8
ConText Indexing

This chapter introduces the concepts necessary for understanding the indexing objects in the ConText data dictionary.

The following topics are discussed in this chapter:

Overview of Indexing

Figure 8-1

ConText indexes enable text and theme queries to be performed against text columns. Figure 8-1 illustrates the basic relationships between text tables, policies, ConText indexes, and ConText queries.

In a typical ConText system, text is loaded into a text column in a table, then a policy is created for the column.

The policy is used to create the ConText index, which resides in separate database tables associated with the text column through the policy. Once an index exists for a column, queries can be performed against the column using any of the query methods supported by ConText.

When an query is issued against a text column that has a ConText index, rather than scan the actual text to find documents that satisfy the search criteria of the query, ConText searches the ConText index tables to determine whether a document should be returned in the results of the query.

The query results are then returned, in the form of a hitlist, to the user that submitted the query. The query results can be returned directly or can be combined with structured data from the base table to refine the query or provide more information about the document that satisfy the query.

See Also:

For more information about ConText indexes and the objects used to create them, see:

"ConText Indexes" in Chapter 6, "Text Concepts"
"Policies" in this chapter

For more information about text loading, see "Text Loading" in Chapter 6, "Text Concepts".

For more information about ConText queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Policies

Figure 8-2

This section provides conceptual, as well as reference, information about policies:

What is a Policy?

To create a ConText index for text stored in a database column, ConText requires the following information about the text:

how is the text stored in the column? - Data Storage
what format(s) is the text in? - Filtering
how should tokens in the text be identified? - Lexers
how should the index be generated and where should it be stored? - Indexing Engine
are any advanced query options going to be used? - Advanced Query (Wordlist) Options
are there any words which should not have entries in the index? - Stop Words

Note:

ConText also provides a facility for specifying whether the text is compressed; however, this facility is not currently implemented.

A policy provides this information for the column, in the form of indexing preferences (one preference for each of the requirements). Policies can be created by any ConText user with the CTXAPP role and are stored in the ConText data dictionary.

Note:

A policy must exist for a column before a ConText server can create a index for the column.

In addition to the preferences for a policy, users specify a name for the policy and the text column for the policy, and a number of other policy attributes.

The policies created by a user must be unique for the user. As such, the same policy for a user cannot be assigned to more than one column.

Column Policies

A column policy is a policy that has a text column assigned to it. Only column policies can be used to create ConText indexes.

See Also:

For examples of creating policies, see "Creating a Column Policy" in Chapter 9, "Setting Up and Managing Text".

Template Policies

A template policy is a policy that does not have a text column assigned to it. Template policies are used as source policies when creating column policies or other template policies. The source policy for a policy specifies the preferences (one for each requirement) to be used as defaults in the policy.

For example, ConText provides a template policy, DEFAULT_POLICY, that is the default source policy for all column and template policies.

Any of the preferences provided in a template policy can be overwritten with other preferences (of the same type) by explicitly naming the preference during creation of the new policy.

ConText provides a number of predefined template policies, owned by CTXSYS. Users can create their own template policies or use the predefined template policies when creating policies.

Multiple Policies on a Column

Multiple policies, as long as they are unique for the user, can be assigned to a column. As a result, a column can have more than one index. When a query is performed, you can specify a policy name to indicate the index that is used to process the query.

This feature is particularly useful if you have English-language documents for which you want to enable both text and theme queries. To enable text and theme queries, you must create both a text indexing policy and a theme indexing policy on the column containing the documents and create a ConText index for each policy.

See Also:

For more information about text and theme queries, see "Text/Theme Queries" in Chapter 6, "Text Concepts".

For more information about text indexing and theme indexing policies, see "Text Lexers" and "Theme Lexer" in this chapter.

For a complete discussion of text and theme queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Policy Examples

Consider a table with two text columns: one holds Microsoft Word documents and the other holds (plain text) comments for the documents. The table structure is:

Table name Column Name Datatype Description

DOC_AND_COMMENT

TEXTKEY

NUMBER

Primary key column

DATE

DATE

Publishing date of document

AUTHOR

VARCHAR2(50)

Name of document author

COMMENTS

VARCHAR2(2000)

Text column storing comments (ASCII text) for documents

TEXT

LONG RAW

Text column storing MS Word documents

Table name	Column Name	Datatype	Description
DOC_AND_COMMENT	TEXTKEY	NUMBER	Primary key column
	DATE	DATE	Publishing date of document
	AUTHOR	VARCHAR2(50)	Name of document author
	COMMENTS	VARCHAR2(2000)	Text column storing comments (ASCII text) for documents
	TEXT	LONG RAW	Text column storing MS Word documents

To create a text index for both the comment and doc columns in doc_and_comment, a policy must be defined for each column. The following example illustrates two policies named i_doc and i_comments that could be created:

Policy Name Indexing Option Indexing Option Value

I_DOC

Text Column

DOC_AND_COMMENT.DOC

Data Store

Direct (text in column)

Filter

MS Word

Lexer

General purpose text lexer

Engine

General purpose indexing engine

Stoplist

Default stoplist (English)

Wordlist

Soundex and stemming

I_COMMENTS

Text Column

DOC_AND_COMMENT.COMMENTS

Data Store

Direct (text in column)

Filter

None (ASCII text)

Lexer

General purpose lexer

Engine

General purpose indexing engine

Stoplist

Default stoplist (English)

Wordlist

None

Policy Name	Indexing Option	Indexing Option Value
I_DOC	Text Column	DOC_AND_COMMENT.DOC
	Data Store	Direct (text in column)
	Filter	MS Word
	Lexer	General purpose text lexer
	Engine	General purpose indexing engine
	Stoplist	Default stoplist (English)
	Wordlist	Soundex and stemming
I_COMMENTS	Text Column	DOC_AND_COMMENT.COMMENTS
	Data Store	Direct (text in column)
	Filter	None (ASCII text)
	Lexer	General purpose lexer
	Engine	General purpose indexing engine
	Stoplist	Default stoplist (English)
	Wordlist	None

To create a theme index for the doc column, a theme indexing policy must be defined. The following example illustrates a policy named i_theme that could be created for the table:

Policy Name Indexing Option Indexing Option Value

I_THEME

Text Column

DOC_AND_COMMENT.DOC

Data Store

Direct (text in column)

Filter

MS Word

Lexer

Theme lexer

Engine

General purpose indexing engine

Stoplist

Not applicable

Wordlist

Not applicable

Policy Name	Indexing Option	Indexing Option Value
I_THEME	Text Column	DOC_AND_COMMENT.DOC
	Data Store	Direct (text in column)
	Filter	MS Word
	Lexer	Theme lexer
	Engine	General purpose indexing engine
	Stoplist	Not applicable
	Wordlist	Not applicable

Predefined Template Policies

ConText provides the following template policies (listed in alphabetical order):

DEFAULT_POLICY

This template policy uses all of the default preferences. It can be used to create a policy with the following characteristics:

Preferences Characteristics

DEFAULT_DIRECT_DATASTORE

Text stored in database

DEFAULT_NULL_FILTER

No filter (text stored in plain, ASCII format)

DEFAULT_LEXER

Basic lexer (standard punctuation and continuation characters, no printjoins or skipjoins characters)

DEFAULT_INDEX

Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes

NO_SOUNDEX

No Soundex word mappings stored during text indexing

DEFAULT_STOPLIST

Default stoplist (English) is active

Preferences	Characteristics
DEFAULT_DIRECT_DATASTORE	Text stored in database
DEFAULT_NULL_FILTER	No filter (text stored in plain, ASCII format)
DEFAULT_LEXER	Basic lexer (standard punctuation and continuation characters, no printjoins or skipjoins characters)
DEFAULT_INDEX	Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes
NO_SOUNDEX	No Soundex word mappings stored during text indexing
DEFAULT_STOPLIST	Default stoplist (English) is active

Note:

DEFAULT_POLICY is the default for source_policy in both CTX_DDL.CREATE_POLICY and CTX_DDL.CREATE_TEMPLATE_POLICY.

TEMPLATE_AUTOB

This template policy uses the AUTOB predefined Lexer preference and all the remaining preferences from DEFAULT_POLICY. It can be used to create a column policy for a text column that contains documents in any of the formats supported by the ConText internal filters.

TEMPLATE_BASIC_WEB

This template policy uses the following predefined preferences and can be used to create a column policy which enables basic section searching for a text column containing HTML documents:

Preferences Characteristics

DEFAULT_URL

Text stored in external files, URLs to external files stored in text column

BASIC_HTML_FILTER

HTML filter with certain HTML tags specified for keep_tag

BASIC_HTML_LEXER

Basic lexer with characters specified for startjoins and endjoins

DEFAULT_LEXER

Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes

BASIC_HTML_WORDLIST

No Soundex word mappings stored during text indexing; HTML section group specified for section_group

DEFAULT_STOPLIST

Default stoplist (English) is active

Preferences	Characteristics
DEFAULT_URL	Text stored in external files, URLs to external files stored in text column
BASIC_HTML_FILTER	HTML filter with certain HTML tags specified for keep_tag
BASIC_HTML_LEXER	Basic lexer with characters specified for startjoins and endjoins
DEFAULT_LEXER	Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes
BASIC_HTML_WORDLIST	No Soundex word mappings stored during text indexing; HTML section group specified for section_group
DEFAULT_STOPLIST	Default stoplist (English) is active

TEMPLATE_DIRECT

This template policy uses the same preferences as DEFAULT_POLICY. It can be used to create a policy for indexing basic text stored in a text column.

TEMPLATE_LONGTEXT_STOPLIST_OFF

This template policy uses the NO_STOPLIST predefined Stoplist preference and all the remaining preferences from DEFAULT_POLICY. It can be used to create a policy that does not use a stoplist during indexing.

TEMPLATE_LONGTEXT_STOPLIST_ON

This template policy uses the DEFAULT_STOPLIST predefined Stoplist preference and all the remaining preferences from DEFAULT_POLICY. It can be used to create a policy that uses the default stoplist (English) during indexing.

TEMPLATE_MD

This template policy uses the MD_TEXT predefined Data Store preference and all the remaining preferences from DEFAULT_POLICY. It can be used to create a policy for indexing text stored in the detail column in a master-detail table.

TEMPLATE_MD_BIN

This template policy uses the MD_BINARY predefined preference and all the remaining preferences from DEFAULT_POLICY. It can be used to create a policy for indexing text stored in the detail column in a master-detail table.

TEMPLATE_WW6B

This template policy uses the WW6B predefined preference and all the remaining preferences from DEFAULT_POLICY. It can be used to create a policy for indexing text in Microsoft Word for Windows 6 format.

Preferences for Indexing

This section provides conceptual, as well as reference, information for indexing preferences:

What is an Indexing Preference?

Indexing preferences specify the options that ConText uses to create ConText indexes. Each preference represents one (and only one) indexing option and is grouped into one of six categories or types, which correspond to the information ConText requires for creating indexes:

Data Store preferences
Filter preferences
Lexer preferences
Engine preferences
Wordlist preferences
Stoplist preferences

When creating a policy, six preferences are specified, one for each of the six types. If one of the preference is not specified when the policy is created, the preference (for that type) from the DEFAULT_POLICY template policy is used.

A preference can be used in more than one policy; however, two preferences of the same type cannot be used in the same policy.

Note:

If you want to use the same preferences for two text columns, you must create two separate policies. The policies will be identical (having all of the same preferences), but they must have unique names and be attached to different columns. This is true whether the columns are in the same table or in different tables.

Tiles in Preferences

Tiles are the objects in the ConText data dictionary that provide ConText with information about how text is managed in the system, as well as indexing instructions. Each Tile specifies a distinct indexing option within the ConText framework.

A Tile is the main component of a preference. Each Tile may have none, one, or many attributes that are used to define preferences. The attributes identify which indexing options are active for the preference.

You define one of the types of preferences by setting the attributes with the desired values for the appropriate Tile, then creating the preference. While a type is not explicitly assigned to a preference, it is implied through the association of the Tile with the preference.

Predefined Preferences

ConText provides a number of predefined preferences (owned by CTXSYS) for each type. These predefined preferences can be used by any ConText user with the CTXAPP role to create policies without having to first create preferences.

User-defined Preferences

ConText users with the CTXAPP role can create their own preferences by setting the required attributes for one of the Tiles provided by ConText, then calling CTX_DDL.CREATE_PREFERENCE and specifying the name of the Tile.

Note:

When creating a policy, users can use all preferences that have been defined in the ConText data dictionary, including their own preferences, preferences created by other users, or the predefined preferences provided by ConText.

Data Store Predefined Preferences

ConText provides the following predefined Data Store preferences:

DEFAULT_DIRECT_DATASTORE (Used in DEFAULT_POLICY)
DEFAULT_OSFILE
DEFAULT_URL
MD_BINARY
MD_TEXT

DEFAULT_DIRECT_DATASTORE

This preference calls the DIRECT Tile, which is used to indicate that text is stored directly in the text column of a text table.

DEFAULT_OSFILE

This preference calls the OSFILE Tile, which is used to indicate that text is stored as files in a file system,

DEFAULT_OSFILE uses the path attribute and a hardcoded set of dummy directory paths to indicate the directories in which the text files are located.

The hardcoded paths, delimited by colons are: /oracle/data, /oracle/data2, /oracle/data3.

Note:

If the locations of your files do not match the hardcoded paths, do not use the DEFAULT_OSFILE preference in a policy.

DEFAULT_URL

This preference calls the URL Tile which is used to indicate that text is stored as URLs.

DEFAULT_URL uses all of the attribute defaults for the URL Tile:

timeout of 30 seconds
up to 8 HTTP threads handled simultaneously
up to 256 HTML documents can be accessed simultaneously
the maximum length of a URL stored in the text column is 256 bytes
the maximum size of an HTML file that the URL data store will access without error is 2 megabytes
no proxy server

MD_BINARY

This preference calls the MASTER DETAIL Tile which is used to indicate text is stored in a master detail table.

MD_BINARY uses the binary attribute and a value of YES to indicate that the text in the table is stored in binary format (newline characters do not indicate end of line).

MD_TEXT

This preference calls the MASTER DETAIL Tile which is used to indicate text is stored in a master detail table.

MD_TEXT uses the binary attribute and a value of NO to indicate that the text in the table is stored in plain text format (newline characters indicate end of line).

Filter Predefined Preferences

ConText provides the following predefined Filter preferences:

AUTOB
BASIC_HTML_FILTER
DEFAULT_NULL_FILTER (Used in DEFAULT_POLICY)
HTML_FILTER
WW6B

AUTOB

This preference calls the BLASTER FILTER Tile which specifies an internal filter used to extract text from formatted documents in a text column.

AUTOB uses the format attribute and a value of 997 to indicate that ConText uses the autorecognize filter to extract text. It can be used to filter text in a column that contains the following document formats:

Document Format Version

AmiPro for Windows

1, 2, 3

ASCII

N/A

HTML

1, 2, 3

Lotus 123 for DOS

4, 5

Lotus 123 for Windows

2, 3, 4, 5

Microsoft Word for Windows

2, 6.x

Microsoft Word for DOS

5.0, 5.5

Microsoft Word for MAC

3, 4, 5.x

Word Perfect for Windows

5.x, 6.x

WordPerfect for DOS

5.0, 5.1, 6.0

Xerox XIF for UNIX

5, 6

Document Format	Version
AmiPro for Windows	1, 2, 3
ASCII	N/A
HTML	1, 2, 3
Lotus 123 for DOS	4, 5
Lotus 123 for Windows	2, 3, 4, 5
Microsoft Word for Windows	2, 6.x
Microsoft Word for DOS	5.0, 5.5
Microsoft Word for MAC	3, 4, 5.x
Word Perfect for Windows	5.x, 6.x
WordPerfect for DOS	5.0, 5.1, 6.0
Xerox XIF for UNIX	5, 6

BASIC_HTML_FILTER

This preference is identical to the HTML_FILTER predefined preference, except the keep_tag attribute is set with the following values to support basic section searching in HTML documents:

'P'
'TITLE'
'H1','H2','H3','H4','H5','H6'
'HEAD'
'BODY'

DEFAULT_NULL_FILTER

This preference calls the FILTER NOP Tile which indicates that the text column in a text table contains plain, unformatted (ASCII) text and does not require filtering for indexing and highlighting.

HTML_FILTER

This preference calls the HTML FILTER Tile and can be used to filter documents in a column that contains only HTML-formatted documents.

WW6B

This preference calls the BLASTER FILTER Tile and specifies a value of 11 for the format attribute to indicate ConText uses the Word for Windows 6 filter to extract text. It can be used in a column that contains only Word for Windows 6-formatted documents.

Lexer Predefined Preferences

ConText provides the following predefined Lexer preferences:

BASIC_HTML_LEXER

This preference is identical to DEFAULT_LEXER, except the startjoins and endjoins attributes for the BASIC LEXER Tile are set with '</' and '>' respectively to support basic section searching in HTML documents.

DEFAULT_LEXER

This preference calls the BASIC LEXER Tile, which indicates the lexer settings used to identify word and sentence boundaries for text indexing and text queries.

DEFAULT_LEXER uses the following Tile attributes and values to indicate the lexer settings:

Attribute Values

punctuations

. ? !

printjoins

NULL (indicates no characters defined as printjoins for the BASIC LEXER)

skipjoins

NULL (indicates no characters defined as skipjoins for the BASIC LEXER)

continuation

- \

Attribute	Values
punctuations	. ? !
printjoins	NULL (indicates no characters defined as printjoins for the BASIC LEXER)
skipjoins	NULL (indicates no characters defined as skipjoins for the BASIC LEXER)
continuation	- \

KOREAN

This preference calls the KOREAN LEXER Tile and can be used for parsing Korean text. Because the KOREAN LEXER Tile does not have any attributes, no attributes are set for this preference.

THEME_LEXER

This preference calls the THEME LEXER Tile, which indicates the preference can be used in a column policy to create theme indexes for a column.

The THEME_LEXER preference does not set any attributes because the THEME LEXER preference doesn't have any attributes.

VGRAM_CHINESE_1 and VGRAM_CHINESE_2

This preference call the CHINESE V-GRAM LEXER Tile, which indicates the preferences can be used for parsing Chinese text.

The 1 or 2 indicates that the preference uses either method 1 or 2 for identifying tokens in Chinese text (hanzi_indexing attribute).

VGRAM_JAPANESE_1 and VGRAM_JAPANESE_2

This preference call the JAPANESE V-GRAM LEXER Tile which indicates the preferences can be used for parsing Japanese text.

The 1 or 2 indicates that the preference uses either method 1 or 2 for identifying tokens in Japanese text (kanji_indexing attribute).

Engine Predefined Preferences

ConText supplies a single predefined Engine preference, DEFAULT_INDEX.

DEFAULT_INDEX

This preference calls the GENERIC ENGINE Tile which is used to specify the amount of memory reserved for indexing.

DEFAULT_INDEX uses the index_memory attribute to allocate the following amount of memory for indexing: 12582912 bytes.

Wordlist Predefined Preferences

ConText provides the following predefined Wordlist preferences, which all use the GENERIC WORD LIST Tile:

BASIC_HTML_WORDLIST

This preference is identical to the NO_SOUNDEX preference, except the section_group attribute has a value of 'BASIC_HTML_SECTION', which is a predefined section group provided by ConText for basic section searching of HTML text.

NO_SOUNDEX

This preference specifies a value of 0 for the soundex_at_index attribute to indicate that ConText does not generate Soundex word mappings during text indexing.

SOUNDEX

This preference specifies a value of 1 for the soundex_at_index attribute to indicate that ConText generates Soundex word mappings during text indexing.

KOREAN_WORDLIST

This preference specifies a value 3 for the fuzzy_match attribute to ensure fuzzy matching is not enabled for Korean.

VGRAM_CHINESE_WORDLIST

This preference specifies a value 4 for the fuzzy_match attribute to ensure fuzzy matching is not enabled for Chinese.

VGRAM_JAPANESE_WORDLIST

This preference specifies a value 2 for the fuzzy_match attribute to enable fuzzy matching for Japanese.

Stoplist Predefined Preferences

ConText provides the following predefined Stoplist preferences for creating text indexes:

DEFAULT_STOPLIST (Used in DEFAULT_POLICY)
NO_STOPLIST

Note:

All of the Stoplist preferences call the GENERIC STOP LIST Tile.

DEFAULT_STOPLIST

This preference defines a list of English terms treated as stop words during indexing.

In addition to the English stoplist in DEFAULT_STOPLIST, ConText supplies stoplists for many European languages. These stoplists are not provided as predefined Stoplist preferences; they are provided as SQL scripts which can be used to create Stoplist preferences for the languages.

See Also:

For a complete list of the stop words in DEFAULT_STOPLIST, as well as the list of stop words for each supplied stoplist, see Appendix A, "Supplied Stoplists".

NO_STOPLIST

This preference specifies that no list of stop words is used during text indexing. All words that ConText encounters are stored in the text index.

Data Storage

Figure 8-3

ConText supports four methods of storing text in a column:

Direct Storage
Master-Detail Storage
External Storage (Operating System Files)
External Storage (URLs)

Note:

The tables illustrated in the following sections are examples only. The column names and definitions for actual tables used to store text will vary depending on the needs of your application.

Direct Storage

With direct storage, text for documents is stored directly in a database column. The following table description illustrates a table in which text is stored directly in a column:

Table Name Column Name Datatype Description

DIR_TEXT

TEXTKEY

NUMBER

Primary or unique key for table

TEXTDATE

DATE

Document publication date

AUTHOR

VARCHAR2(50)

Document author

NOTES

VARCHAR2(2000)

Text column with direct storage

TEXT

LONG

Text column with direct storage

Table Name	Column Name	Datatype	Description
DIR_TEXT	TEXTKEY	NUMBER	Primary or unique key for table
	TEXTDATE	DATE	Document publication date
	AUTHOR	VARCHAR2(50)	Document author
	NOTES	VARCHAR2(2000)	Text column with direct storage
	TEXT	LONG	Text column with direct storage

The requirements for storing text directly in a column are relatively straightforward. The text is physically stored in a text column and the policy for the text column contains a Data Store preference that utilizes the DIRECT Tile.

Master-Detail Storage

Master-detail storage is for documents stored directly in a text column, similar to direct storage; however, each document consists of one or more rows which are indexed as a single row.

In a master-detail relationship, the master table contains the textkey column and the detail table contains the text column, the line number column, and a foreign key to a primary or unique key column in the master table.

The foreign key and the line number columns comprise the primary key for the detail table, which is used to store the text.

The following table description illustrates two tables with a master-detail relationship:

Table Name Column Name Datatype Description

MASTER

PK

NUMBER

Primary key for table

AUTHOR

VARCHAR2

Document author

TITLE

VARCHAR2

Document title

DETAIL

FK

NUMBER

Foreign key to master.pk

LINENO

NUMBER

Detail information for document

TEXT

VARCHAR2

Text column

Table Name	Column Name	Datatype	Description
MASTER	PK	NUMBER	Primary key for table
	AUTHOR	VARCHAR2	Document author
	TITLE	VARCHAR2	Document title
DETAIL	FK	NUMBER	Foreign key to master.pk
	LINENO	NUMBER	Detail information for document
	TEXT	VARCHAR2	Text column

The following query illustrates the relationship between the two tables:

select DETAIL.TEXT
from DETAIL
where DETAIL.FK = MASTER.PK
order by DETAIL.LINENO

ConText supports two methods of creating policies for text columns in master-detail tables:

Policies on Columns in Master Table

With this method, the MASTER DETAIL NEW Tile is used to create Data Store preferences, which are used in the policy assigned to one of the columns in the master table. The column to which the policy is assigned (i.e. the text column) can be any column in the master table, except the column that serves as the textkey column for the policy.

Note:

The contents of the text column are not actually indexed. The text column only serves as a place-holder for the policy.

The detail table name and attributes, including the name of the column that contains the text to be indexed, are specified in the Data Store preference.

Using the tables described above, the textkey for the policy would be pk in master. The text column for the policy could be either author or title.

The Data Store preference for the policy would identify detail as the detail table, lineno as the line number column, and text as the column containing the text to be indexed.

See Also:

For an example of creating a policy on a master table column, see"Creating a Data Store Preference for a Master Table" in Chapter 9, "Setting Up and Managing Text"

Advantages

This method has the following advantages:

DML is handled with one insert to the DML Queue, resulting in a smaller queue and quicker processing
Structured data queries in text/theme queries can be applied to the master table

For example:

exec ctx_query.contains('MY_POL','Oracle','ctx_temp', struct_query=>'author=''SMITH''');

Limitations

This method has the following limitations:

The column storing text in the detail table is limited to CHAR, VARCHAR2, and LONG datatypes.
Updates to individual rows in the detail table are no longer automatically detected, since the DML trigger is on the master table. Updates to the text in the detail table must be manually reindexed using CTX_DML.REINDEX or by creating a trigger on the detail table that calls CTX_DML.REINDEX.

Policies on Columns in Detail Table

With this method, the policy is created on the detail table, rather than on the master table, and the MASTER DETAIL Tile is used instead of the MASTER DETAIL NEW Tile, to create Data Store preferences.

The textkey column and text column for the detail table, along with the line number column, are specified in the policy. The textkey column and the line number column together uniquely identify rows in the detail table.

Using the tables described above, the textkey for the policy would be fk in detail. The text column for the policy would be text.

Disadvantages

This method has the following disadvantages:

Structured data queries in text/theme queries may be slow. The relevant relational criteria is often stored in a different table, resulting in sub-selects to return structured data.
DML may be slow, because the DML trigger is created on the detail table. When a new row is created in the master table and its corresponding rows are created in the detail table, one request is sent to the DML queue for each new detail row, thereby slowing down the queue.
The syntax for one-step queries is non-intuitive. Since the policy is created on the detail table, the one-step query is on the detail table, which may result in multiple rows per document returned by a query.

Note:

This method is provided primarily to maintain backward compatibility with previous versions of ConText.

If you want to index text stored in master-detail tables, Oracle Corporation suggests that you create policies on the master table.

External Storage (Operating System Files)

With operating system storage, the text column does not contain the actual text of the document, but rather stores a pointer (file name) to the operating-system file that contains the text of the document. The Data Store preference for the column policy uses the OSFILE Tile and specifies the location of the file.

Suggestion:

If text is stored in operating system files, the column containing the file names should be either a CHAR or VARCHAR2 column. LONG and LONG RAW columns are best suited for long documents stored directly in the database.

The following table description illustrates a table that uses external data storage:

Table Name Column Name Datatype Description

EXT_TEXT

TEXTKEY

NUMBER

Primary or unique key for the table

TEXTDATE

DATE

Document publication date

AUTHOR

VARCHAR2(50)

Document author

NOTES

VARCHAR2(2000)

Text column with direct text storage

TEXT

VARCHAR2(100)

Text column with names of operating system files that contain the document text

In this example, the only difference between a table used to store text internally and externally is the datatype of the text column. In an external table, the text column would typically be assigned a datatype of VARCHAR2, rather than LONG, because the column contains a pointer to a file rather than the contents of the file (which requires more space to store).

File Names

The names of the external text files are stored in the text column.

Directory Path Names

The directory path(s) where the external text files are located can be stored in the text column as part of the file name or in the Data Store preference that you create for the OSFILE Tile.

Note:

If the preference does not contain the directory path for the files, ConText requires the directory path to be included as part of the file name stored in the text column.

File Access

All the external files referenced in the text column must be accessible from the server machine on which the ConText server is running. This can be accomplished by storing the files locally in the file system for the server machine or by mounting the remote file system to the server machine.

File Permissions

File permissions for external files in which text is stored must be set accordingly to allow ConText to access the files. If the file permissions are not set properly for a file and ConText cannot access the file, the file cannot be indexed or retrieved by ConText.

External Storage (URLs)

For text stored in external World Wide Web files, the complete address for each file must be stored as a Uniform Resource Locator (URL) in the text column and the URL Tile must be utilized in the Data Store preference for the column policy.

Note:

Text that contains HTML tags and is stored directly in a text column is considered internal, rather than external, text. As such, the Data Store preference for the text column policy would use the Data Store Tiles which support direct text storage.

In addition, Web files can be any format supported by the World Wide Web, including HTML files, plain (ASCII) text files, and proprietary formats, such as PDF and Word. The filter for the column must be able to recognize and process any of the possible documents formats that may be encountered on the Web.

A URL consists of the access scheme for the Web file and the address of the file, in the following format:

access_scheme://file_address

The ConText URL Tile supports three access scheme protocols in URLs:

Hypertext Transfer Protocol (HTTP)

If a URL uses HTTP, the file address contains the host name of the Web server where the file is located and, optionally, the URL path for the file on the Web server.

For example:

http://my_server.com/welcome.html

http://www.oracle.com

Note:

The file address may also (optionally) contain the port on which the Web server is listening.

In this context, a Web server is any host machine that is running an HTTP daemon, which accepts requests for files and transfers the files to the requestor.

File Transfer Protocol (FTP)

If a URL uses FTP, the file address contains the host name of the Web server where the file is located and, optionally, the directory path for the file on the Web server.

For example:

ftp://my_server.com/code/samples/sample1.tar.Z

Note:

The file address may also (optionally) contain a username/password for accessing the host machine.

In this context, a Web server is any host machine that is running an FTP daemon, which accepts requests for files and transfers the files to the requestor.

File Protocol

If a URL uses the file protocol, the address for the file contains the absolute directory path for the location of the file on the local file system.

For example:

file://private/docs/html/intro.html

The file referenced by a URL using the file protocol must reside locally on a file system that is accessible to the machine running ConText.

Because the file is accessed through the operating system, the machine on which the file is located does not need to be configured as a Web server. However, the same requirements that apply to text stored as file names apply to text stored as URLs which use the file protocol.

If the requirements are not met, ConText returns one or more error messages.

See Also:

For more information, see "External Storage (URLs)" in this chapter.

For the error messages returned by the URL data store, see Oracle8 Error Messages.

Intranet Support

Through HTTP and FTP, the URL Tile can be used to index files in an intranet, as well as files on any publicly-accessible Web servers on the World Wide Web.

Intranets are private networks that use the Internet to link machines in the network, but are protected from public access on the Internet via a gateway proxy server which acts as a firewall.

Outside a firewall, a URL request for a Web file is processed directly by the host machine identified in the URL. Within a firewall, requests are processed by the proxy server, which passes the request to the appropriate host machine and transfers the response back to the requestor.

For security reasons, access to an intranet is generally restricted to machines within the firewall; however, machines in an intranet can access the World Wide Web through the gateway proxy server if they have the appropriate permission and security clearance.

Document Access Using HTTP or FTP

When HTTP or FTP is used in a URL stored in the database, ConText acts as a client, submitting a request to a Web server for the file (document) referenced by the URL. If the request is successful, the Web server returns the file to ConText where it can be indexed for querying or highlighted for viewing.

Proxy Servers

If the document to be accessed is located on the World Wide Web outside a firewall and the machine on which ConText is installed is inside the firewall, a host machine that serves as the proxy (gateway) for the firewall must be specified as an attribute for the URL Tile.

A single machine can be specified as the proxy for handling HTTP and FTP requests or two separate machines can be specified, one for each protocol. If network traffic is expected to be heavy or a large number of FTP requests are expected, separate proxies should be specified for HTTP and FTP, since FTP is generally used for accessing large, binary files which may affect performance on the proxy server.

In addition to specifying proxy servers, a sub-string of host or domain names, which identify all or most of the machines internal to the firewall, should be specified. Access to these machines does not require going through the proxy server, which helps reduce the request load that your proxy server(s) have to process.

Multi-threading

In a single-threaded environment, a request for a URL blocks all other requests until a response to the request is returned. Because a response may not be returned for a long time, a single-threaded environment in any text system using HTTP or FTP to access files could create a bottleneck.

To prevent this type of bottleneck, the URL Tile supports multi-threading. With multi-threading, while one thread is blocked, waiting to communicate with a Web server, another thread can retrieve a document from another Web server.

Redirection

The response to a request to retrieve a URL may be a new (redirected) document to retrieve. The URL Tile supports this type of redirection by automatically processing the redirection to retrieve the new document. However, to avoid infinite loops, the URL Tile limits the number of redirections that it attempts to process to three (3).

Timeouts

The time necessary to retrieve a URL using HTTP may vary widely, depending on where the Web server is geographically located. The Web server may even be temporarily unreachable.

To allow control over the length of time an application waits for a response to an HTTP request for a URL, the URL data store supports specifying a maximum timeout.

Exception Handling

When using URLs as your data store, a number of exceptions can occur when a file is accessed. These exceptions are written as errors to the CTX_INDEX_ERRORS view.

The URL data store returns error messages for the following exceptions:

the document referenced in the URL has been permanently moved or cannot be found
access to the document referenced in the URL requires authentication which the user does not have or requires payment which the user must provide
access to the document referenced in the URL is denied by the Web server
the Web server referenced in the URL does not comply with HTTP standards
the specified URL is incorrectly formatted
connection to the Web server is denied (this may occur when the incorrect port is referenced in the URL or the Web server is outside the firewall of an intranet)
the wait for a response to a request to retrieve a URL from a Web server exceeds the maximum timeout specified for the URL preference in the text column policy
the maximum number of supported redirections were encountered in attempting to retrieve the document referenced in the URL
the length of the URL exceeds the maximum specified for the URL preference in the text column policy
the size of the document referenced in the URL exceeds the maximum specified for the URL preference in the text column policy

See Also:

For the error messages returned by the URL data store, see Oracle8 Error Messages.

Data Store Tiles

ConText provides the following Tile(s) for creating Data Store preferences:

Tile Description

DIRECT

Data stored internally in the text column. Each row is indexed as a single document

MASTER DETAIL

Data stored internally in the text column. Document consists of one or more rows in a detail table, with header information stored in a master table.
The policy is created on the text column in the detail table. As a result, queries return detail information from the detail table. Header information must be queried explicitly.

MASTER DETAIL NEW

Data stored internally in the text column. Document consists of one or more rows in a detail table, with header information stored in a master table.
The policy is created on a designated text column in the master table. As a result, queries return header information from the master table. Detail information must be queried explicitly.

OSFILE

Data stored externally in operating system files. File names stored in the text column.

URL

Data stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) stored in the text column.

DIRECT

The DIRECT Tile is used for text stored directly in the database. It has no attributes.

MASTER DETAIL

The MASTER DETAIL Tile is used for text stored directly in the database in master-detail tables, with the textkey column located in the detail table. The column policy is assigned to this column.

The MASTER DETAIL Tile has the following attribute(s):

Attribute Attribute Values

binary

0 (plain text)

1 (binary text)

binary

The binary attribute specifies whether text is in plain text format (0) or binary format (1) in the detail table in a master-detail relationship.

Text in plain text format uses newline characters at the end of each line to indicate the end of the line. Text in binary format does not use newline characters to indicate the end of the line.

MASTER DETAIL NEW

The MASTER DETAIL NEW Tile is used for text stored directly in the database in master-detail tables, with the textkey column located in the master table. The column policy is assigned to this column and all detail information is stored in the Data Store preference, rather than the column policy.

MASTER DETAIL NEW has the following attribute(s):

Attribute Attribute Values

binary

0 (plain text)

1 (binary text)

detail_table

name of the detail table (string)

detail_key

name of the foreign key column in the detail table (string)

detail_lineno

name of the line number column in the detail table (string)

detail_text

name of the text column in the detail table (string)

detail_text_size

Internal use only

binary

The binary attribute specifies whether the text in a master detail table is in plain text format (0) or binary format (1).

detail_table

The detail_table attribute specifies the name of the detail table in the master-detail relationship.

detail_key

The detail_key attribute specifies the name of the foreign key column in the detail table.

detail_lineno

The detail_lineno attribute specifies the name of the column in the detail table that identifies rows in the table.

detail_text

The detail_text attribute specifies the name of the text column in the detail table.

OSFILE

The OSFILE Tile is used for text stored in files accessed through the local file system.

OSFILE has the following attribute(s):

Attribute Attribute Values

path

path1:path2:...:pathn

path

The path attribute specifies the location of text files that are stored externally in a file system.

Multiple paths can be specified for path, with each path separated by a colon (:). File names are stored in the text column in the text table. If path is not used to specify a path for external files, ConText requires the path to be included in the file names stored in the text column.

Note:

If text is stored in external files rather than in a database, the files must be accessible from the host machine on which the ConText server is running.

This can be accomplished by storing the files in the file system for the host machine or by mounting the file system where the files are stored to the host machine.

URL

The URL Tile is used for text stored:

in files on the World Wide Web (accessed through HTTP or FTP)
in files in the local file system (accessed through the file protocol)

URL has the following attribute(s):

Attribute Attribute Values

timeout

seconds (0 to 3600, default 30)

maxthreads

number of threads (0 to 1024, default 8)

maxurls

buffer length in bytes (1 to 4294967295, default 256)

urlsize

URL length (32 to 65535, default 256)

maxdocsize

document size (256 to 4294967295, default 2000000)

http_proxy

host name

ftp_proxy

host name

no_proxy

string (up to 16 strings, separated by commas)

timeout

The timeout attribute specifies the length of time, in seconds, that a network operation such as 'connect' or 'read' waits before timing out and returning a timeout error to the application. The valid range for timeout is 0 to 3600 and the default is 30.

Note:

Since timeout is at the network operation level, the total timeout may be longer than the time specified for timeout.

maxthread

The maxthreads attribute specifies the maximum number of threads that can be running at the same time. The valid range for maxthreads is 1 to 1024 and the default is 8.

Note:

The upper range of maxthreads corresponds to the number of file descriptors that the operating system can process at one time. If the number of files the operating system can process at one time is less than the value set, an invalid socket error may be returned.

maxurls

The maxurls attribute specifies the maximum number of rows that the internal buffer can hold for HTML documents (rows) retrieved from the text table. The valid range for maxurls is 1 to 4294967295 and the default is 256.

urlsize

The urlsize attribute specifies the maximum length, in bytes, that the URL data store supports for URLs stored in the database. If a URL is over the maximum length, an error is returned. The valid range for urlsize is 32 to 65535 and the default is 256.

Note:

The values specified for maxurls and urlsize, when multiplied, cannot exceed 5000000.

In other words, the maximum size of the memory buffer (maxurls * urlsize) for the URL Tile is approximately 5 Megabytes.

maxdocsize

The maxdocsize attribute specifies the maximum size, in bytes, that the URL data store supports for accessing HTML documents whose URLs are stored in the database. The valid range for maxdocsize is 1 to 4294967295 and the default is 200000 (2 Mb).

http_proxy

The http_proxy attribute specifies the fully-qualified name of the host machine that serves as the HTTP proxy (gateway) for the machine on which ConText is installed. This attribute must be set if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

ftp_proxy

The ftp_proxy attribute specifies the fully-qualified name of the host machine that serves as the FTP proxy (gateway) for the machine on which ConText is installed. This attribute must be set if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

no_proxy

The no_proxy attribute specifies a string of domains (up to sixteen, separate by commas) which are found in most, if not all, of the machines in your intranet. When one of the domains is encountered in a host name, no request is sent to the machine(s) specified for ftp_proxy and http_proxy. Instead, the request is processed directly by the host machine identified in the URL.

For example, if the string 'us.oracle.com, uk.oracle.com' is entered for no_proxy, any URL requests to machines that contain either of these domains in their host names are not processed by your proxy server(s).

Data Store Preference Example

The following example creates a preference named doc_ref for the OSFILE Tile:

begin
  ctx_ddl.set_attribute ('PATH', '/private/mydocs');
  ctx_ddl.create_preference ('DOC_PREF', 'Path my for my documents' 'OSFILE');
end;

Note:

This example illustrates usage of OSFILE for documents stored in a UNIX-based environment.

The directory path syntax may be different for other environments.

Filtering

Figure 8-4

ConText supports both plain text and formatted text (i.e. Microsoft Word, WordPerfect). In addition, ConText supports text that contains hypertext markup language (HTML) tags.

Regardless of the format, ConText requires text to be filtered for the purposes of indexing the text or processing the text through the Linguistics, as well as highlighting the text for viewing.

This section discusses the following topics relevant to text filtering:

See Also:

For more information about Linguistics and text highlighting, see Oracle8 ConText Cartridge Application Developer's Guide.

Internal Filters

ConText provides internal filters for:

Plain Text Filtering
HTML Filtering (plain text containing HTML tags)
Formatted Text Filtering

In addition, ConText provides the Autorecognize Filter, an internal filter for columns containing mixed formats.

Plain Text Filtering

Plain text requires no filtering because the text is already in the format that ConText requires for identifying tokens.

HTML Filtering

ConText provides an internal filter that supports English and Japanese text with HTML tags for versions 1, 2, and 3.

Note:

For non-English and non-Japanese documents that contain HTML tags, an external filter must be used.

The HTML filter processes all text that is delimited by the standard HTML tag characters (angle brackets).

All HTML tags are either ignored or converted to their representative characters in the ASCII character set. This ensures that only the text of the document is processed during indexing or by the Linguistics.

Formatted Text Filtering

ConText provides internal filters for filtering English and Western European text in a number of proprietary word processing formats.

Note:

For Japanese, Korean, and Chinese formatted text, external filters must be used.

The filters extract plain, ASCII text from a document, then pass the text to ConText, where the text is indexed or processed through the Linguistics. The following document formats are supported by the internal filters:

Format Version

AmiPro for Windows

1, 2, 3

Lotus 123 for DOS

4, 5

Lotus 123 for Windows

2, 3, 4, 5

Microsoft Word for DOS

5.0, 5.5

Microsoft Word for Macintosh

3, 4, 5.x

Microsoft Word for Windows

2, 6.x, 7.0

WordPerfect for DOS

5.0, 5.1, 6.0

WordPerfect for Windows

5.x, 6.x

Xerox XIF for UNIX

5, 6

Note:

Only the following formats support WYSIWYG viewing in the ConText viewers:

Microsoft Word for Windows 2 and 6.x
Word Perfect for DOS 5.0, 5.1, 6.0
Word Perfect for Windows 5.x, 6.x

For more information about the ConText viewers, see Oracle8 ConText Cartridge Workbench User's Guide.

For those formats not supported by the internal filters, user can define/create their own external filters.

Autorecognize Filter

Autorecognize is an internal filter that automatically recognizes the document formats of all the supported internal filters, as well as plain text (ASCII) and HTML formats, and extracts the text from the document using the appropriate filters.

Note:

Microsoft Word for Windows 7.0 documents are not recognized by Autorecognize. As a result, ConText does not support storing Microsoft Word for Windows 7.0 documents in mixed-format columns.

External Filters

ConText provides a framework for users to plug-in user-defined and/or third-party filters to extract plain text from documents. These external filters can be used for a number of purposes, including:

indexing text stored in a format, such as PDF, for which an internal filter does not exist
removing unnecessary text or markup in a document prior to indexing or processing through the ConText Linguistics

For example, the Linguistics rely on text that is grouped into logical paragraphs. If the text stored in the database does not contain clearly-identified paragraphs, the quality of the output generated by the Linguistics may be poor.

An external filter that outlines the paragraph boundaries according to ConText standards could be created to ensure that the Linguistics are provided with an ordered, logical text feed.

Note:

External filters do not support WYSIWYG viewing in the ConText viewers provided with the ConText Workbench.

For more information about the ConText viewers, see Oracle8 ConText Cartridge Workbench User's Guide. .

External Filter Requirements

An external filter can be any executable (e.g. shell script, C program, perl script) that processes an input file and produces a plain text output file. The text in the output file then can be indexed.

If the document is in a proprietary format, the executable must recognize the format tags for the document and be able to convert the formatted text into plain (ASCII) text.

In addition, the executable must be able to run from the operating system command-line and accept two system-supplied arguments:

name of an input file, which stores the document to be filtered
name of an output file, which stores the filtered, ASCII text of the document

The external filter does not need to provide the values for these arguments; Context provides the values as part of its external filter processing.

Note:

The name of the executable cannot be larger than 64 bytes. In addition, the name cannot contain blank spaces or any of the following illegal characters:

! @ # $ % ^ & * ( ) ~ \ Q ' , ^ : " ; ,

Performance Issues

Performance is dependent on the external filter; ConText cannot begin processing a document until the entire document has been filtered. The external filter that performs the filtering should be tuned/optimized accordingly.

Using External Filters

The process model for using external filters is:

Create a filter in the form of a command-line executable.
Store the executable on the server machine where ConText is installed.

Note:

The filter executable must be located in the appropriate directory for your environment.

For example, in a UNIX-based environment, the filter executables must be stored in $ORACLE_HOME/ctx/bin.

In a Windows NT environment, the executables must be stored in \BIN in the Oracle home directory.

For more information about the required location for the external filter executables, see the Oracle8 installation documentation specific to your operating system.

Create a Filter preference that calls the filter executable.

The Tile you use to create the preference depends on whether you use the column to store documents in a single format or mixed formats.

Create a policy that includes the Filter preference for the external filter.

See Also:

For examples of creating Filter preferences for external filters, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

Supplied External Filters

ConText provides a number of external filters for filtering many of the most popular word processing and desktop publishing formats on a number of platforms.

See Also:

For a complete list of the external filters supplied by ConText, as well as instructions for setting up and using the filters, see "Supplied External Filters" in Appendix D, "External Filter Specifications".

Filters for Single-Format Columns

Figure 8-5

For columns that store documents in only one format, a single filter is specified in the Filter preference for the column policy. The filtering method for the column is determined by whether the format is supported by the internal or external filters.

Figure 8-5 illustrates the different filtering methods for single-format columns.

See Also:

For examples of creating Filter preferences for single-format columns, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

Filters for Mixed-Format Columns

Figure 8-6

For columns that store documents in mixed formats, the filtering method is determined by whether the formats are supported by the internal filters, external filters, or both.

Figure 8-6 illustrates the different filter specification methods for mixed-format columns.

Note:

If required, internal filters can be overridden in a Filter preference by explicitly calling an external filter for the format. This can be useful if you have an external filter that provides additional filtering not provided by the internal filters.

For example, you may have MS Word documents that you want spellchecked before indexing. You could create an external MS Word filter that performs the spellchecking and specify the external filter in the Filter preference for the column policy.

See Also:

For examples of creating Filter preferences for mixed-format columns, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

For a complete list of supported formats for mixed-format columns, see "Supported Formats for Mixed-Format Columns" in Appendix D, "External Filter Specifications".

Filter Tiles

Filter Tiles are used to create preferences which determine how text is filtered for indexing and highlighting. Filters allow word processor and formatted documents, as well as ASCII and HTML text documents, to be indexed and highlighted by ConText.

For formatted documents, ConText stores documents in their native format and uses filters to build temporary ASCII versions of the documents. ConText indexes the temporary ASCII text of the formatted document. ConText also uses the ASCII version to highlight query terms.

ConText provides internal filters for processing many of the popular document formats, including Microsoft Word, WordPerfect, and AmiPro.

In addition, ConText allows users to specify external filters for filtering documents in formats not supported by the internal filters provided with ConText.

External filters can also be used to perform operations, such as cleaning up or converting text, before the text is filtered for indexing and highlighting.

ConText provides the following Tile(s) for creating Filter preferences:

Tile Description

BLASTER FILTER

Tile for filtering formatted text and/or plain text using internal filters, external filters or some combination of both.

FILTER NOP

Tile for plain text (does not require filtering)

HTML FILTER

Tile for filtering plain text containing HTML tags

USER FILTER

Tile for specifying external filter for a column.

BLASTER FILTER

The BLASTER FILTER Tile is used to specify either:

internal filters are used to filter document
multiple external filters are used to filter documents in a mixed-format column.

Attributes

BLASTER FILTER has the following attribute(s):

Attribute Attribute Values

executable

format id (number), filter executable, sequence (number)

format

0 or 999 (No filter -- plain/ASCII text)

1 or 4 (Word Perfect for Windows 5.x; Word Perfect for DOS 5.0, 5.1)

2 (MS Word for DOS 5.0, 5.5)

5 (Word Perfect for Windows 6.x; Word Perfect for DOS 6.0)

6 (MS Word for Mac 3, 4, 5.x)

7 (MS Word for Windows 2)

8 (AMIPRO for Windows 1, 2, 3)

9 (Lotus 1-2-3 for Windows 2, 3, 4, 5; Lotus 1-2-3 for DOS 4, 5)

11 (MS Word for Windows 6.x, 7.0)

13 (Xerox XIF for UNIX 5, 6)

997 (Autorecognize)

executable

The executable attribute specifies the external filters that are used to filter text stored in a mixed-format text column. It has three values that must be specified:

format_id (document format for the external filter)
filter_executable (name of executable that performs the filtering for the document format)
sequence_num (identifier for the executable and document format used in the preference)

Note:

format and executable are mutually exclusive.

See Also:

For a list of the format IDs supported by the executable attribute, see "Supported Formats for Mixed-Format Columns" in Appendix D, "External Filter Specifications".

format

The format attribute specifies the internal filter used for filtering text stored in a text column.

FILTER NOP

The FILTER NOP Tile is used to specify that plain text is stored in the text column and no filtering needs to be performed. It has no attributes.

HTML FILTER

The HTML FILTER Tile is used to specify that the internal HTML filter is used to filter plain text that contains HTML tags.

Attributes

HTML_FILTER has the following attribute(s):

Attribute Attribute Values

code_conversion

0 (disabled)

1(enabled)

keep_tag

tag (string), sequence (number)

code_conversion

The code_conversion attribute specifies whether code conversion is enabled for documents which contain Japanese ASCII text with HTML tags.

Code conversion is required for Japanese HTML documents if the documents use more than one of the three character sets supported for HTML text in Japanese. If code conversion is enabled, all Japanese HTML documents are converted to a single, common character set before indexing.

The default for code_conversion is 0 (disabled).

Note:

For mixed-format columns that use Autorecognize (BLASTER Tile, format attribute = 997) or use external filters (BLASTER Tile, executable attribute) for all formats except HTML, code conversion is always enabled.

keep_tag

The keep_tag attribute takes two values: the HTML tag to retain during indexing and a sequence number that uniquely identifies the tag.

The following rules apply to keep_tag:

the angle brackets '<>' that identify tags in HTML are not required when setting keep_tag
multiple tags can be specified for a Filter preference by calling CTX_DDL.SET_ATTRIBUTE once for each tag, then calling CTX_DDL.CREATE_PREFERENCE
the sequence number specified for each tag must be unique within the preference
if the tag specified for keep_tag contains additional (i.e. meta) information, the additional information is filtered by the HTML filter

For example, keep_tag is set to BODY and the following string occurs in a document:

<HTML><BODY BGCOLOR=#ffffff>hello</BODY></HTML>

ConText translates the string to:

<BODY>hello</BODY>

This string is passed to the HTML filter, which ignores the HTML tags, then to the lexer, which indexes the token hello as belonging to the BODY section.

USER FILTER

The USER FILTER Tile is used to specify an external filter for filtering documents in a column.

Attributes

USER FILTER has the following attribute(s):

Attribute Attribute Values

command

filter executable

command

The command attribute specifies the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter specified for command must recognize and handle all such formats, otherwise the BLASTER FILTER Tile (with the executable attribute) should be used instead of the USER FILTER Tile.

Filter Preference Examples

The following section provides two Filter preference examples.

See Also:

For more examples of creating Filter preferences, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

Example 1 (MS Word 6 documents)

The following example creates a preference named word6 for the BLASTER FILTER Tile:

begin
  ctx_ddl.set_attribute ('FORMAT', '11');
  ctx_ddl.create_preference ('WORD6', 'Microsoft Word docs', 'BLASTER FILTER');
end;

Example 2 (HTML documents with document sections enabled)

The following example creates a preference named sect_filt_pref for the HTML FILTER Tile:

begin
   ctx_ddl.set_attribute('KEEP_TAG', 'TITLE', 1);
   ctx_ddl.set_attribute('KEEP_TAG', 'HEAD', 1);
   ctx_ddl.set_attribute('KEEP_TAG', 'BODY', 1);
   ctx_ddl.set_attribute('KEEP_TAG', 'H1', 1);
   ctx_ddl.create_preference('sect_filt_pref','sect search filt','HTML FILTER');
end;

In this example, the <TITLE>, </TITLE>, <HEAD>, </HEAD>, <BODY>, </BODY>, <H1>, and </H1> HTML tags are retained by the HTML filter during filtering, provided the startjoins and endjoins attributes for the BASIC LEXER Tile are set appropriately.

Note:

When using keep_tag to specify tags to be retained, you do not need to specify the angle bracket or forward slash characters in the tag strings.

See Also:

For more information about document sections, see "Document Sections" in Chapter 6, "Text Concepts".

Lexers

Figure 8-7

A lexer parses text and identifies tokens for indexing. ConText supports two types of lexers:

The text lexer provided for English and other single-byte, space-delimited languages supports the following features:

Text Lexers

English and other single-byte languages, including most European languages, can use the same lexer because tokens (words) in those languages are delimited by blank spaces and standard punctuation (commas, periods, question marks, etc.).

Japanese, Chinese, and many other Asian languages are pictorial (multi-byte) languages that cannot be tokenized in the same manner as single-byte languages.

Single-Byte Languages

ConText includes a single lexer (BASIC LEXER Tile) for all of the single-byte, space-delimited languages, such as English (7-bit character set) and other European languages (8-bit character sets). The basic lexer also works with languages such as Greek, which have different alphabets, but still utilize blank spaces to delimit words.

Multi-Byte Languages

ConText provides three separate lexers for processing Japanese, Chinese, and Korean text.

The Chinese (CHINESE V-GRAM LEXER Tile) and Japanese (JAPANESE V-GRAM LEXER Tile) lexers do not rely on finding token boundaries within text; instead, they uses a dictionary of terms to match and index patterns of characters at user-specified, variable points of length.

The Japanese and Chinese lexers also work with languages that use a 7-bit character set, such as English. As a result, ConText supports indexing and querying Japanese and Chinese text that also contains English text.

Note:

Languages that use an 8-bit character set, such as many of the European languages, are not supported by the Japanese and Chinese lexers.

The Korean lexer (KOREAN LEXER Tile), works similarly to the Japanese and Chinese lexers by finding character patterns in the text and matching the patterns to a dictionary of terms. However, due to the significant morphological transformations that Korean verbs undergo, the Korean lexer only indexes nouns and noun phrases.

Text Indexing Policies

By specifying one of the text lexers in the Lexer preference for a policy, you designate the policy as a text indexing policy.

Once a text index is created for the policy, any text requests, including text queries, on the policy will result in the text index being accessed.

See Also:

For more information about text indexing, see "Text Indexes" in Chapter 6, "Text Concepts".

Theme Lexer

For English-language text, a separate lexer (THEME LEXER Tile) is provided for creating theme indexes. This lexer breaks text into tokens; however, the tokens are not stored in the theme index. The tokens are passed to the ConText linguistic core where they are analyzed within the context of the sentences and paragraphs in which they appeared to determine whether they are content-bearing words. The linguistic core then generates themes, which are stored in the theme index.

The themes generated by ConText are based on, but are not identical to, the content-bearing tokens in the text.

By specifying the THEME LEXER Tile in the Lexer preference for a policy, you designate the policy as a theme indexing policy.

Once a theme index is created for the policy, any text requests, including theme queries, on the policy will result in the theme index being accessed.

See Also:

For more information about theme indexing, see "Theme Indexes" in Chapter 6, "Text Concepts".

Base-letter Conversion

For text indexes created on text columns containing languages that use an 8-bit (single-byte) character set, you can specify whether extended characters encountered in tokens are converted to their base-letter representation before their tokens are stored in the text index. Extended characters include special characters and characters with diacritical marks (e.g. accents, umlauts).

Text Indexing

Base-letter conversion is an attribute that you can set when creating a Lexer preference using the BASIC LEXER Tile.

If base-letter conversion is enabled for the Lexer preference in a policy, during text indexing, all characters containing diacritical marks are converted to their base form in the text index. The original text is not affected.

Base-letter conversion requires that the database character set is a subset of the NLS_LANG character set.

For example, suppose the NLS_LANG environment variable is set to French_France.WE8ISO8859P1 and base-letter conversion is enabled. The following string of text is encountered:

La référence de session doit être égale à 'name'

The sentence is indexed as:

la reference de session doit etre egale a name

Note:

Base-letter conversion requires that the language component for NLS_LANG is set to a single-byte language (e.g. French, German) that supports an extended (8-bit) character set. In addition, the charset component must be set to one of the 8-bit character sets (e.g. WE8ISO8859P1).

See Also:

For more information about National Language Support and the NLS_LANG environment variable, see Oracle8 Reference Manual.

Text Queries

In a text query on a column with base-letter conversion enabled, the query terms are automatically converted to match the base-letter conversion that was performed during text indexing.

Note:

Base-letter conversion works with all of the query operators (logical, control, expansion, thesaurus, etc.), except the STEM expansion operator.

See Also:

For more information about text queries and the query operators, see Oracle8 ConText Cartridge Application Developer's Guide. .

NLS Compliance

The BASIC LEXER Tile supports all NLS-compliant character sets, including the AL24UTFFSS (UTF-8) character set. UTF-8 is a character set that recognizes the characters from most single-byte and multi-byte character sets.

Users with multilingual environments, such as multinational companies, can specify UTF-8 for a database and use the database to store documents that use any one of the character sets supported by UTF-8. ConText supports indexing all documents stored in a UTF-8 database and queries to the database from clients running any of the UTF-8 supported character sets.

Supported Languages

The BASIC LEXER Tile currently supports the UTF-8 character set only for space-delimited, single-byte languages, which includes English and other Western European languages.

The BASIC LEXER Tile does not support UTF-8 for the multi-byte languages, nor do the Japanese, Chinese, and Korean lexers currently support UTF-8.

Enabling the NLS-compliant Lexer

The BASIC LEXER Tile does not require any setup to enable it to handle UTF-8 or other NLS-compliant character sets; however, the NLS_LANG environment variable must be set to the appropriate language/territory/character set. In addition, the ORA_NLS32 and ORA_NLS environment variables must be set to the directories containing the appropriate NLS data.

Limitations

The lexer has the following limitations when UTF-8 is the character set specified for the database:

base-letter conversion is not supported
characters from 8-bit character sets are not supported in the BASIC LEXER Tile attributes (i.e. printjoins, skipjoins, startjoins, endjoins, punctuations, numjoin, numgroup, continuation)

Composite Word Indexing

For German or Dutch text, the BASIC LEXER Tile provides an attribute for enabling composite word indexing. With composite word indexing, tokens that are compound words (specifically nouns) are divided into their constituent (root) nouns, including inflected forms of the roots, and the roots are stored in the ConText index along with the entry for the compound word.

For example, if the word Hauptbahnhof is encountered in a German-language document during composite word indexing, the following entries are created in the index: HAUPTBAHNHOF, HAUPT, BAHN, BAHNEN, HOF.

Note:

Because each token that is encountered has to be processed through the ConText decompounding routines, composite indexing may affect indexing performance.

In addition, because composite word indexes may be substantially larger than standard text indexes, composite word indexing may affect query performance.

Supported Character Sets

Composite word indexing supports both single-byte and multi-byte character sets, specifically WE8ISO8859P9 (extended, single-byte) and AL24UTFFSS (multi-byte).

Limitations

Composite indexes have the following limitations:

composite indexing can be enabled for text columns containing only German or Dutch text. If the column contains text in other languages, composite indexing will fail
composite word indexes do not support exact word searches (i.e. standard text queries). If you want to enable composite and exact word queries for a column, you must create both a compound index and a standard index for the column
case-sensitivity is not supported for composite indexes (all tokens are stored in all-uppercase)

Note:

The uppercasing of all tokens in a composite index results in the composite routines not recognizing some compound nouns. As a result, those nouns are not divided into their root nouns and are indexed as regular tokens with a single entry only in the index.

Word Queries

Composite word indexing enables text queries to return all documents that contain either the query term itself or the query term as a root of a compound word; however, queries for phrases that contain one or more compound words return only the documents that contain the exact phrase.

Note:

For more information about composite word queries, see Oracle8 ConText Cartridge Application Developer's Guide. .

Lexer Tiles

ConText provides the following Tile(s) for creating Lexer preferences:

Tile Description

BASIC LEXER

Basic lexer used for extracting tokens from text in languages, such as English and most Western European languages, that use single-byte character sets.

CHINESE V-GRAM LEXER

Lexer used for extracting tokens from Chinese-language text.

JAPANESE V-GRAM LEXER

Lexer used for extracting tokens from Japanese-language text.

KOREAN LEXER

Lexer used for extracting tokens from Korean-language text.

THEME LEXER

Lexer which utilitizes the Linguistics Theme Extraction System to generate themes as tokens for theme indexing.

BASIC LEXER

The BASIC LEXER Tile is used to identify tokens for creating text indexes for English and all other supported single-byte languages. It is also used to enable base-letter conversion for single-byte languages that have extended character sets and composite word indexing for German and Dutch text.

Note:

Any changes made to tokens before text indexing (e.g. removing of characters, base-letter conversion) are also performed on the query terms in a text query. This ensures that the query terms match the form of the tokens in the text index entries.

BASIC LEXER has the following attribute(s):

Attribute Attribute Values

continuation

characters (string)

numgroup

characters (string)

numjoin

characters (string)

printjoins

characters (string)

punctuations

characters (string)

skipjoins

characters (string)

startjoins

non-alphanumeric characters that occur at the beginning of a token (string)

endjoins

non-alphanumeric characters that occur at the end of a token (string)

whitespace

characters (string)

newline

characters (string)

sent_para

0 (disabled)

1 (enabled)

base_letter

0 (disabled)

1 (enabled)

mixed_case

0 (disabled)

1 (enabled)

composite

0 (no composite word indexing)

1 (German composite word indexing)

2 (Dutch composite word indexing)

Note:

The BASIC LEXER Tile attributes that use character strings can contain multiple characters. Each character in the string serves as a distinct character for that type of attribute.

For example, if the string '*_.-' is specified for the printjoins attribute, each individual character ('*', '_', '.', and '-') in the string is treated by ConText as a joining character that is included in the index entry for a token in which the character occurs.

continuation

continuation specifies the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are hyphen '-' and backslash '\'.

numgroup

numgroup specifies the characters that, when they appear in a string of digits, indicate that the digits are groupings within a larger single unit.

For example, comma ',' or period '.' may be defined as numgroup characters because they often indicate a grouping of thousands when they appear in a string of digits.

numjoin

numjoin specifies the characters that, when they appear in a string of digits, cause ConText to index the string of digits as a single unit or word.

For example, period '.' or comma ',' may be defined as numjoin characters because they often serve as decimal points when they appear in a string of digits.

Note:

The default values for numjoin and numgroup are determined by the NLS initialization parameters that are specified for the database.

In general, a value does not need to be specified for either numjoin or numgroup when creating a Lexer preference for the BASIC LEXER Tile.

printjoins

printjoins specifies the non-alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed by ConText as alphanumeric and included with the token in the text index. This includes printjoins that occur consecutively.

For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the text index as pseudo-intellectual and _file_.

Note:

If a printjoins character is also defined as a punctuations character, the character is only processed as an alphanumeric character if the character immediately following it is a standard alphanumeric character or has been defined as a printjoins or skipjoins character.

punctuations

punctuations specifies the non-alphanumeric characters that, when they appear at the end of a word, indicate the end of a sentence. The defaults are period '.', question mark '?', and exclamation point '!'.

Characters that are defined as punctuations are removed from a token before text indexing; however, if a punctuations character is also defined as a printjoins character, the character is only removed if it is the last character in the token and it is immediately preceded by the same character.

For example, if the period (.) is defined as both a printjoins and a punctuations character, the following transformations take place during indexing and querying as well:

Token Indexed Token

.doc

.doc

dog.doc

dog.doc

dog..doc

dog..doc

dog.

dog

dog...

dog..

In addition, BASIC LEXER uses punctuations characters in conjunction with newline and whitespace characters to determine sentence and paragraph deliminters for sentence/paragraph searching.

skipjoins

skipjoins specifies the non-alphanumeric characters that, when they appear within a word, identify the word as a single token; however, the characters are not stored with the token in the text index.

For example, if the hyphen character '-' is defined as a skipjoins, the word pseudo-intellectual is stored in the text index as pseudointellectual.

Note:

printjoins and skipjoins are mutually exclusive. The same characters cannot be specified for both attributes.

startjoins/endjoins

startjoins specifies the characters that, when encountered as the first character in a token, explicitly identify the start of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the ConText index entry for the token. In addition, the first startjoins character in a string of startjoins characters implicitly end the previous token.

endjoins specifies the characters that, when encountered as the last character in a token, explicitly identify the end of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the ConText index entry for the token.

The following rules apply to both startjoins and endjoins:

the characters specified for startjoins/endjoins cannot occur in any of the other attributes for BASIC LEXER.
startjoins/endjoins characters can occur only at the beginning/end of tokens
multiple, contiguous startjoins/endjoins characters are allowed at the beginning/end of a token; however, multiple occurrences of the same startjoins/endjoins character at the beginning/end of a token are not supported

Note:

Defining startjoins and endjoins characters is particularly useful for creating document sections that enable section searching in a column.

For examples of creating sections and section groups, see "Managing User-defined Document Sections" in Chapter 9, "Setting Up and Managing Text".

For more information about sections, see "Document Sections" in Chapter 6, "Text Concepts".

For more information about section searching, see Oracle8 ConText Cartridge Application Developer's Guide. .

whitespace

whitespace specifies the characters that are treated as blank spaces between tokens. BASIC LEXER uses whitespace characters in conjunction with punctuations and newline characters to identify character strings that serve as sentence delimiters for sentence/paragraph searching.

The predefined, default values for whitespace are 'space' and 'tab'; these values cannot be changed. Specifying characters as whitespace characters adds to these defaults.

newline

newline specifies the characters that indicate the beginning of a new line of text. BASIC LEXER uses newline characters in conjunction with punctuations and whitespace characters to identify character strings that server as paragraph delimiters for sentence/paragraph searching.

The only valid values for newline are '\n' and '\r' (for carriage returns) and the default is '\n'.

sent_para

sent_para enables (1) or disables (0) sentence/paragraph searching. The default is '0'.

base_letter

base_letter specifies whether characters that have diacritical marks (umlauts, cedillas, acute accents, etc.) are converted to their base form before being stored in the text index. The default is 0 (base-letter conversion disabled).

mixed_case

mixed_case specifies whether the lexer converts the tokens in text index entries to all uppercase or stores the tokens exactly as they appear in the text. The default is 0 (tokens converted to all uppercase).

Note:

ConText ensures text queries match the case-sensitivity of the index being queried. As a result, if you enable case-sensitivity for your text index, queries against the index are always case-sensitive.

composite

The composite attribute specifies whether composite word indexing is disabled (0) or enabled for either German (1) or Dutch (2) text. The default is 0 (composite word indexing disabled).

Note:

The composite and mixed_case attributes are mutually exclusive; Composite indexes do not support case-sensitivity.

See Also:

For more information, see "Composite Word Indexing" in this chapter.

CHINESE V-GRAM LEXER

The CHINESE V-GRAM LEXER Tile is used for identifying tokens for creating text indexes for Chinese text.

CHINESE V-GRAM LEXER has the following attribute(s):

Attribute Attribute Values

hanzi_indexing

1

2

hanzi_indexing

The hanzi_indexing attribute specifies the number of characters used for pattern matching while indexing.

A value of 1 indicates that the Chinese lexer examines each character individually to determine token boundaries.

A value of 2 indicates that the lexer examines characters in pairs to determine token boundaries. Pattern matching using pairs is generally faster than matching individual characters, resulting in faster index creation.

The default is 2.

JAPANESE V-GRAM LEXER

The JAPANESE V-GRAM LEXER Tile is used for identifying tokens for creating text indexes for Japanese text.

JAPANESE V-GRAM LEXER has the following attribute(s):

Attribute Attribute Values

kanji_indexing

1

2

kanji_indexing

The kanji_indexing attribute specifies the number of characters used for pattern matching while indexing.

A value of 1 indicates that the Japanese lexer examines each character individually to determine token boundaries.

A value of 2 indicates that the lexer examines pairs of characters to determine token boundaries. Pattern matching using pairs is generally faster than matching individual characters, resulting in faster index creation.

The default is 2.

KOREAN LEXER

The KOREAN LEXER Tile is used for identifying tokens for creating text indexes for Korean text. It has no attributes.

THEME LEXER

The THEME LEXER Tile is used in theme indexing policies to create theme indexes for English-language text. It has no attributes.

See Also:

For an example of creating a theme indexing policy, see "Creating a Column Policy for Theme Indexing" in Chapter 9, "Setting Up and Managing Text".

Lexer Preference Examples

The following section provides two Lexer preference examples that both use the BASIC LEXER Tile.

Example 1

The following example creates a preference named doc_link:

begin
  ctx_ddl.set_attribute     ('PRINTJOINS', '.-@&$#/');
  ctx_ddl.create_preference ('DOC_LINK', 'numerous joins', 'BASIC LEXER' );
end;

In this example, the '.', '-', '@', '&', '$', '#', and '/' characters are all defined as printjoins characters.

Characters such as the dollar sign '$' and number sign '#' are useful if you want to index tokens that may contain these characters, such as sums of money and numbers.

Example 2 (startjoins and endjoins)

The following example creates a preference named section_pref:

exec ctx_ddl.set_attribute(`startjoins','</');
exec ctx_ddl.set_attribute(`endjoins','>');
exec ctx_ddl.set_attribute(`printjoins','_@-&$#.');
...
exec ctx_ddl.create_preference(`sect_lex_pref','basic lexing + sections','BASIC LEXER');

In this example, the characters `<` and '/' are defined as startjoins characters. The character `>' is defined as an endjoins character.

The open and closed angle brackets '< >' and the forward slash '/' are useful for identifying HTML tags for document sections.

See Also:

For more information about sections, see "Document Sections" in Chapter 6, "Text Concepts"

Indexing Engine

The indexing engine is the ConText component that creates a ConText index for a text column. A ConText index is required before text in a column can be queried.

ConText supplies a single engine that creates index entries for Context indexes, independent of the format, location, language, and character set of the text.

In particular, the engine determines the amount of memory used to create ConText indexes and where in the database the indexes are stored.

See Also:

For more information about creating an Engine preference, see "Creating an Engine Preference" in Chapter 9, "Setting Up and Managing Text".

Engine Tiles

ConText provides the following Tile(s) for creating Engine preferences:

Tile Description

ENGINE NOP

No engine used for indexing (Not implemented - DO NOT USE)

GENERIC ENGINE

Indexing engine used to create index entries and store in database tables comprising the ConText index.

ENGINE NOP

The ENGINE NOP Tile specifies that no engine is used for indexing. This Tile is currently not implemented and should not be used to create Engine preferences for indexing.

GENERIC ENGINE

The GENERIC ENGINE Tile specifies that the indexing engine provided by ConText is used for indexing.

In particular, the GENERIC ENGINE Tile attributes specify the amount of memory allocated for indexing, and the tablespace(s) and creation parameters for the database tables and indexes that constitute a ConText index.

See Also:

For descriptions of the ConText index tables and indexes, see "Appendix C, "ConText Index Tables and Indexes".

GENERIC ENGINE has the following attribute(s):

Attribute Attribute Values

** none **

N/A

index_memory

memory in bytes (integer)

optimize_default

default ConText index optimization method

i1t_tablespace, i1t_storage, i1t_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for token table

i1i_tablespace, i1i_storage, i1i_other_parms

tablespace (string), STORAGE clause (string), and other index creation parameters (string) for index on token table

ktb_tablespace, ktb_storage, ktb_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for mapping table

kid_tablespace, kid_storage, kid_other_parms
kik_tablespace, kik_storage, kik_other_parms

tablespace (string), STORAGE clause (string), and other index creation parameters (string) for indexes on mapping table

lst_tablespace, lst_storage, lst_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for control table

lix_tablespace, lix_storage, lix_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for index on control table

sqr_tablespace, sqr_storage, sqr_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for SQE results table

sri_tablespace, sri_storage, sri_other_parms

tablespace (string), STORAGE clause (string), and other table creation parameters (string) for index on SQE results table

index_memory

index_memory specifies the amount of memory, in bytes, allocated for indexing.

Note:

When specifying a value for index_memory in a preference, specify as much real (not virtual) memory as is available on the machine which is running the ConText server that will be creating indexes.

For parallel indexing, the memory specified should be the amount of available memory divided evenly among the number of ConText servers that will perform the indexing in parallel.

optimize_default

optimize_default specifies the type of optimization used when CTX_DDL.OPTIMIZE_INDEX is called without an optimization type. If no value is specified for optimize_default, the default is DEFRAGMENT_TO_NEW_TABLE.

xxx_tablespace

i1t_tablespace, ktb_tablespace, and lst_tablespace specify the tablespaces used for the ConText index tables created during indexing.

sqr_tablespace specifies the tablespace used for the stored query expression result (SQR) table that is created, but not populated, during indexing. The SQR table for a policy stores the results of stored query expressions for the policy.

i1i_tablespace, kid_tablespace, kik_tablespace, and lix_tablespace specify the tablespaces used for the Oracle indexes generated for each ConText index table.

sri_tablespace specifies the tablespace used for the Oracle index generated for each SQR table.

Note:

For each xxx_tablespace attribute that is not specified when creating an Engine preference, the text table owner's default tablespace is used for storing the ConText index objects (tables and indexes).

xxx_storage

i1t_storage, ktb_storage, and lst_storage specify the STORAGE clauses used to create the ConText index tables during ConText indexing.

sqr_storage specifies the STORAGE clause used to create the stored query expression result (SQR) table during ConText indexing.

i1i_storage, kid_storage, kik_storage, and lix_storage specify the STORAGE clauses used to create the Oracle indexes for each ConText index table.

sri_storage specifies the STORAGE clause used to create the Oracle index for each SQR table.

See Also:

For more information about the STORAGE clause, see the CREATE TABLE and CREATE INDEX commands in Oracle8 SQL Reference.

xxx_other_parms

i1t_other_parms, ktb_other_parms, and lst_other_parms specify any additional parameters used to create the ConText index tables during ConText indexing.

sqr_other_parms specifies any additional parameters used to create the stored query expression result (SQR) table during ConText indexing.

i1i_other_parms, kid_other_parms, kik_other_parms, and lix_other_parms specify any additional parameters used to create the Oracle indexes for each ConText index table.

sri_other_parms specifies any additional parameters used to create the Oracle index for each SQR table.

Note:

In particular, the xxx_other_parms attributes are used to specify a value for the PARALLEL clause in the CREATE TABLE|INDEX command. The PARALLEL clause determines the degree of parallelism used by the Oracle parallel query option for operations such as generating Oracle indexes.

For more information about the PARALLEL clause in CREATE TABLE and CREATE INDEX, as well as the other parameters that can be used to create database tables and indexes, see Oracle8 SQL Reference.

For more information about the parallel query option in Oracle, see Oracle8 Tuning.

See Also:

For more information about SQEs, see Oracle8 ConText Cartridge Application Developer's Guide. .

Engine Preference Example

The following example creates a preference named doc_engine for the GENERIC ENGINE Tile:

begin
  ctx_ddl.set_attribute ('INDEX_MEMORY',   30000000 );
  ctx_ddl.set_attribute ('I1T_TABLESPACE', 'DOCUMENTS' );
  ctx_ddl.set_attribute ('I1T_STORAGE',' initial 10M next 2M
                         maxextents 10');
  ctx_ddl.set_attribute ('I1T_OTHER_PARMS',' pctfree 20');
  ctx_ddl.set_attribute ('I1I_OTHER_PARMS',' parallel 2');
  ctx_ddl.create_preference ('DOC_ENGINE', 'Test case',
                             'GENERIC ENGINE' );
end;

Advanced Query (Wordlist) Options

ConText provides advanced query (Wordlist) options for expanding text queries using the following methods:

ConText also provides an option for refining text queries using user-defined document sections.

Note:

While the expansion options provided by ConText can be used in theme queries, ConText automatically provides expansion for theme queries through the Linguistics Theme Extraction System.

In addition, the concept of document sections does not apply to theme indexes.

As such, the Wordlist options are not generally used for theme indexes.

See Also:

For more information about expanding and refining text queries, see Oracle8 ConText Cartridge Application Developer's Guide.

For more information about user-defined sections for refining queries, see "User-Defined Sections" in Chapter 6, "Text Concepts".

Stemming

Stemming expands a text query by deriving variations (verb conjugation, noun, pronoun, and adjective inflections) of the search token(s) in the query.

For example, a stem search on the verb buy expands to include its alternate verb forms, such as buys, buying, and bought, but not on the noun buyer. A search on the noun buyer would expand only to include its plural form buyers.

Since different languages have different stemming rules, stemming is language-dependent and uses term lists that define the relationships between the words in a given language

ConText provides a stemmer, licensed from Xerox Corporation, that utilizes Xerox Lexical Technology to support inflectional and derivational stemming in English and inflectional stemming in a number of Western European languages.

Inflectional Stemming

For all the supported languages, the stemmers return standard inflected forms of a word, such as the plural form (e.g. department --> departments).

Derivational Stemming

For English, an additional stemmer is provided which returns standard inflected forms and derived forms (e.g. department --> departments, departmentalize).

Fuzzy Matching

Fuzzy matching expands queries by including terms that are spelled similar to the search token in the query. This type of expansion can be useful in queries for text that contains frequent misspellings or has been scanned using OCR software.

For example, a fuzzy matching query for the term cat expands to include cats, calc, case.

The number of expansions generated by fuzzy matching depends on the tokens that ConText identified during indexing; results can vary significantly according to the tokens that were identified and indexed by ConText for the column. As such, fuzzy matching depends on how tokens are delimited in a given language.

Note:

Fuzzy matching is designed primarily for English-language documents, but can be used, with varying degrees of success with many of the Western European languages.

Soundex

During text indexing of a column, Soundex, if enabled, creates a list of all the words that sound alike and assigns one or more IDs to each word to identify the other words in the list that sound like the word.

Note:

Soundex is designed primarily to look for matches in phonetic spellings used in English, but can be used, with varying degrees of success with many of the other Western European languages.

The Soundex wordlist is stored in the DR_nnnnn_I1W ConText index table, where nnnnn is the identifier of the policy for the text index.

If Soundex is enabled for a text column, users can call Soundex in a query to expand the query. Soundex expands a query by searching the I1W table for terms that sound similar to the specified query term.

For example, a Soundex search on the name Smith would also find the names Smythe and Smit.

Note:

Soundex in ConText uses the same algorithm as the SOUNDEX function in SQL.

For more information about the SOUNDEX function in SQL, see Oracle8 SQL Reference.

Wordlist Tiles

ConText provides a single Tile, GENERIC WORD LIST, for creating Wordlist preferences.

GENERIC WORD LIST

The GENERIC WORD LIST Tile is used to enable the advanced query options (stemming, fuzzy matching, Soundex, and user-defined section searching) for text indexes.

See Also:

For more information about expansion methods in queries, see Oracle8 ConText Cartridge Application Developer's Guide.

GENERIC WORD LIST has the following attribute(s):

Attribute Attribute Values

stclause

STORAGE clause (string) for Soundex wordlist table

instclause

STORAGE clause (string) for index on Soundex wordlist table

soundex_at_index

0 (disabled)

1 (enabled)

stemmer

1 (English)

2 (English -- derivational)

3 (Dutch)

4 (French)

5 (German)

6 (Italian)

7 (Spanish)

fuzzy_match

1 (English and other Western European languages)

2 (Japanese)

3 (Korean)

4 (Chinese)

12 (Soundex emulation)

13 (Dutch)

14 (French)

15 (German)

16 (Italian)

17 (Spanish)

18 (OCR text)

section_group

name of section group

stclause

The stclause attribute specifies the STORAGE clause used to create the Soundex wordlist table during ConText indexing. The Soundex wordlist table is only created if Soundex is enabled through the soundex_at_index attribute.

instclause

The instclause attribute specifies the STORAGE clause used to create the Oracle index for the Soundex wordlist table.

soundex_at_index

The soundex_at_index attribute specifies whether ConText generates Soundex word mappings and stores them in the Soundex wordlist table during text indexing. If Soundex word mappings are not generated and stored in the wordlist table during indexing, queries that use Soundex are not expanded.

stemmer

The stemmer attribute specifies the stemmer used for word stemming in text queries. The default for stemmer is 1 (inflectional English)

fuzzy_match

The fuzzy_match attribute specifies which fuzzy matching routines are used for the column. Fuzzy matching is currently supported for English, Japanese, and, to a lesser extent, the Western European languages.

The default for fuzzy_match is 1.

Note:

The fuzzy_match attribute values for Chinese and Korean are dummy attribute values that prevent the English and Japanese fuzzy matching routines from being used on Chinese and Korean text.

section_group

The section_group attribute specifies the name of the section group to assign to a text column. The following rules apply to section_group:

no default value for section_group
all available section groups in the ConText data dictionary can be specified for section_group; the section group owner does not need to be the same as the policy owner

See Also:

For more information about section groups, see "Document Sections" in Chapter 6, "Text Concepts".

Wordlist Preference Example

The following example creates a preference named soundex_yes for the GENERIC WORD LIST Tile:

begin
  ctx_ddl.set_attribute('SOUNDEX_AT_INDEX', '1');
  ctx_ddl.create_preference('SOUNDEX_YES',
                            'Will build the soundex mapping during indexing',
                            'GENERIC WORD LIST');
end;

Stop Words

To manage the size of text indexes, ConText supports defining stop words. Stop words are common terms that you do not want to include in a text index.

The collection of stop words for a text column is called a stoplist, as defined in a Stoplist preference. You can define up to 4095 stop words for a stoplist.

Note:

Because theme indexes contain proportionately fewer entries than text indexes and size is not considered an issue, stoplists generally do not provide much value for theme indexes.

As such, ConText does not use stoplists for theme indexes. If a stoplist exists for a text column, the stoplist is ignored during theme indexing

Stop Words in Queries

ConText does not create index entries for words defined as stop words; however, it does record the stop words, up to eight, that proceed and follow an indexed term. This enables text queries for phrases which contain stop words.

To conserve space in the text index, ConText does not record the actual stop words in the index entries. Instead, ConText records code numbers, called sequences, that correspond to the stop words. Sequence numbers are assigned to stop words by the user when a stoplist is defined.

For example, the words he, is, at, the, and of are defined as stop words and each stop word is assigned a sequence by the user. During indexing, the string "he is at the top of the class" is encountered.

Index entries are created only for the words top and class; however, the words he, is, at, the, and top are stored as preceding and following stop words for the index entries.

As a result, users can query phrases such as 'he is at the top' and 'top of the class'.

Case-sensitivity

Stoplists for case-sensitive text indexes are automatically case-sensitive, meaning that words in the text are only indexed as stop words if they exactly match the case of the stop words in the stoplist.

As a result, when creating a Stoplist preference for a column on which you want create a case-sensitive text index, you should specify a stop word entry for each commonly occurring variation (i.e. lowercase, initial uppercase, all-uppercase) that may occur for a stop word. For example, some articles, such as a and the in English, often appear at the beginning of sentences. As a result, the initial uppercase form of the articles (A and The) should be included in the stoplist.

Stoplist Tiles

ConText provides a single Tile, GENERIC STOP LIST, for creating Stoplist preferences.

GENERIC STOP LIST

The GENERIC STOP LIST Tile specifies the terms that should not be included in the text index.

GENERIC STOP LIST has the following attribute(s):

Attribute Attribute Values

stop_word

word (string), sequence (number)

stop_word

The stop_word attribute has two values that must be specified:

the word for which ConText does not create an entry in the text index
the sequence (1 to 4095) for the word

Stoplist Preference Example

The following example creates a preference named mini_stoplist for the GENERIC STOP LIST Tile:

begin
  ctx_ddl.set_attribute     ('STOP_WORD', 'a',   1);
  ctx_ddl.set_attribute     ('STOP_WORD', 'A',   2);
  ctx_ddl.set_attribute     ('STOP_WORD', 'the', 3);
  ctx_ddl.set_attribute     ('STOP_WORD', 'The', 4);
  ctx_ddl.set_attribute     ('STOP_WORD', 'and', 5);
  ctx_ddl.set_attribute     ('STOP_WORD', 'And', 6);
  ctx_ddl.create_preference ('MINI_STOPLIST', 'minilist', 'GENERIC STOP LIST' );
end;

Note:

This example illustrates a stoplist for a case-sensitive text index. If the stoplist is for a case-insensitive index, the stoplist requires only one entry for each stop word and the case of the entry has no effect.

Table Name	Column Name	Datatype	Description
EXT_TEXT	TEXTKEY	NUMBER	Primary or unique key for the table
	TEXTDATE	DATE	Document publication date
	AUTHOR	VARCHAR2(50)	Document author
	NOTES	VARCHAR2(2000)	Text column with direct text storage
	TEXT	VARCHAR2(100)	Text column with names of operating system files that contain the document text

Tile	Description
DIRECT	Data stored internally in the text column. Each row is indexed as a single document
MASTER DETAIL	Data stored internally in the text column. Document consists of one or more rows in a detail table, with header information stored in a master table. The policy is created on the text column in the detail table. As a result, queries return detail information from the detail table. Header information must be queried explicitly.
MASTER DETAIL NEW	Data stored internally in the text column. Document consists of one or more rows in a detail table, with header information stored in a master table. The policy is created on a designated text column in the master table. As a result, queries return header information from the master table. Detail information must be queried explicitly.
OSFILE	Data stored externally in operating system files. File names stored in the text column.
URL	Data stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) stored in the text column.

Attribute	Attribute Values
binary	0 (plain text)
	1 (binary text)
detail_table	name of the detail table (string)
detail_key	name of the foreign key column in the detail table (string)
detail_lineno	name of the line number column in the detail table (string)
detail_text	name of the text column in the detail table (string)
detail_text_size	Internal use only

Attribute	Attribute Values
timeout	seconds (0 to 3600, default 30)
maxthreads	number of threads (0 to 1024, default 8)
maxurls	buffer length in bytes (1 to 4294967295, default 256)
urlsize	URL length (32 to 65535, default 256)
maxdocsize	document size (256 to 4294967295, default 2000000)
http_proxy	host name
ftp_proxy	host name
no_proxy	string (up to 16 strings, separated by commas)

Tile	Description
BLASTER FILTER	Tile for filtering formatted text and/or plain text using internal filters, external filters or some combination of both.
FILTER NOP	Tile for plain text (does not require filtering)
HTML FILTER	Tile for filtering plain text containing HTML tags
USER FILTER	Tile for specifying external filter for a column.

Attribute	Attribute Values
executable	format id (number), filter executable, sequence (number)
format	0 or 999 (No filter -- plain/ASCII text)
	1 or 4 (Word Perfect for Windows 5.x; Word Perfect for DOS 5.0, 5.1)
	2 (MS Word for DOS 5.0, 5.5)
	5 (Word Perfect for Windows 6.x; Word Perfect for DOS 6.0)
	6 (MS Word for Mac 3, 4, 5.x)
	7 (MS Word for Windows 2)
	8 (AMIPRO for Windows 1, 2, 3)
	9 (Lotus 1-2-3 for Windows 2, 3, 4, 5; Lotus 1-2-3 for DOS 4, 5)
	11 (MS Word for Windows 6.x, 7.0)
	13 (Xerox XIF for UNIX 5, 6)
	997 (Autorecognize)

Attribute	Attribute Values
code_conversion	0 (disabled)
	1(enabled)
keep_tag	tag (string), sequence (number)

Tile	Description
BASIC LEXER	Basic lexer used for extracting tokens from text in languages, such as English and most Western European languages, that use single-byte character sets.
CHINESE V-GRAM LEXER	Lexer used for extracting tokens from Chinese-language text.
JAPANESE V-GRAM LEXER	Lexer used for extracting tokens from Japanese-language text.
KOREAN LEXER	Lexer used for extracting tokens from Korean-language text.
THEME LEXER	Lexer which utilitizes the Linguistics Theme Extraction System to generate themes as tokens for theme indexing.

Attribute	Attribute Values
continuation	characters (string)
numgroup	characters (string)
numjoin	characters (string)
printjoins	characters (string)
punctuations	characters (string)
skipjoins	characters (string)
startjoins	non-alphanumeric characters that occur at the beginning of a token (string)
endjoins	non-alphanumeric characters that occur at the end of a token (string)
whitespace	characters (string)
newline	characters (string)
sent_para	0 (disabled)
	1 (enabled)
base_letter	0 (disabled)
	1 (enabled)
mixed_case	0 (disabled)
	1 (enabled)
composite	0 (no composite word indexing)
	1 (German composite word indexing)
	2 (Dutch composite word indexing)

Tile	Description
ENGINE NOP	No engine used for indexing (Not implemented - DO NOT USE)
GENERIC ENGINE	Indexing engine used to create index entries and store in database tables comprising the ConText index.

Attribute	Attribute Values
none	N/A
index_memory	memory in bytes (integer)
optimize_default	default ConText index optimization method
i1t_tablespace, i1t_storage, i1t_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for token table
i1i_tablespace, i1i_storage, i1i_other_parms	tablespace (string), STORAGE clause (string), and other index creation parameters (string) for index on token table
ktb_tablespace, ktb_storage, ktb_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for mapping table
kid_tablespace, kid_storage, kid_other_parms kik_tablespace, kik_storage, kik_other_parms	tablespace (string), STORAGE clause (string), and other index creation parameters (string) for indexes on mapping table
lst_tablespace, lst_storage, lst_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for control table
lix_tablespace, lix_storage, lix_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for index on control table
sqr_tablespace, sqr_storage, sqr_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for SQE results table
sri_tablespace, sri_storage, sri_other_parms	tablespace (string), STORAGE clause (string), and other table creation parameters (string) for index on SQE results table

Attribute	Attribute Values
stclause	STORAGE clause (string) for Soundex wordlist table
instclause	STORAGE clause (string) for index on Soundex wordlist table
soundex_at_index	0 (disabled)
	1 (enabled)
stemmer	1 (English)
	2 (English -- derivational)
	3 (Dutch)
	4 (French)
	5 (German)
	6 (Italian)
	7 (Spanish)
fuzzy_match	1 (English and other Western European languages)
	2 (Japanese)
	3 (Korean)
	4 (Chinese)
	12 (Soundex emulation)
	13 (Dutch)
	14 (French)
	15 (German)
	16 (Italian)
	17 (Spanish)
	18 (OCR text)
section_group	name of section group

8 ConText Indexing

Overview of Indexing

Figure 8-1

Policies

Figure 8-2

What is a Policy?

Column Policies

Template Policies

Multiple Policies on a Column

Policy Examples

Predefined Template Policies

DEFAULT_POLICY

TEMPLATE_AUTOB

TEMPLATE_BASIC_WEB

TEMPLATE_DIRECT

TEMPLATE_LONGTEXT_STOPLIST_OFF

TEMPLATE_LONGTEXT_STOPLIST_ON

TEMPLATE_MD

TEMPLATE_MD_BIN

TEMPLATE_WW6B

Preferences for Indexing

What is an Indexing Preference?

Tiles in Preferences

Predefined Preferences

User-defined Preferences

Data Store Predefined Preferences

DEFAULT_DIRECT_DATASTORE

DEFAULT_OSFILE

DEFAULT_URL

MD_BINARY

MD_TEXT

Filter Predefined Preferences

AUTOB

BASIC_HTML_FILTER

DEFAULT_NULL_FILTER

HTML_FILTER

WW6B

Lexer Predefined Preferences

BASIC_HTML_LEXER

DEFAULT_LEXER

KOREAN

THEME_LEXER

VGRAM_CHINESE_1 and VGRAM_CHINESE_2

VGRAM_JAPANESE_1 and VGRAM_JAPANESE_2

Engine Predefined Preferences

DEFAULT_INDEX

Wordlist Predefined Preferences

BASIC_HTML_WORDLIST

NO_SOUNDEX

SOUNDEX

KOREAN_WORDLIST

VGRAM_CHINESE_WORDLIST

VGRAM_JAPANESE_WORDLIST

Stoplist Predefined Preferences

DEFAULT_STOPLIST

NO_STOPLIST

Data Storage

Figure 8-3

Direct Storage

Master-Detail Storage

Policies on Columns in Master Table

Advantages

Limitations

Policies on Columns in Detail Table

Disadvantages

External Storage (Operating System Files)

File Names

Directory Path Names

File Access

File Permissions

External Storage (URLs)

Hypertext Transfer Protocol (HTTP)

File Transfer Protocol (FTP)

File Protocol

Intranet Support

Document Access Using HTTP or FTP

Proxy Servers

Multi-threading

Redirection

Timeouts

8
ConText Indexing