6
Text Concepts

This chapter introduces the concepts necessary for understanding how text is setup and managed by ConText.

The following topics are discussed in this chapter:

Text Operations

ConText supports five types of operations that are processed by ConText servers:

Automated Text Loading
DDL
DML
Text/Theme Queries
Linguistics Requests

Note:

The personality mask for a ConText server determines which operations the server can process.

For more information about personality masks, see "Personalities" in Chapter 2, "Administration Concepts".

Automated Text Loading

Automated text loading is performed by ConText servers running with the Loader (R) personality. It differs from the other text operations in that a request is not made to the Text Request Queue for handling by the appropriate ConText server.

Instead, ConText servers with the R personality regularly scan a document repository (i.e. operating system directory) for documents to be loaded into text columns for indexing.

If a file is found in the directory, the contents of the file are automatically loaded by the ConText server into the appropriate table and column.

See Also:

For more information about text loading using ConText servers, see "Overview of Automated Loading" in Chapter 7, "Automated Text Loading".

DDL

A ConText DDL operation is a request for the creation, deletion, or optimization of a text/theme index on a text column. DDL requests are sent to the DDL pipe in the Text Request Queue, where available ConText servers with the DDL personality pick up the requests and perform the operation.

DDL operations are requested through the GUI administration tools (System Administration or Configuration Manager) or the CTX_DDL package.

See Also:

For more information about the CTX_DDL package, see "CTX_DDL: Text Setup and Management" in Chapter 11, "PL/SQL Packages - Text Management".

DML

A text DML operation is a request for the ConText index (text or theme) of a column to be updated. An index update is necessary for a column when the following modifications have been made to the table:

insertion of a new row
deletion of an existing row
update of the primary key or text column(s) for an existing row

Requests for index updates are stored in the DML Queue where they are picked up and processed by available ConText servers. The requests can be placed on the queue automatically by ConText or they can be placed on the queue manually.

In addition, the system can be configured so DML requests in the queue are processed immediately or in batch mode.

Automatic DML Queue Notification

DML requests are automatically placed in the queue via an internal trigger that is created on a table the first time a ConText index is created for a text column in the table.

ConText supports disabling automatic DML at index creation time through a parameter, create_trig, for CTX_DDL.CREATE_INDEX. The create_trig parameter specifies whether the DML trigger is created/updated during indexing of the text column in the column policy.

In addition, the DML trigger can be removed at any time from a table using CTX_DDL.DROP_INTTRIG.

Note:

DROP_INTTRIG deletes the trigger for the table. If the table contains more than one text column with existing ConText indexes, automatic DML is disabled for all the text columns.

DROP_INTTRIG is provided mainly for maintaining backward compatibility with previous releases of ConText and should be used only when it is absolutely necessary to disable automatic DML for all the text columns in a table.

If the DML trigger is not created during indexing or is dropped, the ConText index is not automatically updated when subsequent DML occurs for the table. Manual DML can always be performed, but automatic DML can only be reenabled by first dropping, then recreating the ConText index or creating your own trigger to handle updates.

Manual DML Queue Notification

DML operations may be requested manually at any time using the CTX_DML.REINDEX procedure, which places a request in the DML Queue for a specified document.

Immediate DML Processing

In immediate mode, one or more ConText servers are running with the DML personality. The ConText servers regularly poll the DML Queue for requests, pick up any pending requests (up to 10,000 at a time) for an indexed column and update the index in real-time.

In this mode, an index is only briefly out of synchronization with the last insert, delete, or update that was performed on the table; however, immediate DML processing can use considerable system resources and create index fragmentation.

Batch DML Processing

If a text table has frequent updates, you may want to process DML requests in batch mode. In batch mode, no ConText servers are running with the DML personality. The queue continues to accept requests, but the requests are not processed because no DML servers are available.

To start DML processing, the CTX_DML.SYNC procedure is called. This procedure batches all the pending requests for an indexed column in the queue and sends them to the next available ConText server with a DDL personality. Any DML requests that are placed in the queue after SYNC is called are not included in the batch. They are included in the batch that is created the next time SYNC is called.

SYNC can be called with a level of parallelism. The level of parallelism determine the number of batches into which the pending requests are grouped. For example, if SYNC is called with a parallelism level of two, the pending requests are grouped into two batches and the next two available DDL ConText servers process the batches.

Calling SYNC in parallel speeds up the updating of the indexes, but may increase the degree of index fragmentation.

Concurrent Index Creation

A text column within a table can be updated while a ConText server is creating an index on the same text column. Any changes to the table being indexed by a ConText server are stored as entries in the DML Queue, pending the completion of the index creation.

After index creation completes, the entries are picked up by the next available DML ConText server and the index is updated to reflect the changes. This avoids a race condition in which the DML Queue request might be processed, but then overwritten by index creation, even though the index creation was processing an older version of the document.

Text/Theme Queries

A text query is any query that selects rows from a table based on the contents of the text stored in the text column(s) of the table.

A theme query is any query that selects rows from a table based on the themes generated for the text stored in the text column(s) of the table.

Note:

Theme queries are only supported for English-language text.

ConText supports three query methods for text/theme queries:

In addition, ConText supports Stored Query Expressions (SQEs).

Before a user can perform a query using any of the methods, the column to be queried must be defined as a text column in the ConText data dictionary and a text and/or theme index must be generated for the column.

See Also:

For more information about text columns, see "Text Columns" in this chapter.

For more information about text/theme queries and creating/using SQEs, see Oracle8 ConText Cartridge Application Developer's Guide..

Two-step Queries

In a two-step query, the user performs two distinct operations. First, the ConText PL/SQL procedure, CONTAINS, is called for a column. The CONTAINS procedure performs a query of the text stored in a text column and generates a list of the textkeys that match the query expression and a relevance score for each document. The results are stored in a user-defined table.

Then, a SQL statement is executed on the result table to return the list of documents (hitlist) or some subset of the documents.

One-step Queries

In a one-step query, the ConText SQL function, CONTAINS, is called directly in the WHERE clause of a SQL statement. The CONTAINS function accepts a column name and query expression as arguments and generates a list of the textkeys that match the query expression and a relevance score for each document.

The results generated by CONTAINS are returned through the SELECT clause of the SQL statement.

In-memory Queries

In an in-memory query, PL/SQL stored procedures and functions are used to query a text column and store the results in a query buffer, rather than in the result tables used in two-step queries.

The user opens a CONTAINS cursor to the query buffer in memory, executes a text query, then fetches the hits from the buffer, one at a time.

Stored Query Expressions

In a stored query expression (SQE), the results of a query expression for a text column, as well as the definition of the SQE, are stored in database tables. The results of a SQE can be accessed within a query (one-step, two-step, or in-memory) for performing iterative queries and improving query response.

The results of an SQE are stored in an internal table in the index (text or theme) for the text column. The SQE definition is stored in a system-wide, internal table owned by CTXSYS. The SQE definitions can be accessed through the views, CTX_SQES and CTX_USER_SQES.

See Also:

For more information about the SQE result table, see "SQR Table" in Appendix C, "ConText Index Tables and Indexes".

Linguistics Requests

The ConText Linguistics are used to analyze the content of English-language documents. The application developer uses the Linguistics output to create different views of the contents of documents.

The Linguistics currently provide two types of output, on a per document basis, for English-language documents stored in an Oracle database:

list of themes
document Gist and/or theme summaries

See Also:

For more information about themes, Gists, and theme summaries, as well as using the Linguistics in applications, see Oracle8 ConText Cartridge Application Developer's Guide..

Text Columns

A text column is any column used to store either text or text references (pointers) in a database table or view. ConText recognizes a column as a text column if one or more policies are defined for the column.

Supported Datatypes

Text columns can be any of the supported Oracle datatypes; however, text columns are usually one of the following datatypes:

CHAR
VARCHAR2
LONG
LONG RAW
BLOB
CLOB
BFILE

A table can contain more than one text column; however, each text column requires a separate policy.

See Also:

For more information about policies and text columns, see "Policies" in Chapter 8, "ConText Indexing".

For more information about Oracle datatypes, see Oracle8 Concepts.

For more information about managing LOBs (BLOB, CLOB, and BFILE), see Oracle8 ConText Cartridge Application Developer's Guide.e and PL/SQL User's Guide and Reference. .

Textkeys

ConText uses textkeys to uniquely identify a document in a text column. The textkey for a text column usually corresponds to the primary key for the table or view in which the column is located; however, the textkey for a column can also reference unique keys (columns) that have been defined for the table.

When a policy is defined for a column, the textkey for the column is specified. If the textkey is not specified, ConText uses the first primary key or unique key that it encounters for the table.

Note:

ConText fully supports creating indexes on text columns in object tables; however, the object table must have a primary key that was explicitly defined during creation of the table.

For more information about object tables, see Oracle8 Concepts.

Composite Textkeys

A textkey for a text column can consist of up to sixteen primary or unique key columns.

During policy definition, the primary/unique key columns are specified, using a comma to separate each column name.

In two-step queries, the columns in a composite textkey are returned in the order in which the columns were specified in the policy.

In in-memory queries, the columns in a composite textkey are returned in encoded form (e.g. 'p1,p2,p3'). This encoded textkey must be decoded to access the individual columns in the textkey.

Note:

There are some limits to composite textkeys that must be considered when setting up your tables and columns, and when creating policies for the columns.

See Also:

For more information about encoding and decoding composite textkeys, see Oracle8 ConText Cartridge Application Developer's Guide. .

Column Name Limitations

There is a 256 character limit, including the comma separators, on the string of column names that can be specified for a composite textkey.

Because the comma separators are included in this limit, the actual limit is 256 minus (number of columns minus 1), with a maximum of 241 characters (256 - 15), for the combined length of all the column names in the textkey.

This limit is enforced during policy creation.

Column Length Limitations

There is a 256 character limit on the combined lengths of the columns in a composite textkey. This is due to the way the textkey values for composite textkeys are stored in the index.

For a given row, ConText concatenates all of the values from the columns that constitute the composite textkey into a single value, using commas to separate the values from each column.

As such, the actual limit for the lengths of the textkey columns is 256 minus (number of columns minus 1), with a maximum of 241 characters (256 - 15), for the combined length of all the columns.

Note:

If you allow values that contain commas (e.g. numbers, dates) in your textkey columns, the commas are escaped automatically by ConText during indexing. The escape character is the backslash character.

In addition, if you allow values that contain backslashes (e.g. dates or directory structures in Windows) in your textkey columns, ConText uses the backslash character to escape the backslashes.

As a result, when calculating the limit for the length of columns in a composite textkey, the overall limit of 256 (241) characters must include the backslash characters used to escape commas and backslashes contained in the data.

Text Loading

The loading of text into database tables is required for creating ConText indexes and generating linguistic output. This task can be performed within an application; however, if you have a large document set, you may want to perform loading as a batch process.

See Also:

For more information about building text loading capabilities into your applications, see Oracle8 ConText Cartridge Application Developer's Guide. .

Individual Row Insert/Update/Export

The method you can use for inserting, updating, or exporting text for individual rows depends on the amount of text to be manipulated and whether the text is formatted.

SQL

For inserting small amounts of plain (ASCII) text into individual rows, you can use the INSERT command in SQL.

For updating individual rows containing small amounts of plain text, you can use the UPDATE command in SQL.

See Also:

For more information about the INSERT and UPDATE commands, see Oracle8 SQL Reference.

ctxload Utility

For updating individual rows from server-side files containing plain or formatted, you can use the ctxload command-line utility provided by ConText. ctxload is especially well-suited for loading large amounts of text contained in server-side files.

ctxload also allows you to export the contents (plain or formatted text) of the text column for a single row to a server-side file.

Note:

If your server environment is Windows NT, you can also use the Input/Output utility for manipulating text in individual rows.

For more information, see "Client-side Insert/Update/Export" in this chapter.

See Also:

For an example of updating/exporting an individual row using ctxload, see "Updating/Exporting a Document" in Chapter 9, "Setting Up and Managing Text".

Batch Load

Either SQL*Loader or ctxload can be used to perform batch loading of text into a database column.

SQL*Loader

To perform batch loading of plain (ASCII) text into a table, you can use SQL*Loader, a data loading utility provided by Oracle.

See Also:

For more information about SQL*Loader, see Oracle8 Utilities.

ctxload Utility

For batch loading plain or formatted text, you can use the ctxload command-line utility provided by ConText.

The ctxload utility loads text from a load file into the LONG or LONG RAW column of a specified database table. The load file can contain multiple documents, but must use a defined structure and syntax. In addition, the load file can contain plain (ASCII) text or it can contain pointers to separate files containing either plain or formatted text.

Note:

ctxload is best suited for loading text into columns that use direct data store. If you use external data store to store file pointers in the database, it is possible to use ctxload; however, you should consider using another loading method, such as SQL*Loader.

See Also:

For an example of loading text using ctxload, see "Using ctxload" in Chapter 9, "Setting Up and Managing Text".

Automated Text Load

Automated text loading uses ctxload and ConText servers running with a Loader personality to automatically load text from ctxload load files into text columns of datatype LONG or LONG RAW.

See Also:

For more information, see Chapter 7, "Automated Text Loading".

Client-side Insert/Update/Export

Context supports inserting/updating text from files residing on a PC running in a Microsoft Windows 32-bit environment, such as Windows NT or 95. In addition, ConText supports exporting text from individual rows into files on a PC.

ConText provides support for these functions through the Input/Output command-line utility provided with the ConText Workbench.

Note:

Client-side text insert/update/export is supported only from 32-bit Windows environments, such as Windows NT or 95.

In addition, the Input/Output utility must be installed on each PC from which text loading/exporting is performed.

See Also:

For more information, see Oracle8 ConText Cartridge Workbench User's Guide.

ConText Indexes

A ConText index is an inverted index containing entries for all the tokens (words or themes) that occur in a text column and the documents (i.e. rows) in which the tokens are found. The index entries are stored in database tables that are associated with the text column through a policy.

ConText supports creating indexes on text columns in relational tables and views, as well as text columns in object tables. In addition, ConText supports creating two types of indexes, text and theme.

This section discusses the following concepts relevant to ConText indexes:

See Also:

For examples of creating policies and indexes, see "Creating a Column Policy" and "Creating an Index" in Chapter 9, "Setting Up and Managing Text".

For more information about policies, see "Policies" in Chapter 8, "ConText Indexing".

Text Indexes

A text index is generated by the text lexers provided by ConText and consists of:

every unique token (word) in the collection of documents in a text column
for each word, a string that identifies each document in which the word occurs and the location offsets for each occurrence within each document

In addition, if section searching has been enabled for the column, the index stores the section names, as well as the documents in which the section occurs and the location offsets for each occurrence within each document.

There is a one-to-one relationship between a text index and the text indexing policy for which it was created.

See Also:

For more information about text indexing policies, see "Text Indexing Policies" in Chapter 8, "ConText Indexing".

For more information about section searching, see "Document Sections" in this chapter.

Text Lexers

The text lexer identifies tokens for creating text indexes. During text indexing, each document in the text column is retrieved and filtered by ConText. Then, the lexer identifies the tokens and extracts them from the filtered text and stores the tokens in memory, along with the document ID and locations for each word, until all of the documents in the column have been processed or the memory buffer is full.

The index entries, consisting of each token and its location string, are then written as rows to the token table for the ConText index and the buffer is flushed.

ConText provides a number of Lexer Tiles that can be used to create text indexes.

See Also:

For more information about the lexers used for text indexing, see "Text Lexers" in Chapter 8, "ConText Indexing".

Tokens in Text Indexes

A token is the smallest unit of text that can be indexed.

In non-pictorial languages, tokens are generally identified as alphanumeric characters surrounded by white space and/or punctuation marks. As a result, tokens can be single words, strings of numbers, and even single characters.

In pictorial languages, tokens may consist of single characters or combinations of characters, which is why separate lexers are required for each pictorial language. The lexers search for character patterns to determine token boundaries.

See Also:

For more information about token recognition, see "Text Lexers" in Chapter 8, "ConText Indexing".

Token Location Information

The location information for a token is bit string that contains the location (offsets in ASCII) of each occurrence of the token in each document in the column. The location information also contains any stop words that precede and follow the token.

Case-sensitivity

For non-pictorial languages, the BASIC LEXER Tile, by default, creates case-insensitive text indexes. In a case-insensitive index, tokens are converted to all uppercase in the index entries.

However, the Tile also provides an attribute, mixed_case, for creating case-sensitive text indexes. In a case-sensitive index, entries are created using the tokens exactly as they appear in the text, including those tokens that appear at the beginning of sentences.

For example, in a case-insensitive text index, the tokens oracle and Oracle are recorded as a single entry, ORACLE. In a case-sensitive text index, two entries, oracle and Oracle, are created.

As a result, case-sensitive indexes may be much larger than case-insensitive indexes and may have some effect on text query performance; however, case-sensitive indexes allow for greater precision in text queries.

Note:

The case-sensitivity of a text index determines whether the text queries performed against the index are case-sensitive. If the text index is case-sensitive, text queries are automatically case-sensitive.

See Also:

For more information about case-sensitivity in text queries, seeOracle8 ConText Cartridge Application Developer's Guide.

Stop Words

A stop word is any combination of alphanumeric characters (generally a word or single character) for which ConText does not create an entry in the index. Stop words are specified in the Stoplist preference for a text indexing policy.

See Also:

For more information about stop words and stoplists, see "Stop Words" in Chapter 8, "ConText Indexing".

For an example of creating a Stoplist preference, see "Creating a Stoplist Preference" in Chapter 9, "Setting Up and Managing Text".

For more information about stop words in text queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Theme Indexes

A theme index contains a list of all the tokens (themes) for the documents in a column and the documents in which each theme is found. Each document can have up to fifty themes.

Note:

Theme indexing is only supported for English text.

In addition, offset and frequency information are not relevant in a theme index, so this type of information is not stored.

See Also:

For more information about theme queries and query methods, see Oracle8 ConText Cartridge Application Developer's Guide.

Theme Lexer

For theme indexing, ConText provides a Tile, THEME_LEXER, that bypasses the standard text parsing routines and, instead, accesses the linguistic core in ConText to generate themes for documents.

The theme lexer analyzes text at the sentence, paragraph, and document level to create a context in which the document can be understood. It uses a mixture of statistical methods and heuristics to determine the main topics that are developed throughout the course of the document.

It also uses the ConText Knowledge Catalog, a collection of over 200,000 words and phrases, organized into a conceptual hierarchy with over 2,000 categories, to generate its theme information.

See Also:

For more information about the ConText Knowledge Catalog, see Oracle8 ConText Cartridge Application Developer's Guide.

Tokens in Theme Indexes

Unlike the single tokens that constitute the entries in a text index, the tokens in a theme index often consist of phrases. In addition, these phrases may be common terms or they may be the names of companies, products, and fields of study as defined in the Knowledge Catalog.

For example, a document about Oracle contains the phrase Oracle Corp. In a (case-sensitive) text index for the document, this phrase would have two entries, ORACLE and CORP, all in uppercase. In a theme index, the entry would be Oracle Corporation, which is the canonical form of Oracle Corp., as stored in the Knowledge Catalog.

See Also:

For more information about themes and the Knowledge Catalog, see Oracle8 ConText Cartridge Application Developer's Guide.

Theme Weights

Each document theme has a weight associated with it. The theme weight measures the strength of the theme relative to the other themes in the document. Theme weights are stored as part of the theme signature for a document and are used by ConText to calculate scores for ranking the results of theme queries.

Case-sensitivity

Theme indexes are always case-sensitive. Tokens (themes) are recorded in uppercase, lowercase, and mixed-case in a theme index. The case for the entry is determined by whether the theme is found in the Knowledge Catalog:

if the theme is in the Knowledge Catalog, the case for the index entry matches the canonical form of the theme in the Knowledge Catalog
if the theme is not in the Knowledge Catalog, the case for the index entry is identical to the theme as it appears in the text of the document

Linguistic Settings

ConText uses linguistic settings, specified as setting configurations, to perform special processing for text that is in all-uppercase or all-lowercase. ConText provides two predefined setting configurations:

GENERIC (mixed-case text)
SA (all-uppercase or all-lowercase text)

GENERIC is the default predefined setting configuration and is automatically enabled for each ConText server at start up.

You can create your own custom setting configurations in either of the GUI administration tools provided in the ConText Workbench.

See Also:

For more information about Linguistics, see Oracle8 ConText Cartridge Application Developer's Guide.

ConText Index Tables

The ConText index for a text column consists of the following internal tables:

DR_nnnnn_I1Tn (token table)
DR_nnnnn_KTB (textkey mapping table)
DR_nnnnn_LST (DOCID generation table)
DR_nnnnn_NLT (DOCID control table)
DR_nnnnn_I1W (Soundex wordlist table -- created only if Soundex is enabled)
DR_nnnnn_SQR (stored query expression result table)

The nnnnn string is an identifier (from 1000-99999) which indicates the policy of the text column for which the ConText index is created.

In addition, ConText automatically creates one or more Oracle indexes for each ConText index table.

The tablespaces, storage clauses, and other parameters used to create the ConText index tables and Oracle indexes are specified by the attributes set for the Engine preference (GENERIC ENGINE Tile) in the policy for the text column.

See Also:

For a description of the ConText index tables, see Appendix C, "ConText Index Tables and Indexes".

For more information about stored query expressions (SQEs), see Oracle8 ConText Cartridge Application Developer's Guide.

Columns with Multiple Indexes

A column can have more than one index by simply creating more than one policy for the column and creating a ConText index for each policy. This is useful if you want to specify different indexing options for the same column. In particular, this is useful if you want to create a text and theme index on a column.

When two indexes exist for the same column, one-step queries (theme or text) require the policy name, as well as the column name, to be specified for the CONTAINS function in the query. In this way, the correct index is accessed for the query.

This requirement is not enforced for two-step and in-memory queries, because they use policy name, rather than column name, to identify the column to be queried.

See Also:

For more information about one-step queries and the CONTAINS function, see Oracle8 ConText Cartridge Application Developer's Guide.

Index Creation

A ConText index is created for a column by calling CTX_DDL.CREATE_INDEX for the column policy; however, before calling CREATE_INDEX, a ConText server must be running with the DDL (D) personality.

See Also:

For more information, see "ConText Servers" in Chapter 2, "Administration Concepts".

Stages of ConText Indexing

ConText indexing takes place in three stages:

Index Initialization

During index initialization, the tables used to store the ConText index are created.

See Also:

For a list of the tables used to store the ConText index, see "Text Indexes" in this chapter.

Index Population

During index population, the ConText index entries for the documents in the text column are created in memory, then transferred to the index tables.

If the memory buffer fills up before all of the documents in the column have been processed, ConText writes the index entries from the buffer to the index tables and retrieves the next document from the text column to continue ConText indexing.

The amount of memory allocated for ConText indexing for a text column determines the size of the memory buffer and, consequently, how often the index entries are written to the index tables.

See Also:

For more information about the effects of frequent writes to the index tables, see "Index Fragmentation" and "Memory Allocation" in this chapter.

Index Termination

During index termination, the Oracle indexes are created for the ConText index tables. Each ConText index table has one or more Oracle indexes that are created automatically by ConText.

Note:

The termination stage only starts when the population stage has completed for all of the documents in the text column.

Creating Empty ConText Indexes

If you want to create a ConText index without populating the tables, ConText provides a parameter, pop_index, for CTX_DDL.CREATE_INDEX, which specifies whether the ConText index tables are populated during indexing.

Parallel Indexing

Parallel indexing is the process of dividing ConText indexing between two or more ConText servers. Dividing indexing between servers can help reduce the time it takes to index large amounts of text.

To perform indexing in parallel, you must start two or more ConText servers (each with the DDL personality) and you must correctly allocate indexing memory.

The amount of allocated index memory should not exceed the total memory available on the host machine(s) divided by the number of ConText servers performing the parallel indexing.

For example, you allocate 10 Mb of memory in the policy for the text column for which you want to create a ConText index. If you want to use two servers to perform parallel indexing on your machine, you should have at least 20 Mb of memory available during indexing.

Note:

When using multiple ConText servers to perform parallel indexing, the servers can run on different host machines if the machines are able to connect via SQL*Net to the database where the index is stored.

Index Fragmentation

As ConText builds an index entry for each token (word or theme) in the documents in a column, it caches the index entries in memory. When the memory buffer is full, the index entries are written to the ConText index tables as individual rows.

If all the documents (rows) in a text column have not been indexed when the index entries are written to the index tables, the index entry for a token may not include all of the documents in the column. If the same token is encountered again as ConText indexing continues, a new index entry for the token is stored in memory and written to the index table when the buffer is full.

As a result, a token may have multiple rows in the index table, with each row representing a index fragment. The aggregate of all the rows for a word/theme represents the complete index entry for the word/theme.

Note:

Because the number of distinct themes in a collection of documents is usually fewer than the number of distinct tokens, theme indexes generally contain fewer entries than text index.

As a result, index fragmentation is not as much of a concern in theme indexes as in text indexes; however, some fragmentation may occur during theme indexing and subsequent DML.

See Also:

For more information about resolving index fragmentation, see "Index Optimization" in this chapter.

Memory Allocation

A machine performing ConText indexing should have enough memory allocated for indexing to prevent excessive index fragmentation. The amount of memory allocated depends on the capacity of the host machine doing the indexing and the amount of text being indexed.

If a large amount of text is being indexed, the index can be very large, resulting in more frequent inserts of the index text strings to the tables. By allocating more memory, fewer inserts of index strings to the tables are required, resulting in faster indexing and fewer index fragments.

See Also:

For more information about allocating memory for ConText indexing, see "Creating an Engine Preference" in Chapter 9, "Setting Up and Managing Text".

Index Log

The ConText index log records all the indexing operations performed on a policy for a text column. Each time an index is created, optimized, or deleted for a text column, an entry is created in the index log.

Log Details

Each entry in the log provides detailed information about the specified indexing operation, including:

the policy for the text column on which the indexing operation was performed
the indexing operation that was performed (creation, optimization, deletion)
if the indexing operation was performed in parallel, the ID of the server that processed the operation
whether the operation failed and, if it did, the stage at which it failed
the number of documents selected for processing and the number of documents actually processed during the indexing operation
the textkeys of the first and last documents processed

Accessing the Log

The index log is stored in an internal table and can be viewed using the CTX_INDEX_LOG or CTX_USER_INDEX_LOG views. The index log can also be viewed in the GUI administration tools (System Administration or Configuration Manager).

Index Updates (DML)

When an existing document in a text column is deleted or modified such that the ConText index (text and/or theme) is no longer up-to-date, the index must be updated.

Text index updates are processed by ConText servers with the DML or DDL personality, depending on the DML index update method (immediate or batch) that is currently enabled.

Note:

In contrast to requests for theme and Gist/theme summary generation, which are processed by ConText servers with the Linguistic personality, updates to theme indexes are processed identically to text indexes, using ConText servers with the DML personality.

Immediate Vs. Batch Update

If immediate index update is enabled, ConText servers with a DML personality regularly scan the DML Queue and process update requests as they come into the queue.

If batch index update is enabled, no ConText servers with a DML personality are running and update requests in the DML Queue are processed by ConText servers with a DDL personality only when explicitly requested.

See Also:

For more information about DML index update methods, see "DML" in this chapter.

For more information about ConText servers, see "Personalities" in Chapter 2, "Administration Concepts".

Deferred Deletion

Updating the index for modified/deleted documents affects every row that contains references to the document in the index. Because this can take considerable time, ConText utilizes a deferred delete mechanism for updating the index for modified/deleted documents.

In a deferred delete, the document references in the ConText index token table (DR_nnnnn_I1Tn) for the modified/deleted document are not actually removed. Instead, the status of the document is recorded in the ConText index DOCID control table (DR_nnnnn_NLT), so that the textkey for the document is not returned in subsequent text queries that would normally return the document.

Actual deletion of the document references from the token table (I1Tn) takes place only during optimization of a index.

See Also:

For more information, see "Removal of Obsolete Document References" in "Index Optimization" in this chapter.

Index Optimization

ConText supports index optimization for improving query performance. Optimization performs two functions for an index:

ConText supports index optimization through the CTX_DDL.OPTIMIZE_INDEX procedure.

Note:

Optimization cannot be performed for an index while any other DDL or DML operation is being performed on the index. Likewise, while optimization is being performed on an index, no DML operations can be performed on the index.

As such, it may not always be practical to optimize an entire index. ConText also supports piecewise optimization for individual index entries (words and sections) that are stored in the index

Compaction of Index Fragments

Compaction combines the index fragments for a token into longer, more complete strings, up to a maximum of 64 Kb for any individual string. Compaction of index fragments results in fewer rows in the ConText index tables, which results in faster and more efficient queries. It also allows for more efficient use of tablespace.

ConText provides two methods of index compaction:

in-place compaction
two-table compaction (default)

In-place compaction uses available memory to compact index fragments, then writes the compacted strings back into the original (existing) token table in the ConText index.

Two-table compaction creates a second token table into which the compacted index fragments are written. When compaction is complete, the original token table is deleted.

Two-table compaction is faster than in-place compaction; however, it requires enough tablespace to be available during compaction to accommodate the creation and population of the second token table.

Removal of Obsolete Document References

ConText provides optimization methods which can be used to actually delete all references to modified/deleted documents from an index.

During an actual delete (also referred to as garbage collection), the index references for all modified/deleted documents are removed from the ConText index token table (DR_nnnnn_I1Tn), leaving only references to existing, unchanged documents. In addition, the ConText index DOCID control table (DR_nnnnn_NLT) is cleared of the information which records the status of documents.

Similar to compaction, ConText supports both in-place or two-table garbage collection.

Piecewise Optimization

Index optimization can be performed piecewise for individual words in a ConText index (text or theme). Because it is generally faster than optimizing an entire index, piecewise optimization is useful when an index has a large number of index fragments or obsolete document references and it is not practical to block DML on the index while optimization is performed.

Note:

While text index entries consist only of single words, theme index entries often consist of phrases as well as single words.

For documentation purposes, the term word refers to words and phrases with regards to piecewise optimization for theme indexes.

Piecewise optimization is specified for a word using arguments in CTX_DDL.OPTIMIZE_INDEX. Piecewise optimization supports only one type of optimization: combined compaction/garbage collection performed in-place.

Note:

Piecewise garbage collection for a word only removes obsolete document references from the corresponding entries (rows) in the token table; obsolete references are retained in the entries for other words in the index, as well as in DR_nnnnn_NLT, which ensures that the other entries are not affected by the piecewise optimization.

To remove obsolete document references from all the entries in an index, garbage collection must be performed for the entire index.

See Also:

For an example of piecewise optimization, see "Optimizing an Index" in Chapter 9, "Setting Up and Managing Text".

Optimizing Index Entries for Tokens and Sections

The word to be optimized can have two types of entries in the index: token and section.

Token entries consist of a word (and its location information) that occurs in one or more documents in a text column. Section entries, found only in text indexes, consist of the name (and location information) for a section that occurs in one or more documents in the column.

If the word to be optimized has token entries in the index, all the token entries (rows) corresponding to the word are combined into as few rows as possible and all obsolete document references are removed from the location strings for the rows.

If the word to be optimized has section entries in the index, all the section entries (rows) corresponding to the word are combined into as few rows as possible and all obsolete document references are removed from the location strings for the rows.

If the word to be optimized has both types of entries in the index, ConText optimizes all the entries for both types in a single pass; however, ConText optimizes the different types of entries as separate, distinct entities.

Case-sensitivity

Piecewise optimization is case-sensitive regardless of the case of the index, meaning index entries for a word are optimized only if the entries exactly match the word specified for piecewise optimization.

This feature is of particular importance for piecewise optimization in theme indexes, because theme indexes are always case-sensitive and the index entries often consist of phrases in mixed case.

For example, a theme index contains separate token entries for the word oracle and the phrase Oracle Corporation. If piecewise optimization is specified for the phrase Oracle Corporation, only those entries that exactly match the phrase are optimized; entries for oracle are not optimized. In addition, if piecewise optimization is specified for the word Oracle, no entries are optimized.

Identifying Candidates for Piecewise Optimization

The word_text and doclsize columns in the index token table (DR_nnnnn_I1Tn) can be queried to identify words that are potential candidates for piecewise optimization. Also note that the word_type column in the table identifies whether the row serves as a token entry or a section entry.

In general, if word_text returns a large number of rows for a word and/or the doclsize for many of the rows is significantly less than 64 Kilobytes (the maximum size of the location string for an index entry), the word is a good candidate for compaction.

When to Optimize

Index optimization should be performed regularly, as index creation and frequent updating can result in excessive fragmentation and accumulation of obsolete document references. The level of fragmentation for an index depends on the amount of memory allocated for indexing and the amount of text being indexed. The number of obsolete document references in an index depends on the frequency of DML for documents in the column and the degree of DML changes for the documents.

In general, optimize an index after:

large amounts of text are indexed
parallel indexing has been utilized
large numbers of documents in a table have been modified/deleted

Thesauri

Users looking for information on a given topic may not know which words have been used in documents that refer to that topic.

ConText enables users to create case-sensitive or case-insensitive thesauri which define relationships between lexically equivalent words and phrases. Users can then retrieve documents that contain relevant text by expanding queries to include similar or related terms as defined in a thesaurus.

Thesauri are stored in internal tables owned by CTXSYS. Each thesaurus is uniquely identified by a name that is specified when the thesaurus is created.

Note:

The ConText thesauri formats and functionality are compliant with both the ISO-2788 and ANSI Z39.19 (1993) standards.

See Also:

For more information about the relationships you can define for terms in a thesaurus, see "Thesaurus Entries and Relationships" in this chapter.

Thesaurus Creation and Maintenance

Thesauri and thesaurus entries can be created, modified, and deleted by all ConText users with the CTXAPP role.

ConText supports thesaurus maintenance from the command line through the PL/SQL package, CTX_THES. ConText also supports GUI viewing and administration of thesauri in the System Administration tool.

Note:

The CTX_THES package calls an internal package, CTX_THS, which should not be called directly.

In addition, the ctxload utility can be used for loading (creating) thesauri from a load file into the thesaurus tables, as well as dumping thesauri from the tables into output (dump) files.

The thesaurus dump files created by ctxload can be printed out or used as input for other applications. The dump files can also be used to load a thesaurus into the thesaurus tables. This can be useful for using an existing thesaurus as the basis for creating a new thesaurus.

See Also:

For more information about command line administration of thesauri, see "Managing Thesauri" in Chapter 9, "Setting Up and Managing Text".

For more information about GUI administration of thesauri, see the help system provided with the System Administration tool.

For more information about ctxload, see Chapter 10, "Text Loading Utility".

Thesauri in Queries

Thesauri are primarily used for expanding the query terms in text queries to include entries that have been defined as having relationships with the terms in the specified thesaurus.

Thesauri can be used for expanding theme queries; however, expansion of theme queries is generally not needed, because ConText uses an internal lexicon, called the Knowledge Catalog, to automatically expand theme queries.

Note:

ConText supports creating multiple thesauri; however, only one thesaurus can be used at a time in a query.

See Also:

For more information about using thesauri and the thesaurus operators to expand queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Query Expansion

The expansions returned by the thesaurus operators in queries are combined using the ACCUMULATE operator ( , ).

Limitations

In a query, the expansions generated by the thesaurus operators don't follow nested thesaural relationships. In other words, only one thesaural relationship at a time is used to expand a query.

For example, B is a narrower term for A. B is also in a synonym ring with terms C and D, and has two related terms, E and F. In a narrower term query for A, the following expansion occurs:

NT(A) query is expanded to {A}, {B}

Note:

The query expression is not expanded to include C and D (as synonyms of B) or E and F (as related terms for B).

Case-sensitivity

ConText thesauri supports creating case-sensitive and case-insensitive thesauri.

Case-sensitive Thesauri

In a case-sensitive thesaurus, terms (words and phrases) are stored exactly as entered. For example, if a term is entered in mixed-case (using either CTX_THES, the System Administration tool, or a thesaurus load file), the thesaurus stores the entry in mixed-case.

In addition, when a case-sensitive thesaurus is specified in a query, the thesaurus lookup uses the query terms exactly as entered in the query. As a result, queries that use case-sensitive thesauri allow for a higher level of precision in the query expansion performed by ConText.

For example, a case-sensitive thesaurus is created with different entries for the distinct meanings of the terms Turkey (the country) and turkey (the type of bird). Using the thesaurus, a query for Turkey expands to include only the entries associated with Turkey.

Case-insensitive Thesauri

In a case-insensitive thesaurus, terms are stored in all-uppercase, regardless of the case in which they were entered.

In addition, when a case-insensitive thesaurus is specified in a query, the query terms are converted to all-uppercase for thesaurus lookup. As a result, ConText is unable to distinguish between terms that have different meanings when they are in mixed-case.

For example, a case-insensitive thesaurus is created with different entries for the two distinct meanings of the term TURKEY (the country or the type of bird). Using the thesaurus, a query for either Turkey or turkey is converted to TURKEY for thesaurus lookup and then expanded to include all the entries associated with both meanings.

Default Thesaurus

If you do not specify a thesaurus by name in a query, by default, the thesaurus operators use a thesaurus named DEFAULT; however, because the entries in a thesaurus may vary greatly depending on the subject matter of the documents for which the thesaurus is used, ConText does not provide a DEFAULT thesaurus.

As a result, if you want to use a default thesaurus for the thesaurus operators, you must create a thesaurus named DEFAULT. You can create the thesaurus through any of the thesaurus creation methods supported by ConText:

System Administration tool (GUI)
CTX_THES.CREATE_THESAURUS (PL/SQL)
ctxload

Supplied Thesaurus

Although ConText does not provide a default thesaurus, ConText does supply a thesaurus, in the form of a ctxload load file, that can be used to create a general-purpose, English-language thesaurus.

The thesaurus load file can be used to create a default thesaurus for ConText or it can be used as the basis for creating thesauri tailored to a specific subject or range of subjects.

See Also:

For more information about using ctxload to create the thesaurus, see "Creating the Supplied Thesaurus" in Chapter 9, "Setting Up and Managing Text".

Supplied Thesaurus Structure and Content

The supplied thesaurus is similar to a traditional thesaurus, such as Roget's Thesaurus, in that it provides a list of synonymous and semantically related terms, sorted into conceptual domains.

The supplied thesaurus provides additional value by organizing the conceptual domains into a hierarchy that defines real-world, practical relationships between narrower terms and their broader terms.

Additionally, cross-references are established between domains in different areas of the hierarchy. At the lower levels of the hierarchy, synonym rings are attached to domain names.

Supplied Thesaurus Location

The exact name and location of the thesaurus load file is operating system dependent; however, the file is generally named 'dr0thsus' (with an appropriate extension for text files) and is generally located in the following directory structure:

<Oracle_home_directory>
    <ConText_directory>
           thes

See Also:

For more information about the directory structure for ConText, see the Oracle8 installation documentation specific to your operating system.

Thesaurus Entries and Relationships

Three types of relationships can be defined for entries (words and phrases) in a thesaurus:

In addition, each entry in a thesaurus can have Scope Notes associated with it.

Synonyms

Figure 6-1

Support for synonyms is implemented through synonym entries in a thesaurus. The collection of all of the synonym entries for a term and its associated terms is known as a synonym ring.

Synonym entries support the following relationships:

Synonym Rings

Synonym rings are transitive. If term A is synonymous with term B and term B is synonymous with term C, term A and term C are synonymous. Similarly, if term A is synonymous with both terms B and C, terms B and C are synonymous. In either case, the three terms together form a synonym ring.

For example, in the synonym rings shown in this example, the terms car, auto, and automobile are all synonymous. Similarly, the terms main, principal, major, and predominant are all synonymous.

Note:

A thesaurus can contain multiple synonym rings; however, synonym rings are not named. A synonym ring is created implicitly by the transitive association of the terms in the ring.

As such, a term cannot exist twice within the same synonym ring or within more than one synonym ring in a thesaurus.

While synonym rings are not explicitly named, they have an ID associated with them. The ID is assigned when the synonym entry is first created.

Preferred Terms

Each synonym ring can have one, and only one, term that is designated as the preferred term. A preferred term is used in place of the other terms in a synonym ring when one of the terms in the ring is specified with the PT operator in a query.

Note:

A term in a preferred term (PT) query is replaced by, rather than expanded to include, the preferred term in the synonym ring.

Hierarchical Relationships

Figure 6-2

Hierarchical relationships consist of broader and narrower terms represented as an inverted tree. Each entry in the hierarchy is a narrower term for the entry immediately above it and to which it is linked. The term at the root of each tree is known as the top term.

For example, in the tree structure shown in the following example, the term elephant is a narrower term for the term mammal. Conversely, mammal is a broader term for elephant. The top term is animal.

In addition to the standard hierarchy, ConText also supports the following specialized hierarchical relationships in thesauri:

Each of the three hierarchical relationships supported by ConText represents a separate branch of the hierarchy and are accessed in a query using different thesaurus operators.

Note:

The three types of hierarchical relationships are optional. Any of the three hierarchical relationships can be specified for a term.

Generic Hierarchy

The generic hierarchy represents relationships between terms in which one term is a generic name for the other.

For example, the terms rat and rabbit could be specified as narrower generic terms for rodent.

Partitive Hierarchy

The partitive hierarchy represents relationships between terms in which one term is part of another.

For example, the provinces of British Columbia and Quebec could be specified as narrower partitive terms for Canada.

Instance Hierarchy

The instance hierarchy represents relationships between terms in which one term is an instance of another.

For example, the terms Cinderella and Snow White could be specified as narrower instance terms for fairy tales.

Multiple Occurrences of the Same Term

Because the four hierarchies are treated as separate structures, the same term can exist in more than hierarchy. In addition, a term can exist more than once in a single hierarchy; however, in this case, each occurrence of the term in the hierarchy must be accompanied by a qualifier.

If a term exists more than once as a narrower term in one of the hierarchies, broader term queries for the term are expanded to include all of the broader terms for the term.

If a term exists more than once as a broader term in one of the hierarchies, narrower term queries for the term are expanded to include the narrower terms for each occurrence of the broader term.

For example, C is a generic narrower term for both A and B. D and E are generic narrower terms for C. In queries for terms A, B, or C, the following expansions take place:

NTG(A) expands to {C}, {A}
NTG(B) expands to {C}, {B}
NTG(C) expands to {C}, {D}, {E}
BTG(C) expands to {C}, {A}, {B}

Note:

This example uses the generic hierarchy. The same expansions hold true for the standard, partitive, and instance hierarchies.

Qualifiers

For homographs (terms that are spelled the same way, but have different meanings) in a hierarchy, a qualifier must be specified as part of the entry for the word. When homographs that have a qualifier for each occurrence appear in a hierarchy, each term is treated as a separate entry in the hierarchy.

For example, the term spring has different meanings relating to seasons of the year and mechanisms/machines. The term could be qualified in the hierarchy using the terms season and machinery.

To differentiate between the terms during a query, the qualifier must be specified. Then, only the terms that are broader terms, narrower terms, or related terms for the term and its qualifier are returned. If no qualifier is specified, all of the related, narrower, and broader terms for the terms are returned.

Note:

In thesaural queries that include a term and its qualifier, the qualifier must be escaped, because the parentheses required to identify the qualifier for a term will cause the query to fail.

Related Terms

Each entry in a thesaurus can have one or more related terms associated with it. Related terms are terms that are close in meaning to, but not synonymous with, their related term. Similar to synonyms, related terms are reflexive; however, related terms are not transitive.

If a term that has one or more related terms defined for it is specified in a related term query, the query is expanded to include all of the related terms.

For example, B and C are related terms for A. In queries for A, B, and C, the following expansions take place:

RT(A) expands to {A}, {B}, {C}
RT(B) expands to {A}, {B}
RT(C) expands to {C}, {A}

Note:

Terms B and C are not related terms and, as such, are not returned in the expansions performed by ConText.

Scope Notes

Each entry in the hierarchy, whether it is a main entry or one of the synonymous, hierarchical, or related entries for a main entry, can have scope notes associated with it.

Scope notes can be used to provide descriptions or comments for the entry. In particular, they can be used to provide information about the usage/function of the entry or to distinguish the entry from other entries with similar meanings.

Document Sections

ConText enables users to increase query precision using structure (i.e. sections) found in most documents. The most common structure found in documents is the grouping of text into sentences and paragraphs. In addition, many documents create structure through the use of tags or regularly-occurring fields delimited by strings of repeating text.

For example, World Wide Web documents use HTML, a defined set of tags and codes, to identify titles, headers, paragraph offsets, and other document meta-information as part of the document content. Similarly, e-mail messages often contains fields with consistent, regularly-occurring headers such as subject: and date:.

For each text column, users can choose to define rules for dividing the documents in the column into user-defined sections. In addition, for text columns that use the BASIC LEXER Tile, users can enable section searching for sentences and paragraphs. ConText includes section information as entries (rows) in the text index for a column so that text queries on the column can be restricted to a specified section.

Note:

Section searching does not apply to theme queries. As such, defining sections and enabling section searching for theme indexes is not supported.

In addition, because section information is stored in the text index, sections must be defined, if desired, and section searching must be enabled before text index creation. If you want to use section searching for columns with existing text indexes, you must drop the indexes, define sections, if desired, enable section searching (through the preferences for the column policies), then reindex the columns.

Section Searching

A query expression operator, WITHIN, is provided for restricting a text query to a particular section.

The WITHIN operator can be used to restricts queries in two distinct ways:

sentence and paragraph searching
user-defined section searching

Note:

Sentence/paragraph searching and user-defined section searching can be enabled concurrently for a text column; however, text queries can reference only a single section (sentence, paragraph, or user-defined) at a time.

In addition, if both sentence/paragraph searching and user-defined section searching are enabled for a text column, certain restrictions apply. For more information, see "User-Defined Sections" in this chapter.

See Also:

For more information about the WITHIN operator and performing text queries using document sections, see Oracle8 ConText Cartridge Application Developer's Guide.

Sentence and Paragraph Searching

Sentence/paragraph searching returns documents in which two or more words occur within the same sentence or paragraph. In this way, sentence/paragraph searching is similar to proximity searching (NEAR operator), which returns documents in which two or more words occur within a user-specified distance.

For sentence/paragraph searching, the WITHIN operator takes sentence or paragraph as the value for the section name.

User-defined Section Searching

Section searching for user-defined sections returns documents in which one or more terms occur in a user-defined section.

For user-defined section searching, the WITHIN operator takes the name of a user-defined section.

Sentences and Paragraphs as Sections

ConText provides two system-level, predefined sections, sentence and paragraph, for sentence/paragraph searching; however, to enable ConText to identify sentence and paragraphs as sections, sentence and paragraph delimiters must be specified for the text lexer (BASIC LEXER Tile).

BASIC LEXER provides three attributes (punctuations, whitespace, and newline) for specifying sentence and paragraph delimiters.

Sentence Delimiters

Sentence delimiters are characters that, when they occur in the following sequence, indicate the end of a sentence and the beginning of a new sentence:

token -> punctuation character(s) -> whitespace character(s)

Paragraph Delimiters

Paragraph delimiters are characters that, when they occur in any of the following sequences, indicate the end of a paragraph and the beginning of a new paragraph:

token -> punctuation character(s) -> whitespace character(s) -> newline character(s)

token -> punctuation character(s) -> newline character(s) -> newline character(s)

By definition, paragraph delimiters also serve as sentence delimiters.

User-Defined Sections

A user-defined section is a body of text, delimited by user-specified start and end tags, within a document. ConText allows users to control the behavior/interaction of user-defined sections through the definition of sections as top-level or self-enclosing sections.

User-defined sections must be assigned a name and grouped into a section group. Sections are not created as individual, stand-alone objects. Instead, users create sections by adding them to an existing section group.

Note:

If user-defined sections are used in conjunction with sentence/paragraph sections, sentence and paragraph are reserved words and cannot be used as section names.

See Also:

For examples of creating section groups and adding, as well as removing, sections in section groups, see "Managing User-defined Document Sections" in Chapter 9, "Setting Up and Managing Text".

Start and End Tags

The beginning of a user-defined section is explicitly identified by a start tag, which can be any token in the text, as long as the token is a valid token recognized by the lexer for the text column. Each section must have a start tag.

The end of a section can be identified explicitly by an end tag or implicitly by the occurrence of the next occurring start tag, depending on whether the section is defined as a top-level or self-enclosing section. As a result, end tags can be optional. Similar to start tags, end tags can be any token in the text, as long as the token can be recognized by the lexer.

Note:

Start and end tags are case-sensitive if the text index for which they are defined is case-sensitive.

For documentation purposes, all references to start and end tags in this section are presented in uppercase.

For more information about case-sensitivity in text indexes, see "Text Indexes" in this chapter.

Start and end tags are stored as part of the ConText index, but do not take up space in the index. For example, a document contains the following string, where <TITLE> and </TITLE> are defined as start and end tags:

<TITLE>cats</TITLE> make good pets

The string is indexed by ConText as:

cats make good pets

which enables searching on phrases such as cats make.

In addition, start and end tags do not produce hits if searched upon.

Suggestion:

Because each occurrence of a token that is defined as a start/end tag indicates the beginning/end of a section, specify tokens for start and end tags that are as distinctive as possible. Include any non-alphanumeric characters such as colons ': ' or angle brackets '<>' which help to uniquely identify the tokens.

For example, the token TITLE by itself does not make a good start tag, because it is a common word and ConText would record the start of a new section each time the token was encountered in the text. A better start tag would be the string <TITLE> or TITLE:.

Top-level Sections

A top-level section is only closed (implicitly) by the next occurring top-level section or (explicitly) by the occurrence of the end tag for the section; however, end tags are not required for top-level sections. In addition, a top-level section implicitly closes all sections that are not defined as top-level.

Top-level sections cannot enclose themselves or each other. As a result, if a section is defined as top-level, it cannot also be defined as self-enclosing.

Self-Enclosing Sections

A self-enclosing section is only closed (explicitly) when the end tag for the section is encountered or (implicitly) when a top-level section is encountered. As a result, end tags are required for sections that are defined as self-enclosing.

Self-enclosing sections support defining tags such as the table tag <TD> in HTML as a start tag. Table data in HTML is always explicitly ended with the </TD> tag. In addition, tables in HTML can have embedded or nested tables.

If a section is not defined as self-enclosing, the section is implicitly closed when another start tag is encountered. For example, the paragraph tag <P> in HTML can be defined as a start tag for a section that is not self-enclosing, because paragraphs in HTML are sometimes explicitly ended with the </P> tag, but are often ended implicitly with the start of another tag.

Startjoin and Endjoin Characters

To enable defining document sections, ConText supports specifying non-alphanumeric characters (e.g. hyphens, colons, periods, brackets) using the startjoins and endjoins attribute for the BASIC LEXER Tile.

When a character defined as a startjoins appears at the beginning of a word, it explicitly identifies the word as a new token and end the previous token. When an character specified as an endjoins appears at the end of a word, it explicitly identifies the end of the token.

Note:

Characters that are defined as startjoins and endjoins are included as part of the entry for the token in the ConText index.

Text Filtering

Section searching for user-defined sections requires the start and end tags for the document sections to be included in the ConText index. This is accomplished through the use of ConText filters and the (optional) definition of startjoins and printjoins for the BASIC LEXER Tile.

For HTML text that uses the internal HTML filter, document sections have an additional requirement. Because the internal HTML filter removes all HTML markup during filtering, you must explicitly specify the HTML tags that serve as section start and end tags and, consequently, must not be removed by the filter.

This is accomplished through the keep_tag attribute for the HTML FILTER Tile. The keep_tag attribute is a multi-value attribute that lets users specify the HTML tags to keep during filtering with the internal HTML filter.

For HTML filter that is filtered using an external HTML filter, the filter must provide some mechanism for retaining HTML tags used as section start and end tags.

Limitations

User-defined sections have the following limitations:

Implicit Start of Body Sections

ConText does not recognize the start of a body section after the implicit end of a header section.

For example, consider the following e-mail message in which FROM:, SUBJECT:, and NEWSGROUPS: are defined as start tags for three different sections:

From: jsmith@ABC.com
Subject: New teams
Newsgroups: arts.recreation, alt.sports

New teams have been added to the league.

All of the text following the NEWSGROUPS: header tag is included in the header section, including the body of the message.

Multi-word Start and End Tags

ConText does not support start and end tags consisting of more than one word. Each start and end tag for a section can contain only a single word and the word must be unique for each tag within the section group.

For example:

problem description: Insufficent privileges
problem solution: Grant required privileges to file

The strings PROBLEM DESCRIPTION: and PROBLEM SOLUTION: cannot be specified as start tags.

Identical Start and End Tags

ConText does not recognize sections in which the start and end tags are the same.

For example:

:Author:
Joseph Smith
:Author:
:Title:
Guide to Oracle
:Title:

The strings :AUTHOR: and :TITLE: cannot be specified as both start and end tags.

Section Groups

A section group is the collection of all the user-defined sections for a text column. Section groups are assigned by name to a text column through the Wordlist preference in the column policy.

Sections in Section Groups

The start and end tags for a particular section must be unique within the section group to which the section belongs. In addition, within a section group, no start tag can also be an end tag.

Section names do not have to be unique within a section group. This allows defining multiple start and end tags for the same logical section, while making the section details transparent to queries.

Section Group Management

Section groups can be created and deleted by ConText users with the CTXADMIN or CTXAPP roles. In addition, users with CTXADMIN or CTXAPP can add and remove sections from section groups. Section group names must be unique for the user who creates the section group.

See Also:

For examples of creating and deleting section groups, as well as adding and removing sections in section groups, see "Managing User-defined Document Sections" in Chapter 9, "Setting Up and Managing Text".

Predefined HTML Section Group

ConText provides a predefined section group, BASIC_HTML_SECTION, which enables user-defined section searching in basic HTML documents.

BASIC_HTML_SECTION contains the following section definitions:

Section Name Start Tag End Tag Top Level Self-Enclosing

HEAD

<HEAD>

</HEAD>

Yes

No

TITLE

<TITLE>

</TITLE>

No

No

BODY

<BODY>

</BODY>

Yes

No

PARA

<P>

</P>

No

No

HEADING

<H1>

</H1>

No

No

<H2>

</H2>

No

No

<H3>

</H3>

No

No

<H4>

</H4>

No

No

<H5>

</H5>

No

No

<H6>

</H6>

No

No

In addition, the following predefined preferences have been created to support ready-to-use basic HTML section searching:

Filter preference - BASIC_HTML_FILTER
Lexer preference - BASIC_HTML_LEXER
Wordlist preference - BASIC_HTML_WORDLIST

Setup Process for Section Searching

The process for setting up section searching differs depending on whether you are enabling section searching for sentences/paragraphs or user-defined sections.

Sentence and Paragraph Searching

The process model for enabling sentence/paragraph searching is as follows:

If necessary, specify values for the whitespace, newline, and punctuations attributes of the BASIC LEXER Tile.
Specify a value of '1' for the sent_para attribute (BASIC LEXER).
Create a Lexer preference for the Tile.
Create a policy that includes the Lexer preference you created.

User-defined Section Searching

The process model for defining sections and enabling section searching for these sections is as follows:

Use CTX_DDL.CREATE_SECTION_GROUP to create a section group for your user-defined sections.

When you call CREATE_SECTION_GROUP, you specify the name of the section group to create.

Call CTX_DDL.ADD_SECTION for each user-defined section that you want to create in your section group.

When you call ADD_SECTION, you specify the name of the section, the start and end tags for the section, and whether the section is top-level or self-enclosing.

If you are creating sections for HTML documents and you use the internal HTML filter, set the keep_tag attribute (HTML FILTER Tile) once for each of the HTML tags that the filter must retain for use as section start and end tags.

Then create a Filter preference for the Tile.

If necessary, specify values for the startjoins and endjoins attributes of the BASIC LEXER Tile.

Then, create a Lexer preference for the Tile.

Use the section_group attribute of the GENERIC WORD LIST Tile to specify the name of your section group and create a Wordlist preference for the Tile.
Create a policy that includes the section-enabled preferences (Filter, Lexer, and Wordlist) that you created.

See Also:

For examples of defining section groups and sections, as well as creating a section-enabled Wordlist preference, see "Managing User-defined Document Sections" in Chapter 9, "Setting Up and Managing Text".

For examples of specifying attributes for the HTML FILTER and BASIC LEXER Tiles, see "Filter Preference Examples" and "Lexer Preference Examples" in Chapter 8, "ConText Indexing".

Section Name	Start Tag	End Tag	Top Level	Self-Enclosing
HEAD	<HEAD>	</HEAD>	Yes	No
TITLE	<TITLE>	</TITLE>	No	No
BODY	<BODY>	</BODY>	Yes	No
PARA	<P>	</P>	No	No
HEADING	<H1>	</H1>	No	No
	<H2>	</H2>	No	No
	<H3>	</H3>	No	No
	<H4>	</H4>	No	No
	<H5>	</H5>	No	No
	<H6>	</H6>	No	No

6 Text Concepts

Text Operations

Automated Text Loading

DDL

DML

Automatic DML Queue Notification

Manual DML Queue Notification

Immediate DML Processing

Batch DML Processing

Concurrent Index Creation

Text/Theme Queries

Two-step Queries

One-step Queries

In-memory Queries

Stored Query Expressions

Linguistics Requests

Text Columns

Supported Datatypes

Textkeys

Composite Textkeys

Column Name Limitations

Column Length Limitations

Text Loading

Individual Row Insert/Update/Export

SQL

ctxload Utility

Batch Load

SQL*Loader

ctxload Utility

Automated Text Load

Client-side Insert/Update/Export

ConText Indexes

Text Indexes

Text Lexers

Tokens in Text Indexes

Token Location Information

Case-sensitivity

Stop Words

Theme Indexes

Theme Lexer

Tokens in Theme Indexes

Theme Weights

Case-sensitivity

Linguistic Settings

ConText Index Tables

Columns with Multiple Indexes

Index Creation

Stages of ConText Indexing

Index Initialization

Index Population

Index Termination

Creating Empty ConText Indexes

Parallel Indexing

Index Fragmentation

Memory Allocation

Index Log

Log Details

Accessing the Log

Index Updates (DML)

Immediate Vs. Batch Update

Deferred Deletion

Index Optimization

Compaction of Index Fragments

Removal of Obsolete Document References

Piecewise Optimization

Optimizing Index Entries for Tokens and Sections

Case-sensitivity

Identifying Candidates for Piecewise Optimization

When to Optimize

Thesauri

Thesaurus Creation and Maintenance

Thesauri in Queries

Query Expansion

Limitations

Case-sensitivity

Case-sensitive Thesauri

Case-insensitive Thesauri

Default Thesaurus

Supplied Thesaurus

Supplied Thesaurus Structure and Content

6
Text Concepts