8 CTX_DOC Package

This chapter describes the CTX_DOC PL/SQL package for requesting document services, such as highlighting extracted text or generating a list of themes for a document.

Many of these procedures exist in two versions: those that make use of indexes, and those that don't. Those that don't are called "policy-based" procedures. They are offered because there are times when you might like to use document services on a single document without creating a context index in advance. Policy-based procedures enable you to do this.

The policy_* procedures mirror the conventional in-memory document services and are used with policy name replacing index name, and document of type VARCHAR2, CLOB, BLOB or BFILE replacing textkey. Thus, you need not create an index to obtain document services output with these procedures.

The CTX_DOC package includes the following procedures and functions:

Name	Description
FILTER	Generates a plain text or HTML version of a document
GIST	Generates a Gist or theme summaries for a document
HIGHLIGHT	Generates plain text or HTML highlighting offset information for a document
IFILTER	Generates a plain text version of binary data. Can be called from a `USER_DATASTORE` procedure.
MARKUP	Generates a plain text or HTML version of a document with query terms highlighted
PKENCODE	Encodes a composite textkey string (value) for use in other `CTX_DOC` procedures
POLICY_FILTER	Generates a plain text or HTML version of a document, without requiring an index.
POLICY_GIST	Generates a Gist or theme summaries for a document, without requiring an index.
POLICY_HIGHLIGHT	Generates plain text or HTML highlighting offset information for a document, without requiring an index.
POLICY_MARKUP	Generates a plain text or HTML version of a document with query terms highlighted, without requiring an index.
POLICY_THEMES	Generates a list of themes for a document, without requiring an index.
POLICY_TOKENS	Generates all index tokens for a document, without requiring an index.
SET_KEY_TYPE	Sets `CTX_DOC` procedures to accept rowid or primary key document identifiers.
THEMES	Generates a list of themes for a document
TOKENS	Generates all index tokens for a document.

FILTER

Use the CTX_DOC.FILTER procedure to generate either a plain text or HTML version of a document. You can store the rendered document in either a result table or in memory. This procedure is generally called after a query, from which you identify the document to be filtered.

Note:

The resultant HTML document does not include graphics.

Syntax 1:In-memory Result Storage

CTX_DOC.FILTER(
          index_name  IN VARCHAR2, 
          textkey     IN VARCHAR2, 
          restab      IN OUT NOCOPY CLOB, 
          plaintext   IN BOOLEAN  DEFAULT FALSE);

Syntax 2: Result Table Storage

CTX_DOC.FILTER(
          index_name  IN VARCHAR2, 
          textkey     IN VARCHAR2, 
          restab      IN VARCHAR2, 
          query_id    IN NUMBER DEFAULT 0,
          plaintext   IN BOOLEAN  DEFAULT FALSE);

index_name

Specify the name of the index associated with the text column containing the document identified by textkey.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be one of the following:

a single column primary key value
encoded specification for a composite (multiple column) primary key. Use CTX_DOC.PKENCODE .
the rowid of the row containing the document

You toggle between primary key and rowid identification using CTX_DOC.SET_KEY_TYPE.

restab

You can specify that this procedure store the marked-up text to either a table or to an in-memory CLOB.

To store results to a table specify the name of the table. The result table must exist before you make this call.

See Also:

"Filter Table" in Appendix A, " Result Tables" for more information about the structure of the filter result table.

To store results in memory, specify the name of the CLOB locator. If restab is NULL, a temporary CLOB is allocated and returned. You must de-allocate the locator after using it with DBMS_LOB.FREETEMPORARY().

If restab is not NULL, the CLOB is truncated before the operation.

query_id

Specify an identifier to use to identify the row inserted into restab.

When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.

plaintext

Specify TRUE to generate a plaintext version of the document. Specify FALSE to generate an HTML version of the document if you are using the INSO filter or indexing HTML documents.

Example

In-Memory Filter

The following code shows how to filter a document to HTML in memory.

declare
mklob clob;
amt number := 40;
line varchar2(80);

begin
 ctx_doc.filter('myindex','1', mklob, FALSE);
 -- mklob is NULL when passed-in, so ctx-doc.filter will allocate a temporary
 -- CLOB for us and place the results there.
 dbms_lob.read(mklob, amt, 1, line);
 dbms_output.put_line('FIRST 40 CHARS ARE:'||line);
 -- have to de-allocate the temp lob
 dbms_lob.freetemporary(mklob);
 end;

Create the filter result table to store the filtered document as follows:

create table filtertab (query_id  number,   
                        document  clob);

To obtain a plaintext version of document with textkey 20, issue the following statement:

begin 
ctx_doc.filter('newsindex', '20', 'filtertab', '0', TRUE);
end;

GIST

Use the CTX_DOC.GIST procedure to generate gist and theme summaries for a document. You can generate paragraph-level or sentence-level gists or theme summaries.

Syntax 1: In-Memory Storage

CTX_DOC.GIST(

index_name    IN VARCHAR2, 
textkey       IN VARCHAR2, 
restab        IN OUT CLOB, 
glevel        IN VARCHAR2 DEFAULT 'P',
pov           IN VARCHAR2 DEFAULT 'GENERIC',
numParagraphs IN NUMBER DEFAULT 16,
maxPercent    IN NUMBER DEFAULT 10,
num_themes   IN NUMBER DEFAULT 50);

Syntax 2: Result Table Storage

CTX_DOC.GIST(

index_name    IN VARCHAR2, 
textkey       IN VARCHAR2, 
restab        IN VARCHAR2, 
query_id      IN NUMBER DEFAULT 0,
glevel        IN VARCHAR2 DEFAULT 'P',
pov           IN VARCHAR2 DEFAULT NULL,
numParagraphs IN NUMBER DEFAULT 16,
maxPercent    IN NUMBER DEFAULT 10,
num_themes     IN NUMBER DEFAULT 50);

index_name

Specify the name of the index associated with the text column containing the document identified by textkey.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be one of the following:

a single column primary key value
an encoded specification for a composite (multiple column) primary key. To encode a composite textkey, use the CTX_DOC.PKENCODE procedure.
the rowid of the row containing the document

You toggle between primary key and rowid identification using CTX_DOC.SET_KEY_TYPE.

restab

You can specify that this procedure store the gist and theme summaries to either a table or to an in-memory CLOB.

To store results to a table specify the name of the table.

See Also:

"Gist Table" in Appendix A, " Result Tables" for more information about the structure of the gist result table, see

To store results in memory, specify the name of the CLOB locator. If restab is NULL, a temporary CLOB is allocated and returned. You must de-allocate the locator after using it.

If restab is not NULL, the CLOB is truncated before the operation.

query_id: Specify an identifier to use to identify the row(s) inserted into restab.
glevel: Specify the type of gist or theme summary to produce. The possible values are:

P for paragraph
S for sentence

The default is P.

pov

Specify whether a gist or a single theme summary is generated. The type of gist or theme summary generated (sentence-level or paragraph-level) depends on the value specified for glevel.

To generate a gist for the entire document, specify a value of 'GENERIC' for pov. To generate a theme summary for a single theme in a document, specify the theme as the value for pov.

When using result table storage and you do not specify a value for pov, this procedure returns the generic gist plus up to fifty theme summaries for the document.

When using in-memory result storage to a CLOB, you must specify a pov. However, if you do not specify pov, this procedure generates only a generic gist for the document.

Note:

The pov parameter is case sensitive. To return a gist for a document, specify 'GENERIC' in all uppercase. To return a theme summary, specify the theme exactly as it is generated for the document.

Only the themes generated by THEMES for a document can be used as input for pov.

numParagraphs

Specify the maximum number of document paragraphs (or sentences) selected for the document gist or theme summaries. The default is 16.

Note:

The numParagraphs parameter is used only when this parameter yields a smaller gist or theme summary size than the gist or theme summary size yielded by the maxPercent parameter.

This means that the system always returns the smallest size gist or theme summary.

maxPercent

Specify the maximum number of document paragraphs (or sentences) selected for the document gist or theme summaries as a percentage of the total paragraphs (or sentences) in the document. The default is 10.

Note:

The maxPercent parameter is used only when this parameter yields a smaller gist or theme summary size than the gist or theme summary size yielded by the numParagraphs parameter.

This means that the system always returns the smallest size gist or theme summary.

num_themes

Specify the number of theme summaries to produce when you do not specify a value for pov. For example, if you specify 10, this procedure returns the top 10 theme summaries. The default is 50.

If you specify 0 or NULL, this procedure returns all themes in a document. If the document contains more than 50 themes, only the top 50 themes show conceptual hierarchy.

Examples

In-Memory Gist

The following example generates a non-default size generic gist of at most 10 paragraphs. The result is stored in memory in a CLOB locator. The code then de-allocates the returned CLOB locator after using it.

set serveroutput on;
declare
  gklob clob;
  amt number := 40;
  line varchar2(80);

begin
 ctx_doc.gist('newsindex','34',gklob, pov => 'GENERIC',numParagraphs => 10);
  -- gklob is NULL when passed-in, so ctx-doc.gist will allocate a temporary
  -- CLOB for us and place the results there.
  
  dbms_lob.read(gklob, amt, 1, line);
  dbms_output.put_line('FIRST 40 CHARS ARE:'||line);
  -- have to de-allocate the temp lob
  dbms_lob.freetemporary(gklob);
 end;

Result Table Gists

The following example creates a gist table called CTX_GIST:

create table CTX_GIST (query_id  number,
                       pov       varchar2(80), 
                       gist      CLOB);

Gists and Theme Summaries

The following example returns a default sized paragraph level gist for document 34 as well as the top 10 theme summaries in the document:

begin
   ctx_doc.gist('newsindex','34','CTX_GIST', 1, num_themes=>10);
end;

The following example generates a non-default size gist of at most 10 paragraphs:

begin
  ctx_doc.gist('newsindex','34','CTX_GIST',1,pov =>'GENERIC',numParagraphs=>10);
end;

The following example generates a gist whose number of paragraphs is at most 10 percent of the total paragraphs in document:

begin 
  ctx_doc.gist('newsindex','34','CTX_GIST',1,pov => 'GENERIC',  maxPercent => 10);
end;

Theme Summary

The following example returns a paragraph level theme summary for insects for document 34. The default theme summary size is returned.

begin
   ctx_doc.gist('newsindex','34','CTX_GIST',1, pov => 'insects');
end;

HIGHLIGHT

Use the CTX_DOC.HIGHLIGHT procedure to generate highlight offsets for a document. The offset information is generated for the terms in the document that satisfy the query you specify. These highlighted terms are either the words that satisfy a word query or the themes that satisfy an ABOUT query.

You can generate highlight offsets for either plaintext or HTML versions of the document. The table returned by CTX_DOC.HIGHLIGHT does not include any graphics found in the original document. You can apply the offset information to the same documents filtered with CTX_DOC.FILTER .

You usually call this procedure after a query, from which you identify the document to be processed.

You can store the highlight offsets in either an in-memory PL/SQL table or a result table.

Syntax 1:In-Memory Result Storage

CTX_DOC.HIGHLIGHT(
        index_name  IN VARCHAR2, 
        textkey     IN VARCHAR2, 
        text_query  IN VARCHAR2, 
        restab      IN OUT NOCOPY HIGHLIGHT_TAB, 
        plaintext   IN BOOLEAN  DEFAULT FALSE);

Syntax 2:Result Table Storage

CTX_DOC.HIGHLIGHT(
          index_name  IN VARCHAR2, 
          textkey     IN VARCHAR2, 
          text_query  IN VARCHAR2, 
          restab      IN VARCHAR2, 
          query_id    IN NUMBER   DEFAULT 0,
          plaintext   IN BOOLEAN  DEFAULT FALSE);

index_name

Specify the name of the index associated with the text column containing the document identified by textkey.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be one of the following:

a single column primary key value
encoded specification for a composite (multiple column) primary key. Use the CTX_DOC.PKENCODE procedure.
the rowid of the row containing the document

You toggle between primary key and rowid identification using CTX_DOC.SET_KEY_TYPE.

text_query

Specify the original query expression used to retrieve the document. If NULL, no highlights are generated.

If text_query includes wildcards, stemming, fuzzy matching which result in stopwords being returned, HIGHLIGHT does not highlight the stopwords.

If text_query contains the threshold operator, the operator is ignored. The HIGHLIGHT procedure always returns highlight information for the entire result set.

restab

You can specify that this procedure store highlight offsets to either a table or to an in-memory PL/SQL table.

To store results to a table specify the name of the table. The table must exist before you call this procedure.

See Also:

see "Highlight Table" in Appendix A, " Result Tables" for more information about the structure of the highlight result table.

To store results to an in-memory table, specify the name of the in-memory table of type CTX_DOC.HIGHLIGHT_TAB. The HIGHLIGHT_TAB datatype is defined as follows:

type highlight_rec is record (
  offset number,
  length number
);
type highlight_tab is table of highlight_rec index by binary_integer;

CTX_DOC.HIGHLIGHT clears HIGHLIGHT_TAB before the operation.

query_id

Specify the identifier used to identify the row inserted into restab.

When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.

plaintext

Specify TRUE to generate a plaintext offsets of the document.

Specify FALSE to generate HTML offsets of the document if you are using the INSO filter or indexing HTML documents.

Examples

Create Highlight Table

Create the highlight table to store the highlight offset information:

create table hightab(query_id number, 
                     offset number, 
                     length number);

Word Highlight Offsets

To obtain HTML highlight offset information for document 20 for the word dog:

begin
ctx_doc.highlight('newsindex', '20', 'dog', 'hightab', 0, FALSE);
end;

Theme Highlight Offsets

Assuming the index newsindex has a theme component, you obtain HTML highlight offset information for the theme query of politics by issuing the following query:

begin
ctx_doc.highlight('newsindex', '20', 'about(politics)', 'hightab', 0, FALSE);
end;

The output for this statement are the offsets to highlighted words and phrases that represent the theme of politics in the document.

IFILTER

Use this procedure when you need to filter binary data to text.

This procedure takes binary data (BLOB IN), filters the data through with the Inso filter, and writes the text version to a CLOB. (Any graphics in the original document are ignored.) CTX_DOC.IFILTER employs the safe callout, and it does not require an index to use, as CTX_DOC.FILTER does.

Note:

This procedure will not be supported in future releases. Programs should make use of CTX_DOC.POLICY_FILTER instead.

Requirements

Because CTX_DOC.IFILTER employs the safe callout mechanism, the SQL*Net listener must be running and configured for extproc agent startup.

Syntax

CTX_DOC.IFILTER(data IN BLOB, text IN OUT NOCOPY CLOB);

data: Specify the binary data to be filtered.
text: Specify the destination CLOB. The filtered data is placed in here. This parameter must be a valid CLOB locator that is writable. Passing NULL or a non-writable CLOB will result in an error. Filtered text will be appended to the end of existing content, if any.

Example

The document text used in a MATCHES query can be VARCHAR2 or CLOB. It does not accept BLOB input, so you cannot match filtered documents directly. Instead, you must filter the binary content to CLOB using the INSO filter. Assuming the document data is in bind variable :doc_blob:

declare
    doc_text clob;
  begin
    -- create a temporary CLOB to hold the document text
    doc_text := dbms_lob.createtemporary(doc_text, TRUE, DBMS_LOB.SESSION);

    -- call ctx_doc.ifilter to filter the BLOB to CLOB data
    ctx_doc.ifilter(:doc_blob, doc_text);

    -- now do the matches query using the CLOB version
    for c1 in (select * from queries where matches(query_string, doc_text)>0)
    loop
      -- do what you need to do here
    end loop;

    dbms_lob.freetemporary(doc_text);
  end;

MARKUP

The CTX_DOC.MARKUP procedure takes a query specification and a document textkey and returns a version of the document in which the query terms are marked up. These marked-up terms are either the words that satisfy a word query or the themes that satisfy an ABOUT query.

You can set the marked-up output to be either plaintext or HTML. The marked-up document returned by CTX_DOC.MARKUPdoes not include any graphics found in the original document.

You can use one of the pre-defined tagsets for marking highlighted terms, including a tag sequence that enables HTML navigation.

You usually call CTX_DOC.MARKUP after a query, from which you identify the document to be processed.

You can store the marked-up document either in memory or in a result table.

Note:

Oracle Text does not guarantee well-formed output from CTX.DOC.MARKUP, especially for terms that are already marked up with HTML or XML. In particular, unexpected nesting of markup tags may occasionally result.

Syntax 1: In-Memory Result Storage

CTX_DOC.MARKUP(

index_name     IN VARCHAR2, 
textkey        IN VARCHAR2, 
text_query     IN VARCHAR2, 
restab         IN OUT NOCOPY CLOB, 
plaintext      IN BOOLEAN   DEFAULT FALSE, 
tagset         IN VARCHAR2  DEFAULT 'TEXT_DEFAULT', 
starttag       IN VARCHAR2  DEFAULT NULL, 
endtag         IN VARCHAR2  DEFAULT NULL, 
prevtag        IN VARCHAR2  DEFAULT NULL, 
nexttag        IN VARCHAR2  DEFAULT NULL);

Syntax 2: Result Table Storage

CTX_DOC.MARKUP(

index_name     IN VARCHAR2, 
textkey        IN VARCHAR2, 
text_query     IN VARCHAR2, 
restab         IN VARCHAR2, 
query_id       IN NUMBER    DEFAULT 0,  
plaintext      IN BOOLEAN   DEFAULT FALSE, 
tagset         IN VARCHAR2  DEFAULT 'TEXT_DEFAULT', 
starttag       IN VARCHAR2  DEFAULT NULL, 
endtag         IN VARCHAR2  DEFAULT NULL, 
prevtag        IN VARCHAR2  DEFAULT NULL, 
nexttag        IN VARCHAR2  DEFAULT NULL);

index_name

Specify the name of the index associated with the text column containing the document identified by textkey.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be one of the following:

a single column primary key value
encoded specification for a composite (multiple column) primary key. Use the CTX_DOC.PKENCODE procedure.
the rowid of the row containing the document

You toggle between primary key and rowid identification using CTX_DOC.SET_KEY_TYPE.

text_query

Specify the original query expression used to retrieve the document.

If text_query includes wildcards, stemming, fuzzy matching which result in stopwords being returned, MARKUP does not highlight the stopwords.

If text_query contains the threshold operator, the operator is ignored. The MARKUP procedure always returns highlight information for the entire result set.

restab

You can specify that this procedure store the marked-up text to either a table or to an in-memory CLOB.

To store results to a table specify the name of the table. The result table must exist before you call this procedure.

See Also:

For more information about the structure of the markup result table, see "Markup Table" in Appendix A, " Result Tables".

To store results in memory, specify the name of the CLOB locator. If restab is NULL, a temporary CLOB is allocated and returned. You must de-allocate the locator after using it.

If restab is not NULL, the CLOB is truncated before the operation.

query_id

Specify the identifier used to identify the row inserted into restab.

When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.

plaintext

Specify TRUE to generate plaintext marked-up document. Specify FALSE to generate a marked-up HTML version of document if you are using the INSO filter or indexing HTML documents.

tagset

Specify one of the following pre-defined tagsets. The second and third columns show how the four different tags are defined for each tagset:

Tagset	Tag	Tag Value
`TEXT_DEFAULT`	starttag	`<<<`
	endtag	`>>>`
	prevtag
	nexttag
`HTML_DEFAULT`	starttag	`<B>`
	endtag	`</B>`
	prevtag
	nexttag
`HTML_NAVIGATE`	starttag	`<A NAME=ctx%CURNUM><B>`
	endtag	`</B></A>`
	prevtag	`<A HREF=#ctx%PREVNUM><</A>`
	nexttag	`<A HREF=#ctx%NEXTNUM>></A>`

starttag

Specify the character(s) inserted by MARKUP to indicate the start of a highlighted term.

The sequence of starttag, endtag, prevtag and nexttag with respect to the highlighted word is as follows:

... prevtag starttag word endtag nexttag...

endtag

Specify the character(s) inserted by MARKUP to indicate the end of a highlighted term.

prevtag

Specify the markup sequence that defines the tag that navigates the user to the previous highlight.

In the markup sequences prevtag and nexttag, you can specify the following offset variables which are set dynamically:

Offset Variable	Value
`%CURNUM`	the current offset number
`%PREVNUM`	the previous offset number
`%NEXTNUM`	the next offset number

See the description of the HTML_NAVIGATE tagset for an example.

nexttag

Specify the markup sequence that defines the tag that navigates the user to the next highlight tag.

Within the markup sequence, you can use the same offset variables you use for prevtag. See the explanation for prevtag and the HTML_NAVIGATE tagset for an example.

Examples

In-Memory Markup

The following code generates a marked-up document and stores it in memory. The code passes a NULL CLOB locator to MARKUP and then de-allocates the returned CLOB locator after using it.

set serveroutput on

declare
mklob clob;
amt number := 40;
line varchar2(80);

begin
 ctx_doc.markup('myindex','1','dog & cat', mklob);
 -- mklob is NULL when passed-in, so ctx-doc.markup will allocate a temporary
 -- CLOB for us and place the results there.
 dbms_lob.read(mklob, amt, 1, line);
 dbms_output.put_line('FIRST 40 CHARS ARE:'||line);
 -- have to de-allocate the temp lob
 dbms_lob.freetemporary(mklob);
 end;

Markup Table

Create the highlight markup table to store the marked-up document as follows:

create table markuptab (query_id  number,   
                        document  clob);

Word Highlighting in HTML

You can also store your MARKUP results in a table. To create HTML highlight markup for the words dog or cat for document 23, issue the following statement:

begin
  ctx_doc.markup(index_name => 'my_index',
                      textkey => '23',
                      text_query => 'dog|cat',
                      restab => 'markuptab',
                      query_id => '1',
                      tagset => 'HTML_DEFAULT');
end;

Theme Highlighting in HTML

To create HTML highlight markup for the theme of politics for document 23, issue the following statement:

begin
  ctx_doc.markup(index_name => 'my_index',
                      textkey => '23',
                      text_query => 'about(politics)',
                      restab => 'markuptab',
                      query_id => '1',
                      tagset => 'HTML_DEFAULT');
end;

PKENCODE

The CTX_DOC.PKENCODE function converts a composite textkey list into a single string and returns the string.

The string created by PKENCODE can be used as the primary key parameter textkey in other CTX_DOC procedures, such as CTX_DOC.THEMES and CTX_DOC.GIST.

Syntax

CTX_DOC.PKENCODE(
         pk1    IN VARCHAR2,
         pk2    IN VARCHAR2 DEFAULT NULL, 
         pk4    IN VARCHAR2 DEFAULT NULL, 
         pk5    IN VARCHAR2 DEFAULT NULL, 
         pk6    IN VARCHAR2 DEFAULT NULL,
         pk7    IN VARCHAR2 DEFAULT NULL,
         pk8    IN VARCHAR2 DEFAULT NULL,
         pk9    IN VARCHAR2 DEFAULT NULL,
         pk10   IN VARCHAR2 DEFAULT NULL,
         pk11   IN VARCHAR2 DEFAULT NULL,
         pk12   IN VARCHAR2 DEFAULT NULL,
         pk13   IN VARCHAR2 DEFAULT NULL,
         pk14   IN VARCHAR2 DEFAULT NULL,
         pk15   IN VARCHAR2 DEFAULT NULL,
         pk16   IN VARCHAR2 DEFAULT NULL)
RETURN VARCHAR2;

pk1-pk16: Each PK argument specifies a column element in the composite textkey list. You can encode at most 16 column elements.

Returns

String that represents the encoded value of the composite textkey.

Examples

begin 
ctx_doc.gist('newsindex',CTX_DOC.PKENCODE('smith', 14), 'CTX_GIST');
end;

In this example, smith and 14 constitute the composite textkey value for the document.

POLICY_FILTER

Generates a plain text or an HTML version of a document. With this procedure, no CONTEXT index is required.

This procedure uses a trusted callout.

Syntax

ctx_doc.policy_filter(policy_name    in  VARCHAR2,
                      document       in [VARCHAR2|CLOB|BLOB|BFILE],
                      restab         in out nocopy CLOB,
                      plaintext      in BOOLEAN default FALSE);

policy_name: Specify the policy name created with CTX_DDL.CREATE_POLICY. Using an index name will result in an error.
document: Specify the document to filter.
restab: Specify the name of the result table.
plaintext: Specify TRUE to generate a plaintext version of the document. Specify FALSE to generate an HTML version of the document if you are using the INSO filter or indexing HTML documents.

POLICY_GIST

Generates a Gist or theme summary for document.You can generate paragraph-level or sentence-level gists or theme summaries. With this procedure, no CONTEXT index is required.

Syntax

ctx_doc.policy_gist(policy_name      in VARCHAR2,
                    document         in [VARCHAR2|CLOB|BLOB|BFILE],
                    restab           in out nocopy CLOB,
                    glevel           in VARCHAR2 default 'P',
                    pov              in VARCHAR2 default 'GENERIC',
                    numParagraphs    in VARCHAR2 default NULL,
                    maxPercent       in NUMBER default NULL,
                    num_themes       in NUMBER default 50);

policy_name: Specify the policy name created with CTX_DDL.CREATE_POLICY. Using an index name will result in an error.
document: Specify the document for which to generate the Gist or theme summary.
restab: Specify the name of the result table.
glevel: Specify the type of gist or theme summary to produce. The possible values are:

P for paragraph
S for sentence

The default is P.

pov

Specify whether a gist or a single theme summary is generated. The type of gist or theme summary generated (sentence-level or paragraph-level) depends on the value specified for glevel.

To generate a gist for the entire document, specify a value of 'GENERIC' for pov. To generate a theme summary for a single theme in a document, specify the theme as the value for pov.

When using result table storage and you do not specify a value for pov, this procedure returns the generic gist plus up to fifty theme summaries for the document.

Note:

Only the themes generated by THEMES for a document can be used as input for pov.

numParagraphs

Specify the maximum number of document paragraphs (or sentences) selected for the document gist or theme summaries. The default is 16.

Note:

The numParagraphs parameter is used only when this parameter yields a smaller gist or theme summary size than the gist or theme summary size yielded by the maxPercent parameter.

This means that the system always returns the smallest size gist or theme summary.

maxPercent

Note:

The maxPercent parameter is used only when this parameter yields a smaller gist or theme summary size than the gist or theme summary size yielded by the numParagraphs parameter.

This means that the system always returns the smallest size gist or theme summary.

num_themes

Specify the number of theme summaries to produce when you do not specify a value for pov. For example, if you specify 10, this procedure returns the top 10 theme summaries. The default is 50.

If you specify 0 or NULL, this procedure returns all themes in a document. If the document contains more than 50 themes, only the top 50 themes show conceptual hierarchy.

POLICY_HIGHLIGHT

Generates plain text or HTML highlighting offset information for a document.With this procedure, no CONTEXT index is required.

The offset information is generated for the terms in the document that satisfy the query you specify. These highlighted terms are either the words that satisfy a word query or the themes that satisfy an ABOUT query.

You can generate highlight offsets for either plaintext or HTML versions of the document. You can apply the offset information to the same documents filtered with CTX_DOC.FILTER .

Syntax

ctx_doc.policy_highlight(policy_name  in  VARCHAR2,
                         document     in  [VARCHAR2|CLOB|BLOB|BFILE],
                         text_query   in VARCHAR2,
                         restab       in out nocopy highlight_tab,
                         plaintext    in boolean FALSE);

policy_name

Specify the policy name created with CTX_DDL.CREATE_POLICY. Using an index name will result in an error.

document

Specify the document to generate highlighting offset information.

text_query

Specify the original query expression used to retrieve the document. If NULL, no highlights are generated.

If text_query includes wildcards, stemming, or fuzzy matching which result in stopwords being returned, this procedure does not highlight the stopwords.

If text_query contains the threshold operator, the operator is ignored. This procedure always returns highlight information for the entire result set.

restab

Specify the name of the result table. The table must exist before you call this procedure.

See Also:

see "Highlight Table" in Appendix A, " Result Tables" for more information about the structure of the highlight result table.

plaintext

Specify TRUE to generate a plaintext offsets of the document.

Specify FALSE to generate HTML offsets of the document if you are using the INSO filter or indexing HTML documents.

POLICY_MARKUP

Generates plain text or HTML version of a document with query terms highlighted.With this procedure, no CONTEXT index is required.

The CTX_DOC.POLICY_MARKUP procedure takes a query specification and a document and returns a version of the document in which the query terms are marked up. These marked-up terms are either the words that satisfy a word query or the themes that satisfy an ABOUT query.

You can set the marked-up output to be either plaintext or HTML.

You can use one of the pre-defined tagsets for marking highlighted terms, including a tag sequence that enables HTML navigation.

Syntax

ctx_doc.policy_markup(policy_name     in VARCHAR2,
                      document        in [VARCHAR2|CLOB|BLOB|BFILE],
                      text_query      in VARCHAR2,
                      restab          in out nocopy CLOB,
                      plaintext       in BOOLEAN default FALSE,
                      tagset          in VARCHAR2 default 'TEXT_DEFAULT',
                      starttag        in VARCHAR2 default NULL,
                      endtag          in VARCHAR2 default NULL,
                      prevtag         in VARCHAR2 default NULL,
                      nexttag         in VARCHAR2 default NULL);

policy_name

Specify the policy name created with CTX_DDL.CREATE_POLICY. Using an index name will result in an error.

document

Specify the document to generate highlighting offset information.

text_query

Specify the original query expression used to retrieve the document. If NULL, no highlights are generated.

If text_query includes wildcards, stemming, or fuzzy matching which result in stopwords being returned, this procedure does not highlight the stopwords.

If text_query contains the threshold operator, the operator is ignored. This procedure always returns highlight information for the entire result set.

restab

Specify the name of the result table. The table must exist before you call this procedure.

See Also:

see "Markup Table" in Appendix A, " Result Tables" for more information about the structure of the highlight result table.

plaintext

Specify TRUE to generate plaintext marked-up document. Specify FALSE to generate a marked-up HTML version of document if you are using the INSO filter or indexing HTML documents.

tagset

Specify one of the following pre-defined tagsets. The second and third columns show how the four different tags are defined for each tagset:

Tagset	Tag	Tag Value
`TEXT_DEFAULT`	starttag	`<<<`
	endtag	`>>>`
	prevtag
	nexttag
`HTML_DEFAULT`	starttag	`<B>`
	endtag	`</B>`
	prevtag
	nexttag
`HTML_NAVIGATE`	starttag	`<A NAME=ctx%CURNUM><B>`
	endtag	`</B></A>`
	prevtag	`<A HREF=#ctx%PREVNUM><</A>`
	nexttag	`<A HREF=#ctx%NEXTNUM>></A>`

starttag

Specify the character(s) inserted by MARKUP to indicate the start of a highlighted term.

The sequence of starttag, endtag, prevtag and nexttag with regard to the highlighted word is as follows:

... prevtag starttag word endtag nexttag...

endtag

Specify the character(s) inserted by MARKUP to indicate the end of a highlighted term.

prevtag

Specify the markup sequence that defines the tag that navigates the user to the previous highlight.

In the markup sequences prevtag and nexttag, you can specify the following offset variables which are set dynamically:

Offset Variable	Value
`%CURNUM`	the current offset number
`%PREVNUM`	the previous offset number
`%NEXTNUM`	the next offset number

See the description of the HTML_NAVIGATE tagset for an example.

nexttag

Specify the markup sequence that defines the tag that navigates the user to the next highlight tag.

Within the markup sequence, you can use the same offset variables you use for prevtag. See the explanation for prevtag and the HTML_NAVIGATE tagset for an example.

POLICY_THEMES

Generates a list of themes for a document. With this procedure, no CONTEXT index is required.

Syntax

ctx_doc.policy_themes(policy_name    in VARCHAR2, 
                              document       in [VARCHAR2|CLOB|BLOB|BFILE],
                      restab         in out nocopy theme_tab,
                      full_themes    in BOOLEAN default FALSE,
                      num_themes     in number    default 50);

policy_name

Specify the policy you create with CTX_DDL.CREATE_POLICY. Using an index name will result in an error.

document

Specify the document for which to generate a list of themes.

restab

Specify the name of the result table.

See Also:

"Theme Table" in Appendix A, " Result Tables" for more information about the structure of the theme result table.

full_themes

Specify whether this procedure generates a single theme or a hierarchical list of parent themes (full themes) for each document theme.

Specify TRUE for this procedure to write full themes to the THEME column of the result table.

Specify FALSE for this procedure to write single theme information to the THEME column of the result table. This is the default.

num_themes

Specify the maximum number of themes to retrieve. For example, if you specify 10, up to first 10 themes are returned for the document. The default is 50.

If you specify 0 or NULL, this procedure returns all themes in a document. If the document contains more than 50 themes, only the first 50 themes show conceptual hierarchy.

Example

Create a policy:

exec ctx_ddl.create_policy('mypolicy');

Run themes:

declare
  la      varchar2(200);
  rtab    ctx_doc.theme_tab;
begin
   ctx_doc.policy_themes('mypolicy', 
                    'To define true madness, What is''t but to be nothing but mad?', rtab);
   for i in 1..rtab.count loop
     dbms_output.put_line(rtab(i).theme||':'||rtab(i).weight);
   end loop;
end;

POLICY_TOKENS

Generate all index tokens for document.With this procedure, no CONTEXT index is required.

Syntax

ctx_doc.policy_tokens(policy_name    in  VARCHAR2,
                      document       in  [VARCHAR2|CLOB|BLOB|BFILE],
                      restab         in out nocopy token_tab);

policy_name

Specify the policy name created with CTX_DDL.CREATE_POLICY. Using an index name will result in an error.

document

Specify the document for which to generate tokens.

restab

Specify the name of the result table.

The tokens returned are those tokens which are inserted into the index for the document. Stop words are not returned. Section tags are not returned because they are not text tokens.

Token tables can be named anything, but must include the following columns, with names and data types as follows.

Table 8-1 Required Columns for Token Tables

Column Name	Type	Description
`QUERY_ID`	`NUMBER`	The identifier for the results generated by a particular call to `CTX_DOC.TOKENS` (only populated when table is used to store results from multiple `TOKEN` calls)
`TOKEN`	`VARCHAR2(64)`	The token string in the text.
`OFFSET`	`NUMBER`	The position of the token in the document, relative to the start of document which has a position of 1.
`LENGTH`	`NUMBER`	The character length of the token.

Example

Get tokens:

declare
  la     varchar2(200);
  rtab   ctx_doc.token_tab;
begin
   ctx_doc.policy_tokens('mypolicy', 
                    'To define true madness, What is''t but to be nothing but mad?', 
                    rtab);
   for i in 1..rtab.count loop
     dbms_output.put_line(rtab(i).offset||':'||rtab(i).token);
   end loop;
end;

SET_KEY_TYPE

Use this procedure to set the CTX_DOC procedures to accept either the ROWID or the PRIMARY_KEY document identifiers. This setting affects the invoking session only.

Syntax

ctx_doc.set_key_type(key_type in varchar2);

key_type

Specify either ROWID or PRIMARY_KEY as the input key type (document identifier) for CTX_DOC procedures.

This parameter defaults to the value of the CTX_DOC_KEY_TYPE system parameter.

Note:

When your base table has no primary key, setting key_type to PRIMARY_KEY is ignored. The textkey parameter you specify for any CTX_DOC procedure is interpreted as a ROWID.

Example

To set CTX_DOC procedures to accept primary key document identifiers, do the following:

begin
ctx_doc.set_key_type('PRIMARY_KEY');
end

THEMES

Use the CTX_DOC.THEMES procedure to generate a list of themes for a document. You can store each theme as a row in either a result table or an in-memory PL/SQL table you specify.

Syntax 1: In-Memory Table Storage

CTX_DOC.THEMES(

index_name      IN VARCHAR2,
textkey         IN VARCHAR2,
restab          IN OUT NOCOPY THEME_TAB,
full_themes     IN BOOLEAN DEFAULT FALSE,
num_themes       IN NUMBER DEFAULT 50);

Syntax 2: Result Table Storage

CTX_DOC.THEMES(

index_name      IN VARCHAR2,
textkey         IN VARCHAR2,
restab          IN VARCHAR2,
query_id        IN NUMBER DEFAULT 0,
full_themes     IN BOOLEAN DEFAULT FALSE,
num_themes       IN NUMBER DEFAULT 50);

index_name

Specify the name of the index for the text column.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be one of the following:

a single column primary key value
an encoded specification for a composite (multiple column) primary key. When textkey is a composite key, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure.
the rowid of the row containing the document

You toggle between primary key and rowid identification using CTX_DOC.SET_KEY_TYPE.

restab

You can specify that this procedure store results to either a table or to an in-memory PL/SQL table.

To store results in a table, specify the name of the table.

See Also:

"Theme Table" in Appendix A, " Result Tables" for more information about the structure of the theme result table.

To store results in an in-memory table, specify the name of the in-memory table of type THEME_TAB. The THEME_TAB datatype is defined as follows:

type theme_rec is record (
   theme varchar2(2000),
   weight number
);

type theme_tab is table of theme_rec index by binary_integer;

CTX_DOC.THEMES clears the THEME_TAB you specify before the operation.

query_id

Specify the identifier used to identify the row(s) inserted into restab.

full_themes

Specify whether this procedure generates a single theme or a hierarchical list of parent themes (full themes) for each document theme.

Specify TRUE for this procedure to write full themes to the THEME column of the result table.

Specify FALSE for this procedure to write single theme information to the THEME column of the result table. This is the default.

num_themes

Specify the maximum number of themes to retrieve. For example, if you specify 10, up to first 10 themes are returned for the document. The default is 50.

If you specify 0 or NULL, this procedure returns all themes in a document. If the document contains more than 50 themes, only the first 50 themes show conceptual hierarchy.

Examples

In-Memory Themes

The following example generates the first 10 themes for document 1 and stores them in an in-memory table called the_themes. The example then loops through the table to display the document themes.

declare
 the_themes ctx_doc.theme_tab;

begin
 ctx_doc.themes('myindex','1',the_themes, numthemes=>10);
 for i in 1..the_themes.count loop
  dbms_output.put_line(the_themes(i).theme||':'||the_themes(i).weight);
  end loop;
end;

Theme Table

The following example creates a theme table called CTX_THEMES:

create table CTX_THEMES (query_id number, 
                         theme varchar2(2000), 
                         weight number);

Single Themes

To obtain a list of up to the first 20 themes where each element in the list is a single theme, issue a statement like the following:

begin

ctx_doc.themes('newsindex','34','CTX_THEMES',1,full_themes => FALSE, 
 num_themes=> 20);

end;

Full Themes

To obtain a list of the top 20 themes where each element in the list is a hierarchical list of parent themes, issue a statement like the following:

begin

ctx_doc.themes('newsindex','34','CTX_THEMES',1,full_themes => TRUE,      num_themes=>20);

end;

TOKENS

Use this procedure to identify all text tokens in a document. The tokens returned are those tokens which are inserted into the index. This feature is useful for implementing document classification, routing, or clustering.

Stopwords are not returned. Section tags are not returned because they are not text tokens.

Syntax 1: In-Memory Table Storage

CTX_DOC.TOKENS(index_name      IN VARCHAR2,
               textkey         IN VARCHAR2,
               restab          IN OUT NOCOPY TOKEN_TAB);

Syntax 2: Result Table Storage

CTX_DOC.TOKENS(index_name      IN VARCHAR2,
               textkey         IN VARCHAR2,
               restab          IN VARCHAR2,
               query_id        IN NUMBER DEFAULT 0);

index_name

Specify the name of the index for the text column.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be one of the following:

a single column primary key value
encoded specification for a composite (multiple column) primary key. To encode a composite textkey, use the CTX_DOC.PKENCODE procedure.
the rowid of the row containing the document

You toggle between primary key and rowid identification using CTX_DOC.SET_KEY_TYPE.

restab

You can specify that this procedure store results to either a table or to an in-memory PL/SQL table.

The tokens returned are those tokens which are inserted into the index for the document (or row) named with textkey. Stop words are not returned. Section tags are not returned because they are not text tokens.

Specifying a Token Table

To store results to a table, specify the name of the table. Token tables can be named anything, but must include the following columns, with names and data types as specified.

Table 8-2 Required Columns for Token Tables

Column Name	Type	Description
`QUERY_ID`	`NUMBER`	The identifier for the results generated by a particular call to `CTX_DOC.TOKENS` (only populated when table is used to store results from multiple `TOKEN` calls)
`TOKEN`	`VARCHAR2(64)`	The token string in the text.
`OFFSET`	`NUMBER`	The position of the token in the document, relative to the start of document which has a position of 1.
`LENGTH`	`NUMBER`	The character length of the token.

Specifying an In-Memory Table

To store results to an in-memory table, specify the name of the in-memory table of type TOKEN_TAB. The TOKEN_TAB datatype is defined as follows:

type token_rec is record (

token varchar2(64),
offset number,
length number

);

type token_tab is table of token_rec index by binary_integer;

CTX_DOC.TOKENS clears the TOKEN_TAB you specify before the operation.

query_id: Specify the identifier used to identify the row(s) inserted into restab.

Examples

In-Memory Tokens

The following example generates the tokens for document 1 and stores them in an in-memory table, declared as the_tokens. The example then loops through the table to display the document tokens.

declare
 the_tokens ctx_doc.token_tab;

begin
 ctx_doc.tokens('myindex','1',the_tokens);
 for i in 1..the_tokens.count loop
  dbms_output.put_line(the_tokens(i).token);
  end loop;
end;