Oracle8
ConText Cartridge Application Developer's Guide
Release 2.4 A63821-01 |
|
This chapter explains how to use ConText to create query expressions to find relevant text in documents. The topics covered in this chapter are:
A query expression defines the search criteria for retrieving
documents using ConText. A query expression consists of query terms (words
and phrases) and other components such as operators and special characters
which allow users to specify exactly which documents are retrieved by ConText.
A query expression can also call stored query expressions
(SQEs) to return stored query results or call PL/SQL functions to return
values used in the query.
When a query is executed using any of the methods supported
by ConText, one of the arguments included in the query is a query expression.
ConText then returns a list of all the documents that satisfy the search
criteria, as well as scores that measure the relevance of the document
to the search criteria
Query terms can consist of words and phrases. Query terms
can also contain stopwords.
The words in a query expression are the individual tokens
on which the query expression operators perform an action. If multiple
words are contained in a query expression, separated only by blank spaces
(no operators), the string of words is considered a phrase and the entire
string is searched for during a query.
Stopwords are common words, such as and, the,
of, and to, that are not considered significant query terms
by themselves because they occur so often in text. However, stopwords can
provide useful search information when combined with more significant terms.
For example, a query for documents containing the phrase
peanut butter and jelly returns different results than a query for
documents containing the terms peanut butter and jelly.
When you define a policy for a column, ConText lets you identify
a list of stopwords. When stopwords are encountered in the documents in
the column, they are not included as indexed terms in the text index; however,
they are recorded.
As a result, stopwords cannot be searched for explicitly
in text queries, but can be included as part of a phrase in a query expression.
See
Also:
For more information about querying with stopwords, see "Querying with Stopwords" in this chapter. |
Stoplists can be created in any language supported by ConText.
ConText provides a default stoplist in English.
Note: Stopwords do not have an affect on the theme indexes generated by ConText for your English-language documents. |
In addition to query terms, a query expression may contain
any or all of the following components:
ConText supports case-insensitivity for text queries and
case-sensitivity for both text and theme queries.
With text queries, you can issue case-sensitive and case-insensitive
queries. The ability to query in a case-sensitive way depends on the lexer
preference used to index the document set.
By default, ConText uses a lexer preference that is not case-sensitive
when indexing documents. Therefore, with a policy containing the default
lexer preference, queries are not case-sensitive. When queries are not
case-sensitive, a query on United returns the same hits as a query
on united.
To issue case-sensitive text queries, you or your ConText
administrator must first index your document set using a policy with a
case-sensitive lexer preference. Using the same policy, you can issue case-sensitive
queries. With case-sensitive queries, a query on United is different
from a query on united.
Case-sensitive querying helps to identify words that have
different meaning when capitalized. For example, to query on the proper
noun Church (as someone's name) without getting the hits for the
common noun church, you issue Church as your query. ConText
returns all appearances of Church.
When you have case-sensitivity enabled, searches on stopwords
are also case-sensitive. Thus when you issue a case-sensitive query on
a phrase containing stopwords and non-stopwords, ConText searches for the
phrase containing the stopwords with the specified case.
For example, assuming the word on is a stopword and
case-sensitivity is enabled, a search on the phrase on the waterfront
does not return hits for documents containing the phrase On the waterfront.
Theme queries are case-sensitive. For example, a query on
Turkey produces hits on Turkey the country and not Turkey
the bird.
See
Also:
For more information about case-sensitive theme queries, see Chapter 4, "Theme Queries". |
German and Dutch language text contains composite words.
With ConText, you can create a composite index and subsequently issue queries
to search for composite words using a subcomposite word as your query term.
To query against a composite index, you specify the policy
associated with the composite index with two-step or in-memory queries.
For one-step queries, you must specify the policy if the text column has
more than one index attached to it.
See
Also:
For more information about creating a composite index for German, see Oracle8 Context Cartridge Administrator's Guide. |
When using a German composite index, a query on the term
Bahnhof (train station) returns documents that contain Bahnhof
or any word containing Bahnhof as a sub-composite, such as Hauptbahnhof,
Nordbahnhof, or Ostbahnhof.
However, a query on Bahnhof does not return documents
that contain the single words Bahn or Hof.
When using a Dutch composite index, a query on the term kapitien
returns documents that contain kapitien or any word containing kapitien
as a sub-composite, such a scheepskapitien.
You can use text highlighting with composite word queries.
When you do so, ConText highlights the entire composite word, not just
the sub-composite you entered as your query.
For example, when you issue Bahnhof as your query,
context highlights the words Hauptbahnhof, Nordbahnhof, and
Ostbahnhof entirely.
See
Also:
For more information on highlighting text queries, see Chapter 6, "Document Presentation: Highlighting". |
For languages that use an 8-bit character set, such as French
and Spanish, Context gives you the option of converting characters to their
base-letter representation before text indexing. This means that words
with tildes, accents, umlauts, and so on are converted to their base-letter
representation before their tokens are placed in the text index.
When you specify a text index that has used base-letter conversion
in a query, ConText converts the term in the query expression to match
the base-letter representation before the query is processed.
The result is that with base-letter conversion on
for Spanish text index, a query on manaña returns documents
that contain manaña and manana.
However, with base letter conversion off for a Spanish
text index, a query on manaña returns documents that contain
only manaña.
In addition, all expansion and stopword checking for the
query is performed on the base-letter terms.
See
Also:
For more information about creating an index that supports base-letter conversion, see Oracle8 Context Cartridge Administrator's Guide. |
The terms in a thesaural query are not converted to
base-letter representation before look-up in the thesaurus. The base-letter
conversion takes place after the thesaurus look-up and is performed on
all the terms returned by the thesaurus.
The following example of a one-step query returns all articles that contain the word wine in the TEXTTAB.TEXT_COLUMN column. The query expression consists only of the query term wine, surrounded by single quotes.
SELECT articles FROM texttab WHERE CONTAINS(textcol, 'wine') > 0;
The following example of a one-step query returns all articles that contain the phrase wine and roses in the TEXTTAB.TEXT_COLUMN column. The query expression consists of the query phrase wine and roses, surrounded by single quotes.
SELECT articles FROM texttab WHERE CONTAINS(textcol, '{wine and roses}') > 0;
See
Also:
For more information about the CONTAINS function used in one-step queries, see CONTAINS in Chapter 9. |
Logical operators combine the terms in a query expression.
All single words and phrases may be combined with logical operators. When
query terms are combined, the number of spaces around the logical operator
is not significant.
Logical operators link query terms together to produce scores
that are based on the relationship of the terms to each other. The logical
operators combine the scores of their operands up to a maximum value of
100. Operands can be any query terms, as well as other operators.
Use the AND operator to search for documents that contain at least one occurrence of each of the query terms. For example, to obtain all the documents that contain the terms batman and robin and penguin, issue the following query:
'batman & robin & penguin'
In an AND query, the score returned is the score of the lowest
query term. In the example above, if the three individual scores for the
terms batman, robin, and penguin is 10, 20 and 30
within a document, the document scores 10.
Use the OR operator to search for documents that contain at least one occurrence of any of the query terms. For example, to obtain the documents that contain the term cats or the term dogs, use one of the following:
'cats | dogs' 'cats OR dogs'
In an OR query, the score returned is the score for the highest
query term. In the example above, if the scores for cats and dogs
is 30 and 40 within a document, the document scores 40.
Use the NOT operator to search for documents that contain
one query term and not another.
For example, to obtain the documents that contain the term animals but not dogs, use the following expression:
'animals ~ dogs'
Similarly, to obtain the documents that contain the term transportation but not automobiles or trains, use the following expression:
'transportation not (automobiles or trains)'
Note: The NOT operator does not affect the scoring produced by the other logical operators. |
Use the equivalence operator to specify an acceptable substitution for a word in a search. For example, if you want all the documents that contain the phrase alsatians are big dogs or labradors are big dogs, you can write:
'labradors=alsatians are big dogs'
ConText processes the above query faster and more efficiently than the same query written with the accumulate operator. For example, you could write the above query less efficiently and less concisely as follows:
'labradors are big dogs, alsatians are big dogs'
The savings you gain in using the equivalence operator over the accumulate operator is most significant when you have more than one equivalence operator in the query expression. For example, the following query
'labradors=alsatians are big canines=dogs'
is a more efficient, more concise form of:
'labradors are big dogs, alsatians are big dogs, alsatians are big canines, labradors are big canines'
The equivalence operator has higher precedence that all other
operators except the unary operators (fuzzy, soundex, stem, and PL/SQL
function calls).
You can use the WITHIN operator to narrow a query down into document sections. Document sections can be one of the following:
The syntax for the WITHIN operator is as follows:
Querying within sentence or paragraph boundaries is useful
to find combinations of words that occur in the same sentence or paragraph.
To find documents that contain dog and cat within the same sentence:
'(dog and cat) WITHIN SENTENCE'
To find documents that contain dog and cat within the same paragraph:
'(dog and cat) WITHIN PARAGRAPH'
To find documents that contain sentences with the word dog but not cat:
'(dog not cat) WITHIN SENTENCE'
Use the WITHIN operator to narrow down a query into user-defined
document sections.
For example in an HTML document set, you or your ConText
administrator can define a section for all headings delimited with <HEAD>
and <\HEAD> and subsequently issue a query for a term in a heading
across all documents.
Note: The WITHIN operator requires you to know the name of the section you wish to search. A list of defined sections can be obtained using the CTX_ALL_SECTIONS or CTX_USER_SECTIONS views. |
See
Also:
For more information about defining sections, see the Oracle8 Context Cartridge Administrator's Guide. |
To find all the documents that contain the term San Francisco within the user-defined section Headings, write your query as follows:
'San Francisco WITHIN Headings'
To find all the documents that contain the term sailing and contain the term San Francisco within the user-defined section Headings, write your query in one of two ways:
'(San Francisco WITHIN Headings) and sailing' 'sailing and San Francisco WITHIN Headings'
To find all documents that contain the terms dog and cat within the same user-defined section Headings, write your query as follows:
'(dog and cat) WITHIN Headings'
Note that the above query is logically different from:
'dog WITHIN Headings and cat WITHIN Headings'
which finds all documents that contain dog and cat
where the terms dog and cat are in Headings sections,
regardless of whether they occur in the same Headings section or
different sections.
To find all documents in which dog is near cat within the section Headings, write your query as follows:
'dog near cat WITHIN Headings'
The WITHIN operator has the following limitations:
Score changing operators behave like logical operators in
that they return documents given the terms you specify. However, these
operators affect document scores differently and, as such, can be used
to change a document's rank in a hitlist with respect to a query term.
The following table describes these operators:
Use the accumulate operator to search for documents that
contain at least one occurrence of any of the query terms, where
the documents that contain the most frequent occurrences of the query terms
are given the highest score.
For example, to search for documents that contain either term Brazil or soccer and to have the highest scores attached to the documents that contain the most occurrences of these words, you can issue:
'soccer,Brazil'
Accumulate is similar to OR, in the sense that a document
satisfies the query expression if any of the terms occur in the document;
however, the scoring is different. OR returns a score based only
on the query term that occurs most frequently in a document. Accumulate
combines the scores for all the query terms that occur in a document, topping
out at 100 when the sum exceeds 100. Thus documents that contain the most
query terms are ranked the highest.
Use the MINUS operator to search for documents that contain
a query term, and when you want the presence of a second query term to
cause the document to be ranked lower.
The minus operator is useful for lowering the score of documents that contain "noise". For example, suppose a query on the term cars always returned high scoring documents about Ford cars. You can lower the scoring of the Ford documents by using the expression:
'cars - Ford'
In essence, this expression returns the documents that contain the term cars. However, the score returned for a document is the number of occurrences of cars minus the number of occurrences of Ford. When a returned document does not contain Ford, the occurrence of the term Ford is counted as zero.
The weight operator multiplies the score by the given
factor, topping out at 100 when the product exceeds 100. For example, the
query cat, dog*2' sums the score of cat with twice the score
of dog, topping out at 100 when the score is greater than 100.
In expressions that contain more than one query term, use
the weight operator to adjust the relative scoring of the query terms.
You can reduce the score of a query term by using the weight operator with
a number less than 1; you can increase the score of a query term by using
the weight operator with a number greater than 1 and less than 10.
The weight operator is useful in accumulate, OR, or AND queries
when the expression has more than one query term. With no weighting on
individual terms, the score cannot tell you which of the query terms occurs
the most. If you are interested in documents that contain a particular
query term more than another term, the overall ranking tells you nothing
about which documents pertain to the term that you are most interested
in.
You have a collection of sports articles. You are interested in the articles about soccer, in particular Brazilian soccer. It turns out that a regular query on soccer, Brazil returns many high ranking articles on US soccer. To raise the ranking of the articles on Brazilian soccer, you can issue the following query:
'soccer, Brazil*3'
Table 3-1 illustrates how the weight operator can change the ranking of three hypothetical documents A, B, and C, which all contain information about soccer. The columns in the table show the total score of four different query expressions on the three documents.
soccer | Brazil | soccer,Brazil | soccer,Brazil*3 | |
---|---|---|---|---|
A |
20 |
10 |
30 |
50 |
B |
10 |
30 |
40 |
100 |
C |
50 |
10 |
60 |
80 |
The score in the third column containing the query soccer,
Brazil is the sum of the scores in the first two columns. The score
in the fourth column containing the query soccer,Brazil*3 is the
sum of the score of the first column soccer plus three times the
score of the second, Brazil.
With the initial query of soccer,Brazil, the documents
are ranked in the order C B A. With the query of soccer,Brazil*3,
the documents are ranked B C A, which is the preferred ranking.
Use the near operator to have Context return a score based on the proximity of two or more query terms. ConText returns higher scores for terms closer together and lower scores for terms farther apart in a document.
Note: The NEAR operator works with only text queries. You cannot use NEAR with theme queries. |
The syntax for the near operator is as follows:
OPERATOR | SYNTAX |
---|---|
NEAR |
NEAR((word1, word2,..., wordn) [, MAX_SPAN [, ORDER]]) |
Specify the terms in the query separated by commas. The query terms can be single words or phrases.
Optionally specify the size of the biggest clump. The default
is 100. ConText returns an error if you specify a number greater than 100.
A clump is the smallest group of words in which all query
terms occur. All clumps begin and end with a query term.
For near queries with two terms, max_span is the maximum distance allowed between the two terms. For example, to query on dog and cat where dog is within 6 words of cat, issue the following query:
'near((dog, cat), 6)'
Specify TRUE for ConText to search for terms in the order
you specify. The default is FALSE.
For example, to search for the words monday, tuesday, and wednesday in that order with a maximum clump size of 20, issue the following query:
'near((monday, tuesday, wednesday), 20, TRUE)
Note: To specify ORDER , you must always specify a number for the MAX_SPAN parameter. |
ConText might return different scores for the same document when you use identical query expressions that have the ORDER flag set differently. For example, ConText might return different scores for the same document when you issue the following queries:
'near((dog, cat), 50, FALSE)' 'near((dog, cat), 50, TRUE)'
The scoring for the near operator combines frequency of the
terms with proximity of terms. For each document that satisfies the query,
ConText returns a score between 1 and 100 that is proportional to the number
of clumps in the document and inversely proportional to the average size
of the clumps. This means many small clumps in a document result in higher
scores, since small clumps imply closeness of terms.
The number of terms in a query also affects score. Queries
with many terms, such as seven, generally need fewer clumps in a document
to score 100 than do queries with few terms, such as two.
A clump is the smallest group of words in which all query
terms occur. All clumps begin and end with a query term. You can define
clump size with the max_span parameter as described in this section.
You can use the near operator with other operators such as
AND and OR. Scores are calculated in the regular way.
For example, to find all documents that contain the terms tiger, lion, and cheetah where the terms lion and tiger are within 10 words of each other, issue the following query.
'near((lion, tiger), 10) AND cheetah'
The score returned for each document is the lower score of
the near operator and the term cheetah.
You can also use the equivalence operator to substitute a single term in a near query:
'near((stock crash, Japan=Korea), 20)'
This query ask for all documents that contain the phrase
stock crash within twenty words of Japan or Korea.
You can write near queries using the syntax of ConText release 2.3.6 and before. For example, to find all documents where lion occurs near tiger, you can write:
'lion near tiger'
or with the semi-colon as follows:
'lion;tiger'
This query is equivalent to the following query:
'near((lion, tiger), 100, FALSE)'
Note: Only the syntax of the near operator is backward compatible. In the example above, the score returned is calculated using the clump method as described in this section. |
When you use highlighting and your query contains the near
operator, all occurrences of all terms in the query that satisfy the proximity
requirements are highlighted. Highlighted terms can be single words or
phrases.
For example, assume a document contains the following text:
Chocolate and vanilla are my favorite ice cream flavors. I like chocolate served in a waffle cone, and vanilla served in a cup with carmel syrup.
If the query is near((chocolate, vanilla)), 100, FALSE), the following is highlighted:
<<Chocolate>> and <<vanilla>> are my favorite ice cream flavors. I like <<chocolate>> served in a waffle cone, and <<vanilla>> served served in a cup with carmel syrup.
However, if the query is near((chocolate, vanilla)), 4, FALSE), only the following is highlighted:
<<Chocolate>> and <<vanilla>> are my favorite ice cream flavors. I like chocolate served in a waffle cone, and vanilla served in a cup with carmel syrup.
See
Also:
For more information about highlighting, see Chapter 6, "Document Presentation: Highlighting". |
You can use the NEAR operator with the WITHIN operator for section searching as follows:
'near((dog, cat), 10) WITHIN Headings'
When evaluating expressions such as these, Context looks
for clumps that lie entirely within the given section.
In the example above, only those clumps that contain dog
and cat that lie entirely within the section Headings are
counted. That is, if the term dog lies within Headings and
the term cat lies five words from dog, but outside of Headings,
this pair of words does not satisfy the expression and is not counted.
Use the result-set operators to control what documents are returned from a query result set. The operands for these operators are expressions, which can be an individual query term or a logical combination of query terms that use other operators.
Result set operators are typically used to exclude noise
from the hitlist (irrelevant documents) and to retrieve documents out of
a hitlist more efficiently. There are three result set operators:
You can use the threshold operator in two ways:
Use the expression level threshold operator to eliminate documents in the result set that score below a threshold number. For example, to search for documents that contain relational databases and to return only documents that score greater than 75, use the following expression:
'relational databases > 75'
Use the query term threshold operator in a query expression to select a document based on how a term scores in the document. For example, to select documents that have at least a score of 30 for lion and contain tiger, use:
'(lion > 30) and tiger'
Use the max operator to retrieve a given number of the highest
scoring documents. For example, to obtain the twenty highest scoring documents
that contain the word dance, you can write:
The max operator is particularly useful to prevent writing
a large number of records to the hitlist table, which could result in performance
degradation.
Note: The max operator cannot be used with the CTX_QUERY.COUNT_HITS function or with in-memory queries. |
Use the first/next operator to return a specified range of
documents from the hitlist.
For example, to return the first 10 documents encountered by ConText that contain the term dog, use the following expression:
'dog#1-10'
You could then return the next 10 documents using the following expression:
'dog#11-20'
The first/next operator can be used to create an application
interface in which query results (rows in the hitlist) are returned incrementally.
Because the query results are returned incrementally, query response is
generally faster. The application can display the hitlists in a more manageable
size, and control can be returned to the user faster.
Note: The first/next operator cannot be used with the CTX_QUERY.COUNT_HITS function or with in-memory queries. |
You can use the first/next operator extract chunks of a sorted hitlist returned by the max operator. For example, if you use the max operator to return only the highest scoring 50 documents that contain the term cat, you can extract the first 10 documents from the 50 as follows:
'cat:50#1-10'
Note: Placing the max operator inside the first/next operator as such is the only instance in which you can embed the max operator in a query expression. |
The expansion operators expand a query expression to include
variants of the query term supplied by the user. There are three kinds
of expansion operators:
The expansion operators are unary operators. They may be
used in combination with each other and with any other operators described
in this chapter. In addition, searches can be broadened by performing an
expansion on an expansion.
The methods used by the expansion operators to perform stemming,
fuzzy matching, and soundex matching for a text column are determined by
the Wordlist preference in the policy for the column.
See
Also:
For more information about setting up preferences and policies, see Oracle8 Context Cartridge Administrator's Guide. |
Use the STEM ($) operator to search for terms that have the
same linguistic root as the query term. For example:
Input | Expands To |
---|---|
$scream |
scream screaming screamed |
$distinguish |
distinguish distinguished distinguishes |
$guitars |
guitars guitar |
$commit |
commit committed |
$cat |
cat cats |
$sing |
sang sung sing |
The ConText stemmer, licensed from Xerox Corporation's XSoft
Division, supports the following languages: English, French, Spanish, Italian,
German, and Dutch.
Note: If STEM returns a stopword, the stopword is not included in the query or highlighted by CTX_QUERY.HIGHLIGHT. |
The soundex (!) operator enables searches on words that have
similar sounds; that is, words that sound like other words. This function
allows comparison of words that are spelled differently, but sound alike
in English.
Soundex in ConText uses the same logic as the soundex function
in SQL to search for words that have a similar sound. It returns all words
in a text column that have the same soundex value.
The following example illustrates the results that could be returned for a one-step query that uses SOUNDEX:
SELECT ID, COMMENT FROM EMP_RESUME WHERE CONTAINS (COMMENT, '!SMYTHE') > 0 ID COMMENT -- ------------ 23 Smith is a hard worker who..
Note: SOUNDEX works best for languages that use a 7-bit character set, such as English. It can be used, with lesser effectiveness, for languages that use an 8-bit character set, such as many Western European languages. For more information about the SOUNDEX function in SQL, see Oracle8 SQL Reference. |
Fuzzy (?) expansions generate words that are spelled similarly.
This type of expansion is helpful for finding more accurate results when
there are frequent misspellings in the documents in the database.
Unlike the stem expansion, the number of words generated
by a fuzzy search depends on what is in the text index; results can vary
significantly according to the contents of the database index.
Input | Expands To |
---|---|
?cat |
cat cats calc case |
?feline |
feline defined filtering |
?apply |
apply apple applied April |
?read |
lead real |
Note: Fuzzy works best for languages that use a 7-bit character set, such as English. It can be used, with lesser effectiveness, for languages that use an 8-bit character set, such as many Western European languages. Also, the Japanese lexer provides limited fuzzy matching. In addition, if fuzzy returns a stopword, the stopword is not included in the query or highlighted by CTX_QUERY.HIGHLIGHT. |
Penetration allows complex query expansions to be expressed
in short concise notation. Penetration is a system of notation for query
expressions and does not affect the meaning of the expansion operators
or the order in which operations are performed; it is a tool to help you
generate non-ambiguous queries using the expansion operators.
Penetration applies the expansion operators to each term
within an explicit expression (i.e., an expression delimited by parentheses
or braces). Any expansion operators outside an expression delimited by
parentheses ( ) or braces { } is applied to each word or phrase inside
the expression.
Query Before Penetration | Query After Penetration |
---|---|
?(dog, cat, mouse) |
?dog, ?cat, ?mouse |
?(dog,!(cat & mouse)) |
?dog, (!?cat & !?mouse) |
?((cat=feline) meows) |
(?cat =?feline)?meows |
In the first example, a fuzzy expansion is performed on each
term.
In the second example, a fuzzy expansion is performed on
each term and a soundex expansion is performed only on the terms cat and
mouse because cat and mouse are enclosed in a separate set
of parentheses
In the third example, a fuzzy expansion is performed on each
term, including both equivalence terms.
Note: Expansion operators do not penetrate expressions delimited by brackets [ ]. |
You can use query expression feedback to examine how ConText expands query expressions containing fuzzy, stem and soundex operators.
See Also: |
If you have base-letter conversion specified for a text column
and the query expression contains a SOUNDEX or FUZZY operator, ConText
operates on the base-letter form of the query.
The STEM operator does not support base-letter conversion.
The thesaurus operators expand a query for a single term
(word or phrase) using a thesaurus that defines relationships between the
user-specified term and other semantically related terms.
There are ten kinds of thesaurus operators, corresponding
to the ten types of relationships that can be defined in an ISO2788 standard
thesaurus.
Internally, ConText processes the expansion by bracketing
each individual term returned by the expansion, then the terms are accumulated
together using the ACCUMULATE operator.
For example, if bird, birdy, and avian
are all synonyms:
SYN(bird) is expanded to {bird},{avian},{birdy}.
If a term in a thesaural query does not have corresponding
entries in the specified thesaurus, no expansion is produced and the term
itself is used in the query.
See
Also:
For more information about viewing thesaural expansions, see Chapter 5, "Query Expression Feedback". For more information about thesaural relationships and creating thesauri, see Oracle8 Context Cartridge Administrator's Guide. |
The thesaurus operators can be used in conjunction with all
the other query expression operators and special characters supported by
ConText, with the exception of the near operator.
The maximum length of the expanded query is 32000 characters.
Thesaural operations cannot be nested. For example, the following query is not allowed.
'SYN(BT(bird))'
The thesaurus operators are implemented in ConText as PL/SQL
functions, and, as such, have arguments that must be specified with the
operator. All of the notational conventions and usage rules for PL/SQL
apply to the thesaurus operators.
The thesaurus operators have the following arguments:
Specify the operand for the thesaurus operator. You must
specify a term when using the NT operator. For preferred term (PT) and
top term (TT) queries, term is replaced by the preferred term/top
term defined for the term in the specified thesaurus; however, if no PT
or TT entries are defined for the term, the term is not replaced and is
used in the query.
For all other thesaural queries, term is expanded to include the synonymous, related, broader, or narrower terms defined for the term in the specified thesaurus.
Specify the number of levels traversed in the thesaurus hierarchy
to return the broader (BT, BTG, BTP) or narrower (NT, NTG, NTP) term for
the specified term. For example, a level of 1 in a BT query returns only
the broader term, if one exists, for the specified term. A level of 2 returns
the broader term for the specified term, as well as the broader term, if
one exists, for the broader term.
The level argument is optional and has a default value of one (1). Zero or negative values for the level argument return only the original query term.
Specify the name of the thesaurus used to return the expansions
for the specified term. The thes argument is optional and has a
default value of DEFAULT. As a result, a thesaurus named DEFAULT must
exist in the thesaurus tables before using any of the thesaurus operators.
Use the synonym operator (SYN) to expand a query to include
all the terms that have been defined in a thesaurus as synonyms for a specified
term.
The following query returns all documents that contain the term tutorial or any of the synonyms defined for tutorial in the DEFAULT thesaurus:
'SYN(tutorial)'
Expansion of compound phrases for a term in a synonym query
are returned as AND conjunctives.
For example, the compound phrase temperature + measurement + instruments is defined in a thesaurus as a synonym for the term thermometer. In a synonym query for thermometer, the query is expanded to:
{thermometer},({temperature}&{measurement}&{instruments})
Note: In a thesaurus, compound phrases can only be defined in synonym relationships for a term. |
Use the preferred term operator (PT) to replace a term in
a query with the preferred term that has been defined in a thesaurus for
the term.
For example, the term building has a preferred term
of construction in a thesaurus. A PT query for building returns
all documents that contain the word construction. Documents that
contain the word building are not returned.
Use the related term operator (RT) to expand a query to include
all terms with the related term that has been defined in a thesaurus for
the term.
For example, the term dinosaur has a related term
of paleontology. A RT query for dinosaur returns all documents
that contain the word paleontology. Documents that contain the word
dinosaur are not returned.
Use the narrower term operators (NT, NTG, NTP, NTI) to expand
a query to include all the terms that have been defined in a thesaurus
as the narrower or lower level terms for a specified term. They can also
expand the query to include all of the narrower terms for each narrower
term, and so on down through the thesaurus hierarchy.
The following query returns all documents that contain either the term tutorial or any of the NT terms defined for tutorial in the DEFAULT thesaurus:
'NT(tutorial)'
The following query returns all documents that contain either fairy tale or any of the narrower instance terms for fairy tale as defined in the DEFAULT thesaurus:
'NTI(fairy tale)'
That is, if the terms cinderella and snow white
are defined as narrower term instances for fairy tale, ConText returns
documents that contain fairy tale, cinderella, or snow
white.
Use the broader term operators (BT, BTG, BTP, BTI) to expand
a query to include the term that has been defined in a thesaurus as the
broader or higher level term for a specified term. They can also expand
the query to include the broader term for the broader term and the broader
term for that broader term, and so on up through the thesaurus hierarchy.
The following query returns all documents that contain the term tutorial or the BT term defined for tutorial in the DEFAULT thesaurus:
'BT(tutorial)'
If a homograph (a word or phrase with multiple meanings,
but the same spelling) appears in two or more nodes in the same hierarchy
branch of a thesaurus, a qualifier is required for each occurrence of the
term in the branch.
If the qualifier is not specified for a homograph in a broader
or narrower term query, the query expands to include all of the broader/narrower
terms for the homograph.
For example, if machine is a broader term for crane
(building equipment) and bird is a broader term for crane
(waterfoul):
BT(crane) expands to {crane},{machine},{bird}
If the qualifier for a homograph is specified in a broader
or narrower term query, only the broader/narrower terms for the qualified
homograph are returned.
BT(crane{(waterfoul)}) expands to {crane},{bird}
Note: When specifying a qualifier in a broader or narrower term query, the qualifier and its notation (parentheses) must be escaped, as is shown in this example. |
Use the TOP TERM operator (TT) to replace a term in a query
with the top term that has been defined for the term in the standard hierarchy
(BT, NT) in a thesaurus. Top terms in the generic (BTG, NTG), partitive
(BTP, NTP), and instance (BTI, NTI) hierarchies are not returned.
For example, the term tutorial has a top term of learning
systems in the standard hierarchy of a thesaurus. A TT query for tutorial
returns all documents that contain the phrase learning systems.
Documents that contain the word tutorial are not returned.
Thesaural expansions in text queries can differentiate between
terms based on case.
For example, a case-sensitive thesaurus named thes1
is created and Mercury is defined as a narrower term for planets,
while mercury is defined as a narrower term for metals.
During a query, the following expansions occur:
BT(mercury,1,thes1) expands to {MERCURY}, {METALS}
BT(Mercury,1,thes1) expands to {MERCURY}, {PLANETS}
Case-sensitive thesauri only affect the expansion of a term
and not the terms actually used in the query. The case of the expanded
terms depends on whether the text index being queried is case-sensitive
or case-insensitive.
For example, when the case-sensitive thesaurus, thes1, is used with a case-insensitive index, the following expansion is returned:
The query then returns all documents in which the two terms
occur, regardless of case. In other words, documents that contain mercury,
Mercury, planets, Planets, or any other combinations
of case for the two terms are all returned by the query.
With a case-sensitive text index, the same query expands to:
The query returns only those documents in which Mercury
and planets occur.
When ConText processes a query on a base-letter index and
the expression contains a thesaurus operator, ConText looks up the query
term in the thesaurus without converting the query to base-letter. The
expansions obtained from the thesaurus are converted to base-letter and
looked up subsequently within the index according to query rules.
This sequence of look-up enables base-letter queries to work
independent of whether the thesaurus is in base-letter form. However, if
the keys in the thesaurus are in base letter form, these keys will not
match the corresponding non-base letter form query terms. When you have
a base-letter thesaurus, you must specify the base-letter form in the query.
Wildcard characters can be used in query expressions to expand
word searches into pattern searches. The wildcard characters are:
For example, the following abbreviated one-step query finds all terms beginning with the pattern scal in a column named text:
...contains(TEXT, 'scal%') > 0
Note: To expand the wildcard query, ConText uses the word list for the text column and rewrites the query with these terms. When your wildcard query expands to a number of terms greater than the maximum allowed in a query, ConText returns an error. In addition, if a wildcard expression translates to a stopword, the stopword is not included in the query or highlighted by CTX_QUERY.HIGHLIGHT. |
The grouping characters control operator precedence by grouping query terms and operators in a query expression. The grouping characters are:
The beginning of a group of terms and operators is indicated
by an open character from one of the sets of grouping characters. The ending
of a group is indicated by the occurrence of the appropriate close character
for the open character that started the group. Between the two characters,
other groups may occur.
For example, the open parenthesis indicates the beginning
of a group. The first close parenthesis encountered is the end of the group.
Any open parentheses encountered before the close parenthesis indicate
nested groups.
Brackets perform the same function as the parentheses, but
prevent penetration for the expansion operators.
You can store the results of a query expression and then
call the SQE later in a query expression to return the stored results.
To call a stored query expression, use the SQE operator.
Operator | Syntax | Description |
---|---|---|
|
SQE(SQE_name) |
Returns the stored result of SQE_name. |
The advantage of calling an SQE in a query expression, rather
than specifying query terms, is that the results are typically returned
faster, since ConText does not have to query the text table directly.
In addition, SQEs can be used to perform iterative queries,
in which an initial query is refined using one or more additional queries.
The process for using stored query expressions is:
Administration of stored query expressions can be performed
using the REFRESH_SQE, REMOVE_SQE,
and PURGE_SQE procedures in the CTX_QUERY PL/SQL
package.
To create a session SQE named PROG_LANG, use CTX_QUERY.STORE_SQE as follows:
exec ctx_query.store_sqe('emp_resumes', 'prog_lang', 'cobol', 'session');
This SQE queries the text column for the EMP_RESUMES policy
(in this case, EMP.RESUMES) and returns all documents that contain the
term cobol. It stores the results in the SQE table for the policy.
PROG_LANG can then be called within a query expression as follows:
select score, docid from emp where contains(resume, 'sqe(prog_lang)')>0 order by score;
When you initially create an SQE using CTX_QUERY.STORE_SQE,
you can specify whether the SQE is for the current session or for all sessions
(system SQE).
You can use session SQEs only in the current session. These
SQEs are stored only for the duration of the session. When a session is
terminated, all session SQEs created during the session are deleted from
the SQE tables. If you want to use a session SQE in another session, you
must recreate the SQE.
System SQEs can be used in all sessions, including concurrent
sessions. When a session is terminated, system SQEs created during the
session are not deleted from the SQE tables and can be used in future
sessions.
If the text column referenced by an stored query expression
has been modified since the stored query expression was created, the stored
query expression results may be out-of-date. Before returning the results
of an stored query expression in a query expression, ConText verifies that
the results are current. If they are not current, ConText automatically
evaluates the differences and updates the results.
ConText also verifies that any stored query expressions nested
within an stored query expression have up-to-date results
Result lists in stored query expression tables may get fragmented
by consecutive re-evaluations. You can resolve fragmentation by calling
CTX_QUERY.REFRESH_SQE.
Iterative queries are queries built on other queries to refine or add to the result set of the original query. Once you define a stored query expression, you can add additional search criteria in two ways:
Sometimes you might want to add a condition to a stored query
expression to re-define your search criteria. You can do so by extending
the query with additional operators when you call CTX_QUERY.CONTAINS.
When you extend stored queries in this way, the response time is usually
faster than an equivalent query without the SQE operator.
For example, you find that wildcard queries take a long time to process. You therefore define a wildcard query as a stored query expression, Q1, to return all documents indexed under policy pol that have words beginning with the letter z:
ctx_query.store_sqe('pol', 'Q1', 'z%', 'session');
You then extend the query by adding an OR condition: You ask for all documents indexed under policy pol that contain words beginning with the letter z or contains the word cat:
ctx_query.contains('pol', 'SQE(Q1) | cat', 'ctx_temp');
Internally, ConText must still use the text index to find those documents that might have the word cat but not z%; however, the response time is generally much faster than the following equivalent query:
ctx_query.contains('pol', 'z% | cats', 'ctx_temp');
You can use stored query expressions to define other stored
query expressions. This is useful when you want to refine the result set
returned from a stored query expression.
For example, you define the stored query expression, Q1 as follows:
ctx_query.store_sqe('pol', 'Q1', 'lions | tigers', 'session');
You then want to reduce this hitlist by adding another condition, so you define Q2 as follows:
ctx_query.store_sqe('pol', 'Q2', 'SQE(Q1) and zoos', 'session');
You then execute Q2 as follows:
ctx_query.contains('pol', 'SQE(Q2)', 'ctx_temp');
This query searches for all documents that contain the terms lions or tigers and zoos. It is generally faster that the following equivalent query:
ctx_query.contains('pol', 'lions | tigers and zoos', 'ctx_temp');
Each stored query expression is stored in two tables: a central
or system table owned by CTXSYS and an text index table attached to the
policy for which the stored query expression was created.
The table owned by CTXSYS is an internal table which stores
the stored query expression definitions for all the stored query expressions
that have been created for all existing policies. It cannot be accessed
directly, but can be viewed through two views, CTX_SQES (users with CTXADMIN
role) and CTX_USER_SQES (users with CTXAPP and CTXADMIN roles).
The table used to store the results of an stored query expression
for a text column is one of the tables created automatically when the column
is indexed; however, the SQR table is only populated when an stored query
expression is created and updated when an stored query expression is re-evaluated.
The tablespace, storage clause, and other parameters used
to create the SQR table are specified by the Engine preference in the policy
for the text column of the stored query expression.
Note: Similar to the other ConText index tables, the SQR table is an internal table that is accessed only by ConText when an stored query expression is processed in a query. For more information about policies, preferences, text indexing, and the structure of the stored query expression tables and views, see Oracle8 Context Cartridge Administrator's Guide. |
You can use all query expression operators in stored query expressions, with the following exceptions:
Stored query expressions also support all of the special
characters and other components that can be used in a query expression,
including PL/SQL functions and other stored query expressions.
In a query expression, you can call a PL/SQL function that
returns a value. The syntax for the PL/SQL operator is as follows:
Calling a PL/SQL function within a query is useful for converting
words to alternate forms. For example, you can call a function that takes
acronyms and returns the expanded string.
Suppose you, as user ctxuser, create a function named CONVERT that takes an acronym as input and returns the fully-expanded version of the acronym. Then, to obtain all documents that contain either IBM or International Business Machine, you issue the following query:
'execute ctxuser.convert(IBM), IBM'
Likewise, you can call a PL/SQL function that translates words. For example, you can call a function french that converts an English word to its French equivalent. You can then search on the French word for cat by issuing the following query:
'@ctxuser.french(cat)'
Operator precedence is the order in which the components
of a query expression are evaluated. ConText query operators can be divided
into two sets of operators that have their own order of evaluation. These
two groups are described below as Group 1 and Group 2.
In all cases, query expressions are evaluated in order from
left to right according to the precedence of their operators. Operators
with higher precedence are applied first. Operators of equal precedence
are applied in order of their appearance in the expression from left to
right.
Within query expressions, the Group 1 operators have the
following order of evaluation from highest precedence to lowest:
Operator | Equivalent |
---|---|
EQUIV |
= |
NEAR |
; |
Weight, Threshold |
* > |
MINUS |
- |
NOT |
~ |
WITHIN |
|
AND |
& |
OR |
| |
ACCUM |
, |
Max |
: |
First/Next |
# |
Within query expression, the Group 2 operators have the following
order of evaluation from highest to lowest:
Operator | Equivalent |
---|---|
Wildcard |
% _ |
Stem |
$ |
Fuzzy |
? |
Soundex |
! |
Other operators not listed under Group 1 or Group 2 are procedural.
These operators have no sense of precedence attached to them. They include
the SQE, PL/SQL, and thesaurus operators.
In the first example, because AND has a higher precedence
than OR, the query returns all documents that contain w1 and all
documents that contain both w2 and w3.
In the second example, the query returns all documents that
contain both w1 and w2 and all documents that contain w3.
In the third example, the fuzzy operator is first applied
to w1, then the AND operator is applied to arguments w3 and
w4, then the OR operator is applied to term w2 and the results
of the AND operation, and finally, the score from the fuzzy operation on
w1 is added to the score from the OR operation.
The fourth example shows that the equivalence operator has
higher precedence than the AND operator.
The fifth example shows that the AND operator has lower precedence
than the WITHIN operator.
Precedence is altered by grouping characters as follows:
Precedence of operators is maintained during evaluation of expressions inside of the parentheses.
To query on words or symbols that have special meaning to
query expressions such as and & or| accum, execute, you must
escape them. There are two ways to escape characters in a query expression:
In the following examples, an escape sequence is necessary because each expression contains a ConText operator or reserved symbol:
'AT\&T' '{AT&T}' 'high\-voltage' '{high-voltage}'
The following is a list of ConText reserved words and characters
that must be escaped to be searched on:
The open brace { signals the beginning of the escape sequence,
and the closed brace} indicates the end. Everything between the opening
brace and the closing brace is part of the query expression (including
any open brace characters). To include the close brace character in a query
expression, use}}.
To escape the backslash escape character, use \\.
Stopwords are words for which ConText does not create an
index entry. They are usually common words that are unlikely to be searched
on by themselves.
ConText is shipped with a default list of stopwords in English
containing common words such as this and that. However, you
or ConText administrator can define stopwords.
See
Also:
For more information about defining stopwords, see Oracle8 Context Cartridge Administrator's Guide. |
You cannot query on a stopword by itself or a phrase of only
stopwords; whenever you attempt to query on a stopword by itself or a stopword-only
phrase, the result is always no hits.
For example, you cannot issue a query to retrieve all documents
that contain this if this is defined as a stopword, nor can
you issue a query on a phrase of stopwords such as the who, if the
words the and who are defined as stopwords.
You can query on phrases that contain stopwords as well as
non-stopwords, such as this boy talks to that girl, where this
and that are the only stopwords. This is possible because Context
records the position of stopwords even though it does not create an index
entry for them.
If you have case-sensitivity enabled for text queries and you issue a query on a phrase containing stopwords and non-stopwords, you must specify the correct case for the stopwords. For example, a query on this boy talks to that girl does not return documents that containing the phrase This boy talks to that girl, assuming this is a stopword.
See
Also:
For more information about issuing case-sensitive text queries, see "Case-Sensitive Queries" in this chapter. |
When you use a stopword or a stopword-only phrase as an operand
of a query operator, ConText rewrites the expression to eliminate the stopword
or stopword-only phrase and then executes the query.
The following table describes some common stopword
transformations. The Stopword Expression column describes the query
expression or component of a query expression you enter, while the right-hand
column describes the way ConText rewrites the query.
In these examples, a value of no_token for the rewritten
expression means no hits are returned for the query.
For example, assuming that the word this is a stopword
and that the word dog is a non-stopword, the query dog and that
is rewritten to dog, applying the first transformation is the list.
See
Also:
For a complete list of stopword transformations, see Appendix D, "Stopword Transformations". To learn about how to examine stopword transformations, see Chapter 5, "Query Expression Feedback". |
Context indexes text by identifying tokens (words). For English
and most European languages it assumes that blank spaces delimit tokens.
At index time, ConText must also know how to interpret punctuation characters
and characters that occur within words and numbers. Such special characters
must be defined in the BASIC LEXER preference. They are described as follows:
In the BASIC LEXER preference, ConText defines a default
set of characters for each group.
The way you query on tokens that contain these characters
depends on how ConText indexes the tokens containing these characters.
This is because ConText tokenizes words at query time the same way it tokenizes
words at index time. To query on words or numbers that contain special
characters, you must know how these words are represented in the index.
See
Also:
For more information about defining special characters for the BASIC LEXER preference, see Oracle8 Context Cartridge Administrator's Guide. |
Punctuation and continuation characters are not indexed with
the words they occur next to or with, and thus are ignored by ConText at
query time. The following table shows how ConText strips punctuation characters
at query time:
Printjoins and skipjoins are characters such as hyphens that
join words together.
When you define a character as a printjoin, such as a hyphen,
you specify that the words on either side of the hyphen are to be indexed
with the hyphen. For example, sister-in-law is indexed as the token
sister-in-law.
When you define a character as a skipjoin, such as a hyphen,
you specify that the two words on either side of the hyphen are to be indexed
as one token without the hyphen. For example, sister-in-law is indexed
as sisterinlaw.
To query on words that contain a join character, you must
know if the character is defined as a skipjoin or printjoin in the BASIC
LEXER preference.
If the hyphen character is defined as a printjoin, you must write your query with the hyphen, since the indexed token contains the hyphen. Thus, to query on all the documents that contain the term sister-in-law, you must write your query as follows with the hyphen:
'{sister-in-law}'
Note: The '-' character must be escaped, or else ConText interprets it as the MINUS operator. |
When a character is defined a as skipjoin, it is not indexed
with the word, therefore you can write queries with or without the skipjoin
character.
If the hyphen character is defined as a skipjoin, you can write your query with or without the hyphen. Thus, to query on all documents that contain sister-in-law, you can write your query as one of the following expressions:
'sisterinlaw' '{sister-in-law}'
You can write your query in two ways, because both queries
are lexed to sisterinlaw before index look-up. This also means that
the documents retrieved can contain either sisterinlaw or sister-in-law.
Numjoin and numgroup characters are characters that can appear
in numbers, such as the decimal point and the comma.
A numjoin is a character that occurs once in a string of digits, such as a decimal point, and gets indexed with the number. (ConText defines the decimal as a default numjoin character for the BASIC LEXER preference.) For example, the number 3.14 is indexed as 3.14. Thus to query on 3.14 with the decimal point defined as a numjoin character, you write:
'3.14'
When you define the numjoin character to be NULL, Context
indexes 3.14 as the two separate numbers 3 and 14.
A numgroup is a character such as a comma that groups digits
together in a number. Numgroup characters get indexed with the number.
(ConText defines the comma as a default numgroup character for the BASIC
LEXER preference.) For example, the number 6,344,555 gets indexed
as 6,344,555.
To query on a number that contains numgroup characters, you must write the query with the numgroup character. For example, to query on 6,344,555, you write:
'{6,344,555}'
Note that the comma must be escaped.
When you define the numgroup character as NULL, numbers such
as 1,000 get indexed as 1 and 000.
Startjoin and endjoin characters are non-alphanumeric characters
that start and end tokens. These characters are indexed with the token
they occur with.
You or your ConText administrator typically define startjoin
and endjoin characters when you index tagged text such as HTML. This makes
it easy to define sections for section searching as well as to query on
the tags themselves.
For example, to query on the tag <HEAD> with < defined
as a startjoin and > defined as an endjoin, write your query as follows:
In the query above, an escape sequence is necessary, since
> is an operator.
See
Also:
For more information about section searching, see "WITHIN Operator" in this chapter. |