Oracle8
ConText Cartridge Application Developer's Guide
Release 2.4 A63821-01 |
|
This chapter describes how to perform theme queries. The following topics are covered:
Theme queries enable you to search for documents by their major concepts. The following sections describe the theme indexing and querying processes and how they use the knowledge base:
See
Also:
For more information about the knowledge base, see "Knowledge Base" in Chapter 7, "ConText Linguistics". For more information about how to create a theme index, see Oracle8 Context Cartridge Administrator's Guide.. |
Before you can issue a theme query, your set of documents
must be indexed by theme. During theme indexing, ConText extracts up to
fifty main concepts or themes of a document and stores these themes in
the theme index. A weight is also associated with every theme that is indexed.
A theme can be a concrete concept, such as insects, or an abstract
concept, such as success, sufficiently developed in the document.
Figure 4-1 illustrates how ConText uses the knowledge base to extract document themes from an example document "The Reproductive Cycle of Insects" that contains information about insects. This example shows that ConText recognizes the following types of themes:
Known themes are document themes that can attach to a branch
of the knowledge base.
In the example in Figure 4-1,
the document A entitled "The Reproductive Cycle of Insects" contains information
about insects. The known document theme insects has four
parent themes corresponding to the branch of the knowledge base: science
and technology, hard sciences, biology, zoology,
and insects. Each theme in the branch is entered as a searchable
row in the theme index along with a weight.
When themes are indexed as such, a theme query on insects
or any of its parents returns the document A.
Unknown themes are document themes that cannot be found in
the knowledge base, because they are either unknown to the knowledge base
or inherently ambiguous.
Figure 4-1 shows how an unknown
theme of Dr. Mack is extracted without having a representation in
the knowledge base. Unknown themes such as this are indexed as a single
row.
Ambiguous document themes such as the term cricket
or the term table also have no attachments to the knowledge base
and hence are indexed as a single row. To query on ambiguous document themes,
you would rely on other supporting themes such as sports or insects
being indexed with an ambiguous theme like cricket.
See
Also:
For more information about querying ambiguous themes, see "Refining Theme Queries" in this chapter. |
The theme weight is a measure of the strength of a theme
relative to the other themes in a document. Weights are indexed with every
theme and the related parent themes extracted from a document. ConText
uses theme weights to help score theme queries.
To execute a theme query, you specify a query string, which
can be a sentence or a phrase with or without operators. ConText uses the
knowledge base to normalize the word or phrase you enter into a standard
form. It then looks up the normalized theme in the index and returns the
documents that were indexed with the given theme. See Figure
4-2. Scores for theme queries are calculated based on the weights associated
with each theme in the index.
For example, a theme query on insect retrieves the
document indexed in Figure 4-1 entitled, "The
Reproductive Cycle of Insects". Likewise, a theme query on any of the indexed
parents, such as science and technology, hard sciences, biology,
or zoology also retrieves the same document.
ConText returns a relevance score for each document it returns
in a theme query; the higher the score, the more relevant the returned
document. This relevance score is out of 100 and is based on the weight
of the indexed theme.
Generally, specifying broader themes or concepts in a theme
query will return higher scoring documents.
When using operators in theme queries, the scoring behavior
is the same as for regular text queries. For example, the OR operator returns
the higher score of its operand, and the AND operator returns the lower
score of its operands.
Theme queries are case-sensitive. For example, doing a query
on the common noun turkey produces a hit on turkey the bird. Such
a query does not produce a hit on the proper noun Turkey, which
describes a country. To query on the proper noun, you must enter the query
as Turkey.
Even though ConText theme queries are case-sensitive, ConText
tolerates poorly formatted input for known themes.
For example, entering microsoft or microSoft
returns documents that include the theme of Microsoft, a known company.
Likewise, entering Currency Rates returns documents that include
a theme of currency rates, a standard classification in business
and economics.
The following section describes how to construct theme queries:
With theme queries, the following operators have the same
semantics as with regular text queries:
Operator | Symbol |
---|---|
Accumulate |
, |
Or |
| |
And |
& |
Minus |
- |
Not |
~ |
Weight |
* |
Threshold |
> |
Max |
: |
Some valid theme query strings using operators are as follows:
contains(text, 'cricket ~ insects') > 0; contains(text, 'cricket & sports') > 0; contains(text, 'music, reggae*5') > 0; contains(text, 'chemistry > 30') > 0; contains(text, 'soccer | basketball') > 0; contains(text, 'computer software - Microsoft') > 0; contains(text, 'music:20') > 0;
See
Also:
For more information about how to use operators in theme queries, see "Refining Theme Queries" in this chapter. For more information about the semantics of query operators, see Chapter 3, "Understanding Query Expressions". |
In a theme query, the thesaurus operators (synonym, broader
term, narrower term etc.) work the same way as in a regular text query,
provided a thesaurus has been created/loaded.
See
Also:
For more information about thesaurus operators, see "Thesaurus Operators" in Chapter 3. |
In theme query expressions, the grouping characters ( ) [
] have the same semantics as with a regular text query.
See
Also:
For more information about grouping characters, see "Grouping Characters" in Chapter 3. |
In theme query expressions, the wildcard characters% _ work
the same way as in regular text queries.
Note: There is a risk of ambiguity when using the wildcard character. For example, doing a theme query on %court% might return documents that have a theme of court of law or tennis court. |
See
Also:
For more information about grouping characters, see "Wildcard Characters" in Chapter 3. |
ConText does not support the following query expression operators
with theme queries:
Operator | Symbol |
---|---|
Near |
; |
Fuzzy |
? |
Soundex |
! |
Stem |
$ |
The following issues affect the phrasing of theme queries.
When you enter your theme query, ConText normalizes the word
or phrase representing your theme into a form that it can use to compare
with document themes in the index. This normal form is nouns and noun phrases,
such as chemistry or personal computer. It is therefore better
to use nouns and noun phrases when constructing theme queries. Avoid using
sentences or long phrases.
For example, to search for documents about computer programming,
use the noun form computer programming not programming my computer.
Avoid splitting phrases that describe your idea as a whole.
For example, use the phrase physical chemistry, not physical
and chemistry.
Theme queries are case-sensitive. For example, doing a query
on the common noun turkey, which describes a type of bird, will
not produce a hit on the proper noun Turkey, which describes a country.
See Also: For more information about case-sensitivity and theme queries, see the "Theme Querying" section in this chapter. |
Depending on how you write your theme query, ConText usually
returns documents that are relevant to your query as well as documents
that might be irrelevant to your query. Before you issue the query, you
do not know what combination of document themes your query will return.
For example, a query on cricket might return documents
on sports and insects depending on your document set. The
best way to know the possible outcome is to run the query and examine the
set of returned documents. Then you run the query again, using logical
operators to eliminate unwanted documents.
You can approach the trial and error method in one of two ways:
Starting with broad theme queries might generate noise or unwanted documents. This is because of the following:
You can use the AND or NOT operator to eliminate unwanted
documents. However, use these operators with caution, because in both cases
you run the risk of eliminating documents that you might be interested
in. For this reason, it is always better to have some noise than none at
all.
You can use the AND operator with a qualifying theme to restrict
your theme query and hence eliminate noise.
For example, if a theme query on cricket always returned documents about the sport cricket and the insect cricket, and you were interested only in those documents about cricket the sport, you can restrict your query by qualifying cricket with the more general category sports as follows:
'cricket and sports'
The disadvantage of using AND with a restricting theme is
that a successful query depends on both themes being developed sufficiently
in the document for ConText to index them as such. For example, a hypothetical
news article about the personal affairs of cricket player might not have
the theme of sports developed substantially for ConText to index
sports as a theme, and therefore such a document would not be returned
in the above query.
You can use the NOT operator to exclude unwanted themes.
For example, suppose you have a collection of news articles. You find that
a theme query on cricket returns documents about cricket the sport
as well as cricket the insect.
In such a scenario, you can use the NOT operator to exclude the unwanted theme. Thus if you are interested in those documents only about the sport cricket, you exclude documents about insects as follows:
'cricket not insects'
One disadvantage of using the NOT operator is that you run
the risk of excluding documents that are coincidentally about the desired
theme and the unwanted theme. For example, the above query does not return
a hypothetical document about a cricket game that was swarmed by locusts,
assuming that the theme of insects is developed sufficiently for
ConText to index insects as a document theme.
Another disadvantage of using NOT is that you usually have
a better idea of the themes you want, not of the themes you don't want.
Predicting unwanted themes depends on knowing your document corpus. For
this reason, using NOT is best suited for eliminating irrelevant high-ranking
documents you specifically know about.
Sometimes it is better to start with specific categories
and then expand these queries into more general ones, especially when your
query covers a topic that is categorized specifically in the world. For
example, if you are searching for documents that are about bees,
you issue a query on bees, which is a specific category of insects.
If you find that the result set is not returning the documents you need,
you can expand the query by issuing a theme of insects, which is
slightly broader.
After expanding a query, you can use the NOT or AND operators
to scale back the query.
To execute a theme query, you specify a query string, which
can be a sentence or a phrase with or without operators. ConText interprets
your query, creating a normalized form of your query that it can use to
match against document themes in the index. Context returns a list of documents
that satisfy the query, based on certain rules, along with a score of how
relevant each document is to the query.
You can issue themes queries using either the two-step or
one-step method. The way in which ConText matches themes and scores hits
is the same for both methods.
Note: To issue theme queries, you must have a theme index. For more information about how to create a theme index on a text column, see Oracle8 Context Cartridge Administrator's Guide. |
To execute a theme query with the CTX_QUERY.CONTAINS
procedure against a theme index, you must specify a policy that has a theme
lexer associated with it.
For example, you specify a theme query on computer software as follows:
execute ctx_query.contains('THEME_POL', 'computer software', 'CTX_TEMP');
In the above example, ConText normalizes computer software,
and then attempts to match the normal form with document themes in the
index.
When a match is found, ConText uses the weight of the matched
theme to compute a score that reflects how relevant the match is to the
query; the higher the score, the more relevant the hit. ConText returns
the matched document as part of the hitlist.
You can execute theme queries in SQL*Plus using the one-step
method. To do so, the text column must be indexed by theme. The way in
which ConText matches themes and scores hits is the same as in a two-step
query.
For example, to execute a theme query on computer software:
SELECT * FROM TEXTAB WHERE CONTAINS (text, 'computer software') > 0;
For a text column that has more than one policy associated
with it, you must specify which policy to use in the CONTAINS
clause using the pol_hint parameter. You might create two policies
for a column when you want to perform both theme and text queries on the
column.
For example, if the column text had a regular text policy and a theme policy THEME_POL associated with it, you issue a theme query as follows:
SELECT ID, SCORE(0) FROM TEXTAB WHERE CONTAINS (text, 'computer software', 0, 'THEME_POL') > 0;
When you specify pol_hint, you must also specify a
placeholder (in this example 0) for the LABEL parameter.
See
Also:
For more information about using the pol_hint parameter in the CONTAINS function, see the specification for CONTAINS in Chapter 9. |