Oracle® Text Reference 10g Release 1 (10.1) Part Number B10730-01 |
|
|
View PDF |
This chapter provides reference information for using the CTX_CLS
PL/SQL package. This package enables you to perform document classification.
Name | Description |
---|---|
TRAIN |
Generates rules that define document categories. Output based on input training document set. |
CLUSTERING |
Generates clusters for a document collection. |
Use this procedure to generate query rules that select document categories. You must supply a training set consisting of categorized documents. Documents can be in any format supported by Oracle Text and must belong to one or more categories. This procedure generates the queries that define the categories and then writes the results to a table.
You must also have a document table and a category table. The category table must contain at least two categories.
For example, your document and category tables can be defined as:
create table trainingdoc(
docid number primary key, text varchar2(4000));
create table category (
docid trainingdoc(docid), categoryid number);
You can use one of two syntaxes depending on the classification algorithm you need. The query compatible syntax uses the RULE_CLASSIFIER
preference and generates rules as query strings. The support vector machine syntax uses the SVM_CLASSIFER
preference and generates rules in binary format. The SVM_CLASSIFIER
is good for high classification accuracy, but because its rules are generated in binary format, they cannot be examined like the query strings generated with the RULE_CLASSIFIER
. Note that only those document ids that appear in both the document table and the category table will impact RULE_CLASSIFIER
and SVM_CLASSIFIER
learning.
The CTX_CLS.TRAIN
procedure requires that your document table have an associated context index. For best results, the index should be synchronized before running this procedure. SVM_CLASSIFIER
syntax enables the use of an unpopulated context index, while query-compatible syntax requires that the context index be populated.
Query Compatible Syntax
The following syntax generates query-compatible rules and is used with the RULE_CLASSIFIER preference. Use this syntax and preference when different categories are separated from others by several key words. An advantage of generating your rules as query strings is that you can easily examine the generated rules. This is different from generating SVM rules, which are in binary format.
CTX_CLS.TRAIN(
index_name in varchar2, doc_id in varchar2, cattab in varchar2, catdocid in varchar2, catid in varchar2, restab in varchar2, rescatid in varchar2, resquery in varchar2, resconfid in varchar2, preference_name in varchar2 DEFAULT NULL
);
Specify the name of the context index associated with your document training set.
Specify the name of the document id column in the document table. This column must contain unique document ids. This column must a NUMBER.
Specify the name of the category table. You must have SELECT privilege on this table.
Specify the name of the document id column in the category table. The document ids in this table must also exist in the document table. This column must a NUMBER.
Specify the name of the category ID column in the category table. This column must a NUMBER.
Specify the name of the result table. You must have INSERT privilege on this table.
Specify the name of the category ID column in the result table. This column must a NUMBER.
Specify the name of the query column in the result table. This column must be VARACHAR2, CHAR CLOB, NVARCHAR2, or NCHAR.
The queries generated in this column connects terms with AND or NOT operators, such as:
'T1 & T2 ~ T3'
Terms can also be theme tokens and be connected with the ABOUT operator, such as:
'about(T1) & about(T2) ~ about(T3)'
Generated rules also support WITHIN queries on field sections.
Specify the name of the confidence column in result table. This column contains the estimated probability from training data that a document is relevant if that document satisfies the query.
Specify the name of the preference. For classifier types and attributes, see "Classifier Types" in Chapter 2, " Oracle Text Indexing Elements".
Syntax for Support Vector Machine Rules
The following syntax generates support vector machine (SVM) rules with the SVM_CLASSIFIER preference. This preference generates rules in binary format. Use this syntax when your application requires high classification accuracy.
CTX_CLS.TRAIN( index_name in varchar2, docid in varchar2, cattab in varchar2, catdocid in varchar2, catid in varchar2, restab in varchar2, preference_name in varchar2 );
Specify the name of the text index.
Specify the name of docid column in document table.
Specify the name of category table.
Specify the name of docid column in category table.
Specify the name of category ID column in category table.
Specify the name of result table.
The result table has the following format:
Column Name | Datatype | Description |
---|---|---|
CAT_ID |
NUMBER | The ID of the category. |
TYPE |
NUMBER(3) NOT NULL | 0 for the actual rule or catid; 1 for other. |
RULE |
BLOB | The returned rule. |
Specify the name of user preference. For classifier types and attributes, see "Classifier Types" in Chapter 2, " Oracle Text Indexing Elements".
Example
The CTX_CLS.TRAIN
procedure is used in supervised classification. For an extended example, see the Oracle Text Application Developer's Guide.
Use this procedure to cluster a collection of documents. A cluster is a group of documents similar to each other in content. Clustering is also known as unsupervised classification.
Given a set of documents, this procedure assigns each document into a cluster according to the similarity with documents already in the cluster. The result is that documents in a cluster are more similar to one another than documents across different clusters. The more clusters produced, the greater the accuracy and quality of each cluster; however, producing more clusters requires more computing time.
Cluster output may be flat or hierarchical. Hierarchical clustering affords greater specificity of each cluster; however, it may require more computing power. In the case where you want to produce only a few clusters, non-hierarchical clustering may suffice.
See Also: For more information about clustering, see "Cluster Types" in Chapter 2, " Oracle Text Indexing Elements", as well as the Oracle Text Application Developer's Guide. |
A clustering result set is composed of document assignments and cluster descriptions. The document assignment result set contains information about the cluster to which the procedure assigned a document, and how similar the document is to the assigned cluster. This result set contains document identification, cluster identification, and similarity score between the cluster and assigned document.
The cluster description result set contains information about what topic a generated cluster is about. This result set contains cluster identification, cluster description text, suggested cluster label, number of documents assigned, and a quality score of the cluster.
There are two versions of this procedure: one with a table result set, and one with an in-memory result set.
Syntax: Table Result Set
ctx_cls.clustering ( index_name IN VARCHAR2, docid IN VARCHAR2, doctab_name IN VARCHAR2, clstab_name IN VARCHAR2, pref_name IN VARCHAR2 DEFAULT NULL );
Specify the name of the context index on collection table.
Specify the name of document ID column of the collection table.
Specify the name of document assignment table. This procedure creates the table with the following structure:
doc_assign( docid number, clusterid number, score number );
Column | Description |
---|---|
DOCID | Document ID to identify document. |
CLUSTERID | ID of the cluster the document is assigned to. If CLUSTERID is -1, then the cluster contains "miscellaneous" documents; for example, documents that cannot be assigned to any other cluster category. |
SCORE | The associated score between the document and cluster. |
If you require more columns, you can create the table before you call this procedure.
Specify the name of the cluster description table. This procedure creates the table with the following structure:
cluster_desc( clusterid NUMBER, descript VARCHAR2(4000), label VARCHAR2(200), sze NUMBER, quality_score NUMBER, parent NUMBER );
Column | Description |
---|---|
CLUSTERID | Cluster ID to identify cluster. If CLUSTERID is -1, then the cluster contains "miscellaneous" documents; for example, documents that cannot be assigned to any other cluster category. |
DESCRIPT | String to describe the cluster. |
LABEL | A suggested label for the cluster. |
SZE | Number of documents assigned to this cluster. |
QUALITY_SCORE | The quality score of the cluster, higher is better. |
PARENT | The parent cluster id. A negative number means no parent cluster. |
If you require more columns, you can create the table before you call this procedure.
Specify the name of the preference.
Syntax: In-Memory Result Set
You can put the result set into in-memory structures for better performance. Two in-memory tables are defined in CTX_CLS package for document assignment and cluster description respectively.
CTX_CLS.CLUSTERING( index_name IN VARCHAR2, docid IN VARCHAR2, dids IN DOCID_TAB, doctab_name IN OUT NOCOPY DOC_TAB, clstab_name IN OUT NOCOPY CLUSTER_TAB, pref_name IN VARCHAR2 DEFAULT NULL );
Specify the name of context index on the collection table.
Specify the document id column of the collection table.
Specify the name of the in-memory docid_tab.
TYPE docid_tab IS TABLE OF number INDEX BY BINARY_INTEGER;
Specify name of the document assignment in-memory table. This table is defined as follows:
TYPE doc_rec IS RECORD ( docid NUMBER, clusterid NUMBER, score NUMBER ) TYPE doc_tab IS TABLE OF doc_rec INDEX BY BINARY_INTEGER;
Column | Description |
---|---|
DOCID | Document ID to identify document. |
CLUSTERID | ID of the cluster the document is assigned to If CLUSTERID is -1, then the cluster contains "miscellaneous" documents; for example, documents that cannot be assigned to any other cluster category. |
SCORE | The associated score between the document and cluster. |
Specify the name of cluster description in-memory table
TYPE cluster_rec IS RECORD( clusterid NUMBER, descript VARCHAR2(4000), label VARCHAR2(200), sze NUMBER, quality_score NUMBER, parent NUMBER ); TYPE cluster_tab IS TABLE OF cluster_rec INDEX BY BINARY_INTEGER;
Column | Description |
---|---|
CLUSTERID | Cluster ID to identify cluster. If CLUSTERID is -1, then the cluster contains "miscellaneous" documents; for example, documents that cannot be assigned to any other cluster category. |
DESCRIPT | String to describe the cluster. |
LABEL | A suggested label for the cluster. |
SZE | Number of documents assigned to this cluster. |
QUALITY_SCORE | The quality score of the cluster, higher is better. |
PARENT | The parent cluster id. A negative number means no parent cluster. |
Specify the name of the preference. For cluster types and attributes, see "Cluster Types" in Chapter 2, " Oracle Text Indexing Elements".
Example