Skip Headers

Oracle Data Mining Concepts
10g Release 1 (10.1)

Part Number B10698-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

8
Text Mining Using Oracle Data Mining

Oracle provides support for text mining in two products:

The support for text data in ODM is different from that provided by Oracle Text, which is dedicated to text document processing. ODM allows the combination of text and non-text (traditional categorical and numerical) columns of data to enable clustering, classification, and feature extraction.

Support for text mining is new in ODM. Text is the first unstructured data supported by ODM. The approach ODM takes to text can also be used to integrate other unstructured data such as images, audio files, etc.

Table 8-1 summarizes how DBMS_DATA_MINING, the ODM Java interface, and Oracle Text support text mining.

Oracle Data Mining Application Developer's Guide contains a case study that mines a combination of text data and non-text data.

8.1 What Text Mining Is

Text mining is conventional data mining done using "text features." Text features are usually keywords, frequencies of words, or other document-derived features. Once you derive text features, you mine them just as you would any other data.

Some of the applications for text mining include:

8.1.1 Document Classification

Document classification, also known as document categorization, is the process of assigning documents to categories (for example, themes or subjects). A particular document may fit into two or more different categories. This type of classification can often be represented as a multi-target classification problem where a predictive model is built for each category.

8.1.2 Combining Text and Numerical Data

In some classes of problems, text is combined with numerical data. For example patient records or other clinical records usually contain both numerical data (temperature, blood pressure, etc.) and text data (physician's notes). In such a case, you can use ODM to perform mining on the numerical data, the text data, or both the numerical and the text data combined.

If you wish to combine both text and numerical data for mining, you must use some appropriate method to convert the unstructured data (the text) to numerical data. You convert text to numerical data by generating numbers that characterize the document. For example, you might count the number of occurrences of certain important words.

The DBMS_DATA_MINING_TRANSFORM package includes a procedure for extracting text features that gives a great deal of control on how features are treated. These features can be used in either ODM interface. The ODM Java interface, automatically converts TEXT columns, but it doesn't provide any control over how the features are generated.

8.2 ODM Technologies Supporting Text Mining

ODM provides infrastructure for developing data mining applications suitable for addressing a variety of business problems involving text. Among these, the following specific technologies provide key elements for addressing problems that require text mining:

The technologies that are most used in text mining are classification, clustering, and feature extraction.

8.2.1 Classification and Text Mining

A large number of document classification applications fall into one of the following:

Support vector machines (SVMs) are powerful classifiers that have been used successfully in document classification applications. SVMs can deal with thousands of features and are easy to train with small or large amounts of data. SVMs are know to work well with text data. For more information about SVMs, see Chapter 3.

8.2.2 Clustering and Text Mining

Clustering is used heavily in text mining; the main applications of clustering in text mining are

Clustering can also be used to group textual information with other indications from business databases to provide novel insights.

The current release of ODM adds support for clustering text features using the DBMS_DATA_MINING package.

8.2.3 Feature Extraction and Text Mining

There are two kinds of problems for which feature extraction is useful:

Non-negative matrix factorization (NMF) is a new feature in release 10.1 of ODM. NMF has been found to provide superior text retrieval when compared to SVD and other traditional decomposition methods. NMF takes as input a term-document matrix and generates a set of topics that represent weighted sets of co-occurring terms. The discovered topics form a basis that provides an efficient representation of the original documents. For more information about NMF, see Chapter 4, "Descriptive Data Mining Models"or Chapter 4, "Descriptive Data Mining Models".

8.2.4 Association and Regression and Text Mining

Association models can be used to uncover the semantic meaning of words. For example, suppose that the word sheep co-occurs with words like sleep, fence, chew, grass, meadow, farmer, and shear. An association model would include rules connecting sheep with these concepts. Inspection of the rules would provide context for sheep in the document collection. Such associations can improve information retrieval engines.

Regression is most often used in problems that combine text with other types of data.

8.3 Oracle Support for Text Mining

Table 8-1 summarizes how the ODM Java interface, DBMS_DATA_MINING (the ODM PL/SQL package), and Oracle Text support text mining functions.

Table 8-1 Text Mining Comparison
Feature ODM Java interface DBMS_DATA_MINING Oracle Text

Association

No support

Text data only or text and non-text data

No support

Clustering

No support for text data

k-means algorithm supports text only or text and non-text data

k-means algorithm supports text only

Attribute importance

No support for text data

No support for text data

No support

Regression

Support vector machines (SVM) supports text data only or text and non-test data

Support vector machines (SVM) supports text data only or text and non-text data

No support

Classification

SVM supports text only or text and non-text data

Support for assigning documents to one of many labels

SVM supports text only or text and non-text data

Support for assigning documents to one of many labels

SVM and decision trees support text only

Support for assigning documents to one of many labels and also for assigning documents to multiple labels at the same time

Feature extraction (basic features)

Feature extraction is done internally; the results are not exposed. Does not provide a high level of control for feature extraction

Exposes the feature extraction that Oracle Text performs; allows same degree of control as Oracle Text

Feature extraction is done internally; the results are not exposed

Feature extraction (higher order features)

Non-negative matrix factorization (NMF) supports text or text and non-text data

Non-negative matrix factorization (NMF) supports text or text and non-text data

No support

Record apply

No support for record apply of text columns

No support for record apply

Supports record apply for classification

Support for TEXT columns

Accepts a TEXT column for mining

Features extracted from a column of type CLOB, BLOB, BFILE. LONG, VARCHAR2, XMLType, CHAR, RAW, LONG RAW using an appropriate transformation

Accept table columns of type CLOB, BLOB, BFILE. LONG, VARCHAR2, XMLType, CHAR, RAW, LONG RAW