Skip Headers

Oracle Data Mining Concepts
10g Release 1 (10.1)

Part Number B10698-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

7
Data Mining Using DBMS_DATA_MINING

Data mining tasks in the ODM PL/SQL interface include model building, model testing and computing lift for a model, and model applying (scoring).

The development methodology for data mining using DBMS_DATA_MINING has two phases:

7.1 DBMS_DATA_MINING Application Development

After you've analyzed the problem and data, use the DBMS_DATA_MINING and DBMS_DATA_MINING_TRANSFORM packages to develop a PL/SQL application that performs the data mining:

  1. Prepare the build and scoring data using the DBMS_DATA_MINING_TRANSFORM package or other third-party tool or direct SQL or PL/SQL utilities to prepare the data as required by the chosen mining function and algorithm. If you are building a predictive model, you prepare a test data set.

    Note that the build, test, and score data sets must be prepared in an identical manner for mining results to be meaningful.

  2. Prepare a settings table that overrides the default mining algorithm settings for the mining function and the default algorithm settings. This is also an optional step.
  3. Build a mining model using the training data set.
  4. For predictive models (Classification and Regression), test the model for its accuracy and other attributes. You test the model by applying it to the test data (that is, score the test data) and computing metrics on the apply results. In other words, you compare the predictions of the model with the actual values in the test data set.
  5. Retrieve the model signature to determine the mining attributes required by a given model for scoring. This information helps to verify that the scoring data is suitable for scoring. This is an optional step.
  6. Apply a classification, regression, clustering, or feature extraction model to new data to generate predictions or descriptive summaries and patterns about the data.
  7. Retrieve the model details to understand how the model scored the data in a particular manner. This is an optional step.
  8. Repeat steps 3 through 9 until you obtain satisfactory results.

7.2 Building DBMS_DATA_MINING Models

The DBMS_DATA_MINING package creates a mining model for a mining function using a specified mining algorithm that supports the function. The algorithm can be influenced using specific algorithm settings. Model build is synchronous in the PL/SQL interface. After a model is built, there is single-user, multi-session access to the model.

7.2.1 DBMS_DATA_MINING Models

A model is identified by its name. Like tables in the database, a model has storage associated with it. The form, shape, and content of this storage is opaque to the user. However, the user can view the contents of a model -- that is, the patterns and rules that constitute a mining model -- using algorithm-specific GET_MODEL_DETAILS functions.

7.2.2 DBMS_DATA_MINING Mining Functions

The DBMS_DATA_MINING package supports Classification, Regression, Association, Clustering, and Feature Extraction. You specify the mining function as a parameter to the BUILD procedure.

7.2.3 DBMS_DATA_MINING Mining Algorithms

Each mining function can be implemented using one or more algorithms. Table 7-1 provides a list of supported algorithms. There is a default algorithm for each mining function, but you can override this default through an explicit setting in the settings table.

Table 7-1 DBMS_DM Summary of Functions and Algorithms
Mining Function Mining Algorithm

Classification

Naive Bayes (NB) -- default algorithm

 

Adaptive Bayes Network (ABN)

 

Support Vector Machine (SVM)

Regression

Support Vector Machine (SVM)

Association

Association Rules (AR)

Clustering

k-Means (KM)

Feature Extraction

Non-Negative Matrix Factorization (NMF)



Each algorithm has one or more settings that influence the way it builds the model. There is a default set of algorithm settings for each mining algorithm. These defaults are provided through the transformation GET_DEFAULT_SETTINGS. To override the defaults, you must provide the choice of the algorithm and the settings for the algorithm through a settings table input to the BUILD Procedure.

7.2.4 DBMS_DATA_MINING Settings Table

The settings table is a simple relational table with a fixed schema. The name of the settings table can be whatever name you choose. The settings table must have exactly two columns with names and types as follows:

setting_name     VARCHAR2(30)          setting_value     VARCHAR2(128)

The values specified in a settings table override the default values. The values in the setting_name column are one or more of several constants defined in the DBMS_DATA_MINING package. The values in the setting_value column are either predefined constants or actual numerical value corresponding to the setting itself. The setting_value column is of type VARCHAR2; you must cast numerical inputs to string using the TO_CHAR() function before input into the settings table.

The following example shows how to create a settings table for an SVM classification model, and edit the individual values using SQL DML.

CREATE TABLE drugstore_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(128));
-- override the default for complexity factor for SVM Classification
INSERT INTO drugstore_model_settings (setting_name, setting_value)
VALUES (dbms_data_mining.svms_complexity_fator, TO_CHAR(0.081));
COMMIT;

The transformation DATA_MINING_GET_DEFAULT_SETTINGS contains all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using appropriate DML.

CREATE TABLE drug_store_settings AS
SELECT setting_name, setting_value
FROM DM_DEFAULT_SETTINGS
WHERE setting_name LIKE 'SVMS_%';
-- update the values using appropriate DML

You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS, as shown in the example below.

CREATE TABLE my_new_model_settings AS
SELECT setting_name, setting_value
   FROM DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_model');

7.2.4.1 DBMS_DATA_MINING Prior Probabilities Table

Priors or Prior Probabilities are discussed in Section 3.1.2 You can specify the priors in a prior probabilities table as an optional function setting when building classification models.

The prior probabilities table has a fixed schema. For numerical targets, use the following schema:

target_value NUMBER     prior_probability NUMBER

For categorical targets, use the following schema:

target_value VARCHAR2     prior_probability NUMBER

Specify the name of the prior probabilities table as input to the setting_value column in the settings table, with the corresponding value for the setting_name column to be DBMS_DATA_MINING.clas_priors_table_name, as shown below:

INSERT INTO drugstore_settings (setting_name, setting_value) VALUES (DBMS_DATA_MINING.class_priors_table_name, 'census_priors');
COMMIT;

7.2.4.2 DBMS_DATA_MINING Cost Matrix Table

Costs are discussed in Section 3.1.1. You specify costs in a cost matrix table. The cost matrix table has a fixed schema. For numerical targets, use the following schema:

actual_target_value NUMBER     predicted_target_value NUMBER      cost NUMBER

For categorical targets, use the following schema:

actual_target_value VARCHAR2   predicted_target_value   VARCHAR2   cost NUMBER

The DBMS_DATA_MINING package enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.

The data input to each COMPUTE procedure in the package is the result generated from applying the model on test data. If you provide a cost matrix as an input, the COMPUTE procedure generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY results, without rebuilding the model and applying it against the same test data for every iteration.

Once you arrive at an optimal cost matrix, you can input this cost matrix to the RANK_APPLY procedure along with the results of APPLY on your scoring data. RANK_APPLY will provide your new data ranked by cost.

7.3 DBMS_DATA_MINING Mining Operations and Results

There are several sets of mining operations supported by the DBMS_DATA_MINING package:

The first set of operations are DDL-like operations. The last set consists of utilities. The remaining sets of operations are query-like operations in that they do not modify the model.

In addition to the operations, the following capabilities are also provided as part of the Oracle Data Mining installation:

Mining results are either returned as result sets or persisted as fixed schema tables.

7.3.1 DBMS_DATA_MINING Build Results

The BUILD operation creates a mining model. The GET_MODEL_DETAILS functions for each supported algorithm permit you to view the model. In addition, GET_MODEL_SIGNATURE and GET_MODEL_SETTINGS provide descriptive information about the model.

7.3.2 DBMS_DATA_MINING Apply Results

APPLY creates and populates a fixed schema table with a given name. The schema of this table varies based on the particular mining function, algorithm, and target attribute type -- numerical or categorical.

RANK_APPLY takes an APPLY result table as input and generates another table with results ranked based on a top-N input, and for classification models, also based on cost. The schema of this table varies based on the particular mining function, algorithm, and the target attribute type -- numerical or categorical.

7.3.3 Evaluating DBMS_DATA_MINING Classification Models

DBMS_DATA_MINING includes the following procedures for testing classification models:

These procedures are described in the DBMS_DATA_MINING chapter of PL/SQL Packages and Types Reference.

The rest of this section describes confusion matrix, lift, and receiver operating characteristics.

7.3.3.1 Confusion Matrix

ODM supports the calculation of a confusion matrix to asses the accuracy of a classification model. A confusion matrix is a 2-dimensional square matrix. The row indexes of a confusion matrix correspond to actual values observed and used for model testing; the column indexes correspond to predicted values produced by applying the model to the test data. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing. For example, a value of 25 for an actual value index of "buyer" and a predicted value index of "nonbuyer" indicates that the model incorrectly classified a "buyer" as a "nonbuyer" 25 times. A value of 516 for an actual/predicted value index of "buyer" indicates that the model correctly classified a "buyer" 516 times.

The predictions were correct 516 + 725 = 1241 times, and incorrect 25 + 10 = 35 times. The sum of the values in the matrix is equal to the number of scored records in the input data table. The number of scored records is the sum of correct and incorrect predictions, which is 1241 + 35 = 1276. The error rate is 35/1276 = 0.0274; the accuracy rate is 1241/1276 = 0.9725.

A confusion matrix provides a quick understanding of model accuracy and the types of errors the model makes when scoring records. It is the result of a test task for classification models.

Figure 7-1 Confusion Matrix

Text description of confmtrx.gif follows.

Text description of the illustration confmtrx.gif

7.3.3.2 Lift

ODM supports computing lift for a classification model. Lift can be computed for both binary (2 values) target fields and multiclass (more than 2 values) target fields. Given a designated positive target value (that is, the value of most interest for prediction, such as "buyer," or "has disease"), test cases are sorted according to how confidently they are predicted to be positive cases. Positive cases with highest confidence come first, followed by positive cases with lower confidence. Negative cases with lowest confidence come next, followed by negative cases with highest confidence. Based on that ordering, they are partitioned into quantiles, and the following statistics are calculated:

Cumulative targets can be computed from the quantities that are available in the LiftRresultElement using the following formula:

targets_cumulative = lift_cumulative * percentage_records_cumulative

7.3.3.3 Receiver Operating Characteristics

Another useful method for evaluating classification models is Receiver Operating Characteristics (ROC) analysis. ROC curves are similar to Lift charts in that they provide a means of comparison between individual models and determine thresholds which yield a high proportion of positive hits. Specifically, ROC curves aid users in selecting samples by minimizing error rates. ROC was originally used in signal detection theory to gauge the true hit versus false alarm ratio when sending signals over a noisy channel.

The horizontal axis of an ROC graph measures the false positive rate as a percentage. The vertical axis shows the true positive rate. The top left hand corner is the optimal location in an ROC curve, indicating high TP (true-positive) rate versus low FP (false-positive) rate. The ROC Area Under the Curve is useful as a quantitative measure for the overall performance of models over the entire evaluation data set. The larger this number is for a specific model, the better. However, if the user wants to use a subset of the scored data, the ROC curves help in determining which model will provide the best results at a specific threshold.

In the example graph in Figure 7-2, Model A clearly has a higher ROC Area Under the Curve for the entire data set. However, if the user decides that a false positive rate of 40% is the maximum acceptable, Model B is better suited, since it achieves a better error true positive rate at that false positive rate.

Figure 7-2 Receiver Operating Characteristics Curves

Text description of roc3.gif follows.

Text description of the illustration roc3.gif

Besides model selection the ROC also helps to determine a threshold value to achieve an acceptable trade-off between hit (true positives) rate and false alarm (false positives) rate. By selecting a point on the curve for a given model a given trade-off is achieved. This threshold can then be used as a post-processing for achieving the desired performance with respect to the error rates. ODM models by default use a threshold of 0.5. This is the confusion matrix reported by the test in the ODM Java interface.

The Oracle Data Mining ROC computation calculates the following statistics:

7.3.4 Test Results for DBMS_DATA_MINING Regression Models

The most commonly used metrics for regression models are root mean square error and mean absolute error. You can use SQL queries described in Oracle Data Mining Application Developer's Guide to compute those metrics.

The regression test results provide measures of model accuracy: root mean square error and mean absolute error of the prediction.

7.3.4.1 Root Mean Square Error

The following query calculates root mean square.

SELECT sqrt(avg((A.prediction - B.target_column_name) *
(A.prediction - B.
target_column_name))) rmse
FROM
apply_results_table A, targets_table B
WHERE A.c
ase_id_column_name = B.case_id_column_name;

7.3.4.2 Mean Absolute Error

Given the targets table generated from the test data with the schema:

(case_id_column_name VARCHAR2,
target_column_name   NUMBER)

and apply results table for regression with the schema:

(case_id_column_name VARCHAR2,
prediction           NUMBER)

and a (optional) normalization table with the schema:

(attribute_name      VARCHAR2(30),
scale                NUMBER,
shift                NUMBER)

the query for mean absolute error is:

SELECT /*+PARALLEL(T) PARALLEL(A)*/
AVG(ABS(T.actual_value - T.target_value)) mean_absolute_error
FROM (SELECT B.
case_id_column_name,
(B.
target_column_name * N.scale + N.shift) actual_value
FROM targets_table B,
normalization_table N
WHERE N.attribute_name = B.
target_column_name AND
B.
target_column_name = 1) T,
apply_results_table_name A
WHERE A.
case_id_column_name = T.case_id_column_name;

You can fill in the italicized values with the actual column and table names chosen by you. If the data is not normalized, you can eliminate those references from the subquery.

7.4 DBMS_DATA_MINING Model Export and Import

Oracle supports data mining model export and import between Oracle databases or schemas to provide a way to move models. DBMS_DATA_MINING does not support model export and import via PMML.

Model export/import is supported at different levels, as follows:

DBMS_DATA_MINING supports export and import of models based on the Oracle DBMS Data Pump. Where you export a model, the tables that constitute the model and the associated metadata are written to a dump file set that consists of one or more files. When you import a model, the tables and metadata are retrieved from the file and restored in the new database.

For information about requirements for the application system, see Chapter 9.

For detailed information about the export/import transformation, see the Oracle Data Mining Application Developer's Guide.