Skip Headers

PL/SQL Packages and Types Reference
10g Release 1 (10.1)

Part Number B10802-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

24
DBMS_DATA_MINING_TRANSFORM

The DBMS_DATA_MINING_TRANSFORM package is a set of data transformation utilities available for use with the DBMS_DATA_MINING package for preparing mining data.

This chapter contains the following topics:


Using DBMS_DATA_MINING_TRANSFORM


Overview

The DBMS_DATA_MINING_TRANSFORM package has two motivations:

The main principle behind the design of this package is the fact that SQL has enough power to perform most of the common mining transforms efficiently. For example, binning can be done using CASE expression or DECODE function, and linear normalization is a simple algebraic expression of the form (x - shift)/scale where x is the data value that is being transformed.

However, the queries that perform the transforms can be rather lengthy. So it is desirable to have some convenience routines that will help in generating queries. Thus, the goal of this package is to provide query generation services for the most common mining transforms, as well as to provide a framework that can be easily extended for implementing other transforms.

Note: Use of this package for preprocessing data input to DBMS_DATA_MINING operations is not mandatory. You can use any third-party tool or any home-grown utilities that are customized for your application.


Types

Table 24-1  Summary of Data Types
Data Type Purpose

Column_List

List of column names representing mining attributes, defined to be

VARRAY(1000) OF VARCHAR2(32);


Supported Transformation Methods

The DBMS_DATA_MINING_TRANSFORM package supports the following transformations for numerical and categorical attributes, which map to the NUMBER and VARCHAR2/CHAR Oracle data types respectively.

Binning

Binning involves mapping both continuous and discrete values to discrete values of reduced cardinality. For example, the age of persons can be binned into discrete numeric bins: 1-20 to 1, 21-40 to 2, and so on. Popular car manufacturers such as Ford, Chrysler, BMW, Volkswagen can be binned into discrete categorical bins: {Ford, Chrysler} to US_Car_Makers, and {BMW, Volkswagen} to European_Car_Makers. The package supports binning for both categorical and numerical attributes.


Note:

This chapter uses standard interval notation for number sets: [a,b] is the set of all real numbers greater than or equal to a and less than or equal to b; [a,b) is the set of all real numbers greater than or equal to a and less than b. (b is in the set [a,b]; b is not in the set [a,b).)


Top-N Frequency Categorical Binning

The bin definition for each attribute is computed based on the occurrence frequency of values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin1, ..., binN corresponds to the values with top frequencies. The bin binN+1 corresponds to all remaining values.

Equi-Width Numerical Binning

The bin definition for each attribute is computed based on the min and max values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin1,..., binN span ranges of equal width of size inc = (max - min)/N, and bin0 spans range (-inf, min) and binN+1 range (max, + inf).

Quantile Numerical Binning

The definition for each relevant column is computed based on the min values for each quantile, where quantiles are computed from the data using NTILE function. Bins bin1, ..., binN - 1 span ranges [minI, minI+1) and binN range [minN, maxN]. Bins with equal left and right boundaries are collapsed.

Normalization

Normalization involves scaling continuous values down to specific range -- as in -1.0 to 1.0 or 0.0 to 1.0 such that xnew = (xold - shift)/scale. It applies only to numerical attributes.

Min-Max Normalization

The normalization definition for each attribute is computed based on the min and max values that are computed from the data. The values for shift and scale are computed to be shift = min, and scale = (max - min) respectively.

Z-Score Normalization

The normalization definition for each attribute is computed based on the values for mean and standard deviation that are computed from the data. The values for shift and scale are computed to be shift = mean, and scale = standard deviation respectively.

Winsorizing and Trimming (Clipping)

Some computations on attribute values can be significantly affected by extreme values. One approach to achieving a more robust computation is to winsorize or trim the data as a preprocessing step. Winsorizing involves setting the tail values of a particular attribute to some specified quantile of the data, while trimming removes the tails. In other words, trimmed values are ignored in further computations. This is achieved by setting the tails to NULL. For example, for a 90% winsorization, the bottom 5% are set equal to the minimum value in the 6th percentile, while the upper 5% are set equal to the value corresponding to the maximum value in the 95th percentile.


Transformation Operations

The package provides three classes of convenience routines. The first two classes of routines define the transformation, and the last class of routines generate the queries that provide the transformed data.

Creating a Transform Definition Table

You must use CREATE_BIN_NUM or CREATE_BIN_CAT routines to create the bin definition tables. You must use CREATE_NORM_LIN to create the normalization definition table. You must use CREATE_CLIP to create the clipping definition table.

Usually, the consistency and integrity of transform definition tables is guaranteed by the creation process. Alternatively, it can be achieved by leveraging an integrity constraints mechanism. This can be done either by altering the tables created with CREATE routines, or by creating the tables manually with the necessary integrity constraints.

Defining the Transform

The most common way of defining the transform (populating the transform definition tables) for each attribute is based on data inspection using some predefined methods (also known as automatic transform definition). Some of the most popular methods have been captured by the INSERT routines in the package. For example, the z-score normalization method estimates mean and standard deviation from the data to be used as a shift and scale parameters of the linear normalization transform.

You can bin numerical attributes using INSERT_BIN_NUM_EQWIDTH or INSERT_BIN_NUM_QTILE and categorical attributes using INSERT_BIN_CAT_FREQ. You can normalize numerical attributes using INSERT_NORM_LIN_ZSCORE or INSERT_NORM_LIN_MINMAX. You can either winsorize numerical attributes using INSERT_CLIP_WINZOR_TAIL or trim them using INSERT_CLIP_TRIM_TAIL. You can invoke these routines several times to transform all relevant attributes from various data sources till the definition table fully represents all mining attributes for a given problem.

After performing automatic transform definition, some or all of the definitions can be adjusted by issuing SQL DML statements against the transform definition tables, thus providing virtually infinite flexibility in defining custom transforms.

The INSERT routines enable flexible transform definitions in several ways:

Generating the Query for the Transform

Query generation is driven by the simple transform-specific definition tables with a predefined schema. Query generation routines should be viewed as macros, and transform definition tables as parameters used in macro expansions. Similar to using #define macros in the C language, the invoker is responsible for ensuring the correctness of the expanded macro, that is, that the result is a valid SQL query.

You can generate the views representing the transformation queries for binning using XFORM_BIN_CAT and XFORM_BIN_NUM, and for normalization using XFORM_NORM_LIN, and for clipping using XFORM_CLIP.

If your data contains a combination of numerical and categorical attributes, you must essentially feed the results of one transformation step to the next step. For example, the results of XFORM_BIN_CAT can be fed to XFORM_BIN_NUM or vice versa. The order is irrelevant since numerical and categorical transforms work on disjoint sets of attributes.


Transformation Methodology

Given a dataset for a particular mining problem, any preprocessing and transformations on the mining data must be uniform across all mining operations. In other words, if the build data is preprocessed according to a particular transform definition, then it follows that the test data and the scoring data must be preprocessed using the same definition.

The general usage of routines in this package can be explained using this example. Assume that your input table for model build contains both numerical and categorical data that require to be binned. A possible sequence of operations will be:

If this happens to be a classification model, and you want to APPLY this model to scoring data, you must prepare the scoring data similar to the build data. You can achieve this in two simple steps:


Summary of DBMS_DATA_MINING_TRANSFORM Subprograms

Table 24-2  DBMS_DATA_MINING_TRANSFORM Package Subprograms
Subprogram Purpose

CREATE_BIN_CAT Procedure

Creates a categorical bin definition table

CREATE_BIN_NUM Procedure

Creates a numerical bin definition table

CREATE_CLIP Procedure

Creates a clipping definition table

CREATE_NORM_LIN Procedure

Creates a normalization definition table

INSERT_BIN_CAT_FREQ Procedure

Populates the categorical bin definition table, applying frequency-based binning on the categorical input data

INSERT_BIN_NUM_EQWIDTH Procedure

Populates the numerical bin definition table, applying equi-width binning on the numerical input data

INSERT_BIN_NUM_QTILE Procedure

Populates the numerical bin definition table, applying quantile binning on the numerical input data

INSERT_CLIP_TRIM_TAIL Procedure

Populates the clipping definition table, applying trimming based on tail fraction on the numerical input data

INSERT_CLIP_WINSOR_TAIL Procedure

Populates the clipping definition table, applying winsorizing based on tail fraction on the numerical input data

INSERT_NORM_LIN_ZSCORE Procedure

Populates the normalization definition table applying z-score normalization on the numerical input data

INSERT_NORM_LIN_MINMAX Procedure

Populates the normalization definition table, applying min-max normalization on the numerical input data

XFORM_BIN_CAT Procedure

Creates the view representing the transformed output with binned categorical data

XFORM_BIN_NUM Procedure

Creates the view representing the transformed output with binned numerical data

XFORM_CLIP Procedure

Creates the view representing the transformed output with clipped numerical data

XFORM_NORM_LIN Procedure

Creates the view representing the transformed output with normalized numerical data


CREATE_BIN_CAT Procedure

This procedure creates a categorical binning definition table. This table is used as input to INSERT_BIN_CAT_FREQ and XFORM_BIN_CAT procedures.

Syntax

DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_CAT (
     bin_table_name     IN VARCHAR2,
     bin_schema_name    IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-3  CREATE_BIN_CAT Procedure Parameters
Parameter Description

bin_table_name

Name of the bin definition table.

bin_schema_name

Name of the schema hosting the bin definition table.

Usage Notes

The generated bin definition table will have the predefined schema:

col  VARCHAR2(30)
val  VARCHAR2(4000)
bin  VARCHAR2(4000)

Examples

BEGIN
   DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_CAT('build_bin_cat_table');
END;

CREATE_BIN_NUM Procedure

This procedure creates a numerical binning definition table. This table is used as input to INSERT_BIN_NUM_EQWIDTH, INSERT_BIN_NUM_QTILE, and XFORM_BIN_NUM procedures.

Syntax

DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_NUM (
     bin_table_name    IN VARCHAR2,
     bin_schema_name   IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-4  CREATE_BIN_NUM Procedure Parameters
Parameter Description

bin_table_name

Name of the bin definition table.

bin_schema_name

Name of the schema hosting the bin definition table.

Usage Notes

The generated bin definition table will have the predefined schema:

col   VARCHAR2(30)
val   NUMBER
bin   VARCHAR2(4000)

Examples

BEGIN

DBMS_DATA_MINING_TRANSFORM.CREATE_BIN_NUM('build_bin_num_table');
END;

CREATE_CLIP Procedure

This procedure creates a clipping definition table. This table is used as input to INSERT_CLIP_WINSOR_TAIL, INSERT_CLIP_TRIM_TAIL, and XFORM_CLIP procedures.

Syntax

DBMS_DATA_MINING_TRANSFORM.CREATE_CLIP (
     clip_table_name    IN VARCHAR2,
     clip_schema_name   IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-5  CREATE_CLIP Procedure Parameters
Parameter Description

clip_table_name

Name of the clipping definition table.

clip_schema_name

Name of the schema hosting the clipping definition table.

Usage Notes

The generated clipping definition table will have the predefined schema:

col   VARCHAR2(30)
lcut  NUMBER
lval  NUMBER
rcut  NUMBER
rval  NUMBER

Examples

BEGIN
  DBMS_DATA_MINING_TRANSFORM.CREATE_CLIP('build_clip_table');
END;

CREATE_NORM_LIN Procedure

This procedure creates a linear normalization definition table. This table is used as input to INSERT_NORM_LIN_MINMAX, INSERT_NORM_LIN_ZSCORE, XFORM_NORM_LIN procedures.

Syntax

DBMS_DATA_MINING_TRANSFORM.CREATE_NORM_LIN (
     norm_table_name       IN VARCHAR2,
     norm_schema_name      IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-6  CREATE_NORMALIZE_LIN Procedure Parameters
Parameter Description

norm_table_name

Name of the normalization definition table.

norm_schema_name

Name of the schema hosting the normalization definition table.

Usage Notes

The generated normalization table will have the predefined schema:

col      VARCHAR2(30)
shift    NUMBER
scale    NUMBER

Examples

BEGIN

DBMS_DATA_MINING_TRANSFORM.CREATE_NORM_LIN('build_norm_table');
END;

INSERT_BIN_CAT_FREQ Procedure

This procedure finds the categorical binning definition for every VARCHAR2, CHAR column in the data table that is not specified in the exclusion list and inserts the definition into the categorical binning definition table created using CREATE_BIN_CAT.

Definition for each relevant column is computed based on the occurrence frequency of column values that are computed from the data table. Each of the bin_num(N) bins bin1, ..., binN corresponds to the values with top frequencies when N > 0 or bottom frequencies when N < 0, and binN+1 to all remaining values, where binI = I. Ordering ties among identical frequencies are broken by ordering on column values (ASC for N > 0 or DESC for N < 0). When the number of distinct values C < N only C+1 bins will be created.

The parameter default_num (D) is used for pruning based on the number of values that fall into the default bin. When D > 0 only columns that have at least D defaults are kept while others are ignored. When D < 0 only columns that have at most D values are kept. No pruning is done when D is NULL or D = 0. Parameter bin_support (SUP) is used for restricting bins to frequent (SUP > 0) values frq >= SUP*tot, or infrequent (SUP < 0) ones frq <= -SUP*tot, where frq is a given value count and tot is a sum of all counts as computed from the data. No support filtering is done when SUP is NULL or when SUP = 0.

Syntax

DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_CAT_FREQ (
     bin_table_name       IN VARCHAR2,
     data_table_name      IN VARCHAR2,
     bin_num              IN PLS_INTEGER DEFAULT 9,
     exclude_list         IN Column_List DEFAULT NULL,
     default_num          IN PLS_INTEGER DEFAULT 2,
     bin_support          NUMBER DEFAULT NULL,
     bin_schema_name      IN VARCHAR2 DEFAULT NULL,
     data_schema_name     IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-7  INSERT_BIN_CAT_FREQ Procedure Parameters
Parameter Description

bin_table_name

Name of the categorical bin table generated using CREATE_BIN_CAT procedure.

data_table_name

Name of the table containing the data.

bin_num

Number of bins.

exclude_list

List of columns (attributes) to be excluded from this iteration of the binning proces.s

default_num

Number of default values.

bin_support

Bin support as a fraction.

bin_schema_name

Name of the schema hosting the bin definition table.

data_schema_name

Name of the schema hosting the table with data.

Usage Notes

For a given input table, you can iteratively call this routine several times with different specifications for number of bins for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list parameter for a particular binning specification.

Columns with all NULLs are ignored. No bin definitions are populated when bin_num = 0, or bin_num, is NULL.

Examples

The simplest invocation of this routine populates bin definitions in the cat_bin_table for all the categorical attributes found in build_table.


BEGIN

DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM(
'cat_bin_table', 'build_table');
END; /

INSERT_BIN_NUM_EQWIDTH Procedure

This procedure finds the numerical binning definition for every NUMBER column in the data table that is not specified in the exclusion list and inserts the definition into the numerical binning definition table that was created using CREATE_BIN_NUM.

Definition for each relevant column is computed based on the min and max values that are computed from the data table. Each of the bin_num(N) bins bin1,..., binN span ranges of equal width inc = (max - min) / N where binI = I when N > 0 or binI = N+1-I when N < 0, and bin0 = binN+1 = NULL. The values of val column are rounded to round_num significant digits prior to scoring them in the definition table.

Syntax

DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_EQWIDTH (
     bin_table_name        IN VARCHAR2,
     data_table_name       IN VARCHAR2,
     bin_num               IN PLS_INTEGER DEFAULT 10,
     exclude_list          IN Column_List DEFAULT NULL,
     round_num             IN PLS_INTEGER DEFAULT 6,
     bin_schema_name       IN VARCHAR2 DEFAULT NULL,
     data_schema_name      IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-8  INSERT_BIN_EQWIDTH Procedure Parameters
Parameter Description

bin_table_name

Name of the categorical bin table generated using CREATE_BIN_NUM procedure.

data_table_name

Name of the table containing the data.

bin_num

Number of bins.

exclude_list

List of columns (attributes) to be excluded from this iteration of the binning process.

round_num

Number of significant digits.

bin_schema_name

Name of the schema hosting the bin definition table.

data_schema_name

Name of the schema hosting the table with data.

Usage Notes

For a given input table, you can iteratively call this routine several times with different specifications for number of bins for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list parameter for a particular binning specification.

Columns with all NULLs or only one unique value are ignored. No bin definitions are populated when bin_num = 0, or bin_num is NULL.

For example, when N=2, col='mycol', min=10, and max = 21, the following three rows are inserted into the definition table (inc = 5.5):

 COL     VAL  BIN
 ----- -----  -----
 mycol    10  NULL
 mycol  15.5  1
 mycol    21  2

Examples

The simplest invocation of this routine populates bin definitions in the num_bin_
table for all the numerical attributes found in build_table.
BEGIN

DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM(
'num_bin_table', 'build_table');
END;

/


INSERT_BIN_NUM_QTILE Procedure

This procedure finds a numerical binning definition for every NUMBER column in the data table that is not specified in the exclusion list and inserts the definition into the binning definition table that was created using CREATE_BIN_NUM.

The definition for each relevant column is computed based on the min values for each quantile, where quantiles are computed from the data using NTILE function. Bins bin1, ..., binN - 1 span ranges [minI, minI+1) and binN range [minN, maxN] with binI = I when N > 0 or binI = N+1-I when N < 0, and bin0 = binN+1 = NULL. Bins with equal left and right boundaries are collapsed.

Syntax

DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_QTILE (
    bin_table_name       IN VARCHAR2,
    data_table_name      IN VARCHAR2,
    bin_num              IN PLS_INTEGER DEFAULT 10,
    exclude_list         IN Column_List DEFAULT NULL,
    bin_schema_name      IN VARCHAR2 DEFAULT NULL,
    data_schema_name     IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-9  INSERT_BIN_NUM_QTILE Procedure Parameters
Parameter Description

bin_table_name

Name of the numerical binning definition table generated using the CREATE_BIN_NUM procedure.

data_table_name

Name of the table containing the data.

bin_num

Number of bins.

exclude_list

List of columns (attributes) to be excluded from this iteration of the binning process.

bin_schema_name

Name of the schema hosting the numerical binning definition table.

data_schema_name

Name of the schema hosting the table with data.

Usage Notes

For a given input table, you can iteratively call this routine several times with different specifications for bin_num for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list parameter for a particular specification. Columns with all NULLs are ignored.

Example 1. When N = 4, col='mycol', and data is {1,2,2,2,2,3,4}, the following three rows are inserted into the definition table:

COL     VAL   BIN 
----- -----  ----- 
mycol     1   NULL 
mycol     2   1
mycol     4   2

Here quantities are {1,2}, {2,2}, {2,3}, {4} and min(1) = 1, min(2) = 2, min(3) = 2, min(4) = 4, max(4) = 4, and ranges are [1,2), [2,2), [2,4), [4,4]. After collapsing [1,2), [2,4].

Examples

The simplest invocation of this routine populates numerical binning definitions in the num_bin_table for all the numerical attributes found in build_table.

BEGIN
    DBMS_DATA_MINING_TRANSFORM.INSERT_BIN_NUM_QTILE(
    'num_bin_table', 'build_table');
END;

INSERT_CLIP_TRIM_TAIL Procedure

This procedure finds the trimming definition for every NUMBER column in the data table that is not specified in the exclusion list and inserts the definition into the clipping definition table that was created using CREATE_CLIP.

The definition for each relevant column is computed based on the non-NULL values sorted in ascending order such that val(1) < val(2) < ... < val(N), where N is a total number of non-NULL values in a column:

lcut = val(1+floor(N*q))
lval = NULL
rcut = val(N-floor(*N*q))
rval = NULL

where q = ABS(NVL(tail_frac,0)). Nothing is done when q >= 0.5.

Syntax

DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_TRIM_TAIL (
    clip_table_name     IN VARCHAR2,
    data_table_name     IN VARCHAR2,
    tail_frac           IN NUMBER DEFAULT 0.025,
    exclude_list        IN Column_List DEFAULT NULL,
    clip_schema_name    IN VARCHAR2 DEFAULT NULL,
    data_schema_name    IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-10  INSERT_CLIP_TRIM_TAIL Procedure Parameters
Parameter Description

clip_table_name

Name of the clipping definition table generated using the CREATE_CLIP procedure.

data_table_name

Name of the table containing the data.

tail_frac

Tail fraction.

exclude_list

List of columns (attributes) to be excluded from this iteration of the clipping process.

clip_schema_name

Name of the schema hosting the clipping definition table.

data_schema_name

Name of the schema hosting the table with data.

Usage Notes

For a given input table, you can iteratively call this routine several times with different specifications for tail_frac for a given input table. For each iteration, you can selectively exclude attributes (that is, column names) using the exclude_list parameter for a particular specification.

Example 1. When q = 0.2, col='mycol', and data is {1,2,2,2,3,4,4}, the following row is inserted into the definition table:

COL    LCUT   LVAL   RCUT   RVAL
----- -----  -----  -----  -----
mycol     2    NULL     4   NULL

Here 1 + floor(N*q) = 1 + floor(7*0.2) = 2, lcut = val(2) = 2.

N - floor(N*q) = 7 - floor(7*0.2) = 6, rcut = val(6) = 4.

Examples

The simplest invocation of this routine populates clipping definitions in the clip_table for all the numerical attributes found in build_table.

BEGIN
    DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_TRIM_TAIL(
    'clip_table', 'build_table');
END;

INSERT_CLIP_WINSOR_TAIL Procedure

This procedure finds winsorizing definition for every NUMBER column in the data table that is not specified in the exclusion list and inserts the definition into the clipping definition table that was created using CREATE_CLIP.

Definition for each relevant column is computed based on the non-NULL values sorted in ascending order such that val(1) < val(2) < ... < val(N), where N is a total number of non-NULL values in a column:

lcut = val(1+floor(N*q))
lval = lcut
rcut = val(N-floor(N*q))
rval = rcut

where q = ABS(NVL(tail_fraq,0)). Nothing is done when q >= 0.5.

Syntax

DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_WINSOR_TAIL (
    clip_table_name    IN VARCHAR2,
    data_table_name    IN VARCHAR2,
    tail_frac          IN NUMBER DEFAULT 0.025,
    exclude_list       IN Column_List DEFAULT NULL,
    clip_schema_name   IN VARCHAR2 DEFAULT NULL,
    data_schema_name   IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-11  INSERT_CLIP_WINSOR_TAIL Procedure Parameters
Parameter Description

clip_table_name

Name of the clipping definition table generated using CREATE_CLIP procedure.

data_table_name

Name of the table containing the data.

tail_frac

Tail fraction.

exclude_list

List of columns (attributes) to be excluded from this iteration of the clipping process.

clip_schema_name

Name of the schema hosting the clipping definition table.

data_schema_name

Name of the schema hosting the table with data.

Usage Notes

For a given input table, you can iteratively call this routine several times with different specifications for tail_frac for a given input table. For each iteration, you can selectively exclude attribute (that is, column names using the exclude_list parameter for a particular specification. Columns with all NULLs are ignored.

Example 1. When q = 0.2, col='mycol', and data is {1,2,2,2,3,4,4}, the following row is inserted into the definition table:

COL    LCUT   LVAL   RCUT  RVAL
----- -----  -----  ----- -----
mycol     2      2      4      4

Here 1 + floor(N*q) = 1 + floor(7*0.2) = 2, lcut = val(2) = 2.

N - floor(N*q) = 7 - floor(7*0.2) = 6, rcut = val(6) = 4.

Examples

The simplest invocation of this routine populates clipping definitions in the clip_table for all the numerical attributes found in build_table.

BEGIN
    DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_WINSOR_TAIL(
    'clip_table', 'build_table');
END;

INSERT_NORM_LIN_ZSCORE Procedure

This procedure finds the normalization definition for every NUMBER column in the data table that is not specified in the exclusion list and inserts the definition based on z-score normalization into the table that was created using CREATE_NORM_LIN.

Definition for each relevant column is computed based on the min and max values that are computed from the data table, such that shift = min and scale = max - min. The values of shift and scale are rounded to round_num significant digits prior to storing them in the definition table.

Syntax

DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_ZSCORE (
     norm_table_name     IN VARCHAR2,
     data_table_name     IN VARCHAR2,
     exclude_list        IN Column_List DEFAULT NULL,
     round_num           IN PLS_INTEGER DEFAULT 6,
     norm_schema_name    IN VARCHAR2 DEFAULT NULL,
     data_schema_name    IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-12  INSERT_BIN_NORM_LIN_ZSCORE Procedure Parameters
Parameter Description

norm_table_name

Name of the normalization table generated using CREATE_NORM_LIN procedure.

data_table_name

Name of the table containing the data.

exclude_list

List of columns (attributes) to be excluded from this iteration of the normalization process.

round_num

Number of significant digits.

norm_schema_name

Name of the schema hosting the normalization definition table.

data_schema_name

Name of the schema hosting the table with data.

Usage Notes

For a given input table, you can iteratively call this routine several times with selective exclusion of attributes (that is, column names) using the exclude_list parameter for a particular binning specification.

Columns with all NULLs or only one unique value are ignored.

Examples

The simplest invocation of this routine populates normalization definitions in the norm_zscore_table for all the numerical attributes found in build_table.

BEGIN

DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_ZSCORE(
'norm_zscore_table', 'build_table');
END;

/


INSERT_NORM_LIN_MINMAX Procedure

This procedure finds the normalization definition for every NUMBER column in the data table that is not specified in the exclusion list and inserts the definition based on min-max normalization into the table that was created using CREATE_NORM_LIN.

Definition for each relevant column is computed based on the mean and standard deviation that are computed from the data table, such that shift = mean and scale = stddev. The values of shift and scale are rounded to round_num significant digits prior to storing them in the definition table.

Syntax

DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_MINMAX (
     norm_table_name     IN VARCHAR2,
     data_table_name     IN VARCHAR2,
     exclude_list        IN Column_List DEFAULT NULL,
     round_num           IN PLS_INTEGER DEFAULT 6,
     norm_schema_name    IN VARCHAR2 DEFAULT NULL,
     data_schema_name    IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-13  INSERT_NORM_LIN_MINMAX Procedure Parameters
Parameter Description

norm_table_name

Name of the normalization table generated using CREATE_NORM_LIN procedure.

data_table_name

Name of the table containing the data.

exclude_list

List of columns (attributes) to be excluded from this iteration of the normalization process.

round_num

Number of significant digits.

norm_schema_name

Name of the schema hosting the normalization definition table.

data_schema_name

Name of the schema hosting the table with data.

Usage Notes

For a given input table, you can iteratively call this routine several times with selective exclusion of attributes (that is, column names) using the exclude_list parameter for a particular normalization specification.

Columns with all NULLs or only one unique value are ignored.

Examples

The simplest invocation of this routine populates normalization definitions in the norm_minmax_table for all the numerical attributes found in build_table.

BEGIN

DBMS_DATA_MINING_TRANSFORM.INSERT_NORM_LIN_MINMAX(
'norm_minmax_table', 'build_table');
END;

XFORM_BIN_CAT Procedure

This procedure creates the view that performs categorical binning. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.

Syntax

DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_CAT (
     bin_table_name       IN VARCHAR2,
     data_table_name      IN VARCHAR2,
     xform_view_name      IN VARCHAR2,
     literal_flag         IN BOOLEAN DEFAULT FALSE,
     bin_schema_name      IN VARCHAR2 DEFAULT NULL,
     data_schema_name     IN VARCHAR2 DEFAULT NULL,
     xform_schema_name    IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-14  XFORM_BIN_CAT Procedure Parameters
Parameter Description

bin_table_name

Name of the categorized binning definition table generated using CREATE_BIN_CAT procedure.

data_table_name

Name of the table containing the data.

xform_view_name

View representing the transformed output.

literal_flag

Literal flag.

bin_schema_name

Name of the schema hosting the bin definition table.

data_schema_name

Name of the schema hosting the data table.

xform_schema_name

Name of the schema hosting the view representing the transformed output.

Usage Notes

The bin table created by CREATE_BIN_CAT and populated with bin definitions by INSERT_BIN_CAT_FREQ is used to guide the query generation process to construct categorical binning expressions of the following form:

DECODE("col", val1, bin1,
              ...
valN, binN,
              NULL,   NULL,
binN+1) "col"

This expression maps values val1,..., valN into N bins bin1,..., binN, and other values into binN+1, while NULL values remain unchanged. binN+1 is optional. If not specified, it defaults to NULL. To specify binN+1 provide a row with val set to NULL.

The literal_flag parameter indicates whether the values in bin are valid SQL literals. When the flag is set to TRUE, the value of bin is used as is in query generation; otherwise it is converted into a valid text literal (surrounded by quotes and each single quote is replaced by two single quotes). By default, the flag is set to FALSE. One example of when it can be set to TRUE is in cases when all bin are numbers. In that case the transformed column will be numeric as opposed to textual (default behavior).

Note that col is case-sensitive since it generates quoted identifiers. In cases when there are multiple entries with the same col,val combination with different bin, the behavior is undefined -- any one of the bin values might be used.

Examples

Example 1. bin_cat contains four rows with col = 'mycol':

{col = 'mycol', val = 'Waltham',        bin = 'MA'}
{col = 'mycol', val = 'Burlington',     bin = 'MA'}
{col = 'mycol', val = 'Redwood Shores', bin = 'CA'}
{col = 'mycol', val = NULL,             bin = 'OTHER'}

the following expression is generated:

 DECODE("mycol", 'Waltham',        'MA',
                 'Burlington',     'MA',
                 'Redwood Shores', 'CA',
                  NULL,             NULL,
                                   'OTHER') "mycol"

Example 2. bin_cat contains three rows with col = 'mycol':

{col = 'mycol', val = 'Waltham',        bin = 'MA'}
{col = 'mycol', val = 'Burlington',     bin = 'MA'}
{col = 'mycol', val = 'Redwood Shores', bin = 'CA'}

the following expression is generated:

DECODE("mycol", 'Waltham',        'MA',
                'Burlington',     'MA',
                'Redwood Shores', 'CA') "mycol"

Example 3. For the definition:

COL   VAL            BIN
----- ----------     ---
mycol Waltham        1
mycol Burlington     1
mycol Redwood Shores 2

the following expression is generated when the literal flag is set to FALSE:

DECODE ("mycol", 'Waltham',        '1',
                 'Burlington'      '1',
                 'Redwood Shores', '2') "mycol"

and when the flag is set to TRUE:

DECODE("mycol", 'Waltham',        1,
                'Burlington',     1,
                'Redwood Shores', 2) "mycol"

The simplest invocation of this routine generates a view build_view that represents the transformation query on build_table based on bin definitions in the cat_bin_table.

BEGIN

DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_CAT(
'cat_bin_table', 'build_table', 'build_view');
END; /

XFORM_BIN_NUM Procedure

This procedure creates the view that performs numerical binning. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.

Syntax

DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_NUM (
     bin_table_name     IN VARCHAR2,
     data_table_name    IN VARCHAR2,
     xform_view_name    IN VARCHAR2,
     literal_flag       IN BOOLEAN DEFAULT FALSE,
     bin_schema_name    IN VARCHAR2 DEFAULT NULL,
     data_schema_name   IN VARCHAR2 DEFAULT NULL,
     xform_schema_name  IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-15  XFORM_BIN_NUM Procedure Parameters
Parameter Description

bin_table_name

Name of the numerical binning definition table generated using CREATE_BIN_NUM procedure.

data_table_name

Name of the table containing the data.

xform_view_name

View representing the transformed output.

literal_flag

Literal flag.

bin_schema_name

Name of the schema hosting the bin definition table.

data_schema_name

Name of the schema hosting the data table.

xform_schema_name

Name of the schema hosting the view representing the transformed output.

Usage Notes

The bin table created by CREATE_BIN_NUM and populated with bin definitions by INSERT_BIN_NUM_EQWIDTH or INSERT_BIN_NUM_QTILE is used to guide the query generation process to construct numerical binning expressions of the following form:

CASE WHEN "col" <  val0   THEN 'bin0'
     WHEN "col" <= val1   THEN 'bin1'
     ...
     WHEN "col" <= valN   THEN 'binN'
     WHEN "col" IS NOT NULL THEN 'binN+1'
END "col"

This expression maps values in the range [val0;valN] into N bins bin1, ..., binN, values outside of this range into bin0 or binN+1, such that

(-inf; val0)  -> bin0
[val0; val1)  -> bin1
...
(valN-1; valN] -> binN
(valN; +inf)  -> binN+1

NULL values remain unchanged. binN+1 is optional. If it is not specified, the values ("col" > valN) are mapped to NULL. To specify binN+1, provide a row with val set to NULL. The order of the WHEN ... THEN pairs is based on the ascending order of val for a given col.

The literal_flag parameter indicates whether the values in bin are valid SQL literals. When the flag is set to TRUE, the value of bin is used as is in query generation; otherwise it is converted into a valid text literal (surrounded by quotes and each single quote is replaced by two single quotes). By default, the flag is set to FALSE. One example of when it can be set to TRUE is in cases when all bin are numbers. In that case the transformed column will be numeric as opposed to textual (default behavior).

Note that col is case-sensitive since it generates quoted identifiers. In cases where there are multiple entries with the same col,val combination with different bin, the behavior is undefined -- any one of the bin values might be used.

Examples

Example 1. bin_num contains four rows with col = 'mycol':

{col = 'mycol', val = 15.5, bin = 'small'}
{col = 'mycol', val = 10,   bin = 'tiny'}
{col = 'mycol', val = 20,   bin = 'large'}
{col = 'mycol', val = NULL, bin = 'huge'}

the following expression is generated:

CASE WHEN "mycol" <  10       THEN 'tiny'
     WHEN "mycol" <= 15.5     THEN 'small'
     WHEN "mycol" <= 20       THEN 'large'
     WHEN "mycol" IS NOT NULL THEN 'huge'
 END "mycol"

Example 2. bin_num contains three rows with col = 'mycol':

{col = 'mycol', val = 15.5, bin = NULL}
{col = 'mycol', val = 10,   bin = 'tiny'}
{col = 'mycol', val = 20,   bin = 'large'}

the following expression is generated:

CASE WHEN "mycol" <  10   THEN NULL
     WHEN "mycol" <= 15.5 THEN 'small'
     WHEN "mycol" <= 20   THEN 'large'
END "mycol"

Example 3. For the definition:

COL   VAL  BIN
----- ---- ---
mycol   10 NULL
mycol 15.5 1
mycol   21 2

the following expression is generated when the literal flag is set to FALSE:

CASE WHEN "mycol" <  10   THEN NULL
     WHEN "mycol" <= 15.5 THEN '1'
     WHEN "mycol" <= 20   THEN '2' 
 END "mycol"

and when the flag is set to TRUE:

CASE WHEN "mycol" <   10  THEN NULL
     WHEN "mycol" <= 15.5 THEN 1
     WHEN "mycol" <=  20  THEN 2
     END "mycol"

The simplest invocation of this routine generates a view build_view that represents the transformation query on build_table based on transform definitions in bin definitions in the num_bin_table.

BEGIN

DBMS_DATA_MINING_TRANSFORM.XFORM_BIN_NUM(
'num_bin_table', 'build_table', 'build_view');
END; /

XFORM_CLIP Procedure

This procedure creates the view that performs clipping. Only the columns that are specified in the transform definition are clipped; the remaining columns do not change.

Syntax

DBMS_DATA_MINING_TRANSFORM.XFORM_CLIP (
    clip_table_name       IN VARCHAR2, 
    data_table_name       IN VARCHAR2,
    xform_view_name       IN VARCHAR2,
    clip_schema_name      IN VARCHAR2 DEFAULT NULL,
    data_schema_name      IN VARCHAR2,DEFAULT NULL,
    xform_schema_name     IN VARCHAR2,DEFAULT NULL;

Parameters

Table 24-16  XFORM_CLIP Procedure Parameters
Parameter Description

clip_table_name

Name of the clipping definition table generated using CREATE_CLIP.

data_table_name

Name of the table containing the data.

xform_view_name

View representing the transformed output.

clip_schema_name

Name of the schema hosting the clipping definition table.

data_schema_name

Name of the schema hosting the data table.

xform_schema_name

Name of the schema hosting the view representing the transformed output.

Usage Notes

The clipping definition table created by CREATE_CLIP and populated with clipping definitions by INSERT_CLIP_WINSOR_TAIL or INSERT_CLIP_TRIM_TAIL is used to guide query generation process to construct clipping expressions of the following form:

CASE WHEN "col" < lcut THEN lval
     WHEN "col" > rcut THEN rval
                       ELSE "col"
END "col"

Note that col is case-sensitive since it generates quoted identifiers. When there are multiple entries in the transform definition table for the same col, the behavior is undefined. Any one of the definitions may be used in query generation. NULL values remain unchanged.

Example 1 (Winsorizing). When col = 'my_col', lcut = -1.5, lval = -1.5, and rcut = 4.5 and rval = 4.5, the following expression is generated:

CASE WHEN "my_col" < -1.5 THEN NULL
     WHEN "my_col" >  4.5 THEN NULL
                          ELSE "my_col"
END "my_col"

Examples

The simplest invocation of this routine generates a view object build_view that represents the transformation query on build_table based on transform definitions in clipping definitions in the clip_table.

BEGIN
    DBMS_DATA_MINING_TRANSFORM.XFORM_CLIP(
    'clip_table', 'build_table', 'build_view');
END;

XFORM_NORM_LIN Procedure

This procedure creates the view that performs linear normalization. Only the columns that are specified in the definition table are transformed; the remaining columns do not change.

Syntax

DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN (
     norm_table_name      IN VARCHAR2,
     data_table_name      IN VARCHAR2,
     xform_view_name      IN VARCHAR2,
     norm_schema_name     IN VARCHAR2 DEFAULT NULL,
     data_schema_name     IN VARCHAR2 DEFAULT NULL,
     xform_schema_name    IN VARCHAR2 DEFAULT NULL);

Parameters

Table 24-17  XFORM_NORM_LIN Procedure Parameters
Parameter Description

norm_table_name

Name of the normalization definition table generated using CREATE_NORM_LIN procedure.

data_table_name

Name of the table containing the data.

xform_view_name

View representing the transformed output.

norm_schema_name

Name of the schema hosting the normalization definition table.

data_schema_name

Name of the schema hosting the data table.

xform_schema_name

Name of the schema hosting the view representing the transformed output.



Usage Notes

The normalization table created by CREATE_NORM_LIN and populated with definitions by either INSERT_NORM_LIN_ZSCORE or INSERT_NORM_LIN_MINMAX is used to guide the query generation process to construct normalization expressions of the following form:

("col" - shift)/scale "col"

Note that col is case-sensitive since it generates quoted identifiers. When there are multiple entries in the transform definition table for the same col, the behavior is undefined. Any one of the definitions may be used in query generation. NULL values remain unchanged.

For example, when col = 'my_col', shift = -1.5, and scale = 20. The following expression is generated:

("my_col" - (-1.5))/20 "my_col"

Examples

The simplest invocation of this routine generates a view build_view that represents the transformation query on build_table based on normalization definitions in the norm_minmax_table.

BEGIN

DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN(
'norm_minmax_table', 'build_table', 'build_view');
END;