4
Descriptive Data Mining Models

This chapter describes descriptive models, that is, the unsupervised learning functions. These functions do not predict a target value, but focus more on the intrinsic structure, relations, interconnectedness, etc. of the data. The Oracle Data Mining interfaces support the following descriptive models and associated algorithms:

Function	Algorithm
Clustering (Section 4.1)	Enhanced k-means (Section 4.1.1)
	Orthogonal Clustering (O-Cluster) Java interface only (Section 4.1.2)
Association (Section 4.2)	Apriori Algorithm (Section 4.2.4)
Feature Extraction (Section 4.3)	Non-Negative Matrix Factorization (Section 4.3.1)

4.1 Clustering in Oracle Data Mining

Clustering is a technique useful for exploring data. It is particularly useful where there are many cases and no obvious natural groupings. Here, clustering data mining algorithms can be used to find whatever natural groupings may exist.

Clustering analysis identifies clusters embedded in the data. A cluster is a collection of data objects that are similar in some sense to one another. A good clustering method produces high-quality clusters to ensure that the inter-cluster similarity is low and the intra-cluster similarity is high; in other words, members of a cluster are more like each other than they are like members of a different cluster.

Clustering can also serve as a useful data-preprocessing step to identify homogeneous groups on which to build predictive models. Clustering models are different from predictive models in that the outcome of the process is not guided by a known result, that is, there is no target attribute. Predictive models predict values for a target attribute, and an error rate between the target and predicted values can be calculated to guide model building. Clustering models, on the other hand, uncover natural groupings (clusters) in the data. The model can then be used to assign groupings labels (cluster IDs) to data points.

In ODM a cluster is characterized by its centroid, attribute histograms, and place in the clustering model hierarchical tree. ODM performs hierarchical clustering using an enhanced version of the k-means algorithm and O-Cluster, an Oracle proprietary algorithm. The clusters discovered by these algorithms are then used to create rules that capture the main characteristics of the data assigned to each cluster. The rules represent the hyperboxes (bounding boxes) that envelop the data in the clusters discovered by the clustering algorithm. The antecedent of each rule describes the clustering bounding box. The consequent encodes the cluster ID for the cluster described by the rule. For example, for a data set with two attributes: AGE and HEIGHT, the following rule represents most of the data assigned to cluster 10:

If AGE >= 25 and AGE <= 40 and
HEIGHT >= 5.0ft and HEIGHT <= 5.5ft
then CLUSTER = 10

The clusters are also used to generate a Bayesian probability model which is used during scoring for assigning data points to clusters.

The two clustering algorithms supported by ODM interfaces are

Enhanced k-means
Orthogonal partitioning clustering (Java interface only)

4.1.1 Enhanced k-Means Algorithm

The k-means algorithm is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters (provided there are enough distinct cases). The k-means algorithm works only with numerical attributes. Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. (Cosine and Fast Cosine are supported in the DBMS_DATA_MINING version of k-means only.) Data points are assigned to the nearest cluster according to the distance metric used.

Note:

DBMS_DATA_MINING and the Java interface use different versions of the enhanced k-means algorithm. The two implementations may give different results.

ODM implements an enhanced version of the k-means algorithm with the following features:

The algorithm builds models in a hierarchical manner. The algorithm builds a model top down using binary splits and refinement of all nodes after the children of the split node have converged. In this sense, the algorithm is similar to the bisecting k-means algorithm described in the literature. The centroid of the inner nodes in the hierarchy are updated to reflected changes as the tree evolves. The whole tree is returned.
The algorithm can grow the tree either one level at a time (balanced approach) or one node at a time (unbalanced approach). The node with the largest distortion (sum of distance to the node's centroid) is split to increase the size of the tree until the desired number of clusters is reached. The DBMS_DATA_MINING implementation build unbalanced trees only.
The Java version of the algorithm bins the build data. Binning plays the role of normalization and helps with data summarization and rules. The DBMS_DATA_MINING normalizes the data.
The algorithm has an internal data summarization step that allows it to scale well to data sets with large number of cases.
The algorithm provides probabilistic scoring/assignment of data to clusters.
The algorithm returns, for each cluster, a centroid (cluster prototype), histograms (one for each attribute), and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster.

This incremental approach to k-means avoids the need for building multiple k-means models and provides clustering results that are consistently superior to the traditional k-means.

The k-Means Clustering algorithm implementation in DBMS_DATA_MINING is a modified and enhanced version of the implementing in the Java interface. There is no upper limit on the number of attributes and target cardinality for this implementation of k-Means. The DBMS_DATA_MINING implementation will be the ODM implementation of the k-Means Clustering algorithm in future ODM releases. Table 4-1 summarizes the differences between the two implementations.

Table 4-1 DBMS_DATA_MINING and Java k-Means Implementation Differences

Java Implementation	DBMS_DATA_MINING
Numerical attributes only	Categorical and numerical attributes
Data automatically binned	Data automatically normalized
Does not handle sparse data	Handles sparse data
NULLs indicate missing values	NULLs indicate sparse data
Build balanced or unbalanced trees	Builds unbalanced trees only. (Unbalanced trees usually give better results.)
Supports Euclidean distance only	Supports Euclidean, Cosine, and Fast Cosine distances
Based on a split and train k-Means approach	Based on a bi-sect k-Means with refinement approach
Centroid reports the mean only	Centroid reports the mean (numerical attributes) or mode (categorical attributes) and variance (numerical attributes)
Attribute relevance computed in a different way from the DBMS_DATA_MINING implementation	Attribute relevance computed in a different way from the Java implementation; uses a bitmap-based computation
Basic ranking of attributes	Better ranking of attributes

The choice between balanced and unbalanced approaches in the Java interface is controlled by the system parameter CL_ALG_SETTING_TREE_GROWTH in the ODM_CONFIGURATION table. The balanced approach is faster than the unbalanced approach, while the unbalanced approach generates models with smaller overall distortion.

4.1.1.1 Data for k-Means

In the Java interface, the k-means algorithm works with numerical data only. As a result, if you need to cluster data with categorical attributes, you must explode the categorical attribute into multiple binary columns (one per unique value of the categorical attribute) before using k-means. If you bin the data manually, you must bin the new binary columns after you have exploded them.

The DBMS_DATA_MINING supports both categorical and numerical data.

The k-means algorithms work best with a moderate number of attributes (at most 100); however, there is no upper limit on the number of attributes and target cardinality for the DBMS_DATA_MINING implementation of k-Means.

4.1.1.2 Scalability through Summarization

Because traditional k-means requires multiple passes through the data, it can be impractical for large data tables that don't fit in memory. In this case multiple expensive database scans would be required. ODM's enhanced k-means requires at most one database scan. For data tables that don't fit in memory, the enhanced k-means algorithm employs a smart summarization approach that creates a summary of the data table that can be stored in memory. This approach allows the enhanced k-means algorithm to handle data tables of any size. The summarization scheme can be seen as a smart sampling approach that first identifies the main partitions in the data and then generates summary points for each partition in proportion to their share of the total data. Each summary point has a weight that accounts for the proportion of the data it represents.

4.1.1.3 Scoring (Applying Models)

The clusters discovered by enhanced k-means are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The traditional k-means algorithm can be interpreted as a mixture model where the mixture components are spherical multivariate normal distributions with the same variance for all components. A mixture model is a type of density model that includes several component functions (usually Gaussian) that are combined to provide a multimodal density.

In the mixture model created from the clusters discovered by enhanced k-means, on the other hand, the mixture components are a product of independent normal distribution with potentially different variances. Because of this greater flexibility, the probability model created by enhanced k-means provides a better description of the underlying data than the underlying model of traditional k-means.

4.1.2 Orthogonal Partitioning Clustering (O-Cluster)

Note:

O-Cluster is available in the Java interface only.

The O-Cluster algorithm supports Orthogonal Partitioning Clustering; O-Cluster creates a hierarchical grid-based clustering model, that is, it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters. The resulting clusters define dense areas in the attribute space. The clusters are described by intervals along the attribute axes and the corresponding centroids and histograms. A parameter called sensitivity defines a baseline density level. Only areas with peak density above this baseline level can be identified as clusters.

To compare, the k-means algorithm tesselates the space even when natural clusters may not exist. For example, if there is a region of uniform density, k-Means tesselates it into n clusters (specified by the user). On the other hand, O-Cluster separates areas of high density by placing cutting planes through areas of low density. O-Cluster needs multi-modal histograms (peaks and valleys). If an area has projections with uniform or monotonically changing density, O-Cluster does not partition it.

4.1.2.1 O-Cluster Data Use

O-Cluster does not necessarily use all the data when it builds a model. It reads the data in batches (the default batch size is 50000). It will only read another batch if it believes, based on some statistical tests, that there may still exist clusters that it has not uncovered.

Because O-Cluster may stop the model build before it reads all of the data, it is highly recommended that you randomize the data.

4.1.2.2 Binning for O-Cluster

O-Cluster bins the data internally, thus providing automatic data discretization. However, if manual binning is used, the bin values must be represented by contiguous integer numbers starting at 1.

4.1.2.3 O-Cluster Attribute Type

Binary attributes should be declared as categorical. O-Cluster distinguishes between continuous and discrete numerical attributes. The two types of attributes undergo different binning procedures in order to capture the characteristics of the underlying distributions. For example, a discrete numerical attribute such as age should be declared of data type INTEGER. On the other hand, continuous numerical attributes such as height measured in feet should be declared of data type NUMBER.

4.1.2.4 O-Cluster Scoring

The clusters discovered by O-Cluster are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The generated probability model is a mixture model where the mixture components are represented by a product of independent normal distributions for numerical attributes and multinomial distributions for categorical attributes.

4.1.3 K-Means and O-Cluster Comparison

The main characteristics of the enhanced k-means and O-Cluster algorithms are summarized in Table 4-2, below.

Table 4-2 Comparison of Enhanced k-Means and O-Cluster

Feature	Enhanced k-means	O-Cluster
Interface	Both Java and DBMS_DATA_MINING (different versions)	Java only
Clustering methodology	Distance-based	Grid-based
Number of cases	Handles tables of any size. Uses summarization for tables that don't fit in the memory buffer.	More appropriate for data tables that have more than 500 cases. Handles large tables via active sampling.
Number of attributes	More appropriate for data sets that have 100 or fewer attributes	More appropriate for data tables that have more than 5 attributes.
Number of clusters	User-specified	Automatically determined
Attribute type	Numerical attributes only	Numerical and categorical attributes
Hierarchical clustering	Yes	Yes
Probabilistic cluster assignment	Yes	Yes
Automatic data preparation	Binning (Java interface); normalization (DBMS_DATA_MINING)	Binning

4.2 Association Models in Oracle Data Mining

The Association model is often associated with "market basket analysis", which is used to discover relationships or correlations in a set of items. It is widely used in data analysis for direct marketing, catalog design, and other business decision-making processes. A typical association rule of this kind asserts the likelihood that, for example, "70% of the people who buy spaghetti, wine, and sauce also buy garlic bread."

Association models capture the co-occurrence of items or events in large volumes of customer transaction data. Because of progress in bar-code technology, it is now possible for retail organizations to collect and store massive amounts of sales data, referred to as "basket data." Association models were initially defined on basket data, even though they are applicable in several other applications. Finding all such rules is valuable for cross-marketing and mail-order promotions, but there are other applications as well: catalog design, add-on sales, store layout, customer segmentation, web page personalization, and target marketing.

Traditionally, association models are used to discover business trends by analyzing customer transactions. However, they can also be used effectively to predict Web page accesses for personalization. For example, assume that after mining the Web access log, Company X discovered an association rule "A and B implies C," with 80% confidence, where A, B, and C are Web page accesses. If a user has visited pages A and B, there is an 80% chance that he/she will visit page C in the same session. Page C may or may not have a direct link from A or B. This information can be used to create a dynamic link to page C from pages A or B so that the user can "click-through" to page C directly. This kind of information is particularly valuable for a Web server supporting an e-commerce site to link the different product pages dynamically, based on the customer interaction.

There are several properties of association models that can be calculated. ODM provides two:

Support: Support of a rule is a measure of how frequently the items involved in it occur together. Using probability notation, support (A implies B) = P(A, B).
Confidence: Confidence of a rule is the conditional probability of B given A; confidence (A implies B) = P (B given A), which is equal to P(A, B) or P(A).

These statistical measures can be used to rank the rules and hence the predictions.

ODM supports the Apriori algorithm for association models.

The Apriori algorithm works by iteratively enumerating item sets of increasing lengths subject to the minimum support threshold. Since association rule mining is defined this way and the state-of-the-art algorithms work by iterative enumeration, association rules algorithms don't handle the following cases efficiently:

Finding associations involving rare events
Finding associations in data sets that are dense and that have a large number of attributes.

4.2.1 Finding Associations Involving Rare Events

By definition, association rule mining discovers frequent patterns with frequency above the minimum support threshold. Therefore, in order to find associations involving rare events, the algorithm must run with very low minimum support values. However, doing so could potentially explode the number of enumerated item sets, especially in cases with large number of items. That could increase the execution time significantly.

Therefore, association rule mining is not recommended for finding associations involving "rare" events in problem domains with large number of items.

However, there are ways to restrict the item set enumeration to a smaller set if the "rare" events of interest are known. One could also use association rules to perform "partial classification" of the rare events. Those enhancements to association rules are not supported in this release.

Another option is to use classification models in such problem domains.

4.2.2 Finding Associations in Dense Data Sets

Since association rule algorithms work by iterative enumeration, they work best for sparse data sets, that is, data sets where each record contains only a small fraction of the total number of possible items (if the total number of items is very large). Algorithm performance degrades exponentially with increasing number of frequent items per record. Therefore, to get good runtime performance, one of the following conditions should hold:

If the data set is dense, the number of possible items is small.
If the number of possible items is large, the data set is sparse.
The data set becomes progressively sparser with increasing item set length due to the application of the minimum support threshold.

The last condition holds for higher minimum support values.

Typical data sets in many bioinformatics applications are dense with large number of attributes. In order to use association rules effectively for such problems, careful planning is required. One option is to start with a high minimum support threshold and repeat it for lower values until desirable results are obtained. Another option is to recode some of the uninteresting attribute values to NULL, if such recoding makes the data set sparse.

4.2.3 Data for Association Models

Association models are designed to use sparse data. Sparse data is data for which only a small fraction of the attributes are non-zero or non-null in any given row. Examples of sparse data include market basket and text mining data. For example, a market basket problem, there might be 1,000 products in the company's catalog, and the average size of a basket (the collection of items that a customer purchases in a typical transaction) is 20 products. In this example, a transaction/case/record has on average 20 out of 1000 attributes that are not null. This implies that the fraction of non-zero attributes on the table (or the density) is 20/1000, or 2%. This density is typical for market basket and text processing problems. Data that has a significantly higher density can require extremely large amounts of Temp space to build associations. For more information about sparse data, see Section 2.2.6.

Association models treat NULL values an indication of sparse data. The algorithm doesn't handle missing values. If the data is not sparse and the NULL values are indeed missing at random, it is necessary to perform missing data imputation (that is, "treat" the missing values) and substitute the NULL values for a non-null value.

4.2.4 Apriori Algorithm

The association rule mining problem can be decomposed into two subproblems:

Find all combinations of items, called frequent itemsets, whose support is greater than the minimum support.
Use the frequent itemsets to generate the desired rules. The idea is that if, for example, ABC and BC are frequent, then the rule A implies BC holds if the ratio of support(ABC) to support(BC) is at least as large as the minimum confidence. Note that the rule will have minimum support because ABCD is frequent. ODM Association supports single consequent rules only (ABC implies D).

The number of frequent itemsets is governed by the minimum support parameters. The number of rules generated is governed by the number of frequent itemsets and the confidence parameter. If the confidence parameter is set too high, there may be frequent itemsets in the association model but no rules.

ODM uses an SQL-based implementation of the Apriori algorithm. The candidate generation and support counting steps are implemented using SQL queries. We do not use any specialized in-memory data structures. The SQL queries are fine-tuned to run efficiently in the database server by using various hints.

4.3 Feature Extraction in Oracle Data Mining

ODM Feature Extraction creates a new set of features by decomposing the original data. Feature extraction lets you describe the data with a number of features far smaller than the number of original dimensions (attributes). A feature is a combination of attributes in the data that is of special interest and captures important characteristics of the data.

Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of supervised learning.

For example, feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.

4.3.1 Non-Negative Matrix Factorization

Non-negative Matrix Factorization (NMF) is described in the paper " Learning the Parts of Objects by Non-Negative Matrix Factorization" by D. D. Lee and H. S. Seung in Nature (401, pages 788-7910, 1999). Non-negative Matrix Factorization is a feature extraction algorithm that decomposes multivariate data by creating a user-defined number of features, which results in a reduced representation of the original data. NMF decomposes a data matrix V into the product of two lower rank matrices W and H so that V is approximately equal to WH. NMF uses an iterative procedure to modify the initial values of W and H so that the product approaches V. The procedure terminates when the approximation error converges or the specified number of iterations is reached. Each feature is a linear combination of the original attribute set; the coefficients of these linear combinations are non-negative. During model apply, an NMF model maps the original data into the new set of attributes (features) discovered by the model.

There is no upper limit on the number of attributes and target cardinality for NMF.

4.3.1.1 NMF for Text Mining

Text mining involves extracting information from unstructured data. Further, typical text data is high-dimensional and sparse. Unsupervised algorithms like Principal Components Analysis (PCA), Singular Value Decomposition (SVD), and NMF involve factorizing the document-term matrix based on different constraints. One widely used approach for text mining is latent semantic analysis. NMF focuses on reducing dimensionality. By comparing the vectors for two adjoining segments of text in a high-dimensional semantic space, NMF provides a characterization of the degree of semantic relatedness between the segments. NMF is less complex than PCA and can be applied to sparse data. NMF-based latent semantic analysis is an attractive alternative to SVD approaches due to the additive non-negative nature of the solution and the reduced computational complexity and resource requirements.

4 Descriptive Data Mining Models