B
ODM Tips and Techniques

This section contains information about some special considerations for clustering models, for SVM models, and for NMF models.

B.1 Clustering Models

ODM supports two algorithms for clustering:

k-Means
O-Cluster

The two algorithms treat data differently. This section discusses important considerations about data for clustering.

B.1.1 Attributes for Clustering

Binary attributes should have data mining type as follows:

Numeric for k-Means
Categorical for O-Cluster

B.1.2 Binning Data for k-Means Models

You can either bin the data manually or let the algorithm do the binning. For k-Means, it is usually best to let the algorithm do the binning. If you bin manually, the first bin number must be 1. We recommend that you have the same number of bins per attribute in order to have the same scale in the distance computation. For example, if age is binned in 20 bins (1...20), and there is a binary attribute (gender), the binary attribute should be binned as 1 and 20 instead of 1 and 2. If this is not done, the algorithm would still work but the results will be unreliable.

B.1.3 Binning Data for O-Cluster Models

You can either bin the data manually or let the algorithm do the binning. For O-Cluster, it is usually best to let the algorithm do the binning. If you bin manually, the first bin number must be 1. In the case of O-Cluster, manual binning should not over-smooth or under-smooth the histograms of numeric attributes. The number of bins does not need to be the same across attributes but should be chosen to capture the attribute value distribution accurately.

B.2 SVM Models

This section describes the ways in which you can affect model build quality and performance with SVM models.

B.2.1 Build Quality and Performance

The user can influence both the SVM model quality (accuracy) and performance (build time) through two basic mechanisms: data preparation and model settings.

Poor choice of settings or data preparation can lead to serious performance degradation. Poor settings choice can also lead to inaccurate models. For example, a model can predict only one class. ODM offer- built in mechanisms that estimate appropriate settings for the problem at hand.

SVM estimates three settings: complexity factor, standard deviation for Gaussian kernels, and epsilon for regression models. These three settings are estimated by default.

Default settings are overridden by specifying a value for the setting when creating the algorithm settings object (Java) or by inserting a row in the settings table for that setting (DBMS_DM).

B.2.2 Data Preparation

Default data preparation is overridden by specifying the data as prepared (ODM). In DBMS_DM there is no default data preparation. Data preparation must be specifically invoked.

ODM and DBMS_DM accept two types of predictors: numeric and categorical. In ODM, the logical data specification identifies the data type of each predictor. DBMS_DM identifies all database numeric types as numeric predictors and all database string types as categorical predictors. SVM requires all predictors to be numeric and the range of values restricted to a small interval around 0. Hence numeric predictors are normalized and categorical predictors are exploded (described below).

B.2.3 Numeric Predictor Handling

Normalization of numeric predictors is typically required for two reasons: (1) so that the relative influence of the various predictors on the model is not distorted by their relative scaling, and (2) to avoid computational overflow/underflow. To the first point, note that an SVM model (ai) parameter applies to an entire training vector, rather than individual predictors within the vector; hence large individual vector values can dominate. In addition (point 2), note that the kernel computation includes sum of dot products of two potentially large values. Such sums can result in overflow, or, as the exponent of a Gaussian, underflow.

Our offerings for normalization include z-score and min-max normalization. Each normalization technique has advantages and drawbacks. Z-score has the advantage of not distorting the shape of the distribution. However, z-score'd data can still have many large (absolute) values (outside of range {-1, 1}. The number and size of large values can differ greatly across attributes, depending upon their relative non-normality. In addition, the z-score'd range differs from the exploded categorical range {0,1}. Min-max normalization avoids the large value issue and has the same range as the categorical data but may suffer compression of its dense range in the presence of large extreme values. The user could potentially address the min-max normalization drawback with some procedure for handling outliers.

B.2.4 Categorical Predictor Handling

Categorical predictors are exploded into arrays of indicator variables. This is transparent to the user in both interfaces. For example, a predictor, VAR, which takes on one of three possible values: A, B or C becomes three predictors that one could think of as VAR_A, VAR_B and VAR_C. Each of these exploded predictors is a binary attribute with values in the set {0,1}. If VAR_A is 1, then VAR = "A". If VAR_A is 0 then VAR is NOT equal to "A". For multi-valued categorical predictors with no implicit order, explosion, as specified, is the most sensible procedure.

For binary predictors or multi-valued categoricals with an implicit order there are choices. Consider a predictor taking on values in the set {LOW, MEDIUM, HIGH}. The user could choose to recode the values to {0, 0.5, 1}, or, use some other mapping that is intended to reflect the relative distance between the categories. The recoded predictor would then be passed as numeric field.

Binary predictors take on one of two possible values. E.g. a binary predictor might take on values from the set {NO, YES}. The decision to recode such predictors to {0, 1} depends upon whether there is a preferred positive value or not.

B.2.5 Regression Target Handling

In the Java API, for performance and accuracy, we internally normalize regression targets. In DBMS_DM, the user can choose to normalize the target externally. Consider a regression target with large absolute values. Because the predictors are constrained, the coefficients (ai) must be large, or the number of non-zero coefficients must be large, or both, otherwise it is impossible to predict a large value. The range of the coefficients is restricted by the ComplexityFactor (C) setting,. With a small C, the performance and, possibly, the accuracy of SVM can be poor. With a large C, the training time can increase significantly. In addition, the model can be larger than would be otherwise necessary. To avoid this issue we normalize the targets. Since the goal is to avoid large values, min-max normalization is often a good choice (see numeric predictor handling discussion above).

B.2.6 SVM Algorithm Settings

Several of the algorithm settings can significantly affect SVM accuracy and performance. The issues and considerations in choosing setting values are discussed briefly in the sections below. In the literature, cross-validation and grid search techniques are often used to select parameters. These methods, while accurate, are very expensive. Rather than saddle users with an expensive procedure, we have opted for computationally inexpensive procedures that fit easily within our implementation framework. The intent is to provide settings that should give the model adequate complexity for the problem at hand. If the user requires greater accuracy and is willing to incur the additional computational expense, our defaults can be used as the starting point for a grid search.

B.2.7 Complexity Factor (C)

The complexity factor, C, determines the trade-off between minimizing model error on the training data and minimizing model complexity. Its responsibility is to avoid over-fit (an over-complex model fitting noise in the training data) and under-fit (a model that is too simple). Overfit exists if the model fits the training data well but does poorly on held-aside test data. Underfit exists if the model does poorly on both training and test data. If the user wishes to override the default C after seeing the model result, the user can obtain the internally computed default value as a starting point for a grid search. Both the java API and PL/SQL interfaces have methods for getting the Complexity factor. The subsequent discussion provides qualitative description of the impact of C on the model build.

Very large value of C places extreme penalty on errors, so that SVM seeks a perfect separation of target classes, or, in the case of regression, a perfect (epsilon-insensitive -- see below) fit. Assuming the data to have at least some noise, this is over-fit. Large C leaves the parameters of the model (ai) unconstrained, i.e., the model is complex with high capacity for fitting data. Conversely, small C places low penalty on errors and high constraints on the model parameters, which can lead to under-fit.

Standard Deviation -- Gaussian Kernels Only

Standard deviation is an SVM setting applying to Gaussian Kernels only. This parameter, in conjunction with C, affects the trade-off between error on the training data and generalization. For fixed C, underfit results as the standard deviation gets large, and overfit results as the standard deviation goes to zero.

To facilitate using the default values computed in ODM as a starting point, methods exist to extract the setting values.

B.2.8 Epsilon -- Regression Only

Epsilon is an SVM setting applying to regression models only. The parameter separates small errors from large errors. Small errors are considered to have no associated penalty. Only large errors are penalized. In the Java API, if the default target normalization is used the value of epsilon is re-scaled accordingly. In DBMS_DM, epspilon needs to be re-scaled by the user. By default, epsilon is estimated internally.

The epsilon parameter affects the number of support vectors and, thus, indirectly, the trade-off between model complexity and generalization (over-fit and under-fit as possible consequences). An estimate of the standard deviation of the additive noise in the target variable is needed to find an appropriate epsilon value and thus minimize predictive error. The estimate can be based either on domain knowledge or it can be obtained from the residuals of a crude model (e.g., polynomial, KNN) built on the training data.

B.2.9 Kernel Cache -- Gaussian Kernels Only

The most expensive operation in building a gaussian SVM model is the computation of kernels. The general approach taken to build is to converge within a chunk of data at a time, then to test for violators outside of the chunk. Build is complete when there are no more violators within tolerance. The size of the chunk is chosen such that the associated kernels can be maintained in memory in a "Kernel Cache". The larger the chunk size, presumably, the better the chunk represents the population of training data and the fewer number of times new chunks will need to be created. Generally, larger caches imply faster builds. Default size is 50M.

B.2.10 Tolerance

Tolerance is the maximum size of a violation of convergence criteria such that the model is considered to have converged. The default value is 0.001. Larger values imply faster building but less accurate models.

B.3 NMF Models

Traditionally, as part of standard numerical analysis, matrix factorization is a common preprocessing procedure prior to solving a liner system of equations. For data mining, matrix factorization offers a way to reduce the dimensionality of a dataset and extract features that reveal interesting structure in the data or provide inputs to further types of analysis. In matrix factorization, the number of the dataset- independent columns is reduced by projection onto a lower dimensional space (e.g., smaller matrices).

Rank reduction by factorization can reveal interesting low-dimensional subspaces embedded in large dimensionality datasets space and is a useful operation for pattern discovery and feature extraction. For example, the traditional Principal Component Analysis (PCA) uses a projection of the data on dimensions along which it varies the most and can be used to visualize the most dominant structure in a dataset.

Non-negative matrix factorization (NMF), by imposing non-negativity constraints on the factors, has been shown to be a useful decomposition and feature extraction method in fields such as object detection and recognition and to be a valuable alternative PCA^{Foot 1}. By forcing a dataset (matrix) to "fit" into a product of smaller datasets (matrices), NMF compresses the data and tends to eliminate some of the redundancies and expose the most common patterns. By using a parts-based or component-based decomposition, and in contrast to PCA and other techniques, the compressed version of the data is not random looking and can be used to understand interesting patterns and common trends in the dataset. The NMF decomposition also induces a numerical taxonomy that can be used for grouping the rows or columns of the original dataset. The extracted features can be used as inputs to other analysis tasks such as classification or indexing. This procedure has proven useful in face recognition problems or the discovery of semantic features in texts^{Foot 2}.

Given an N (rows) x M (columns) 2D dataset A and k < N, M, NMF computes an approximation of the original data, A ~ W H, where W is N by k, and H is k by M. Starting from random initial conditions, W and H are iteratively updated until convergence to a local minimum is achieved, monitored by the minimization of the Euclidean cost function. A must have positive entries, and so are W and H by construction. Even though localization is not an explicit property of the algorithm, NMF appears to produce quite localized and sparse features that facilitate the interpretation of results and the transparency of the model. For example, when NMF is applied to a dataset of facial images, the extracted features are facial parts: eyes, noses, etc. When the dataset is a document/keyword matrix, then NMF extracts "semantic" features^{Foot 3}.

When NMF is used as a feature extractor, it can benefit from scaling individual attributes to a common scale via normalization. This would facilitate pattern discovery and make dimensionality reduction more effective. A preferred approach would be to perform outlier treatment (e.g., by using a winsorizing transformation) and then perform min-max normalization. Having individual attributes on a common scale would help ranking and interpretation of feature coefficients in terms of there relative magnitude. Additionally, applying a normalization transformation would allow NMF to operate on negative data as well. Otherwise, the data need to be shifted to the positive range manually on a per-attribute basis.

If NMF is used to cluster column instances --- e.g., by using the amplitudes from the rows of H-- then according to the nature of the problem, one may consider normalizing the rows, the columns, or both, prior to using NMF.

¹ Daniel D. Lee, and H. Sebastian Seung, Learning the parts of objects by non-negative matrix factorization. Nature 1999, Oct 21, 401(6755), 788-793.
² Same as footnote 1, above.
³ Same as footnote 1, above.

B ODM Tips and Techniques