23 OLAP and Data Mining

In large data warehouse environments, many different types of analysis can occur. In addition to SQL queries, you may also apply more advanced analytical operations to your data. Two major types of such analysis are OLAP (On-Line Analytic Processing) and data mining. Rather than having a separate OLAP or data mining engine, Oracle has integrated OLAP and data mining capabilities directly into the database server. Oracle OLAP and Oracle Data Mining (ODM) are options to the Oracle Database. This chapter provides a brief introduction to these technologies, and more detail can be found in these products' respective documentation.

The following topics provide an introduction to Oracle's OLAP and data mining capabilities:

OLAP Overview
Oracle Data Mining Overview

See Also:
Oracle OLAP Application Developer's Guide for further information regarding OLAP and Oracle Data Mining documentation for further information regarding data mining

OLAP Overview

Oracle Database OLAP adds the query performance and calculation capability previously found only in multidimensional databases to Oracle's relational platform. In addition, it provides a Java OLAP API that is appropriate for the development of internet-ready analytical applications. Unlike other combinations of OLAP and RDBMS technology, Oracle Database OLAP is not a multidimensional database using bridges to move data from the relational data store to a multidimensional data store. Instead, it is truly an OLAP-enabled relational database. As a result, this release provides the benefits of a multidimensional database along with the scalability, accessibility, security, manageability, and high availability of the Oracle Database. The Java OLAP API, which is specifically designed for internet-based analytical applications, offers productive data access. See Oracle OLAP Application Developer's Guide for further information regarding OLAP.

Benefits of OLAP and RDBMS Integration

Basing an OLAP system directly on the Oracle server offers the following benefits:

Scalability
Availability
Manageability
Backup and Recovery
Security

Scalability

Oracle Database OLAP is highly scalable. In today's environment, there is tremendous growth along three dimensions of analytic applications: number of users, size of data, complexity of analyses. There are more users of analytical applications, and they need access to more data to perform more sophisticated analysis and target marketing. For example, a telephone company might want a customer dimension to include detail such as all telephone numbers as part of an application that is used to analyze customer turnover. This would require support for multi-million row dimension tables and very large volumes of fact data. Oracle Database can handle very large data sets using parallel execution and partitioning, as well as offering support for advanced hardware and clustering.

Availability

Oracle Database includes many features that support high availability. One of the most significant is partitioning, which allows management of precise subsets of tables and indexes, so that management operations affect only small pieces of these data structures. By partitioning tables and indexes, data management processing time is reduced, thus minimizing the time data is unavailable. Another feature supporting high availability is transportable tablespaces. With transportable tablespaces, large data sets, including tables and indexes, can be added with almost no processing to other databases. This enables extremely rapid data loading and updates.

Manageability

Oracle enables you to precisely control resource utilization. The Database Resource Manager, for example, provides a mechanism for allocating the resources of a data warehouse among different sets of end-users. Consider an environment where the marketing department and the sales department share an OLAP system. Using the Database Resource Manager, you could specify that the marketing department receive at least 60 percent of the CPU resources of the machines, while the sales department receive 40 percent of the CPU resources. You can also further specify limits on the total number of active sessions, and the degree of parallelism of individual queries for each department.

Another resource management facility is the progress monitor, which gives end users and administrators the status of long-running operations. Oracle Database 10g maintains statistics describing the percent-complete of these operations. Oracle Enterprise Manager enables you to view a bar-graph display of these operations showing what percent complete they are. Moreover, any other tool or any database administrator can also retrieve progress information directly from the Oracle data server, using system views.

Backup and Recovery

Oracle provides a server-managed infrastructure for backup, restore, and recovery tasks that enables simpler, safer operations at terabyte scale. Some of the highlights are:

Details related to backup, restore, and recovery operations are maintained by the server in a recovery catalog and automatically used as part of these operations. This reduces administrative burden and minimizes the possibility of human errors.
Backup and recovery operations are fully integrated with partitioning. Individual partitions, when placed in their own tablespaces, can be backed up and restored independently of the other partitions of a table.
Oracle includes support for incremental backup and recovery using Recovery Manager, enabling operations to be completed efficiently within times proportional to the amount of changes, rather than the overall size of the database.
The backup and recovery technology is highly scalable, and provides tight interfaces to industry-leading media management subsystems. This provides for efficient operations that can scale up to handle very large volumes of data. Open Platforms for more hardware options & enterprise-level platforms.

Security

Just as the demands of real-world transaction processing required Oracle to develop robust features for scalability, manageability and backup and recovery, they lead Oracle to create industry-leading security features. The security features in Oracle have reached the highest levels of U.S. government certification for database trustworthiness. Oracle's fine grained access control feature, enables cell-level security for OLAP users. Fine grained access control works with minimal burden on query processing, and it enables efficient centralized security management.

Oracle Data Mining Overview

Oracle Data Mining uses data mining algorithms to sift through the large volumes of data generated by businesses to produce, evaluate, and deploy predictive and descriptive models. It also enriches mission critical applications in CRM, manufacturing control, inventory management, customer service and support, Web portals, wireless devices and other fields with context-specific recommendations and predictive monitoring of critical processes. ODM delivers real-time answers to questions such as:

Which items is a person most likely to buy or like?
What is the likelihood that this product will be returned for repair?
What is the likelihood that this person poses a credit risk?

Oracle Data Mining enables data mining inside the database for performance and scalability. Some of the capabilities are:

Java and PL/SQL interfaces that provide programmatic control and application integration
Several algorithms:
- Classification: Naive Bayes, Adaptive Bayes Network, Support Vector Machine
- Regression: Support Vector Machine
- Clustering: k-Means, O-Cluster
- Association: Apriori
- Attribute Importance: Predictor Variance
- Feature Extraction: Non-Negative Matrix Factorization
Real-time and batch scoring modes

Oracle Data Mining also supports sequence similarity search and annotation (BLAST) in the database.

Enabling Data Mining Applications

Oracle Data Mining provides a Java API and PL/SQL packages to exploit the data mining functionality that is embedded in the Oracle database.

By delivering complete programmatic control of data mining in the database, Oracle Data Mining (ODM) delivers powerful, scalable modeling and real-time scoring. This enables businesses to incorporate rules, predictions, and classifications in all processes and decision points throughout the business cycle.

ODM is designed to meet the challenges of vast amounts of data, delivering accurate insights completely integrated into e-business applications. This integrated intelligence enables the automation and decision speed that e-businesses require to compete today.

Data Mining in the Database

Oracle Data Mining performs all phases of data mining within the database. In each data mining phase, this architecture results in significant improvements including performance, automation, and integration.

Performing data mining in the database has the following benefits:

All phases of data mining take place in the database:
- All data preparation occurs in the database
- The data that is mined remains in the database
- The models produced by mining reside in the database
- Scoring occurs in the database along with results immediately available as tables
Data mining automatically inherits important database features, including:
- Scalability
- Availability
- Manageability
- Backup and recovery
- Security

Data Preparation

Data preparation usually requires the creation of new tables or views based on existing data. Both options perform faster than moving data to an external data mining utility and offer the programmer the option of snapshots or real-time updates.

Oracle Data Mining provides utilities for complex, data mining-specific tasks. For example, for some types of models, binning improves model build time and model performance; therefore, ODM provides a utility for user-defined binning.

ODM accepts data in either non-transactional (single-record case) format or transactional (multi-record case) format. ODM provides a pivoting utility for converting multiple non-transactional tables into a single transactional table.

ODM data exploration and model evaluation features are extended by Oracle's statistical functions and OLAP capabilities. Because these also operate within the database, they can all be incorporated into a seamless application that shares database objects. This allows for more functional and faster applications.

Model Building

Oracle Data Mining supports all the major data mining functions: classification, regression, association rules, clustering, attribute importance, and feature extraction.

These algorithms address a broad spectrum of business problems, ranging from predicting the future likelihood of a customer purchasing a given product, to understand which products are likely to be purchased together in a single trip to the grocery store. Since all model building takes place inside the database, the data never needs to move outside the database, and therefore the entire data-mining process is accelerated.

Model Evaluation

Models are stored in the database and are directly accessible for evaluation, reporting, and further analysis by a wide variety of tools and application functions. ODM provides APIs for calculating confusion matrix and lift charts. ODM stores the models, the underlying data, and the results of model evaluation together in the database to enable further analysis, reporting, and application-specific model management.

Model Apply (Scoring)

Oracle Data Mining provides both batch and real-time scoring. In batch mode, ODM takes a table as input. It scores every record, and returns a scored table as a result. In real-time mode, parameters for a single record are passed in and the scores are returned in a Java object.

In both modes, ODM can deliver a variety of scores. It can return a rating or probability of a specific outcome. Alternatively it can return a predicted outcome and the probability of occurrence of the outcome.

ODM Programmatic Interfaces

ODM provides Java and PL/SQL interfaces for data mining. These interfaces make is possible to embed data mining in applications.

ODM Java API

The ODM Java API allows programmers to develop data mining applications or tools in the J2SE/J2EE environment. The API defines a set of classes that can be used to develop a complete data mining solution. The API has built-in data mining metadata management and provides an infrastructure to build data mining applications easily. The API supports importing and exporting of PMML models for Naive Bayes and Association Rules models. The API supports asynchronous execution of mining operations and provides mechanisms to retrieve state transition information for running or completed operations. The API supports real-time scoring for all the supervised models and clustering models. This API supports text mining.

The ODM Java API provides flexible data preparation options. Applications can use either automated data preparation, or they perform data processing using external transformations defined as utility methods in the oracle.dmt.odm.transformation.Transformation class. In addition, applications can embed externals transformation details in the logical data specification as an input for the build operation, then the system will persist the details with the model and perform the embedded transformations in the future apply and test operations.

The ODM Java API design reflects concepts present in the emerging Java standard (JSR-73) for Data Mining, which is being developed through the Java Community Process.

ODM PL/SQL Packages

The following supplied packages support data mining in PL/SQL programs:

DBMS_DATA_MINING
DBMS_DATA_MINING_TRANSFORM

DBMS_DATA_MINING provides support for in-database data mining. This package can be used to build and test models and to apply models to new data (scoring). The package provides the basic building blocks for data mining, along with utilities and functions to inspect models and their results. The package also supports export and import of native models from a user's schema or database instance.

DBMS_DATA_MINING_TRANSFORM, a complementary package, provides support for popular data transformations such as numerical and categorical binning and linear and z-score normalization. The DBMS_DATA_MINING_TRANSFORM package is open source in nature, in that the package code is distributed with the product, so that users can study the utility routines and learn how to define their own data transformations using Oracle SQL and PL/SQL scripting.

ODM Sequence Similarity Search (BLAST)

ODM also supports specialized sequence matching and annotation algorithms. In life sciences, vast quantities of data including nucleotide and amino acid sequences are stored, typically, in a database. This sequence data help biologists determine the chemical structure, biological function, and evolutionary history of organisms.

A version of BLAST, based on NCBI BLAST 2.0, has been implemented in the Oracle Database using table functions. This enables users to perform BLAST queries against data in an Oracle database.