IBM 15 Understanding Data Mining

Chapter

Understanding Data Mining

Through a variety of techniques, data mining identiﬁes nuggets of information in bodies of data.

Data mining extracts information in such a way that it can be used in areas such as decision

support, prediction, forecasts, and estimation. Data is often vol uminous but of low value and with

little direct usefulness in its raw form. It is the hidden information in the data that has v alue.

In data mining, success comes from combining your (or your expert’s) knowledge of the

data with advanced, active analysis techniques in which the computer identiﬁes the underlying

relationships and features in the data. The process of data mining generates models from historical

data that are later used for predictions, pattern detection, and more. The technique for building

these models is called machine learning or modeling.

Modeling Techniques

IBM® SPSS® Modeler includes a number of machine-learning and modeling technolo gies, which

can be roughly grouped according to the types of problems they are intended to solve.

Predictive modeling methods include decision trees, neural networks, and statistical models.

Clustering models focus on identifying groups of similar records and labeling the records

according to the group to which they belong. Clustering methods incl ude Kohonen, k-means,

and TwoStep.

Association rules associate a particular conclusion (such as the purchase of a particular

product) with a set of conditions (the purchase of several other products).

Screening models can be used to screen data to locate ﬁelds and records that are most likely to

be of interest in modeling and identify outliers that may not ﬁt known patterns. Available

methods include feature selection and anomaly detection.

Data Manipulation and Discovery

SPSS Modeler also includes many facilities that let you apply your expertise to the data:

Datamanipulation. Constructs new data items derived from existing ones and brea ks down the

data into meaningful subsets. Data from a variety of sources can be me rgedan d ﬁltered.

Browsingand vis ualization. Displays aspects of the data usin g the Data Audit node to perform

an initial audit including graphs and statistics. Advan ced visualization includes interactive

graphics, which can be exported for inclusion in project reports.

Statistics. Conﬁrms suspected relationships between variables in the data. Statistics from

IBM® SPSS® Statistics can also be used within SPSS Modeler.

Hypothesistesting. Constructs models of how the data behaves and veriﬁes t hese models.