31
Understanding Data Mining
together. It may not even be online. If it exists only on paper, data entry will be required before
you can begin data mining.
Check whether the data covers the relevant attributes
The object of data mining is to identify relevant attributes, so including this check may seem odd
at rst. It is very useful, however, to look at what data is available and to try to identify the likely
relevant factors that are not recorded. In trying to predi ct ice cream sales, for example, you may
have a lot of information about retail outlets or sales history, but you may not have weather
and temperature information, which is likely to play a signicant role. Missing attributes do
not necessarily mean that data mining will not produce useful results, but they can limit the
accuracy of resulting predictions.
A quick way of assessing the situation is to perform a comprehensive audit of your data.
Before moving on, consider attaching a Data Audit node to your data source and run ning it to
generate a full report.
Beware of noisy data
Data often contains errors or may contain subjective, and therefore variable, judgments. These
phenomena are collectively referred to as noise. Sometimes noise in datais normal. There may
well be underlying rules, but they may not hold for 100% of the cases.
Typically,the more noise ther e is in data, the more difcult it is to get accurate results.
However, SPSS Modeler’s machine-learning methods are able to handle noisy data and have been
used successfully on data sets containing almost 50% noise.
Ensure that there is sufficient data
In data mining, it is not necessarily the size of a data set that is important. Therepresentativeness
of the data set is far more signicant, together with its coverage of possible outcomes and
combinations of variables.
Typically,the more attributes th ata re considered, the more records that will be needed to
give representative coverage.
If the data is representative and there are general underlying rules, it may well be that a data
sample of a few thousand (or even a few hundred) records will give equally good results as a
million—and you will get the results more quickly.
Seek out the experts on the data
In many cases, you will be working on your own data and will therefore be highlyfamiliar w ith
its content and meaning. However, if you are working on data for another depart ment of your
organization or for a client, it is highly desirable that you have access to experts who know the
data. They can guide you in the identica tion of relevant attributes and can help to interpret the
results of data mining, distinguishing the true nuggets of information from “fool’s gold,” or
artifacts caused by anomalies in the data sets.