Chapter
6
66
6
Handling Missing Values

Overview of Missing Values

During the Data Preparation phase of data mining, you will often want to replace miss ing values
in the data. Missing values are values in the data set that are unknown, uncollected, or incorre ctly
entered. Usually, such values are invalid for their elds. For example, the eld Sex should contain
the values Mand F. If you discover the values Yor Zin the eld, you can safely assume that such
values are invalid and should therefore be interpreted as blanks. Likewise,a negati ve value for the
eld Age is meaningless and should also be interpreted as a blank. Frequently, such obviously
wrong values are purposely entered, or elds left blank, during a questionnaire to indicate a
nonresponse. At times, you may want to examine these blanks more closely to deter mine whether
a nonresponse, such as the refusal to give one’s age, is a factor in predicting a spe cic outcome.
Some modeling techniques handle missing data better than others. For example, C5.0 and
Apriori cope well with values that are explicitly declared as “missing” in a Type node. Other
modeling techniques have trouble dealing with missing values and experience longer training
times, resulting in less-accurate models.
There are several types of missing values recognized by IBM® SPSS® Modeler:
Null or system-missing values. These are nonst ring values that have been left blank in the
database or source le and have not been specically dened as “missing” in a source or
Typenode. System-missing values are displayed as $null$. Note that empty strings are not
considered nulls in SPSS Modeler, although they may be treated as nulls by certain databases.
Emptystrings andwhite space. Empty string values and white space (strings with no visible
characters) are treated as distinct from null values. Empty strings are treated as equivalent to
white space for most purposes. For example, if you sele ct the option to treat white space as
blanks in a source or Typeno de, this setting applies to empty strings as well.
Blank oruse r-definedmissing values. These are values such as unknown,99, or –1 that are
explicitly dened in a source node or Type node as missing. Optionally, you can also choose
to treat nulls and white space as blanks, which allows them to be agged for special treatment
and to be excluded from most calculations. For exam ple, you can use the @BLANK function to
treat these values, along with other types of missing values, as blanks.
© Copyright IBM Corporation 1994, 2012. 99