
101
Handling Missing Values
In general terms, there are two approaches you can follow:
Youcan exclude fie lds or records with missing values
Youcan impute, replace , or coerce missing values using a variety of methods
Both of these approaches can be largely automated using the Data Audit node. For example, you
can generate a Filter node that excludes fields with too many missing values to be use ful in
modeling, and generate a Supernode that imputes missing values for any or all of the fields that
remain. This is where the real power of the audit comes in, allowing y ou not only to assess the
current state of your data, but to take action based on the assessment.
Handling Records with Missing ValuesIf the majority of missing values is concentrated in a small number of records, you can just
exclude those records. For example, a bank usually keeps detailed and complete records on
its loan customers. If, however, the bank is le ss restrictive in approving loans for its own staff
members, data gathered for staff loans is likely to have several blank fields. In such a case, there
are two options for handling these missing values:
Youcan use a Select node to remove th e staffreco rds.
If the data set is large, you can discard all records with blanks.
Handling Fields with Missing ValuesIf the majority of missing values is concentrated in a small number of fields, you can address them
at the field level rather than at the record level. This approach also allows y out o experiment with
the relative importance of particular fields before deciding on an approach for handling missing
values. If a field is unimportant in modeling, it probably is not worth keeping, regardless of how
many missing values it has.
For example, a market research company may collect data from a ge neral questionnaire
containing 50 questions. Two of the questions address age and political persuasion, information
that many people are reluctant to give. In this case, Age and Political_pe rsuasion have many
missing values.
Field Measurement Level
In determining which method to use, you should also consider the measurement level of fields
with missing values.
Numericf ields. For numeric field t ypes, such as Continuous, you should always eliminate any
non-numeric values before building a model, because many models will not function if blanks are
included in numeric fields.
Categoricalfields. For categorical fields, such as Nominal and Flag, altering missing values is not
necessary but will increase the accuracy of the model. For example, a model that uses the field S ex
will still function with meaningless values, such as Yand Z, but removing all values other than M
and Fwill increase the accuracy of the model.