IBM 15 Performance: Modeling Nodes, Performance: C LEM Expressions

234

Chapter 13

Aggregate. When the Keys are contiguous option is not set, this node reads (but does not store) its

entire input data set before it produces any aggregated output. In the more extreme situations,

where the size of the aggregated data reaches a limit (determined by the SPSS Modeler Serv er

conﬁguration option Memory usage multiplier), the remainder of the data set is sorted and

processed as if the Keys are contiguous option were set. When this option is set , no data is stored

because the aggregated output records are produced as the input data is read.

Distinct. The Distinct node stores all of the unique key ﬁelds in the input data set; in cases where

all ﬁelds are key ﬁelds and all records are unique it stores the entire data set. By default the

Distinct node sorts the data on the key ﬁelds and then selects (or discards) the ﬁrst distinct re cord

from each group. For smaller data sets with a low number of distin ct keys, or those that have been

pre-sorted, you can choose options to improve the speed and efﬁciency of processing.

Type. Insome instances, theType node caches the input data when re ading values; the cache is

used for downstream processing. Th e cache requires sufﬁcient disk space to store the entire

data set but speeds up processing.

Evaluation. The Evaluation node must sort the input data to compute tiles. The s ort isre peated for

each model evaluated because the scores and consequent record order are different i n each case.

The running time is M*N*log(N), where Mis the number of models and Nis the number of records.

Performance: Modeling Nodes

NeuralNet and Kohonen. Neural network training algorithms (including the Kohonen algorithm)

make many passes over the training data. The data is stored in memory up to a limit, and the

excess is spilled to disk. Accessing the training data from d isk is expensive because the access

method is random, which can lead to excessive disk activity. You can disable the use of disk

storage for these algorithms, forcing all data to be stored in memory, by selecting the

Optimize

for speed option on the Model tab of the node’s dialog box. Note that if the a mount of memory

required to store the data is greater than the working set of the server process, part of it will be

paged to disk and performance will suffer accordingly.

When Optimize for memory is enabled, a percentage of physical RAM i s allocated to the

algorithm according to the value of the IBM® SPSS® Modeler Server conﬁguration option

Modeling memory limit percentage. To use more memory for training neural networks, either

provide more RAM or increase the value of this option, but note that setting the value too high

will cause paging.

The running time of the neural network algorithms depends on the required leve l of accuracy.

Youcan control the running time by set ting a stopping condition in the node’s dialog box.

K-Means. The K-Means clustering algorithm has the same options for controlling m emory usage

as the neural network algorithms. Performance on data stor ed on disk is better,ho wever,becau se

access to the data is sequential.

Performance: C LEM Expressions

CLEM sequence functions (“@functions”) that look back into the data stream must store

enough of the data to satisfy the longest look-back. For operations whose degree of look-back

is unbounded, all values of the ﬁeld must be stored. An unbounded operation is one where the