IBM 15 Performance: Process Nodes

233

PerformanceConsiderations for Streams and Nodes

Note: The following databases support temp orary tables for the purpose of caching: DB2,

Netezza, Oracle, SQL Server, and Teradata. Other databases will use a normal table for d atabase

caching. The SQL code can be customized for speciﬁc databases - contact Support fo r assistance.

Performance: Process Nodes

Sort. The Sort node must read the entire input data set before it can be sorted. Thedata is stored in

memory up to some limit, and the excess is spilled to disk. Thesorting algorithm is a combination

algorithm: data is read into memory up to the limit and sorted u sing a fast hybrid quick-sort

algorithm. If all the data ﬁts in mem ory,then the sort is complete. Otherwise, a merge-sort

algorithm is applied. The sorted data is written to ﬁle and the next ch unk of data is read into

memory,sorted, and written to disk. Thisis repeated until all the data has been read; then the sorted

chunks are merged. Merging may require repeated passes over the data stored on disk. Atpea k

usage, the Sort node will have two complete copies of the data set on disk: sorted and unso rted.

The overall running time of the algorithm is on the order of N*log(N), where Nis the number

of records. Sorting in memory is faster than merging from disk, sothe actual running time can

be reduced by allocating more memory to the sort. The algorithm allocates to itself a fraction of

physical RAM controlled by the IBM® SPSS® Modeler Server co nﬁguration option Memory

usagemult iplier. To increase the memory used for sorting, provide more physica lR AM or increase

this value. Note that when the proportion of memory used exceeds th e working set of the process

so that part of the memory is paged to disk, performance degrades because the memory-access

pattern of the in-memory sort algorithm is random and can cause excessive paging. Thesort

algorithm is used by several nodes other than the Sort node, but the same performancerules apply.

Binning. The Binning node reads the entire input data set to compute the bin boundaries, before

it allocates records to bins. The data set is cached while the b oundaries are computed; then it is

rescanned for allocation. When the binning method is ﬁxed-width or me an+standard deviation,

the data set is cached directly to disk. These metho ds have a linear running time and require

enough disk space to store the entire data set. When the binning metho d is ranks or tiles, the data

set is sorted using the sort algorithm described earlier, and the sorted data set is used as the cache.

Sorting gives these methods a running time of M*N*log(N), where Mis the number o f binned

ﬁelds and Nis the number of records; it requires disk space equal to twice the data set size.

Generating a Derive node based on generated bins will improve performance in subsequent

passes. Derive operations are mu ch faster than binning.

Mergeby Key (Join). TheMe rgeno de, when the merge method is keys (equivalent to a database

join), sorts each of its input data sets by the key ﬁelds. This part of the p rocedure has a running

time of M*N*log(N), where Mis the number of inputs and Nis the number of records in the largest

input; it requires sufﬁcient disk space to store all of its input data sets plus a second copy of the

largest data set. The running time of the merge itself is proportio nal to the size of the output

data set, which depends on the frequency of matching keys. Inthe worst ca se, where the output

is the Cartesian product of the inputs, the running time may approach NM. Th isis ra re—most

joins have many fewer matching keys. If one d ata set is relatively larger than the other(s), or if

the incoming data is already sorted by a key ﬁeld, then you can improve the performance of

this node using the Optimization tab.