
231
PerformanceConsiderations for Streams and Nodes
The following operations cannot be performed in most databases. T hey should be placed in the
stream after the operations in the preceding list:
Operations on any nondatabase data, such as flat files
Merge by order
Balance
Distinct operations in discard mode or where only a subset of fields are selected as distinct
Any operation that requires accessing data from records other than the one being processed
State and count field derivations
History node operations
Operations involving “@” (time-series) functions
Type-checking modes Warnand Abort
Model construction, application, and analysis
Note: Decision trees, rulesets, linear regressi on, and factor-generated models can generate
SQL and can therefore be pushed back to the database.
Data output to anywhere other than the same database that is processing the data
Node CachesTooptimize stream run ning, you can set up a cache on any nonterminal node. When you set up a
cache on a node, the cache is filled with the data that passes through the node the next time you
run the data stream. From then on, the data is read from the cache (which is st ored on disk in a
temporary directory) rather than from the data source.
Caching is most useful following a time-consuming operation such as a sort, merge, or
aggregation. For example, suppose that y ou have a source node set to read sales data from a
database and an Aggregate node that summarizes sales by location. You can set up a cache on the
Aggregate node rather than on the source node because you want the cache to store the ag gregated
data rather than the entire data set.
Note: Caching at source nodes, which sim ply stores a copy of the original data as it is read into
IBM® SPSS® Modeler, will not improve performance in most circumstances.
Nodes with caching enabled are displayed with a small document icon at the t op right corner.
When the data is cached at the node, the document icon is green.