HP Integrated Archive Platform Proximity word sequences, Matching word sequences in attachments

Models: Integrated Archive Platform

1 49
Download 49 pages 11.65 Kb
Page 41
Image 41

Proximity word sequences

You can use simple word sequences to search for words separated by separators but not by other words. To search for document words that are in an ordered sequence, but might be separated by other words, use a proximity word sequence.

To write a proximity word sequence, use the same syntax as a simple word sequence, but append a tilde (~) character to the second quote, and follow that with a numeric proximity value. The proximity value represents the maximum number of other document words that can occur between any two successive words of the sequence. A document matches a proximity word sequence if all words occur in the document in the same order, with at most N intervening words, where N is the proximity value.

For example, the sequence "bird garden stone"~3 matches any document that has these three words in this order, with bird and garden separated by no more than three words, and garden and stone separated by no more than three words. This sequence matches a document with the text a bird in the rose garden is near a stone because there are at most three words between successive sequence words. This sequence also matches a bird garden with a stone for the same reason.

Simple word sequences are a special case of proximity word sequences: ". . ." is the same as ".

. ."~0. Any documents found by ". . ."~N are also found by ". . ."~M, when M > N.

Matching word sequences in attachments

This section discusses word matching in attachments. Like other documents, IAP renders attachment documents (like spreadsheets and PDF files) into text words. When IAP renders a document, it follows the document application’s internal representation of the file.

Certain file types, for example spreadsheets, look very different internally than they do externally. This means that word sequence in the external application representation which the end user sees may differ from the internal application representation. IAP query matching uses the internal application representation. Below are a couple of examples to illustrate.

Example 1. Separators are ignored

IAP renders text into words. Remaining characters such as periods, commas, spaces, and newlines are considered separators and are ignored. Phrase queries ignore all formatting elements and non-word characters. The following original plain text of:

“This was news to Mr. Smith.

Johnson, however, knew better.”

matches the phrase query of: “Smith Johnson”

This is because internally, the two plain text sentences are represented as one long string of continuous words: “This was news to Mr Smith Johnson however knew better”.

Example 2. Sequence is not intuitive

Internally in an attachment’s original application, a large multi-page document or a single page spreadsheet equates to a long text sequence. Text may not appear in the same sequence internally as it appears externally. Also, multiple instances of the same text in certain file types are represented as a single instance.

Spreadsheets

Look at the external representation of the following example spreadsheet.

User Guide

41

Page 41
Image 41
HP Integrated Archive Platform manual Proximity word sequences, Matching word sequences in attachments