HP Integrated Archive Platform Proximity word sequences, Matching word sequences in attachments

Page 41

Proximity word sequences

You can use simple word sequences to search for words separated by separators but not by other words. To search for document words that are in an ordered sequence, but might be separated by other words, use a proximity word sequence.

To write a proximity word sequence, use the same syntax as a simple word sequence, but append a tilde (~) character to the second quote, and follow that with a numeric proximity value. The proximity value represents the maximum number of other document words that can occur between any two successive words of the sequence. A document matches a proximity word sequence if all words occur in the document in the same order, with at most N intervening words, where N is the proximity value.

For example, the sequence "bird garden stone"~3 matches any document that has these three words in this order, with bird and garden separated by no more than three words, and garden and stone separated by no more than three words. This sequence matches a document with the text a bird in the rose garden is near a stone because there are at most three words between successive sequence words. This sequence also matches a bird garden with a stone for the same reason.

Simple word sequences are a special case of proximity word sequences: ". . ." is the same as ".

. ."~0. Any documents found by ". . ."~N are also found by ". . ."~M, when M > N.

Matching word sequences in attachments

This section discusses word matching in attachments. Like other documents, IAP renders attachment documents (like spreadsheets and PDF files) into text words. When IAP renders a document, it follows the document application’s internal representation of the file.

Certain file types, for example spreadsheets, look very different internally than they do externally. This means that word sequence in the external application representation which the end user sees may differ from the internal application representation. IAP query matching uses the internal application representation. Below are a couple of examples to illustrate.

Example 1. Separators are ignored

IAP renders text into words. Remaining characters such as periods, commas, spaces, and newlines are considered separators and are ignored. Phrase queries ignore all formatting elements and non-word characters. The following original plain text of:

“This was news to Mr. Smith.

Johnson, however, knew better.”

matches the phrase query of: “Smith Johnson”

This is because internally, the two plain text sentences are represented as one long string of continuous words: “This was news to Mr Smith Johnson however knew better”.

Example 2. Sequence is not intuitive

Internally in an attachment’s original application, a large multi-page document or a single page spreadsheet equates to a long text sequence. Text may not appear in the same sequence internally as it appears externally. Also, multiple instances of the same text in certain file types are represented as a single instance.

Spreadsheets

Look at the external representation of the following example spreadsheet.

User Guide

41

Image 41
Contents HP Integrated Archive Platform User Guide Page Contents Index Figures Tables Intended audience Document conventions and symbolsPrerequisites Related documentationHP technical support Subscription serviceOther web sites TIPUser Guide About this guide Application What You Can Do Understanding document archivingEAs applications Message Mime types advanced users Understanding searching and document indexingIndexed document types Office 2007 supported file extensions and Mime types Office 2007 supported properties Office 2007 supported featuresType Property Microsoft Word, PowerPoint Excel Modified Forward to Understanding the user interface Using the toolbarLogging in and out Search basics Common tasksTask Reference Completing simple searchesIAP Web Interface tasks Simple Search Completing advanced searchesAdvanced Search page email content type Query Field Matches in the Document Additional advanced search query fieldsAs path c\abc\xyz FolderDisplaying query or search results Query Results page email content typeQuery results navigation bar Bars Saving query or search criteriaSave Criteria Saving query or search resultsSave Results Sending query or search resultsExporting query or search results Accessing saved criteriaAccessing saved results Deleting quarantine repositories Copying saved results to a quarantine repositoryTo search for multiple items, use the advanced search form Searching audit log repositoriesAdvanced Search page document content type Query Field Matches Logged actions and descriptionsLogged Action Description Changing your password TroubleshootingTroubleshooting topics include Changing your languageUnable to display saved results Problems exporting resultsIAP Web Interface Word characters Query expressionsWord characters and separators Letters and digits in different character setsRegular expression definition of English word characters Letters and digits definedSupported character Description Set Matching wordsSupported character sets Matching word sequences Matching similar wordsFuzzy words Measuring word similarityMatching word sequences in attachments Proximity word sequencesExample 1. Separators are ignored Example 2. Sequence is not intuitiveExcel spreadsheet Syntax Matches Boolean query expressionsBoolean query expressions Following are examples of query expressions Nested Boolean query expressionsQuery expression examples Query expression Finds documents with Query expression examplesQuery expression syntax and matching Index See IAP User Guide