HP Integrated Archive Platform manual Matching similar words, Matching word sequences, Fuzzy words

Page 40

Matching similar words

Topics include:

Fuzzy words, page 40

Measuring word similarity, page 40

Fuzzy words

You can search for document words that are textually similar to a given literal query word (that is, one containing no wildcards). To do this, append a tilde (~) character to the word, creating a fuzzy word. For example, the fuzzy word define~ matches the similar words defined and definite, but does not match defining, definition, indefinite, or pine. It also matches define itself.

Measuring word similarity

The edit distance (also called Levenshtein distance) between two words is the number of single-character operations (deletion, replacement, or insertion) required to change one word into the other word.

For example, the edit distance between define and pine is three: two deletions (d and e) and one replacement (f by p). The distance between define and definite is also three (e replaced by i; te inserted).

The search engine considers define more similar to definite than to pine, even though the edit distances are the same (three), because the edit distance (number of character changes) is compared to the word length (of the shorter of the query and document words). Two words are closer, for querying purposes, if it takes less to change one word into the other word relative to their lengths.

The similarity ratio used by the search engine is d/min(query, doc), where d is the edit distance, min is a function that returns the lesser of its arguments, and query and doc are the lengths of the query word and document word, respectively. A fuzzy word matches a document word if this ratio is no more than 0.5.

Examples:

Words Compared

Similarity Ratio

Match ?

 

 

 

 

 

define, definite

3/min(6,

8)

= 3/6 = 0.5

yes

 

 

 

 

 

define, pine

3/min(6,

4)

= 3/4 = 0.75

no (0.75 > 0.5)

 

 

 

 

 

Matching word sequences

You can use word sequences to find documents with words that occur in a specified order and are separated by a specified maximum distance.

Topics include:

Simple word sequences, page 40

Proximity word sequences, page 41

Matching word sequences in attachments, page 41

Simple word sequences

To search for an ordered sequence of words, use a simple word sequence, which is a list of literal query words (no wildcards) separated by spaces (or other separators) and enclosed in quotes ("). A document matches a simple word sequence if all words occur in the document in the same order, with no intervening words.

For example, the sequence "like a rolling stone" does not match a document with the text like a large rolling stone because of the intervening word large.

40 Query expression syntax and matching

Image 40
Contents HP Integrated Archive Platform User Guide Page Contents Index Figures Tables Document conventions and symbols Intended audiencePrerequisites Related documentationSubscription service HP technical supportOther web sites TIPUser Guide About this guide EAs applications Understanding document archivingApplication What You Can Do Indexed document types Understanding searching and document indexingMessage Mime types advanced users Office 2007 supported file extensions and Mime types Type Property Microsoft Word, PowerPoint Excel Office 2007 supported featuresOffice 2007 supported properties Modified Forward to Logging in and out Using the toolbarUnderstanding the user interface Common tasks Search basicsIAP Web Interface tasks Completing simple searchesTask Reference Completing advanced searches Simple SearchAdvanced Search page email content type Additional advanced search query fields Query Field Matches in the DocumentFolder As path c\abc\xyzQuery Results page email content type Displaying query or search resultsQuery results navigation bar Saving query or search criteria BarsSaving query or search results Save CriteriaSending query or search results Save ResultsAccessing saved results Accessing saved criteriaExporting query or search results Copying saved results to a quarantine repository Deleting quarantine repositoriesSearching audit log repositories To search for multiple items, use the advanced search formAdvanced Search page document content type Logged Action Description Logged actions and descriptionsQuery Field Matches Troubleshooting Changing your passwordTroubleshooting topics include Changing your languageProblems exporting results Unable to display saved resultsIAP Web Interface Query expressions Word charactersLetters and digits in different character sets Word characters and separatorsRegular expression definition of English word characters Letters and digits definedSupported character sets Matching wordsSupported character Description Set Matching similar words Matching word sequencesFuzzy words Measuring word similarityProximity word sequences Matching word sequences in attachmentsExample 1. Separators are ignored Example 2. Sequence is not intuitiveExcel spreadsheet Boolean query expressions Boolean query expressionsSyntax Matches Query expression examples Nested Boolean query expressionsFollowing are examples of query expressions Query expression examples Query expression Finds documents withQuery expression syntax and matching Index See IAP User Guide