HP RISS Components manual Matching word sequences, Fuzzy words, Measuring word similarity

Page 36

Fuzzy words

You can search for document words that are textually similar to a given literal query word (that is, one containing no wildcards). To do this, append a tilde (~) character to the word, creating a fuzzy word. For example, the fuzzy word define~ matches the similar words defined and definite, but does not match defining, definition, indefinite, or pine. It also matches define itself.

Measuring word similarity

The edit distance (also called Levenshtein distance) between two words is the number of single-character operations (deletion, replacement, or insertion) required to change one word into the other word.

For example, the edit distance between define and pine is three: two deletions (d and e) and one replacement (f by p). The distance between define and definite is also three (e replaced by i; te inserted).

The search engine considers define more similar to definite than to pine, even though the edit distances are the same (three), because the edit distance (number of character changes) is compared to the word length (of the shorter of the query and document words). Two words are closer, for querying purposes, if it takes less to change one word into the other word relative to their lengths.

The similarity ratio used by the search engine is d/min(query, doc), where d is the edit distance, min is a function that returns the lesser of its arguments, and query and doc are the lengths of the query word and document word, respectively. A fuzzy word matches a document word if this ratio is no more than 0.5.

Examples:

Words Compared

Similarity Ratio

Match ?

 

 

 

 

 

define, definite

3/min(6,

8)

= 3/6 = 0.5

yes

 

 

 

 

 

define, pine

3/min(6,

4)

= 3/4 = 0.75

no (0.75 > 0.5)

 

 

 

 

 

Matching word sequences

You can use word sequences to find documents with words that occur in a specified order and are separated by a specified maximum distance.

Topics include:

Simple word sequences, page 36

Proximity word sequences, page 36

Simple word sequences

To search for an ordered sequence of words, use a simple word sequence, which is a list of literal query words (no wildcards) separated by spaces (or other separators) and enclosed in quotes ("). A document matches a simple word sequence if all words occur in the document in the same order, with no intervening words.

For example, the sequence "like a rolling stone" does not match a document with the text like a large rolling stone because of the intervening word large.

Proximity word sequences

You can use simple word sequences to search for words separated by separators but not by other words. To search for document words that are in an ordered sequence, but might be separated by other words, use a proximity word sequence.

To write a proximity word sequence, use the same syntax as a simple word sequence, but append a tilde (~) character to the second quote, and follow that with a numeric proximity value. The proximity value represents the maximum number of other document words that can occur between any two successive

36 Query expression syntax and matching

Image 36
Contents HP Reference Information Storage System User Guide Version Page Contents Riss Outlook Interface IndexFigures Tables Intended audience PrerequisitesRelated documentation Document conventions and symbols HP technical supportDocument conventions TIPSubscription service Other web sitesProviding feedback About this guide Riss and RIM Understanding document archivingRIM applications Application What You Can DoUnderstanding searching and document indexing Indexed document typesMessage Mime types advanced users User Guide Riss overview Using the toolbar Logging in and outUnderstanding the user interface User interface topics includeCommon tasks Search basicsCompleting simple searches Riss Web Interface tasksTask Reference Search using the following fields on the Advanced Search Completing advanced searchesAdditional advanced search query fields Query Field Matches in the DocumentQuery Results page email content type Displaying query or search resultsQuery results navigation bar Saving query or search criteria Saving query or search resultsSave Results Sending query or search results Exporting query or search resultsFile Download dialog box Accessing saved criteria Accessing saved resultsCopying saved results to a quarantine repository Deleting quarantine repositoriesSearching audit log repositories To search for multiple items, use the advanced search formAdvanced Search page document content type Logged actions and descriptions Logged Action DescriptionQuery Field Matches Troubleshooting Changing your passwordProblems exporting results Changing your languageFolder Options dialog box Query expressions Word charactersLetters and digits in different character sets Word characters and separatorsRegular expression definition of English word characters Letters and digits definedMatching words Matching similar wordsSupported character sets Supported character Description SetMatching word sequences Fuzzy wordsMeasuring word similarity Simple word sequencesBoolean query expressions Boolean query expressionsSyntax Matches Nested Boolean query expressions Query expression examplesFollowing are examples of query expressions Query expression examplesSetting up the Riss Outlook Interface Installing the Outlook plug-in or OWASupported Outlook versions Archived email messages Riss Search Results folderRiss Outlook user interface objects Objects DescriptionAccessing exported results Searching for archived documentsRiss Outlook Interface tasks Using Cache Manager Displaying archived email attachmentsCache Manager icons Icon DescriptionUser account settings Setting offline cache optionsRiss Outlook Interface administrator tasks Offline Cache Options dialog box Enabling offline cache Archive Options panel, Options dialog boxInformation on configuring EFS Setting host information Clearing offline cacheDisplaying the About options Riss Information dialog boxTroubleshooting Problems logging About dialog boxIndex See Mime Reference Information Storage SystemRIM
Related manuals
Manual 148 pages 28.04 Kb