Matching similar words
Topics include:
•Fuzzy words, page 40
•Measuring word similarity, page 40
Fuzzy words
You can search for document words that are textually similar to a given literal query word (that is, one containing no wildcards). To do this, append a tilde (~) character to the word, creating a fuzzy word. For example, the fuzzy word define~ matches the similar words defined and definite, but does not match defining, definition, indefinite, or pine. It also matches define itself.
Measuring word similarity
The edit distance (also called Levenshtein distance) between two words is the number of
For example, the edit distance between define and pine is three: two deletions (d and e) and one replacement (f by p). The distance between define and definite is also three (e replaced by i; te inserted).
The search engine considers define more similar to definite than to pine, even though the edit distances are the same (three), because the edit distance (number of character changes) is compared to the word length (of the shorter of the query and document words). Two words are closer, for querying purposes, if it takes less to change one word into the other word relative to their lengths.
The similarity ratio used by the search engine is d/min(query, doc), where d is the edit distance, min is a function that returns the lesser of its arguments, and query and doc are the lengths of the query word and document word, respectively. A fuzzy word matches a document word if this ratio is no more than 0.5.
Examples:
Words Compared | Similarity Ratio | Match ? | ||
|
|
|
|
|
define, definite | 3/min(6, | 8) | = 3/6 = 0.5 | yes |
|
|
|
|
|
define, pine | 3/min(6, | 4) | = 3/4 = 0.75 | no (0.75 > 0.5) |
|
|
|
|
|
Matching word sequences
You can use word sequences to find documents with words that occur in a specified order and are separated by a specified maximum distance.
Topics include:
•Simple word sequences, page 40
•Proximity word sequences, page 41
•Matching word sequences in attachments, page 41
Simple word sequences
To search for an ordered sequence of words, use a simple word sequence, which is a list of literal query words (no wildcards) separated by spaces (or other separators) and enclosed in quotes ("). A document matches a simple word sequence if all words occur in the document in the same order, with no intervening words.
For example, the sequence "like a rolling stone" does not match a document with the text like a large rolling stone because of the intervening word large.
40 Query expression syntax and matching