HP Integrated Archive Platform manual Letters and digits in different character sets

Page 38

Word characters and separators

Word characters include all uppercase and lowercase letters, digits, and the following additional characters:

_ (underscore)

# (number/pound/hash sign)

& (ampersand)

All other characters are separators (except in queries, wildcards ? and *, and special query characters ~, ", -, and !).

However, && by itself is not a word. It is a Boolean operator. When combined with at least one more word character, && can be part of a word. For example, a&&b is a word.

Query analysis and document indexing are not case-sensitive. Uppercase and lowercase letters are treated the same.

Regular expression definition of English word characters

The following regular expression provides, in succinct form, a complete specification of English word characters (except for treatment of && as a non-word):

[ A-Za-z0-9_#& ]+

Letters and digits in different character sets

Topics include:

Letters and digits defined, page 38

Letters and digits in files, page 38

Letters and digits defined

All letters and digits are word characters. What IAP considers a letter or digit depends on the character set encoding used. For US ASCII encoding, letters are uppercase and lowercase English letters (A-Z, a-z). For ISO 8859-1 (Latin-1) encoding, used for Western European languages, accented letters are included. Most ideographic characters, such as those used in Asian languages, are also considered letters.

Whatever the language and encoding used for a particular document (file or email message), IAP maps encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to determine if a given character is a letter or a digit (or neither):

A letter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter), Lu (uppercase letter), Lt (title case letter), Lm (modifier letter), or Lo (other letter).

A digit is any Unicode character whose Unicode name contains the word DIGIT, provided it is not in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future).

Letters and digits in files

Although all letters and digits are word characters, their treatment in files (including email message attachments) depends on the character encoding used. You can search for any words in email message bodies and headers, regardless of the encoding.

You can search for words in files (including email body, header, attachments, and indexed documents) provided the character encoding is one the following:

38 Query expression syntax and matching

Image 38
Contents HP Integrated Archive Platform User Guide Page Contents Index Figures Tables Prerequisites Document conventions and symbolsIntended audience Related documentationOther web sites Subscription serviceHP technical support TIPUser Guide About this guide Application What You Can Do Understanding document archivingEAs applications Message Mime types advanced users Understanding searching and document indexingIndexed document types Office 2007 supported file extensions and Mime types Office 2007 supported properties Office 2007 supported featuresType Property Microsoft Word, PowerPoint Excel Modified Forward to Understanding the user interface Using the toolbarLogging in and out Common tasks Search basicsTask Reference Completing simple searchesIAP Web Interface tasks Completing advanced searches Simple SearchAdvanced Search page email content type Additional advanced search query fields Query Field Matches in the DocumentFolder As path c\abc\xyzQuery Results page email content type Displaying query or search resultsQuery results navigation bar Saving query or search criteria BarsSaving query or search results Save CriteriaSending query or search results Save ResultsExporting query or search results Accessing saved criteriaAccessing saved results Copying saved results to a quarantine repository Deleting quarantine repositoriesSearching audit log repositories To search for multiple items, use the advanced search formAdvanced Search page document content type Query Field Matches Logged actions and descriptionsLogged Action Description Troubleshooting topics include TroubleshootingChanging your password Changing your languageProblems exporting results Unable to display saved resultsIAP Web Interface Query expressions Word charactersRegular expression definition of English word characters Letters and digits in different character setsWord characters and separators Letters and digits definedSupported character Description Set Matching wordsSupported character sets Fuzzy words Matching similar wordsMatching word sequences Measuring word similarityExample 1. Separators are ignored Proximity word sequencesMatching word sequences in attachments Example 2. Sequence is not intuitiveExcel spreadsheet Syntax Matches Boolean query expressionsBoolean query expressions Following are examples of query expressions Nested Boolean query expressionsQuery expression examples Query expression examples Query expression Finds documents withQuery expression syntax and matching Index See IAP User Guide