Collation and Text Boundaries |
If you are an advanced user and interested in trying out more rules, here is a brief explanation of how they work. The Collation Rules field is a list of rules, where each rule is of three forms:
- <modifier>
- <relation> <text-argument>
- <reset> <text-argument>
Text Argument
A text argument is any sequence of characters, excluding special characters (that is, whitespace characters and ASCII punctuation characters). If those characters are desired, you can put them in single quotes (as in the Ampersand example).
Modifier
Currently there is only one modifier character which is used to specify that all accents (secondary differences) are backwards.
@ Indicates that accents are sorted backwards, as in French Relation
The relations are the following:
< Greater, as a letter difference (primary) ; Greater, as an accent difference (secondary) , Greater, as a case difference (tertiary) = Equal Reset
Currently there is only one reset character which is used primarily for contractions and expansions, but which can also be used to add a modification at the end of a set of rules.
& Indicates that the next rule follows the position to where the reset text-argument would be sorted.
The reset does not put the text-argument into the sorting sequence.
This sounds more complicated than it is in practice. For example, the following are equivalent ways of expressing the same thing:
Rules
Meaning
a < b < c Put b after a, then put c after b. a < b & b < c Put b after a, then put c after b. a < c & a < b Put c after a, then put b after a. Notice that the order is very important, as the subsequent item goes immediately after the text-argument. The following are not equivalent:
Rules
Meaning
a < b & a < c Put b after a, then put c after a. Same as a < c < b! a < c & a < b Put c after a, then put b after a. Same as a < b < c! Either the text-argument must already be present in the sequence, or some initial substring of the text-argument must be present. (e.g. "a < b & ae < e" is valid since "a" is present in the sequence before "ae" is reset). In this latter case, "ae" is not entered and treated as a single character; instead, "e" is sorted as if it were expanded to two characters: "a" followed by an "e".
This difference appears in natural languages: in traditional Spanish "ch" is treated as though it contracts to a single character (expressed as "c < ch < d"), while in traditional German "ä" (a-umlaut) is treated as though it expands to two characters (expressed as "a & ae ; ä < b").
Ignorable Characters
The first rule must start with a relation (the examples we have used above are really fragments; "a < b" really should be "< a < b"). If, however, the first relation is not "<", then all the all text-arguments up to the first "<" are ignorable. For example, ", - < a < b" makes "-" an ignorable character, as we saw earlier in the word "black-birds". In the samples for different languages, you see that most accents are ignorable.
Normalization and Accents
The Collation object automatically normalizes text internally to separate accents from base characters where possible. This is done both when processing the rules, and when comparing two strings. Collation also uses the Unicode canonical mapping to ensure that combining sequences are sorted properly (for more information, see The Unicode Standard, Version 2.0.)
Most languages that use accents sort them in a consistent fashion, immediatedly after the unmodified base character. This can be achieved by making the accents ignorable, and putting them in the right order at the beginning of the collation rules. When this is done, only special cases like the German "ä" need to be handled by explicit rules.
Errors
The following are errors:
- A text-argument not preceded by either a reset or relation character (e.g. "a < b c < d"). This example will not have the desired affect of "a < b < c < d", instead it will treat "bc" as a single letter.
- A relation or reset character not followed by a text-argument (e.g. "a < , b")
- A reset where the text-argument (or an initial substring of the text-argument) is not already in the sequence (e.g. "a < b & e < f")
- A punctuation character that is not enclosed in quotes.
If you produce one of the latter three errors, a message at the bottom of the screen will tell you what the error is.
This page incorporates material or code copyrighted by Taligent, Inc. For more information on international resources, see their International Fact Sheet.
Collation and Text Boundaries |