Network Working Group C. Newman
Internet-Draft Sun Microsystems
Expires: April 26, 2004 October 27, 2003
Internet Application Protocol Collation Registry
draft-newman-i18n-comparator-01.txt
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at http://
www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 26, 2004.
Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract
Many Internet application protocols include string-based lookup,
searching, or sorting operations. However the problem space for
searching and sorting international strings is large, not fully
explored, and is outside the area of expertise for the Internet
Engineering Task Force (IETF). Rather than attempt to solve such a
large problem, this specification creates an abstraction framework so
that application protocols can precisely identify a comparison
function and the repertoire of comparison functions can be extended
in the future.
Newman Expires April 26, 2004 [Page 1]
Internet-Draft Collation Registry October 2003
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Conventions Used in this Document . . . . . . . . . . . . . 3
2. Collation Definition and Purpose . . . . . . . . . . . . . . 3
3. Collation Name Syntax . . . . . . . . . . . . . . . . . . . 4
4. Collation Specification Requirements . . . . . . . . . . . . 6
5. Application Protocol Requirements . . . . . . . . . . . . . 8
6. Initial Collations . . . . . . . . . . . . . . . . . . . . . 9
6.1 Octet Collation . . . . . . . . . . . . . . . . . . . . . . 9
6.2 ASCII Numeric Collation . . . . . . . . . . . . . . . . . . 10
6.3 ASCII Casemap Collation . . . . . . . . . . . . . . . . . . 10
6.4 Nameprep Collation . . . . . . . . . . . . . . . . . . . . . 11
6.5 Basic Collation . . . . . . . . . . . . . . . . . . . . . . 12
7. Use by ACAP and Sieve . . . . . . . . . . . . . . . . . . . 14
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . 14
8.1 Collation Registration Procedure . . . . . . . . . . . . . . 14
8.2 Collation Registration Template . . . . . . . . . . . . . . 15
8.3 Octet Collation Registration . . . . . . . . . . . . . . . . 16
8.4 ASCII Numeric Collation Registration . . . . . . . . . . . . 16
8.5 Legacy English Casemap Collation Registration . . . . . . . 16
8.6 English Casemap Collation Registration . . . . . . . . . . . 16
8.7 Nameprep Collation Registration . . . . . . . . . . . . . . 17
8.8 Basic Collation Registration . . . . . . . . . . . . . . . . 17
8.9 Basic Accent Sensitive Match Collation Registration . . . . 17
8.10 Basic Case Sensitive Match Collation Registration . . . . . 18
8.11 Structure of Collation Registry . . . . . . . . . . . . . . 18
8.12 Example Initial Registry Summary . . . . . . . . . . . . . . 19
9. DTD for Collation Registration . . . . . . . . . . . . . . . 19
10. Guidelines for Expert Reviewer . . . . . . . . . . . . . . . 20
11. Security Considerations . . . . . . . . . . . . . . . . . . 21
12. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . 21
13. Changes From -00 . . . . . . . . . . . . . . . . . . . . . . 22
Normative References . . . . . . . . . . . . . . . . . . . . 22
Informative References . . . . . . . . . . . . . . . . . . . 23
Author's Address . . . . . . . . . . . . . . . . . . . . . . 24
Intellectual Property and Copyright Statements . . . . . . . 25
Newman Expires April 26, 2004 [Page 2]
Internet-Draft Collation Registry October 2003
1. Introduction
The ACAP [11] specification introduced the concept of a comparator
(which we call collation in this document), but failed to create an
IANA registry. With the introduction of stringprep [6] and the
Unicode Collation Algorithm [8], it is now time to create that
registry and populate it with some initial values appropriate for an
international community. This specification replaces and generalizes
the definition of a comparator in ACAP and creates a collation
registry.
1.1 Conventions Used in this Document
The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
in this document are to be interpreted as defined in "Key words for
use in RFCs to Indicate Requirement Levels" [1].
The attribute syntax specifications use the Augmented Backus-Naur
Form (ABNF) [2] notation including the core rules defined in Appendix
A. This also inherits ABNF rules from Language Tags [5].
2. Collation Definition and Purpose
A collation is a named function which takes two arbitrary length
octet strings (encoded in UTF-8 [3] for collations which operate on
characters) as input and can be used to perform one or more of three
basic comparison operations: equality test, substring match and
ordering test.
Collations provide a multi-protocol abstraction layer for comparison
functions so the details of a particular comparison operation can be
specified by someone with appropriate expertise independent of the
application protocol that consumes that collation. This is similar
to the way a charset [14] separates the details of octet to character
mapping from a protocol specification such as MIME [9] or the way
SASL [10] separates the details of an authentication mechanism from a
protocol specification such as ACAP [11].
Newman Expires April 26, 2004 [Page 3]
Internet-Draft Collation Registry October 2003
Here a small diagram to help illustrate the value of this abstraction
layer:
+-----------------+
| Octet |
+-------------------+ +--| Collation Spec |
| IMAP i18n SEARCH |--+ | +-----------------+
+-------------------+ | +-------------+ | +-----------------+
+--| Collation |--+--| A stringprep |
+-------------------+ | | Registry | | | Collation Spec |
| ACAP i18n SEARCH |--+ +-------------+ | +-----------------+
+-------------------+ | +-----------------+
| | locale-specific |
+--| Collation Spec |
+-----------------+
Thus IMAP, ACAP and future application protocols with international
search capability simply specify how to interface to the collation
registry instead of each protocol spec having to specify all the
collations it supports.
One component of a collation is a canonicalization function which can
be pre-applied to single strings and may enhance the performance of
subsequent comparison operations. Normally, this is an
implementation detail of collations, but at times it may be useful
for an application protocol to expose collation canonicalization over
protocol. Collation canonicalization can range from an identity
mapping (e.g., the i;octet collation) to a mapping which makes the
string unreadable to a human (e.g., the basic collation).
3. Collation Name Syntax
The collation name itself is a single US-ASCII string beginning with
a letter and made up of letters, digits, or one of the following 4
symbols: "-", ";", "=" or ".". The name MUST NOT be longer than 254
characters.
collation-char = ALPHA / DIGIT / "-" / ";" / "=" / "."
collation-name = ALPHA *253collation-char
The string a client uses to select a collation MAY contain a wildcard
("*") character which matches zero or more collation-chars. Wildcard
characters MUST NOT be adjacent. Clients which support disconnected
operation SHOULD NOT use wildcards to select a collation, but clients
which provide collation operations only when connected to the server
MAY use wildcards. If the wildcard string matches multiple
collations, the server SHOULD select the collation with the broadest
Newman Expires April 26, 2004 [Page 4]
Internet-Draft Collation Registry October 2003
scope (preferably international scope), the most recent table
versions and the greatest number of supported operations. A single
wildcard character ("*") refers to the application protocol collation
behavior that would occur if no explicit negotiation were used.
When used as a protocol element for ordering, the collation name MAY
be prefixed by either "+" or "-" to explicitly specify an ordering
direction. As mentioned previously, "+" has no effect on the
ordering function, while "-" negates the result of the ordering
function. In general, collation-order is used when a client requests
a collation, and collation-sel is used with the server informs the
client of the selected collation.
collation-wild = ("*" / (ALPHA ["*"])) *(collation-char ["*"])
; MUST NOT exceed 255 characters total
collation-sel = ["+" / "-"] collation-name
collation-order = ["+" / "-"] collation-wild
Some protocols are designed to use URIs to refer to collations rather
than simple tokens. A special section of the IANA web page is
reserved for such usage. The "collation-uri" form is used to refer
to a specific IANA registry entry for a specific named collation (the
collation registration may not actually be present if it is
experimental). The "collation-auri" form is an abstract name for an
ordering, a comparator pattern or a vendor private comparator.
collation-uri = "http://www.iana.org/assignments/collation/"
collation-name ".xml"
collation-auri = ( "http://www.iana.org/assignments/collation/"
collation-order [".xml"]) / other-uri
other-uri = absoluteURI
; excluding the IANA collation namespace.
While this specification makes no absolute requirements on the
structure of collation names, naming consistency is important, so the
following initial guidelines are provided.
Collation names with an international audience typically begin with
"i;". Collation names intended for a particular language or locale
typically begin with a language tag [5] followed by a ";". After the
first ";" is normally the name of the general collation algorithm
followed by a series of algorithm modifications separated by the ";"
delimiter. Parameterized modifications will use "=" to delimit the
parameter from the value. The version numbers of any lookup tables
Newman Expires April 26, 2004 [Page 5]
Internet-Draft Collation Registry October 2003
used by the algorithm SHOULD be present as parameterized
modifications.
Collation names of the form *;vnd-domain.com;* are reserved for
vendor-specific collations created by the owner of the domain name
following the "vnd-" prefix. Registration of such collations (or the
name space as a whole) with intended use of "Vendor" is encouraged
when a public specification or open-source implementation is
available, but is not required.
4. Collation Specification Requirements
A collation specification MUST state which of the three basic
functions are supported (equality, substring, ordering) and how to
perform each of the supported functions on any two input
octet-strings including empty strings. Given a collation with a
specific name, and any two fixed input strings, the result MUST be
the same. The collation specification MUST state whether the
collation operates on raw octets or on characters (in which case the
UTF-8 charset is presumed). Collations MUST be transitive.
A collation specification MUST describe the internal canonicalization
algorithm. This algorithm can be applied to individual strings and
the result strings can be stored to potentially optimize future
comparison operations. A collation MAY specify that the
canonicalization algorithm is the identity function. The output of
the canonicalization algorithm MAY have no meaning to a human.
Collations which use more than one customizable lookup table in a
documented format MUST assign numbers to the tables they use. This
permits an application protocol command to access the tables used by
a server collation.
o The equality function always returns "match" or "no-match" when
supplied valid input and MAY return "error" if the input strings
are not valid UTF-8 strings or violate other collation
constraints.
o The substring matching function determines if the first string is
a substring of the second string. A collation which supports
substring matching will automatically support the two special
cases of substring matching: prefix and suffix matching if those
special cases are supported by the application protocol. It
returns "match" or "no-match" when supplied valid input and
returns "error" when supplied invalid input.
o The ordering function determines how two octet strings are
ordered. It returns "-1" if the first string is listed before the
Newman Expires April 26, 2004 [Page 6]
Internet-Draft Collation Registry October 2003
second string according to the collation, "+1" if the second
string is listed before the first string, and "0" if the two
strings are equal. If the order of the two strings is reversed,
the result of the ordering function of the collation MUST be
negated. In general, collations SHOULD NOT return "0" unless the
two octet sequences are identical.
Since ordering is normally used to sort a list of items, "error"
is not a useful return value from the ordering function. Strings
with errors that prevent the sorting algorithm from functioning
correctly should sort to the end of the list. Thus if the first
string is invalid UTF-8 while the second string is valid, the
result will be "+1". If the second string is invalid UTF-8 while
the first string is valid, the result will be "-1". If the
collation is character-based, and both strings are invalid UTF-8,
the result SHOULD match the result from the "i;octet" collation.
When the collation is used with a "+" prefix, the behavior is the
same as when used with no prefix. When the collation is used with
a "-" prefix, results which would be "+1" are instead "-1" and
results which would be "-1" are instead "+1".
Unless otherwise specified by the collation or application protocol,
a NULL string (as opposed to an empty string) is equal only to
another NULL string, a NULL string is not a substring of any other
string, and a NULL string sorts to a position after all non-NULL
strings, but before strings which generate errors.
Some application protocols will permit the use of multi-value
attributes with a collation. This paragraph describes the rules that
apply unless otherwise specified by the collation or application
protocol. The equality and substring collation algorithms will be
iterated over each pair of single values from the two inputs. If any
combination produces an error, the result is an error. Otherwise, if
any combination produces a "match", the result is a match. Otherwise
the result is "no-match". For the ordering function, the smallest
ordinal octet string from the first set of values is compared to the
smallest ordinal octet string from the second set of values.
Application protocols MAY return position information for substring
matches. If this is done, the position information MUST include both
the starting offset and the ending offset in the string. This is
important because more sophisticated collations can match strings of
unequal length (for example, a pre-composed accented character will
match a decomposed accented character).
Collation specifications intended for common use are expected to
reference standards from standards bodies with significant experience
Newman Expires April 26, 2004 [Page 7]
Internet-Draft Collation Registry October 2003
dealing with the details of international character sets.
5. Application Protocol Requirements
An application protocol which offers searching, substring matching
and/or sorting and permits the use of characters outside the US-ASCII
charset needs to consider the following requirements and issues:
The protocol MUST provide a mechanism for the client to select the
collation to use with equality matching, substring matching and
ordering.
The protocol MUST specify how comparisons behave in the absence of an
explicit collation negotiation or when a collation negotiation of "*"
is used. The protocol MAY specify that the default collation used in
such circumstances is sensitive to server configuration.
The protocol SHOULD provide a way to list available collations
matching a given wildcard pattern or patterns.
If the protocol provides positional information for the results of a
substring match, that positional information MUST fully specify the
substring in the result that matches independent of the length of the
search string. For example, returning both the starting and ending
offset of the match would suffice, as would the starting offset and a
length. Returning just the starting offset is not acceptable. This
rule is necessary because advanced collations can treat strings of
different lengths as equal (for example, pre-composed and decomposed
accented characters).
If the protocol permits the use of collations on stored character
data which is not encoded with the UTF-8 charset, then the protocol
specification has to describe relevant issues of the conversion.
Details to consider include how to handle unknown charsets, any
charsets which are mandatory-to-implement, any issues with byte-order
that might apply, and any transfer encodings which need to be
supported.
If the protocol provides a canonicalization function for strings,
then use of collations MAY be appropriate for that function.
If the protocol supports disconnected clients, then a mechanism for
the client to precisely replicate the server's collation algorithm is
likely desirable. Thus the protocol MAY wish to provide a command to
fetch lookup tables used by charset conversions and collations.
The protocol specification should consider assigning protocol error
codes for the following circumstances:
Newman Expires April 26, 2004 [Page 8]
Internet-Draft Collation Registry October 2003
o The client requests the use of a collation by name or pattern, but
no implemented collation matches that pattern.
o The client attempts to use a collation for a function that is not
supported by that collation. For example, attempting to use the
"i;ascii-numeric" collation for a substring matching function.
o The client uses an equality or substring matching collation and
the result is an error. It may be appropriate to distinguish
between the two input strings, particularly when one is supplied
by the client and one is stored by the server. It might also be
appropriate to distinguish the specific case of an invalid UTF-8
string.
If the protocol permits the use of a collation with data structures
beyond those described in this specification (octet strings, NULL
string, array of octet strings), the protocol MUST describe the
default behavior for a collation with that data structure.
6. Initial Collations
This section describes an initial set of collations for the collation
registry.
6.1 Octet Collation
The "i;octet" collation is a simple and fast collation intended for
use on binary octet strings rather than on character data. It never
returns an "error" result. It provides equality, substring and
ordering functions. The ordering algorithm is as follows:
1. If both strings are the empty string, return the result "0".
2. If the first string is empty and the second is not, return the
result "-1".
3. If the second string is empty and the first is not, return the
result "+1".
4. If both strings begin with the same octet value, remove the first
octet from both strings and repeat this algorithm from step 1.
5. If the unsigned value (0 to 255) of the first octet of the first
string is less than the unsigned value of the first octet of the
second string, then return "-1".
6. If this step is reached, return "+1".
Newman Expires April 26, 2004 [Page 9]
Internet-Draft Collation Registry October 2003
This algorithm is roughly equivalent to the C library function memcmp
with appropriate length checks added.
The matching function returns "match" if the sorting algorithm would
return "0". Otherwise the matching function returns "no-match".
The substring function returns "match" if the first string is the
empty string, or if there exists a substring of the second string of
length equal to the length of the first string which would result in
a "match" result from the equality function. Otherwise the substring
function returns "no-match".
The associated canonicalization algorithm is the identity function.
6.2 ASCII Numeric Collation
The "i;ascii-numeric" collation is a simple collation intended for
use with arbitrary sized decimal numbers stored as octet strings of
US-ASCII digits (0x30 to 0x39). It supports equality and ordering,
but does not support the substring function. The algorithm is as
follows:
1. If neither string begins with a digit, return "error" if
matching, or the result of the "i;octet" collation for ordering.
2. If the first string begins with a digit and the second string
does not, return "error" if matching and "-1" for ordering.
3. If the second string begins with a digit and the first string
does not, return "error" if matching and "+1" for ordering.
4. Let "n" be the number of digits at the beginning of the first
string, and "m" be the number of digits at the beginning of the
second string.
5. If n is equal to m, return the result of the "i;octet" collation.
6. If n is greater than m, prepend a string of "n - m" zeros to the
second string and return the result of the "i;octet" collation.
7. If m is greater than n, prepend a string of "m - n" zeros to the
first string and return the result of the "i;octet" collation.
The associated canonicalization algorithm is to truncate the input
string at the first non-digit character.
6.3 ASCII Casemap Collation
Newman Expires April 26, 2004 [Page 10]
Internet-Draft Collation Registry October 2003
The "en;ascii-casemap" collation is a simple collation intended for
use with English language text in pure US-ASCII. It provides
equality, substring and ordering functions. The algorithm first
applies a canonicalization algorithm to both input strings which
subtracts 32 (0x20) from all octet values between 97 (0x61) and 122
(0x7A) inclusive. The result of the collation is then the same as
the result of the "i;octet" collation for the canonicalized strings.
Care should be taken when using OS-supplied functions to implement
this collation as this is not locale sensitive, but functions such as
strcasecmp and toupper can be locale sensitive.
For historical reasons, in the context of ACAP and Sieve, the name
"i;ascii-casemap" is a synonym for this collation.
6.4 Nameprep Collation
The "i;nameprep;v=1;uv=3.2" collation is an implementation of the
nameprep [7] specification based on normalization tables from Unicode
version 3.2. This collation applies the nameprep canoncialization
function to both input strings and then returns the result of the
i;octet collation on the canonicalized strings. While this collation
offers all three functions, the ordering function it provides is
inadequate for use by the majority of the world.
Version number 1 is applied to nameprep as specified in RFC 3491. If
the nameprep specification is revised without any changes that would
produce different results when given the same pair of input octet
strings, then the version number will remain unchanged.
The table numbers for tables used by nameprep are as follows:
+--------------+-----------------------+
| Table Number | Table Name |
+--------------+-----------------------+
| 1 | UnicodeData-3.2.0.txt |
| 2 | Table B.1 |
| 3 | Table B.2 |
| 4 | Table C.1.2 |
| 5 | Table C.2.2 |
| 6 | Table C.3 |
| 7 | Table C.4 |
| 8 | Table C.5 |
| 9 | Table C.6 |
| 10 | Table C.7 |
| 11 | Table C.8 |
| 12 | Table C.9 |
+--------------+-----------------------+
Newman Expires April 26, 2004 [Page 11]
Internet-Draft Collation Registry October 2003
6.5 Basic Collation
The basic collation is intended to provide tolerable results for a
number of languages for all three functions (equality, substring and
ordering) so it is suitable as a mandatory-to-implement collation for
protocols which include ordering support. The ordering function of
the basic collation is the Unicode Collation Algorithm [8] version 9
(UCAv9).
The equality and substring functions are created as described in
UCAv9 section 8. While that section is informative to UCAv9, it is
normative to this collation specification.
This collation is based on Unicode version 3.2, with the following
tables relevant:
1. For the normalization step, UnicodeData-3.2.0.txt [16] is used.
Column 5 is used to determine the canonical decomposition, while
column 3 contains the canonical combining classes necessary to
attain canonical order.
2. The table of characters which require a logical order exception
is a subset of the table in PropList-3.2.0.txt [17] and is
included here:
0E40..0E44 ; Logical_Order_Exception
# Lo [5] THAI CHARACTER SARA E..THAI CHARACTER SARA AI MAIMALAI
0EC0..0EC4 ; Logical_Order_Exception
# Lo [5] LAO VOWEL SIGN E..LAO VOWEL SIGN AI
# Total code points: 10
3. The table used to translate normalized code points to a sort key
is allkeys-3.1.1.txt [18].
UCAv9 includes a number of configurable parameters and steps labelled
as potentially optional. The following list summarizes the defaults
used by this collation:
o The logical order exception step is mandatory by default to
support the largest number of languages.
o Steps 2.1.1 to 2.1.3 are mandatory as the repertoire of the basic
collation is intended to be large.
o The second level in the sort key is evaluated forwards by default.
o The variable weighting uses the "non-ignorable" option by default.
Newman Expires April 26, 2004 [Page 12]
Internet-Draft Collation Registry October 2003
o The semi-stable option is not used by default.
o Support for exactly three levels of collation is the default
behavior.
o No preprocessing step is used by the basic collation prior to
applying the UCAv9 algorithm. Note that an application protocol
specification MAY require pre-processing prior to the use of any
collations.
o The equality and substring algorithms exclude differences at level
2 and 3 by default (thus it is case-insensitive and ignores
accentual distinctions.
o The equality and substring algorithms use the "Whole Characters
Only" feature described in UCAv9 section 8 by default.
The exact collation name with these defaults is
"i;basic;uca=3.1.1;uv=3.2". When a specification states that the
basic collation is mandatory-to-implement, only this specific name is
mandatory-to-implement.
In order to allow modification of the optional behaviors, the
following ABNF is used for variations of the basic collation:
basic-collation = ("i" / Language-Tag) ";basic;uca=3.1.1;uv=3.2"
[";match=accent" / ";match=case"]
[";tailor=" 1*collation-char ]
If multiple modifiers appear, they MUST appear in the order described
above. The modifiers have the following meanings:
match=accent Both the first and second levels of the sort keys are
considered relevant to the equality and substring
operations (rather than the default of first level
only). This makes the matching functions sensitive to
accentual distinctions.
match=case The first three levels of sort keys are considered
relevant to the equality and substring operations.
This makes the matching functions sensitive to both
case and accentual distinctions.
The default weighting option is "non-ignorable". The "semi-stable"
sort key option is not used by default.
The canonicalization algorithm associated with this collation is the
output of step 3 of the UCAv9 algorithm (described in section 4.3 of
Newman Expires April 26, 2004 [Page 13]
Internet-Draft Collation Registry October 2003
the UCA specification). This canonicalization is not suitable for
human consumption.
Finally, the UCAv9 algorithm permits the "allkeys" table to be
tailored to a language. People who make quality tailorings are
encouraged to register those tailorings using the collation registry.
Tailoring names beginning with "x" are reserved for experimental use,
are treated as "Limited use" and MUST NOT match wildcards if any
registered collation is available that does match.
7. Use by ACAP and Sieve
Both ACAP [11] and Sieve [15] are standards track specifications
which used collations prior to the creation of this specification and
registry. Those standards do not meet all the application protocol
requirements described in Section 5. For backwards compatibility,
those protocols use the "i;ascii-casemap" instead of
"en;ascii-casemap".
8. IANA Considerations
8.1 Collation Registration Procedure
IANA will create a mailing list collation@iana.org which can be used
for public discussion of collation proposals prior to registration.
Use of the mailing list is encouraged but not required. The actual
registration procedure will not begin until the completed
registration template is sent to iana@iana.org. The IESG will
appoint a designated expert who will monitor the collation@iana.org
mailing list and review registrations forwarded from IANA. The
designated expert is expected to tell IANA and the submitter of the
registration within two weeks whether the registration is approved,
approved with minor changes, or rejected with cause. When a
registration is rejected with cause, it can be re-submitted if the
concerns listed in the cause are addressed. Decisions made by the
designated expert can be appealed to the IESG and subsequently follow
the normal appeals procedure for IESG decisions.
Collation registrations in a standards track, BCP or IESG-approved
experimental RFC are owned by the IESG and changes to the
registration follow normal procedures for updating such documents.
Collation registrations in other RFCs are owned by the RFC author(s).
Other collation registrations are owned by the individual(s) listed
in the contact field of the registration and IANA will preserve this
information. Changes to a registration MUST be approved by the
owner. In the event the owner can't be contacted for a period of one
month and a change is deemed necessary, the IESG MAY re-assign
ownership to an appropriate party.
Newman Expires April 26, 2004 [Page 14]
Internet-Draft Collation Registry October 2003
8.2 Collation Registration Template
Registration of a collation is done by sending a well-formed XML
document that validates with collationreg.dtd (Section 9). The
registration MUST include a collation element that MAY include an
"rfc=" attribute if the specification is in an RFC and MUST include a
scope attribute of "i18n", "local" or "other" and an intendedUse
attribute of "common", "limited", "vendor", or "deprecated".
The collation element contains the other elements in the
registration. The mandatory name element gives the precise name of
the comparator. The mandatory title element give the title of the
comparator. The mandatory functions element lists which of the three
functions the comparator provides. The mandatory specification
element describes where to find the specification, and MAY have a URI
attribute. The submittor element provides an RFC 2822 email address
for the person who submitted the registration. It is optional if the
owner element contains an email address. The mandatory owner element
contains either the four letters "IETF" or an email address of the
owner of the registration. The optional version element is included
when the registration is likely to be revised or has been revised in
such a way that the results change for certain input strings. The
optional UnicodeVersion element indicates the version number of the
UnicodeData file on which the collation is based. The optional
UCAVersion element specifics the version of the Unicode Collation
Algorithm on which the collation is based. The optional
UCAMatchLevel element specifies the number of Unicode Collation
Algorithm sort key levels used for the equality and substring
operations.
Here is a template for the registration:
collation name
technical title for collation
equality order substring
specification reference
email address of owner or IETF
email address of submittor
1
3.2
3.1.1
Be aware that future revisions of this specification may add
additional function types, as well as additional XML attributes and
Newman Expires April 26, 2004 [Page 15]
Internet-Draft Collation Registry October 2003
values. Any system which automatically parses these XML documents
MUST take this into account to preserve future compatibility.
8.3 Octet Collation Registration
i;octet
Octet
equality order substring
RFC XXXX
IETF
chris.newman@sun.com
8.4 ASCII Numeric Collation Registration
i;ascii-numeric
ASCII Numeric
equality order
RFC XXXX
IETF
chris.newman@sun.com
8.5 Legacy English Casemap Collation Registration
i;ascii-casemap
Legacy English Casemap
equality order substring
RFC XXXX
IETF
chris.newman@sun.com
8.6 English Casemap Collation Registration
Newman Expires April 26, 2004 [Page 16]
Internet-Draft Collation Registry October 2003
en;ascii-casemap
English Casemap
equality order substring
RFC XXXX
IETF
chris.newman@sun.com
8.7 Nameprep Collation Registration
i;nameprep;v=1;uv=3.2
Nameprep
equality order substring
RFC XXXX
IETF
chris.newman@sun.com
1
3.2
8.8 Basic Collation Registration
i;basic;uca=3.1.1;uv=3.2
Basic
equality order substring
RFC XXXX
IETF
chris.newman@sun.com
3.2
3.1.1
1
8.9 Basic Accent Sensitive Match Collation Registration
Newman Expires April 26, 2004 [Page 17]
Internet-Draft Collation Registry October 2003
i;basic;uca=3.1.1;uv=3.2;match=accent
Basic Accent Sensitive Match
equality order substring
RFC XXXX
IETF
chris.newman@sun.com
3.2
3.1.1
2
8.10 Basic Case Sensitive Match Collation Registration
i;basic;uca=3.1.1;uv=3.2;match=case
Basic Case Sensitive Match
equality order substring
RFC XXXX
IETF
chris.newman@sun.com
3.2
3.1.1
3
8.11 Structure of Collation Registry
Once the registration is approved, IANA will store each XML
registration document in a URL of the form http://www.iana.org/
assignments/collation/collation-name.xml where collation-name is the
contents of the name element in the registration. Both the submittor
and the designated expert is responsible for verifying that the XML
is well-formed and complies with the DTD. In the future, it is hoped
IANA will take over XML verification responsibility from the
designated expert.
IANA will also maintain a text summary of the registry under the name
http://www.iana.org/assignments/collation/summary.txt. This summary
is divided into four sections. The first section is for collations
intended for common use. This section is intended for collation
registrations published in IESG approved RFCs or for locally scoped
collations from the primary standards body for that locale. The
designated expert is encouraged to reject collation registrations
Newman Expires April 26, 2004 [Page 18]
Internet-Draft Collation Registry October 2003
with an intended use of "common" if the expert believes it should be
"limited", as it is desirable to keep the number of "common"
registrations small and high quality. The second section is reserved
for limited use collations. The third section is reserved for
registered vendor specific collations. The final section is reserved
for deprecated collations.
8.12 Example Initial Registry Summary
The following is an example of how IANA might structure the initial
registry summary.txt file:
Collation Functions Scope Reference
--------- --------- ----- ---------
Common Use Collations:
i;octet e, o, s Other [RFC XXXX]
i;nameprep;v=1;uv=3.2 e, o, s i18n [RFC XXXX]
i;basic;uca=3.1.1;uv=3.2 e, o, s i18n [RFC XXXX]
i;basic;uca=3.1.1;uv=3.2;match=accent e, o, s i18n [RFC XXXX]
i;basic;uca=3.1.1;uv=3.2;match=case e, o, s i18n [RFC XXXX]
en;ascii-casemap e, o, s Local [RFC XXXX]
Limited Use Collations:
i;ascii-numeric e, o Other [RFC XXXX]
Vendor Collations:
Deprecated Collations:
i;ascii-casemap e, o, s Local [RFC XXXX]
References
----------
[RFC XXXX] Newman, C., "Internet Application Protocol Collation
Registry", RFC XXXX, Sun Microsystems, October 2003.
9. DTD for Collation Registration
10. Guidelines for Expert Reviewer
The expert reviewer appointed by the IESG has fairly broad latitude
for this registry. While a number of collations are expected
(particularly customizations of the basic collation for localized
use), an explosion of collations (particularly common use collations)
is not desirable for widespread interoperability. However, it is
important for the expert reviewer to provide cause when rejecting a
registration, and when possible to describe corrective action to
permit the registration to proceed. The following table includes
some example reasons to reject a registration with cause:
o The registration is not a well-formed XML document that follows
the DTD.
o The registration has intended use of "common", but there is no
evidence the collation will be widely deployed so it should be
listed as "limited".
o The registration has intended use of "common", but is redundant
Newman Expires April 26, 2004 [Page 20]
Internet-Draft Collation Registry October 2003
with the functionality of a previously registered "common"
collation.
o The collation name fails to precisely identify the version numbers
of relevant tables to use.
o The registration fails to meet one of the "MUST" requirements in
Section 4.
o The collation name fails to meet the syntax in Section 3.
o The collation specification referenced in the registration is
vague or has optional features without a clear behavior specified.
o The referenced specification does not adequately address security
considerations specific to that collation.
11. Security Considerations
Collations will normally be used with UTF-8 strings. Thus the
security considerations for UTF-8 [3] and stringprep [6] also apply
and are normative to this specification.
12. Open Issues
1. Is any Nameprep processing appropriate for the basic collation?
Because a result of "0" from an ordering algorithm is
undesirable, much of the nameprep processing is inappropriate.
Furthermore, a result of "error" which is important for nameprep
is generally inappropriate as an internal result in an ordering
algorithm since it makes the results less intuitive. The sort
key table also eliminates most problematic characters from
consideration if the appropriate collation modifier is used.
Finally, exact compatibility with the Unicode Collation Algorithm
is deemed desirable by the author, as even the smallest variation
may require implementation of largely duplicate code. However,
this decision is outside my expertise, so I welcome alternate
viewpoints.
2. The ICU implementation of the UCA algorithm includes additional
algorithmic customizations such as the ability to be
case-sensitive while at the same time being insensitive to
accents. Should these customizations be added to this
specification?
3. Should a format for customization data for the basic collation be
defined so that disconnected clients might have the option of
Newman Expires April 26, 2004 [Page 21]
Internet-Draft Collation Registry October 2003
downloading that information?
4. Need to deal with the concept of "maybe" or "indeterminate"
results from matching or ordering. See what LDAP does as an
example.
13. Changes From -00
1. Replaced the term comparator with collation. While comparator is
somewhat more precise because these abstract functions are used
for matching as well as ordering, collation is the term used by
other parts of the industry. Thus I have changed the name to
collation for consistency.
2. Remove all modifiers to the basic collation except for the
customization and the match rules. The other behavior
modifications can be specified in a customization of the
collation.
3. Use ";" instead of "-" as delimiter between parameters to make
names more URL-ish.
4. Add URL form for comparator reference.
5. Switched registration template to use XML document.
6. Added a number of useful registration template elements related
to the Unicode Collation Algorithm.
7. Switched language from "custom" to "tailor" to match UCA language
for tailoring of the collation algorithm.
Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
[2] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997.
[3] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC
2279, January 1998.
[4] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource
Identifiers (URI): Generic Syntax", RFC 2396, August 1998.
[5] Alvestrand, H., "Tags for the Identification of Languages", BCP
Newman Expires April 26, 2004 [Page 22]
Internet-Draft Collation Registry October 2003
47, RFC 3066, January 2001.
[6] Hoffman, P. and M. Blanchet, "Preparation of Internationalized
Strings ("stringprep")", RFC 3454, December 2002.
[7] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for
Internationalized Domain Names (IDN)", RFC 3491, March 2003.
[8] Davis, M. and K. Whistler, "Unicode Collation Algorithm version
9", July 2002, .
Informative References
[9] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies",
RFC 2045, November 1996.
[10] Myers, J., "Simple Authentication and Security Layer (SASL)",
RFC 2222, October 1997.
[11] Newman, C. and J. Myers, "ACAP -- Application Configuration
Access Protocol", RFC 2244, November 1997.
[12] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
Considerations Section in RFCs", BCP 26, RFC 2434, October
1998.
[13] Resnick, P., "Internet Message Format", RFC 2822, April 2001.
[14] Freed, N. and J. Postel, "IANA Charset Registration
Procedures", BCP 19, RFC 2978, October 2000.
[15] Showalter, T., "Sieve: A Mail Filtering Language", RFC 3028,
January 2001.
URIs
[16]
[17]
[18]
Newman Expires April 26, 2004 [Page 23]
Internet-Draft Collation Registry October 2003
Author's Address
Chris Newman
Sun Microsystems
1050 Lakes Drive
West Covina, CA 91790
US
EMail: chris.newman@sun.com
Newman Expires April 26, 2004 [Page 24]
Internet-Draft Collation Registry October 2003
Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any
intellectual property or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; neither does it represent that it
has made any effort to identify any such rights. Information on the
IETF's procedures with respect to rights in standards-track and
standards-related documentation can be found in BCP-11. Copies of
claims of rights made available for publication and any assurances of
licenses to be made available, or the result of an attempt made to
obtain a general license or permission for the use of such
proprietary rights by implementors or users of this specification can
be obtained from the IETF Secretariat.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights which may cover technology that may be required to practice
this standard. Please address the information to the IETF Executive
Director.
Full Copyright Statement
Copyright (C) The Internet Society (2004). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assignees.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
Newman Expires April 26, 2004 [Page 25]
Internet-Draft Collation Registry October 2003
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgment
Funding for the RFC Editor function is currently provided by the
Internet Society.
Newman Expires April 26, 2004 [Page 26]