dav/SabreDAV/docs/rfc5051.txt

   1
   2
   3
   4
   5
   6
   7 Network Working Group                                         M. Crispin
   8 Request for Comments: 5051                      University of Washington
   9 Category: Standards Track                                   October 2007
  10
  11
  12          i;unicode-casemap - Simple Unicode Collation Algorithm
  13
  14 Status of This Memo
  15
  16    This document specifies an Internet standards track protocol for the
  17    Internet community, and requests discussion and suggestions for
  18    improvements.  Please refer to the current edition of the "Internet
  19    Official Protocol Standards" (STD 1) for the standardization state
  20    and status of this protocol.  Distribution of this memo is unlimited.
  21
  22 Abstract
  23
  24    This document describes "i;unicode-casemap", a simple case-
  25    insensitive collation for Unicode strings.  It provides equality,
  26    substring, and ordering operations.
  27
  28 1.  Introduction
  29
  30    The "i;ascii-casemap" collation described in [COMPARATOR] is quite
  31    simple to implement and provides case-independent comparisons for the
  32    26 Latin alphabetics.  It is specified as the default and/or baseline
  33    comparator in some application protocols, e.g., [IMAP-SORT].
  34
  35    However, the "i;ascii-casemap" collation does not produce
  36    satisfactory results with non-ASCII characters.  It is possible, with
  37    a modest extension, to provide a more sophisticated collation with
  38    greater multilingual applicability than "i;ascii-casemap".  This
  39    extension provides case-independent comparisons for a much greater
  40    number of characters.  It also collates characters with diacriticals
  41    with the non-diacritical character forms.
  42
  43    This collation, "i;unicode-casemap", is intended to be an alternative
  44    to, and preferred over, "i;ascii-casemap".  It does not replace the
  45    "i;basic" collation described in [BASIC].
  46
  47 2.  Unicode Casemap Collation Description
  48
  49    The "i;unicode-casemap" collation is a simple collation which is
  50    case-insensitive in its treatment of characters.  It provides
  51    equality, substring, and ordering operations.  The validity test
  52    operation returns "valid" for any input.
  53
  54
  55
  56
  57
  58 Crispin                     Standards Track                     [Page 1]
  59 \f
  60 RFC 5051                   i;unicode-casemap                October 2007
  61
  62
  63    This collation allows strings in arbitrary (and mixed) character
  64    sets, as long as the character set for each string is identified and
  65    it is possible to convert the string to Unicode.  Strings which have
  66    an unidentified character set and/or cannot be converted to Unicode
  67    are not rejected, but are treated as binary.
  68
  69    Each input string is prepared by converting it to a "titlecased
  70    canonicalized UTF-8" string according to the following steps, using
  71    UnicodeData.txt ([UNICODE-DATA]):
  72
  73       (1) A Unicode codepoint is obtained from the input string.
  74
  75           (a) If the input string is in a known charset that can be
  76               converted to Unicode, a sequence in the string's charset
  77               is read and checked for validity according to the rules of
  78               that charset.  If the sequence is valid, it is converted
  79               to a Unicode codepoint.  Note that for input strings in
  80               UTF-8, the UTF-8 sequence must be valid according to the
  81               rules of [UTF-8]; e.g., overlong UTF-8 sequences are
  82               invalid.
  83
  84           (b) If the input string is in an unknown charset, or an
  85               invalid sequence occurs in step (1)(a), conversion ceases.
  86               No further preparation is performed, and any partial
  87               preparation results are discarded.  The original string is
  88               used unchanged with the i;octet comparator.
  89
  90       (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
  91           are performed on the resulting codepoint from step (1)(a).
  92
  93           (a) If the codepoint has a titlecase property in
  94               UnicodeData.txt (this is normally the same as the
  95               uppercase property), the codepoint is converted to the
  96               codepoints in the titlecase property.
  97
  98           (b) If the resulting codepoint from (2)(a) has a decomposition
  99               property of any type in UnicodeData.txt, the codepoint is
 100               converted to the codepoints in the decomposition property.
 101               This step is recursively applied to each of the resulting
 102               codepoints until no more decomposition is possible
 103               (effectively Normalization Form KD).
 104
 105           Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
 106           has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
 107           WITH SMALL LETTER Z WITH CARON).  Codepoint U+01C5 has a
 108           decomposition property of U+0044 (LATIN CAPITAL LETTER D)
 109           U+017E (LATIN SMALL LETTER Z WITH CARON).  U+017E has a
 110           decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
 111
 112
 113
 114 Crispin                     Standards Track                     [Page 2]
 115 \f
 116 RFC 5051                   i;unicode-casemap                October 2007
 117
 118
 119           (COMBINING CARON).  Neither U+0044, U+007A, nor U+030C have
 120           any decomposition properties.  Therefore, U+01C4 is converted
 121           to U+0044 U+007A U+030C by this step.
 122
 123       (3) The resulting codepoint(s) from step (2) is/are appended, in
 124           UTF-8 format, to the "titlecased canonicalized UTF-8" string.
 125
 126       (4) Repeat from step (1) until there is no more data in the input
 127           string.
 128
 129    Following the above preparation process on each string, the equality,
 130    ordering, and substring operations are as for i;octet.
 131
 132    It is permitted to use an alternative implementation of the above
 133    preparation process if it produces the same results.  For example, it
 134    may be more convenient for an implementation to convert all input
 135    strings to a sequence of UTF-16 or UTF-32 values prior to performing
 136    any of the step (2) actions.  Similarly, if all input strings are (or
 137    are convertible to) Unicode, it may be possible to use UTF-32 as an
 138    alternative to UTF-8 in step (3).
 139
 140       Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
 141       because UTF-16 surrogates will cause i;octet to collate codepoints
 142       U+E0000 through U+FFFF after non-BMP codepoints.
 143
 144    This collation is not locale sensitive.  Consequently, care should be
 145    taken when using OS-supplied functions to implement this collation.
 146    Functions such as strcasecmp and toupper are sometimes locale
 147    sensitive and may inconsistently casemap letters.
 148
 149    The i;unicode-casemap collation is well suited to use with many
 150    Internet protocols and computer languages.  Use with natural language
 151    is often inappropriate; even though the collation apparently supports
 152    languages such as Swahili and English, in real-world use it tends to
 153    mis-sort a number of types of string:
 154
 155    o  people and place names containing scripts that are not collated
 156       according to "alphabetical order".
 157    o  words with characters that have diacriticals.  However,
 158       i;unicode-casemap generally does a better job than i;ascii-casemap
 159       for most (but not all) languages.  For example, German umlaut
 160       letters will sort correctly, but some Scandinavian letters will
 161       not.
 162    o  names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
 163       in English),
 164    o  strings containing other non-letter symbols; e.g., euro and pound
 165       sterling symbols, quotation marks other than '"', dashes/hyphens,
 166       etc.
 167
 168
 169
 170 Crispin                     Standards Track                     [Page 3]
 171 \f
 172 RFC 5051                   i;unicode-casemap                October 2007
 173
 174
 175 3.  Unicode Casemap Collation Registration
 176
 177    <?xml version='1.0'?>
 178    <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
 179    <collation rfc="5051" scope="global" intendedUse="common">
 180    <identifier>i;unicode-casemap</identifier>
 181    <title>Unicode Casemap</title>
 182    <operations>equality order substring</operations>
 183    <specification>RFC 5051</specification>
 184    <owner>IETF</owner>
 185    <submitter>mrc@cac.washington.edu</submitter>
 186    </collation>
 187
 188 4.  Security Considerations
 189
 190    The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
 191    SECURITY] apply and are normative to this specification.
 192
 193    The results from this comparator will vary depending upon the
 194    implementation for several reasons.  Implementations MUST consider
 195    whether these possibilities are a problem for their use case:
 196
 197    1) New characters added in Unicode may have decomposition or
 198       titlecase properties that will not be known to an implementation
 199       based upon an older revision of Unicode.  This impacts step (2).
 200
 201    2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
 202       does not require normalization of out-of-order diacriticals.
 203       However, an implementation MAY use an NFKD library routine that
 204       does such normalization.  This impacts step (2)(b) and possibly
 205       also step (1)(a), and is an issue only with ill-formed UTF-8
 206       input.
 207
 208    3) The set of charsets handled in step (1)(a) is open-ended.  UTF-8
 209       (and, by extension, US-ASCII) are the only mandatory-to-implement
 210       charsets.  This impacts step (1)(a).
 211
 212       Implementations SHOULD, as far as feasible, support all the
 213       charsets they are likely to encounter in the input data, in order
 214       to avoid poor collation caused by the fall through to the (1)(b)
 215       rule.
 216
 217    4) Other charsets may have revisions which add new characters that
 218       are not known to an implementation based upon an older revision.
 219       This impacts step (1)(a) and possibly also step (1)(b).
 220
 221
 222
 223
 224
 225
 226 Crispin                     Standards Track                     [Page 4]
 227 \f
 228 RFC 5051                   i;unicode-casemap                October 2007
 229
 230
 231    An attacker may create input that is ill-formed or in an unknown
 232    charset, with the intention of impacting the results of this
 233    comparator or exploiting other parts of the system which process this
 234    input in different ways.  Note, however, that even well-formed data
 235    in a known charset can impact the result of this comparator in
 236    unexpected ways.  For example, an attacker can substitute U+0041
 237    (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
 238    U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
 239    non-match of strings which visually appear the same and/or causing
 240    the string to appear elsewhere in a sort.
 241
 242 5.  IANA Considerations
 243
 244    The i;unicode-casemap collation defined in section 2 has been added
 245    to the registry of collations defined in [COMPARATOR].
 246
 247 6.  Normative References
 248
 249    [COMPARATOR]          Newman, C., Duerst, M., and A. Gulbrandsen,
 250                          "Internet Application Protocol Collation
 251                          Registry", RFC 4790, February 2007.
 252
 253    [STRINGPREP]          Hoffman, P. and M. Blanchet, "Preparation of
 254                          Internationalized Strings ("stringprep")", RFC
 255                          3454, December 2002.
 256
 257    [UTF-8]               Yergeau, F., "UTF-8, a transformation format of
 258                          ISO 10646", STD 63, RFC 3629, November 2003.
 259
 260    [UNICODE-DATA]        <http://www.unicode.org/Public/UNIDATA/
 261                          UnicodeData.txt>
 262
 263                          Although the UnicodeData.txt file referenced
 264                          here is part of the Unicode standard, it is
 265                          subject to change as new characters are added
 266                          to Unicode and errors are corrected in Unicode
 267                          revisions.  As a result, it may be less stable
 268                          than might otherwise be implied by the
 269                          standards status of this specification.
 270
 271    [UNICODE-SECURITY]    Davis, M. and M. Suignard, "Unicode Security
 272                          Considerations", February 2006,
 273                          <http://www.unicode.org/reports/tr36/>.
 274
 275
 276
 277
 278
 279
 280
 281
 282 Crispin                     Standards Track                     [Page 5]
 283 \f
 284 RFC 5051                   i;unicode-casemap                October 2007
 285
 286
 287 7.  Informative References
 288
 289    [BASIC]               Newman, C., Duerst, M., and A. Gulbrandsen,
 290                          "i;basic - the Unicode Collation Algorithm",
 291                          Work in Progress, March 2007.
 292
 293    [IMAP-SORT]           Crispin, M. and K. Murchison, "Internet Message
 294                          Access Protocol - SORT and THREAD Extensions",
 295                          Work in Progress, September 2007.
 296
 297 Author's Address
 298
 299    Mark R. Crispin
 300    Networks and Distributed Computing
 301    University of Washington
 302    4545 15th Avenue NE
 303    Seattle, WA  98105-4527
 304
 305    Phone: +1 (206) 543-5762
 306    EMail: MRC@CAC.Washington.EDU
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338 Crispin                     Standards Track                     [Page 6]
 339 \f
 340 RFC 5051                   i;unicode-casemap                October 2007
 341
 342
 343 Full Copyright Statement
 344
 345    Copyright (C) The IETF Trust (2007).
 346
 347    This document is subject to the rights, licenses and restrictions
 348    contained in BCP 78, and except as set forth therein, the authors
 349    retain all their rights.
 350
 351    This document and the information contained herein are provided on an
 352    "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
 353    OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
 354    THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
 355    OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
 356    THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
 357    WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
 358
 359 Intellectual Property
 360
 361    The IETF takes no position regarding the validity or scope of any
 362    Intellectual Property Rights or other rights that might be claimed to
 363    pertain to the implementation or use of the technology described in
 364    this document or the extent to which any license under such rights
 365    might or might not be available; nor does it represent that it has
 366    made any independent effort to identify any such rights.  Information
 367    on the procedures with respect to rights in RFC documents can be
 368    found in BCP 78 and BCP 79.
 369
 370    Copies of IPR disclosures made to the IETF Secretariat and any
 371    assurances of licenses to be made available, or the result of an
 372    attempt made to obtain a general license or permission for the use of
 373    such proprietary rights by implementers or users of this
 374    specification can be obtained from the IETF on-line IPR repository at
 375    http://www.ietf.org/ipr.
 376
 377    The IETF invites any interested party to bring to its attention any
 378    copyrights, patents or patent applications, or other proprietary
 379    rights that may cover technology that may be required to implement
 380    this standard.  Please address the information to the IETF at
 381    ietf-ipr@ietf.org.
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394 Crispin                     Standards Track                     [Page 7]
 395 \f