Unicode-related debtags

Sun Aug 24 17:41:00 UTC 2008

I am maintaining the "unifont" Debian package, which includes a
Unicode font that spans all of Unicode Plane 0 (the Basic Multilingual
Plane, or BMP, which contains most of the world's modern scripts).
The only current way of indicating that a certain script is covered is
with the "culture::" tags.  These tags do not cover all scripts
available in Unicode.  At the same time, if a package only consists of
a font (such as the "ttf-unifont" package), it could be somewhat of a
stretch for a smaller font to claim culture support.

With the standardization of Debian documentation on UTF-8, a perpetual
increase in international fonts (and international language support),
and the rise to prominence of the Unicode Standard in general, could
we consider a special tag category that denotes the Unicode scripts
that are covered in a font?  For example,
"unicode::cjk_unified_ideographs" spans the current "culture::chinese"
and "culture::taiwanese" but is not all-inclusive.  Further Unified
Han ideographs exist in CJK Unified Ideographs Extension A, as well as
rarer ideographs outside of Plane 0.  Someone might be looking for a
font that specifically contains such an extension.

In the case of CJK ideographs, there is also a Unicode "z-axis" that
someone could use to search for a CJK font focused on a Japanese style
or a Korean style, or for a CJK font that contained ancient forms.
Maybe that could be supported in the future.

A further tag indicating the version of the Unicode Standard followed
would also be useful, as glyphs are constantly added; for example,
"unicode::version:5.1" or "unicode::version:5.1.0" would help someone
looking for Unified Han CJK ideographs newly added to Unicode version
5.1.

Below is a list of the scripts in Unicode Plane 0 (hexadecimal code
points 0x0000 through 0xFFFF).  Unicode code points are defined up to
0x10FFFF, for a total of 17 (decimal) planes, numbered 0 through 16.
A convention could be adopted such as replacing spaces with
underscores.  Also, some aliases might be permitted; for example,
instead of "unicode::c0_controls_and_basic_latin" a possible
alternative could be "unicode::ascii" (which is what the range
U+0000..U+007F happens to be).

      U+0000..U+007F  C0 Controls and Basic Latin
      U+0080..U+00FF  C1 Controls and Latin-1 Supplement
      U+0100..U+017F  Latin Extended-A
      U+0180..U+024F  Latin Extended-B
      U+0250..U+02AF  IPA Extensions
      U+02B0..U+02FF  Spacing Modifier Letters
      U+0300..U+036F  Combining Diacritical Marks
      U+0370..U+03FF  Greek and Coptic
      U+0400..U+04FF  Cyrillic
      U+0500..U+052F  Cyrillic Supplement
      U+0530..U+058F  Armenian
      U+0590..U+05FF  Hebrew
      U+0600..U+06FF  Arabic
      U+0700..U+074F  Syriac
      U+0750..U+077F  Arabic Supplement
      U+0780..U+07BF  Thaana
      U+07C0..U+07FF  N'Ko
      U+0800..U+08FF  Unassigned
      U+0900..U+097F  Devanagari
      U+0980..U+09FF  Bengali
      U+0A00..U+0A7F  Gurmukhi
      U+0A80..U+0AFF  Gujarati
      U+0B00..U+0B7F  Oriya
      U+0B80..U+0BFF  Tamil
      U+0C00..U+0C7F  Telugu
      U+0C80..U+0CFF  Kannada
      U+0D00..U+0D7F  Malayalam
      U+0D80..U+0DFF  Sinhala
      U+0E00..U+0E7F  Thai
      U+0E80..U+0EFF  Lao
      U+0F00..U+0FFF  Tibetan
      U+1000..U+109F  Myanmar
      U+10A0..U+10FF  Georgian
      U+1100..U+11FF  Hangul Jamo
      U+1200..U+137F  Ethiopic
      U+1380..U+139F  Ethiopic Supplement
      U+13A0..U+13FF  Cherokee
      U+1400..U+167F  Unified Canadian Aboriginal Syllabics
      U+1680..U+169F  Ogham
      U+16A0..U+16FF  Runic
      U+1700..U+171F  Tagalog
      U+1720..U+173F  Hanunoo
      U+1740..U+175F  Buhid
      U+1760..U+177F  Tagbanwa
      U+1780..U+17FF  Khmer
      U+1800..U+18AF  Mongolian
      U+18B0..U+18FF  Unassigned
      U+1900..U+194F  Limbu
      U+1950..U+197F  Tai Le
      U+1980..U+19DF  New Tai Lue
      U+19E0..U+19FF  Khmer Symbols
      U+1A00..U+1A1F  Buginese
      U+1A20..U+1AFF  Unassigned
      U+1B00..U+1B7F  Balinese
      U+1B80..U+1BBF  Sundanese
      U+1BC0..U+1BFF  Unassigned
      U+1C00..U+1C4F  Lepcha
      U+1C50..U+1C7F  Ol Chiki
      U+1C80..U+1CFF  Unassigned
      U+1D00..U+1D7F  Phonetic Extensions
      U+1D80..U+1DBF  Phonetic Extensions Supplement
      U+1DC0..U+1DFF  Combining Diacritical Marks Supplement
      U+1E00..U+1EFF  Latin Extended Additional
      U+1F00..U+1FFF  Greek Extended
      U+2000..U+206F  General Punctuation
      U+2070..U+209F  Superscripts and Subscripts
      U+20A0..U+20CF  Currency Symbols
      U+20D0..U+20FF  Combining Diacritical Marks for Symbols
      U+2100..U+214F  Letterlike Symbols
      U+2150..U+218F  Number Forms
      U+2190..U+21FF  Arrows
      U+2200..U+22FF  Mathematical Operators
      U+2300..U+23FF  Miscellaneous Technical
      U+2400..U+243F  Control Pictures
      U+2440..U+245F  Optical Character Recognition
      U+2460..U+24FF  Enclosed Alphanumerics
      U+2500..U+257F  Box Drawing
      U+2580..U+259F  Block Elements
      U+25A0..U+25FF  Geometric Shapes
      U+2600..U+26FF  Miscellaneous Symbols
      U+2700..U+27BF  Dingbats
      U+27C0..U+27EF  Miscellaneous Mathematical Symbols - A
      U+27F0..U+27FF  Supplemental Arrows - A
      U+2800..U+28FF  Braille Patterns
      U+2900..U+297F  Supplemental Arrows - B
      U+2980..U+29FF  Miscellaneous Mathematical Symbols - B
      U+2A00..U+2AFF  Supplemental Mathematical Operators
      U+2B00..U+2BFF  Miscellaneous Symbols and Arrows
      U+2C00..U+2C5F  Glagolithic
      U+2C60..U+2C7F  Latin Extended C
      U+2C80..U+2CFF  Coptic
      U+2D00..U+2D2F  Georgian Supplement
      U+2D30..U+2D7F  Tifinagh
      U+2D80..U+2DDF  Ethiopic Extended
      U+2DE0..U+2DFF  Cyrillic Extended - A
      U+2E00..U+2E7F  Supplemental Punctuation
      U+2E80..U+2EFF  CJK Radicals Supplement
      U+2F00..U+2FDF  Kangxi Radicals
      U+2FE0..U+2FEF  Unassigned
      U+2FF0..U+2FFF  Ideographic Description Characters
      U+3000..U+303F  CJK Symbols and Punctuation
      U+3040..U+309F  Hiragana
      U+30A0..U+30FF  Katakana
      U+3100..U+312F  Bopomofo
      U+3130..U+318F  Hangul Compatibility Jamo
      U+3190..U+319F  Kanbun
      U+31A0..U+31BF  Bopomofo Extended
      U+31C0..U+31EF  CJK Strokes
      U+31F0..U+31FF  Katakana Phonetic Extensions
      U+3200..U+32FF  Enclosed CJK Letters and Months
      U+3300..U+33FF  CJK Compatibility
      U+3400..U+4DBF  CJK Unified Ideographs Extension A
      U+4DC0..U+4DFF  Yijing Hexagram Symbols
      U+4E00..U+9FCF  CJK Unified Ideographs
      U+9FD0..U+9FFF  Unassigned
      U+A000..U+A48F  Yi Syllables
      U+A490..U+A4CF  Yi Radicals
      U+A4D0..U+A4FF  Unassigned
      U+A500..U+A63F  Vai
      U+A640..U+A69F  Cyrillic Extended - B
      U+A6A0..U+A6FF  Unassigned
      U+A700..U+A71F  Modifier Tone Letters
      U+A720..U+A7FF  Latin Extended - D
      U+A800..U+A82F  Syloti Nagri
      U+A830..U+A83F  Unassigned
      U+A840..U+A87F  Phags-pa
      U+A880..U+A8DF  Saurashtra
      U+A8E0..U+A8FF  Unassigned
      U+A900..U+A92F  Kayah Li
      U+A930..U+A95F  Rajang
      U+A960..U+A9FF  Unassigned
      U+AA00..U+AA5F  Cham
      U+AA60..U+ABFF  Unassigned
      U+AC00..U+D7AF  Hangul Syllables
      U+D7B0..U+D7FF  Unassigned
      U+D800..U+DFFF  Surrogate Pairs - Not Used
      U+E000..U+F8FF  Private Use Area
      U+F900..U+FAFF  CJK Compatibility Ideographs
      U+FB00..U+FB4F  Alphabetic Presentation Forms
      U+FB50..U+FDFF  Arabic Presentation Forms - A
      U+FE00..U+FE0F  Variation Selectors
      U+FE10..U+FE1F  Vertical Forms
      U+FE20..U+FE2F  Combining Half Marks
      U+FE30..U+FE4F  CJK Compatibility Forms
      U+FE50..U+FE6F  Small Form Variants
      U+FE70..U+FEFF  Arabic Presentation Forms - B
      U+FF00..U+FFEF  Halfwidth and Fullwidth Forms
      U+FFF0..U+FFFF  Specials

There isn't always a one-to-one mapping between a script and a
culture.  Also, a script name and culture name could differ.  For
example, the Ol Chiki script (added to Unicode version 5.1) is used by
the Santali people of India.  For these reasons also being able to tag
a supported script would be useful.

Paul Hardy
GPG Key ID: E6E6E390