Unicode-related debtags
Paul Hardy
unifoundry at gmail.com
Sun Aug 24 17:41:00 UTC 2008
I am maintaining the "unifont" Debian package, which includes a
Unicode font that spans all of Unicode Plane 0 (the Basic Multilingual
Plane, or BMP, which contains most of the world's modern scripts).
The only current way of indicating that a certain script is covered is
with the "culture::" tags. These tags do not cover all scripts
available in Unicode. At the same time, if a package only consists of
a font (such as the "ttf-unifont" package), it could be somewhat of a
stretch for a smaller font to claim culture support.
With the standardization of Debian documentation on UTF-8, a perpetual
increase in international fonts (and international language support),
and the rise to prominence of the Unicode Standard in general, could
we consider a special tag category that denotes the Unicode scripts
that are covered in a font? For example,
"unicode::cjk_unified_ideographs" spans the current "culture::chinese"
and "culture::taiwanese" but is not all-inclusive. Further Unified
Han ideographs exist in CJK Unified Ideographs Extension A, as well as
rarer ideographs outside of Plane 0. Someone might be looking for a
font that specifically contains such an extension.
In the case of CJK ideographs, there is also a Unicode "z-axis" that
someone could use to search for a CJK font focused on a Japanese style
or a Korean style, or for a CJK font that contained ancient forms.
Maybe that could be supported in the future.
A further tag indicating the version of the Unicode Standard followed
would also be useful, as glyphs are constantly added; for example,
"unicode::version:5.1" or "unicode::version:5.1.0" would help someone
looking for Unified Han CJK ideographs newly added to Unicode version
Below is a list of the scripts in Unicode Plane 0 (hexadecimal code
points 0x0000 through 0xFFFF). Unicode code points are defined up to
0x10FFFF, for a total of 17 (decimal) planes, numbered 0 through 16.
A convention could be adopted such as replacing spaces with
underscores. Also, some aliases might be permitted; for example,
instead of "unicode::c0_controls_and_basic_latin" a possible
alternative could be "unicode::ascii" (which is what the range
U+0000..U+007F happens to be).
U+0000..U+007F C0 Controls and Basic Latin
U+0080..U+00FF C1 Controls and Latin-1 Supplement
U+0100..U+017F Latin Extended-A
U+0180..U+024F Latin Extended-B
U+0250..U+02AF IPA Extensions
U+02B0..U+02FF Spacing Modifier Letters
U+0300..U+036F Combining Diacritical Marks
U+0370..U+03FF Greek and Coptic
U+0400..U+04FF Cyrillic
U+0500..U+052F Cyrillic Supplement
U+0530..U+058F Armenian
U+0590..U+05FF Hebrew
U+0600..U+06FF Arabic
U+0700..U+074F Syriac
U+0750..U+077F Arabic Supplement
U+0780..U+07BF Thaana
U+07C0..U+07FF N'Ko
U+0800..U+08FF Unassigned
U+0900..U+097F Devanagari
U+0980..U+09FF Bengali
U+0A00..U+0A7F Gurmukhi
U+0A80..U+0AFF Gujarati
U+0B00..U+0B7F Oriya
U+0B80..U+0BFF Tamil
U+0C00..U+0C7F Telugu
U+0C80..U+0CFF Kannada
U+0D00..U+0D7F Malayalam
U+0D80..U+0DFF Sinhala
U+0E00..U+0E7F Thai
U+0E80..U+0EFF Lao
U+0F00..U+0FFF Tibetan
U+1000..U+109F Myanmar
U+10A0..U+10FF Georgian
U+1100..U+11FF Hangul Jamo
U+1200..U+137F Ethiopic
U+1380..U+139F Ethiopic Supplement
U+13A0..U+13FF Cherokee
U+1400..U+167F Unified Canadian Aboriginal Syllabics
U+1680..U+169F Ogham
U+16A0..U+16FF Runic
U+1700..U+171F Tagalog
U+1720..U+173F Hanunoo
U+1740..U+175F Buhid
U+1760..U+177F Tagbanwa
U+1780..U+17FF Khmer
U+1800..U+18AF Mongolian
U+18B0..U+18FF Unassigned
U+1900..U+194F Limbu
U+1950..U+197F Tai Le
U+1980..U+19DF New Tai Lue
U+19E0..U+19FF Khmer Symbols
U+1A00..U+1A1F Buginese
U+1A20..U+1AFF Unassigned
U+1B00..U+1B7F Balinese
U+1B80..U+1BBF Sundanese
U+1BC0..U+1BFF Unassigned
U+1C00..U+1C4F Lepcha
U+1C50..U+1C7F Ol Chiki
U+1C80..U+1CFF Unassigned
U+1D00..U+1D7F Phonetic Extensions
U+1D80..U+1DBF Phonetic Extensions Supplement
U+1DC0..U+1DFF Combining Diacritical Marks Supplement
U+1E00..U+1EFF Latin Extended Additional
U+1F00..U+1FFF Greek Extended
U+2000..U+206F General Punctuation
U+2070..U+209F Superscripts and Subscripts
U+20A0..U+20CF Currency Symbols
U+20D0..U+20FF Combining Diacritical Marks for Symbols
U+2100..U+214F Letterlike Symbols
U+2150..U+218F Number Forms
U+2190..U+21FF Arrows
U+2200..U+22FF Mathematical Operators
U+2300..U+23FF Miscellaneous Technical
U+2400..U+243F Control Pictures
U+2440..U+245F Optical Character Recognition
U+2460..U+24FF Enclosed Alphanumerics
U+2500..U+257F Box Drawing
U+2580..U+259F Block Elements
U+25A0..U+25FF Geometric Shapes
U+2600..U+26FF Miscellaneous Symbols
U+2700..U+27BF Dingbats
U+27C0..U+27EF Miscellaneous Mathematical Symbols - A
U+27F0..U+27FF Supplemental Arrows - A
U+2800..U+28FF Braille Patterns
U+2900..U+297F Supplemental Arrows - B
U+2980..U+29FF Miscellaneous Mathematical Symbols - B
U+2A00..U+2AFF Supplemental Mathematical Operators
U+2B00..U+2BFF Miscellaneous Symbols and Arrows
U+2C00..U+2C5F Glagolithic
U+2C60..U+2C7F Latin Extended C
U+2C80..U+2CFF Coptic
U+2D00..U+2D2F Georgian Supplement
U+2D30..U+2D7F Tifinagh
U+2D80..U+2DDF Ethiopic Extended
U+2DE0..U+2DFF Cyrillic Extended - A
U+2E00..U+2E7F Supplemental Punctuation
U+2E80..U+2EFF CJK Radicals Supplement
U+2F00..U+2FDF Kangxi Radicals
U+2FE0..U+2FEF Unassigned
U+2FF0..U+2FFF Ideographic Description Characters
U+3000..U+303F CJK Symbols and Punctuation
U+3040..U+309F Hiragana
U+30A0..U+30FF Katakana
U+3100..U+312F Bopomofo
U+3130..U+318F Hangul Compatibility Jamo
U+3190..U+319F Kanbun
U+31A0..U+31BF Bopomofo Extended
U+31C0..U+31EF CJK Strokes
U+31F0..U+31FF Katakana Phonetic Extensions
U+3200..U+32FF Enclosed CJK Letters and Months
U+3300..U+33FF CJK Compatibility
U+3400..U+4DBF CJK Unified Ideographs Extension A
U+4DC0..U+4DFF Yijing Hexagram Symbols
U+4E00..U+9FCF CJK Unified Ideographs
U+9FD0..U+9FFF Unassigned
U+A000..U+A48F Yi Syllables
U+A490..U+A4CF Yi Radicals
U+A4D0..U+A4FF Unassigned
U+A500..U+A63F Vai
U+A640..U+A69F Cyrillic Extended - B
U+A6A0..U+A6FF Unassigned
U+A700..U+A71F Modifier Tone Letters
U+A720..U+A7FF Latin Extended - D
U+A800..U+A82F Syloti Nagri
U+A830..U+A83F Unassigned
U+A840..U+A87F Phags-pa
U+A880..U+A8DF Saurashtra
U+A8E0..U+A8FF Unassigned
U+A900..U+A92F Kayah Li
U+A930..U+A95F Rajang
U+A960..U+A9FF Unassigned
U+AA00..U+AA5F Cham
U+AA60..U+ABFF Unassigned
U+AC00..U+D7AF Hangul Syllables
U+D7B0..U+D7FF Unassigned
U+D800..U+DFFF Surrogate Pairs - Not Used
U+E000..U+F8FF Private Use Area
U+F900..U+FAFF CJK Compatibility Ideographs
U+FB00..U+FB4F Alphabetic Presentation Forms
U+FB50..U+FDFF Arabic Presentation Forms - A
U+FE00..U+FE0F Variation Selectors
U+FE10..U+FE1F Vertical Forms
U+FE20..U+FE2F Combining Half Marks
U+FE30..U+FE4F CJK Compatibility Forms
U+FE50..U+FE6F Small Form Variants
U+FE70..U+FEFF Arabic Presentation Forms - B
U+FF00..U+FFEF Halfwidth and Fullwidth Forms
U+FFF0..U+FFFF Specials
There isn't always a one-to-one mapping between a script and a
culture. Also, a script name and culture name could differ. For
example, the Ol Chiki script (added to Unicode version 5.1) is used by
the Santali people of India. For these reasons also being able to tag
a supported script would be useful.
Paul Hardy
GPG Key ID: E6E6E390
More information about the Debtags-devel
mailing list