[gopher] Gopher-powered Hangman game!

Kacper Gutowski mwgamera at gmail.com
Tue Nov 24 18:19:53 UTC 2009


We have a problem here.  Originally Gopher was designed to work with plain
ASCII. However, RFC 1436 not only does not define what does that mean but
also in appendix there is a note saying that when high bit is set it should
be ISO Latin 1.  Note is only about displayed name in directory (User_Name
field) but we can see there was some confusion about this.
ISO Latin 1 was a common choice for the default character set and encoding.
For example HTTP also uses Latin 1 by default but unlike Gopher it does have
a mechanism (headers) to change it.  Strange that Gopher+ didn't introduce
anything serving this purpose (even though there is a content negotiation
similar to the one later introduced in HTTP/1.1).  Thanks to this, however,
we evaded the hell of having to support multiple incompatible legacy
encodings.  But now everybody should realize that using ASCII or even Latin 1
is not enough because many languages can not be expressed with only those
characters.

On 2009-11-24 05:46:22, Cameron Kaiser wrote:
> I think gopher should use ASCII where possible and UTF-8 where necessary.
> But that's just me.

Personally I also think that we should disregard the note about Latin 1 and
start using UTF-8.  For retaining illusion of compatibility one just should
not use accented letters when it is not necessary, i.e. when writing in
English or Latin.  (Note that in most of other languages that use Latin-based
alphabets, diacritical marks are not optional [1]).  Please note also that in
reality there is unfortunately NO way to make Gopher support other languages
without scarifying some compatibility (that damned ISO Latin 1).

I see that there is some consensus about which encoding should be used
when ASCII is not enough but I'll throw some arguments here.

Pro UTF-8:
1. Since it's an encoding of Unicode there won't be any problems with a set
   of characters either now or in any foreseeable future.  If it's machine
   encodable, it can be expressed in UTF-8.
2. Unlike some other multi-byte encodings it won't break the protocol itself
   because it's fully compatible with 7bit ASCII. One can say that ASCII is
   by all means a subset of UTF-8.

Cons:
1. Text written in script other than Latin or with accented letters will
   appear as garbage in a client that either excepts plain ASCII or Latin 1.
  - UMN Gopher client expects Latin 1 both in directory and in text
  - lynx does some strange things [2]
  - Firefox (without overbite) does some strange things [2] with directory
    but displays text as usual (heuristically guessing the encoding or
    letting user choose it manually)
2. Compatibility is broken and clients would have to be fixed in order to
   support UTF-8.  Browser based clients won't care about text content but
   directories are a whole different thing (see above).
3. Correctly displaying Unicode text is much more complex than displaying
   plain ASCII.  This is obvious and nothing can be done about this but
   it goes against Gopher's simplicity.


[1] For example in Polish and other languages in this hangman game.  There
    was a practice of writing with ASCII equivalents, especially on IRC,
    but it is against the orthography.  N.B. IRC is in similar to Gopher's
    situation regarding I18N.
[2] Looks like UTF-8 encoding is accepted but only characters present in
    ISO Latin 1 set are displayed correctly?


-- 
Kacper Gutowski



More information about the Gopher-Project mailing list