[Women-website] Language Negotiation and charsets

Jutta Wrage jw@witch.westfalen.de
Tue, 29 Mar 2005 22:12:19 +0200


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Am Dienstag, 29.03.05 um 19:36 Uhr schrieb Thierry Reding:

> I was thinking about that the other day aswell. I think we should have
> something like that too. The main Debian website has one problem 
> though, in
> that it doesn't remember the language that you choose from that 
> language bar.
> Whenever you click your way to the next page, it will switch back to 
> the
> negotiated language.

Seems I have to tell a bit more about negotiation, charsets and 
languages

Language Negotiation is communication between browser and server. The 
server does not remember that for a session as the browser sends the 
language preferences on each request.

So if your browser is set up to prefer fr above en and then de, the 
server will deliver the french version, if there is one, if not the 
english and if that is not available, too, the german. If the default 
is netherlands, the server will fall back to that, if there aren't any 
more preferences delivered by your browser.

This only works if
- - the bowser has language settings
- - The server does know about negotiation
- - there are pages, that can be negotiated

If the server works with variants, there is a .var file (ending depends 
on the server settings) with the pages, languages and other information 
in it. The server looks into it and delivers the page. Another way is 
to negotiate using Multiviews. Then the languages and file endings have 
to be defined in the server settings and Multiviews must be enabled.

For Multiviews, you may have an index.en.html for english language, an 
index.de.html for german language and so on. for pl you define .po in 
the server settings as pl files are known as executables by apache and 
it denies access to them (cgis have to be in the cgi directory for 
secutity reasons).

The dicts in http://debianwomen.org/dicts/ work with Multiviews. So, if 
you call http://www.debianwomen.org/dicts/dward you will get a page 
accourding to your browser settings. I get the dward.de.html then. on 
westfalen.de I am working with var files as the web server has other 
settings and the virtual servers are automatically configured (direct 
changes are overwritten on next restart). I would have to dive a bit 
into it to make the correct changes in the site scripts. so there you 
may call http://www.witch.westfalen.de/debian-women/dward.var there to 
see your language page.

BTW: The settings, I have made on d-w are to be seen in the dicts 
directory in the .htaccess file. And there you also see, that adding a 
default charset is switched off. If that would be on (default in 
apache) browsers would get the default charset advertised and pages 
would be looking ugly if in different charset.

Here a bit from the communication between client and server

My client sends page request and
...
Accept-Charset: iso-8859-1, utf-8, iso-10646-ucs-2, macintosh, 
windows-1252, *
Accept-Language: de, en;q=0.94, fr;q=0.88, nl;q=0.81, it;q=0.75, 
ja;q=0.69, es;q=0.62, da;q=0.56, fi;q=0.50, ko;q=0.44, no;q=0.38, 
pt;q=0.31, sv;q=0.25, zh-cn;q=0.19, zh-tw;q=0.12

Server sends
Mar 29 21:20:11  http Rx: HTTP/1.1 200 OK
Mar 29 21:20:11  Rx: 
http://www.witch.westfalen.de/debian-women/dward.var
...
Mar 29 21:20:11  Content-Location: dward.de.html
Mar 29 21:20:11  Vary: negotiate,accept-language,accept-charset
...
Mar 29 21:20:11  Content-Type: text/html
Mar 29 21:20:11  Content-Language: de
Mar 29 21:20:11
Mar 29 21:20:11  Rx Headers:
{
    ...
             "Content-Language" = (de);
             "Content-Location" = ("dward.de.html");
...
             Vary = ("negotiate,accept-language,accept-charset");
         };
     };
}

In the Vary line you see, what the server has accepted (and found) 
language (de) and charset (iso8859-1) the client wanted most

the dward.var has the following lines for the German page:
URI: dward.de.html
Content-type: text/html;charset=iso-8859-1
Content-language: de

The polish page is UTF-8 (could be something different)
URI: dward.po.html
Content-type: text/html;charset=UTF-8
Content-language: pl

The debianwomen server does not negotiate the charset (so the browser 
takes it from the page) but  for the dicts files the language:
             Vary = ("negotiate,accept-language");

For the the main pages debianwomen.org doesn't do any negotiation but 
delivers the page and sets charset to utf-8 (even if the page is not):

Mar 29 21:31:47  Content-Type: text/html; charset=UTF-8
             "Content-Type" = ("text/html; charset=UTF-8");

and no negotiation

The Wiki behaves same but is set up to deliver a default encoding (so 
the broweser can take the charset from the page by default).
Mar 29 21:34:59  Content-Type: text/html; charset=iso-8859-1;
and again the RX-Headers:
             "Content-Type" = ("text/html; charset=iso-8859-1;");
If the wiki would not overwrite the utf-8 set up in the htaccess, then 
we would not see correct characters above 127.

Maybe, now it is understandable a bit more why I complained about the 
default charset delivered by the server due to the settings on 
debianwomen.org : That is not the way it should work if you have 
multilingual pages and different charsets. It is for people having 
pages in only one encoding and one language. Then they can be lazy and 
tell the server to deliver a default encoding as the know, what charset 
they have used to edit the pages. But if they decide to edit the pages 
using a different charset or put documents edited by others online, 
that charset overwrites the charset in the page and the browser may 
show ugly pages.

One or more words about UTF-8 and perl, less, vi and terminal

- - Vi does not automatically know, if something is utf-8. Even if you 
decide to tell vi to use utf-8, it may damage files (did it for me when 
editing the utf-8-dicts, like punjabi)
- - less can be used to view different encodings by command line (i have 
a less.utf8 script for that.
- - to see the correct characters the terminal/shell must have: The 
charsets needed, be 8bit clean, and set up correctly
- - perl 5.6 has bad and incomplete unicode support. To get the 
makedictutf8.pl working I had to steel libs from perl 5.8 else there 
was no way to get it working correctly. - Hmm... alioth is running 
Woody, too. So there is no way to run makedictutf8.pl there without 
making changes to the system.

People using english language may not notice some issues as an english 
only file/page looks same if viewed as iso8859-1 or utf-8 as US ASCII 
is part of both charsets (and many others).

Hope the summary of things I've found out while working on the dicts is 
understandable. If not, feel free to ask.

greetings

Jutta

- -- 
http://www.witch.westfalen.de
http://witch.muensterland.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Darwin)

iD8DBQFCSba1OgZ5N97kHkcRAhSsAKDHJ4hHE/Hx+CTRxfubiI6aylPvHgCeJPQq
YAr7GRhQsrFG42eo/r2Kaxk=
=JurF
-----END PGP SIGNATURE-----