[Women-website] Language Negotiation and charsets
Jutta Wrage
jw@witch.westfalen.de
Tue, 29 Mar 2005 22:12:19 +0200
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Am Dienstag, 29.03.05 um 19:36 Uhr schrieb Thierry Reding:
> I was thinking about that the other day aswell. I think we should have
> something like that too. The main Debian website has one problem
> though, in
> that it doesn't remember the language that you choose from that
> language bar.
> Whenever you click your way to the next page, it will switch back to
> the
> negotiated language.
Seems I have to tell a bit more about negotiation, charsets and
languages
Language Negotiation is communication between browser and server. The
server does not remember that for a session as the browser sends the
language preferences on each request.
So if your browser is set up to prefer fr above en and then de, the
server will deliver the french version, if there is one, if not the
english and if that is not available, too, the german. If the default
is netherlands, the server will fall back to that, if there aren't any
more preferences delivered by your browser.
This only works if
- - the bowser has language settings
- - The server does know about negotiation
- - there are pages, that can be negotiated
If the server works with variants, there is a .var file (ending depends
on the server settings) with the pages, languages and other information
in it. The server looks into it and delivers the page. Another way is
to negotiate using Multiviews. Then the languages and file endings have
to be defined in the server settings and Multiviews must be enabled.
For Multiviews, you may have an index.en.html for english language, an
index.de.html for german language and so on. for pl you define .po in
the server settings as pl files are known as executables by apache and
it denies access to them (cgis have to be in the cgi directory for
secutity reasons).
The dicts in http://debianwomen.org/dicts/ work with Multiviews. So, if
you call http://www.debianwomen.org/dicts/dward you will get a page
accourding to your browser settings. I get the dward.de.html then. on
westfalen.de I am working with var files as the web server has other
settings and the virtual servers are automatically configured (direct
changes are overwritten on next restart). I would have to dive a bit
into it to make the correct changes in the site scripts. so there you
may call http://www.witch.westfalen.de/debian-women/dward.var there to
see your language page.
BTW: The settings, I have made on d-w are to be seen in the dicts
directory in the .htaccess file. And there you also see, that adding a
default charset is switched off. If that would be on (default in
apache) browsers would get the default charset advertised and pages
would be looking ugly if in different charset.
Here a bit from the communication between client and server
My client sends page request and
...
Accept-Charset: iso-8859-1, utf-8, iso-10646-ucs-2, macintosh,
windows-1252, *
Accept-Language: de, en;q=0.94, fr;q=0.88, nl;q=0.81, it;q=0.75,
ja;q=0.69, es;q=0.62, da;q=0.56, fi;q=0.50, ko;q=0.44, no;q=0.38,
pt;q=0.31, sv;q=0.25, zh-cn;q=0.19, zh-tw;q=0.12
Server sends
Mar 29 21:20:11 http Rx: HTTP/1.1 200 OK
Mar 29 21:20:11 Rx:
http://www.witch.westfalen.de/debian-women/dward.var
...
Mar 29 21:20:11 Content-Location: dward.de.html
Mar 29 21:20:11 Vary: negotiate,accept-language,accept-charset
...
Mar 29 21:20:11 Content-Type: text/html
Mar 29 21:20:11 Content-Language: de
Mar 29 21:20:11
Mar 29 21:20:11 Rx Headers:
{
...
"Content-Language" = (de);
"Content-Location" = ("dward.de.html");
...
Vary = ("negotiate,accept-language,accept-charset");
};
};
}
In the Vary line you see, what the server has accepted (and found)
language (de) and charset (iso8859-1) the client wanted most
the dward.var has the following lines for the German page:
URI: dward.de.html
Content-type: text/html;charset=iso-8859-1
Content-language: de
The polish page is UTF-8 (could be something different)
URI: dward.po.html
Content-type: text/html;charset=UTF-8
Content-language: pl
The debianwomen server does not negotiate the charset (so the browser
takes it from the page) but for the dicts files the language:
Vary = ("negotiate,accept-language");
For the the main pages debianwomen.org doesn't do any negotiation but
delivers the page and sets charset to utf-8 (even if the page is not):
Mar 29 21:31:47 Content-Type: text/html; charset=UTF-8
"Content-Type" = ("text/html; charset=UTF-8");
and no negotiation
The Wiki behaves same but is set up to deliver a default encoding (so
the broweser can take the charset from the page by default).
Mar 29 21:34:59 Content-Type: text/html; charset=iso-8859-1;
and again the RX-Headers:
"Content-Type" = ("text/html; charset=iso-8859-1;");
If the wiki would not overwrite the utf-8 set up in the htaccess, then
we would not see correct characters above 127.
Maybe, now it is understandable a bit more why I complained about the
default charset delivered by the server due to the settings on
debianwomen.org : That is not the way it should work if you have
multilingual pages and different charsets. It is for people having
pages in only one encoding and one language. Then they can be lazy and
tell the server to deliver a default encoding as the know, what charset
they have used to edit the pages. But if they decide to edit the pages
using a different charset or put documents edited by others online,
that charset overwrites the charset in the page and the browser may
show ugly pages.
One or more words about UTF-8 and perl, less, vi and terminal
- - Vi does not automatically know, if something is utf-8. Even if you
decide to tell vi to use utf-8, it may damage files (did it for me when
editing the utf-8-dicts, like punjabi)
- - less can be used to view different encodings by command line (i have
a less.utf8 script for that.
- - to see the correct characters the terminal/shell must have: The
charsets needed, be 8bit clean, and set up correctly
- - perl 5.6 has bad and incomplete unicode support. To get the
makedictutf8.pl working I had to steel libs from perl 5.8 else there
was no way to get it working correctly. - Hmm... alioth is running
Woody, too. So there is no way to run makedictutf8.pl there without
making changes to the system.
People using english language may not notice some issues as an english
only file/page looks same if viewed as iso8859-1 or utf-8 as US ASCII
is part of both charsets (and many others).
Hope the summary of things I've found out while working on the dicts is
understandable. If not, feel free to ask.
greetings
Jutta
- --
http://www.witch.westfalen.de
http://witch.muensterland.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Darwin)
iD8DBQFCSba1OgZ5N97kHkcRAhSsAKDHJ4hHE/Hx+CTRxfubiI6aylPvHgCeJPQq
YAr7GRhQsrFG42eo/r2Kaxk=
=JurF
-----END PGP SIGNATURE-----