[Women-website] Language Negotiation and charsets

Thu, 31 Mar 2005 07:38:18 +0200

--tThc/1wpZn/ma/RB
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Jutta Wrage wrote:
[...]
> >I was thinking about that the other day aswell. I think we should have
> >something like that too. The main Debian website has one problem=20
> >though, in
> >that it doesn't remember the language that you choose from that=20
> >language bar.
> >Whenever you click your way to the next page, it will switch back to=20
> >the
> >negotiated language.
[...]
>=20
> Seems I have to tell a bit more about negotiation, charsets and=20
> languages
>=20
> Language Negotiation is communication between browser and server. The=20
> server does not remember that for a session as the browser sends the=20
> language preferences on each request.
>
[...]

I think you misunderstood me. I do know how content negotiation works, and =
in
fact, if you take a look at the archive of the new website structure, you
will see that content negotiation is already used. It also works correctly.

That does not, however, solve the problem that I was describing. In fact, it
explains why the problem occurs. Since the language is negotiated every time
a page is accessed, there is no automatic way of permanently switching to a
language different from the one provided by your browser (except by setting
up your browser differently).

The site to which I provided a link to in my previous mail handles this by
appending the language suffix to every link of a localized page. For instan=
ce,
if a page links to `foo/bar', then the link on the Spanish page would become
`foo/bar.es.html', so by clicking on that link from the Spanish page, it wi=
ll
take you to `foo/bar' in Spanish, and *not* use the negotiated language. Th=
is
way you can permanently stay with the language that you choose from the
language bar.

I think that this is the way we should handle that on the Debian Women
website. I believe it is the most intuitive way.

[...]
> Maybe, now it is understandable a bit more why I complained about the=20
> default charset delivered by the server due to the settings on=20
> debianwomen.org : That is not the way it should work if you have=20
> multilingual pages and different charsets. It is for people having=20
> pages in only one encoding and one language. Then they can be lazy and=20
> tell the server to deliver a default encoding as the know, what charset=
=20
> they have used to edit the pages. But if they decide to edit the pages=20
> using a different charset or put documents edited by others online,=20
> that charset overwrites the charset in the page and the browser may=20
> show ugly pages.

I wouldn't go so far as to call all those people lazy. After all some of th=
em
do a lot of work =3D) I also wouldn't say that it "is not the way it should
work". Besides, there is iconv which can convert files to pretty much any
encoding you can think of. In my opinion, it is neither lazy nor futile to
try and only use one charset that can account for all others. In the end, it
makes life easier for all.

[...]
> - - Vi does not automatically know, if something is utf-8. Even if you=20
> decide to tell vi to use utf-8, it may damage files (did it for me when=
=20
> editing the utf-8-dicts, like punjabi)

I have never encountered the same problem. It has happened to me that
characters are not displayed correctly, but never were the contents of a fi=
le
damaged.

[...]
> - - perl 5.6 has bad and incomplete unicode support. To get the=20
> makedictutf8.pl working I had to steel libs from perl 5.8 else there=20
> was no way to get it working correctly. - Hmm... alioth is running=20
> Woody, too. So there is no way to run makedictutf8.pl there without=20
> making changes to the system.

Alioth does have iconv. Maybe that could be included in the build process of
the dicts?

> People using english language may not notice some issues as an english=20
> only file/page looks same if viewed as iso8859-1 or utf-8 as US ASCII=20
> is part of both charsets (and many others).

I do know the problem. In my experience, the easiest (and in my opinion bes=
t)
way you can go about this is to use either UTF-8 or only US ASCII with HTML
entities. I do not think HTML entities are a viable option, though.

> Hope the summary of things I've found out while working on the dicts is=
=20
> understandable. If not, feel free to ask.

The way I see it, there are two possibilities: 1) we go and make everything
UTF-8 from the start or 2) we go the way of different charsets for different
languages.

I think by now it is clear that I prefer the first choice. This would mainly
involve either having a policy about only using UTF-8 when editing files or
processing every file which is not proper UTF-8 with iconv.

The second approach would need the configuration on Alioth to not provide
UTF-8 as default charset and have us use a different header for different
languages. Additionally, this would mean that files still need to be
processed with iconv in case people working on the language are using a
different encoding.

I'd very much like input on this from a everyone involved. Questions,
comments, suggestions?

Thierry

--tThc/1wpZn/ma/RB
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFCS4zKhS8Ykk2Ma9MRAuxcAJ9HBNGeAMhx53SQmHCzOZSHGXLTpQCeNJBJ
szaZb+r+3h6w01ZmtAflKWY=
=YDuz
-----END PGP SIGNATURE-----

--tThc/1wpZn/ma/RB--