Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: UTF-8 & Unicode

From: "Alan J. Flavell" <flavell@--.---.--.-->
To: NULL
Date: 2/2/2005 12:17:00 PM
On Wed, 2 Feb 2005, EU citizen wrote:

> > > Do web pages have to be created in unicode in order to use UTF-8
> encoding?

[...]

> I wish people would give simple answers to simple questions.

I don't think you've understood the problem.  If the questioner was in 
a position to understand the "simple answer" which you say you want, I 
can't imagine how they would have asked the question in that form in 
the first place.

> This is not a silly question; 

The original questioner should not feel offended or dispirited by what 
I'm going to say: but, in the form in which is was asked, the question 
is incoherent. 

This is not unusual: many people are confused both by the theory and 
by the terminology of character representation, especially if they 
gained an initial understanding in a simpler situation (typically, 
character repertoires of 256 characters or less, represented by an 
8-bit character encoding such as iso-8859-anything; and fonts that 
were laid out accordingly).

> See
> http://www.w3schools.com/xml/xml_encoding.asp on XML Encoding. 

How very strange.  This claims to be XHTML, but, as far as I can see, 
it has no character encoding specified on its HTTP Content-type header 
*nor* on its <?xml...> thingy (indeed it doesn't have a <?xml...> 
thingy).

In the absence of a BOM, XML is entitled to deduce that it's utf-8:
but since it's invalid utf-8, it *ought* to refuse to process it.
Unless someone can show me what I'm missing.

By looking at it, it is evidently encoded in iso-8859-1.
It purports to declare that via a "meta http-equiv", but for XML this 
is meaningless - and anyway comes far too late.

I don't know why the W3C validator doesn't reject it out of hand?

(Of course the popular browsers will be slurping it as slightly 
xhtml-flavoured tag soup, so we can't expect to deduce very much from 
the fact that they calmly display what the author intended.)

> Slightly
> edited, this says:
> 
> XML documents can contain foreign characters like Norwegian æøå, or French
> êèé.

And those characters are presented encoded in iso-8859-1 ...

> To let your XML parser understand these characters, you should save 
> your XML documents as Unicode.

Two things wrong here.  What do they suppose they mean by "save ... as 
Unicode"?  The XML Document Character Set is *by definition* Unicode, 
there's nothing that an author can do to change that (unlike SGML).

Characters can be represented in at least two different ways in XML: 
by /numerical character references/ (&#number;), or as /encoded 
characters/ using some /character encoding scheme/.  (In some contexts 
there may also be named character entities, but they introduce no new 
principles for the present purpose so we won't need to discuss them 
here).

The only coherent interpretation I can put on their "should save as 
Unicode" statement is "should save in one of the character encoding 
schemes of Unicode".  But /should/ we?  Do they?  No, they don't: they 
are using iso-8859-1 (they *could* even do it correctly); and they 
also discuss the use of windows-1252, although without giving much 
detail about the implications of deploying a proprietary character 
encoding on the WWW.

The /conclusions/ are fine, in their way:

    * Use an editor that supports encoding.
    * Make sure you know what encoding it uses.
    * Use the same encoding attribute in your XML documents.

But the reader still hasn't really learned anything about the 
underlying principles yet.  And the page hasn't told them anything 
useful about *which* encoding to choose for deploying their documents 
on the WWW.

> Windows 95/98 Notepad cannot save files in Unicode format.

Then it's unfit for composing the kind of document that we are 
discussing here.  No matter - there are plenty of competent editors 
which can work on that platform.

My own tutorial pages weren't really aimed at XML, so I won't suggest 
them as an appropriate answer here.  Actually, the relevant chapter of 
the Unicode specification is not unreasonable as an introduction to 
the principles of character representation and encoding, even if they 
might be a bit indigestible at a first reading.


transparent
Print
Mail
Digg
delicious
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent