Altova Mailing List Archives


Re: SDATA or UNICODE

From: "Rick Jelliffe" <ricko@-------.---.-->
To: "'xml development mailing list xml-dev'" <xml-dev@--.--.-->
Date: 1/29/1998 5:13:00 AM
> From: Paul Prescod <papresco@t...>

> On Wed, 28 Jan 1998, Gavin McKenzie wrote:
> > 
> > XML provides a way for specifying the encoding of an entity with the
> > ?XML pi encoding declaration.  Why wouldn't this be sufficient.  If the
> > euro or florin symbol is available in some non-Unicode character
> > encoding scheme, isn't it sufficient to encode the text which requires
> > the symbol in the appropriate scheme and use the encoding declaration?
> 
> No, for the reason Tim points out. On the other hand, you might be on the 
> right track. A processing instruction would serve as a hack to tell the 
> application where to insert the euro. <?EURO>

XML has, underlying its decisions, the SGML model which separates the
encoding of data (i.e. "storage management") from their logical representation
as streams of characters in a single character set (i.e. "entity management").

This is a very flexible model, since it allows any system of encoding that
anyone can dream up to be used without having to alter XML/SGML: an entity
can be sourced from files, multipart MIME, data base, random number generators,
standard input, anything.  To allow multiple encodings within an XML
file, delimited using PIs or elements or internal entities would violate
this model, and I would strongly recommend against it. If your customers
require multiple encodings, then they have to source each one from a separate
external entity. These entities can be bundled up or interleaved in any
fashion you like, but this is a *PRE* XML storage management issue, not
an XML issue. 

I think there is a great desire that XML will be a Trojan horse to force
the development of wide-character applications, and Universal Character 
Set-using ones (UCS = ISO 10646 ~= Unicode) in particular. 
I, for one, hope that by disconnection encoding and character "repertoire", 
XML will marginalise the character encoding issue to the extent that 
it will become easier to use Unicode than to use a regional encoding, 
in the long run.
 
> I think you should implement a language that allows this and is preprocessed 
> into XML. If I were you I would use marked sections and not attributes to 
> describe the boundaries. Marked sections are really easy to scan for.

But once you have changed encodings, do you scan for the end of the
marked section using the old or the new encoding? These kinds of ISO 2022
mode changing are what we are trying to get rid of from XML (and from
SGML).

So you can have multiple encodings before the parser, but not being presented
to the parser. The other choice is multiple encodings after the parser: e.g.
embedded the SJIS encoded in a latin-1-safe way. This is the same as Dave's 
comment about transliteration using notation. You can have a document like

<?XML version="1.0" encoding="8859-1"?>
<!DOCTYPE x SYSTEM "x.dtd"
[
	<!NOTATION sjis-Qencoded SYSTEM "SjisQ.pl">
	<!ELEMENT SJIS-SECTION ( #PCDATA ) >
	<!ATTLIST SJIS-SECTION
		I-need-decoding NOTATION ( sjis-Qencoded ) > 
]>
<x>
...

<SJIS-SECTION><![CDATA[
smdkfjhhjwfnnweofijslkdm
]]></SJIS-SECTION>
...
</x>


(You cannot do the same thing using internal entities in XML, since you 
cannot put a notatation on an internal entity declaration.)
 


Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)

Disclaimer

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.