Altova Mailing List Archives


Re: Non-Ascii Characters modified when Document Loaded

From: "Anthony Jones" <Ant@------------.--->
To: NULL
Date: 3/20/2007 9:34:00 AM


"mikes" <mikes@d...> wrote in message
news:4F3D7783-9493-4335-B9E8-7596E7A80FA4@m......
> I was hoping that I was doing something really dumb and there would be a
> quick solution that would not require a lot of detail.  But I guess that's
> not the case, so here is the detail.
>
> My application is written in C++.  I am using VS 2003 (soon to be upgraded
> to VS 2005 I hope) and Platform SDK version 3790.1830.  I'm using the Load
> method of the IXMLDOMDocument class to load the xml document file and the
> Save method to write the document back to a file.  The xml document I am
> loading is UFT-8 encoded.
>
> After a much deeper look at what is happen I have found that the Load
method
> is loading the XML correctly from the file, it is not stepping on the
> non-ASCII characters in the XML.  It is the Save method that is causing
the
> problem.  It is UFT-8 encoding characters that have already been encoded
as
> UFT-8.  An example will make this clearer.
>
> The test XML document I am using has a ® (registered symbol) in the data
> associated with one of its elements.  Its unicode value is 00AE.  When the
> document is read into the DOM the symbol's code is C2 AE which is the
correct
> UFT-8 encoding of the register symbol character.  When the document is
saved
> the two character C2 AE code is replaced by the four characters C3 82 C2
AE.
> It turns out C3 82 is the UTF-8 encoding of the non-ASCII character Â
which
> has a unicode value of 00C2.  So it appears that the Save method is
> processing the content of the DOM as though it is raw unicode that needs
to
> be UFT-8 encoded, even though the content was read from a UFT-8 encoded
> source and needs no encoding.
>
> How do I turn off this unwanted encoding?


Does the source file contain the UTF-8 byte order mark at the beginning of
the file?
Does the source file contain a <?xml declaration and does it specify an
encoding?
If so what encoding does it specifiy?
How have you determined that the load hasn't misinterpreted the encoding?



Disclaimer

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.