Altova Mailing List Archives
>microsoft.public.xml Archive Home
>Thread Prev - Re: Non-Ascii Characters modified when Document Loaded
>Thread Next - Re: Non-Ascii Characters modified when Document Loaded
Re: Non-Ascii Characters modified when Document Loaded
Date: 3/20/2007 9:34:00 AM
"mikes" <mikes@d...> wrote in message news:4F3D7783-9493-4335-B9E8-7596E7A80FA4@m...... > I was hoping that I was doing something really dumb and there would be a > quick solution that would not require a lot of detail. But I guess that's > not the case, so here is the detail. > > My application is written in C++. I am using VS 2003 (soon to be upgraded > to VS 2005 I hope) and Platform SDK version 3790.1830. I'm using the Load > method of the IXMLDOMDocument class to load the xml document file and the > Save method to write the document back to a file. The xml document I am > loading is UFT-8 encoded. > > After a much deeper look at what is happen I have found that the Load method > is loading the XML correctly from the file, it is not stepping on the > non-ASCII characters in the XML. It is the Save method that is causing the > problem. It is UFT-8 encoding characters that have already been encoded as > UFT-8. An example will make this clearer. > > The test XML document I am using has a ® (registered symbol) in the data > associated with one of its elements. Its unicode value is 00AE. When the > document is read into the DOM the symbol's code is C2 AE which is the correct > UFT-8 encoding of the register symbol character. When the document is saved > the two character C2 AE code is replaced by the four characters C3 82 C2 AE. > It turns out C3 82 is the UTF-8 encoding of the non-ASCII character Â which > has a unicode value of 00C2. So it appears that the Save method is > processing the content of the DOM as though it is raw unicode that needs to > be UFT-8 encoded, even though the content was read from a UFT-8 encoded > source and needs no encoding. > > How do I turn off this unwanted encoding? Does the source file contain the UTF-8 byte order mark at the beginning of the file? Does the source file contain a <?xml declaration and does it specify an encoding? If so what encoding does it specifiy? How have you determined that the load hasn't misinterpreted the encoding?