Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: Java sax UTF-8 parsing troubles -- PLEASE HELP...

From: aleksm@-------.--- (---------- --------)
To: NULL
Date: 9/1/2004 6:31:00 AM
Yes, actualy, the string does have some UTF-8 characters which I am indeed
expecting. I am expecting a combination of Yen currency characters, British
pounds etc...  This is an XML stream that needs to be parsed, modified, and
sent to FOP for PDF generation.

I have allways dealt with SAX parsing with plain Strings, and
that has allways worked, however, I realy did get stuck on this one...  


Regards, Alex.

Soren Kuula <dongfang-remove_this@r...> wrote in message news:<3v3Zc.41993$Vf.2222667@n...>...
> Aleksandar Matijaca wrote:
> > Hi there,
> 
> Hi, I can see you got your problem solved, but are you sure it is 
> _really_ doint what you want it to do (and are you aware what is 
> happening) ?
> 
> Assuming the type of your parameter infile is String:
> 
> Character encoding is the translation between character strings and byte 
> strings. I assume also that whatever made the String infile, it has 
> somehow managed to get the right chars out of the bytes in your file.
> 
> I think that this happens:
> 
> > 		infile = "<?xml version=\"1.0\"
> > encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
> > Yen</currency_display></display_values>";
> > 
> > // the above is perfectly valid UNICODE symbol for Yen
> > 
> > 		XMLReader xr = new org.apache.xerces.parsers.SAXParser();
> > 		
> > 		xr.setContentHandler(this);
> > 		xr.setErrorHandler(this);
> > 		ByteArrayInputStream bi = new
> > ByteArrayInputStream
> 
> 1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string 
> representation of your String.(infile.getBytes());
> or
> 1b) (after fix: )getBytes() returns UTF-ENCODED byte string 
> representation of your String.(infile.getBytes());
> 2) Your Reader then correctly DEcodes the byte stream into chars again
> > 		Reader reader = new InputStreamReader(bi,"UTF-8");
> > 		InputSource is = new InputSource(reader);
> 3) the setEncoding statement should really have no effect; the 
> InputSource does not have the challenge of turning bytes into chars as 
> it already has a Reader (a source of chars, as opposed to a Stream 
> (source of bytes) so extract characters from. In other words, the 
> decoding work should have been done already)
> > 		is.setEncoding("UTF-8");
> > 		xr.parse(is);  // CRASHES RIGHT HERE...
> 
> - because  UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily 
> = s for some String s.
> 
> I think you read a file into a String (correctly decoded, maybe by 
> coincidence).
> Then you encode that String (a String is just a sequence of chars) into 
> bytes and decode that back into chars again. No need for that !!
> 
> I suggest you let an InputStream read from your file, and use that 
> InputStream DIRECTLY as an argument to your InputSource. Reason : The 
> InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml 
> encoding="blah"... IN the XML file PROPER. Then it will automagically 
> use the proper decoding.
> 
> If that fails, you may try open a Reader on an InputStream in the file, 
> and then supply the encoding yourself (taking the risk that one day you 
> will prefer to write your XML files in some other encoding, and your 
> program will not work anymore).
> 
> Anyway encoding a String into bytes and then back to a source of chars 
> (a Reader) only adds confusion.
> 
> Soren


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent