Altova Mailing List Archives>Archive Index >comp.text.xml Archive Home >Recent entries >Thread Prev - Re: Java sax UTF-8 parsing troubles -- PLEASE HELP... [Thread Next] Re: Java sax UTF-8 parsing troubles -- PLEASE HELP...To: NULL Date: 9/1/2004 6:31:00 AM Yes, actualy, the string does have some UTF-8 characters which I am indeed
expecting. I am expecting a combination of Yen currency characters, British
pounds etc... This is an XML stream that needs to be parsed, modified, and
sent to FOP for PDF generation.
I have allways dealt with SAX parsing with plain Strings, and
that has allways worked, however, I realy did get stuck on this one...
Regards, Alex.
Soren Kuula <dongfang-remove_this@r...> wrote in message news:<3v3Zc.41993$Vf.2222667@n...>...
> Aleksandar Matijaca wrote:
> > Hi there,
>
> Hi, I can see you got your problem solved, but are you sure it is
> _really_ doint what you want it to do (and are you aware what is
> happening) ?
>
> Assuming the type of your parameter infile is String:
>
> Character encoding is the translation between character strings and byte
> strings. I assume also that whatever made the String infile, it has
> somehow managed to get the right chars out of the bytes in your file.
>
> I think that this happens:
>
> > infile = "<?xml version=\"1.0\"
> > encoding=\"UTF-8\"?><display_values><currency_display>\u00A5 Japanese
> > Yen</currency_display></display_values>";
> >
> > // the above is perfectly valid UNICODE symbol for Yen
> >
> > XMLReader xr = new org.apache.xerces.parsers.SAXParser();
> >
> > xr.setContentHandler(this);
> > xr.setErrorHandler(this);
> > ByteArrayInputStream bi = new
> > ByteArrayInputStream
>
> 1a) (as before): getBytes() returns PLATFORM_DEFAULT-ENCODED byte string
> representation of your String.(infile.getBytes());
> or
> 1b) (after fix: )getBytes() returns UTF-ENCODED byte string
> representation of your String.(infile.getBytes());
> 2) Your Reader then correctly DEcodes the byte stream into chars again
> > Reader reader = new InputStreamReader(bi,"UTF-8");
> > InputSource is = new InputSource(reader);
> 3) the setEncoding statement should really have no effect; the
> InputSource does not have the challenge of turning bytes into chars as
> it already has a Reader (a source of chars, as opposed to a Stream
> (source of bytes) so extract characters from. In other words, the
> decoding work should have been done already)
> > is.setEncoding("UTF-8");
> > xr.parse(is); // CRASHES RIGHT HERE...
>
> - because UTF_DECODE(SOME_OTHER_ENCODING_ENCODE(s)) is not necessarily
> = s for some String s.
>
> I think you read a file into a String (correctly decoded, maybe by
> coincidence).
> Then you encode that String (a String is just a sequence of chars) into
> bytes and decode that back into chars again. No need for that !!
>
> I suggest you let an InputStream read from your file, and use that
> InputStream DIRECTLY as an argument to your InputSource. Reason : The
> InputSource may be clever enough (I think it is) to UNDERSTAND the <?xml
> encoding="blah"... IN the XML file PROPER. Then it will automagically
> use the proper decoding.
>
> If that fails, you may try open a Reader on an InputStream in the file,
> and then supply the encoding yourself (taking the risk that one day you
> will prefer to write your XML files in some other encoding, and your
> program will not work anymore).
>
> Anyway encoding a String into bytes and then back to a source of chars
> (a Reader) only adds confusion.
>
> Soren
| ||||||
| Company | Legal | Press | Partners | Careers | Sitemap | Contact Us | Altova Blog | Mobile | Full Site | |||
|
