Altova Mailing List Archives


Re: [xml-dev] Dangers of Copying Text into an XML Document

From: "G. Ken Holman" <gkholman@----------------.--->
To: <xml-dev@-----.---.--->
Date: 9/5/2007 3:44:00 PM
At 2007-09-05 11:10 -0400, Costello, Roger L. wrote:
>I am compiling a list of well-formedness problems that may arise from
>copying text from one document and pasting it into an XML document.
>
>For example, consider this XML document:
>
><?xml version="1.0" encoding="UTF-8"?>
><Document>
>       <Para id="...">...</Para>
></Document>
>
>Suppose that text is copied from a document and pasted into the XML
>document, either as the content of the <Para> element

Use
<![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<Document>
       <Para id="...">...</Para>
</Document>
]]>

>or as the value of the id attribute.

Ouch!  You've got some character editing to do then ... you'll have 
to individually mark up your sensitive markup characters.

>Here is my current list of problems:
>
>1. The text may contain these reserved characters: {<, >, ', ", &}.
>These characters may introduce syntax errors into the XML document and
>may need to be escaped.

Not a problem with element content ... labourious with attribute values.

>2. The editor that was used to create the text may use a different
>encoding than the XML document's encoding. A binary string that
>represents a character in one encoding may represent a different
>character in another encoding.  Consequently, if the text was created
>in an editor that uses a different encoding than the XML document then
>the characters that result from pasting the text into the XML document
>may not be the same.

Usually the answer isn't related to either application's character 
encoding of the files ... if the application has appropriately 
created internally a set of Unicode characters when translating from 
the external document encoding, then the copy/paste functions between 
Unicode-aware applications will be working with the abstract Unicode 
character, only realizing a particular encoding when the application 
writes a file.

>Example: Word uses Windows-1252 encoding. The hex
>value for the left curly (a.k.a. smart) quote is x93. In UTF-8 encoding
>the hex value for the left curly quote is x201C. In UTF-8 the hex value
>x93 corresponds to a control character.  Copying a left curly quote
>from a Word document and pasting it into a UTF-8 XML document may
>result in the XML document receiving a control character rather than a
>left curly quote.

This discussion came up in just the last few days.  Copying from Word 
to Notepad appeared to use the abstract characters and not the 
encoding sequences.

>Can you think of other problems that may result from copying text from
>one document and pasting it into an XML document?

"problems"?  I suppose it is just a matter of how XML-aware your 
application doing the pasting is.  If you are just using a simple 
text editor then you can't expect it to do much and the onus is on you.

If you work on the clipboard with Unicode characters then you should 
be insulated from encoding problems.

Pasting into element content and pasting into attribute values has 
different rules, so just be sensitive to the requirements.  CDATA is 
a handy way of doing it in element content.  The only characters you 
need to escape in attribute content are "<", "&", and whichever of 
the single or double quotes you use for your attribute literal 
delimiter ... the ">" and other quote do not have to be escaped.

Now if the XML content you have contains a CDATA section and you are 
pasting that into element content, you have to create two CDATA 
sections.  This is the challenge in a hands-on exercise in the XML 
class I deliver.

I hope this helps.

. . . . . . . . . . . Ken

--
Upcoming public training: XSLT/XSL-FO Sep 10, UBL/code lists Oct 1
World-wide corporate, govt. & user group XML, XSL and UBL training
RSS feeds:     publicly-available developer resources and training
G. Ken Holman                 mailto:gkholman@C...
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/m/
Box 266, Kars, Ontario CANADA K0A-2E0    +1(613)489-0999 (F:-0995)
Male Cancer Awareness Jul'07  http://www.CraneSoftwrights.com/m/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

Disclaimer

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.