Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: [ANN] XSDBench XML Schema Benchmark 1.0.0 released

From: Boris Kolpackov <boris@-------------.--->
To: Michael Kay <mike@--------.--->
Date: 10/18/2006 4:13:00 PM
Hi Michael,

Michael Kay <mike@s...> writes:

> You say:
>
> "We expect that in most applications the structure validation
>        overhead will greatly outweigh that of the content validation."
>
> Why do you expect that? I would have expected exactly the opposite. A
> process that make one decision per element node in the document is surely
> likely to be faster than one that has to examine each character.

I think you would agree that values of datatypes which require examination
of every character in order to be validated (e.g., numbers, token, name,
enum, regex) tend to be rather short. So the ratio of data that will need
to be examined character-by-character to the total XML document size should
generally be rather small.

I compared the results of the test for Xerces-C++ with validation enabled
and disabled (remember the schema does not use anything except xsd:string
so it is "pure" structure validation). It came out that about 60% is spent
on XML parsing and 40% on structure validation.

Now if we could compare XML parsing to content validation, we could get
an idea of whether structure validation is more expensive. I would say
(proper) XML parsing would be a lot more expensive than content
validation because:

 1) XML parser has to examine the whole document character-by-character
    which is a lot more than what will be validated (see above).

 2) XML parser will need to convert XML document encoding to the parser's
    internal encoding (in case of Xerces-C++ it is from UTF-8 to UTF-16).

 3) XML parser will need to allocate memory for element/attribute
    names and their values. Most of content validation can happen
    without allocating any extra memory.

> Of course it may be true that most of the content is xs:string, but who
> knows.

I think most of the content is string, numbers and enums/regex's now and
then. But I agree it is all pure speculation until we run some tests.


-boris

--
Boris Kolpackov
Code Synthesis Tools CC
http://www.codesynthesis.com
tel: +27 76 1672134
fax: +27 21 5526869


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent