Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: [xml-dev] An alternative formulation of the document-centric/data-centric XML divide

From: Bob Glushko <glushko@----.--------.--->
To: Sean McGrath <sean.mcgrath@--------.--->, xml-dev@-----.---.---
Date: 6/3/2004 2:12:00 PM
At 11:01 AM 6/3/2004 +0100, Sean McGrath wrote:

Document-centric XML:

<x-tab>        </x-tab>XML in
which corpora conforming to schema X, exhibit power law distributions of
the element types in X.


Data-centric XML:

<x-tab>        </x-tab>XML in
which corpora conforming to schema X, exhibit uniform distributions of
the element types in X.


Not perfect but useful nonetheless I think. Mixed content is missing for
a start.


Anyway, please take a look at the graphs at:

<x-tab>        </x-tab>http://seanmcgrath.blogspot.com/2004_05_23_seanmcgrath_archive.html#108576202776583412



I'd be very interested in seeing other peoples graphs of the tag-share of their XML corpora.

This reminds me of a classic paper by Darrell Raymond and Frank Tompa called "Hypertext and the Oxford English Dictionary" from the Communications of the ACM in 1988 or so.   At Waterloo -- Tim Bray was also part of this work at the time -- they had a research program on how to handle large text data/hypertexts like the OED (in preparation to create electronic versions) and they did a lot of very clever analyses of the dictionary, which had just been turned into SGML via conversion from the typesetting tapes.   The paper includes several charts showing the distribution of  (a) entry length,  (b) number of tags per entry (c), number of cross references and so on and either explicitly or implicitly they show tag-share in the dictionary to have the kind of distribution that Sean has in his analyses.  


Rick Jellife has some software that does the same sort of thing that I saw demonstrated at the GCA XML conferences the last year or so.


But I don't buy into this data-centric vs doc-centric view of the world. It is obviously a continuum   (called the "Document Type Spectrum" in the Document Engineering book  I'm writing with Tim McGrath [just about done, MIT Press early 2005]).   On one end are pure narrative things and on the other end are purely transactional ones:   Moby Dick to invoices.  IIn the middle are hybrid types like catalogs and reference books that have lots of structured content mixed in with narrative content.  


 I always use Moby Dick as the endpoint when I talk about this because its opening line is "call me XML"  or something like that. :-)


-bob glushko


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent