Altova Mailing List Archives


Re: [xml-dev] The triples datamodel -- was Re: [xml-dev] SemanticWeb permathread, iteration n+1

From: Henrik Martensson <henrik.martensson@--------.-->
To: Elliotte Rusty Harold <elharo@-------.---.--->
Date: 6/6/2004 8:38:00 PM
On Sun, 2004-06-06 at 12:52, Elliotte Rusty Harold wrote:
> At 10:52 AM +0200 6/6/04, Henrik Martensson wrote:
> 
> 
> >Using the XHTML 1.0 doctype declaration constitutes a promise to stick
> >to XHTML, and not to mix in elements and attributes from namespaces
> >that have not been declared in the XHTML DTD.
> 
> Absolutely not! All using the XHTML DTD promises is that the DTD can 
> be found at a certain URL, and used to apply some default attributes 
> values and resolve some entity references. It in no way promises that 
> the document is valid.

It is technically correct that XML documents are allowed to have doctype
declarations, without validating against the DTD the declaration
identifies.

The XHTML spec says that strictly conforming documents must have a
doctype declaration (section 3.1.1).

There are a couple of examples in the spec of XHTML documents that use
elements from other namespaces (section 3.1.2). None of them have a
doctype declaration. However, they do declare namespaces for all
extensions.

I haven't found any suggestion that an XML document is an XHTML document
if non-conformant elements are added to the XHTML namespace.

So, you are completely right that the presence of a doctype declaration
doesn't really matter from a strict "I did follow the XML rules to the
letter" perspective. However, using a doctype is still a declaration of
intent in XHTML. So is declaring the XHTML namespace for the elements.

Of course, from a practical perspective, this may not matter much to the
publisher of a web site. However, as we have seen earlier in the thread,
it may matter to consumers of the information.

This is yet another example of how impossible it is for an information
producer to foresee how a consumer will process the information. It is
just these difficulties that make it necessary to have clear,
unambiguous rules for how to mark the content up. A key factor in this
is that the information producer must abide by the rules, and not change
them arbitrarily.

> 
> This is one of two qualitative differences between SGML and XML. In 
> XML validity is optional. It is *not* required. Adding a document 
> type declaration makes no promises that the document is valid. It 
> simply says you might find this DTD to be useful when processing the 
> document.

On the other hand, validating against a DTD tends to be an all or
nothing affair, so if the document isn't intended to be valid, the DTD
isn't very useful. (Except for entity resolution, but why use entities
if one does not wish to use doctypes and DTDs fully? There are other
mechanisms that do the same thing. RELAX NG has the right approach
here.)

> 
> Many developers believe that rigid, conservative (everything not 
> permitted is forbidden) schemas are necessary to produce software. 
> Nothing could be further from the truth. Programming with the 
> expectation that the schema will be followed leads to brittle, 
> inextensible, closed systems that break at the first whiff of change. 
> Robust, flexible software that can handle extensions gracefully 
> begins with the realization that any fixed schema is inadequate for 
> some uses, and that one must be prepared to handle both schemaless 
> and invalid documents.

As far as I know, the notion that the best way to handle a flaw in a
production system, is to close the system down, locate the root cause,
fix the root problem, and then start everything up again, was first used
in the Japanese car industry, by Toyota, in the fifties. I will not
claim that this is the best way to go everywhere, but in industrial
settings it has proven very effective. (This philosophy of error
management played an important part in the rise of Japanese industry in
the sixties and seventies.) The same idea is expressed in many software
development methodologies, notably in Agile ones. It is basically the
same idea as requiring SGML documents (and many XML documents) to
validate.

Equating this strategy with making brittle software is not correct. For
example, I am involved with working on a rule based system for verifying
financial transactions. Adding new information about the transactions
should of course not break existing rules.

On the other hand, adding new information should not go unnoticed
either, because ignoring new information is dangerous. Consider a system
that evaluates transactions based on the following information:

<person>
  <name>...</name>
  <financial-status>...</financial-status>
</person>

Now we add a bit of information:

<person>
  <name>...</name>
  <legal-status>wanted fugitive</legal-status>
  <financial-status>...</financial-status>
</person>

Even though all the old rules still work, it is clear (to a human) that
a transaction that validates using the old information, should perhaps
not validate when using the additional information.

Systems that blithely continue to work, assuming that new information
has no bearing on the tasks they currently perform, are prone to make
mistakes. Whether such mistakes are acceptable or not depends on the
circumstances. Assuming that new markup in an XML document has no
bearing on old processes, or that if they do, the errors are relatively
harmless, is very dangerous.

It isn't a matter of one approach being right, and the other being
wrong. It is a matter of which approach is the most appropriate for a
given set of circumstances.

> 
> Again, the question one should ask when presented with an XML 
> document is, "Can I extract the information needed to perform my task 
> from this document?" The question should not be, "Does this document 
> adhere to one very precise and constrained format in the universe of 
> all possible formats?"

The problem is that, as shown above, when new information is added, the
answer to the question "Can I extract the information needed from this
document?" is nearly always "I don't know". This is why validation, and
refusing to process documents with unidentifiable content, is a good
idea.

In some instances, an application can make such decisions based on
metadata:

<person>
  <name>...</name>
  <hair-color importance="low">...</hair-color>
  <legal-status importance= "critical">wanted fugitive</legal-status>
  <financial-status>...</financial-status>
</person>

In most cases, this does not work very well though. For whom is the
information critical? If the same information is used in another
context, it may be irrelevant whether a person is a (supected) criminal
or not.

I do agree that conformance to a strict format is not a goal in itself.
Then again, it never has been, and I have never met anyone that claims
that it is. Strictness is just a means to an end. I also believe that
one should not impose strict conformance unnecessarily. However, when
constraints have been imposed, they should not be broken arbitrarily.
Either they should be adhered to, or the constraints themselves should
be changed.

I am not claiming that this is a universal truth. I am just saying that
for the systems I work with, this approach works a lot better than an
approach that may make a system ignore important information.

> 
> Sometimes the answer, is "I don't know" and the document may need to 
> be kicked to a human for further analysis. In practice, however, most 
> systems encounter a fairly limited number of document formats (though 
> that limited number is normally greater than 1) and these can all be 
> recognized and handled, with occasional fallbacks to people to add 
> support for newly recognized formats.

Fallback to people is exactly what validation is for. It is a way for a
system to say "I don't know how to handle this, please help". A
validator does not say "you broke the rules, shame on you!", even though
many people seem to think it does.

Speaking of limited numbers of variants:

The latest system I worked with had to deal with more than 80 variants
of a single schema. Determining which markup was obsolete, and which was
new, and vital, information could be quite interesting. It was
definitely not something that could be left up to a processing system.
(Though I did create methods for the system to provide hints to
authors.)

At a company where I worked a couple of years ago, a single formatting
system had to deal with more than 170 different input formats, some
related to each other, some not.

Having these many different formats sloshing around is not desirable. It
is however quite common, normal, if you will. In neither of the
companies mentioned above were there any technical reasons for having so
many different formats. The reasons were partly political, partly due to
ignorance, partly due to trying to fix problems downstream instead of at
the source, and partly due to misunderstandings. (In my experience only
about a third of all DTD/schema requirements are driven by technical
requirements.)

On the other hand, I have seen one or two companies that manage their
information assets very well. As far as I can see, cultural and social
factors are at least as important as technical factors when designing
successful information management systems.

/Henrik

Disclaimer

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.