Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


[Summary] Fallacies of Validation

From: "Roger L. Costello" <costello@-----.--->
To: <xml-dev@-----.---.--->
Date: 9/4/2004 8:15:00 PM
<!---->Fallacies of Validation

Introduction
The purpose of this document is to identify common "fallacies" 
with regards to validation and its role in a system architecture.  These 
fallacies were identified through discussions on the xml-dev list.  This is 
a record of those discussions.
Fallacies of 
Validation

1. Fallacy of "THE Schema"

2. Fallacy 
of Schema Locality

3. Fallacy of Requisite Validation

4. Fallacy 
of Validation as a Pass/Fail Operation

5. Fallacy of a Universal 
Validation Language

6. Fallacy of Closed System Validation

7. 
Fallacy that Validation is Exclusively for Constraint Checking

Each of 
these fallacies is examined below.
1. Fallacy of "THE Schema"

This 
fallacy was identified by Michael Kay:

... there's no 
harm in using XML Schema to check data against the business rules, so long as 
you realize this is *an* XML Schema, not *the* XML Schema. We need to stop 
thinking that there can only be one schema.

Len Bullard made a 
similar statement:

... most fundamental errors are 
... to consider only a single schema.

and at another point Len 
states:

 ... fall into the trap of thinking of 
THE schema and not recognizing the system as a declarative ecosystem of schemas 
and schema components.

Both Michael and Len are stating that in a 
system there should be numerous schemas. 
Sidebar
Len was asked to define "declarative 
ecosystem".  This is a very important term and underlies much of what is 
presented
here. Here's what "declarative ecosystem" means:

Every 
system lives within a world where there is a lot of variety, i.e., systems 
aren't islands.  For example, the Wal-Mart system must coexist with its 
supplier systems, its distributor systems, and its retailer systems. One can 
think of this system-of-systems as an "ecosystem".  Thus, the Wal-Mart 
system resides in an ecosystem.  Each system within the ecosystem has their 
own local requirements which are documented by their own (declarative-based) 
schemas.  Thus, not only are there a bunch of systems which must coexist, 
there are a bunch of schemas that must coexist.  This ecosystem of schemas 
is a "declarative ecosystem". 
One more comment on declarative 
ecosystems.  Len made this remark which is important:

... [if two 
systems are interoperating in a closed environment then] it doesn't matter how 
singular or multiple they [the schemas] are; but when they are in an ecosystem, 
they typically overlap and exchange information, and adapt as a 
result.

Now back to the fallacy of "THE schema" 
...

Many examples were provided to demonstrate the value of multiple 
validations:

Len provided an example of a distributed reporting 
system:

Look at any large reporting system.  You 
can build that up a large schema but given local variations, do you have 
sufficient power/force/authority to make them stick or will you be constantly 
adjusting them, loosening them, strengthening them, and how will you know which 
is the right thing to so?

Here is further elaboration on 
this.  Suppose that a company has an office in London, Hong Kong, and 
Sydney.  They all report to the main office in New York.  With such a 
geographically dispersed collection of offices, it is easy to imagine that there 
will be local variations.  There will probably be some data that is common 
to all the offices (Rick Jelliffe calls the constraints on this type of data 
invariant constraints).  Then there will be locale-specific data (variant 
constraints).  So, it doesn't seem reasonable to assume that a single 
reporting schema would suffice for this geographically-dispersed 
organization. 
Mary Holstege and Michael Kay gave examples of the value of multiple schemas 
in a workflow environment:

From Mary Holstege:

... suppose all you care about in some phase of processing is 
picking up the IDs in a document. Then you define a minimal schema where 
everything is open with the appropriate ID attributes. Maybe you're going to 
generate an index. In another
phase of processing all you care about is 
checking that dates are in the right date range. So you have another minimal 
schema that only pays attention to dates.

From Michael 
Kay:

One example I am thinking of is where a document 
is gradually built up in the course of a workflow. At each stage in the workflow 
the validation constraints are different. You can think of each schema as a 
filter that allows the document to proceed to the next stage of 
processing.

Finally, Len made a good statement:

Sometimes, a single schema suffices for the whole system.  
Sometimes, you needs lots of little ones.

2. 
Fallacy of Schema Locality

Len identified this 
fallacy:

... most fundamental errors are to consider 
schemas only at the external system junctions ...

What is being 
said is this: if you build a system with local customs hardcoded into it, but 
then deploy it into a global environment ... that's a real bad mistake. An 
example of this is Michael Kay's example of interacting with an online U.S. 
service that insisted on users providing a state code. Clearly, the online 
service was built with local customs hardcoded, but then deployed in a global 
environment.

Here's a comment that Len made on this fallacy:

The problem of locale is that it is declared locally but might 
require global management.

3. Fallacy of 
Requisite Validation

Michael Kay made a very compelling 
statement with regards to whether validation should be done at all in certain 
situations. Michael was responding to the example of an online service 
validating a user's address. Here's what Michael said about the online service's 
insistence on validating the user's address:

The 
strategy (validating the user's address) assumes that you know better than your 
customers what constitutes a valid address. Let's face it, you don't, and you 
never will. A much better strategy is to let them (the user) express their 
address in their own terms. After all, that's what they do in old-fashioned 
paper correspondence, and it seems to work quite well.

Michael 
argues very effectively that in this situation it makes no sense to do any 
validation at all!
Jonathan Robie rebutted Michael's argument, saying that validation is 
necessary for machine processing:
In old-fashioned paper correspondence, addresses 
are interpreted by human beings, and this is a perfectly fine strategy in an 
application that formats addresses so that they can be read by human beings. But 
if I have a program that needs to be able to identify customers in a given 
region, or that needs to be able to compute the shipping costs before sending an 
item, then my program needs to know how to read the address. I'm not asking the 
customer to provide an address in a format that they might recognize, I'm asking 
the customer to provide an address in a format that my program can use. In that 
context, even if the customer finds it a little painful, I'm going to make them 
communicate at least the basic information.
For addresses, many applications have a certain 
middle ground. They insist on knowing the country and postal code, and perhaps 
street name and number, but allow other information to be added in a way that 
the program might not recognize. One more useful application of partial 
understanding.
Then Frank Manola rebutted Jonathan, emphasizing that 
oftentimes constraints are unknowable:
Part of the problem, though, is when the people 
defining the constraints think they know the requirements of actually performing 
the activity the program is supposed to help implement, but really don't; or in 
the example you cite, think the address constraints they define will actually 
help deliver the goods, but they actually get in the way. I experience this 
problem quite frequently. My street "number" (some other street "numbers" in our 
neighborhood do too) has a letter in it: 50A Butters Row (don't ask me why: I'm 
not responsible for how addresses are assigned here). Sometimes a program 
accepting addresses won't allow me to enter the letter (or the letter magically 
becomes an apartment number, which it isn't; this is a single-family house), 
because the writers of the constraints think they know how addresses are 
supposed to look. Not having a street number that matches the actual address of 
the house doesn't help delivery very much.
4. Fallacy of Validation as a Pass/Fail 
Operation

Mary Holstege identified this fallacy.  
Here's what she said:

[Many people think that 
validation is a pass/fail operation.] Not so, although lots of people are still 
stuck in that way of thinking, including, alas, a lot of the vendors. The schema 
design goes to great pains to make it possible to do things like this, for 
example: validate a document against a tight schema, and then ask questions of 
the result such as "show me all the item counts that failed validation because 
they were too high"

Rick Jelliffe notes that XML Schemas 
validators are limited with regards to providing useful information on where an 
error occurred.  In fact, he argues that oftentimes an XML Schema validator 
will provide wrong information about the location of an error.  He notes 
that this problem is not specific to XML Schema validators, but to all 
grammar-based validators (e.g., XML Schemas and RelaxNG).  Rick notes that 
with Schematron (which is not grammar-based) you can associate specific error 
messages with each assertion.  Here's an example that Rick 
provided:

<sch:rule 
context="beatles/member">
  <sch:assert test="count(../member)=4" 
flag="tooManyBeatles" diagnostics="tmb">
     The 
beatles should have four members
  
</sch:assert>
<sch:rule>
..
<sch:diagnostic 
id="tmb">
  Check that <sch:value-of select="."/> is a correct 
Beatle
</sch:diagnostic>

If the number of beatles/member 
elements is not equal to 4 then a specific, user-defined error message is 
spawned.  This is very nice!
The above example showed generating a user-defined message when an error 
occurs in the data.  Rick also notes that Schematron has the ability to 
generate messages when it is detected that the data is accurate.  
The ability to specify user-defined messages is a very important feature, as 
will be seen when fallacy #7 is examined.

5. Fallacy 
of a Universal Validation Language

Dave Pawson identified 
this fallacy.  He noted that the Atom specification cannot be validated 
using a single technology:

From [Atom, version] 0.3 
onwards it's not been possible to validate an instance against a single schema, 
not even Relax NG. They need a mix of Schema and 'other' processing before being 
given a clean bill of health.

6. Fallacy of 
Closed System Validation

This fallacy was identified by 
Len a long time ago.  He was discussing closed versus open 
systems when he stated, "Systems leak.  There's no 
such thing as a closed system".  This is an important comment.  
Many people imagine that they can create a monolithic, invariant schema because 
"there's just me and my well-known trading partners".  This statement fails 
to recognize the existence of a changing world; more precisely, a changing 
ecosystem.

7. Fallacy that Validation is Exclusively 
for Constraint Checking

I suspect that many people have 
the same mentality that I had regarding validation: "An XML instance document 
arrives, I forward it to a validator tool, if the validator tool doesn't 
complain then I forward the instance to some software to process it.  If 
there's an error then discard the instance."  Len has enlightened me to the 
greater role that validation can
play in a system.  This is discussed 
below:

Launching Point for Messages

While 
validating an instance document it is reasonable to generate messages - "error 
messages" when errors are encountered, and even "success messages" when instance 
data is found to be conformant.  Thus, validation can result in spawning 
messages that are sent around the system, which activate other parts of the 
system.  Where do the messages go?  What parts of the system receive 
the message? One approach is subscription - a part of the system will 
receive a message only if it has subscribed to receive that type of 
message.  Here are some snippets of a message from Len on this use of 
validation:

... if the expectation of the contract is 
violated, a flag goes up and is sent to whoever has subscribed to that 
event type.

And this snippet:

Aka, 
event-driven intelligence: a flag is raised given recognition of a 
pattern/error/trend and the system sends a subscriber(s) a 
message.

Feedback-Mediated 
Evolution

Suppose that during validation an exception is 
raised.  As discussed above, this may result in spawning a message to some 
part of the system. (For example, a user enters an invalid value for a U.S. 
state code, which results in sending a message to a logger routine)  Len 
notes, "an exception is not an error, it's a learning 
(feedback) mechanism".  The recipient of the
message can take 
advantage of the information that the message provides. (For example, the 
recipient of the invalid state code message may realize that the system should 
not be forcing non-U.S. users to enter a state code; the recipient then changes 
the system)  Thus, validation messages become valuable feedback, which may 
be used to facilitate evolution of the system.

Darwinian 
Selection Process

In Darwinian evolution less fit species are 
filtered out.  Only the fittest species survive.  You may view 
validation as a process in which less fit (erroneous) instances are filtered 
out, and only the fittest (conforming) instances survive.  Here's a snippet 
from Len on this:

... the model of selection based on 
fitness or other criteria can be used to direct the evolution of the system 
based on feedback in the form of messages.


transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent