Home. 
.

transparent

transparent

transparent

Altova Mailing List Archives


Re: [xml-dev] generate common xml shema from multiple xml instances

From: bryan rasmussen <rasmussen.bryan@-----.--->
To: Paul Spencer <xml-dev-list@--------.--.-->
Date: 7/6/2009 8:14:00 AM
I think a Weighted Likelihood algorithm of some sort would be usable
(although what the weights should be I'm unsure), for example
something like:

User says they would like to Generate Enumerations - this could
actually be used as how would you like to Generate Enumerations -
never generate enumerations, only generate enumerations if highly
certain, generate enumerations where likely, Always Generate
Enumerations..

Check if node for set of documents is likely to be enumeration,
variations to check for would be:

No Whitespace - most enumerations are non whitespace, give user
opportunity to allow whitespace in enumeration. If whitespace found
and no whitespace allowed in enumeration then type is just string.

If there is whitespace allowed I would also note that the the node
would probably still have some regularity that could be used to
determine if it was likely to  be an enumeration.


Do values repeat in any of these documents:
if for example we have a set with 100 nodes with all different values
we could have an enumeration with 100 values, but if the nodes
sometimes repeat values that would increase the chance of it being an
enumeration. I think the way the question was first phrased would be
to be able to generate enumerations based on giving something like

RED
GREEN
ORANGE
BLUE

This gives us pretty good clues to guess it is an enumeration - one
the size of the data in each instance is pretty close to each other,
they are all letters in  a particular alphabet, they are words
(wordnet turns them up)

I think an algorithm could be written to make this an enumeration
pretty easily. However in situations like this you like it if your
inputs can guide you, if the rule of the application was that

RED
GREEN
ORANGE
BLUE
RED

Has a 20% higher chance of being an enumeration than
RED
GREEN
ORANGE
BLUE
that would be useful.






On Thu, Jun 18, 2009 at 8:30 AM, Paul
Spencer<xml-dev-list@b...> wrote:
> XML to schema tools tend to allow the user to set various options. For example, XMLSpy">XMLSpy asks if you want to create enumerations or not. If you had a single instance, you might just get
>
> <xs:simpleType name="Color">
>  <xs:restriction base="xs:string">
>    <xs:enumeration value="RED" />
>  </xs:restriction>
> </xs:simpleType>
>
> As Mukul says, it is then up to the schema author to fix this. I saw the original message as trying to improve this "first cut" by taking account of several instance documents. The complexity of the tool goes up markedly with more than one instance. For example, how would you handle this:
>
> Instance 1
> <a>
>  <b/>
>  <c/>
> </a>
>
> Instance 2
> <a>
>  <b/>
>  <d/>
> </a>
>
> Is there a choice between c and d? Or are both optional? If optional, which order do they go in?

I think this part actually comes into the thing stated above it being
a weighted likelihood algorithm and the choice of user inputs.

When you say: The complexity of the tool goes up markedly with more
than one instance.
you mean not just the complexity of programming but also the
complexity of making the right choice, but these should probably be
separated.

The complexity of programming goes up markedly with more than one
instance, but the complexity of making the right choice goes down
after some point if there is some guiding repetition:


 Instance 1
 <a>
  <b/>
  <c/>
 </a>

 Instance 2
 <a>
  <b/>
  <d/>
 </a>

Instance 3
 <a>
  <b/>
  <c/>
  <d/>
 </a>

furthermore just as you can ask the user  - do you want to make
enumerations - you can ask do you want to generate choices

the choice is generally less used than cardinality games therefore I
suppose the default would be to go to minOccurs= 0
the problem then is, as you noted, how to get things to be in the
right order. The right order problem decreases if we have multiple
instances that adumbrates the order.

The problems with this are of course -
1. if it takes 100 instances to get something good out you would
probably write a schema :)
2. The likely UI for a tool that allowed you to do this would be crap,
or it would be a command line tool which would be pretty sweet I
guess.

finally - I think actually it would be easier to write something that
generated Schematron schemas based on multiple outputs that handled a
weighted likelihood algorithm in the generation than something that
generated XSD - or maybe I just mean it would be more enjoyable.

Cheers,
Bryan Rasmussen

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@l...
subscribe: xml-dev-subscribe@l...
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php



transparent
Print
Mail
Like It
Disclaimer
.

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.

.
.

transparent

transparent