Altova Mailing List Archives


Re: RFC: thoughts for a "streamlined" XML syntax variant...

From: BGB <cr88192@-------.--->
To: NULL
Date: 5/11/2012 5:01:00 PM
On 5/11/2012 1:44 PM, Peter Flynn wrote:
> On 11/05/12 18:40, BGB wrote:
>> one issue partly in the case of XML for its use in structured data
>> is its relative verbosity, especially in cases where it is entered by
>> hand or being read by a human (say, for debugging reasons, ...).
>
> I think this was expected to be a very rare case, which is why the spec
> says that terseness in XML markup is of minimal importance.
>

fair enough.

I mostly use it for things like compiler ASTs, network protocols, and 
file-formats (generally structured-data).


currently used forms of XML are:
raw/plaintext XML;
as deflated plaintext XML;
as an in-use binary format (similar to an "improved" version of WBXML 
with a few more features and density-improvements, with both being 
byte-based).

I have another format I could use, but going into it likely pushes 
topicality (it is a Huffman-compressed binary serialization format, 
currently used for sending messages over a TCP socket in a 3D game 
engine, but this doesn't have much in particular to do with XML, as the 
message format it is currently used with is S-Expression based, rather 
than XML based).

but, yeah, I guess originally XML was intended for markup of mostly 
textual documents (like in HTML or similar), rather than for 
representing structured data (or being used for humans viewing said 
structured data as debugging output).

I wonder if anyone ever really considered "scene-graph delta-update 
messages in a 3D FPS game" as a possible use-case for XML either? 
somehow I doubt it (I had intended to do this originally, despite 
eventually opting for a different representation for said deltas).

even as such, I did end up aggressively compressing them (via a 
specialized encoding scheme), as otherwise the bandwidth usage would 
have been a bit steep for a typical end-user internet connection.


>> so, the thought here would be to allow a "modest" syntax extension
>> (probably would be limited to particular implementations which
>> support the extension).
>>
>> more specifically, I was considering it as a possible extension
>> feature to my own implementation, but have some doubts given that,
>> yes, this would be non-standard extension. note that there probably
>> would be a feature to manually "enable" it, such as to avoid
>> necessarily breaking compatibility.
>
> Switchable is good.
>

yeah.


>> in my case, the current primary use is for things like compiler ASTs,
>> where it competes some with the use of S-Expressions for ASTs (Lisp
>> style, not the "Rivest" variant / name-hijack). note that these ASTs
>> normally never leave the application which created them, so the
>> impact of using a non-standard syntax when serializing them is likely
>> fairly small.
>>
>> example, say that a person has an expression like:
>> <if>
>>      <cond>
>>          <binary op="&lt;">
>>              <ref name="x"/>
>>              <number value="3"/>
>>          </binary>
>>      </cond>
>>      <then>
>>          <funcall name="foo">
>>              <args/>
>>          </funcall>
>>      </then>
>> </if>
>>
>> representing, say, the AST of the statement "if(x>3)foo();".
>>
>> the parser and printer could use a more compact encoding, say:
>> <if
>>      <cond<binary op="&lt;"<ref name="x"/>  <number value="3"/>>>>
>>      <then<funcall name="foo"<args/>>>
>
> This syntax (or very nearly) is already available in SGML:
>
> <!doctype if [
> <!element if - - (cond,then)>
> <!element cond - - (binary)>
> <!element binary - - (ref,number)>
> <!element number - - empty>
> <!element then - - (funcall)>
> <!element funcall - - (args)>
> <!element (args,ref) - - empty>
> <!attlist binary op cdata #required>
> <!attlist (ref,funcall) name cdata #required>
> <!attlist number value cdata #required>
> <!entity lt sdata "<">
> ]>
> <if<cond<binary op="&lt;"<ref name=x<number value="3"></></>
>     <then<funcall name=foo<args></></></>
>

fair enough.


>> which would be regarded as functionally-equivalent to the prior
>> expression (and would generate equivalent DOM trees when read back in).
>>
>> with the following rules:
>> <tag>...</tag>  and<tag/>  are the same as before.
>>
>> while:
>> <tag<...>  ...>
>> would use an alternate parsing strategy, where ">" is significant (since
>> the prior tag didn't actually end), and indicates the end of the
>> expression (the magic here would be seeing another "<" within a tag).
>>
>> similarly, maybe "<[[" could also be parsed as a shorthand for
>> "<![CDATA[" as well (and would also match nicer with the closing bracket
>> "]]>").
>>
>> note that it would be possible to mix them, as in:
>> <foo>  <bar<baz/>>  </foo>
>> and:
>> <foo<bar>  <baz/>  </bar>>
>>
>> maybe also a different "name" would be a good idea, like "XEML" or
>> similar would make sense, such as to reduce possible confusion.
>>
>> any thoughts or relevant information to look at?...
>
> I think you'd need a special editor: if the objective is to abbreviate
> the syntax, there is a delicate breakpoint between the denseness of the
> reduced syntax and the ability of the creator/user to understand it.
>

I hadn't considered this case.
if the code is being viewed/edited in a generic text editor (such as 
Notepad), it shouldn't make too much of a difference, but granted a 
specialized XML editor could very well get confused.

but, in this case, I doubt that such a change would render the syntax 
unreadable (to humans), but it could reduce verbosity and sprawl 
somewhat (in intermediate data files spit out by the application), which 
is currently the main problem area (finding things in multi-MB files is 
hard enough as-is, much less when the AST for a single function in a 
C-like syntax can span over a fairly large number of pages).

but, I don't think it would be too much of a different issue from that 
of a person trying to read S-Expressions, if using a more compact format.

this is partly because a C-style (programming language) syntax is fairly 
information-dense, but when parsed into ASTs and then dumped as XML, 
there is a significant amount of expansion.


> What about writing up the method as a paper for the Balisage (markup)
> conference? That's really the place to discuss new syntaxes.
>

I don't know much about them, I hadn't heard of this before.


> ///Peter
>

Disclaimer

These Archives are provided for informational purposes only and have been generated directly from the Altova mailing list archive system and are comprised of the lists set forth on www.altova.com/list/index.html. Therefore, Altova does not warrant or guarantee the accuracy, reliability, completeness, usefulness, non-infringement of intellectual property rights, or quality of any content on the Altova Mailing List Archive(s), regardless of who originates that content. You expressly understand and agree that you bear all risks associated with using or relying on that content. Altova will not be liable or responsible in any way for any content posted including, but not limited to, any errors or omissions in content, or for any losses or damage of any kind incurred as a result of the use of or reliance on any content. This disclaimer and limitation on liability is in addition to the disclaimers and limitations contained in the Website Terms of Use and elsewhere on the site.