The Digital Lumberjack

Hacking Code When Life Goes Digital

XML for Compilers


Post date 27 Mar 2010
Filed under

Thinking about how to implement a tool for measuring source code smelliness, my main concern was extracting the syntactic elements I wanted from a programming language.

Obviously plain regular expression matching are not the way to go. I decided to opt for a reduced AST representing the common elements of most imperative languages: functions, variables and some type of packaging (name spaces, files and/or classes).

Even when I could easily build an AST on my own with Flex and Bison, I started looking for projects that had already taken care of that. I discovered many tools with no apparent code reuse, they all implemented their own parser for the same language. It did not take too long to realize how useful it could be using XML, a context free grammar that could represent every detail I wanted.

Like most of the Aha! momments of my life, just a short glimpse on the internet made me realize that not only people thought about that before I did, they did long time ago YZKK01TCDJR04. So I will just sumarize here (and force myself to think) about what I want from XML.

Portable representation

XML is the bridge that eases the flow of data structures among different systems because binary representation leads to several communication problems.

A compiler is the king of the tools dealing with our code but there is a whole array of development related applications that also demands (or benefits of) an understanding of our code: debuggers, static analysers, IDE environments, editors, code browsers, bug trackers, documentation systems, requirement management, version control...

Just for the first type of applications we came up with debugging data formats such as DWARF and similar variants. What about the rest? Well, every single of them seem to parse code on their own. When they just don't care, they lack functionality which could help programmers.

Even if the LLVM library model makes code reuse easier, a common way to represent the basic elements of a language could reduce the development effort of those tools.

There are also plenty of commercial tools which will not disclose their data structures. These tools could work with the XML representation of their targeted language and the integration with any other tool would be immediate.

Defined conversions

Now you might be thinking that the differences among languages, tools and lack of a common standard are hard stuff to chew. Well, that is where XLST can help. Just look at this small piece of XML The example was taken from the Wikipedia article about XLST:

010    <?xml version="1.0" ?>
020    <persons>
030      <person username="JS1">
040        <name>John</name>
050        <family-name>Smith</family-name>
060      </person>
070      <person username="MI1">
080        <name>Morka</name>
090        <family-name>Ismincius</family-name>
100      </person>
110    </persons>

With XLST you could define a set of rules for translating that into something like this:

010    <?xml version="1.0" encoding="UTF-8"?>
020    <root>
030      <name username="JS1">John</name>
040      <name username="MI1">Morka</name>
050    </root>

The application to developer tools would be great because the XML output of a compiler could be reshaped and filtered providing just the basic elements that other tools need. Even it would be possible to have be different transformations depending on the targeted tool.

Binary mapping

There will be a concern about the efficiency of these methods. People with a compiler background, used to highly optimized solutions, might be skeptical thinking about the costs of using XML.

Even if we prevent several other tools from parsing over and over again the code, it just doesn't look OK to translate the internal binary AST into the text of an XML file, transform and back again into binary.

Well, that is what XML static binding solves. It defines the binary representation of the XML nodes. A compiler can offer the XML generation for the sake of portability. However the binary representation of a full or reduced AST could be produced and other tools will have a common medium to understand the structure of such dump.

Today compilers are envisioned as a single tool in every programmer's computer. There is a possibility for the current distributed compilation systems becoming more common than we think. Just the same, other developer tools dealing with documentation and requirements might also evolve into distributed environments. The XML representation will be a great asset.

Conclusions

Most programming languages have common elements such as functions and variables. With a unified representation, some tools could be expanded and provide us with more information, no matter the underlaying language.

Imagine a bug tracker that automatically list the functions modified when solving a bug. A VCS listing the history of functions not just the history of files, also it could define modification rights classes. A documentation and/or requirement tool adapting to a user defined notation. Is there any easier way to encourage the development of such tools?

blog comments powered by Disqus