This blog is a reminder that cheating in software development can get you into big trouble.
Sometimes developers get really really lazy, OR are pressured to write something using overly simplified data structures. Almost every time this happens you are bitten in the butt! Sometimes the problems show up immediately and other times it may take months or years (especially in integrated systems).
I recently was asked to look at some production code that was screwing up! The code I looked at read in a XML document, changed the document title, and sent it along to a different portion of the software. Typically when I see these types of issues I like to review the input to the software when it was working and compare it to the input after it stopped working. I saw immediately what was wrong.
The system providing the XML document had recently upgraded their software and their new version exported the XML document slightly differently. The XML that worked looked something like this:
<document>
<title>Document Title</title>
<sections>
<section>
<title>Section 1 Title</title>
</section>
<section>
<title>Section 2 Title</title>
</section>
<section>
<title>Section 3 Title</title>
</section>
</sections>
</document>
As you can see, the node structure has the document title at the top, and each section also has its own title. The code that was manipulating this XML was using Java with indexof and substring to do the dirty work. It sort of worked like this:
public static String changeDocumentTitle(String xmlString, String title) {
int startIndex = xmlString.indexOf("<title>");
int endIndex = xmlString.indexOf("</title>");
String newXmlString = xmlString.substring(0, startIndex);
newXmlString += "<title>" + title;
newXmlString += xmlString.substring(endIndex);
return newXmlString;
}
Technically this worked OK, because the first occurrence of ’<title>;’ and ’</title>’ were the correct target. When this XML document is subjected to this code the outcome is this:
<document>
<title>New Document Title</title>
<sections>
<section>
<title>Section 1 Title</title>
</section>
<section>
<title>Section 2 Title</title>
</section>
<section>
<title>Section 3 Title</title>
</section>
</sections>
</document>
When I called this static method I set ‘title to “New Document Title” and, as you can see, the highest title node was modified as I wanted.
After the upgrade on the sending system I noticed that the XML was slightly different.. It looked something like this:
<document>
<sections>
<section>
<title>Section 1 Title</title>
</section>
<section>
<title>Section 2 Title</title>
</section>
<section>
<title>Section 3 Title</title>
</section>
</sections>
<title>Document Title</title>
</document>
Whow! Now the document title is at the bottom of the document. So clearly, the system providing the XML document made a rather significant change, right? Well, this is what some folks think, however, I don’t see it that way. My view is: the XML document is still valid, so it is the XML parser that is at fault. Right? I mean, they did make a structural change to the XML document, but it is still valid, therefore, it should be parsed and modified correctly.
When this XML is processed by the same code above the outcome is totally different:
<document>
<sections>
<section>
<title>New Document Title</title>
</section>
<section>
<title>Section 2 Title</title>
</section>
<section>
<title>Section 3 Title</title>
</section>
</sections>
<title>Document Title</title>
</document>
HA! The first section of the document was modified! This is totally wrong, and it causes the consuming application to display the document incorrectly.
What went wrong?
It is important to remember that when you are working with XML you are working with an object. Even though it looks like text it is absolutely not text. Many people make the mistake of processing XML as text, but this is just an awful and dangerous way to manage this particular format.
XML is a serialized object notation. This XML document is an object called 'document’. The object has a property called 'title’. The object also has a property called 'sections’ that is an array of objects of type 'section’. Each section object has a property called 'title’. In my opinion, this is the correct way to interpret this document. The problem with this code is that the developer did not think of the XML document as an object.
In Java terms, the XML can be looked at like this:
import java.util.*;
class Section {
public String title;
}
class Document {
public String title;
public ArrayList<Section> sections = new ArrayList<Section>();
}
class XMLTest {
public static void main(String[] args) throws Exception {
Document document = new Document();
document.title = "Document Title";
Section section1 = new Section();
Section section2 = new Section();
Section section3 = new Section();
section1.title = "Section 1 Title";
section2.title = "Section 2 Title";
section3.title = "Section 3 Title";
document.sections.add(section1);
document.sections.add(section2);
document.sections.add(section3);
}
}
This makes sense, right? You can see that this is a Java representation of the XML we are analyzing. If you wanted to change the document title you would use the field document.title. How do we do this with an XML document?
Java has some great XML classes. They are not as simple and straight-forward as C#, but they work! Java is known for being very “wordy” and this example will show that, but it will get the job done the right way!
Using the Java DOM parser is pretty easy. You just need to learn how to use a few classes.
DocumentBuilderFactory - Defines an API that allows your Java code to parse XML into a DOM Object
DocumentBuilder - Defines an API to use DOM Document instances from an XML document
Document - Represents the tree structure of a HTML and/or XML document
Once the XML string has been de-serialized into a Document object then you can change it with no fear! As long as the XML is properly formatted then you can feel confident this will do what you need without the risk of what I explained before. I find that a simple way to traverse XML is using XPATHs. Here is a simple method that will work on both the XML examples shown above:
public static String changeDocumentTitle(String xmlString, String title) throws Exception {
String xPath = "/document/title";
InputStream is = new ByteArrayInputStream(xmlString.getBytes());
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = db.parse(is);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(xPath);
Node node = (Node)expr.evaluate(dom, XPathConstants.NODE);
node.setTextContent(title);
String resultXML = domToXMLString(dom);
is.close();
return resultXML;
}
Not bad, right? We start off by setting the XPATH that represents the location of our target node. We are going inside the 'document’ node, and then the 'title’ node. I think that is a very easy way to traverse the XML document to get exactly what you are targeting.
After we set our XPATH, then we build our Document object using the classes I mentioned earlier. Now that our XML string has been broken apart into a proper object we will built the infrastructure we need to make use of our XPATH string.
XPathFactory - API that allows the creation of an XPath object
XPath - Creates an instance that represents a XPATH
XPathExpression - Creates an instance that represents a compiled XPath expression
Then the code evaluates the XPATH which will return the 'title’ node, assuming it is there. If not, it will throw an exception that can be handled at a higher scope.
Once we have the node memory reference we can change the inner text by simply calling the setTextContent method.
BOOM! The node is changed properly and both XML formats above will work the same.
The last part of this is re-serializing the XML back into a string, because in this example the software is outbounding the XML as a string to the consuming application. To do this, I wrote a simple method called domToXMLString.
public static String domToXMLString(Document dom) throws Exception {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(dom), new StreamResult(writer));
return writer.getBuffer().toString();
}
After sending the Document instance to this static method you will receive its text representation.
In summary… Working with XML as if it were text is very tempting. Sometimes it gives the illusion that a simple regsub or substring will do the trick. There are a lot of issues parsing and manipulating XML that make text manipulation the wrong way to go. Always treat XML as it is: an Object. Treat XML with respect in your software and it will treat you with respect back!