xml.etree.ElementTree.ParseError: Spam after document element | by Teri Radichel | Biting Bugs | December 2023

esteria.white

A malformed XML document causes this Python error

Here’s what I did that caused this error. I kind of suspected this would be a problem, but I lazily tried it to see if I could get away with it.

I wrote code to merge multiple XML files using Python 3 into a simple file that looks like this (I’m merging multiple burp files for processing – something I’ve done before but wanted to rework the code ).

#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
#merge multiple burp xml files
ptcode="xxxxxxxxxxx"
reportdatadir="/home/ec2-user/report/" + ptcode + "/data"
burpdatadir=reportdatadir +"/burp"
mergedfile=reportdatadir + "/burp.xml"

f = open(mergedfile, "a+")

xml_files = glob.glob(burpdatadir +"/*.xml")
xml_element_tree = None
for xml_file in xml_files:
data = ET.tostring(ET.parse(xml_file).getroot()).decode("utf-8")
f.write(data)
f.close()

Great, this takes a bunch of Burp XML files and merges them, but it basically concatenates the root elements like this, for example, if I merge three files.

<issues>
<issue>security bug 1</issue>
<issue>security bug 2</issue>
</issues>

<issues>
<issue>security bug 3</issue>
<issue>security bug 4</issue>
</issues>

<issues>
<issue>security bug 5</issue>
<issue>security bug 6</issue>
</issues>

The problem is that I can only have one root element.

To fix this when I merge the files I can either add a new item:

<root>
<issues>
<issue>security bug 1</issue>
<issue>security bug 2</issue>
</issues>

<issues>
<issue>security bug 3</issue>
<issue>security bug 4</issue>
</issues>

<issues>
<issue>security bug 5</issue>
<issue>security bug 6</issue>
</issues>
</root>

Alternatively, I can find the child nodes of the root and add them to the root node to get this:

<issues>
<issue>security bug 1</issue>
<issue>security bug 2</issue>
<issue>security bug 3</issue>
<issue>security bug 4</issue>
<issue>security bug 5</issue>
<issue>security bug 6</issue>
</issues>

The latter is probably easier for sorting items.

Follow for updates.

Teri Radichel | © 2nd sight laboratory 2023

Leave a comment