Sunday, 18 August 2013

Parsing XML in Groovy with namespace and entities

Parsing XML in Groovy with namespace and entities

Parsing XML in Groovy should be a piece of cake, but I always run into
problems.
I would like to parse a string like this:
<html>
<p>
This&nbsp;is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>
When I do it the standard way new XmlSlurper().parseText(body), the parser
complains about the &nbsp entity. My secret weapon in cases like this is
to use tagsoup:
def parser = new org.ccil.cowan.tagsoup.Parser()
def page = new XmlSlurper(parser).parseText(body)
But now the <ac:sepcial> tag will be closed immediatly by the parser - the
special text will not be inside this tag in the resulting dom. Even when I
disable the namespace-feature:
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)
Another approach was to use the standard parser and to add a doctype like
this one:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
This seems to work for most of my files, but it takes ages for the parser
to fetch the dtd and process it.
Any good idea how to solve this?
PS: here is some sample code to play around with:
@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='0.9.7')
def processNode(node) {
def out = new StringBuilder("")
node.children.each {
if (it instanceof String) {
out << it
} else {
out << "<${it.name()}>${processNode(it)}</${it.name()}>"
}
}
return out.toString()
}
def body = """<html>
<p>
This&nbsp;is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>"""
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)
def out = new StringBuilder("")
page.childNodes().each {
out << processNode(it)
}
println out.toString()
""

No comments:

Post a Comment