Java DocumentBuilder: xml parsing is very slow?
March 28, 2011 12 Comments
I’ve been messing up with some code to find certain links in a xhtml page in Java. I’ve started with XPath and page source parsed by ootb
javax.xml.parsers.DocumentBuilder, but it was so painfully slow. What’s most interesting it was not the XPath evaluation but xhtml parsing.
It was only 12kB large and took around 2 minutes to parse! It was simply unusable (that’s why this regex from previous post was born). Then XPath was evaluated in no time. What was causing the issue is that xml parser is by default doing all validation it can while parsing documents (this also means trying to download DTDs or xslt documents to validate document structure). All was fixed by disabling validation. So here it is if you need it:
DocumentBuilderFactory fac = DocumentBuilderFactory.newInstance(); fac.setNamespaceAware(false); fac.setValidating(false); fac.setFeature("http://xml.org/sax/features/namespaces", false); fac.setFeature("http://xml.org/sax/features/validation", false); fac.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false); fac.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); DocumentBuilder builder = fac.newDocumentBuilder();
Now use this builder to parse xml documents with no validation (and no time).