Abstract

Though XML has gained significant acceptance in a number of application domains, XML parsing can still be a vexing performance bottleneck. With the growing prevalence of multicore CPUs, parallel XML parsing could be one option for addressing this bottleneck. Achieving data parallelism by dividing the XML document into chunks and then independently processing all chunks in parallel is difficult, however, because the state of an XML parser at the first character of a chunk depends potentially on the characters in all preceding chunks. In previous work, we have used a sequential preparser implementing a preparsing pass to determine the document structure, followed by a parallel full parse. The preparsing is sequential, however, and thus limits speedup. In this work, we parallelize the preparsing pass itself by using a simultaneous finite transducer (SFT), which implicitly maintains multiple preparser results. Each result corresponds to starting the preparser in a different state at the beginning of the chunk. This addresses the challenge of determining the correct initial state at beginning of a chunk by simply considering all possible initial states simultaneously. Since the SFT is finite, the simultaneity can be implemented efficiently simply by enumerating the states, which limits the overhead. To demonstrate effectiveness, we use an SFT to build a parallel XML parsing implementation on an unmodified version of libxml2, and obtained good scalability on both a 30 CPU Sun E6500 machine running Solaris and a Linux machine with two Intel Xeon L5320 CPUs for a total of 8 physical cores.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.