Abstract
Blueprint is a declarative domain-specific language for document extraction. Users describe document layout using spatial, textual, semantic, and numerical fuzzy constraints, and the language runtime extracts the field-value mappings that best satisfy the constraints in a given document. We used Blueprint to develop several document extraction solutions in a commercial setting. This approach to the extraction problem proved powerful. Concise Blueprint programs were able to generate good accuracy on a broad set of use cases. However, a major goal of our work was to build a system that non-experts, and in particular non-engineers, could use effectively, and we found that writing declarative fuzzy constraint-based extraction programs was not intuitive for many users: a large up-front learning investment was required to be effective, and debugging was often challenging. To address these issues, we developed a no-code IDE for Blueprint, called Studio, as well as program synthesis functionality for automatically generating Blueprint programs from training data, which could be created by labeling document samples in our IDE. Overall, the IDE significantly improved the Blueprint development experience and the results users were able to achieve. In this paper, we discuss the design, implementation, and deployment of Blueprint and Studio. We compare our system with a state-of-the-art deep-learning based extraction tool and show that our system can achieve comparable accuracy results, with comparable development time, for appropriately-chosen use cases, while providing better interpretability and debuggability.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.