Abstract

Inferring the input grammar accepted by a program is central for a variety of software engineering problems, including parsers verification, grammar-based fuzzing, communication protocol inference, and documentation. Sound and complete active learning techniques have been developed for several classes of languages and the corresponding automaton representation, however there are outstanding challenges that are limiting their effective application to the inference of input grammars. We focus on active learning techniques based on L^* and propose two extensions of the Minimally Adequate Teacher framework that allow the efficient learning of the input language of a program in the form of symbolic automata, leveraging the additional information that can extracted from concolic execution. Upon these extensions we develop two learning algorithms that reduce significantly the number of queries required to converge to the correct hypothesis.

Highlights

  • IntroductionInferring the input grammar of a program from its implementation is central for a variety of software engineering activities, including automated documentation, compiler analyses, and grammar-based fuzzing

  • Inferring the input grammar of a program from its implementation is central for a variety of software engineering activities, including automated documentation, compiler analyses, and grammar-based fuzzing.Several learning algorithms have been investigated for inferring a grammar from examples of accepted and rejected input words, with active learning approaches achieving the highest data-efficiency and strong convergence guarantees

  • In our preliminary evaluation based on Java implementations of parsers for regular languages from the Automatark benchmark suite, the new active learning algorithms enabled by concolic execution learned the correct input language for 76% of the subject, despite the lack of a complete equivalence oracle and achieving a reduction of up to 96% of the number of membership and equivalence queries produced by the learner

Read more

Summary

Introduction

Inferring the input grammar of a program from its implementation is central for a variety of software engineering activities, including automated documentation, compiler analyses, and grammar-based fuzzing. We extend two state of the art active learning frameworks for symbolic learning by enabling the teacher to 1) provide more informative answers for membership queries by pairing the accept/reject outcome with a path condition describing all the input words that would result in the same execution as the word indicated by the learner, and 2) provide a partial equivalence oracle that may produce counterexamples for the learner hypothesis. In our preliminary evaluation based on Java implementations of parsers for regular languages from the Automatark benchmark suite, the new active learning algorithms enabled by concolic execution learned the correct input language for 76% of the subject, despite the lack of a complete equivalence oracle and achieving a reduction of up to 96% of the number of membership and equivalence queries produced by the learner.

Symbolic finite state automata
Active learning and minimally adequate teachers
Concolic execution
From path conditions to SFA
Active learning for SFA
Learning using observation tables
Learning using discrimination trees
Active learning with concolic execution
Concolic learning with symbolic observation tables
Concolic learning with a symbolic membership oracle
Experimental Setup
Learning with symbolic membership queries
Active learning
Findings
Passive learning
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.