Abstract
We present an approach for systematically probing a trained neural network to extract a symbolic abstraction of it, represented as a Boolean formula. We formulate this task within Angluin's exact learning framework, where a learner attempts to extract information from an oracle (in our work, the neural network) by posing membership and equivalence queries. We adapt Angluin's algorithm for Horn formula to the case where the examples are labelled w.r.t. an arbitrary Boolean formula in CNF (rather than a Horn formula). In this setting, the goal is to learn the smallest representation of all the Horn clauses implied by a Boolean formula—called its Horn envelope—which in our case correspond to the rules obeyed by the network. Our algorithm terminates in exponential time in the worst case and in polynomial time if the target Boolean formula can be closely approximated by its envelope. We also show that extracting Horn envelopes in polynomial time is as hard as learning CNFs in polynomial time. To showcase the applicability of the approach, we perform experiments on BERT based language models and extract Horn envelopes that expose occupation-based gender biases.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.