Abstract

The task of Question Answering has gained prominence in the past few decades for testing the ability of machines to understand natural language. Large datasets for Machine Reading have led to the development of neural models that cater to deeper language understanding compared to information retrieval tasks. Different components in these neural architectures are intended to tackle different challenges. As a first step towards achieving generalization across multiple domains, we attempt to understand and compare the peculiarities of existing end-to-end neural models on the Stanford Question Answering Dataset (SQuAD) by performing quantitative as well as qualitative analysis of the results attained by each of them. We observed that prediction errors reflect certain model-specific biases, which we further discuss in this paper.

Highlights

  • Machine Reading is a task in which a model reads a piece of text and attempts to formally represent it or performs a downstream task like Question Answering (QA)

  • We focused on Bi-Directional Attention Flow (BiDAF) (Seo et al, 2016), Gated Self-Matching Networks (R-Net) (Wang et al, 2017), Document Reader (DrQA) (Chen et al, 2017), MultiParagraph Reading Comprehension (DocQA) (Clark and Gardner, 2017), and the Logistic Regression baseline model (Rajpurkar et al, 2016) We mainly choose these models since they have comparable high performance on the evaluation metrics and it is easy to replicate their results due to availability of open source implementations

  • We analyze - both quantitatively and qualitatively - results generated by 4 end-to-end neural models on the Stanford Question Answering Dataset

Read more

Summary

Introduction

Machine Reading is a task in which a model reads a piece of text and attempts to formally represent it or performs a downstream task like Question Answering (QA). Neural approaches to the latter have gained a lot of prominence especially owing to the recent spur in developing and publicly releasing large datasets on Machine Reading and Comprehension (MRC) These datasets are created from different underlying sources such as web resources in MS MARCO (Nguyen et al, 2016); trivia and web in QUASAR-S and QUASAR-T (Dhingra et al, 2017), SearchQA (Dunn et al, 2017), TriviaQA (Joshi et al, 2017); news articles in CNN/Daily Mail (Chen et al.), NewsQA (Trischler et al, 2016) and stories in NarrativeQA (Kociskyet al., 2017).

Relevant Neural Models
Span-Level Performance
Sentence-Level Performance
Passage Length Distribution
Question Length Distribution
Answer Length Distribution
Error Overlap
Inference-Based Errors
Qualitative Analysis
Boundary-Based Errors
Findings
Observations
Conclusion and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.