Abstract
The conferences ACL (Association for Computational Linguistics) and EMNLP (Empirical Methods in Natural Language Processing) rank among the premier venues that track the research developments in Natural Language Processing and Computational Linguistics. In this paper, we present a study on the research papers of approximately two decades from these two NLP conferences. We apply keyphrase extraction and corpus analysis tools to the proceedings from these venues and propose probabilistic and vector-based representations to represent the topics published in a venue for a given year. Next, similarity metrics are studied over pairs of venue representations to capture the progress of the two venues with respect to each other and over time.
Highlights
Scientific findings in a subject-area are typically published in conferences, journals, patents, and books in that domain
We present our study on research proceedings of approximately two decades from two leading NLP conferences, namely ACL and EMNLP, to complement a previous study on this topic by Hall et al (2008)
We propose two novel representations for summarizing the venue proceedings in a given year. (1) The probabilistic representation expresses each venue as a probability distribution over topics, whereas (2) the TPICP representation captures topics that are the major focus in the venue for a particular year via Topic Proportion (TP) as well as topic importance as measured with inverse corpus proportion (ICP)
Summary
Scientific findings in a subject-area are typically published in conferences, journals, patents, and books in that domain. Despite several potential benefits mentioned above and the free availability of most research proceedings in NLP through the ACL Anthology, the topical and temporal aspects of this corpus are yet to be fully studied in current literature. We present our study on research proceedings of approximately two decades from two leading NLP conferences, namely ACL and EMNLP, to complement a previous study on this topic by Hall et al (2008). To the best of our knowledge, we are the first to characterize the developments in the NLP domain using a comparative study of two of its leading publication venues. We represent the NLP research corpus from approximately two decades as a keyphrasedocument matrix and apply Latent Dirichlet Allocation (Blei et al, 2003) to extract coherent topics from it (Newman et al, 2010) Our contributions are summarized below: 1. We represent the NLP research corpus from approximately two decades as a keyphrasedocument matrix and apply Latent Dirichlet Allocation (Blei et al, 2003) to extract coherent topics from it (Newman et al, 2010)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.