Abstract

Our work aims to aid in the development of an open source data schema for educational interventions by implementing natural language processing (NLP) techniques on publications within What Works Clearinghouse (WWC) and the Education Resources Information Center (ERIC). A data schema demonstrates the relationships between individual elements of interest (in this case, research in education) and collectively documents elements in a data dictionary. To facilitate the creation of this educational data schema, we first run a two-topic latent Dirichlet allocation (LDA) model on the titles and abstracts of papers that met WWC standards without reservation against those of papers that did not, separated by math and reading subdomains. We find that the distributions of allocation to these two topics suggest structural differences between WWC and non-WWC literature. We then implement Term Frequency-Inverse Document Frequency (TF-IDF) scoring to study the vocabulary within WWC titles and abstracts and determine the most relevant unigrams and bigrams currently present in WWC. Finally, we utilize an LDA model again to cluster WWC titles and abstracts into topics, or sets of words, grouped by underlying semantic similarities. We find that 11 topics are the optimal number of subtopics in WWC with an average coherence score of 0.4096 among the 39 out of 50 models that returned 11 as the optimal number of topics. Based on the TF-IDF and LDA methods presented, we can begin to identify core themes of high-quality literature that will better inform the creation of a universal data schema within education research.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call