Data Schema to Formalize Education Research &amp; Development Using Natural Language Processing

Hannah Frederick,Brian Wright,Margaret Williams,Amanda West,Haizhu Hong

doi:10.1109/sieds52267.2021.9483781

Data Schema to Formalize Education Research & Development Using Natural Language Processing

Hannah Frederick, Brian Wright + Show 3 more

https://doi.org/10.1109/sieds52267.2021.9483781

Copy DOI

Export

Save

Cite

Publication Date: Apr 30, 2021

Affiliation: University of Virginia

Abstract
Full-Text
Similar Papers

Abstract

Listen

Our work aims to aid in the development of an open source data schema for educational interventions by implementing natural language processing (NLP) techniques on publications within What Works Clearinghouse (WWC) and the Education Resources Information Center (ERIC). A data schema demonstrates the relationships between individual elements of interest (in this case, research in education) and collectively documents elements in a data dictionary. To facilitate the creation of this educational data schema, we first run a two-topic latent Dirichlet allocation (LDA) model on the titles and abstracts of papers that met WWC standards without reservation against those of papers that did not, separated by math and reading subdomains. We find that the distributions of allocation to these two topics suggest structural differences between WWC and non-WWC literature. We then implement Term Frequency-Inverse Document Frequency (TF-IDF) scoring to study the vocabulary within WWC titles and abstracts and determine the most relevant unigrams and bigrams currently present in WWC. Finally, we utilize an LDA model again to cluster WWC titles and abstracts into topics, or sets of words, grouped by underlying semantic similarities. We find that 11 topics are the optimal number of subtopics in WWC with an average coherence score of 0.4096 among the 39 out of 50 models that returned 11 as the optimal number of topics. Based on the TF-IDF and LDA methods presented, we can begin to identify core themes of high-quality literature that will better inform the creation of a universal data schema within education research.

Full Text