Abstract

Historical topic modeling and semantic concepts exploration in a large corpus of unstructured text remains a hard, opened problem. Despite advancements in natural languages processing tools, statistical linguistics models, graph theory and visualization, there is no framework that combines these piece-wise tools under one roof. We designed and constructed a Semantic Network Analysis Pipeline (SNAP) that is available as an open-source web-service that implements work-flow needed by a data scientist to explore historical semantic concepts in a text corpus. We define a graph theoretic notion of a semantic concept as a flow of closely related tokens through the corpus of text. The modular work-flow pipeline processes text using natural language processing tools, statistical content narrowing, creates semantic networks from lexical token chaining, performs social network analysis of token networks and creates a 3D visualization of the semantic concept flows through corpus for interactive concept exploration. Finally, we illustrate the framework’s utility to extract the information from a text corpus of Herman Melville’s novel Moby Dick, the transcript of the 2015–2016 United States (U.S.) Senate Hearings on Environment and Public Works, and the Australian Broadcast Corporation’s short news articles on rural and science topics.

Highlights

  • Historical semantic concepts (HSC) modeling aims to understand what the key concepts discussed in a text corpus are, how concepts evolve over time, and what the context semantic concepts are used in is in relation to each other as well as their relation to the supporting sub-concepts

  • The modular framework relies on mature linguistic tools that can be swapped to customize the mechanics of the computational linguistics processing

  • One such customization might include the implementation of a workflow to analyze the sentiment concept flows, where a sentiment concept flow would track and connect tokens coded with a sentiment label

Read more

Summary

Introduction

Historical semantic concepts (HSC) modeling aims to understand what the key concepts discussed in a text corpus are, how concepts evolve over time, and what the context semantic concepts are used in is in relation to each other as well as their relation to the supporting sub-concepts. Semantic networks can be used to capture the relationships among co-occurring words in a single document [1,2], interactive HSC exploration requires multi-step, computational linguistic work-flow to process the unstructured text to extract information from many documents in order to synthesize knowledge about the different concepts found in the corpus of text. Sci. 2019, 9, 5302 existing text analysis techniques often extract a set of discrete textual memes from a text corpus which does not preserve the meme’s context, relationship to other meme(s), nor how these relationships change throughout a corpus of text To illustrate these shortcomings, let us consider a toy example of three newspaper articles that were published sequentially on the topic of “salmon” and a set of key textual memes extracted from each article—environment, cost, salmon, economy, harvest, ecology, economy, investment, and global, economy, salmon, environment, cost. The framework’s project management allows for the inspection and validation of the intermediate text-processing steps, management of large data sets and provides data security

Background
Work-Flow
From Unstructured Text to Semantic Flows
Natural Language Processing
Term frequency and stop word removal
Semantic Concept
Semantic Flows
Implementation Notes
Sample Corpus Analysis
Moby Dick
Australian Broadcast Commission
Discussion and Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.