From Words to Numbers: Getting Started with Text Analysis for Applied Social Scientists

Hyun Woo Kim,Hyejung Chang

doi:10.22682/bcrp.2020.3.2.122

Abstract

Objectives: With texts as unstructured data everywhere, text analysis or natural language processing (NLP) is a rapidly growing academic field that has great potential for novel research among many applied social scientists and practitioners. This paper presents a practical introduction to NLP using Python as a useful tool for text analysis. Methods: Starting with installation of Python and an external library for NLP, this paper describes a step-by-step process of data preparation, transformation, and summarization for text data using examples. The example texts were obtained from a transcribed business meeting record of a multinational company based in Helsinki. Results: From the initial unstructured text data having numerous irrelevant elements, the data preparation procedures of tokenization, removing stop words, stemming, and lemmatization result in a set of words useful for main analyses. The next step of transforming the words to numbers was conducted using a bag-of-words method by assigning a unique value to each word in a matrix. As the last step, the matrix is computed for frequency summarization using TF-IDF (Term Frequency and Inverse Document Frequency). Conclusions: Unlike structured data, many unstructured text data are not generated for the purpose of data analysis. With numeric data reproduced by the process presented in this paper, communication researchers can perform various statistical methods or use machine learning algorithms. Beyond the scope of this paper, it is strongly recommended to study statistics and computational linguistics as well as have a working knowledge on R and/or Python for advanced text analysis.

Full Text