Abstract

The widely deployed and easy-to-use Linguistic Inquiry and Word Count (LIWC) tool is the gold standard for many computerized text analysis tasks for many medical applications such as patient sentiment analysis, depression detection, and ADHD detection. Compared to most other natural language processing (NLP) tasks, in the medical field it is often very difficult to obtain large-scale data sets, making effective automatic representation learning from complex text patterns (e.g., using a deep auto-encoder) challenging. LIWC can solve this problem by using a human-designed dictionary as a substitution of a machine learning model to convert text into a concise and effective vector representation. However, while LIWC's dictionary is large, some potentially informative words might still be neglected due to the knowledge constraint of the dictionary editors. This problem is particularly conspicuous when the analyzed text is not a formal language (e.g., dialect, slang, or cyber words). To address this problem, we propose a new matching scheme that does not require an exact word match, but instead counts all words that are similar to a key in the LIWC dictionary. This scheme is implemented using WordNet, a large lexical database, and Word2Vec, a machine learning based word embedding technology. The output of the proposed method is in the exact same format as LIWC's output, thereby maintaining the usability. Similar to previous work, the proposed method can be viewed as a combination of human domain knowledge and machine learning for text representation encoding.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.