Abstract

Area studies is an interdisciplinary study of the humanities adjacent to various research fields such as social sciences, natural sciences, engineering, medicine, and health. If focusing on humanities sciences, researchers' primary research resources have been text media, such as historical records, literary works, research papers, newspapers, and magazines. Researchers have engaged in analyses by reading the texts carefully as part of their research activities. However, with the spread of the Internet, Web data have become an inevitable source for area studies. We have explored new directions for area studies based on informatics compatible with big data and the Internet age. However, the low accuracy of place name extraction from texts and their place identification on a map hinders text processing, making it impossible to analyze big text data on the Web automatically. Using BiLSTM-CRF and a Balanced Corpus of Contemporary Written Japanese, our previous study realized approximately 0.9 of BiLSTM-CRF recognition accuracy, which demonstrated the effectiveness of BiLSTM-CRF. Our method could extract place names properly, but not enough to correctly identify their places on the map. Currently, we are trying a simple strategy: We assume that the locations of place names appearing in the same news article are close to each other. Thus, we compute the weighted average of all candidate coordinates of all place names in a news article as the pseudo center, then, as for each place name, choose the coordinate with the shortest distance from the pseudo center. This paper introduces the above methods and results in detail.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call