Abstract

Since online news articles are updated daily, hourly and sometimes every minute, therefore the data from online news articles are glowing rapidly. These data seem like a large corpus of text mining. This research focuses on Thai personal names that appear in the online news which sometimes have slightly different spelling but they actually refer to the same person. From the news data that were collected during 30 July 2009 to 5 November 2009, there are a lot of name variations. The objective of this paper is to disambiguate Thai personal names by applying string matching techniques which are Guth, Levenshtein, Damerau-Levenshtein, Longest Common Substring and Longest Common Subsequence. The experimental results show that the Longest Common Subsequence was the most efficient technique for matching Thai personal name with the F-Score of 94.43%. After that, the two-scan labeling technique was used to identify the unique full Thai personal name. The results show that it can reduce the 6,884 distinct personal names to 830 unique personal named entities which equals to 12.057% reduction.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.