Building semantic data populations in unstructured data or text is challenging. In this type of data, several problems can be raised, some of which are difficult to analyze. Some groups of words or expressions cannot be defined according to their meaning and can be a source of ambiguity. It can have a different meaning depending on the context of its use. This work aims to automatically annotate Indonesian Language text, especially phrases, with the existing knowledge base. The result is text with semantic markup. Machines can automatically process this type of text because it describes its meaning. This work applies an n-gram language model to identify meaningful phrases and defines them as a unit so that every existing word or phrase is automatically semantically tagged. This work uses the DBpedia and schema.org knowledge base. The percentage of successfully labeled data in this job was 78% with 84.95% accuracy using DBpedia and 5.9% with 97.46% accuracy using schema .org. Some factors affect the accuracy score, including the availability of the required data with the data contained in the knowledge base, the system's ability in the POS tagging process, and many new terminology and local cultures that have not yet been contained in the knowledge bases, especially schema.org that is utilized as a standard for all search engines. This work will help the machine understand the semantics of text data. All pages obtained will be semantically tagged and, therefore, will be understood by machines. This ability will support the following processes.
Read full abstract