A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

Johannes Petrus,Ermatita ,Sukemi ,Erwin

doi:10.14569/ijacsa.2023.0140264

Johannes Petrus, Ermatita + Show 2 more

Open Access

https://doi.org/10.14569/ijacsa.2023.0140264

Copy DOI

Abstract

This study proposes a new approach in the sentence tokenization process. Sentence tokenization, which is known so far, is the process of breaking sentences based on spaces as separators. Space-based sentence tokenization only generates single word tokens. In sentences consisting of five words, tokenization will produce five tokens, one word each. Each word is a token. This process ignores the loss of the original meaning of the separated words. Our proposed tokenization framework can generate one-word tokens and multi-word tokens at the same time. The process is carried out by extracting the sentence structure to obtain sentence elements. Each sentence element is a token. There are five sentence elements that is Subject, Predicate, Object, Complement and Adverbs. We extract sentence structures using deep learning methods, where models are built by training the datasets that have been prepared before. The training results are quite good with an F1 score of 0.7 and it is still possible to improve. Sentence similarity is the topic for measuring the performance of one-word tokens compared to multi-word tokens. In this case the multiword token has better accuracy. This framework was created using the Indonesian language but can also use other languages with dataset adjustments.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2023
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

Abstract

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

The Role of Non-Constituents in The Sentence Structures in Kurdish and Persian Languages
Hamza Hussein Hama ... Rabar Kakarsh Mina
Journal of University of Raparin | VOL. 9
Hamza Hussein Hama, et. al.Hamza Hussein Hama ... Rabar Kakarsh Mina
29 Sep 2022
Journal of University of Raparin | VOL. 9

POS Tagger Improvisation with the Addition of Foreign Word Labels on Telkom University News
Donni Richasdy ... Winkie Setyono
Building of Informatics, Technology and Science (BITS) | VOL. 4
Donni Richasdy, et. al.Donni Richasdy ... Winkie Setyono
22 Sep 2022
Building of Informatics, Technology and Science (BITS) | VOL. 4

ANALISIS KONTRASTIF BAHASA LIO-INDONESIA DAN PENGIMPLEMENTASIANNYA DALAM MODEL PEMBELAJARAN BAHASA KEDUA
Suhardi & Pujiati Suyata
Jurnal Cakrawala Pendidikan | VOL. 2
Suhardi & Pujiati SuyataSuhardi & Pujiati Suyata
06 Jul 2010
Jurnal Cakrawala Pendidikan | VOL. 2

A Comparative study of ‘在+一起+V’ and ‘V+在+一起’ Structure in Modern Chinese
Seung-Hyun Kim
The Journal of Chinese Language and Literature | VOL. 101
Seung-Hyun KimSeung-Hyun Kim
31 Dec 2016
The Journal of Chinese Language and Literature | VOL. 101

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

Abstract

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications