Trigram Features Research Articles

Detecting local languages in Indonesia is essential for recognizing linguistic diversity, promoting intercultural understanding, preserving endangered languages, and improving access to education and services. By identifying and documenting these languages, we can support language preservation efforts, provide tailored resources for communities, and celebrate the unique cultural heritage of different ethnic groups. Ultimately, this encourages a more accepting and open-minded society, prioritizing various languages and cultural customs. This research aims to identify the most suitable algorithm for language detection in Indonesian regional languages and gain insights into their unique characteristics through n-gram analysis. By understanding language diversity, the study contributes to preserving Indonesia's cultural and linguistic heritage and improving language detection techniques. This study compares the performance of five algorithms (Naïve Bayes, K-nearest neighbors (KNN), least-squares, Kullback Leibler divergence, and Kolmogorov Smirnov test) to determine the most accurate and efficient method for language identification. Incorporating trigram features alongside unigrams and bigrams significantly improved the model's performance, with F1 scores increasing from 0.923 to 0.959. The study found that using more features leads to better accuracy, with Naïve Bayes and KNN emerging as the top-performing algorithms for language identification.

Read full abstract

BackgroundTimely data is key to effective public health responses to epidemics. Drug overdose deaths are identified in surveillance systems through ICD-10 codes present on death certificates. ICD-10 coding takes time, but free-text information is available on death certificates prior to ICD-10 coding. The objective of this study was to develop a machine learning method to classify free-text death certificates as drug overdoses to provide faster drug overdose mortality surveillance.MethodsUsing 2017–2018 Kentucky death certificate data, free-text fields were tokenized and features were created from these tokens using natural language processing (NLP). Word, bigram, and trigram features were created as well as features indicating the part-of-speech of each word. These features were then used to train machine learning classifiers on 2017 data. The resulting models were tested on 2018 Kentucky data and compared to a simple rule-based classification approach. Documented code for this method is available for reuse and extensions: https://github.com/pjward5656/dcnlp.ResultsThe top scoring machine learning model achieved 0.96 positive predictive value (PPV) and 0.98 sensitivity for an F-score of 0.97 in identification of fatal drug overdoses on test data. This machine learning model achieved significantly higher performance for sensitivity (p<0.001) than the rule-based approach. Additional feature engineering may improve the model’s prediction. This model can be deployed on death certificates as soon as the free-text is available, eliminating the time needed to code the death certificates.ConclusionMachine learning using natural language processing is a relatively new approach in the context of surveillance of health conditions. This method presents an accessible application of machine learning that improves the timeliness of drug overdose mortality surveillance. As such, it can be employed to inform public health responses to the drug overdose epidemic in near-real time as opposed to several weeks following events.

Read full abstract

Trigram Features Research Articles

Articles published on Trigram Features

Analysis of language identification algorithms for regional Indonesian languages

Sentiment Analysis on Twitter Hashtag Datasets

Sentence Classification Using N-Grams in Urdu Language Text

Public Perception of the Fifth Generation of Cellular Networks (5G) on Social Media.

Exploring multinomial naïve Bayes for Yorùbá text document classification

Enhancing timeliness of drug overdose mortality surveillance: A machine learning approach.

An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Trigram Features Research Articles

Articles published on Trigram Features

Analysis of language identification algorithms for regional Indonesian languages

Sentiment Analysis on Twitter Hashtag Datasets

Sentence Classification Using N-Grams in Urdu Language Text

Public Perception of the Fifth Generation of Cellular Networks (5G) on Social Media.

Exploring multinomial naïve Bayes for Yorùbá text document classification

Enhancing timeliness of drug overdose mortality surveillance: A machine learning approach.

An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach