Die ontwikkeling van ’n woordafbreker en kompositumanaliseerder vir Afrikaans

S Pilon,G.B Van Huyssteen,M.J Puttkammer

doi:10.4102/lit.v29i1.99

Abstract

The development of a hyphenator and compound analyser for Afrikaans The development of two core-technologies for Afrikaans, viz. a hyphenator and a compound analyser is described in this article. As no annotated Afrikaans data existed prior to this project to serve as training data for a machine learning classifier, the core-technologies in question are first developed using a rule-based approach. The rule-based hyphenator and compound analyser are evaluated and the hyphenator obtains an fscore of 90,84%, while the compound analyser only reaches an f-score of 78,20%. Since these results are somewhat disappointing and/or insufficient for practical implementation, it was decided that a machine learning technique (memory-based learning) will be used instead. Training data for each of the two core-technologies is then developed using “TurboAnnotate”, an interface designed to improve the accuracy and speed of manual annotation. The hyphenator developed using machine learning has been trained with 39 943 words and reaches an fscore of 98,11% while the f-score of the compound analyser is 90,57% after being trained with 77 589 annotated words. It is concluded that machine learning (specifically memory-based learning) seems an appropriate approach for developing coretechnologies for Afrikaans.

Highlights

In hierdie artikel word die ontwikkeling van twee kerntegnologieë vir Afrikaans, ’n woordafbreker en ’n kompositumanaliseerder, beskryf
Morfologiese analiseerders word nie alleen in teksgebaseerde toepassings gebruik nie, maar ook in spraakgebaseerde toepassings
Puttkammer & G.B. van Huyssteen sensieel is in die meeste taaltegnologietoepassings (Daelemans et al, 2005), is dit daarom van kernbelang om ’n gesofistikeerde, herbruikbare morfologiese analiseerder vir ’n taal soos Afrikaans te ontwikkel

Summary

Inleiding

Die groei en ontwikkeling van ’n mensetaaltegnologie-industrie van ’n taal is afhanklik van die ontwikkeling van kerntegnologieë (d.i. modules wat vir spesifieke take ontwikkel word en dan in toepassings geïmplementeer kan word) vir dié betrokke taal. Alvorens sodanige morfologiese analiseerder ontwikkel word, moet deeglik besin word oor die soort analises wat die analiseerder moet kan doen. Om te verseker dat die morfologiese analiseerder wat ontwikkel word, herbruikbaar is (d.i. geskik vir gebruik in ’n verskeidenheid toepassings), moet gepoog word om funksionaliteite in sodanige analiseerder in te bou wat dit in soveel moontlik toepassings bruikbaar sou kon maak. In hierdie artikel word die ontwikkeling van ’n woordafbreker en ’n kompositumanaliseerder vir Afrikaans beskryf; albei kan beskou word as kerntegnologieë wat in ’n outomatiese morfologiese analiseerder geïmplementeer kan word. Verder moet ook besin word oor die metodes wat gebruik gaan word om die analiseerder te ontwikkel. In Afdeling 3 word die ontwikkeling van ’n datagedrewe woordafbreker en kompositumanaliseerder beskryf en die resultate wat dié modules in evaluasies behaal het, bespreek. Die artikel sluit af met aanbevelings ten opsigte van toekomstige werk wat kan lei tot die verbetering van die datagedrewe modules

Reëlgebaseerde benadering

Gevolgtrekking

Datagedrewe benadering

Algoritme

Eienskappe

Evaluasie

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Literator	Publication Date: Jul 25, 2008
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Die ontwikkeling van ’n woordafbreker en kompositumanaliseerder vir Afrikaans

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Literator

Lead the way for us

Similar Papers

A Review on Machine Learning Techniques to Predict the Reliability in Software Products
A. Balaram ... S. Vasundra
-
A. Balaram, et. al.A. Balaram ... S. Vasundra
01 Jan 2021
01 Jan 2021

Review of Machine and Deep Learning Techniques in Epileptic Seizure Detection using Physiological Signals and Sentiment Analysis
Deba Prasad Dash ... Maheshkumar Kolekar
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Deba Prasad Dash, et. al.Deba Prasad Dash ... Maheshkumar Kolekar
15 Jan 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

Review of Machine Learning Techniques in Soft Tissue Biomechanics and Biomaterials.
Samir Donmazov ... Kerem Pekkan
Cardiovascular engineering and technology | VOL. -
Samir Donmazov, et. al.Samir Donmazov ... Kerem Pekkan
02 Jul 2024
Cardiovascular engineering and technology | VOL. -

Prediction of oil and gas pipeline failures through machine learning approaches: A systematic review
Abdulnaser M Al-Sabaeei ... Ajayshankar Jagadeesh
Energy Reports | VOL. 10
Abdulnaser M Al-Sabaeei, et. al.Abdulnaser M Al-Sabaeei ... Ajayshankar Jagadeesh
16 Aug 2023
Energy Reports | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Die ontwikkeling van ’n woordafbreker en kompositumanaliseerder vir Afrikaans

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Literator