Abstract

The availability of a huge amount of textual data from a bewildering variety of sources leads to the well-identified paradox based on which an overload of information means no usable knowledge. In fact, up to 80% of electronic data is textual. Moreover, the most valuable information is encoded in pages which are written in various native languages, but are relevant even to non-native speakers. The process of accessing all these raw data, heterogeneous for language used, and transforming them into information is therefore inextricably linked to the concepts of textual analysis and synthesis, hinging greatly on the ability to master the problems of multilingualism. Through multilingual text mining, users can get an overview of great volumes of textual data having a highly readable grid, which helps them discover meaningful similarities among documents and find all related information. This paper describes the approach used by SYNTHEMA for multilingual text mining, showing the classification results on around 600 breaking news items written in English, Italian and French. 1 Multilingual resources construction Generally speaking, the manual construction and maintenance of multilingual language resources is undoubtedly expensive, requiring remarkable efforts. Being established in 1994 by computer scientists from the IBM Research Center, with the expertise and skills suited to provide effective software solutions, as well as carry out R&D in Natural Language Processing area, SYNTHEMA has been involved in Machine Translation, Information Extraction and Text Mining activities since 1996, primarily in the field of Technology Watch. The growing availability of comparable and parallel corpora has pushed SYNTHEMA to develop specific methods for semi-automatic updating of lexical resources. They are based on Natural Language Understanding and Machine Learning. These techniques detect multilingual lexicons from such corpora, by extracting all the © 2005 WIT Press WIT Transactions on Information and Communication Technologies, Vol 35, www.witpress.com, ISSN 1743-3517 (on-line) Data Mining VI 89

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.