Acquisition of lexical information

Nicoletta Calzolari,Remo Bindi

doi:10.3115/991146.991156

Abstract

The creation and development of a large Lexical Database (LDB) which, until now, mainly reuses the data found in standard Machine Readable Dictionaries, has been going on in Pisa for a number of years (see Calzolari 1984, 1988, Calzolari, Picchi 1988). We are well aware that, in order to build a more powel-ful I.DB (or even a Le.'dcal Knowledge Base) to be used in different ComputationM l.inguistics (CL) applications, types of information other than those usually found in machine readable dictionaries are urgently needed. Different sources of information must therefore be exploited we wemt to overcome the 'qchcal bottleneck ~ of Natural l..anguage Processing ( N I P ) . In a trend which is becoming increasingly relevant both m c1 proper and in Literao and Iinguistic Computing, we feel that very interesting data ibr our LI)Bs c:m be found b.v processing large textuM corpora, where the actual usage of the language can be truly investigated. Many research projects are nowadays collecting large amounts of textuM data, thus providing more and more material to be analyzed for descriptions based on measurable evidence of how language is actually used. We uhhnately aim at integrating lexical data extracted from the an',dysis of large textual corpora into the I,DB we are implementing. These data refer, typically, to: i) complementation relations introduced by prepositions (e.g. dividere subcategorizes for a PP headed by the preposition in ha one sense, and by the preposition fra in another sense); ii) lexically conditioned modification relations (tena macchina potente , un farmaco potente and not /brte , while un cajfe forte , una moneta forte and not potente ); iii) lefically significant collocations (premiere ut~a decisione and not fare z~na decisione , prestare attenzione and not dare ); iv) fixed phrases and idioms I (donna itz carriera, dottorato di ricerca, a propo~ito di); v) compounds ( tarola calda, ~ave scuo/a). All these types of data are a major issue of practical relevance, and particularly problematic, in many N I P applications in different areas. They should therefore be dvcn very lm'ge coverage in any useful LDB, and, moreover, should also be annotated, in a computerk,'ed lexicon, for the pe~inent t)equency information obtained fiom the processed corpus, and obviously updated flom time to time. As a matter of fact, dictionaries now tend to encode all the theoreticcd possibilities on a same level, but if e ; 'e~ possibility in the diction:m, must be given equal weight, parsing is very diificult (Church 1988, p.3): they should provide infornaation on what is more likely to occur, e.g. relative likelihood of alternate pm-ts of speech for a word or of ahernate word-senses, both out of context and it possible taking into account contextu~d factors. Statistical anMyses of linguistic data were very popular in the 50s and '60s, mainly, even though not only, for literary types of analyses and for studies on the lexicon (Guiraud 1959, Muller 1964, Moskovich 1977). Stochastic approaches to linguistic analyses have been strongly reevaluated in the past few years, either for syntactic analysis (Gm'side et al. 1987, Church 1988), or for NLP applications (Brown et al. 1988), or for semantic analysis (Zemik 1989, Smadja 1989). Quantitative (not statistical) evidence on e.g. word-sense occurrences in a large corpus have been taken into account for lexicographic descriptions (Cobuild 1%7).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Acquisition of lexical information

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Rational prescribing and sources of information
Flora Haayer
Social Science & Medicine | VOL. 16
Flora HaayerFlora Haayer
01 Jan 1981
Social Science & Medicine | VOL. 16

Coping With the Infodemic With Scientific Knowledge Management
Jorge Biolchini ... Elaine Cristina Ferreira Dias
-
Jorge Biolchini, et. al.Jorge Biolchini ... Elaine Cristina Ferreira Dias
01 Jan 2021
01 Jan 2021

Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection
Michael Bloodgood ... Benjamin Strauss
-
Michael Bloodgood, et. al.Michael Bloodgood ... Benjamin Strauss
01 Feb 2016
01 Feb 2016

Accuracy of breeding values in small genotyped populations using different sources of external information—A simulation study
S Andonov ... I Misztal
Journal of Dairy Science | VOL. 100
S Andonov, et. al.S Andonov ... I Misztal
27 Oct 2016
Journal of Dairy Science | VOL. 100

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Acquisition of lexical information

Abstract

Talk to us

Similar Papers