Abstract

The creation and development of a large Lexical Database (LDB) which, until now, mainly reuses the data found in standard Machine Readable Dictionaries, has been going on in Pisa for a number of years (see Calzolari 1984, 1988, Calzolari, Picchi 1988). We are well aware that, in order to build a more powel-ful I.DB (or even a Le.'dcal Knowledge Base) to be used in different ComputationM l.inguistics (CL) applications, types of information other than those usually found in machine readable dictionaries are urgently needed. Different sources of information must therefore be exploited we wemt to overcome the 'qchcal bottleneck ~ of Natural l..anguage Processing ( N I P ) . In a trend which is becoming increasingly relevant both m c1 proper and in Literao and Iinguistic Computing, we feel that very interesting data ibr our LI)Bs c:m be found b.v processing large textuM corpora, where the actual usage of the language can be truly investigated. Many research projects are nowadays collecting large amounts of textuM data, thus providing more and more material to be analyzed for descriptions based on measurable evidence of how language is actually used. We uhhnately aim at integrating lexical data extracted from the an',dysis of large textual corpora into the I,DB we are implementing. These data refer, typically, to: i) complementation relations introduced by prepositions (e.g. dividere subcategorizes for a PP headed by the preposition in ha one sense, and by the preposition fra in another sense); ii) lexically conditioned modification relations (tena macchina potente , un farmaco potente and not /brte , while un cajfe forte , una moneta forte and not potente ); iii) lefically significant collocations (premiere ut~a decisione and not fare z~na decisione , prestare attenzione and not dare ); iv) fixed phrases and idioms I (donna itz carriera, dottorato di ricerca, a propo~ito di); v) compounds ( tarola calda, ~ave scuo/a). All these types of data are a major issue of practical relevance, and particularly problematic, in many N I P applications in different areas. They should therefore be dvcn very lm'ge coverage in any useful LDB, and, moreover, should also be annotated, in a computerk,'ed lexicon, for the pe~inent t)equency information obtained fiom the processed corpus, and obviously updated flom time to time. As a matter of fact, dictionaries now tend to encode all the theoreticcd possibilities on a same level, but if e ; 'e~ possibility in the diction:m, must be given equal weight, parsing is very diificult (Church 1988, p.3): they should provide infornaation on what is more likely to occur, e.g. relative likelihood of alternate pm-ts of speech for a word or of ahernate word-senses, both out of context and it possible taking into account contextu~d factors. Statistical anMyses of linguistic data were very popular in the 50s and '60s, mainly, even though not only, for literary types of analyses and for studies on the lexicon (Guiraud 1959, Muller 1964, Moskovich 1977). Stochastic approaches to linguistic analyses have been strongly reevaluated in the past few years, either for syntactic analysis (Gm'side et al. 1987, Church 1988), or for NLP applications (Brown et al. 1988), or for semantic analysis (Zemik 1989, Smadja 1989). Quantitative (not statistical) evidence on e.g. word-sense occurrences in a large corpus have been taken into account for lexicographic descriptions (Cobuild 1%7).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.