
Information Technology REVIEW OF 0 CORPUS DO PORTUGUÉS BY MARK DAVIES AND MICHAELJ. FERREIRA Tony Berber Sardinha Pontificia Universidade Católica de Säo Paulo The Corpus do Portugués is an online, fully searchable corpus of about 45 million words diat spans seven centuries (die fourteendi dirough die twentiedi). It was developed by Mark Davies (Linguistics and English Language Dept., Brigham Young University) and Michael Ferreira (Spanish and Portuguese Dept., Georgetown University) widi funding from die National Endowment for the Humanities. The corpus was released in 2006 and is accessible at .1 This corpus forms part of a network of online corpora, created by Mark Davies at Brigham Young, diat are available at . The various corpora in the network all share die same architecture, based on SQL databases. Davies and Ferreira will be referred to here as the audiors of die corpus, which is intended to mean "corpus compilers and Webdesigners ". Ferreira was in charge of most of the historical subcorpus (fourteendi to eighteenth centuries) and Davies was responsible for die more recent subcorpus (nineteenth and twentieth centuries) as well as for the Web interface, tagging and lemmatization. This review will focus on bodi the corpus and die interface to access the corpus, as the two are inseparable; the corpus texts are only 1 The 'www' part ofthe address is necessary; without it the user is led to a totally different service. La corónica 36.1 (Fall, 2007): 283-92 284Tony Reiher SarclinliaLa coránica 3(3.1, 2007 available, as a whole, via the corjms interface. The corpus Web interface is available in two languages (English and Portuguese). For this review, I used the English version. The corpus is made uj) of more than 50,000 texts spanning seven centuries (fourteenth to twentieth), including four broad registers (oral, fiction, news and academic), from Portugal and Brazil. It is divided into two periods, namely 1300-1899 and 1900-1999, each comprising roughly 507c of the word totals (52% and 487c. respectively; but see note about word counts below). The 1900s are further broken down into Brazilian and Portuguese texts, each with nearly identical sizes (12,009,402 words for Brazilian Portuguese, and 12,002,965 for European). At present, this is the largest multi-register online corjms of Portuguese available. Previously, the only register-diversified Web searchable corjxira of Portuguese were the NILC Corpus, with 29.6 million words (http://www.linguateca.pt). the Lcicio-Web, with ten million words (http://\v\v\v.nilc.icmc.us}j.br/lacio\vcb), and the one million-word sample of the Banco de Portugués (http://lael.jJiicsp.br/corpora/bp), but they were restricted to Brazilian Portuguese, were smaller in size, and lacked a diachronic component. The Corpus do Portugués, then, fills a gaj) that has long existed in the field of Portuguese corpus linguistics. The corpus is actually made up in part ofoilier corpora. It includes, for example, the Lácio-]\éb (mentioned above), the Tycho-Brahe Corpus of Historical Portuguese, and the Floresta Sintáctica, all ofwhich still exist independently, both on or off the Web. The corpus text sources arewell documented on several listings accessible from the help menu or directly at . In addition to the actual textual source, the online help¡jrovides information on the number of texts coming from each source ant! their size in words; for instance, we learn that the "Corpus Dialectal o Estudo da Sintaxe" contributed 968 texts to the corjius, totaling 428,826 words. By clicking on an entry, the user is taken to another reference, which shows all the texts from a particular source, with their size in words." Users can also navigate to the source Web jiages of some texts. However, as it often hapjjens with hyperlinks, the addresses that thev point to may no longer exist, and indeed there are lots of 2 The elate ol publication is also provided for some texts. Users in Portuguese-speaking countries must note that some dates ale in the Month- Day Year format, which mav cause confusion (zl/ 1 2 may be interpreted as 2 December when in fact it is 1 2 February). Information Technology285 broken links...

