Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Christoph Ringlstetter,Stoyan Mihov,Klaus U. Schulz

doi:10.1162/coli.2006.32.3.295

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Christoph Ringlstetter, Stoyan Mihov + Show 1 more

PDF Available

https://doi.org/10.1162/coli.2006.32.3.295

Copy DOI

Export

Save

Cite

Journal: Computational Linguistics	Publication Date: Sep 1, 2006
Citations: 42

Affiliation: Deutsche Forschungsgemeinschaft, Bulgarian Academy of Sciences, Ludwig-Maximilians-Universität München

#Orthographic Errors #Errors In Corpora #Natural Language Tools #Repository Of Texts #Repository Of Tools #Web Pages #Area Of Linguistics #Recent Experiments #High Accuracy #Distribution Of Errors

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.

Full Text

Paper version not known (Free)

View/Download pdf

Published Version

Check institute access

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Computational Linguistics

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.

R Discovery Prime

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora