Abstract

With the development of Tibetan information technology, technologies about Tibetan web crawlers was extremely important. We elaborate different pages pretreatment rules according to the different sites and make the collected Tibetan Web text dump for Tibetan documents, by constructing a Web crawler to crawl different Tibetan websites, Experiments show that it can quickly and effectively to build large-scale Tibetan corpus, build the foundations for Tibetan information processing technology by self-made software and the module of pretreatment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call