Development of a large-scale web crawler and search engine infrastructure

Susumu Akamine,Kentaro Inui,Sadao Kurohashi,Yoshikiyo Kato,Daisuke Kawahara,Yutaka Kidawara,Keiji Shinzato

doi:10.1145/1667780.1667805

Development of a large-scale web crawler and search engine infrastructure

Susumu Akamine, Kentaro Inui + Show 5 more

https://doi.org/10.1145/1667780.1667805

Copy DOI

Publication Date: Dec 3, 2009

Citations: 14

Affiliation: National Institute of Information and Communications Technology, Nara Institute of Science and Technology, Kyoto University

#Large-scale Web Search Engine #Large-scale Web Search + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

This paper reports the ongoing development of a large-scale Web crawler and search engine infrastructure at National Institute of Information and Communications Technology. This infrastructure has the following characteristics: (1) It collects one billion Japanese Web pages while keeping them up-to-date. (2) It selects 100 million pages from among the collected pages and converts them into a standard data format to store the results of morphological analysis, dependency parsing, and synonym augmentation. (3) The selected set of pages is searchable and accessible to the users. (4) The scalability of the system is achieved by using a large-scale cluster machine for distributed data processing.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.