Abstract

The paper discusses the use of web crawler technology. We created an application based on standard web crawler. Our application is determined for data extraction. Primarily, the application was designed to extract data using keywords from a social network Twitter. First, we created a standard crawler, which went through a predefined list of URLs and gradually download page content of each of the URLs. Page content was then parsed and important text and metadata were stored in a database. Recently, the application was modified in to the form of the multi-agent system. The system was developed in the C# language, which is used to create web applications and sites etc. Obtained data was evaluated graphically. The system was created within Indect project at the VSB-Technical University of Ostrava.

Highlights

  • Browsing the code of web pages, gathering the information found in the code and search links to other websites is the most common task of robots

  • We have faced the problem of data mining from social networks, such as Twitter

  • The obtained results show that the tool manages to download large amounts of data

Read more

Summary

Web Crawler

Web crawler itself is started within every agent instance. Multi-agent system is able to encapsulate any application that needs to be run inside the multi-agent system. It may delegate part of communication and management tasks to the control elements in lower levels of the hierarchy Such architecture can be represented in the form of a tree (Fig. 4). According to our experiments we have discovered that running about 30 – 40 crawlers is lowering the number of request that single crawler processes This is caused by manager agent not being able to handle all requests. These requests are divided between inbound and outbound, inbound being data returned from crawler and outbound being URLs to be crawled This phenomenon could be observed when running about 90 crawlers, where manager agent is overwhelmed with inbound requests and is not able to distribute new URLs to be crawled.

Introduction
Multi-Agent System
Used Technology and Methodology
Twitter Search API
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.