Development of an algorithm for obtaining data from thematic internet resources

S Mambetov,Ye Begimbayeva,Al-Farabi Kazakh National University ,International It University; Satbayev University ,S Joldasbayev,A Khikmetov

doi:10.47533/2023.1606-146x.6

Abstract

With the rapid development of the Internet, users are actively sharing their personal data and other information on many social networks. Information on the Internet should be analyzed to make sure that it is reliable and does not pose a threat to the public. Based on this, there is a need to collect, monitor and analyze this information. Data collection is a complex task, depending on the structure of each web page. Since not all resources allow you to collect information, you have to use many methods. The proposed article shows effective ways of using syntactic analysis to obtain information. The method of semantic analysis (parsing) of the contents of web pages is explained using a program written in Python based on the BeatifulSoup library. In addition, the focus is on methods of collecting information through other APIs, using tools to emulate user behavior in the browser. An algorithm for extracting information from thematic Internet resources using the BeatifulSoup + Requests library is presented. As a result, information was obtained from Englishand Russian-speaking hacker and carding forums.

Full Text