Implementation of Web Scraping on Google Search Engine for Text Collection Into Structured 2D List

Tresna Maulana Fahrudin,Prismahardi Aji Riyantoko,Kartika Maulida Hindrayani

doi:10.31315/telematika.v20i2.9575

Tresna Maulana Fahrudin, Prismahardi Aji Riyantoko + Show 1 more

Open Access

PDF Available

https://doi.org/10.31315/telematika.v20i2.9575

Copy DOI

Export

Save

Cite

Journal: Telematika	Publication Date: Jun 30, 2023
License type: CC BY-NC-SA 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Purpose: This research proposes the implementation of web scraping on Google Search Engine to collect text into a structured 2D list.Design/methodology/approach: Implementing two important stages in the process of collecting data through web scraping, namely the HTML parsing process to extract links (URL) on Google Search Engine pages, and HTML parsing process to extract the body text from website pages on each link that has been collected.Findings/result: The inputted query is adjusted to the latest issues and news in Indonesia, for example the President's important figures, the month of Ramadan and Idul Fitri, riots tragedy (stadium) and natural disasters, rising prices of basic commodities, oil and gold, as well as other news. The least number of links obtained was 56 links and the most was 151 links, while the processing time to obtain links for each of the fastest queries was 1 minute 6.3 seconds and the longest was 2 minutes 49.1 seconds. The results of scraping links from these queries were obtained from Wikipedia, Detik, Kompas, the Election Supervisory Body (Bawaslu), CNN Indonesia, the General Election Commission (KPU), Pikiran Rakyat, and others.Originality/value/state of the art: Based on previous research, this study provides an alternative to produce optimal collection of links and text from web scraping results in the form of a 2D list structure. Lists in the Python programming language can store character sequences in the form of strings and can be accessed using index keys, and manipulate text efficiently.

Full Text