Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Albert Weichselbraun

doi:10.21105/joss.03557

Albert Weichselbraun

Open Access

https://doi.org/10.21105/joss.03557

Copy DOI

Abstract

Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium. In contrast to related software packages, Inscriptis (i) provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers; and (ii) supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document.

Highlights

Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium (Huggins et al, 2021)
1. provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, better preserves the spatial arrangement of text elements
2. supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document

Summary

Introduction

1. provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, better preserves the spatial arrangement of text elements. Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium (Huggins et al, 2021).

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Open Source Software	Publication Date: Oct 16, 2021
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Open Source Software

Lead the way for us

Similar Papers

Online Monitoring and Visualization with ROS and ReactJS
Artem Ivanov ... Aufar Zakiev
-
Artem Ivanov, et. al.Artem Ivanov ... Aufar Zakiev
13 May 2021
13 May 2021

The Role of Task Understanding on Younger and Older Adults' Performance.
David J Frank ... Dayna R Touron
The journals of gerontology. Series B, Psychological sciences and social sciences | VOL. 74
David J Frank, et. al.David J Frank ... Dayna R Touron
16 Dec 2016
The journals of gerontology. Series B, Psychological sciences and social sciences | VOL. 74

Automatic knowledge extraction from Chinese electronic medical records and rheumatoid arthritis knowledge graph construction.
Feifei Liu ... Mingtong Liu
Quantitative Imaging in Medicine and Surgery | VOL. 13
Feifei Liu, et. al.Feifei Liu ... Mingtong Liu
01 Jun 2023
Quantitative Imaging in Medicine and Surgery | VOL. 13

Feature Weighting Improvement of Web Text Categorization Based on Particle Swarm Optimization Algorithm
Yonghe Lu ... Yanhong Peng
Journal of Computers | VOL. 10
Yonghe Lu, et. al.Yonghe Lu ... Yanhong Peng
01 Jan 2015
Journal of Computers | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Open Source Software