Abstract

Web pages are not purely text, nor are they solely HTML. This paper surveys HTML web pages; not only on textual content, but with an emphasis on higher order visual features and supplementary technology. Using a crawler with an in-house developed rendering engine, data on a pseudo-random sample of web pages is collected. First, several basic attributes are collected to verify the collection process and confirm certain assumptions on web page text. Next, we take a look at the distribution of different types of page content (text, images, plug-in objects, and forms) in terms of rendered visual area. Those different types of content are broken down into a detailed view of the ways in which the content is used. This includes a look at the prevalence and usage of scripts and styles. We conclude that more complex page elements play a significant and underestimated role in the visually attractive, media rich, and highly interactive web pages that are currently being added to the World Wide Web.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call