RSS feeds behavior analysis, structure and vocabulary

Cedric Du Mouza,Michel Scholl,Nelly Vouzoukidou,Nicolas Travers,Vassilis Christophides,Zeinab Hmedeh

doi:10.1108/ijwis-06-2014-0023

Abstract

Purpose – The purpose of this paper is to present a thorough analysis of three complementary features of real-scale really simple syndication (RSS)/Atom feeds, namely, publication activity, items characteristics and their textual vocabulary, that the authors believe are crucial for emerging Web 2.0 applications. Previous works on RSS/Atom statistical characteristics do not provide a precise and updated characterization of feeds’ behavior and content, characterization that can be used to successfully benchmark the effectiveness and efficiency of various Web syndication processing/analysis techniques. Design/methodology/approach – The authors empirical study relies on a large-scale testbed acquired over an eight-month campaign from 2010. They collected a total number of 10,794,285 items originating from 8,155 productive feeds. The authors deeply analyze feeds productivity (types and bandwidth), content (XML, text and duplicates) and textual content (vocabulary and buzz-words). Findings – The findings of the study are as follows: 17 per cent of feeds produce 97 per cent of the items; a formal characterization of feeds publication rate conducted by using a modified power law; most popular textual elements are the title and description, with the average size of 52 terms; cumulative item size follows a lognormal distribution, varying greatly with feeds type; 47 per cent of the feed-published items share the same description; the vocabulary does not belong to Wordnet terms (4 per cent); characterization of vocabulary growth using Heaps’ laws and the number of occurrences by a stretched exponential distribution conducted; and ranking of terms does not significantly vary for frequent terms. Research limitations/implications – Modeling dedicated Web applications capacities, Defining benchmarks, optimizing Publish/Subscribe index structures. Practical implications – It especially opens many possibilities for tuning Web applications, like an RSS crawler designed with a resource allocator and a refreshing strategy based on the Gini values and evolution to predict bursts for each feed, according to their category and class for targeted feeds; an indexing structure which matches textual items’ content, which takes into account item size according to targeted feeds, size of the vocabulary and term occurrences, updates of the vocabulary and evolution of term ranks, typos and misspelling correction; filtering by pruning items for content duplicates of different feeds and correlation of terms to easily detect replicates. Originality/value – A content-oriented analysis of dynamic Web information.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

RSS feeds behavior analysis, structure and vocabulary

Abstract

Talk to us

Similar Papers

More From: International Journal of Web Information Systems

Lead the way for us

Journal: International Journal of Web Information Systems	Publication Date: Aug 12, 2014
Citations: 49

Similar Papers

On a new index of Publish/Subscribe system in the context of Big Data
Mohamedou Cheikh Tourad ... Ahmad Outfarouin
-
Mohamedou Cheikh Tourad, et. al.Mohamedou Cheikh Tourad ... Ahmad Outfarouin
01 Nov 2016
01 Nov 2016

Comprehensive Study of the Blazars from Fermi-LAT LCR: The Log-Normal Flux Distribution and Linear rms–Flux Relation
Na Wang ... Liang Dong
Research in Astronomy and Astrophysics | VOL. 23
Na Wang, et. al.Na Wang ... Liang Dong
04 Oct 2023
Research in Astronomy and Astrophysics | VOL. 23

A New Ranked-Key Structure for Intelligent Pub-Sub Systems in Large Scale
Mohamedou Cheikh Tourad* ... Abdelmounaim Abdali
International Journal of Engineering and Advanced Technology | VOL. 8
Mohamedou Cheikh Tourad*, et. al.Mohamedou Cheikh Tourad* ... Abdelmounaim Abdali
30 Aug 2019
International Journal of Engineering and Advanced Technology | VOL. 8

EVALUATION OF DEMOGRAPHIC COMPONENT OF COUNTRIES’ ECONOMIC SECURITY
Olha Hrybinenko ... Olha Zakharova
Business, Management and Education | VOL. 18
Olha Hrybinenko, et. al.Olha Hrybinenko ... Olha Zakharova
25 Aug 2020
Business, Management and Education | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

RSS feeds behavior analysis, structure and vocabulary

Abstract

Talk to us

Similar Papers

More From: International Journal of Web Information Systems