EFFICIENT PREPROCESSING FOR WEB LOG COMPRESSION

Sebastian Deorowicz,Szymon Grabowski

doi:10.47839/ijc.7.1.487

Abstract

Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. They are usually archived, as it enables further analysis, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client’s IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary-based phrase sequence substitution, move-to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.70 times in case of gzip and 1.86 times in case of bzip2.

Highlights

Plain text, as a medium for data conveyance and storage, is living its second youth
Later variants add a dictionary substitution for words found in a prepass, and compact encoding of numbers, dates, times and IP addresses
The current work is an extension of our previous attempts [2,3] to design a transform suitable for efficient compression of web log files

Summary

INTRODUCTION

As a medium for data conveyance and storage, is living its second youth. It is enough to mention the XML format and web languages (HTML, XHTML, CSS, web scripts etc.) to support this claim, but a more complete list should include DNA and protein sequence databases, mail folders, plain text newsgroup archives, IRC archives, and so on. Redundancy increases the costs of data transmission and storage, but can slow down query handling. It should be stressed that specialized methods, even if limited to text preprocessing before running a general-purpose compressor, can achieve compression ratios significantly better than universal compression algorithms, at more or less retained (or even decreased) computational requirements for the process of data encoding and decoding [1]. Prior to handling any queries, the log archive must be decompressed This is a disadvantage but on the other hand, non-queriable compression algorithms enable reaching better compression ratios and are simpler. A side goal of the current work is to stress on how inappropriate the widely used ( in log storage and analysis systems) Deflate method is, if the data to compress are typical large log files. A preliminary version of the current work was presented in [3]

SOURCES OF REDUNDANCY IN WEB LOGS

RELATED WORK

APACHE WEB LOG FORMAT

OUR ALGORITHM

EXPERIMENTAL RESULTS

CONCLUSIONS AND FUTURE WORK

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

EFFICIENT PREPROCESSING FOR WEB LOG COMPRESSION

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computing

Lead the way for us

Journal: International Journal of Computing	Publication Date: Aug 1, 2014
License type: cc-by

Similar Papers

Data Preprocessing and Cleansing in Web Log on Ontology for Enhanced Decision Making
F Mary Harin Fernandez ... R Ponnusamy
Indian Journal of Science and Technology | VOL. 9
F Mary Harin Fernandez, et. al.F Mary Harin Fernandez ... R Ponnusamy
16 Mar 2016
Indian Journal of Science and Technology | VOL. 9

Hierarchical sessionization at preprocessing level of WUM based on swarm intelligence
Tasawar Hussain ... Sohail Asghar
-
Tasawar Hussain, et. al.Tasawar Hussain ... Sohail Asghar
01 Oct 2010
01 Oct 2010

Comparative analysis between discrete cosine transform and wavelet transform techniques for medical image compression
A Ajala Funmilola ... D Fenwa Olusayo
-
A Ajala Funmilola, et. al.A Ajala Funmilola ... D Fenwa Olusayo
01 Jan 2015
01 Jan 2015

Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web Logs
Che-Yun Hsu ... Hung-Hsuan Chen
Journal of Data and Information Quality | VOL. 14
Che-Yun Hsu, et. al.Che-Yun Hsu ... Hung-Hsuan Chen
23 Mar 2022
Journal of Data and Information Quality | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

EFFICIENT PREPROCESSING FOR WEB LOG COMPRESSION

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computing