Abstract

Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. They are usually archived, as it enables further analysis, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client’s IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary-based phrase sequence substitution, move-to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.70 times in case of gzip and 1.86 times in case of bzip2.

Highlights

  • Plain text, as a medium for data conveyance and storage, is living its second youth

  • Later variants add a dictionary substitution for words found in a prepass, and compact encoding of numbers, dates, times and IP addresses

  • The current work is an extension of our previous attempts [2,3] to design a transform suitable for efficient compression of web log files

Read more

Summary

INTRODUCTION

As a medium for data conveyance and storage, is living its second youth. It is enough to mention the XML format and web languages (HTML, XHTML, CSS, web scripts etc.) to support this claim, but a more complete list should include DNA and protein sequence databases, mail folders, plain text newsgroup archives, IRC archives, and so on. Redundancy increases the costs of data transmission and storage, but can slow down query handling. It should be stressed that specialized methods, even if limited to text preprocessing before running a general-purpose compressor, can achieve compression ratios significantly better than universal compression algorithms, at more or less retained (or even decreased) computational requirements for the process of data encoding and decoding [1]. Prior to handling any queries, the log archive must be decompressed This is a disadvantage but on the other hand, non-queriable compression algorithms enable reaching better compression ratios and are simpler. A side goal of the current work is to stress on how inappropriate the widely used ( in log storage and analysis systems) Deflate method is, if the data to compress are typical large log files. A preliminary version of the current work was presented in [3]

SOURCES OF REDUNDANCY IN WEB LOGS
RELATED WORK
APACHE WEB LOG FORMAT
OUR ALGORITHM
EXPERIMENTAL RESULTS
CONCLUSIONS AND FUTURE WORK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.