Input Sanitization Research Articles

Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of hybrid (static+dynamic) code attributes that characterize input validation and input sanitization code patterns and are expected to be significant indicators of web application vulnerabilities. Because static and dynamic program analyses complement each other, both techniques are used to extract the proposed attributes in an accurate and scalable way. Current vulnerability prediction techniques rely on the availability of data labeled with vulnerability information for training. For many real world applications, past vulnerability data is often not available or at least not complete. Hence, to address both situations where labeled past data is fully available or not, we apply both supervised and semi-supervised learning when building vulnerability predictors based on hybrid code attributes. Given that semi-supervised learning is entirely unexplored in this domain, we describe how to use this learning scheme effectively for vulnerability prediction. We performed empirical case studies on seven open source projects where we built and evaluated supervised and semi-supervised models. When cross validated with fully available labeled data, the supervised models achieve an average of 77 percent recall and 5 percent probability of false alarm for predicting SQL injection, cross site scripting, remote code execution and file inclusion vulnerabilities. With a low amount of labeled data, when compared to the supervised model, the semi-supervised model showed an average improvement of 24 percent higher recall and 3 percent lower probability of false alarm, thus suggesting semi-supervised learning may be a preferable solution for many real world applications where vulnerability data is missing.

Read full abstract

String-manipulating programs are an important class of programs with applications in malware detection, graphics, input sanitization for Web security, and large-scale HTML processing. This paper extends prior work on BEK, an expressive domain-specific language for writing string-manipulating programs, with algorithmic insights that make BEK both analyzable and data-parallel. By analyzable we mean that unlike most general purpose programming languages, many algebraic properties of a BEK program are decidable (i.e., one can check whether two programs commute or compute the inverse of a program). By data-parallel we mean that a BEK program can compute on arbitrary subsections of its input in parallel, thus exploiting parallel hardware. This latter requirement is particularly important for programs which operate on large data: without data parallelism, a programmer cannot hide the latency of reading data from various storage media (i.e., reading a terabyte of data from a modern hard drive takes about 3 hours). With a data-parallel approach, the system can split data across multiple disks and thus hide the latency of reading the data. A BEK program is expressive: a programmer can use conditionals, switch statements, and registers--or local variables--in order to implement common string-manipulating programs. Unfortunately, this expressivity induces data dependencies, which are an obstacle to parallelism. The key contribution of this paper is an algorithm which automatically removes these data dependencies by mapping a B EK program into a intermediate format consisting of symbolic transducers, which extend classical transducers with symbolic predicates and symbolic assignments. We present a novel algorithm that we call exploration which performs symbolic loop unrolling of these transducers to obtain simplified versions of the original program. We show how these simplified versions can then be lifted to a stateless form, and from there compiled to data-parallel hardware. To evaluate the efficacy of our approach, we demonstrate up to 8x speedups for a number of real-world, BEK programs, (e.g., HTML encoder and decoder) on data-parallel hardware. To the best of our knowledge, these are the first data parallel implementation of these programs. To validate that our approach is correct, we use an automatic testing technique to compare our generated code to the original implementations and find no semantic deviations.

Read full abstract

Input Sanitization Research Articles

Related Topics

Articles published on Input Sanitization

Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning

Data-Parallel String-Manipulating Programs

Predicting SQL injection and cross site scripting vulnerabilities through mining input sanitization patterns

Structural Learning of Attack Vectors for Generating Mutated XSS Attacks

Sound and precise analysis of web applications for injection vulnerabilities

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Input Sanitization Research Articles

Related Topics

Articles published on Input Sanitization

Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning

Data-Parallel String-Manipulating Programs

Predicting SQL injection and cross site scripting vulnerabilities through mining input sanitization patterns

Structural Learning of Attack Vectors for Generating Mutated XSS Attacks

Sound and precise analysis of web applications for injection vulnerabilities