Abstract

Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model’s capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.We present Developer,11Developer = Detecting codEVulnerabilitiEs at the Large scale by learning from OPen sourcERepositories. a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, Developer automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, Developer employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network – a bidirectional long–short term memory architecture – to predict if the target code contains a vulnerability or not. We apply Developer to identify vulnerabilities at the program source-code level. Our evaluation shows that Developer outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call