Mining the Web for Intelligent Problem Solving for Programmers

Xin Rong

doi:10.1145/2835776.2855091

Abstract

Programming can be hard to learn and master. Novice programmers often find themselves struggling with terminology, concepts, or different solutions to the same problem with little clue on how to choose the best one. Professional programmers often spend a considerable amount of time learning to use third-party libraries, APIs, or an unfamiliar piece of code. Although programmers can turn to search engines or question-and-answer websites for help, the problem solving process can often take multiple iterations and can be time-consuming. An integrated system that can recognize a programmer's difficulties and provide contextualized solutions is thus desirable, as it may significantly reduce the amount of manual effort required in the loop of troubleshooting. Ideally, a programmer should be able to interact with such an intelligent system using natural language, in a way similar to how they document code or communicate with peers. However, using automatic natural language processing techniques to address programming questions is very difficult, mainly due to the following reasons: (1) the terms and common expressions vary greatly across different domains and individual programmers, making it difficult to associate relevant concepts together; (2) the solution to the user's trouble in programming often requires multiple steps or different resources, which requires deep understanding of the relations or dependencies of the possible solutions, as well as the user's personal capability of handling those solutions; (3) the documents in the training data usually include a mixture of general-domain expressions with mentions of variables, functions, and classes, as well as source code, making low-level text processing difficult; (4) the evaluation of the system generally requires skilled experts to provide ground truth, which is expensive and often unreliable. We address the above difficulties and build an intelligent programming helper system by mining the massive data available online related to programming, including question-and-answer websites, tutorials, blogs, and code repositories. In specific, the study involves three important components. First, we use information extraction techniques to extract common programming tasks, issues, and solutions from the Web data, and establish connections between these extracted elements by leveraging their discrete or distributed representations (e.g., using neural embedding models). Such techniques have been shown to be useful in helping general users solve problems that require interactions with a complex computer software application through the interface of natural language. Second, we study how to handle complicated problems that require multiple steps to solve. The existing troubleshooting instances documented online are collectively modeled as a heterogeneous network, on which the random walk paths can be exploited to recommend solutions. Third, we study how to personalize the problem-solving process for users with varying levels of skills and background knowledge. In particular, each user's past adoptions of technologies and the adoption behavior in his/her social community can be jointly leveraged to provide the appropriate recommendations of technologies and may even promote innovations (e.g., new algorithms) in the process. Collectively, these three components form an integral solution to computer-assisted problem solving for programmers driven by big data, and may have impact on various different domains, including information extraction, language modeling, natural language understanding, automatic problem solving, and social network analysis.

Full Text