Abstract
Human computation has traditionally been an essential mechanism for providing training data and feedback to machine learning algorithms. Until a decade ago, human input was collected mainly from machine learning experts or via controlled user studies with small groups of people. With the rapid development of Internet technologies, human computation became applicable to problems that require large-scale training data. To this end, crowdsourcing is a form of human computation facilitated by online frameworks on the Internet, which in their simplest model serve as shared marketplaces for both crowd requesters and crowd workers. This dissertation focuses on two aspects of integrating crowdsourcing in the process of building and improving machine learning algorithms and systems. First, it studies how human supervision can be efficiently leveraged for generating training label data for new machine learning models and algorithms. Second, it explores the impact of human intervention for assisting machine learning experts in troubleshooting and improving existing systems composed of multiple machine learning components. While crowdsourcing opens promising opportunities in supporting machine learning techniques, it also poses new challenges relevant to both human supervision and intervention in intelligent systems. As opposed to expert input, crowdsourcing data may involve noise which lowers the quality of the collected data and the corresponding predictions. Noise is present in crowdsourcing data due to possible subjectivity, ambiguous task design, human error, and insufficient qualification worker skills. In order to accommodate quality control measures that account for noise, machine learning models need to be appropriately adopted for representing and interpreting crowd data sources. Moreover, due to the large size of datasets and the design space of machine learning models, crowdsourcing supervision and intervention can be costly and often not feasible. For this purpose, cost optimization mechanisms are necessary for scaling the crowdsourcing process and making it affordable even for complex tasks that have high data requirements. In order to tackle the two challenges of (possibly) noisy and costly crowd input, this thesis contributes towards building quality control and cost optimization techniques for hybrid crowd-machine systems that learn and are improved from human-generated data. The first contribution of the thesis is a crowd model, which we call the Access Path Model. It seamlessly tackles the problems of label aggregation and cost optimization for making new predictions. Differently from what has been proposed in previous work, the Access Path Model relies on group-based representations of the crowd named as access paths. This high-level abstraction allows the model to express worker answer correlations in addition to the worker individual profiles. The design is beneficial for making robust decisions with meaningful confidence even in the presence of noise and sparse worker participation. Moreover, it allows for efficient crowd access optimization schemes, which plan the budget allocation to diverse access paths in order to maximize the information gain for new predictions. Closely related to this contribution, we then investigate cost optimization strategies that can be applied in the early stage of collecting training data for a new model. In this context, we propose the B-Leafs algorithm, which dynamically acquires data for feature-based classification models. B-Leafs naturally trades off exploration and exploitation crowd access decisions and overcomes the challenge of data insufficiency via model sampling and parameter credibility checks. The Access Path Model and the B-Leafs algorithm are strategies of quality control and cost optimization for building a single machine learning model from crowdsourced labels. In the quest of a deeper integration of human computation with complete intelligent systems, the final contribution of this thesis is a troubleshooting methodology for integrative computational pipelines composed of multiple machine learning components. The goal of the methodology is to guide system designers in the process of decision-making for improving the quality of current systems. For this purpose, the methodology involves human computation for simulating component fixes that cannot be generated otherwise. The simulated fixes are injected back in the system execution, which allows for systematic analysis of the potential impact of individual and joint fixes in the overall system output quality. This human-assisted methodology is a powerful tool for better understanding complex systems and prioritizing research and engineering efforts towards future system enhancements.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.