Crowdsourced Data Management: Industry and Academic Perspectives

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Crowdsourcing and human computation enable organizations to accomplish tasks that are currently not possible for fully automated techniques to complete, or require more flexibility and scalability than traditional employment relationships can facilitate. In the area of data processing, companies have benefited from crowd workers on platforms such as Amazon’s Mechanical Turk or Upwork to complete tasks as varied as content moderation, web content extraction, entity resolution, and video/audio/image processing. Several academic researchers from diverse areas ranging from the social sciences to computer science have embraced crowdsourcing as a research area, resulting in algorithms and systems that improve crowd work quality, latency, or cost. Given the relative nascence of the field, the academic and the practitioner communities have largely operated independently of each other for the past decade, rarely exchanging techniques and experiences. In this book, we aim to narrow the gap between academics and practitioners. On the academic side, we summarize the state of the art in crowd-powered algorithms and system design tailored to large-scale data processing. On the industry side, we survey 13 industry users (e.g., Google, Facebook, Microsoft) and 4 marketplace providers of crowd work (e.g., CrowdFlower, Upwork) to identify how hundreds of engineers and tens of million dollars are invested in various crowdsourcing solutions. Through the book, we hope to simultaneously introduce academics to real problems that practitioners encounter every day, and provide a survey of the state of the art for practitioners to incorporate into their designs. Through our surveys, we also highlight the fact that crowd-powered data processing is a large and growing field. Over the next decade, we believe that most technical organizations will in some way benefit from crowd work, and hope that this book can help guide the effective adoption of crowdsourcing across these organizations.

Similar Papers
  • Research Article
  • Cite Count Icon 62
  • 10.1016/j.compchemeng.2019.04.028
Process Systems Engineering: Academic and industrial perspectives
  • Apr 27, 2019
  • Computers & Chemical Engineering
  • Ignacio E Grossmann + 1 more

Process Systems Engineering: Academic and industrial perspectives

  • Conference Article
  • 10.1109/icrito.2016.7784951
Generalized classification rules for entity identification
  • Sep 1, 2016
  • Umesh S Bhoskar + 1 more

One of the essential tasks in data integration is entity resolution (ER) which will recognize the records that are belonging to the same entity. The entity resolution is referred by many other terms like duplicate detection, pattern matching, etc. Now a days the activities like information integration, information retrieval, crowd sourcing, and pay-as-you-go have involved users to carry out the ER tasks such as to identify whether two entity descriptions are referred to the same entity or not. Previous work of ER involves clustering and comparison approaches which are based on some assumption. The ER gives the poorer quality when such assumptions are not correct. In our approach, we present a new set of entity rules where each rule enumerates all possibilities to identify the correct entity of the records. Additionally, we propose an extended approach (GenR) for efficient and effective rules generation by using a specialized form of term-based entropy measure. We experimentally evaluated the proposed approach using data set with a large no. of records and the data sets with different data characteristics. We report on some promising empirical results which demonstrate performance improvement by using a term-based quality measure.

  • Research Article
  • Cite Count Icon 68
  • 10.14778/2824032.2824062
Argonaut
  • Aug 1, 2015
  • Proceedings of the VLDB Endowment
  • Daniel Haas + 3 more

Crowdsourced workflows are used in research and industry to solve a variety of tasks. The databases community has used crowd workers in query operators/optimization and for tasks such as entity resolution. Such research utilizes microtasks where crowd workers are asked to answer simple yes/no or multiple choice questions with little training. Typically, microtasks are used with voting algorithms to combine redundant responses from multiple crowd workers to achieve result quality. Microtasks are powerful, but fail in cases where larger context (e.g., domain knowledge) or significant time investment is needed to solve a problem, for example in large-document structured data extraction. In this paper, we consider context-heavy data processing tasks that may require many hours of work, and refer to such tasks as macrotasks. Leveraging the infrastructure and worker pools of existing crowdsourcing platforms, we automate macrotask scheduling, evaluation, and pay scales. A key challenge in macrotask-powered work, however, is evaluating the quality of a worker's output, since ground truth is seldom available and redundancy-based quality control schemes are impractical. We present Argonaut, a framework that improves macrotask powered work quality using a hierarchical review. Argonaut uses a predictive model of worker quality to select trusted workers to perform review, and a separate predictive model of task quality to decide which tasks to review. Finally, Argonaut can identify the ideal trade-off between a single phase of review and multiple phases of review given a constrained review budget in order to maximize overall output quality. We evaluate an industrial use of Argonaut to power a structured data extraction pipeline that has utilized over half a million hours of crowd worker input to complete millions of macrotasks. We show that Argonaut can capture up to 118% more errors than random spot-check reviews in review budget-constrained environments with up to two review layers.

  • PDF Download Icon
  • Supplementary Content
  • Cite Count Icon 110
  • 10.2196/jmir.9330
Mapping of Crowdsourcing in Health: Systematic Review
  • May 15, 2018
  • Journal of Medical Internet Research
  • Perrine Créquit + 4 more

BackgroundCrowdsourcing involves obtaining ideas, needed services, or content by soliciting Web-based contributions from a crowd. The 4 types of crowdsourced tasks (problem solving, data processing, surveillance or monitoring, and surveying) can be applied in the 3 categories of health (promotion, research, and care).ObjectiveThis study aimed to map the different applications of crowdsourcing in health to assess the fields of health that are using crowdsourcing and the crowdsourced tasks used. We also describe the logistics of crowdsourcing and the characteristics of crowd workers.MethodsMEDLINE, EMBASE, and ClinicalTrials.gov were searched for available reports from inception to March 30, 2016, with no restriction on language or publication status.ResultsWe identified 202 relevant studies that used crowdsourcing, including 9 randomized controlled trials, of which only one had posted results at ClinicalTrials.gov. Crowdsourcing was used in health promotion (91/202, 45.0%), research (73/202, 36.1%), and care (38/202, 18.8%). The 4 most frequent areas of application were public health (67/202, 33.2%), psychiatry (32/202, 15.8%), surgery (22/202, 10.9%), and oncology (14/202, 6.9%). Half of the reports (99/202, 49.0%) referred to data processing, 34.6% (70/202) referred to surveying, 10.4% (21/202) referred to surveillance or monitoring, and 5.9% (12/202) referred to problem-solving. Labor market platforms (eg, Amazon Mechanical Turk) were used in most studies (190/202, 94%). The crowd workers’ characteristics were poorly reported, and crowdsourcing logistics were missing from two-thirds of the reports. When reported, the median size of the crowd was 424 (first and third quartiles: 167-802); crowd workers’ median age was 34 years (32-36). Crowd workers were mainly recruited nationally, particularly in the United States. For many studies (58.9%, 119/202), previous experience in crowdsourcing was required, and passing a qualification test or training was seldom needed (11.9% of studies; 24/202). For half of the studies, monetary incentives were mentioned, with mainly less than US $1 to perform the task. The time needed to perform the task was mostly less than 10 min (58.9% of studies; 119/202). Data quality validation was used in 54/202 studies (26.7%), mainly by attention check questions or by replicating the task with several crowd workers.ConclusionsThe use of crowdsourcing, which allows access to a large pool of participants as well as saving time in data collection, lowering costs, and speeding up innovations, is increasing in health promotion, research, and care. However, the description of crowdsourcing logistics and crowd workers’ characteristics is frequently missing in study reports and needs to be precisely reported to better interpret the study findings and replicate them.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/icebe.2014.25
In-house Crowdsourcing-Based Entity Resolution: Dealing with Common Names
  • Nov 1, 2014
  • Morteza Saberi + 3 more

Entity Resolution (ER) is one of the techniques used to disambiguate the various manifestations of same object to improve search results in databases. Recently, Crowd sourcing has been utilized to improve entity resolution to gain positive impact when searching for particular information in a database. In this paper, we consider the domain of Customer Relationship Management (CRM) and utilize the approach of Crowd sourcing to enrich the process of achieving ER. Specifically our focus is to identify the right customer that has been manifested in various ways under a common name in a database using In-house Crowd sourcing-based Entity Resolution approach (ICER). The ICER takes the list of possible duplicates into consideration (which are pre-determined) and identifies the pair of record that has the maximum impact in achieving ER. Then, this pair is crowd sourced to Customer Service Representatives (CSRs) to have their input (labeling). ICER incorporates the principles of Human Intelligence Task (HIT) that aims to keep the questions asked to the CSR to a minimum. Two ICER approaches are proposed in this study based on probabilistic (modified approach of Whang et al) and active learning schemas. The applicability of the proposed ICER approaches and comparison of their results have been highlighted by using an example.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.3389/fdata.2024.1296552
A scalable MapReduce-based design of an unsupervised entity resolution system
  • Mar 1, 2024
  • Frontiers in Big Data
  • Nicholas Kofi Akortia Hagan + 3 more

Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of their data processes by automating manual processes and making them as unsupervised as possible. An additional challenge is to make these unsupervised processes scalable to meet the demands of increased data volume. This paper describes the parallelization of an unsupervised entity resolution (ER) process. ER is a component of many different data curation processes because it clusters records from multiple data sources that refer to the same real-world entity, such as the same customer, patient, or product. The ability to scale ER processes is particularly important because the computation effort of ER increases quadratically with data volume. The Data Washing Machine (DWM) is an already proposed unsupervised ER system which clusters references from diverse data sources. This work aims at solving the single-threaded nature of the DWM by adopting the parallelization nature of Hadoop MapReduce. However, the proposed parallelization method can be applied to both supervised systems, where matching rules are created by experts, and unsupervised systems, where expert intervention is not required. The DWM uses an entropy measure to self-evaluate the quality of record clustering. The current single-threaded implementations of the DWM in Python and Java are not scalable beyond a few 1,000 records and rely on large, shared memory. The objective of this research is to solve the major two shortcomings of the current design of the DWM which are the creation and usage of shared memory and lack of scalability by leveraging on the power of Hadoop MapReduce. We propose Hadoop Data Washing Machine (HDWM), a MapReduce implementation of the legacy DWM. The scalability of the proposed system is displayed using publicly available ER datasets. Based on results from our experiment, we conclude that HDWM can cluster from 1,000's to millions of equivalent references using multiple computational nodes with independent RAM and CPU cores.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/icacci.2016.7732034
Entropy based informative content density approach for efficient web content extraction
  • Sep 1, 2016
  • Manjusha Annam + 1 more

Web content extraction is a popular technique for extracting the main content from web pages and discards the irrelevant content. Extracting only the relevant content is a challenging task since it is difficult to determine which part of the web page is relevant and which part is not. Among the existing web content extraction methods, density based content extraction is one popular method. However density based methods, suffer from poor efficiency, especially when the pages containing less information and long noise. We propose a web content extraction technique build on Entropy based Informative Content Density algorithm (EICD). The proposed EICD algorithm initially analyses higher text density content. Further, the entropy-based analysis is performed for selected features. The key idea of EICD is to utilize the information entropy for representing the knowledge that correlates to the amount of informative content in a page. The proposed method is validated through simulation and the results are promising.

  • Conference Article
  • Cite Count Icon 7
  • 10.1145/2661334.2661401
Evaluating the accessibility of crowdsourcing tasks on Amazon's mechanical turk
  • Jan 1, 2014
  • Rocío Calvo + 2 more

Crowd work web sites such as Amazon Mechanical Turk enable individuals to work from home, which may be useful for people with disabilities. However, the web sites for finding and performing crowd work tasks must be accessible if people with disabilities are to use them. We performed a heuristic analysis of one crowd work site, Amazon's Mechanical Turk, using the Web Content Accessibility Guidelines 2.0. This paper presents the accessibility problems identified in our analysis and offers suggestions for making crowd work platforms more accessible

  • Book Chapter
  • Cite Count Icon 4
  • 10.1007/978-3-319-99987-6_1
Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability
  • Jan 1, 2018
  • Xiao Chen + 5 more

Entity Resolution (ER) is a task to identify records that refer to the same real-world entities. A naive way to solve ER tasks is to calculate the similarity of the Cartesian product of all records, which is called pair-wise ER and leads to quadratic time complexity. Faced with an exploding data volume, pair-wise ER is challenged to achieve high efficiency and scalability. To tackle this challenge, parallel computing is proposed for speeding up the ER process. Due to the difficulty of distributed programming, big data processing frameworks are often used as tools to ease the realization of parallel ER, supporting data partitioning, workload balancing, and fault tolerance. However, the efficiency and scalability of parallel ER is also influenced by the adopted framework. In the area of parallel ER, the adoption of Apache Spark, a general framework supporting in-memory computation, still is not widely studied. Furthermore, though Apache Spark provides both low-level (RDD-based) and high-level APIs (Datasets-based), to date, only RDD-based APIs have been adopted in parallel ER research. In this paper, we have implemented a Spark-SQL-based ER process and explored its persistence capability to see the performance benefits. We have evaluated its speedup and compared its efficiency to Spark-RDD-based ER. We observed that different persistence options have a large impact on the efficiency of Spark-SQL-based ER, requiring a careful consideration for choosing it. By adopting the best persistence option, the efficiency of our Spark-SQL-based ER implementation is improved up to 3 times on different datasets, over a baseline without any persistence option or with misconfigured persistence.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.3758/s13428-022-01864-x
Chatbot Language - crowdsource perceptions and reactions to dialogue systems to inform dialogue design decisions
  • Jun 14, 2022
  • Behavior Research Methods
  • Birgit Popp + 2 more

Conversational User Interfaces (CUI) are widely used, with about 1.8 billion users worldwide in 2020. For designing and building CUI, dialogue designers have to decide on how the CUI communicates with users and what dialogue strategies to pursue (e.g. reactive vs. proactive). Dialogue strategies can be evaluated in user tests by comparing user perceptions and reactions to different dialogue strategies. Simulating CUI and running them online, for example on crowdsourcing websites, is an attractive avenue to collecting user perceptions and reactions, as they can be gathered time- and cost-effectively. However, developing and deploying a CUI on a crowd sourcing platform can be laborious and requires technical proficiency from researchers. We present Chatbot Language (CBL) as a framework to quickly develop and deploy CUI on crowd sourcing platforms, without requiring a technical background. CBL is a library with specialized CUI functionality, which is based on the high-level language JavaScript. In addition, CBL provides scripts that use the API of the crowd sourcing platform Mechanical Turk (MT) in order to (a) create MT Human Intelligence Tasks (HITs) and (b) retrieve the results of those HITs. We used CBL to run experiments on MT and present a sample workflow as well as an example experiment. CBL is freely available and we discuss how CBL can be used now and may be further developed in the future.

  • Research Article
  • 10.2174/1874110x01408010462
A Traceable Data Fusion Based on Data Provenance
  • Dec 31, 2014
  • The Open Cybernetics & Systemics Journal
  • Zhao Qiang + 3 more

Data fusion is a hot topic in data integration which at least includes the two stages: entity resolution and data conflict resolution. However, the existing fusion process is transparent and the fusion stages are isolated. So in this paper, we proposed a traceable data fusion mechanism based on data provenance which can trace the data sources of fusion re- sults and the evolutionary process. The mechanism mainly targets forwards entity resolution and data conflict resolution stage. We represented the provenance of data origin using PI-CS which is more accurate because PI-CS can record the in- termediate process of data evolution. In order to record the evolution process of data fusion, we proposed two transforma- tion provenances: entity resolution provenance and data conflict resolution provenance which record respectively the evo- lution process of entity resolution and data conflict resolution. Finally, we give an example to validate the availability of the traceable mechanism for data fusion.

  • Conference Article
  • Cite Count Icon 11
  • 10.1145/3132847.3132876
Select Your Questions Wisely
  • Nov 6, 2017
  • Vijaya Krishna Yalavarthi + 2 more

Crowdsourcing is becoming increasingly important in entity resolution tasks due to their inherent complexity such as clustering of images and natural language processing. Humans can provide more insightful information for these difficult problems compared to machine-based automatic techniques. Nevertheless, human workers can make mistakes due to lack of domain expertise or seriousness, ambiguity, or even due to malicious intents. The bulk of literature usually deals with human errors via majority voting or by assigning a universal error rate over crowd workers. However, such approaches are incomplete, and often inconsistent, because the expertise of crowd workers are diverse with possible biases, thereby making it largely inappropriate to assume a universal error rate for all workers over all crowdsourcing tasks. We mitigate the above challenges by considering an uncertain graph model, where the edge probability between two records A and B denotes the ratio of crowd workers who voted YES on the question if A and B are same entity. To reflect independence across different crowdsourcing tasks, we apply the notion of possible worlds, and develop parameter-free algorithms for both next crowdsourcing and entity resolution tasks. In particular, for next crowdsourcing, we identify the record pair that maximally increases the reliability of the current clustering. Since reliability takes into account the connected-ness inside and across all clusters, this metric is more effective in deciding next questions, in comparison with state-of-the-art works, which consider local features, such as individual edges, paths, or nodes to select next crowdsourcing questions. Based on detailed empirical analysis over real-world datasets, we find that our proposed solution, PERC (probabilistic entity resolution with imperfect crowd) improves the quality by 15% and reduces the overall cost by 50% for the crowdsourcing-based entity resolution.

  • Research Article
  • 10.1177/193758671300600309
Design Collaboration: Practice and Academic Perspectives
  • Apr 1, 2013
  • HERD: Health Environments Research & Design Journal
  • D Kirk Hamilton

I have been an advocate of evidence-based, or research-influenced, design process and increased rigor in practice for nearly two decades. After 30 years of active practice as a hospital architect, I am currently beginning my tenth year in the academic arena. After experi- ence in both arenas, it is clear to me that there are important differenc- es in the way practitioners and aca- demics view design problems.Differences of PerceptionA stereotype might be that design professionals are practical while academics are theoretical. This is an exaggerated generalization that doesn't entirely ring true. As Kurt Lewin said, There is nothing quite so practical as a good the- ory (1951, p. 169), and in my personal opinion, healthcare design is woefully short of good, strong theory. We do have strong theory from Roger Ulrich (1992) on supportive design, and something close to theory from John Reiling (2005) about design for safety, as well as recommendations that verge on theory from Janet Carpman (2001) about wayfinding.One obvious difference in perspec- tive between practice and academia is the pace. The world of practice, with its service to clients and the constraints of budget and schedule, moves swiftly and demands imme- diate answers, or in the absence of an answer, at least a thoughtful deci- sion based on best practice. The pace in the world of academia is mea- sured by semesters and time divid- ed across multiple commitments to teaching and research. Graduate student assistants, while smart and inexpensive, are only available part of the time. In the academic world, there is a mis- sion to find answers, but not at the pace demanded by real-life projects.There are also differences in the perception of rigor. Architects work hard to gather information and to do what is right for their clients. Academic researchers are likely to feel that what an architect calls consists of exploring the profession- al (not scholarly) literature, referencing documen- tation of a firm's past experience, and referring to catalogues of manufacturers' biased descriptions of their products. To an academic researcher, this does not constitue sufficient rigor. An academic will want to have searched for all relevant scholarly literature, and to have interpreted the findings for a project's unique circumstances.An example from my own experience is what I considered while designing critical care units during my practice years. I would hold meet- ings with nurses, physicians, respiratory therapists, pharmacists, and other representatives to ask what was needed, how they worked, and what they want- ed in the way of process improvement. I would ask the nurses to show me an empty room, and ask them to tell me what they thought of its features. As a practitioner, I never entered a room with a patient in it.Now, as an academic researcher, I shadow criti- cal care nurses with permis- sion from the Institutional Review Board (IRB) and I follow the nurses into patient rooms over the course of a 12-hour shift, while wearing scrubs and a hospital badge. Although I make no notes about patients, their conditions, or their families, and the nurses are anonymous in my recording, through close and careful observa- tion I gain a far more thorough understanding of how these nurses use the features of that designed environment.There is also a difference in access to scholarship. Academic and university-based researchers have ready access to the world of scholarly literature. They may also have access to graduate student assis- tants, skilled at searching the library for research papers, and who work at comparatively low wages. The practitioner, on the other hand, struggles to find the articles, and must pay $20-$40 to down- load a single paper before reading more than an abstract.In spite of their differences, or perhaps because of them, collaboration between practitioners and aca- demics is possible and desirable. …

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 14
  • 10.1371/journal.pone.0134978
Lessons Learned from Crowdsourcing Complex Engineering Tasks
  • Sep 18, 2015
  • PLoS ONE
  • Matthew Staffelbach + 6 more

CrowdsourcingCrowdsourcing is the practice of obtaining needed ideas, services, or content by requesting contributions from a large group of people. Amazon Mechanical Turk is a web marketplace for crowdsourcing microtasks, such as answering surveys and image tagging. We explored the limits of crowdsourcing by using Mechanical Turk for a more complicated task: analysis and creation of wind simulations.Harnessing Crowdworkers for EngineeringOur investigation examined the feasibility of using crowdsourcing for complex, highly technical tasks. This was done to determine if the benefits of crowdsourcing could be harnessed to accurately and effectively contribute to solving complex real world engineering problems. Of course, untrained crowds cannot be used as a mere substitute for trained expertise. Rather, we sought to understand how crowd workers can be used as a large pool of labor for a preliminary analysis of complex data.Virtual Wind TunnelWe compared the skill of the anonymous crowd workers from Amazon Mechanical Turk with that of civil engineering graduate students, making a first pass at analyzing wind simulation data. For the first phase, we posted analysis questions to Amazon crowd workers and to two groups of civil engineering graduate students. A second phase of our experiment instructed crowd workers and students to create simulations on our Virtual Wind Tunnel website to solve a more complex task.ConclusionsWith a sufficiently comprehensive tutorial and compensation similar to typical crowd-sourcing wages, we were able to enlist crowd workers to effectively complete longer, more complex tasks with competence comparable to that of graduate students with more comprehensive, expert-level knowledge. Furthermore, more complex tasks require increased communication with the workers. As tasks become more complex, the employment relationship begins to become more akin to outsourcing than crowdsourcing. Through this investigation, we were able to stretch and explore the limits of crowdsourcing as a tool for solving complex problems.

  • Research Article
  • Cite Count Icon 52
  • 10.1016/j.addr.2019.01.009
Bridging the gaps between academic research and industrial product developments of lipid-based formulations.
  • Jan 22, 2019
  • Advanced Drug Delivery Reviews
  • René Holm

Bridging the gaps between academic research and industrial product developments of lipid-based formulations.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.