The Trouble With Big Data by Jennifer Edmond, Nicola Horsley, Jörg Lehmann and Mike Priddy

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Tannistha Samanta reviews Jennifer Edmond, Nicola Horsley, Jörg Lehmann and Mike Priddy’s The Trouble With Big Data (Bloomsbury), focusing on the Knowledge Complexity (KPLEX) Project researching big data’s failure points.

Similar Papers
  • Research Article
  • Cite Count Icon 4
  • 10.1111/ppe.12971
Rigour and reproducibility in perinatal and paediatric epidemiologic research using big data
  • Mar 23, 2023
  • Paediatric and perinatal epidemiology
  • Anna Nguyen + 1 more

Rigour and reproducibility in perinatal and paediatric epidemiologic research using big data

  • Research Article
  • Cite Count Icon 11
  • 10.12948/issn14531305/21.1.2017.05
A Maturity Analysis of Big Data Technologies
  • Mar 30, 2017
  • Informatica Economica
  • Radu Boncea + 3 more

Big Data OverviewDriven by the need to generate business value, the enterprise has started to adopt Big Data as a solution, migrating from the classical databases and data stores which lack the flexibility and are not optimized enough [1].The changes in the environment make big data analytics attractive to all types of organizations, while the market conditions make it practical. The combination of simplified models for development, commoditization, a wider palette of data management tools, and low-cost utility computing has effectively lowered the barrier to entry. [2]. The concept addresses large volumes of complex data, rapid growing data sets that may come from different autonomous sources.In recent approaches, Big Data is characterized by principles known as the 4VVolume, Variety, Velocity and Veracity [3]. There are opinions about accepting other principles as Big Data characteristics, such as Value.Each day more businesses realize that Big Data is relevant as the applications generate large volumes of data generated automatically, from different data sources, centralized or autonomous. As traditional databases hit limitations when the need of analyzing this data, dedicated solutions must be considered.Important BigData Solutions:* Apache HBase/Hadoop is based on Google's BigTable distributed storage system, which runs on top of Hadoop as a distributed and scalable big data store. This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop's MapReduce programming model. It combines the scalability of Hadoop with real-time data access as a key/value store and deep analytic capabilities of Map Reduce [4].HBase allows to query for individual records as well as derive aggregate analytic reports across a massive amount of data. It can host large tables with billions of rows, millions of columns and run across a cluster of commodity hardware. HBase is composed of three types of servers in a master slave type of architecture. Region servers are responsible to serve data for reads and writes. When accessing data, clients communicate with HBase Region Servers directly. Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process.* Apache Cassandra is a distributed database used for the administration and management of large amounts of structured data across multiple servers, while providing highly available service and no single point of failure. It provides features such as continuous availability, linear scale performance, data distribution across multiple data centers and cloud availability zones. Cassandra inherits its data architecture from Google's BigTable and it borrows its distribution mechanisms from Amazon's Dynamo.The nodes in a Cassandra cluster are completely symmetrical, all having identical responsibilities. Cassandra also employs consistent hashing to partition and replicate data. it has the capability of handling large amounts of data and thousands of concurrent users or operations per second across multiple data centers.Cassandra has a hierarchy of caching mechanisms and carefully orchestrated disk I/O which ensures speed and data safety. Write operations are sent first to a persistent commit log (ensuring a durable write), then to a write-back cache called a memTable; when the memTable fills, it is flushed to a sorted string table - SSTable - on disk. A Cassandra cluster is organized as a ring, and it uses a partitioning strategy to distribute data evenly.* Redis represents an in-memory data structure store used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperlogs and geospatial indexes with radius queries. Redis stores all data in RAM, allowing lightning fast reads and writes. It runs extremely efficiently in memory and handles high-velocity data, needing simple standard servers to deliver millions of operations per second with submillisecond latency. …

  • Research Article
  • 10.55662/jst.2024.5104
Big Data Analytics-Driven Project Management Strategies
  • Jan 11, 2024
  • Journal of Science & Technology
  • Muhammad Zahaib Nabeel

The integration of Artificial Intelligence (AI) and Big Data Analytics (BDA) in project management has become a critical enabler of efficiency in managing large-scale, complex projects. This research paper delves into how AI-driven big data analytics can revolutionize traditional project management methodologies by introducing dynamic scheduling, real-time risk prediction, and automated task prioritization strategies. These advanced techniques, which leverage machine learning (ML) models and extensive historical project data, enable a shift from reactive to proactive project management, ensuring that risks and resource constraints are identified and addressed before they impact project delivery. By analyzing massive datasets, including historical performance metrics, resource availability, and project timelines, AI-driven systems can forecast delays, assess risk levels dynamically, and adapt schedules in real-time. This proactive approach facilitates better decision-making, optimized resource allocation, and improved project outcomes. The study is anchored on the premise that the sheer volume of data generated in large-scale projects often overwhelms traditional project management systems. By incorporating AI and BDA, project managers can better utilize this data, turning it into actionable insights that inform intelligent decision-making. Machine learning algorithms, particularly those specializing in predictive analytics, are capable of identifying patterns that elude human analysis, allowing for the accurate forecasting of project risks, schedule slippage, and task dependencies. This ability to predict potential issues, such as resource bottlenecks or unforeseen delays, enables project teams to implement mitigative actions in advance, thus reducing the likelihood of project failure. Furthermore, dynamic scheduling is a key focus of this research, as AI-powered models can continuously adjust project timelines based on real-time data. These models consider variables such as resource utilization rates, task dependencies, and evolving project constraints, offering adaptive scheduling mechanisms that evolve throughout the project lifecycle. The automated task prioritization system, powered by BDA, ensures that the most critical tasks receive the appropriate level of attention at the right time, improving project performance and enhancing resource efficiency. Through natural language processing (NLP) and advanced data mining techniques, AI models can also analyze project documentation and communication channels to detect potential risks and suggest task adjustments. The paper also discusses the application of AI in risk prediction, focusing on how AI models can analyze risk factors from historical data, including resource constraints, financial limitations, and market volatility, to produce risk profiles that project managers can use for strategic planning. Real-time risk assessments, made possible by the integration of AI and BDA, can help project teams stay ahead of potential disruptions. This allows for more accurate contingency planning and reduces the overall risk to project timelines and budgets. Practical applications of these AI-driven strategies are presented through case studies of large-scale projects in various industries, including construction, information technology, and healthcare. These case studies demonstrate how AI-powered analytics have been successfully implemented to enhance project efficiency, optimize resource allocation, and minimize risks in complex projects. The study underscores the importance of integrating these technologies into modern project management frameworks to cope with the increasing complexity of projects in today’s fast-paced business environment. While the potential benefits of AI and BDA in project management are substantial, this paper also addresses the challenges associated with their implementation. One significant challenge is the quality and availability of data required to train AI models effectively. Incomplete or inaccurate data can lead to unreliable forecasts, compromising the project’s success. Additionally, the paper explores the issues of data privacy and security in AI-driven project management systems, highlighting the need for robust data governance frameworks to ensure the ethical use of AI technologies. Another key consideration is the resistance to change within organizations, where traditional project management methods are deeply ingrained. The paper emphasizes the need for a cultural shift towards data-driven decision-making and suggests strategies for fostering an environment conducive to AI adoption. This includes training project management teams to work alongside AI systems and fostering collaboration between AI experts and project managers to ensure smooth implementation and operation. Finally, this research outlines future trends in AI and BDA for project management, suggesting that further advancements in AI technologies, such as reinforcement learning and more sophisticated natural language processing algorithms, will drive the next generation of intelligent project management systems. These future systems are expected to be even more adept at handling the complexities of large-scale projects, offering real-time solutions to unforeseen challenges and adapting dynamically to changing project requirements.

  • Research Article
  • Cite Count Icon 222
  • 10.1016/j.ijproman.2015.02.006
Managing change in the delivery of complex projects: Configuration management, asset information and ‘big data’
  • Mar 20, 2015
  • International Journal of Project Management
  • Jennifer Whyte + 2 more

Managing change in the delivery of complex projects: Configuration management, asset information and ‘big data’

  • Research Article
  • Cite Count Icon 6
  • 10.1109/access.2020.3036285
Simulation-Based Sensitivity Analysis for Evaluating Factors Affecting Bus Service Reliability: A Big and Smart Data Implementation
  • Jan 1, 2020
  • IEEE Access
  • Seyed Mohammad Hossein Moosavi + 3 more

Service quality is a significant concern for both providers and users of public transportation. It is crucial for transit agencies to clearly recognize the causes of unreliability before adapting any improvement strategy. However, evaluation of main causes of bus service unreliability has not been investigated well. Existing studies have three main limitations in context of recognizing causes of service unreliability. First, public transport networks and traffic condition are highly complex systems and most of the existing models are not capable to accurately determine the relationship between service irregularity and impact factors. Second, definition of “Big data” has been neglected and most of the studies only focused on one source of large scale data set to determine the causes of unreliability. Third, bus service unreliability can impact the users' perception toward the public transport, significantly. It has been recommended by number of studies that bus service reliability should be evaluated from both service providers' and users' perspective. However, the impact of service unreliability from passengers' perception is not well investigated, yet. Consequently, we proposed a novel simulation-based sensitivity analysis to evaluating main causes of bus service unreliability using a combination of three different sources of big data. Moreover, for the first time we developed a simulation model in R studio which is an open source and powerful coding environment. According to the results, the level of reliability in Route U32 showed the highest sensitivity to headway variations. Waiting time can be decreased by 61% if only bus operators can reduce the headway variation by 25% of the actual observed data. Big gap and bus bunching could be almost disappeared by decreasing headway variations. Moreover, the terminal departure policy could significantly improve the passenger waiting time. Waiting time can be decreased by 36% when almost all the buses depart the terminal on-time.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/csicc52343.2021.9420593
Speed up Cassandra read path by using Coordinator Cache
  • Mar 3, 2021
  • Latifa Azizi Vakili + 1 more

The fast increasing amount of massive and complex data in today's Internet, called Big Data, requires sophisticated, comprehensive and highly operational databases. NoSQL databases are designed to fulfill Big Data requirements. Choosing an appropriate NoSQL database among various solutions to cover and manage big volume of data in Big Data, both in quantity and quality, itself is a big challenge. Cassandra is one of the distributed NoSQL databases mastered for managing very large amounts of structured and unstructured data spread out across many commodity servers, while providing highly available services with no single point of failure. Cassandra system was designed to run on cheap commodity hardware and handle high write through-put while not sacrificing read efficiency. This Paper will first present an overview of NoSQL databases, Big Data and IoT data as a controversial and complicated source of data in Big Data. Then, focuses on Cassandra database read request issues in its read path and suggests a model to reduce the time of read request (read query) coming from client side to Cassandra database. In this model we added a cache called Coordinator cache in Cassandra controlling nodes. Using a real dataset, we perform an analysis of Cassandra existing read path with suggested read path model and then compare the time of a read query before and after this model. The result shows that using Coordinator cache together with key cache offered by Cassandra database speedup data read request. Coordinator cache requires no extra memory because Cassandra Coordinator node does not store anything when doing controlling tasks over replica nodes and its potential memory space can be used for the introduced Coordinator cache.

  • Book Chapter
  • Cite Count Icon 18
  • 10.1007/978-3-030-04203-5_2
The Role of IoT and Big Data in Modern Technological Arena: A Comprehensive Study
  • Dec 31, 2018
  • Sushree Bibhuprada B Priyadarshini + 2 more

In current era of technology, the adoration of Internet of Things (IoT) is rising rampantly with the proliferation in its exciting application prospects and practical usage. Fundamentally, IoT refers to a system of computing devices, persons or animals ascribed with unique identifiers. The data get transmitted without any human to computer or human to human interference. IoT has fundamentally emerged from merging of micro electro mechanical systems, micro services along with wireless technologies as well as internet. The merging assists in bridging between the information technology as well as operational technology, thereby analyzing the machine provoked data in technological platform. Further, Big data indicates large volume of structured as well as unstructured data associated in day to day life. In this context, the amount of data that can be generated and preserved on global level is mostly mind-boggling. However, the relevance of big data does not concentrate on how much data one possesses, however what one carries out on it. The current chapter throws light on IoT, Big data, their relevance, data sources, big data applications, IoT Architecture and security challenges, standards and protocols for IoT, single points of failure, IoT Code etc.

  • Research Article
  • Cite Count Icon 7
  • 10.3390/electronics12244894
Securing Big Data Exchange: An Integrated Blockchain Framework for Full-Lifecycle Data Trading with Trust and Dispute Resolution
  • Dec 5, 2023
  • Electronics
  • Chuangming Zhou + 4 more

In the era of big data, facilitating efficient data flow is of paramount importance. Governments and enterprises worldwide have been investing in the big data industry, promoting data sharing and trading. However, existing data trading platforms often suffer from issues like privacy breaches, single points of failure, data tampering, and non-transparent transactions due to their reliance on centralized servers. To address these challenges, blockchain-based big data transaction models have been proposed. However, these models often lack system integrity and fail to fully meet user requirements while ensuring adequate security. To overcome these limitations, this paper presents an Ethereum-based big data trading model that establishes a comprehensive and secure trading system. The model aims to provide users with more convenient, secure, and professional services. Through the utilization of smart contracts, users can efficiently match data and negotiate prices online while ensuring secure data delivery through encryption technologies. Additionally, the model introduces a trusted third-party entity that offers professional data evaluation services and actively safeguards user data ownership in the event of disputes. The implementation of the model includes the development of smart contracts and the necessary machine learning code, followed by rigorous testing and validation. The experimental results validate the effectiveness and reliability of our proposed model, demonstrating its potential to ensure effective and secure big data trading.

  • Conference Article
  • Cite Count Icon 15
  • 10.4043/26275-ms
Big Data Analytics for Predictive Maintenance Modeling: Challenges and Opportunities
  • Oct 27, 2015
  • OTC Brasil
  • I H F Santos + 17 more

Big data analytics, applied in the industry to leverage data collection, processing and analysis, can allow a better understanding of production system's abnormal behavior. This knowledge is essential for the adoption of a proactive maintenance approach instead of conventional time-based strategies, leading to a paradigm shift towards Condition-Based Maintenance (CBM) since decision is now based on the usage of a huge, diverse, and dynamic amounts of data as a means to optimize operational costs. This paper reports an investigation of the emerging aspects in the design and implementation of big data analytics solutions for offshore installations in order to allow predictive maintenance practices. Condition-based maintenance focuses on performing interventions based on the actual and future states (health) of a system by monitoring the underlying deterioration processes. One of the building blocks of a CBM design and implementation is the prognostic approach/system, which aims to detect, classify and predict critical failures. Considering the massive amounts of data available from a Stationary Production Unity (SPU), the use of techniques that properly deal with such a big data scenario became essential. The use of parallel processing to ingest, transform, and analyze different kinds of data in near real-time basis allows the construction of a valuable tool for implementing CBM. This paper presents a comparison of different approaches for RUSBoost and Random Forest (RF) classification, in constructing a prognostic system for a specific class of turbogenerator failures from a chosen Petrobras' Floating Production Storage and Offloading (FPSO). Besides the comparison of different classifiers, a contribution of this work lies on the use of data acquired not only from machine sensors (telemetry data) but also non-structured data regarding the most critical failures acquired from official reports, e.g. operator's machine event annotations. Those reported annotations were correlated to telemetry data to identify real critical failures, and simultaneously avoid false positives.

  • Conference Article
  • Cite Count Icon 69
  • 10.1109/icufn.2017.7993927
Blockchain based approach to enhance big data authentication in distributed environment
  • Jan 1, 2017
  • Nazri Abdullah + 2 more

Existing authentication protocols for Big Data system such as Apache Hadoop is based on Kerberos. In the Kerberos protocol, there are numerous security issues that have remained unsolved; replay attacks, DDoS and single point of failure are some examples. These indicate potential security vulnerabilities and Big Data risks in using Hadoop. This paper presents drawbacks of Kerberos implementations and identifies authentication requirements that can enhance the security of Big Data in distributed environments. The enhancement proposed is based on the rising technology of blockchain that overcomes shortcomings of Kerberos.

  • Research Article
  • Cite Count Icon 77
  • 10.1016/j.comnet.2021.107994
A blockchain-based trading system for big data
  • Mar 13, 2021
  • Computer Networks
  • Donghui Hu + 4 more

A blockchain-based trading system for big data

  • Front Matter
  • Cite Count Icon 19
  • 10.1097/apo.0000000000000399
Blockchain Technology for Ophthalmology: Coming of Age?
  • Jul 1, 2021
  • Asia-Pacific Journal of Ophthalmology
  • Wei Yan Ng + 9 more

Blockchain Technology for Ophthalmology: Coming of Age?

  • Research Article
  • Cite Count Icon 171
  • 10.1016/j.jpdc.2022.01.030
A distributed intrusion detection system to detect DDoS attacks in blockchain-enabled IoT network
  • Feb 18, 2022
  • Journal of Parallel and Distributed Computing
  • Randhir Kumar + 5 more

A distributed intrusion detection system to detect DDoS attacks in blockchain-enabled IoT network

  • Research Article
  • Cite Count Icon 2
  • 10.1079/pavsnnr201510028
Integrated pest management in temperate horticulture: seeing the wood for the trees.
  • Jan 1, 2015
  • CABI Reviews
  • C D Harvey

Owing to the decreasing availability of synthetic pesticides, there is an urgent need for developing and improving alternative pest control methods in horticulture. Integrated Pest Management (IPM) aims to reduce and control the damage caused by pest organisms by making use of ecological interactions between the pest, its antagonists and the environment. IPM usually involves combined use of pesticides, pest antagonists, mass trapping and environmental manipulation. This gives rise to potentially negative interference amongst these components as well as with other environmental and crop-related factors. Such interference has the potential to reduce IPM efficacy, especially as the use of IPM is broadened and intensified. Evidence for such interference among components of IPM is briefly reviewed and the need for a research agenda that investigates such interference experimentally is discussed along with the potential for using 'big data' generated in IPM to conduct meta-analyses and construct powerful models for IPM. These approaches to research and data management should support the expansion and improvement of Decision Support Systems (DSS) for IPM practitioners that combine databases, expert networks and models. The success of DSS based on increasingly complex and extensive knowledge and data greatly depends on their accessibility, ease of use and whether they produce clear outputs that support decision-making by growers and consultants. The aim must be to improve IPM efficacy, predictability, cost-effectiveness and sustainability, while still finding ways of helping IPM practitioners identify IPM strategies that are optimal for their needs amongst an increasing number of options.

  • Research Article
  • Cite Count Icon 13
  • 10.1017/s1471068414000131
Efficient Computation of the Well-Founded Semantics over Big Data
  • Jul 1, 2014
  • Theory and Practice of Logic Programming
  • Ilias Tachmazidis + 2 more

Data originating from the Web, sensor readings and social media result in increasingly huge datasets. The so called Big Data comes with new scientific and technological challenges while creating new opportunities, hence the increasing interest in academia and industry. Traditionally, logic programming has focused on complex knowledge structures/programs, so the question arises whether and how it can work in the face of Big Data. In this paper, we examine how the well-founded semantics can process huge amounts of data through mass parallelization. More specifically, we propose and evaluate a parallel approach using the MapReduce framework. Our experimental results indicate that our approach is scalable and that well-founded semantics can be applied to billions of facts. To the best of our knowledge, this is the first work that addresses large scale nonmonotonic reasoning without the restriction of stratification for predicates of arbitrary arity.

Save Icon
Up Arrow
Open/Close