Big Data OverviewDriven by the need to generate business value, the enterprise has started to adopt Big Data as a solution, migrating from the classical databases and data stores which lack the flexibility and are not optimized enough [1].The changes in the environment make big data analytics attractive to all types of organizations, while the market conditions make it practical. The combination of simplified models for development, commoditization, a wider palette of data management tools, and low-cost utility computing has effectively lowered the barrier to entry. [2]. The concept addresses large volumes of complex data, rapid growing data sets that may come from different autonomous sources.In recent approaches, Big Data is characterized by principles known as the 4VVolume, Variety, Velocity and Veracity [3]. There are opinions about accepting other principles as Big Data characteristics, such as Value.Each day more businesses realize that Big Data is relevant as the applications generate large volumes of data generated automatically, from different data sources, centralized or autonomous. As traditional databases hit limitations when the need of analyzing this data, dedicated solutions must be considered.Important BigData Solutions:* Apache HBase/Hadoop is based on Google's BigTable distributed storage system, which runs on top of Hadoop as a distributed and scalable big data store. This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop's MapReduce programming model. It combines the scalability of Hadoop with real-time data access as a key/value store and deep analytic capabilities of Map Reduce [4].HBase allows to query for individual records as well as derive aggregate analytic reports across a massive amount of data. It can host large tables with billions of rows, millions of columns and run across a cluster of commodity hardware. HBase is composed of three types of servers in a master slave type of architecture. Region servers are responsible to serve data for reads and writes. When accessing data, clients communicate with HBase Region Servers directly. Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process.* Apache Cassandra is a distributed database used for the administration and management of large amounts of structured data across multiple servers, while providing highly available service and no single point of failure. It provides features such as continuous availability, linear scale performance, data distribution across multiple data centers and cloud availability zones. Cassandra inherits its data architecture from Google's BigTable and it borrows its distribution mechanisms from Amazon's Dynamo.The nodes in a Cassandra cluster are completely symmetrical, all having identical responsibilities. Cassandra also employs consistent hashing to partition and replicate data. it has the capability of handling large amounts of data and thousands of concurrent users or operations per second across multiple data centers.Cassandra has a hierarchy of caching mechanisms and carefully orchestrated disk I/O which ensures speed and data safety. Write operations are sent first to a persistent commit log (ensuring a durable write), then to a write-back cache called a memTable; when the memTable fills, it is flushed to a sorted string table - SSTable - on disk. A Cassandra cluster is organized as a ring, and it uses a partitioning strategy to distribute data evenly.* Redis represents an in-memory data structure store used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperlogs and geospatial indexes with radius queries. Redis stores all data in RAM, allowing lightning fast reads and writes. It runs extremely efficiently in memory and handles high-velocity data, needing simple standard servers to deliver millions of operations per second with submillisecond latency. …
Read full abstract