Gigantic amounts of data are being generated at high speeds by a variety of sources such as mobile devices, social media, machine logs, and multiple sensors surrounding us. All around the world, we produce vast amount of data and the volume of generated data is growing exponentially at a unprecedented rate. The pace of data generation is even being accelerated by the growth of new technologies and paradigms such as Internet of Things (IoT).
What is Big Data and How Is It Changing?
The definition of big data is hidden in the dimensions of the data. Data sets are considered “big data” if they have a high degree of the following three distinct dimensions: volume, velocity, and variety. Value and veracity are two other “V” dimensions that have been added to the big data literature in the recent years. Additional Vs are frequently proposed, but these five Vs are widely accepted by the community and can be described as follows:
- Velocity: the speed at which the data is been generated
- Volume: the amount of the data that is been generated
- Variety: the diversity or different types of the data
- Value: the worth of the data or the value it has
- Veracity: the quality, accuracy, or trustworthiness of the data
Large volumes of data are generally available in either structured or unstructured formats. Structured data can be generated by machines or humans, has a specific schema or model, and is usually stored in databases. Structured data is organized around schemas with clearly defined data types. Numbers, date time, and strings are a few examples of structured data that may be stored in database columns. Alternatively, unstructured data does not have a predefined schema or model. Text files, log files, social media posts, mobile data, and media are all examples of unstructured data.
Based on a report provided by Gartner, an international research and consulting organization, the application of advanced big data analytics is part of the Gartner Top 10 Strategic Technology Trends for 2019, and is expected to drive new business opportunities. The same report also predicts that more than 40% of data science tasks will be automated by 2020, which will likely require new big data tools and paradigms.
By 2017, global internet usage reached 47% of the world’s population based on an infographic provided by DOMO. This indicates that an increasing number of people are starting to use mobile phones and that more and more devices are being connected to each other via smart cities, wearable devices, Internet of Things (IoT), fog computing, and edge computing paradigms. As internet usage spikes and other technologies such as social media, IoT devices, mobile phones, autonomous devices (e.g. robotics, drones, vehicles, appliances, etc) continue to grow, our lives will become more connected than ever and generate unprecedented amounts of data, all of which will require new technologies for processing.
The Scale of Data Generated by Everyday Interactions
At a large scale, the data generated by everyday interactions is staggering. Based on research conducted by DOMO, for every minute in 2018, Google conducted 3,877,140 searches, YouTube users watched 4,333,560 videos, Twitter users sent 473,400 tweets, Instagram users posted 49,380 photos, Netflix users streamed 97,222 hours of video, and Amazon shipped 1,111 packages. This is just a small glimpse of a much larger picture involving other sources of big data. It seems like the internet is pretty busy, does not it? Moreover, it is expected that mobile traffic will experience tremendous growth past its present numbers and that the world’s internet population is growing significantly year-over-year. By 2020, the report anticipates that 1.7MB of data will be created per person per second. Big data is getting even bigger.
At small scale, the data generated on a daily basis by a small business, a start up company, or a single sensor such as a surveillance camera is also huge. For example, a typical IP camera in a surveillance system at a shopping mall or a university campus generates 15 frame per second and requires roughly 100 GB of storage per day. Consider the storage amount and computing requirements if those camera numbers are scaled to tens or hundreds.
Big Data in the Scientific Community
Scientific projects such as CERN, which conducts research on what the universe is made of, also generate massive amounts of data. The Large Hadron Collider (LHC) at CERN is the world’s largest and most powerful particle accelerator. It consists of a 27-kilometer ring of superconducting magnets along with some additional structures to accelerate and boost the energy of particles along the way.
During the spin, particles collide with LHC detectors roughly 1 billion times per second, which generates around 1 petabyte of raw digital “collision event” data per second. This unprecedented volume of data is a great challenge that cannot be resolved with CERN’s current infrastructure. To work around this, the generated raw data is filtered and only the “important” events are processed to reduce the volume of data. Consider the challenging processing requirements for this task.
The four big LHC experiments, named ALICE, ATLAS, CMS, and LHCb, are among the biggest generators of data at CERN, and the rate of the data processed and stored on servers by these experiments is expected to reach about 25 GB/s (gigabyte per second). As of June 29, 2017, the CERN Data Center announced that they had passed the 200 petabytes milestone of data archived permanently in their storage units.
Why Big Data Tools are Required
The scale of the data generated by famous well-known corporations, small scale organizations, and scientific projects is growing at an unprecedented level. This can be clearly seen by the above scenarios and by remembering again that the scale of this data is getting even bigger.
On the one hand, the mountain of the data generated presents tremendous processing, storage, and analytics challenges that need to be carefully considered and handled. On the other hand, traditional Relational Database Management Systems (RDBMS) and data processing tools are not sufficient to manage this massive amount of data efficiently when the scale of data reaches terabytes or petabytes. These tools lack the ability to handle large volumes of data efficiently at scale. Fortunately, big data tools and paradigms such as Hadoop and MapReduce are available to resolve these big data challenges.
Analyzing big data and gaining insights from it can help organizations make smart business decisions and improve their operations. This can be done by uncovering hidden patterns in the data and using them to reduce operational costs and increase profits. Because of this, big data analytics plays a crucial role for many domains such as healthcare, manufacturing, and banking by resolving data challenges and enabling them to move faster.
Big Data Analytics Tools
Since the compute, storage, and network requirements for working with large data sets are beyond the limits of a single computer, there is a need for paradigms and tools to crunch and process data through clusters of computers in a distributed fashion. More and more computing power and massive storage infrastructure are required for processing this massive data either on-premise or, more typically, at the data centers of cloud service providers.
In addition to the required infrastructure, various tools and components must be brought together to solve big data problems. The Hadoop ecosystem is just one of the platforms helping us work with massive amounts of data and discover useful patterns for businesses.
Below is a list of some of the tools available and a description of their roles in processing big data:
- MapReduce: MapReduce is a distributed computing paradigm developed to process vast amount of data in parallel by splitting a big task into smaller map and reduce oriented tasks.
- HDFS: The Hadoop Distributed File System is a distributed storage and file system used by Hadoop applications.
- YARN: The resource management and job scheduling component in the Hadoop ecosystem.
- Spark: A real-time in-memory data processing framework.
- PIG/HIVE: SQL-like scripting and querying tools for data processing and simplifying the complexity of MapReduce programs.
- HBase, MongoDB, Elasticsearch: Examples of a few NoSQL databases.
- Mahout, Spark ML: Tools for running scalable machine learning algorithms in a distributed fashion.
- Flume, Sqoop, Logstash: Data integration and ingestion of structured and unstructured data.
- Kibana: A tool to visualize Elasticsearch data.
To summarize, we are generating a massive amount of data in our everyday life, and that number is continuing to rise. Having the data alone does not improve an organization without analyzing and discovering its value for business intelligence. It is not possible to mine and process this mountain of data with traditional tools, so we use big data pipelines to help us ingest, process, analyze, and visualize these tremendous amounts of data.
Learn to deploy databases in production on Kubernetes
For more training in big data and database management, watch our free online training on successfully running a database in production on kubernetes.
Faruk Caglar received his PhD from the Electrical Engineering and Computer Science Department at Vanderbilt University. He is a researcher in the fields of Cloud Computing, Big Data, Internet of Things (IoT) as well as Machine Learning and solution architect for cloud-based applications. He has published several scientific papers and has been serving as reviewer at peer-reviewed journals and conferences. He also has been providing professional consultancy in his research field.