Let’s know about BIG DATA and how big MNC’s like Google, Facebook, Instagram etc stores and manages Thousands of Terabytes of data with high speed and high efficiency :

6 min readSep 17, 2020

What is BIG DATA ?

Big data is not a technology it is a big problem in a Data World for storage problem. “Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”
Big Data refers to complex and large data sets that have to be processed and analyzed to uncover valuable information that can benefit businesses and organizations.

Have you ever notice that how much data daily produced everyday?

In digital and computing world, information is generated and collected at a rate that rapidly exceeds the boundary range.Currently, “over 2 billion” people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. As information is transferred and shared at light speed on optic fiber and wireless networks, the volume of data and the speed of market growth increase.

Types of Big Data :

1.Structured data

2.Unstructured data

3.Semi-structured data

The Three V’s of Big Data :

Volume- The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a webpage or a mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.

Velocity- Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the highest velocity of data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in real time or near real time and will require real-time evaluation and action. With the growth in the Internet of Things, data streams in to businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time.

Variety- Variety refers to the many types of data that are available. Data comes in all types of formats — from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.

We know Google is the only one who can answer any kind of questions.Google currently processed over 40,000 searches every second on average which translates to over 3.5 billion searches per day and 1.2 trillion searches per year.

Social media is one of the platform for gaining and sharing our knowledge. We use lots of social media platform like Facebook, Instagram, LinkedIn, YouTube, Netflix and many more.

Facebook 2.5 Billion Pieces Of Content And 500+ Terabytes Ingested Every Day :-
“Over 2.5 Quintilian bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth.” that pace is only accelerating with the growth of the Internet of Things (IoT). Over the last two years alone 90 percent of the data in the world was generated.

Instagram :- 95 million photos and videos are shared on Instagram per day. 9. Over 40 billion photos and videos have been shared on the Instagram platform since its conception.

LinkedIn :- LinkedIn is the world’s preeminent social network for professionals. Members create online résumés, listing their current and previous job roles, their skills, and their education. They can network with other LinkedIn members, who are searchable by the above criteria and more.
.660 million LinkedIn users, spread over 200 countries, November 2019
.LinkedIn users spend an average of 10:20 minutes on the site daily, visiting 8.5 pages, or an average of 7:18 minutes, visiting 7.99 pages.

So let’s see how we can solve this big data problem :

“Architecture of Distributed Storage “ : -

Hadoop a software which help us to build the “Architecture of Distributed Storage”

If we have one laptop with 1 TB storage, we want to store more 50 GB in our laptop so for this basically we can buy one hard disk and we can stored it. But in Companies there is huge amount of data daily come up and we don’t have resources to store it, because in Companies 400 TB data stored at a time. And yes company can create 500 tb of data and can resolve this problem but after that the major problem is come up which is company had to face I/O problem. So to solve this problem nowadays almost all the companies are using distributed storage.

HADOOP

It is an open source, Java based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella, Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its nodes. The framework is managed by Apache Software Foundation and is licensed under the Apache License 2.0.

Hadoop are run on large data sets distributed across cluster of commodity computers. Computer cluster consists of multiple nodes which are connected with each other via networking and acts as a single system. Main node is name as a MASTER node or NAME NODE and rest of the slaves are known as data node, this topology is knows as Master slave cluster.

Hadoop solves two key challenges with traditional databases:

1. Capacity: Hadoop stores large volumes of data.
By using a distributed file system called an HDFS(Hadoop Distributed File System), the data is split into chunks and saved across clusters of commodity servers. As these commodity servers are built with simple hardware configurations, these are economical and easily scalable as the data grows.

2. Speed: Hadoop stores and retrieves data faster.
Hadoop uses the MapReduce functional programming model to perform parallel processing across data sets. So, when a query is sent to the database, instead of handling data sequentially, tasks are split and concurrently run across distributed servers. Finally, the output of all tasks is collated and sent back to the application, drastically improving the processing speed.

Benefits of Hadoop for Big Data :

.Resilience
.Scalability
.Low cost
.Speed
.Data diversity

THANK YOU