๐™ƒ๐™ค๐™ฌ ๐˜ฝ๐™„๐™‚ ๐˜ฟ๐˜ผ๐™๐˜ผ ๐™˜๐™๐™–๐™ฃ๐™œ๐™ž๐™ฃ๐™œ ๐™ฉ๐™๐™š ๐™ฌ๐™–๐™ฎ๐™จ ๐™ฉ๐™ค ๐™ข๐™–๐™ฃ๐™ž๐™ฅ๐™ช๐™ก๐™–๐™ฉ๐™š ๐™ฉ๐™๐™š ๐™™๐™–๐™ฉ๐™–

Digambar Nandrekar
7 min readMar 13, 2021

--

What is called a BIG DATA โ€ฆ?

Big Data simply means a huge amount of Data. It has become a domain in IT sector consisting of the technologies dealing with large data that is generated every second. Itโ€™s not any technology but a problem which has raised due to huge generation of data in todayโ€™s world.

โ€œ90% of the worldโ€™s total data is generated in last two year.โ€

โ€œBig Dataโ€ is more of a marketing term. It implies that the Data today is so Big that you cannot analyze all of it at once due to the amount of memory (RAM) available that will be needed to hold the Data. Data is lot more than the memory available. This data is generated in High Volume, velocity and variety.

Add alt text

WHEN IS BIG DATA USED?

  • When there is high volume of unstructured data then big data is being used is almost every case in the world
  • Also, when there is large amounts of structured or semi-structured data then big data helps derive insights with analytics models so there also big data is being used
  • Big data also helps in structuring of data and getting the answers through queries so even in querying data, big data is being used.

WHO USES BIG DATA?

Add alt text

  • All the industry segments from social media to health services are using it
  • Hospitality / Hotel / Travel โ€” applications and websites are using to understand the customer needs and put their pricing models and travel packages accordingly
  • Health Industry โ€” from predicting ailments to medication, for making health kits and health insurance packages and provide necessary health care, health industry is using big data
  • Retail business like Amazon, Walmart and many FMCG companies are using big data to understand customer behavior and build suitable offers for the customers to increase their sales
  • Banking and Financial Serves โ€” understanding patterns of customer and their transactions and provide loans/credit cards. For predicting fraud transactions and avoid them in real time
  • Government โ€” Even with Aadhaar and now a huge database on population, one can understand that government also is using big data to do census calculation, provide subsidies etc.. and plan for government schemes using big data.

Types Of Big Data

Add alt text

BigDataโ€™ could be found in three forms:

  1. Structured Data : It refers to the data that has a proper structure associated with it. For example, the data that is present within the databases, the csv files, and the excel spreadsheets can be referred to as structured data.
  2. Unstructured Data : It refers to the data that does not have any structure associated with it at all. For example, the image files, the audio files, and the video files can be referred to as unstructured data.
  3. Semi-structured Data : It refers to the data that does not have a proper structure associated with it. For example, the data that is present within the emails, the log files, and the word documents can be referred to as Semi-Structured Data.

The Four Vโ€™s of Big Data are as follows :

Add alt text

๐Ÿ“Œ Volume of Big Data :

The name Big Data itself is related to a size that is enormous. The size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, โ€˜Volumeโ€™ is one characteristic that needs to be considered while dealing with Big Data.

๐Ÿ“ŒVelocity of Big Data :

The term โ€˜velocityโ€™ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines the real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

๐Ÿ“ŒVariety of Big Data :

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

๐Ÿ“ŒVeracity of Big Data

In the context of big data, it is basically accuracy of big data, itโ€™s not just the quality of the data itself but how trustworthy the data source, type, and processing of it is.

Importance of Big Data:

Traditional Database Management system fails to handle the Big Data. New tool and technologies like Apache Hadoop, Apache Spark etc have come up to deal with the Big Data. Hadoop is most popular amongst them.

Let us take an example of Facebook. Suppose it wants to know which ad works best for the people with college degrees.

Now lets say there are 200,000,000 Facebook user with college degrees. Each have been served 100 ads. This means there are 20,000,000,000 events of interest and each event consists of several features like the color, what is the ad about, is there a male or a female, etc.

Lets say there are 50 such features in each ad. This means that now we have 1,000,000,000,000 unit of data to work onto. If each unit of data was only 100 bytes, Facebook will have 93 GB of data to compute upon and analyze.

This was a small piece of millions of cases that Facebook has to analyze daily. This big sample of data helps Facebook to extracts the results in the form of patterns, behavior, liking etc. very accurately and thus right ads are shown to right people at the right time.

This the reason behind facebookโ€™s success. Facebook, Google, Yahoo were the Pioneers of the Big Data technologies. Now, due to their success, every company is willing to become the next Facebook and this is the reason behind the hike of the Big Data popularity. There are many other application of Big Data in different sectors like health care, defense, security etc.

Every Challenge has a solution โ€” The solution of Big Data Problem is โ€œDistributed Storage Solutionsโ€

Add alt text

Consider this simple example, you have 3 laptops or 3 storage servers, typically known as Slave Nodes. Every laptop is connected via networking with one main laptop typically known as Master Node. Now suppose each server has 150 MB of storage, so if somehow 320 MB data came then we wonโ€™t be able to store it in one server, so here comes the play of Distributed Storage.

Add alt text

  • Master is always receiving the data and distributing the data in between the slaves. That means now we donโ€™t have to think about Volume Problems. Because no matter how big the Data is, we can easily distribute them in the slaves and also we donโ€™t need to purchase bigger storages.
  • So, as we are not purchasing bigger storages so our costing will also decrease. Now we can purchase lots of small storage servers and attach them with master. Suppose in future the data becomes more huge, then we will purchase more storage servers and keep on attaching them with master.
  • Final thing speed, if you notice suppose one storage server takes 1 minute to store 10 GB data, now as in parallel there are multiple storage serves in parallel so to store the same 10 GB data in 10 storage device (1GB in each server) we will only need few seconds. Also itโ€™s not always about storing the data, itโ€™s also about how faster you can read the data. As, in parallel there are 10 storage servers so to read the same 10 GB data, it will take only few seconds, whereas if we use one storage to read 10GB data then it will take over 1 minute. These are simple examples, in actually Industry these architectures are more bigger with lots of components attached to each other.

What are the best Big Data Tools?

Here is the list of top 10 big data tools โ€“

  • Apache Hadoop
  • Apache Spark
  • Flink
  • Apache Storm
  • Apache Cassandra
  • MongoDB
  • Kafka
  • Tableau
  • RapidMiner
  • R Programming

What is Hadoop โ€ฆ. ?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware(This means the system is capable of running different operating systems (OSes) such as Windows or Linux without requiring special drivers.). It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Conclusion:

With the various technologies it holds, Big Data helps almost every company or sector that aspires to grow. Analyzing large datasets that are associated with the events of the company can give them insights to increase their customer satisfaction. Big MNCโ€™s like Google, Facebook, Instagram etc stores, manages and manipulate Thousands of Terabytes of data with High Speed and High Efficiency using Distributed Storage Solutions and manage data using Big Data tools like Hadoop.

โœจโœจThank you to reading my Articleโœจโœจ

--

--

Digambar Nandrekar
Digambar Nandrekar

Written by Digambar Nandrekar

DevOps | RHEL8 | Python | AI/ML | AWS | Docker | K8S |Ansible | Jenkins| Hadoop

No responses yet