Big Data… But what is it?

Dorjey Sherpa
Analytics Vidhya
Published in
5 min readOct 20, 2020

--

To talk about Big data, we first must talk about data. With the rapid technological advancements, data comes in all shapes and sizes. There is structured data, unstructured data, slightly structured data also known as Semi-structured data. Structured Data is fairly easy to manage, search, filter for data analysis due to its rows and columns structure and requires less storage. Whereas Unstructured Data is complex, has many different forms of database files, are hard to manage and requires more storage.

Structured Data vs. Unstructured Data: what are they and why care? from Lawtomated

But these data usually fall into two bins: Traditional Data and Big data.

Traditional Data

Traditional Data is what most people are familiar with both in data science and outside of data science. Traditional data is a structured data usually maintained in a fixed format (.xlsx, .sql, .csv, .json, etc), which is easy to manipulate and clean. Traditional data systems can be large (gigabytes or terabytes (1 Terabyte = 1000 Gigabytes) and usually used to process and understand one’s own organization, company or customer standings.

Big Data

The talk around Big Data has been increasing in the last 10 years and slowly making its way into the mainstream. It’s even projected that we will have over 175 zettabytes in the Global DataSphere (1 zettabyte = 1,000,000,000,000 Gigabytes).

Source: Data Age 2025 from IDC Global DataSphere

And as someone entering the data science field, we too should know what Big Data is and how it’s different from Traditional Data.

Three Vs of Big Data

Big Three of Big Data

The Three “V”s are Volume, Velocity and Variety. These “V”s are widely accepted to make up Big Data and is coined Doug Laney.

Volume:

Big data… is big. It ranges from Petabytes (1 Petabyte = 1,000,000 Gigabyte) to Zettabytes (1 Zettabyte = 1,000,000 Petabyte). But that is not enough to call it Big Data. Big Data is relatively new when compared to traditional data and Big data arises due to the technological advancements, which has made data collecting so easy.

Velocity:

Velocity of the data is the growth of the data. Big Data needs high velocity. It’s easy to see how 4 petabytes of data everyday from Facebook can become big data (4 petabytes x 365 days = 1,460 petabytes/year)…

As mentioned, big data is on the rise due to technological advancements and to prove that we have the MRI. Now MRIs are seven times faster, which results to a short time interval images, which means more data for Doctors and Data Scientists to analyze and come to sound conclusions.

Variety:

Variety refers to the variety of database files. Facebook is a great example of a Big Data company. Have you ever done one of those “Which (x) are you?” or one of those quizzes or those “personality tests”? You click and answer and then submit for the “result”, and you get feedback that will either make you feel some type of emotion. But behind the scenes, all your answers are collected and stored, most likely in a table, a structured data. This is a very, very, very, small portion of the Big Data. Facebook users also post and send photos, audio, video, messages both public and private and this is all data that Facebook needs to store in their servers. So the 4 new petabytes of data that is received per day is Unstructured or semi-structured.

gif from GIPHY

More “V”s?

The Three “V”s above are widely accepted but there are still talks about what other categories are necessary to be considered data as big data.

IBM considers Veracity, the quality/certainty of the data to also be a factor, making it Four “V”s. If its meaningless data, then its just taking up space for other meaningful data. But how does one know what data is meaningless? That is why companies would prefer candidates with domain knowledge.

SAS (Statistical Analysis System) Institute believes there to be five “V”s that define big data. SAS considered Veracity and Variability as factors to Big Data. Variety and Variability are different. Variety looks at the different database files whereas Variability looks at the “flow” of the data.

George Firican, Founder of Lightsondata, believes there to be 10 “V”s to describe Big Data. You can read about the 10 “V”s here.

However, the man who coined the “Three “V”s of Big Data”, Doug Laney said the following in an interview with Gregory Piatesky from KDnuggets:

“Yes, others have suggested other V’s like veracity, but these are not measures of bigness, so are not really definitional characteristics of Big Data. Nonetheless, they are important considerations for most data. In fact, some colleagues and I came up with 12 Vs of data that can be used to ensure various aspects of data are managed and leveraged appropriately.” — Doug Laney

Gif from GIFSFORUM.com

So what is big data? At the end, Big data is big, unstructured and growing at a crazy rate. As for the other V’s, check with your industry focus and your in trust your own domain knowledge. However, regardless of what the “V”s are, the data needs to add/bring Value to the grand scheme of data collecting. Value in terms of having a greater understanding about customers activity or behavior to increase customer satisfaction/retention, demonstrating methods to reduce company spending, increase efficiency, etc.

If you are still uncertain about what is what and how the V’s differ, then check out this article

Want to know more about accessing Big data?

Hadoop and Spark are the two of the many leading tools for Big Data. Check out What is Hadoop? by Oyetoke Tobi Emmanuel or High Level Overview of Apache Spark by Eric Girouard.

Optional Fun Readings about Data Science:

If you have Facebook and want to know more about it’s stats, click here and if you want to know more about Facebook’s Data structures (Big and Small), click here.

Digitization of the World by David Reinsel, John Gantz and John Rydning (very fun read about the potential future of the general population)

--

--