Big Data Demystified

(Reading time: approx. 5 minutes)

Big data is raw data generated by real-life activities. Each time you click around a web browser or talk on the phone or drive your car or get in an elevator, you are adding to the 2.5 quintillion bytes of data being generated in the wild every day. So why is big data such a big deal? Big data helps us see the world more clearly and act accordingly.

Let’s demystify this enigmatic term.

 

Forget the math.  Ask the right questions.

You will eventually need to know about neural networks, supervised | unsupervised machine learning, MapReduce, multivariate distributions, k-means clusters….. etc etc.  But for now, don’t get distracted. Stay focused on the one question that matters and ask yourself, ‘What is it about the world that I am trying to see more clearly?’.

The questions that are relevant to your situation should be the starting point and driving force behind your big data project. It is literally your compass. Invest some time and effort creating a list of the most important questions that need answering and validate them frequently to make sure you are asking the right ones.  Don’t be surprised if other relevant questions arise as you mine your data. After all, exploration and discovery is the whole point of data mining.

 

Small Data

To make sense of your big data, you first have to understand your ‘small’ data.  Small data is a single measurement or observation. It is the individual atomic units of data that collectively make up your data warehouse. Data at the atomic level is cardinal and ordinal. In plain english, this means that data organizes the world in 2 ways; naming (cardinality) and ordering (ordinality).

Names denote something’s uniqueness or membership in a group. Names give things an identity.  For instance, my social security number identifies me uniquely.  Any group I belong to, such as a customer segment, also identifies me (albeit not uniquely).  I belong to the ‘male, over 40 in Toronto’ customer segment along with every other male over 40 in Toronto.  Note that identity does not have a mathematical value, even when it is numeric.  Calculating the average social security number does not result in anything meaningful because this data is cardinal.

Ordering things denotes sequence, importance or magnitude. Customer ratings, such as 4 out of 5 stars, is an example of ordinality. The rating denotes the magnitude of a customer’s satisfaction with a product or service.  This value is mathematical and calculating the average customer rating on a product yields a meaningful result.  Ordinal values are anonymous and necessarily so. The math becomes awkward if you start treating one instance of the number 7 differently than another 7.

Thinking of data in the small sense will help you examine your big data with a keener eye.  You will have a better chance at asking the right questions.  Most importantly, you will see that your data has context.  Context is anything that gives data meaning.  Just as the number 125 is meaningless without a unit of measure such as 125 mph, 125 lbs, 125 centimeters, data is meaningless without context.

And this is where big data comes in.

 

Big Data

Big data is the mother-load of context.  Big data is simply lots and lots of small data.  And each atomic unit of small data has a relationship with all the other atomic units.  This relationship is what gives the data context and this context is the key to gaining insights. More context = more insight.

Figure 1: k-means clustering

To illustrate, let’s look at 3 types of insight; grouping, correlation and prediction.

Grouping is useful because it identifies a segment of data that has something relevant in common.  For example, in Figure 1, let’s assume the points on the graph represent customer locations and the ‘+’ symbols are store locations.  Each coloured area represents the customer groupings based on proximity to the store closest to them.  A customer segment that has geography as its common feature gives businesses insight into where their customer touchpoints are and has implications on things like new store locations, delivery territories and outdoor ad placement.  For you tech heads, grouping is accomplished using clustering algorithms such as k-means.

Correlation is useful in seeing how different factors relate.  Figure 2 is a scatter plot of individuals by income and age.  You can clearly see how closely the points follow the ‘line of best fit‘ showing a strong correlation between these two factors.

Figure 2: Income-Age Scatter Plot

Correlations are mathematically found using ‘frequent pattern’ algorithms such as Top K Parallel FPGrowth but they are also often found visually.

So please, LOOK at your data.  Find as many ways as is practical to visualize your data.  It’s an effective way to find patterns and its fun.

Predictions require you or a machine to find correlations first.  These correlations are used to teach a computer how certain factors relate.  The computer can then predict the value of a missing factor given the known value of the others.  If the points on the graph in Figure 2 were used as a training set to teach a computer how income and age relate, the computer could be taught to predict a person’s age based on their income or their income based on their age.  This is an example of supervised machine learning and is accomplished mathematically using regression algorithms.  Accuracy depends on the size of the training set (points on the graph).  The larger the training set the more accurate the predictions will be. Remember, more context = more insight.

 

Big Data in the real world

Now that you know some theory, lets take a look at how it all works in the real world. At Hubba, we apply data science to help businesses and their customers see each other more clearly. We do this by putting data in the proper context.

For customers, this means receiving accurate, useful and beautiful information about the products and services they care about when they ask for the information.  Hubba technology also gives customers a voice. For retailers and brands, this means seeing who their customers are, where their customers are, what customers think of their brand and what their customers care about in the world besides their product.

Grouping algorithms are used to identify the best customers so that brands and retailers can reward them to maximize loyalty.

Correlation algorithms are used to deliver the right message to the right person at the right time, at scale.

And most importantly, prediction algorithms are used to determine customer intent.

This last application of data science is my personal favourite.  By seeing what customers have browsed, reviewed or purchased, retailers and brands using Hubba can find correlations, predict customer behaviour and take effective action.  And by employing machine learning to the growing volume of customer data, a continuously improving prediction accuracy rate is a mathematical certainty.

Figure 3: Income Map

The map in figure 3 shows the average income by postal code within a 5 km radius of a shopping mall near my home.  What if brands could see what areas of the city their customers live or even better, where their competitor’s customers live?  What if brands and retailers could see which customers were most likely to make a purchase and know what incentives to trigger conversion?  I am excited to report that this technology exists today.

Hubba is built from the ground up to solve the formula: Data + Context = x Intent.  The holy-grail of retail.

 

So why is big data such a big deal?

Big data helps machines, systems and people operate more productively and with fewer errors.  And with so much new data being created daily, inefficiencies and errors will continuously shrink over time through machine learning.

Although I admit its convenient having Amazon make good book recommendations, I’m most excited about how big data will help us solve the really important problems like curing diseases, preventing crime and accurately predicting natural disasters.

Do you have any save-the-world, pie-in-the-sky, big data projects you would like to see built?  I’d love to hear from you.

 

Inbae Ahn

Inbae Ahn

Inbae is the CTO at Hubba.Since 2001, he has been building software and leading teams for the world’s largest (and smallest) organizations. He believes technology’s sole purpose is to serve people by solving problems. Inbae is a StartupWeekend NEXT Instructor and a startup Advisor to Hovr.it, Parachute Software, RadicalRadical, Hak Studio, Bitmaker Labs and Plascii.His personal mission is to build community and make Toronto a world-class technology hub.
Inbae Ahn

Latest posts by Inbae Ahn (see all)