In a data driven organisation, everyone speaks the data dialect

This is a link to a great 18 minute presentation recorded at the OfferZen MERGE ConferenceBrittany Wagenaar and Robyn Steenekamp‘s talk on “Why Everyone’s a Data Scientist” explains how Luno is transforming into a data driven organisation by leveraging the concept of Data Science Citizen.  The Luno team is scaling and increasingly use data to drive decision-making. Employees that can read, reason and argue with data are fluent in the data dialect that empowers data driven decisions.   Why is this is important to them ? “The true power of data is harnessed when we incorporate domain knowledge into data analysis”.

Listen how they’ve made data accessible, understandable and easy to use for everyone in the company.


Why Data Wrangling is the key to Data Science

To know why Data Wrangling is the key to Data Science, first we need to agree on what Data Science is.

What is Data Science?

At Data Dialect we describe Data Science as “The science of solving problems using data.”

The Head of Decision Intelligence at Google, Cassie Kozyrkov, gives it a similar definition:

Data science is the discipline of making data useful.  It’s three subfields involve mining large amounts of information for inspiration (analytics), making decisions wisely based on limited information (statistics), and using patterns in data to automate tasks (ML/AI).”

Whichever area of data science you might venture to, accessing enough high-quality data would be the foundation for the success in your effort.  On the flip side, limited, low-quality data is bound to lead to limited, low-quality results – a kind of “garbage in, garbage out”.

What is enough data?

Analysts have been working with data for decades, even centuries – predicting, forecasting, looking back, finding patterns.  The data used in traditional analysis are mostly structured in low complexity and within a defined sample group, derived from traditional data sources.  This traditional sourcing is resulting in overtraded and crowded, overfitted results.

Big data on the other hand is mostly not structured and results derived from big data are not binary.  Much of big data is found in alternative sources like geo location, online behaviour, consumer spending, web scraping, the internet of things, flight data and behavioural data to name but a few.

Free online public databases like Google Dataset search, World Bank Open Data site and the USA government’s open data and many more also give access to large collections of data.

All this data put together allow us to create mosaic pictures of the world around us.

Using multiple data sources do mean combining data with different structures, from different locations, with different measures of quality, having different formats.

Quantity is no longer a problem – consistent high Quality is.

What is high quality data?

One of the fathers of data science, Hadley Wickham defines high-quality data as data that fulfils the requirements of “Tidy Data”.  He defines “Tidy Data” as datasets that are arranged such that:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

Here is the big leap.  There are masses of unstructured – or dirty – datasets available to us.  For this dirty data to be made useful whether it’s with analytics, statistics or machine learning, it needs to be reordered into tidy (alas not tiny) datasets.  Getting from dirty to tidy the data needs to be obtained from a combination of structures, scrubbed to clean out the noise and explored to understand the quality of the data.  In short, Data Wrangling is needed to get from dirty to tidy high quality data.

What is Data Wrangling?

Data Wrangling is as much an art as it is a science.  This is mostly because there is no ‘one size fit all’ recipe to follow – rather a couple of broad steps backed by principles and techniques.

There is no guarantee that a dataset will fit or be able to answer the business need after grinding through the wrangling mill.  Data Wrangling will however produce a clean dataset, highlighting its strengths and weaknesses, allowing for an informed decision on next steps.

  1. Obtaining the data

Data can be obtained in multiple formats. The aim of obtaining the data is to get the data into an accessible format.  It could imply unzipping zipped files, merging similar files, picking information from text files, decoding graphical files, scraping information from web sites or obtaining information from streams like Twitter or WhatsApp.

Obtaining the data would do a little data scrubbing as possible to consolidate the data from multiple files into a single view or table per observation unit, gaining some understanding of what kind of data is available.  The light scrubbing could include removing blank columns and rows, identify similar columns in the files and rename to match, removing columns that is not common across the files and converting dates to a format common across the files.

The data is consolidated but still “dirty”.

As an example:

Say an agricultural group wants to do water supply planning and are looking for rainfall patterns in the catchment are.  Searching the net there is a government site giving monthly average rainfall for the area in .xls format, some local farmers have a community site where they’ve recorded daily rainfall figures for each farm in a web table and the weather bureau has weekly figures stored in .ods format.

Obtaining the data would involve downloading each of these, converting the .xls, web table and .ods files to a common format and consolidating them to one or more relational tables.


2. Scrubbing the data

Scrubbing or cleaning a dataset is an iterative process.  Merging similar columns, deriving new columns, rename columns, set column types, correct spelling mistakes, align the text case of entries, review missing entries – either delete columns or lines with too many missing entries or impute or derive entries based on logic or statistical formulae or highlight know missing values, managing outliers by either removing them or highlighting them as know outliers.

Statistical analysis and data visualisation come in very handy to identify the ‘dirty’ spots that needs some scrubbing.  It also assists in testing how ‘clean’ the data is once scrubbed.   Scrubbing the data requires tools that can search and update large datasets in seconds, selecting and updating data based on formulae and slicing and grouping large data sections.

Scrubbing also needs business domain knowledge as filtering and imputing data have a business impact and implied business decisions.

Per our example:

Some of the rainfall figures are exceptionally high.  We cross check the three sources of rainfall by plotting the time series.  This allow us to determine if these outliers are typos or real cloud bursts.  Additional domain knowledge of seasonal patterns or a web search for recorded floods will also come in handy.  If we conclude the outlier is indeed a typo, we need to decide to either discard the reading or use the existing data to derive a value for that day fitting the seasonal pattern.

3. Exploring the data

Exploratory data analysis, aka EDA, makes no changes to the data.  Rather it investigates the quality of data, identifying weaknesses and patterns in the data that will impact the quality of any result where the data was used.  The product of EDA is a recommendation of how data can be applied, what to avoid and if the data is suitable for the business needs.

Exploring the data involves analysing relationships between variables by applying statistical and data visualisation methods like correlation matrixes, heat maps, conditional filtering and plotting of line, bar and point graphs. Exploration could mean binning and plotting time series and overlying multi-dimensional graphs.

In our example:

Comparing weekly trends, the weather bureau and the farmers data have a low correlation because their areas demarcation and points of record differs significantly.  More information is necessary to be able to compare the two sources.

4. And back again

The process of data wrangling is iterative. Scrubbing and exploring a dataset often reveals the need for further data. For instance, if data relates to rainfall in different locations, further information on how a location is defined and described would make the rainfall data more complete. So back to obtaining location data, scrubbing and exploring.

Where to get the Data Wrangling key to Data Science

Every venture into Data Science starts with Data Wrangling applied to the data.  Data Wrangling in turn requires skills – technical skills to obtain the data but also statistical knowledge to scrub and explore the data.

It also requires tools that can manage these operations.  The tool set need to be able to manage volume of data and have functionality to apply complex data manipulations while providing reproduceable record of the manipulations preserving the integrity of the data.

At Data Dialect we specialise in the art of Data Wrangling powered by the large library of functionality available in the R.  R is an open source and has a large academic following with an ever-expanding library of functions giving it the strong statistical backing needed in Data Wrangling.

Data Wrangling is time and resource intensive.  It is estimated to take around 75 to 80% of the time in a data science project.  At Data Dialect we’ve designed the Data Dialect Framework that allows you to ringfence and outsource Data Wrangling to us.  Outsourcing Data Wrangling will allow you to focus on the data science value-add to make data useful to your business.

The Data Science evolution

Data has been around for centuries, computers has been in our homes and offices for decades – why now all this talk about Data Science ?

Because data science is the “sexiest job of the twenty first century”.   Since an article by this title was published in the revered Harvard Business Review of October 2012 data science has become more than a catchy buzzword. Data Science is developing and growing so fast, it could leave Moore’s Law breathless.

What is Data Science ?

Simply put – it is the science of solving problems using data. Here are Data Dialect we’ve distilled our data science work into four simple steps:

  1. Define a question,
  2. Source the related data
  3. Apply an algorithm to the data to answer the question
  4. Tell a story to communicate the result

So – using loads and loads of data and computing power, data science solves problems using mathematical statistics like regression analysis and hypothesis testing using different variables simultaneously in experimentation.  But…mathematical statistics have been around for decades. Why are we only talking Data Science now?

For Data Science to evolve from Modern Mathematical Statistics it needed two more ingredients. Data and Computing Power.

Since the early 2000’s the world around us has become more and more digitized.  In the previous century, websites were little more that digital pamphlets. Then Web 2.0 emerged where website changed to being a shared experience.

In 2003 we got MySpace, in 2004 Facebook, in 2005 YouTube. More commercial sites followed – Twitter, LinekdIn, Snapchat, Skype, Instagram, Tinder, Netflix, AirBnB, Uber. These sites involved the person behind the keyboard to contribute, post, comment, upload, share, have a footprint in the digital landscape.

In addition the Internet of Things evolved – where data readings are recorded and used to trigger events. The internet created and shaped an ecosystem of data.

In addition, companies have digitised their client data, interactions and transactions happened digitally instead of via paper work – a flood of data has become available.

Data creation has grown to a point where – to paraphrase a quote reportedly from Google’s Chief Economist Hal Varian, we are creating more data every 2 days than was created between the dawn of civilization and 2003.

We have Big data ! All of this resulted in data too big for traditional computing power .

But what has happened with traditional computing power ?

    • In the 1960s we had computing power confined to specialised data centres.
    • In the 1980s computing power was liberated to the personal computer, gaining access to domestic households and office workers desktop.
    • Then in the early/mid 2000s – distributed computing power came alive in the cloud infrastructure.

Maybe big data came about because of the enhanced computing power the cloud offered, maybe the cloud came about because big data demanded it, maybe the two fed on each other.  Either way, mathematical statistics got what it needed to evolve into data science – Big Data and Big Computing Power.