Why Data Wrangling is the key to Data Science

To know why Data Wrangling is the key to Data Science, first we need to agree on what Data Science is.

What is Data Science?

At Data Dialect we describe Data Science as “The science of solving problems using data.”

The Head of Decision Intelligence at Google, Cassie Kozyrkov, gives it a similar definition:

Data science is the discipline of making data useful.  It’s three subfields involve mining large amounts of information for inspiration (analytics), making decisions wisely based on limited information (statistics), and using patterns in data to automate tasks (ML/AI).”

Whichever area of data science you might venture to, accessing enough high-quality data would be the foundation for the success in your effort.  On the flip side, limited, low-quality data is bound to lead to limited, low-quality results – a kind of “garbage in, garbage out”.

What is enough data?

Analysts have been working with data for decades, even centuries – predicting, forecasting, looking back, finding patterns.  The data used in traditional analysis are mostly structured in low complexity and within a defined sample group, derived from traditional data sources.  This traditional sourcing is resulting in overtraded and crowded, overfitted results.

Big data on the other hand is mostly not structured and results derived from big data are not binary.  Much of big data is found in alternative sources like geo location, online behaviour, consumer spending, web scraping, the internet of things, flight data and behavioural data to name but a few.

Free online public databases like Google Dataset search, World Bank Open Data site and the USA government’s open data and many more also give access to large collections of data.

All this data put together allow us to create mosaic pictures of the world around us.

Using multiple data sources do mean combining data with different structures, from different locations, with different measures of quality, having different formats.

Quantity is no longer a problem – consistent high Quality is.

What is high quality data?

One of the fathers of data science, Hadley Wickham defines high-quality data as data that fulfils the requirements of “Tidy Data”.  He defines “Tidy Data” as datasets that are arranged such that:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

Here is the big leap.  There are masses of unstructured – or dirty – datasets available to us.  For this dirty data to be made useful whether it’s with analytics, statistics or machine learning, it needs to be reordered into tidy (alas not tiny) datasets.  Getting from dirty to tidy the data needs to be obtained from a combination of structures, scrubbed to clean out the noise and explored to understand the quality of the data.  In short, Data Wrangling is needed to get from dirty to tidy high quality data.

What is Data Wrangling?

Data Wrangling is as much an art as it is a science.  This is mostly because there is no ‘one size fit all’ recipe to follow – rather a couple of broad steps backed by principles and techniques.

There is no guarantee that a dataset will fit or be able to answer the business need after grinding through the wrangling mill.  Data Wrangling will however produce a clean dataset, highlighting its strengths and weaknesses, allowing for an informed decision on next steps.

  1. Obtaining the data

Data can be obtained in multiple formats. The aim of obtaining the data is to get the data into an accessible format.  It could imply unzipping zipped files, merging similar files, picking information from text files, decoding graphical files, scraping information from web sites or obtaining information from streams like Twitter or WhatsApp.

Obtaining the data would do a little data scrubbing as possible to consolidate the data from multiple files into a single view or table per observation unit, gaining some understanding of what kind of data is available.  The light scrubbing could include removing blank columns and rows, identify similar columns in the files and rename to match, removing columns that is not common across the files and converting dates to a format common across the files.

The data is consolidated but still “dirty”.

As an example:

Say an agricultural group wants to do water supply planning and are looking for rainfall patterns in the catchment are.  Searching the net there is a government site giving monthly average rainfall for the area in .xls format, some local farmers have a community site where they’ve recorded daily rainfall figures for each farm in a web table and the weather bureau has weekly figures stored in .ods format.

Obtaining the data would involve downloading each of these, converting the .xls, web table and .ods files to a common format and consolidating them to one or more relational tables.

ScrubbingBrushResizedBlog

2. Scrubbing the data

Scrubbing or cleaning a dataset is an iterative process.  Merging similar columns, deriving new columns, rename columns, set column types, correct spelling mistakes, align the text case of entries, review missing entries – either delete columns or lines with too many missing entries or impute or derive entries based on logic or statistical formulae or highlight know missing values, managing outliers by either removing them or highlighting them as know outliers.

Statistical analysis and data visualisation come in very handy to identify the ‘dirty’ spots that needs some scrubbing.  It also assists in testing how ‘clean’ the data is once scrubbed.   Scrubbing the data requires tools that can search and update large datasets in seconds, selecting and updating data based on formulae and slicing and grouping large data sections.

Scrubbing also needs business domain knowledge as filtering and imputing data have a business impact and implied business decisions.

Per our example:

Some of the rainfall figures are exceptionally high.  We cross check the three sources of rainfall by plotting the time series.  This allow us to determine if these outliers are typos or real cloud bursts.  Additional domain knowledge of seasonal patterns or a web search for recorded floods will also come in handy.  If we conclude the outlier is indeed a typo, we need to decide to either discard the reading or use the existing data to derive a value for that day fitting the seasonal pattern.

3. Exploring the data

Exploratory data analysis, aka EDA, makes no changes to the data.  Rather it investigates the quality of data, identifying weaknesses and patterns in the data that will impact the quality of any result where the data was used.  The product of EDA is a recommendation of how data can be applied, what to avoid and if the data is suitable for the business needs.

Exploring the data involves analysing relationships between variables by applying statistical and data visualisation methods like correlation matrixes, heat maps, conditional filtering and plotting of line, bar and point graphs. Exploration could mean binning and plotting time series and overlying multi-dimensional graphs.

In our example:

Comparing weekly trends, the weather bureau and the farmers data have a low correlation because their areas demarcation and points of record differs significantly.  More information is necessary to be able to compare the two sources.

4. And back again

The process of data wrangling is iterative. Scrubbing and exploring a dataset often reveals the need for further data. For instance, if data relates to rainfall in different locations, further information on how a location is defined and described would make the rainfall data more complete. So back to obtaining location data, scrubbing and exploring.

Where to get the Data Wrangling key to Data Science

Every venture into Data Science starts with Data Wrangling applied to the data.  Data Wrangling in turn requires skills – technical skills to obtain the data but also statistical knowledge to scrub and explore the data.

It also requires tools that can manage these operations.  The tool set need to be able to manage volume of data and have functionality to apply complex data manipulations while providing reproduceable record of the manipulations preserving the integrity of the data.

At Data Dialect we specialise in the art of Data Wrangling powered by the large library of functionality available in the R.  R is an open source and has a large academic following with an ever-expanding library of functions giving it the strong statistical backing needed in Data Wrangling.

Data Wrangling is time and resource intensive.  It is estimated to take around 75 to 80% of the time in a data science project.  At Data Dialect we’ve designed the Data Dialect Framework that allows you to ringfence and outsource Data Wrangling to us.  Outsourcing Data Wrangling will allow you to focus on the data science value-add to make data useful to your business.