Why Data Wrangling is the key to Data Science

To know why Data Wrangling is the key to Data Science, first we need to agree on what Data Science is.

What is Data Science?

At Data Dialect we describe Data Science as “The science of solving problems using data.”

The Head of Decision Intelligence at Google, Cassie Kozyrkov, gives it a similar definition:

Data science is the discipline of making data useful.  It’s three subfields involve mining large amounts of information for inspiration (analytics), making decisions wisely based on limited information (statistics), and using patterns in data to automate tasks (ML/AI).”

Whichever area of data science you might venture to, accessing enough high-quality data would be the foundation for the success in your effort.  On the flip side, limited, low-quality data is bound to lead to limited, low-quality results – a kind of “garbage in, garbage out”.

What is enough data?

Analysts have been working with data for decades, even centuries – predicting, forecasting, looking back, finding patterns.  The data used in traditional analysis are mostly structured in low complexity and within a defined sample group, derived from traditional data sources.  This traditional sourcing is resulting in overtraded and crowded, overfitted results.

Big data on the other hand is mostly not structured and results derived from big data are not binary.  Much of big data is found in alternative sources like geo location, online behaviour, consumer spending, web scraping, the internet of things, flight data and behavioural data to name but a few.

Free online public databases like Google Dataset search, World Bank Open Data site and the USA government’s open data and many more also give access to large collections of data.

All this data put together allow us to create mosaic pictures of the world around us.

Using multiple data sources do mean combining data with different structures, from different locations, with different measures of quality, having different formats.

Quantity is no longer a problem – consistent high Quality is.

What is high quality data?

One of the fathers of data science, Hadley Wickham defines high-quality data as data that fulfils the requirements of “Tidy Data”.  He defines “Tidy Data” as datasets that are arranged such that:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

Here is the big leap.  There are masses of unstructured – or dirty – datasets available to us.  For this dirty data to be made useful whether it’s with analytics, statistics or machine learning, it needs to be reordered into tidy (alas not tiny) datasets.  Getting from dirty to tidy the data needs to be obtained from a combination of structures, scrubbed to clean out the noise and explored to understand the quality of the data.  In short, Data Wrangling is needed to get from dirty to tidy high quality data.

What is Data Wrangling?

Data Wrangling is as much an art as it is a science.  This is mostly because there is no ‘one size fit all’ recipe to follow – rather a couple of broad steps backed by principles and techniques.

There is no guarantee that a dataset will fit or be able to answer the business need after grinding through the wrangling mill.  Data Wrangling will however produce a clean dataset, highlighting its strengths and weaknesses, allowing for an informed decision on next steps.

  1. Obtaining the data

Data can be obtained in multiple formats. The aim of obtaining the data is to get the data into an accessible format.  It could imply unzipping zipped files, merging similar files, picking information from text files, decoding graphical files, scraping information from web sites or obtaining information from streams like Twitter or WhatsApp.

Obtaining the data would do a little data scrubbing as possible to consolidate the data from multiple files into a single view or table per observation unit, gaining some understanding of what kind of data is available.  The light scrubbing could include removing blank columns and rows, identify similar columns in the files and rename to match, removing columns that is not common across the files and converting dates to a format common across the files.

The data is consolidated but still “dirty”.

As an example:

Say an agricultural group wants to do water supply planning and are looking for rainfall patterns in the catchment are.  Searching the net there is a government site giving monthly average rainfall for the area in .xls format, some local farmers have a community site where they’ve recorded daily rainfall figures for each farm in a web table and the weather bureau has weekly figures stored in .ods format.

Obtaining the data would involve downloading each of these, converting the .xls, web table and .ods files to a common format and consolidating them to one or more relational tables.


2. Scrubbing the data

Scrubbing or cleaning a dataset is an iterative process.  Merging similar columns, deriving new columns, rename columns, set column types, correct spelling mistakes, align the text case of entries, review missing entries – either delete columns or lines with too many missing entries or impute or derive entries based on logic or statistical formulae or highlight know missing values, managing outliers by either removing them or highlighting them as know outliers.

Statistical analysis and data visualisation come in very handy to identify the ‘dirty’ spots that needs some scrubbing.  It also assists in testing how ‘clean’ the data is once scrubbed.   Scrubbing the data requires tools that can search and update large datasets in seconds, selecting and updating data based on formulae and slicing and grouping large data sections.

Scrubbing also needs business domain knowledge as filtering and imputing data have a business impact and implied business decisions.

Per our example:

Some of the rainfall figures are exceptionally high.  We cross check the three sources of rainfall by plotting the time series.  This allow us to determine if these outliers are typos or real cloud bursts.  Additional domain knowledge of seasonal patterns or a web search for recorded floods will also come in handy.  If we conclude the outlier is indeed a typo, we need to decide to either discard the reading or use the existing data to derive a value for that day fitting the seasonal pattern.

3. Exploring the data

Exploratory data analysis, aka EDA, makes no changes to the data.  Rather it investigates the quality of data, identifying weaknesses and patterns in the data that will impact the quality of any result where the data was used.  The product of EDA is a recommendation of how data can be applied, what to avoid and if the data is suitable for the business needs.

Exploring the data involves analysing relationships between variables by applying statistical and data visualisation methods like correlation matrixes, heat maps, conditional filtering and plotting of line, bar and point graphs. Exploration could mean binning and plotting time series and overlying multi-dimensional graphs.

In our example:

Comparing weekly trends, the weather bureau and the farmers data have a low correlation because their areas demarcation and points of record differs significantly.  More information is necessary to be able to compare the two sources.

4. And back again

The process of data wrangling is iterative. Scrubbing and exploring a dataset often reveals the need for further data. For instance, if data relates to rainfall in different locations, further information on how a location is defined and described would make the rainfall data more complete. So back to obtaining location data, scrubbing and exploring.

Where to get the Data Wrangling key to Data Science

Every venture into Data Science starts with Data Wrangling applied to the data.  Data Wrangling in turn requires skills – technical skills to obtain the data but also statistical knowledge to scrub and explore the data.

It also requires tools that can manage these operations.  The tool set need to be able to manage volume of data and have functionality to apply complex data manipulations while providing reproduceable record of the manipulations preserving the integrity of the data.

At Data Dialect we specialise in the art of Data Wrangling powered by the large library of functionality available in the R.  R is an open source and has a large academic following with an ever-expanding library of functions giving it the strong statistical backing needed in Data Wrangling.

Data Wrangling is time and resource intensive.  It is estimated to take around 75 to 80% of the time in a data science project.  At Data Dialect we’ve designed the Data Dialect Framework that allows you to ringfence and outsource Data Wrangling to us.  Outsourcing Data Wrangling will allow you to focus on the data science value-add to make data useful to your business.

Where are the South African Expats ?


Where in the world all the South African expats live ?  Can the use of South African slang on twitter give us a clue ?  Lets harvest some tweets and see…

The aim is to

  • Determine the continent and countries where South African ex-pats are staying.
  • Map the change in locations over time in monthly intervals.

Methodology to get there

In this proof of concept the origin of tweets containing specific South African words will be used to determine the location of South African expats.

  • Harvest tweets sent containing specific South African slang words. The slang words was originally sourced from the Brand South Africa’s website. The list is reduced to words uniquely South African in any context.  Ultimately 17 words are included:
  • Refer to the location field in the twitter record to determine the origin of each tweet.
  • Map the continent and country of the origin of the tweets, excluding Africa.

Test quality of results

  • The majority of tweets per word need to come from South Africa else the word is not uniquely South African.
  • In addition the resulting countries are similar to surveys conducted by the site wheredidwego.com listing the following 10 countries as the top destinations: New Zealand, Australia, UK, Canada, USA, Netherlands, Ireland, Germany, UAE, Qatar


The location field is a free text field. Many users do not complete the field which means these tweets would need to be discarded. Where the location field is completed, the entries varies. Each of these entries needs to be mapped to a valid country and continent. The expectation is the need to map will decrease as more mapping is added.

After harvesting tweets for 14 days there was still around 300 locations that needed to be mapped each day which made the exercise unsustainable beyond a proof of concept. If the proof of concept pass, the next step is to connect to the Google Maps API to allow a more efficient determination of the location origin of the a tweet.

Resulting data

These are the results from the 14 day exercise – 12 October to 26 October 2019.

All the words in scope are used more than 50% of the time by twitters from South Africa validating the words as typical South African.


The only exception is the word Rooibos. Although less than 50% of the tweets come from South Africa, Africa is still the largest single source of tweets containing the word Rooibos and Rooibos is still included as a typical South African word.


The Exercise fail the test of likely countries to appear on the list. Refer to the list of countries and continents represented in the double pie chart.


The two most popular countries listed on the site wheredidwego.com for South Africans to migrate to are Australia and New Zealand. Referring to the chart, Australia and New Zealand are dwarfed by the USA and UK in the twitter stats.

It is possible that countries have different cultures when it comes to sending tweets and South African expats in the USA and UK tweet significantly more than their cousins in Australia and New Zealand. The location origin of tweets can not be taken as a single indicator of where South African have settled.

What next ?

It is interesting to note if Australia and New Zealand is taken out of the equation, the list of most popular countries are very similar. This should be explored. In addition if a more accurate and efficient determination of origin with the google maps API is used the result can be measured with more confidence.

The content of this post was generated with an R .rmd file.  The plot was generated off data originally stored in  the local MySQL database from harvested tweets.  The data was pulled from the local MySQL database to dynamically generate the document.  The code used to generate this blog can be found in: GitHub.  The SQL database details has been masked.