To know why Data Wrangling is the key to Data Science, first we need to agree on what Data Science is.
What is Data Science?
At Data Dialect we describe Data Science as “The science of solving problems using data.”
The Head of Decision Intelligence at Google, Cassie Kozyrkov, gives it a similar definition:
“Data science is the discipline of making data useful. It’s three subfields involve mining large amounts of information for inspiration (analytics), making decisions wisely based on limited information (statistics), and using patterns in data to automate tasks (ML/AI).”
Whichever area of data science you might venture to, accessing enough high-quality data would be the foundation for the success in your effort. On the flip side, limited, low-quality data is bound to lead to limited, low-quality results – a kind of “garbage in, garbage out”.
What is enough data?
Analysts have been working with data for decades, even centuries – predicting, forecasting, looking back, finding patterns. The data used in traditional analysis are mostly structured in low complexity and within a defined sample group, derived from traditional data sources. This traditional sourcing is resulting in overtraded and crowded, overfitted results.
Big data on the other hand is mostly not structured and results derived from big data are not binary. Much of big data is found in alternative sources like geo location, online behaviour, consumer spending, web scraping, the internet of things, flight data and behavioural data to name but a few.
Free online public databases like Google Dataset search, World Bank Open Data site and the USA government’s open data and many more also give access to large collections of data.
All this data put together allow us to create mosaic pictures of the world around us.
Using multiple data sources do mean combining data with different structures, from different locations, with different measures of quality, having different formats.
Quantity is no longer a problem – consistent high Quality is.
What is high quality data?
One of the fathers of data science, Hadley Wickham defines high-quality data as data that fulfils the requirements of “Tidy Data”. He defines “Tidy Data” as datasets that are arranged such that:
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table
Here is the big leap. There are masses of unstructured – or dirty – datasets available to us. For this dirty data to be made useful whether it’s with analytics, statistics or machine learning, it needs to be reordered into tidy (alas not tiny) datasets. Getting from dirty to tidy the data needs to be obtained from a combination of structures, scrubbed to clean out the noise and explored to understand the quality of the data. In short, Data Wrangling is needed to get from dirty to tidy high quality data.
What is Data Wrangling?
Data Wrangling is as much an art as it is a science. This is mostly because there is no ‘one size fit all’ recipe to follow – rather a couple of broad steps backed by principles and techniques.
There is no guarantee that a dataset will fit or be able to answer the business need after grinding through the wrangling mill. Data Wrangling will however produce a clean dataset, highlighting its strengths and weaknesses, allowing for an informed decision on next steps.
- Obtaining the data
Data can be obtained in multiple formats. The aim of obtaining the data is to get the data into an accessible format. It could imply unzipping zipped files, merging similar files, picking information from text files, decoding graphical files, scraping information from web sites or obtaining information from streams like Twitter or WhatsApp.
Obtaining the data would do a little data scrubbing as possible to consolidate the data from multiple files into a single view or table per observation unit, gaining some understanding of what kind of data is available. The light scrubbing could include removing blank columns and rows, identify similar columns in the files and rename to match, removing columns that is not common across the files and converting dates to a format common across the files.
The data is consolidated but still “dirty”.
As an example:
Say an agricultural group wants to do water supply planning and are looking for rainfall patterns in the catchment are. Searching the net there is a government site giving monthly average rainfall for the area in .xls format, some local farmers have a community site where they’ve recorded daily rainfall figures for each farm in a web table and the weather bureau has weekly figures stored in .ods format.
Obtaining the data would involve downloading each of these, converting the .xls, web table and .ods files to a common format and consolidating them to one or more relational tables.