Data Science is the science of solving problems using data. Data Dialect have adopted four steps in managing data science workflow:
Define the business question
Source the related data
Apply an algorithm to the data to answer the question
Tell a story to communicate the result
Step 2, Sourcing data a tedious and often most time-consuming task in any data science project. The Data Dialect framework was designed to allow a business to outsource the data sourcing step.
Sourcing the data is done through three iterative flows, applying Data Wrangling in the R scripting language.
Obtaining structured or unstructured data to a raw data set,
Scrubbing data to a tidy data set and
Exploring the tidy data set, delivering a Data Quality Report.
It has a clear start and end with defined deliverables in the form of scripts, data files and documented reports.
Once wrangling is done, completing the step to Source the Data, it opens possibilities for many business applications:
Building a data pipeline
Interpreting the data
Performing statistical inference
Performing regression analysis
Applying machine learning
The Data Dialect framework is closely aligned with two well published data science models:
the OSEMiN (Obtain, Scrub, Explore, Model, and iNterpret) process originally developed by Business Intelligence community but adopted by the Data Science community and widely used to manage the flow of Data Science projects and
the Team Data Science Process or TDSP that Microsoft recommends to clients using the Microsoft Azure platform
The results of the Data Dialect framework is reproduceable and the resulting work is the ownership of the client.
Where possible, every step in the data manipulation is recorded in script.
If any step in the data manipulation needs to be done manually it is clearly documented.
The raw and tidy data files are saved into a format accessible to the client.
Any decision on the quality of the data is done in consultation with the client and clearly documented.
See our Wrangling of the Fire Incidents for the City of Cape Town where we’ve applied the Data Dialect framework to source the data.
All the r script applied along with the data files they produce are available on GitHub.