What are the patterns in fire incidents in the City of Cape Town that could assist in planning capacity to supply water for firefighting?
The City of Cape Town publishes data on their Open Data Portal. As the site explains:
The City of Cape Town makes data available that has been approved for use in terms of the Open Data Policy. Access to City information helps to increase transparency, as well as benefit the wider community and other stakeholders.
To address the business question, we’ve taken the raw data available on Fire Incidents in the City of Cape Town from January 2009 to March 2018.
Following the Data Dialect Framework the data is obtained and consolidated into a into a single Raw Data frame (FireData on GitHub). The raw data is scrubbed into a Tidy Data frame (see FireDataClean in GitHub). The tidy data is explored for relationships as described in the Data Quality report.
The business question is reviewed in the Data Visualisation report. The report highlights :
the use of water to respond to fire incidents over time
the use of water to respond to fire incidents in different areas of the City of Cape Town
the use of water to respond to in different classes of fire incidents
It points out the relationship between the Cause and the Resolution of the fires.
The Data Visualisation report also highlights issues with the clarity and reliability of the data, with questions for the City of Cape Town on how the incidents were recorded and classified.
Before producing the Data Visualisation report, the data needed to be obtained, scrubbed and explored.
We applied the three steps of Data Wrangling completing the Source the Data step in the Data Dialect Framework.
In the Obtain step the zipped Safety and Security – Fire Incidents source files listed on the City of Cape Town Open Data Portal are downloaded, unzipped and merged into a Raw Data frame (FireData on GitHub).
When the City of Cape Town published the data, it was saved in an ODS format and zipped.
To access the data, the zipped source files are manually downloaded and unzipped. Once unzipped, two scripts are applied:
This is a simple script that reads the ODS files and save them as R objects. It performs no other data manipulation. Reading all the ODS files takes over 2 hours, but once this script is run, the resulting data frames are accessible for an r-script to read within seconds.
The aim of the script is to merge the different data frames into a single Raw Data frame (FireData on GitHub).
The script reads the data frames saved in R objects. Before merging the data is lightly scrubbed, just enough so they can merge into a single data frame.
The scrubbing includes:
Removing blank columns and rows from each file
Identify similar columns in the files and rename to match
Removing columns that is not common across the files
Converting Dates to a format common across the files
The incidents for Q4 2016 do not have Cause and Sub-Council columns and the incidents for Q1 2017 do not have a Sub-Council column. These columns were added with blank values to allow the joining of all the quarterly files.
In the resulting Raw Data frame (FireData on GitHub), each incident contains 12 fields that were identified as common to all incidents:
|A_CreatedAt||The date of the incident. For some incidents it includes the time|
|B_Service_Request||The service request number|
|C_WaterUsage||Liters of water used to resolve the incident|
|D_Category||The category of the incident|
|E_Sub_Category||The subcategory of the incident|
|F_Suburb||The suburb where the incident occurred|
|F_Town||The town within the municipality where the incident occurred|
|G_District||District of the fire station that responded|
|G_SubCouncil||One of 20 City of Cape Town sub councils the incident relates to|
|H_Cause||Cause of the fire|
|I_Resolution||Resolution of the fire|
|J_Temperature||The temperature interval for the day of the incident|
Although merged into a single Raw Data frame (FireData on GitHub), the data is still considered dirty:
Some data fields like the creation date have inconsistent formats.
Manually entered fields like Suburb and Town has many spelling mistakes.
There are many missing values that are either blank or populated with N/A.
Some columns have similar but different entries using different combinations of upper- and lower-case letters and abbreviations.
All these issues are addressed in the next step.
The scrubbing script (WranglingFireData_Cleanup.R available on GitHub) is time-consuming to develop. Once developed however, the script runs in a few seconds. The script uses logic to determine the correct values, correcting spelling mistakes from correction input files, accepting missing values and discard corrupted entries. It reads spelling correction and mapping information from three different comma delimited files.
To get from Raw Data to Tidy Data, the script performs the following tasks :
Align entries where case differs.
Align entries where text is very similar.
Discard the one line that does not contain a date.
Because Suburb and Town were completed manually, it contains many typos, missing data and miss matches. The scrubbing script needs to do several updates to these two columns.
Correct spelling mistakes for Suburbs. (refer to SuburbsCorrect.csv on GitHub)
Remove special characters and correct spelling mistakes for Towns. (refer to TownsCorrection.csv on GitHub)
Apply iterative logic to correct and complete the Town and Suburb entries. It assumes a Town is an area that has more than 2000 entries in the original Town column. All other entries are mapped as Suburbs within a Town.
The output is a Tidy Data frame (see FireDataClean in GitHub) of over 105 000 incidents recorded over more than 9 years.
The service request number is unique.
There are still several N/A entries for specific fields (see the graphs in the Data Quality report).
The data formats are standardised
The date of the incident is recoded in the A_CreatedAt column as date/time.
The total water used per incident is recorded in the C_WaterUsage column as numeric.
The rest of the incident information is recorded in 10 factor columns.
The exploration script makes no changes to the file. It investigates the quality of the data, identifying weaknesses and patterns in the data that will impact the quality of the analysis results. It recommends how the data can be applied and what to avoid.
The three data wrangling steps are now complete. With the Source the Data step in the Data Dialect Framework completed, the following artifacts are available:
The Raw Data frame (FireData on GitHub)
The Tidy Data frame (FireDataClean on GitHub)
The four scripts used to produce the data frames and report (all available on GitHub)
The data is now ready to be used by the business as input to Interpret or for further Statistical inference, Regression analysis or Machine learning. It is ready to combine with other tidy data frames containing related information like dam levels or city growth.
The business can also decide, based on the Data Quality report that the set data is not the correct data to answer the business question or the data is incomplete and can not be used in any analytical excercise, terminating the project at the end of the Source the Data step.
To respond to the business question, we’ve chosen to Interpret the Tidy Data frame.
Using the advice from the Data Quality report the FireIncidentsDataVisualisation.Rmd script (available on GitHub) is developed and used to read in the Tidy Data frame and produce the Data Visualisation report.
This exercise of Data Wrangling and Data Visualisation was derived from data published on the City of Cape Town data portal without any further input from the City of Cape Town.
The analysis was done to demonstrate of the Data Dialect Framework taking raw unstructured data and preparing tidy data for further analytical use. The results were not applied for any further commercial use.