Business Question

What are the patterns in fire incidents in the City of Cape Town that could assist in planning capacity to supply water for firefighting?

Data Source

The City of Cape Town publishes data on their Open Data Portal. As the site explains:

The City of Cape Town makes data available that has been approved for use in terms of the Open Data Policy. Access to City information helps to increase transparency, as well as benefit the wider community and other stakeholders.

To address the business question, we’ve taken the raw data available on Fire Incidents in the City of Cape Town from January 2009 to March 2018.

Answer to the business question

Following the Data Dialect Framework the data is obtained and consolidated into a into a single Raw Data frame (FireData on GitHub). The raw data is scrubbed into a Tidy Data frame (see FireDataClean in GitHub). The tidy data is explored for relationships as described in the Data Quality report.

The business question is reviewed in the Data Visualisation report. The report highlights :

It points out the relationship between the Cause and the Resolution of the fires.

The Data Visualisation report also highlights issues with the clarity and reliability of the data, with questions for the City of Cape Town on how the incidents were recorded and classified.

How was this done ? (Wrangling the Data)

Before producing the Data Visualisation report, the data needed to be obtained, scrubbed and explored.

We applied the three steps of Data Wrangling completing the Source the Data step in the Data Dialect Framework.

1. Obtain the Data

In the Obtain step the zipped Safety and Security – Fire Incidents source files listed on the City of Cape Town Open Data Portal are downloaded, unzipped and merged into a Raw Data frame (FireData on GitHub).

When the City of Cape Town published the data, it was saved in an ODS format and zipped.

To access the data, the zipped source files are manually downloaded and unzipped. Once unzipped, two scripts are applied:

  1. WranglingFireData_ReadIn.R (available on GitHub)

This is a simple script that reads the ODS files and save them as R objects. It performs no other data manipulation. Reading all the ODS files takes over 2 hours, but once this script is run, the resulting data frames are accessible for an r-script to read within seconds.

  1. WranglingFireData_Merge.R (available on GitHub)

The aim of the script is to merge the different data frames into a single Raw Data frame (FireData on GitHub).

The script reads the data frames saved in R objects. Before merging the data is lightly scrubbed, just enough so they can merge into a single data frame.

The scrubbing includes:

  • Removing blank columns and rows from each file

  • Identify similar columns in the files and rename to match

  • Removing columns that is not common across the files

  • Converting Dates to a format common across the files

  • The incidents for Q4 2016 do not have Cause and Sub-Council columns and the incidents for Q1 2017 do not have a Sub-Council column. These columns were added with blank values to allow the joining of all the quarterly files.

In the resulting Raw Data frame (FireData on GitHub), each incident contains 12 fields that were identified as common to all incidents:

Field Description
A_CreatedAt The date of the incident. For some incidents it includes the time
B_Service_Request The service request number
C_WaterUsage Liters of water used to resolve the incident
D_Category The category of the incident
E_Sub_Category The subcategory of the incident
F_Suburb The suburb where the incident occurred
F_Town The town within the municipality where the incident occurred
G_District District of the fire station that responded
G_SubCouncil One of 20 City of Cape Town sub councils the incident relates to
H_Cause Cause of the fire
I_Resolution Resolution of the fire
J_Temperature The temperature interval for the day of the incident

Although merged into a single Raw Data frame (FireData on GitHub), the data is still considered dirty:

  • Some data fields like the creation date have inconsistent formats.

  • Manually entered fields like Suburb and Town has many spelling mistakes.

  • There are many missing values that are either blank or populated with N/A.

  • Some columns have similar but different entries using different combinations of upper- and lower-case letters and abbreviations.

All these issues are addressed in the next step.

2. Scrub the Data

The scrubbing script (WranglingFireData_Cleanup.R available on GitHub) is time-consuming to develop. Once developed however, the script runs in a few seconds. The script uses logic to determine the correct values, correcting spelling mistakes from correction input files, accepting missing values and discard corrupted entries. It reads spelling correction and mapping information from three different comma delimited files.

The scrubbing script takes the Raw Data frame (see FireData in GitHub) as input and produces a Tidy Data frame (see FireDataClean in GitHub).

To get from Raw Data to Tidy Data, the script performs the following tasks :

  1. Align entries where case differs.

  2. Align entries where text is very similar.

  3. Discard the one line that does not contain a date.

  4. Align the sub council entries (refer to SubCouncils.csv on GitHub and City of Cape Town list of subcouncils website)

Because Suburb and Town were completed manually, it contains many typos, missing data and miss matches. The scrubbing script needs to do several updates to these two columns.

  1. Correct spelling mistakes for Suburbs. (refer to SuburbsCorrect.csv on GitHub)

  2. Remove special characters and correct spelling mistakes for Towns. (refer to TownsCorrection.csv on GitHub)

  3. Apply iterative logic to correct and complete the Town and Suburb entries. It assumes a Town is an area that has more than 2000 entries in the original Town column. All other entries are mapped as Suburbs within a Town.

The output is a Tidy Data frame (see FireDataClean in GitHub) of over 105 000 incidents recorded over more than 9 years.

  • The service request number is unique.

  • There are still several N/A entries for specific fields (see the graphs in the Data Quality report).

The data formats are standardised

  • The date of the incident is recoded in the A_CreatedAt column as date/time.

  • The total water used per incident is recorded in the C_WaterUsage column as numeric.

  • The rest of the incident information is recorded in 10 factor columns.

3. Explore the Data

The exploration script makes no changes to the file. It investigates the quality of the data, identifying weaknesses and patterns in the data that will impact the quality of the analysis results. It recommends how the data can be applied and what to avoid.

The FireIncidentsDataQuality.Rmd script (available on GitHub) takes the Tidy Data frame (FireDataClean on GitHub) as input and produces the Data Quality report in html format.

The three data wrangling steps are now complete. With the Source the Data step in the Data Dialect Framework completed, the following artifacts are available:

  1. The Raw Data frame (FireData on GitHub)

  2. The Tidy Data frame (FireDataClean on GitHub)

  3. The Data Quality report

  4. The four scripts used to produce the data frames and report (all available on GitHub)

  • WranglingFireData_Merge.R

  • WranglingFireData_ReadIn.R

  • WranglingFireData_Cleanup.R

  • FireIncidentsDataQuality.Rmd

The data is now ready to be used by the business as input to Interpret or for further Statistical inference, Regression analysis or Machine learning. It is ready to combine with other tidy data frames containing related information like dam levels or city growth.

The business can also decide, based on the Data Quality report that the set data is not the correct data to answer the business question or the data is incomplete and can not be used in any analytical excercise, terminating the project at the end of the Source the Data step.

To respond to the business question, we’ve chosen to Interpret the Tidy Data frame.

How was this done (Interpreting the Data)

Using the advice from the Data Quality report the FireIncidentsDataVisualisation.Rmd script (available on GitHub) is developed and used to read in the Tidy Data frame and produce the Data Visualisation report.

Where the Data Quality report looked at the completeness of the data, the Data Visualisation report demonstrates how the relationships in the data can be interpreted to address the business question.

The FireIncidentsDataVisualisation.Rmd script (available on GitHub), takes the Tidy Data frame (FireDataClean on GitHub) as input and produces the Data Visualisation report in html format.

This exercise of Data Wrangling and Data Visualisation was derived from data published on the City of Cape Town data portal without any further input from the City of Cape Town.

The analysis was done to demonstrate of the Data Dialect Framework taking raw unstructured data and preparing tidy data for further analytical use. The results were not applied for any further commercial use.