1 TABLE OF CONTENTS

1 TABLE OF CONTENTS
2 PROLOGUE
3 SYNOPSIS
4 STORM EVENTS DATASET
- 4.1 General Informations
- 4.2 Points Of Interest
  - 4.2.1 Changes in the composition of weather event types
  - 4.2.2 Eligibility criteria for inclusion of weather events in the dataset
5 PRELIMINARY ACTIVITIES
6 DATA PROCESSING
7 PROCESSED DATA
8 HARM ON POPULATION HEALTH
9 HARM ON ECONOMY
10 RESULTS
- 10.1 Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
- 10.2 Question 2 : Across the United States, which types of events have the greatest economic consequences?
11 REPRODUCIBILITY DETAILS
12 LICENSE
13 REFERENCES

back to TABLE OF CONTENTS

2 PROLOGUE

To provide some context for the reader with respect to what this is all about, some general information was included:

2.1 About The Assignment
2.2 About The Main Script
2.3 About The Report

A summary for the analysis was not included in this chapter, but can be found at the chapter SYNOPSIS.

back to start of this chapter
back to TABLE OF CONTENTS

2.1 About The Assignment

This project was created for the 2nd peer-graded assignment of:

Course 5: Reproducible Research,
from Data Science Specialization,
by Johns Hopkins University,
at Coursera

The course is taught by:

Jeff Leek, PhD
Roger D. Peng, PhD
Brian Caffo, PhD

As putted by the teachers of the course:

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

The assignment requests to address 2 questions:

Your data analysis must address the following questions:

Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Question 2: Across the United States, which types of events have the greatest economic consequences?

based on the observation from the supplied dataset:

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

Some quite general guidelines and a tip were provided:

Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

It was deliberately decided to adopt a more educational approach aiming to produce a well-justified and self-explained product that can serve as guide to a beginner on how a basic pipeline can be constructed in order to obtain a report with an analysis from scratch.

All the requirements for the assignment were followed, with one exception:

due to the book-like structure that was adopted for the report it was considered more appropriate to include the SYNOPSIS not immediately after the title, but as a separate chapter after the PROLOGUE

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

2.2 About The Main Script

In the github repository https://github.com/jzstats/Reproducible-Research--2nd-Assignment, that hosts all the material relevant to this project the main script RepRes_____analysis.Rmd that contains the code used to conduct the analysis can be found.

When knitted directly from RStudio, it produces the Markdown file RepRes_____analysis.md with the analysis.

In addition, it was rendered with the script render_____RepRes_analysis.R, (as explained at the following section of this chapter, 2.3 About The Report) to produce a bookdown variation that was uploaded to Rpubs and used to populate the webpage that was created to showcase this project.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

2.3 About The Report

The main Rmd file, RepRes_analysis.Rmd that contains the code to conduct the analysis and produces the Markdown document RepRes_analysis.md was rendered with the script render_____RepRes_analysis.R to create a bookdown version of the report with the analysis, that are hosted at the webpage created to showcase the this project:

Report
- A more visually appealing and practical (due to the sidepanel with contents that contains) book-like version of the report powered by the rmdformats library. It was produced by rendering the RepRes_analysis.Rmd with the script render_____RepRes_analysis.R . This is the version that was uploaded to RPubs at this link.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

3 SYNOPSIS

The U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Events Database, was explored to identify the most harmful weather event types, among the weather phenomena defined in NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007 (at chapter 7), with respect to population health and economy.

The raw data was loaded in R from the supplied file, preproccessed, the target data subset was extracted, in-record validation was conducted, the majority of missing values were imputed (via a deterministic and conservative approach), the observations were cross validated and finally the table with the processed data was created, which contained all information needed to address the two questions of interest:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

For the first question, the harm on population health by each weather event type was evaluated (separately) based on the average impact of the observations that resulted in non-zero damage over each of the three perspectives (fatalities, injuries and casualties) that were considered to be of importance.

Similarly for the second question, the harm on economy by each weather event type was evaluated (separately) based on the average impact from the observations that resulted in non-zero damage over each of the three perspectives (property damage, crop damage and economic damage) that were considered to be of importance.

Although for both questions the main criterion to rank the included weather event types (from the most harmful to the least) for each perspective was the overall average damage observed (with respect to each perspective) based on the observations that caused non-zero damage, the average for the 90% of cases with lowest impact versus the average for the 10% of cases with the highest impact (for each of the included weather event types) was reported to provide a more complete and insightful ‘picture’ of the consequences observed by each weather event type, due to the fact that for all perspectives, the majority of weather event types were highly positively skewed.

The analysis was structured, performed and documented in such way that fortifies the reproducibility of the report and explains every required detail so that even the non-expert can follow the procedure and understand the thought process behind the decision making at each stage.

back to start of this chapter
back to TABLE OF CONTENTS

4 STORM EVENTS DATASET

To conduct the analysis for this project, the file with the raw data repdata_data_StormData.csv.bz2 was used, which contains data from the Storm Events Dataset gathered and made publicly available by U.S. National Oceanic and Atmospheric Administration (NOAA).

Some general information as well as two points of interest about the dataset:

4.2.1 Changes in the composition of weather event types
4.2.2 Eligibility criteria for inclusion of weather events in the dataset

were discussed to provide the nessecary insights in order to understand why the decisions which govern the approach adopted in this analysis were made.

back to start of this chapter
back to TABLE OF CONTENTS

4.1 General Informations

The version of the dataset used in this analysis contains observations for the severe weather events that happened (or more accurately begun) from January 1950 to November 2011 at United States.

Further details about the dataset (which was used in this analysis) can be accessed by the supplemental material provided at the instructions of the assignment:

NATIONAL WEATHER SERVICE INSTRUCTION 10-1605 (AUGUST 17, 2007) (also available at the GitHub repository created to support this project through this link)
Storm Data Faq Page (also available at the GitHub repository created to support this project through this link)

For additional information on the Storm Events Dataset, as well as an updated and cleaner version of the data, with observations from January 1950 up to January 2020 (at the time this report was produced, but it is expected to continue updating), it is recommended to visit and explore:

NOAA’s Storm Events Dataset official wepbage

Finally, a document with detailed information for the history of the dataset, was available at NOAA’s Storm Events Dataset wepbage for the version history:

The History of the Storm Events Database (also available at the GitHub repository created to support this project through this link)

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

4.2 Points Of Interest

In order to understand why some of the decisions which govern the approach adopted in this analysis were made, it is essential to take into account two crucial facts with respect to the observations recorded in the dataset:

4.2.1 Changes in the composition of weather event types
- Both the composition of the weather events types that were recorded in the dataset and the way the data was entered in the system (the data entry procedure and the database software) changed several times across the years.
4.2.2 Eligibility criteria for inclusion of weather events in the dataset
- Not every weather event that occurred in the period that the dataset spans, was automatically eligible to be recorded in the dataset. Only those that have caused harm (either to population health or to economy)
  or have gathered public interest were recorded.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

4.2.1 Changes in the composition of weather event types

Through the years, as the publicity of the dataset soared, several aspects governing the data collection procedure changed in order to expand, enrich and fortify the quality of the data.

As a result the number of defined weather event types that were collected increased several times starting from just one (TORNADO) for the first few years and expanding into 48 defined weather event times at the time the dataset used in this analysis was created. Consequently there are inconsistencies in the the composition of weather event types between different periods that could affect the integrity of the analysis.

Furthermore for the period 1996 up to 2000 while the weather event types that were being recorded had already been significantly increased, the values for the weather event type entries were entered though a free text field resulting in more than 950 different unique entries.

For this reason it was decided to use for the analysis only the part with observations since January 2001, for which as a result of the introduction of a drop down menu and the removal of the free text field for the entries of the weather event type values, the majority of observations don’t suffer from such problems and the weather event types contained include the majority of the latest defined weather event types.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

4.2.2 Eligibility criteria for inclusion of weather events in the dataset

Out of all weather events that happened in the period from January 2001 to November 2011 at United States and were classified as one of the types that were recorded (at the period they occurred), only those in the subset that belonged to at least one of the following three groups were eligible to be included in the dataset:

The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce.
Rare, unusual, weather phenomena that generate media attention.
Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event.

An important implication of the above policy must be highlighted:

From all the weather phenomena that happened in the period from January 2001 to November 2011 at United States and were of a type that was recorded at the time they occurred, the dataset contains only the subset with those that either resulted in harm (to population health or to economy) or gathered high publicity.
On the contrary all the weather phenomena that happened in the period from January 2001 to November 2011 at United States and neither caused any harm (to population health or to economy) nor gathered high public interest, were ignored, even if they were of a type that was recorded at the time they occurred.

Consequently any conclusion made for a weather event type in general will inevitably be biased, as it will overestimate the consequences with respect to the harm they caused (either to population health or to economy) due to the fact that the available sample is not representative of the the overall population of weather phenomena (of the types that were recorded) by default.

For this reason it was decided to use for the analysis:

Only the subset of observations that resulted in non-zero harm with respect to each of the perspectives of interest (fatalities, injuries and casualties) in order to determine the most harmful weather event types for the population health.
Only the subset of observations that resulted in non-zero harm with respect to each of the perspectives of interest (property damage, crop damage and economic damage) in order to determine the most harmful weather event types for the economy.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

5 PRELIMINARY ACTIVITIES

Executes four preliminary tasks in order to ensure (and set when it is needed and possible) that the working directory and the R session are ready to proceed with the analysis:

5.1 Set The Random Seed
- Sets a random seed to make the random events reproducible.
5.2 Load All Required Libraries
- Loads all libraries required to conduct the analysis and produce the report.
5.3 Create All Required Directories
- Creates (if it doesn’t exist) a directory tree (at the working directory) in which the output files will be exported.
5.4 Access The File With The Raw Data
- Downloads the file with the raw data, repdata_data_StormData.csv.bz2 in the working directory, if it doesn’t already exist.

back to start of this chapter
back to TABLE OF CONTENTS

5.1 Set The Random Seed

In an attempt to fortify the reproducibility of the random events, the number 1234567890 was explicitly chosen and set as the random seed.

# Select a random seed.
selected_random_seed <- 1234567890

# Set the selected random seed.
set.seed(selected_random_seed)

Note that the only random events that took place in this analysis were the assignment of random positions for the labels at the plots:

Plot 1.1.4
Plot 1.2.4
Plot 1.3.4
Plot 2.1.4
Plot 2.2.4
Plot 2.3.4

by the function geom_repel_label() from the ggrepel library.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

5.2 Load All Required Libraries

Loads all libraries required to conduct the analysis and produce the report.

# Load all required libraries.
library(tools)
library(rmarkdown)
library(knitr)
library(kableExtra)
library(magrittr)
library(DT)
library(rmdformats)
library(data.table)
library(validate)
library(stringr)
library(moments)
library(ggplot2)
library(ggrepel)
library(grid)
library(gridExtra)

Note that the library:

rmdformats
- which was only used to produce the Report

is not essential to conduct the analysis.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

5.3 Create All Required Directories

During the execution of the main script, RepRes_analysis.Rmd several outputs are produced, (that are also included in the report), mostly to enhance further the reproducibility of the analysis.

All those files are exported in appropriate sub-directories inside the directory with name outputs which is created at the working directory.

# Create a list with the paths to all sub-directories 
# of the directory tree for the outputs of this analysis
directory_tree_____outputs <- list(
    "filepath_____outputs_____processed_data" = 
        file.path("outputs", "processed_data"),
    "filepath_____outputs_____harm_on_population_health_____figures" = 
        file.path("outputs", "harm_on_population_health", "figures"),
    "filepath_____outputs_____harm_on_population_health_____results" = 
        file.path("outputs", "harm_on_population_health", "results"),
    "filepath_____outputs_____harm_on_economy_____figures" = 
        file.path("outputs", "harm_on_economy", "figures"),
    "filepath_____outputs_____harm_on_economy_____results" = 
        file.path("outputs", "harm_on_economy", "results"),
    "filepath_____outputs_____reproducibility_support_____r_session" = 
        file.path("outputs", "reproducibility_support", "r_session"),
    "filepath_____outputs_____reproducibility_support_____MD5_checksums" =
        file.path("outputs", "reproducibility_support", "MD5_checksums")
    
)

# Create the directory tree for the outputs of the analysis.  
invisible(lapply(
    X = directory_tree_____outputs,
    FUN = function(filepath_of_subdirectory) {
        if ( ! dir.exists(filepath_of_subdirectory) ) {
            dir.create(filepath_of_subdirectory, recursive = TRUE)
        }
    }
))


# Check if all subdirectories of the directory for the outputs of the analysis 
# were successfully created.
do_the_directories_exists <- vapply(
    X = directory_tree_____outputs,
    FUN = dir.exists,
    FUN.VALUE = logical(1)
)

# If failed to created any of the sub-directories 
# required for the outputs of the analysis 
# the process terminates
if (any(!do_the_directories_exists)) {
    stop(
        "\n",
        "Failed to create the directories: ", "\n",
        paste0("\t", directory_tree_____outputs[!do_the_directories_exists], "\n"),
        "The process is aborted for now.", "\n",
        "Please rerun the script or create the required sub-directories manually.", 
        "\n"
    )
}

If failed to created any of the sub-directories in the directory tree for the outputs of the analysis, the process terminates.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

5.4 Access The File With The Raw Data

The file with name repdata_data_StormData.csv.bz2, which contains data from the Storm Events Dataset was supplied for this assignment and used to conduct the analysis.

If the file doesn’t already exists at the working directory, an attempt will be made to download it automatically.

# Path to the file with the compressed raw data.
filepath_____unprocessed_data <- "repdata_data_StormData.csv.bz2"

# The link supplied by the instuctions of the assignment 
# to download the file with the compressed raw data.
url_to_download_the_data_file <- 
  "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

# Check if the file 'repdata_data_StormData.csv.bz2', 
# with the compressed raw data is available at the working directory. 
## if it doesn't exist...
if ( !file.exists(filepath_____unprocessed_data) ) {

  message(
    "\n", 
    "The file, '", filepath_____unprocessed_data, "'", "\n", 
    "doesn't exists at the working directory.",
    "\n"
  )
  message(
    "\n", "Trying to download the file, ", "\n",
    "'", filepath_____unprocessed_data, "' ", "\n",
    "with the raw data from the url: ", "\n",
    "\t", "'",  url_to_download_the_data_file, "'"
  )
  
  ### ...an attempt is made to download it from the link supplied by assignment
  try(
    download.file(
      url = url_to_download_the_data_file,
      destfile = filepath_____unprocessed_data)
  )
  
  ## Checks if the file 'repdata_data_StormData.csv.bz2' 
  ## was successfully downloaded.  
  ### in case the file is not found at the working directory 
  ### after the attempt to download 
  ### the process terminates with an informative message 
  ### that explains the situation to the user
  if ( !file.exists(filepath_____unprocessed_data)  ) {
    stop(
      "\n", 
      "Failed to download the required file,", "\n",
      "'", filepath_____unprocessed_data, "'", "\n",
      "with the raw data.", "\n",
      "The process is aborted for now."
    )
  } 
}

If the download fails, the process terminates.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6 DATA PROCESSING

The data processing pipeline, started with a supplied file, repdata_data_StormData.csv.bz2 that contained raw data from the Storm Events Dataset and produced the table with the processed data.

The pipeline consists of seven distinct stages:

Load The Raw Data In R
- The table with the raw data was created by loading in R the raw data from the supplied file with the compressed raw data with all variables coerced to character type. Post validation was conducted and an overview of the table with the raw data was presented.
Preprocess The Raw Data
- From the data at the table with the raw data, in order to create the table with the preprocessed data prerequisites were verified about the variables required for the analysis before they were selected, coerced to their appropriate types and a key was set for the table. Post validation was conducted and an overview of the table with the preprocessed data was presented.
6.3 Extract The Target Data Subset
- From the table with the preprocessed data only the subset of data that includes the observations for the weather events that happened in the period from 2001 to 2011 and caused non-zero fatalities, injuries, property damage or crop damage were extracted. Post validation was conducted and an overview of the table with the target data subset was presented.
Conduct In-Record Data Validation
- The values of each variable at the table with the target data subset were validated against appropriate constrains for each column separately (independently of the other variables) and those entries that were found invalid got substituted with NAs to create the table with the in-record validated data. Post validation was conducted and an overview of the table with the in-record validated data was presented.
Impute Missing Values
- The missing values at each variable from the table with the in-record validated data were examined and the those that could be retrieved (via a deterministic and conservative way) were imputed, to produce the table with the imputed data. Post validation was conducted and an overview of the table with the imputed data was presented.
Conduct Cross-Record Data Validation
- Each observations from the table with the imputed data was validated against appropriate constrains that spanned across all available variables and only those that were found valid were used to create the table with the cross-record validated data. Post validation was conducted and an overview of the table with the cross-record validated data was presented.
6.7 Produce The Processed Data
- From the table with the cross-record validated data, by transforming appropriately the available information, the table with the processed data was created that contained the variables required to identify the most harmful weather event types with respect to the population health and for the economy. Post validation was conducted.

At each stage of the data processing procedure any fact that played a major role was highlighted and examined when it was needed, in compliance with the spirit of the assignment, aiming to supply all the facts necessary to understand how and why the decision making behind this analysis happened in order to create a well justified and documented, reproducible report.

back to start of this chapter
back to TABLE OF CONTENTS

6.1 Load The Raw Data In R

Summary

The raw data was loaded in R from the supplied file repdata_data_StormData.csv.bz2 (which contains data from the Storm Events Dataset), to create the table with the raw data which was then post validated and some basics fact about it were highlighted.

Steps

6.1.1 Create the table with the raw data
- Reads the file repdata_data_StormData.csv.bz2 in R, to create the table with the raw data.
6.1.2 Conduct post validation for the table with the raw data
- Ensures that the raw data was loaded correctly.
6.1.3 Overview of the table with the raw data
- Presents some basic facts about the table with the raw data.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.1.1 Create the table with the raw data

The raw data was loaded in R directly from the supplied file repdata_data_StormData.csv.bz2 (which was a CSV file, compressed via a bzip2 algorithm), with all variables deliberately coerced to character type in order to ensure that no information was lost or altered as a side effect of coercion. The first row of the file includes headers that were used to automatically assign the names of all the variables at the table with the raw data that was created.

# Load the raw data in R from the supplied file:
#     "repdata_data_StormData.csv.bz2"
# and create the table with the raw data.
raw_data <- fread(
  ## the file is expected to exist at the working directory
  file = filepath_____unprocessed_data,
  ## the variables in the file are separated via a comma
  sep = ",",
  ## the first row of the file contains the names of the variables
  header = TRUE,
  ## all variables were deliberately loaded as character type
  ## to avoid any loss or alteration of information as a side effect of coercion
  colClasses = "character"
)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.1.2 Conduct post validation for the table with the raw data

The table with the raw data was post validated to ensure that the data from the file was loaded in R correctly.
Three simple constrains were applied:

It should contain 37 variables.
It should contain 902297 observations.
The type of all variables should be ‘character’.

(The expected number of variables and the expected number of observations, were acquired interactively before the execution of the main script RepRes_analysis.Rmd and were then used to form the constrains for the post validation.)

# Create a validator with constrains for the validity of loaded raw data. 
V_____loaded_raw_data <- validator(
  ## create a character vector that captures the class of each variable
  classes_of_varirables := vapply(
    X = ., 
    ### although it is not expected to get an output 
    ### with more than one element for the class of each variable, 
    ### in general it is possible to happen, 
    ### so proper care is taken to collapse the elements of such vector 
    ### in a single element so that the vapply() function 
    ### won't fail with an error in such case
    FUN = function(x) paste(class(x), collapse = ","), 
    FUN.VALUE = character(1)
  ),
  "expected_number_of_variables" = ( length(.) == 37 ),
  "expected_number_of_observations" = ( nrow(.) == 902297 ),
  "expected_variable_types" = ( classes_of_varirables == "character" )
)

# Confront the table with the raw data with the validator
# which constrains the constrains for the validity of raw data.
CF_____loaded_raw_data <- confront(dat = raw_data, V_____loaded_raw_data)

The table with the raw data was valid.

# Create a kable with the results of post validation 
# for the table with the raw data. 
kable(
  x = summary(CF_____loaded_raw_data)[
    , 
    c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ], 
  caption = paste0(
    "TABLE 6.1.3-1: ", 
    "The results of post validation for the table with the raw data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c(
      "striped", "hover", "condensed", "responsive", "bordered"
    ), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

TABLE 6.1.3-1: The results of post validation for the table with the raw data.
name	items	passes	error	warning
expected_number_of_variables	1	1	FALSE	FALSE
expected_number_of_observations	1	1	FALSE	FALSE
expected_variable_types	37	37	FALSE	FALSE

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.1.3 Overview of the table with the raw data

The table with the raw data contained 37 variables that were all of type ‘character’ and 902297 observations.

# Print the structure of the table with the raw data.
str(raw_data)

## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : chr  "1.00" "1.00" "1.00" "1.00" ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : chr  "97.00" "3.00" "57.00" "89.00" ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : chr  "0.00" "0.00" "0.00" "0.00" ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: chr  "0.00" "0.00" "0.00" "0.00" ...
##  $ COUNTYENDN: chr  "" "" "" "" ...
##  $ END_RANGE : chr  "0.00" "0.00" "0.00" "0.00" ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : chr  "14.00" "2.00" "0.10" "0.00" ...
##  $ WIDTH     : chr  "100.00" "150.00" "123.00" "100.00" ...
##  $ F         : chr  "3" "2" "2" "2" ...
##  $ MAG       : chr  "0.00" "0.00" "0.00" "0.00" ...
##  $ FATALITIES: chr  "0.00" "0.00" "0.00" "0.00" ...
##  $ INJURIES  : chr  "15.00" "0.00" "2.00" "2.00" ...
##  $ PROPDMG   : chr  "25.00" "2.50" "25.00" "2.50" ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : chr  "0.00" "0.00" "0.00" "0.00" ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : chr  "3040.00" "3042.00" "3340.00" "3458.00" ...
##  $ LONGITUDE : chr  "8812.00" "8755.00" "8742.00" "8626.00" ...
##  $ LATITUDE_E: chr  "3051.00" "0.00" "0.00" "0.00" ...
##  $ LONGITUDE_: chr  "8806.00" "0.00" "0.00" "0.00" ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : chr  "1.00" "2.00" "3.00" "4.00" ...
##  - attr(*, ".internal.selfref")=<externalptr>

There were no missing values (coded as NAs) at any of the variables it contained, but there were a lot of empty values which probably represent missing values. For some of the variables, a suspiciously large or small number of distinct values was observed.

# Create a kable to highlight some facts 
# about the variables at the table with the raw data.
kable(
  data.table(
    "Variable" = names(raw_data),
    "Number of Distinct Values" = vapply(
      X = raw_data, 
      FUN = function(x) length(unique(x)), 
      FUN.VALUE = integer(1)
      ),
    "Number of NAs" = vapply(
      X = raw_data, 
      FUN = function(x) sum(is.na(x)),
      FUN.VALUE = integer(1)
      ),
    # from the output with the structure of the table with the raw data 
    # it was found that some empty values exist at some variables 
    # and the exact number that each of them contains was reported
    "Number of Empty Values" = vapply(
      X = raw_data, 
      FUN = function(x) sum(x == ""), 
      FUN.VALUE = integer(1))
    
  ),
  caption = paste0(
    "TABLE 6.1.3-2: ", 
    "Facts about the variables at the table with the raw data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c(
      "striped", "hover", "condensed", "responsive", "bordered"
    ), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The table with the raw data contains 37 variables ", 
      "that are all of type 'character'", "\n",
      "and 902297 observations."
    )
  )

TABLE 6.1.3-2: Facts about the variables at the table with the raw data.
Variable	Number of Distinct Values	Number of NAs	Number of Empty Values
STATE__	70	0	0
BGN_DATE	16335	0	0
BGN_TIME	3608	0	0
TIME_ZONE	22	0	0
COUNTY	557	0	0
COUNTYNAME	29601	0	1589
STATE	72	0	0
EVTYPE	985	0	0
BGN_RANGE	272	0	0
BGN_AZI	35	0	547332
BGN_LOCATI	54429	0	287743
END_DATE	6663	0	243411
END_TIME	3647	0	238978
COUNTY_END	1	0	0
COUNTYENDN	1	0	902297
END_RANGE	266	0	0
END_AZI	24	0	724837
END_LOCATI	34506	0	499225
LENGTH	568	0	0
WIDTH	293	0	0
F	7	0	843563
MAG	226	0	0
FATALITIES	52	0	0
INJURIES	200	0	0
PROPDMG	1390	0	0
PROPDMGEXP	19	0	465934
CROPDMG	432	0	0
CROPDMGEXP	9	0	618413
WFO	542	0	142069
STATEOFFIC	250	0	248769
ZONENAMES	25112	0	594029
LATITUDE	1781	0	47
LONGITUDE	3841	0	0
LATITUDE_E	1729	0	40
LONGITUDE_	3778	0	0
REMARKS	436906	0	287433
REFNUM	902297	0	0
Note:
The table with the raw data contains 37 variables that are all of type ‘character’ and 902297 observations.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.2 Preprocess The Raw Data

Summary

From the table with the raw data which contains 37 variables, only 9 were selected to create the table with preprocessed data and proceed with this analysis:

REFNUM : an id that uniquely identifies each observation
BGN_DATE : the date when each weather event begun
EVTYPE : the type of each weather event
FATALITIES : the number of fatalities
INJURIES : the number of injuries
PROPDMG : the magnitude value of the damage caused in properties that could have been expressed in thousands, millions or billions of dollars, depending on the corresponding indicator value at the variable PROPDMGEXP
PROPDMGEXP : an indicator value that denotes whether the corresponding magnitude value at the variable PROPDMG refers to thousands, millions or billions of dollars
CROPDMG : the magnitude value of the damage caused in crops that could have been expressed in thousands, millions or billions of dollars, depending on the corresponding indicator value at the variable CROPDMGEXP
CROPDMGEXP : an indicator value that denotes whether the corresponding magnitude value at the variable CROPDMG refers to thousands, millions or billions of dollars

Due to the fact that all variables at the table with the raw data were (deliberately) loaded as type ‘character’ some prerequisites were needed to get verified for the format of the character string values that they contained before they were coerced to their appropriate type.

The variable REFNUM after having verified that the values it contained uniquely identify each observation, was set as the key for the table with the preprocessed data.

Finally post validation was conducted and some facts about the table with the preprocessed data were highlighted.

Steps

6.2.1 Verify the prerequisites for the selected variables
- Checks two key points for the values of selected variables from the table with the raw data:
  - 6.2.1.1 Verify the coercibility of the values for the selected variables
    - The character values at the selected variables were checked to verify if their format was compatible with the variable type that each of the them should be coerced to.
  - 6.2.1.2 Verify the uniqueness of the key values
    - The values of the variable that was intended to be used as the key of the table with the preprocessed data were checked to verify if they uniquely identify each observation.
6.2.2 Create the table with the preprocessed data
- Creates the table with the preprocessed data, by selecting the required variahles, coercing them to their appropriate type and setting a key was set for the table.
6.2.3 Conduct post validation for the table with the preprocessed data
- Ensures that the raw data was preprocessed data correctly.
6.2.4 Overview of the table with the preprocessed data
- Presents some basic fact about the table with the preprocessed data.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.2.1 Verify the prerequisites for the selected variables

Two key points were checked for the values of the selected variables from the table with raw data before proceeding to create the table with the preprocessed data:

6.2.1.1 Verify the coercibility of the values for the selected variables
- Checks if the format of the character string values of the selected variables from the table with the raw data are compatible with the (variable) type that each of them should be coerced to.
6.2.1.2 Verify the uniqueness of the key values
- Checked if the values of the variable REFNUM (when coerced to type ‘integer’) that was intended to be used as the key of the table preprocessed data uniquely identifies each observation.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.2.1.1 Verify the coercibility of the values for the selected variables

The format of the character string values of the selected variables, REFNUM, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP, from the table with the raw data were checked in order to verify that they were compatible with the new variable type that each them should be coerced to, so as not to lose any information without knowing it (or in other words to avoid the side effect of automatic substitution by NAs, of the values that were incompatible with the new variable type that each of them should be coerced to).

The variables EVTYPE, BGN_DATE, PROPDMGEXP and CROPDMGEXP were already in their appropriate type (which is ‘character’), so no further adjustments were needed. On the other hand the variables REFNUM, FATALITIES and INJURIES had to be coerced from ‘character’ type to ‘integer’, while the type of the remaining two variables, PROPDMG and CROPDMG had to change from ‘character’ to ‘double’.

A validation was conducted to verify that:

the values of the variable REFNUM can be coerced to ‘integer’ type
the values of the variable FATALITIES can be coerced to ‘integer’ type
the values of the variable INJURIES can be coerced to ‘integer’ type
the values of the variable PROPDMG can be coerced to ‘double’ type
the values of the variable CROPDMG can be coerced to ‘double’ type

# Create a validator with the constrains needed to verify 
# that the formats of the character string values 
# at the selected variables from the table with the raw data 
# are compatible with the variable types that they should be coerced to.
V_____coercibible_format_of_the_character_string_values <- 
  validator(
    "REFNUM_value_is_coercible_to_integer" = 
      ( grepl("^\\d{1,}\\.00$", REFNUM) ),
    "FATALITIES_value_is_coercible_to_integer" = 
      ( grepl("^\\d{1,}\\.00$", FATALITIES) ),
    "INJURIES_value_is_coercible_to_integer" = 
      ( grepl("^\\d{1,}\\.00$", INJURIES) ),
    "PROPDMG_value_is_coercible_to_double" = 
      ( grepl("^\\d{1,}\\.\\d{2}$", PROPDMG) ),
    "CROPDMG_value_is_coercible_to_double" = 
      ( grepl("^\\d{1,}\\.\\d{2}$", CROPDMG) )
  )

# Confront the table with the raw data with the validator 
# which contains the constrains for the formats of the character string values 
# at the selected variables from the table with the raw data.
CF_____coercibible_format_of_the_character_string_values <- 
  confront(
    dat = raw_data,
    V_____coercibible_format_of_the_character_string_values
  )

The values of all selected variables were found to be compatible with the new type that each of them should be coerced to.

# Create a kable to present the results validation 
# for the format of the character string values 
# at the selected variables from the table with the raw data.
kable(
  x = summary(
    CF_____coercibible_format_of_the_character_string_values
  )[, c("name", "items", "passes", "fails", "nNA", "error", "warning")],
  caption = paste0(
    "Table 6.2.1.1-1: ",
    "The results of the validation ", 
    "for the compatibility of the format of the character string values ", 
    "at the selected variables from the table with raw data ",
    "with the appropriate type that each of them should be coerced to, ",
    "at the table of preprocessed data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.2.1.1-1: The results of the validation for the compatibility of the format of the character string values at the selected variables from the table with raw data with the appropriate type that each of them should be coerced to, at the table of preprocessed data.
name	items	passes	error	warning
REFNUM_value_is_coercible_to_integer	902297	902297	FALSE	FALSE
FATALITIES_value_is_coercible_to_integer	902297	902297	FALSE	FALSE
INJURIES_value_is_coercible_to_integer	902297	902297	FALSE	FALSE
PROPDMG_value_is_coercible_to_double	902297	902297	FALSE	FALSE
CROPDMG_value_is_coercible_to_double	902297	902297	FALSE	FALSE

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.2.1.2 Verify the uniqueness of the key values

The variable REFNUM, coerced to its proper type (that is ‘integer’), should and was expected to uniquely identify each observation, making it an excellent choice for the key of the table with the preprocessed data, (as well as for the rest of the tables that were generated at the following stages of the data processing pipeline).

Before proceeding to set the REFNUM as the key, the claim that it uniquely identifies each observations was checked to avoid unexpected surprises that may jeopardize the reproducibility of the analysis.

# Create a validator for the uniqueness of values at the variable REFNUM.
V_____uniqueness_of_values_for_the_key_of_the_table <- validator(
  "value_uniquely_identifies_the_observation" = (
    # the values of the variable REFNUM will be first coerced 
    # to their appropriate variable type, which is 'integer' 
    # and then checked for uniqueness
    as.integer(REFNUM) %in% 
      names(table(as.integer(REFNUM)))[table(as.integer(REFNUM)) == 1]
  )
)

# Confront the table with raw data with the validator 
# for the uniqueness of values at REFNUM variable 
CF_____uniqueness_of_values_for_the_key_of_the_table <- confront(
  dat = raw_data,
  V_____uniqueness_of_values_for_the_key_of_the_table
)

All values of the variable REFNUM were found to be distinct, and consequently they uniquely identify each observation.

# Create a kable to present the results from the validation 
# for the uniqueness of each value at the variable REFNUM 
# at the table with the raw data. 
kable(
  x = summary(CF_____uniqueness_of_values_for_the_key_of_the_table)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ], 
  caption = paste0(
    "Table 6.2.1.2-1: ",
    "The results from the validation ",
    "for the uniqueness of values from REFNUM variable ", 
    "at the table with the raw data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c(
      "striped", "hover", "condensed", "responsive", "bordered"
    ), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The values at REFNUM variable were coerced to 'integer' type ", "\n",
      "before checking if they uniquely identify each observation."
    )
  )

Table 6.2.1.2-1: The results from the validation for the uniqueness of values from REFNUM variable at the table with the raw data.
name	items	passes	fails	nNA	error	warning
value_uniquely_identifies_the_observation	902297	902297	0	0	FALSE	FALSE
Note:
The values at REFNUM variable were coerced to ‘integer’ type before checking if they uniquely identify each observation.

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.2.2 Create the table with the preprocessed data

Having identify the variables from the table with the raw data (REFNUM, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP) that were required to proceed with the analysis, and verified that they satisfied the necessary prerequisites, the table with preprocessed data was created, by selecting those 9 variables, coercing them to their appropriate type:

REFNUM was selected and coerced from ‘character’ type to ‘integer’
BGN_DATE was selected (no coercion happened as it was already of the proper type, ‘character’)
EVTYPE was selected (no coercion happened as it was already of the proper type, ‘character’)
FATALITIES was selected and coerced from ‘character’ type to ‘integer’
INJURIES was selected and coerced from ‘character’ type to ‘integer’
PROPDMG was selected and coerced from ‘character’ type to ‘double’
PROPDMGEXP was selected (no coercion happened as it was already of the proper type, ‘character’)
CROPDMG was selected and coerced from ‘character’ type to ‘double’
CROPDMGEXP was selected (no coercion happened as it was already of the proper type, ‘character’)

and finally setting the variable REFNUM as the key of the table.

# Create the table with the preprocessed data 
# with the selected variables from the table with the raw data, 
# coerced to their appropriate type.
preprocessed_data <- raw_data[
  ,
  list(
    "REFNUM" = as.integer(REFNUM),
    "BGN_DATE" = BGN_DATE,
    "EVTYPE" = EVTYPE,
    "FATALITIES" = as.integer(FATALITIES),
    "INJURIES" = as.integer(INJURIES),
    "PROPDMG" = as.double(PROPDMG),
    "PROPDMGEXP" = PROPDMGEXP,
    "CROPDMG" = as.double(CROPDMG),
    "CROPDMGEXP" = CROPDMGEXP
  )
  ]

# Set REFNUM as the key of the table with the preprocessed data
setkey(preprocessed_data, REFNUM)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.2.3 Conduct post validation for the table with the preprocessed data

The table with the preprocessed data was post validated to ensure that:

all and only, the variables required for the analysis were included
all the observations from table with raw data were transfered
each of the selected variables was coerced to its appropriate type
no missing values were introduced as a result of the coercion
REFNUM was set as the key of the table

# Create a vector with the names of the expected variables 
# at the table with the preprocessed data.
expected_variables_at_the_table_with_preprocessed_data <- c(
  "REFNUM", "BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", 
  "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP"
)

# Create a validator for the post validation of the preprocessed data.
V_____post_validation_of_table_with_preprocessed_data <- validator(
  # check if the table contains all and only the required variables 
  "all_and_only_the_required_variables_are_included" = 
    ( names(.) == expected_variables_at_the_table_with_preprocessed_data ),
  # check if all the observations were included.
  "all_observations_were_transfered" = nrow(.) == nrow(raw_data),
  # checks if each variable is coerced to its appropriate type
  "REFNUM_is_integer" = 
    ( paste(class(.[["REFNUM"]]), collapse = ",") == "integer" ),
  "BGN_DATE_is_character" = 
    ( paste(class(.[["BGN_DATE"]]), collapse = ",") == "character" ),
  "EVTYPE_is_character" = 
    ( paste(class(.[["EVTYPE"]]), collapse = ",") == "character" ),
  "FATALITIES_is_integer" = 
    ( paste(class(.[["FATALITIES"]]), collapse = ",") == "integer" ),
  "INJURIES_is_integer" = 
    ( paste(class(.[["INJURIES"]]), collapse = ",") == "integer" ),
  "PROPDMG_is_numeric" = 
    ( paste(class(.[["PROPDMG"]]), collapse = ",") == "numeric" ),
  "PROPDMGEXP_is_character" = 
    ( paste(class(.[["PROPDMGEXP"]]), collapse = ",") == "character" ),
  "CROPDMG_is_numeric" = 
    ( paste(class(.[["CROPDMG"]]), collapse = ",") == "numeric" ),
  "CROPDMGEXP_is_character" = 
    ( paste(class(.[["CROPDMGEXP"]]), collapse = ",") == "character" ),
  # check that no missing values were introduced as a result of coercion
  "no_missing_values_introduced" = ( mean(complete.cases(.)) == 1 ),
  # checks if the REFNUM is set as the key of the table
  "REFNUM_is_the_key_of_the_table" = ( attributes(.)[["sorted"]] == "REFNUM" )
)

# Confront the table with the preprocessed data with the validator 
# which contains the constrains for the validity of preprocessed data. 
CF_____post_validation_of_table_with_preprocessed_data <- confront(
  dat = preprocessed_data,
  V_____post_validation_of_table_with_preprocessed_data
)

The table with the preprocessed data was valid.

# Create a kable was to present the results of post validation 
# for table with the preprocessed data. 
kable(
  x = summary(CF_____post_validation_of_table_with_preprocessed_data)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ], 
  caption = paste0(
    "Table 6.2.3-1: ",
    "The results of post validation for the table with the preprocessed data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.2.3-1: The results of post validation for the table with the preprocessed data.
name	items	passes	error	warning
all_and_only_the_required_variables_are_included	9	9	FALSE	FALSE
all_observations_were_transfered	1	1	FALSE	FALSE
REFNUM_is_integer	1	1	FALSE	FALSE
BGN_DATE_is_character	1	1	FALSE	FALSE
EVTYPE_is_character	1	1	FALSE	FALSE
FATALITIES_is_integer	1	1	FALSE	FALSE
INJURIES_is_integer	1	1	FALSE	FALSE
PROPDMG_is_numeric	1	1	FALSE	FALSE
PROPDMGEXP_is_character	1	1	FALSE	FALSE
CROPDMG_is_numeric	1	1	FALSE	FALSE
CROPDMGEXP_is_character	1	1	FALSE	FALSE
no_missing_values_introduced	1	1	FALSE	FALSE
REFNUM_is_the_key_of_the_table	1	1	FALSE	FALSE

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.2.4 Overview of the table with the preprocessed data

The table with the preprocessed data, contained 9 variables and 902297 observations.

The variable REFNUM was set as the key of the table.

# Print the structure of the table with the preprocessed data.
str(preprocessed_data)

## Classes 'data.table' and 'data.frame':   902297 obs. of  9 variables:
##  $ REFNUM    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: int  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : int  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.3 Extract The Target Data Subset

Summary

From all available observations that the table with the preprocessed data contains, only the subset of the weather phenomena that begun at 2001 or later and resulted in non-zero harm either to population health (caused fatalities or injuries) or to economy (caused property damage or crop damage) will be used for this analysis (for the reasons that were discussed in detail at the section 4.2 Points Of Interest about the Storm Events Dataset).

The consistency of the format of dates at the BGN_DATE variable (that indicates when each weather phenomenon begun) was checked, as it was intended to be used for the identification, the eligible observations for the target data subset were identified, and got extracted to create the table with the target data subset.

Finally post validation was conducted and some facts about the table with the target data subset were highlighted.

Steps

6.3.1 Identify the target subset of observations
- Verifies prerequisites and identifies the eligible observations for the table with the target data subset:
  - 6.3.1.1 Verify the consistency of date format
    - Verifies that the character string format of the values at BGN_DATE variable are consistent.
  - 6.3.1.2 Identify the eligible observations
    - Identifies the eligible observations for the table with the target data subset by their key value.
6.3.2 Create the table with the target data subset
- Creates the table with the target data subset by identifying and extracting the eligible observations by their key values.
6.3.3 Conduct post validation for the table with the target data subset
- Ensures that the target data was extracted correctly.
6.3.4 Overview of the table with the target data subset
- Presents some basic facts about the table with the target data subset.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.3.1 Identify the target subset of observations

Out of all available observations it was decided to use for the analysis the subset that includes only the weather phenomena that happened from 2001 and later (due to the implications of changes in the composition of weather event types) and resulted in non-zero harm either to population health (caused fatalities or injuries) or to economy (caused property damage or crop damage) (due to the implications of the eligibility criteria for inclusion of weather events in the dataset).

The format of the date values at BGN_DATE variable from the table with preprocessed data had to be checked to see if it is consistent across all observations, before it was used to form the first of the two constrains.

The eligible observations were finally identified by their key value (denoted by the variable REFNUM).

6.3.1.1 Verify the consistency of date format
- Verifies that the character string format of the values at BGN_DATE variable are consistent.
6.3.1.2 Identify the eligible observations
- Identifies by their key value the observations at the table with preprocessed data that begun from 2001 or later and resulted in non-zero harm either to population health (caused fatalities or injuries) or to economy (caused property damage or crop damage)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.3.1.1 Verify the consistency of date format

The year in the value of date at the variable BGN_DATE was intended to be used as one of the two criteria to identify the eligible observations for the target data subset at the next subsubsection.

That’s why it is crucial at this point, to verify that the values of date are in the expected format, which as indicated by the overview of the table with preprocessed data (as well as some interactive examination) seems to be:

MM/DD/YYYY 0:00:00
- MM stands for 2 characters for the month
- DD stands for 2 characters for the day
- YYYY stands for 4 characters for the year
- the value of year is followed by a space
- 0:00:00 is a dummy part that stands for the time

# Create a validator for the format 
# of the character string values of the dates.
V____expected_character_string_format_for_begin_date <- validator(
  "expected_character_string_format_of_date" = 
    grepl("^\\d{1,2}/\\d{1,2}/\\d{4} 0:00:00$", BGN_DATE)
)

# Confront the table with the preprocessed data with the validator 
# for the format of the character string values of dates.
CF____expected_character_string_format_for_begin_date <- confront(
  dat = preprocessed_data,
  V____expected_character_string_format_for_begin_date
)

Indeed all values for dates were found to be in the expected format.

# Create a kable to present the results of the confrontation 
# for the format of the character string values of dates.
kable(
  x = summary(CF____expected_character_string_format_for_begin_date)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.3.1.1-1: ",
    "The results of the validation ",
    "for the format of the character sting values of dates ", 
    "from the variable BGN_DATE at the table with preprocessed data." 
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.3.1.1-1: The results of the validation for the format of the character sting values of dates from the variable BGN_DATE at the table with preprocessed data.
name	items	passes	fails	nNA	error	warning
expected_character_string_format_of_date	902297	902297	0	0	FALSE	FALSE

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.3.1.2 Identify the eligible observations

According to the discussion for the two points of interest for the Storm Events Dataset only a subset of observations will be used for this analysis. This target data subset includes only the observations which refer to weather phenomena that simultaneously satisfy the following two criteria:

begun at Jan 2011 and later due to the implications of changes in the composition of weather event types
- the year (that was extracted from the date value of the BGN_DATE variable
  coerced to integer) must be found equal or larger than 2001
resulted in non-zero harm either to population health (caused fatalities or injuries) or to economy (caused property damage or crop damage) due to the implications of the eligibility criteria for inclusion of weather events in the dataset
- the value of at least one of the variables, FATALITIES, INJURIES, PROPDMG and CROPDMG must be positive

# Create a validator with the eligibility criteria 
# for the inclusion of observations at the target data subset. 
V____criteria_for_target_data_subset_of_observations <- validator(
  "begin_date_from_2001_and_later" = (
    as.integer(
      str_extract(
        string = BGN_DATE,
        pattern = "(?<=^\\d{1,2}/\\d{1,2}/)\\d{4}"
      )
    ) %in%
      c(2001:2011)
  ),
  "non_zero_damage_to_population_health_or_economy" = (
    (as.integer(FATALITIES) > 0) |
      (as.integer(INJURIES) > 0) |
      (as.double(PROPDMG) > 0) |
      (as.double(CROPDMG) > 0)
  )
)

# Confront the table with the preprocessed data with the validator 
# with the eligibility criteria for the inclusion of observations 
# at the target data subset.
CF____criteria_for_target_data_subset_of_observations <- confront(
  dat = preprocessed_data,
  V____criteria_for_target_data_subset_of_observations
)

Out of 902297 observation from the table with preprocessed data, there were found:

488692 observations which refer to weather phenomena that begun at 2001 or later
254633 observations which refer to weather phenomena that resulted in non-zero harm either to population health (caused fatalities or injuries) or to economy (caused property damage or crop damage)

# Create a kable to present the results of the test 
# for the inclusion of observations at the target data subset.
kable( 
  x = summary(CF____criteria_for_target_data_subset_of_observations)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.3.1.2-1: ",
    "The results for the eligibility criteria for inclusion of observations ",
    "from the table with the preprocessed data in the target data subset."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.3.1.2-1: The results for the eligibility criteria for inclusion of observations from the table with the preprocessed data in the target data subset.
name	items	passes	fails	nNA	error	warning
begin_date_from_2001_and_later	902297	488692	413605	0	FALSE	FALSE
non_zero_damage_to_population_health_or_economy	902297	254633	647664	0	FALSE	FALSE

The observations that satisfied simultaneously the two criteria which determine which observation would be included in the target data subset were identified by their key value (denoted by the variable REFNUM).

# Identify the observations eligible to be included in the target data subset 
# by their key value denoted by the variable REFNUM.
criterion_by_REFNUM_____eligible_observations_for_the_target_data_subset <- with(
  data = CF____criteria_for_target_data_subset_of_observations[["._value"]],
  expr = preprocessed_data[
    begin_date_from_2001_and_later &
      non_zero_damage_to_population_health_or_economy,
    REFNUM
    ]
)

Exactly 144826 observations were found eligible to be included in the table with the target data subset.

# Create a table that presents the number of observations 
# that were found eligible to be included in the target data subset.
kable(
  x = data.table(
    "Number of Eligible Observations for the Target Data Subset" = 
      length(
        criterion_by_REFNUM_____eligible_observations_for_the_target_data_subset
      )
  ),
  caption = paste0(
    "Table 6.3.1.2-2: ",
    "The number of observations that were found eligible ", "\n",
    "to get included in the table with the target data subset."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.3.1.2-2: The number of observations that were found eligible to get included in the table with the target data subset.
Number of Eligible Observations for the Target Data Subset
144826

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.3.2 Create the table with the target data subset

From the table with the preprocessed data, the table with the target data subset was created by including only those observation that simultaneous satisfied two criteria:

begun at Jan 2011 and later (due to the implications of changes in the composition of weather event types)
resulted in non-zero harm either to population health (caused fatalities or injuries) or to economy (caused property damage or crop damage) (due to the implications of the eligibility criteria for inclusion of weather events in the dataset)

The observations were identified and extracted by their key value (denoted by the variable REFNUM).

# Create the table with the target data subset 
# by including only the observations that were found eligible 
# from those included at the table with the preprocessed data.
target_data_subset <- preprocessed_data[
  criterion_by_REFNUM_____eligible_observations_for_the_target_data_subset
  ]

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.3.3 Conduct post validation for the table with the target data subset

Post validation was conducted to verify that all observations contained at the table with the target data subset were eligible.

The same constrains that were used to identify the eligible observations from the table with preprocessed data were used to verify the eligibility of observations at the table with the target data subset.

# The table with the target data subset was post validated 
# to verify that it contained only eligible observations 
# from the table with the preprocessed data.
CF____post_validation_of_target_data_subset_table <- confront(
  dat = target_data_subset,
  # The validator that was created and used to identify 
  # the eligible observations for the target data subset 
  # was used to ensure the validity of the table with the target data subset.
  V____criteria_for_target_data_subset_of_observations
)

All observations contained at the table with the target data subset were eligible.

# Create a kable to present the results of post validation 
# for the table with the target data subset.
kable(
  x = summary(CF____post_validation_of_target_data_subset_table)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.3.3-1: ",
    "The results of the post validation ", 
    "from the table with the target data subset."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The same constrains that were used to identify the eligible observations ", 
      "from the table with the preprocessed data, ", "\n",
      "were used for the post validation of the observations ", 
      "at the table with the target data subset."
    )
  )

Table 6.3.3-1: The results of the post validation from the table with the target data subset.
name	items	passes	fails	nNA	error	warning
begin_date_from_2001_and_later	144826	144826	0	0	FALSE	FALSE
non_zero_damage_to_population_health_or_economy	144826	144826	0	0	FALSE	FALSE
Note:
The same constrains that were used to identify the eligible observations from the table with the preprocessed data, were used for the post validation of the observations at the table with the target data subset.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.3.4 Overview of the table with the target data subset

The table with the target data subset contained 9 variables and 144826 observations.

The variable REFNUM was set as the key of the table.

# Print the structure of the table with target data subset.
str(target_data_subset)

## Classes 'data.table' and 'data.frame':   144826 obs. of  9 variables:
##  $ REFNUM    : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ BGN_DATE  : chr  "1/19/2001 0:00:00" "1/19/2001 0:00:00" "1/19/2001 0:00:00" "1/19/2001 0:00:00" ...
##  $ EVTYPE    : chr  "TSTM WIND" "TSTM WIND" "TSTM WIND" "TSTM WIND" ...
##  $ FATALITIES: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : int  0 0 0 0 0 0 0 4 0 0 ...
##  $ PROPDMG   : num  10 8 2 15 5 3 10 450 150 3 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

The variables EVTYPE, PROPDMGEXP and CROPDMGEXP contained a suspiciously large number of distinct values.

# Create a kable to highlight some facts about the target data subset table.
kable(
  x = data.table(
    "Variable Name" = names(target_data_subset),
    "Number of Distinct Values" = vapply(
      X = target_data_subset, 
      FUN = function(x) length(unique(x[!is.na(x)])), 
      FUN.VALUE = integer(1))
  ),
  caption = paste0(
    "Table 6.3.4: ",
    "Facts about the variables at the table with the target data subset."
  )
) %>% 
  kable_styling(
    bootstrap_options = c(
      "striped", "hover", "condensed", "responsive", "bordered"
    ), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The table with the target data subset contains 9 variables ", "\n",
      "and 144826 observations without any missing value (coded as NA)."
    )
  )

Table 6.3.4: Facts about the variables at the table with the target data subset.
Variable Name	Number of Distinct Values
REFNUM	144826
BGN_DATE	3746
EVTYPE	97
FATALITIES	31
INJURIES	101
PROPDMG	1162
PROPDMGEXP	4
CROPDMG	269
CROPDMGEXP	4
Note:
The table with the target data subset contains 9 variables and 144826 observations without any missing value (coded as NA).

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4 Conduct In-Record Data Validation

Summary

Through the in-record data validation stage the values of each variable from the table with the target data subset were examined independently of the corresponding values at other variables, in order to identify invalid entries which were then substituted by missing values (coded properly as NA) to create the table with the in-record validated data.

Finally post validation was conducted and some facts about the table with the in-record validated data were highlighted.

Steps

6.4.1 Introduce information from the Strom Data Documentation
- Creates constants with information about the permitted values of some variables that will be used to form validity contains.
  - 6.4.1.1 Valid values for the EVTYPE variable
    - The valid values for EVTYPE variable were introduced.
  - 6.4.1.2 Valid values for the PROPDMGEXP variable
    - The valid values for PROPDMGEXP variable were introduced.
  - 6.4.1.3 Valid values for the CROPDMGEXP variable
    - The valid values for PROPDMGEXP variable were introduced.
6.4.2 Conduct in-record data validation for each variable
- Identifies the invalid values for each variable.
6.4.3 Create the table with the in-record validated data
- Creates the table with the in-record validated data by substituting all invalid values that were identified as invalid, with NAs.
6.4.4 Conduct post validation for the table with the in-record validated data
- Ensures that all values of each variable at the table with the in-record validated data are valid.
6.4.5 Overview of the table with the in-record validated data
- Presents some basic facts about the table with the in-record validated data.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4.1 Introduce information from the Strom Data Documentation

Some constants with the valid values for the variables EVTYPE, PROPDMGEXP and CROPDMGEXP (as stated at the Storm Data Documentation) were created and used in order to form their respective constrains.

6.4.1.1 Valid values for the EVTYPE variable
- The 48 valid values for EVTYPE variable were introduced.
6.4.1.2 Valid values for the PROPDMGEXP variable
- The 3 valid values for PROPDMGEXP variable were introduced.
6.4.1.3 Valid values for the CROPDMGEXP variable-variable)
- The 3 valid values for CROPDMGEXP variable were introduced.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4.1.1 Valid values for the EVTYPE variable

The entries of the variable EVTYPE according to the NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007 (at chapter 7), must take one of the 48 character values that correspond to the defined weather event types:

ASTRONOMICAL LOW TIDE
AVALANCHE
BLIZZARD
COASTAL FLOOD
COLD/WIND CHILL
DEBRIS FLOW
DENSE FOG
DENSE SMOKE
DROUGHT
DUST DEVIL
DUST STORM
EXCESSIVE HEAT
EXTREME COLD/WIND CHILL
FLASH FLOOD
FLOOD
FROST/FREEZE

FUNNEL CLOUD
FREEZING FOG
HAIL
HEAT
HEAVY RAIN
HEAVY SNOW
HIGH SURF
HIGH WIND
HURRICANE/TYPHOON
ICE STORM
LAKE-EFFECT SNOW
LAKESHORE FLOOD
LIGHTNING
MARINE HAIL
MARINE HIGH WIND
MARINE STRONG WIND

MARINE THUNDERSTORM WIND
RIP CURRENT
SEICHE
SLEET
STORM SURGE/TIDE
STRONG WIND
THUNDERSTORM WIND
TORNADO
TROPICAL DEPRESSION
TROPICAL STORM
TSUNAMI
VOLCANIC ASH
WATERSPOUT
WILDFIRE
WINTER STORM
WINTER WEATHER

# Create a vector that includes the 48 values of the defined weather event types.
defined_event_types <- c(
  "ASTRONOMICAL LOW TIDE", 
  "AVALANCHE", 
  "BLIZZARD", 
  "COASTAL FLOOD", 
  "COLD/WIND CHILL", 
  "DEBRIS FLOW", 
  "DENSE FOG", 
  "DENSE SMOKE", 
  "DROUGHT", 
  "DUST DEVIL", 
  "DUST STORM",
  "EXCESSIVE HEAT",
  "EXTREME COLD/WIND CHILL", 
  "FLASH FLOOD", 
  "FLOOD", 
  "FROST/FREEZE", 
  "FUNNEL CLOUD", 
  "FREEZING FOG", 
  "HAIL", 
  "HEAT",
  "HEAVY RAIN", 
  "HEAVY SNOW", 
  "HIGH SURF", 
  "HIGH WIND", 
  "HURRICANE/TYPHOON", 
  "ICE STORM", 
  "LAKE-EFFECT SNOW", 
  "LAKESHORE FLOOD", 
  "LIGHTNING",
  "MARINE HAIL", 
  "MARINE HIGH WIND",
  "MARINE STRONG WIND", 
  "MARINE THUNDERSTORM WIND", 
  "RIP CURRENT", 
  "SEICHE", 
  "SLEET", 
  "STORM SURGE/TIDE",
  "STRONG WIND", 
  "THUNDERSTORM WIND",
  "TORNADO", 
  "TROPICAL DEPRESSION", 
  "TROPICAL STORM", 
  "TSUNAMI", 
  "VOLCANIC ASH", 
  "WATERSPOUT", 
  "WILDFIRE",
  "WINTER STORM", 
  "WINTER WEATHER"
)

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4.1.2 Valid values for the PROPDMGEXP variable

The entries of the variable PROPDMGEXP that indicates whether the magnitude for the economic damage, (denoted by the PROPDMG variable), refers to thousands, millions or billions of dollars, according to the information provided by NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007 (at chapter 2.7), must take one of the following 3 character values :

K which corresponds to thousands of dollars
M which corresponds to millions of dollars
B which corresponds to billions of dollars

# Create a vector that includes the defined values for the variable PROPDGMEXP.
defined_property_damage_exponents <- c("K", "M", "B")

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4.1.3 Valid values for the CROPDMGEXP variable

The entries of the variable CROPDMGEXP that indicates whether the magnitude for the economic damage, (denoted by the CROPDMG variable), refers to thousands, millions or billions of dollars, according to the information provided by NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007 (at chapter 2.7), must take one of the following 3 character values :

K which corresponds to thousands of dollars
M which corresponds to millions of dollars
B which corresponds to billions of dollars

# Create a vector that includes the defined values for the variable CROPDGMEXP.
defined_crop_damage_exponents <- c("K", "M", "B")

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4.2 Conduct in-record data validation for each variable

To create the constrains for the in-record validation for each variable from the table with the target data subset, some ‘common world knowledge’ combined with information provided by the Storm Data Documentation about the valid values of the available variables were used.

Specifically:

The REFNUM variable’s values must be unique for each observations.
the BGN_DATE variable’s value must contain a year part that is from 2001 up to 2011
The EVTYPE variable’s values must be one of the 48 defined events types.
The FATALITIES variable’s values must be non-negative.
The INJURIES variable’s values must be non-negative.
The PROPDMG variable’s values must be non-negative.
The PROPDMGEXP variable’s values must be K, M or B.
The CROPDMG variable’s values must be non negative.
The CROPDMGEXP variable’s values must be K, M or B.

Although unnecessary to test the constrains for all variables, that were included in the table with the target data subset (the uniqueness of the values in key variable REFNUM, the fact the year indicated in the BGN_DATE variable was from 2001 to 2011 as well as the fact that the values of the variables FATALITIES, INJURIES, CROPDMG, PROPDMG were non-negative), because some of them had been verified in previous stages of the data processing procedure, such tests were included in order to provided a detailed and complete overview of all the in-record constrains for the entries
of each variable at the validated data table.

Actually only the variables EVTYPE, PROPDMGEXP and CROPDMGEXP needed to be validated in this stage, as these were the ones that haven’t been checked properly yet.

# Create a validator with constrains for the in-record validation 
# for the values of each variable.
V____constrains_for_the_in_record_data_validation <- validator(
  "REFNUM" = ( REFNUM %in% names(table(REFNUM))[table(REFNUM) == 1] ),
  "BGN_DATE" = ( 
    as.integer(
      str_extract(BGN_DATE, "(?<=^\\d{1,2}/\\d{1,2}/)\\d{4}(?= 0:00:00$)")
    ) %in%
      c(2001:2011)
  ),
  "EVTYPE" = ( EVTYPE %in% defined_event_types ),
  "FATALITIES" = ( FATALITIES >= 0 ),
  "INJURIES" = ( INJURIES >= 0 ),
  "PROPDMG" = ( PROPDMG >= 0 ),
  "PROPDMGEXP" = ( PROPDMGEXP %in% defined_property_damage_exponents ),
  "CROPDMG" = ( CROPDMG >= 0 ),
  "CROPDMGEXP" = ( CROPDMGEXP %in% defined_crop_damage_exponents )
)

# Confront the table with target data subset with the validator 
# which contains the constrains for the in-record data validation. 
CF____constrains_for_the_in_record_data_validation <- confront(
  dat = target_data_subset,
  V____constrains_for_the_in_record_data_validation
)

According to the results of the in-record data validation, there is a significant proportion of invalid values found at the variables EVTYPE, PROPDMGEXP and CROPDMGEXP.

# Create a kable to present the results of the in-record data validation 
# with the table with target data subset.
kable(
  x = summary(CF____constrains_for_the_in_record_data_validation)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.4.2-1: ",
    "The results of the in-record data validation ",
    "for the table with the target data subset."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.4.2-1: The results of the in-record data validation for the table with the target data subset.
name	items	passes	fails	error	warning
REFNUM	144826	144826	0	FALSE	FALSE
BGN_DATE	144826	144826	0	FALSE	FALSE
EVTYPE	144826	112051	32775	FALSE	FALSE
FATALITIES	144826	144826	0	FALSE	FALSE
INJURIES	144826	144826	0	FALSE	FALSE
PROPDMG	144826	144826	0	FALSE	FALSE
PROPDMGEXP	144826	140668	4158	FALSE	FALSE
CROPDMG	144826	144826	0	FALSE	FALSE
CROPDMGEXP	144826	89785	55041	FALSE	FALSE

The invalid values for each variable were identified by their key value (denoted by the variable REFNUM).

# Identify the values that were found invalid for each variable 
# by the key value. 
criterion_by_REFNUM_____invalid_values_of_each_variable <- 
  lapply(
    X = CF____constrains_for_the_in_record_data_validation[["._value"]],
    FUN = function(x) {
      target_data_subset[!x, REFNUM]
    }
  )

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4.3 Create the table with the in-record validated data

The table with the in-record validated data was obtained, by substituting with NAs all values that were identified as invalid at the table with the target data subset.

The invalid values for each variable were identified and substituted by their key value (denoted by the variable REFNUM).

# Create a dummy table as a copy of the table with the target data subset. 
in_record_validated_data <- copy(target_data_subset)

# Create the table with the in-record validated data 
# by substituted with NAs the invalid values of each variable.
for ( var_name in names(criterion_by_REFNUM_____invalid_values_of_each_variable) ) {
  set(x = in_record_validated_data,
      i = which(
        in_record_validated_data$REFNUM %in%
          criterion_by_REFNUM_____invalid_values_of_each_variable[[var_name]]
      ),
      j = var_name,
      value = NA
  )
}

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4.4 Conduct post validation for the table with the in-record validated data

Post validation was conducted to verify that the values of variables at the table with the in-record validated data were valid according to the same constrains that were used to identify the invalid values for each variable at the table with the target data subset.

# The table with the in-record validated data was post validated to verify 
# that all values for each of variable it contained were valid.
CF____post_validation_of_the_table_with_the_in_record_validated_data <- 
  confront(
    dat = in_record_validated_data,
    V____constrains_for_the_in_record_data_validation
)

All the values for each variable at the table with the in-record validated data were valid.

# Create a kable to present the results of the post validation 
# for the table with the in-record validated data.  
kable(
  x = summary(CF____post_validation_of_the_table_with_the_in_record_validated_data)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.4.4-1: ",
    "The results of post validation ", 
    "for the table with the in-record validated data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The same constrains that were used to identify the invalid values ", 
      "of each variable at the table with the target data subset, ", "\n",
      "were used for the post validation of the observations ", 
      "at the table with the in-record validated data."
    )
  )

Table 6.4.4-1: The results of post validation for the table with the in-record validated data.
name	items	passes	fails	nNA	error	warning
REFNUM	144826	144826	0	0	FALSE	FALSE
BGN_DATE	144826	144826	0	0	FALSE	FALSE
EVTYPE	144826	112051	0	32775	FALSE	FALSE
FATALITIES	144826	144826	0	0	FALSE	FALSE
INJURIES	144826	144826	0	0	FALSE	FALSE
PROPDMG	144826	144826	0	0	FALSE	FALSE
PROPDMGEXP	144826	140668	0	4158	FALSE	FALSE
CROPDMG	144826	144826	0	0	FALSE	FALSE
CROPDMGEXP	144826	89785	0	55041	FALSE	FALSE
Note:
The same constrains that were used to identify the invalid values of each variable at the table with the target data subset, were used for the post validation of the observations at the table with the in-record validated data.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.4.5 Overview of the table with the in-record validated data

The table with the in-record validated data contained 9 variables and 144826 observations.

The variable REFNUM was set as the key of table.

# Print the structure of the table with the in-record validated data table .
str(in_record_validated_data)

## Classes 'data.table' and 'data.frame':   144826 obs. of  9 variables:
##  $ REFNUM    : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ BGN_DATE  : chr  "1/19/2001 0:00:00" "1/19/2001 0:00:00" "1/19/2001 0:00:00" "1/19/2001 0:00:00" ...
##  $ EVTYPE    : chr  NA NA NA NA ...
##  $ FATALITIES: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : int  0 0 0 0 0 0 0 4 0 0 ...
##  $ PROPDMG   : num  10 8 2 15 5 3 10 450 150 3 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

There were plenty of missing values, that were introduced as a result of the in-record validation procedure for the variables EVTYPE, PROPDMGEXP and CROPDMGEXP, but the number of distinct values didn’t indicate any more the presence of obvious abnormalities.

# Create a kable to highlight some facts 
# about the table with the in-record validated data.
kable(
  x = data.table(
    "Variable" = names(in_record_validated_data),
    "Number of Distinct Values" = vapply(
      X = in_record_validated_data, 
      FUN = function(x) length(unique(x[!is.na(x)])), 
      FUN.VALUE = integer(1)
    ),
    "Number of NAs" = vapply(
      X = in_record_validated_data, 
      FUN = function(x) sum(is.na(x)), 
      FUN.VALUE = integer(1)),
    "Percentage of NAs" = 
      vapply(
        X = in_record_validated_data, 
        FUN = function(x) mean(is.na(x)), 
        FUN.VALUE =double(1))
  ),
  caption = "Table 6.4.5-1: Facts about the table with in-record validated data."
) %>% 
  kable_styling(
    bootstrap_options = c(
      "striped", "hover", "condensed", "responsive", "bordered"
      ), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The table with the in-record validated data contained 9 variables ", 
      "and 144826 observations."
    )
  )

Table 6.4.5-1: Facts about the table with in-record validated data.
Variable	Number of Distinct Values	Number of NAs	Percentage of NAs
REFNUM	144826	0	0.0000000
BGN_DATE	3746	0	0.0000000
EVTYPE	46	32775	0.2263061
FATALITIES	31	0	0.0000000
INJURIES	101	0	0.0000000
PROPDMG	1162	0	0.0000000
PROPDMGEXP	3	4158	0.0287103
CROPDMG	269	0	0.0000000
CROPDMGEXP	3	55041	0.3800492
Note:
The table with the in-record validated data contained 9 variables and 144826 observations.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5 Impute Missing Values

Summary

In this stage of data processing procedure, an attempt was made to maximize the amount of available information for the analysis, by imputing some of the missing values that exist at the table with the in-record validated data with plausible values.

There were 3 variables (EVTYPE, PROPDMGEXP and CROPDMGEXP) that contained NAs, all of which were introduced through the in-record data validation stage.

Via a conservative deterministic approach which aimed to retrieve the missing values only for the cases that there were almost no doubt about the values that were imputed, the majority of those entries got successfully restored.

However it is highlighted that for the variable EVTYPE, there is no guarantee that the values imputed are error-free, due to the fact that the associations were made based on the invalid values found at the table with the target data subset, that were substituted by NAs and the information available in NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007 (at chapter 7)*
by the analyst who has no expertise neither on weather nor in meteorology.

On the other hand, the missing values that were imputed for the variables PROPDMGEXP and CROPDMGEXP are almost certainly correct (and even if they are not, it didn’t affect in any significant way the results of the analysis as they all correspond to observations that resulted in 0 property and crop damage respectively while the analysis focused on the observations for the weather events that caused non-zero harm).

# Create a dummy table by coping the table with the in-record validated data 
# at which the missing values for the variable EVTYPE, PROPDMGEXP and CROPDMGEXP 
# will be imputed to get the table with the imputed data.
imputed_data <- copy(in_record_validated_data)

Steps

6.5.1 Impute missing values at the variable EVTYPE
- Imputes the missing values at the variable EVTYPE with plausible substitutions based on the invalid values they corresponded:
  - 6.5.1.1 Examine the invalid values from the variable EVTYPE
    - Examine the invalid values that have been substituted by NAs.
  - 6.5.1.2 Associate plausible substitutions to the invalid values from the variable EVTYPE
    - Associated the invalid values with plausible substitutions.
  - 6.5.1.3 Identify the imputable missing values at the variable EVTYPE
    - Identified the imputable missing values according to associations by their key value.
  - 6.5.1.4 Substitute the imputable missing values at the variable EVTYPE
    - Substituted the imputable missing values with valid ones.
6.5.2 Impute missing values at the variable PROPDMGEXP
- Imputes the missing values at the variable PROPDMGEXP with plausible substitutions based on the invalid values they corresponded:
  - 6.5.2.1 Examine the invalid values from the variable PROPDMGEXP
    - Examine the invalid values that have been substituted by NAs.
  - 6.5.2.2 Associate plausible substitutions to the invalid values from the variable PROPDMGEXP
    - Associated the invalid values with plausible substitutions.
  - 6.5.2.3 Identify the imputable missing values at the variable PROPDMGEXP
    - Identified the imputable missing values according to associations by their key value.
  - 6.5.2.4 Substitute the imputable missing values at the variable PROPDMGEXP
    - Substituted the imputable missing values with valid ones.
6.5.3 Impute missing values at the variable CROPDMGEXP
- Imputes the missing values at the variable CROPDMGEXP with plausible substitutions based on the invalid values they corresponded:
  - 6.5.3.1 Examine the invalid values from the variable CROPDMGEXP
    - Examine the invalid values that have been substituted by NAs.
  - 6.5.3.2 Associate plausible substitutions to the invalid values from the variable CROPDMGEXP
    - Associated the invalid values with plausible substitutions.
  - 6.5.3.3 Identify the imputable missing values at the variable CROPDMGEXP
    - Identified the imputable missing values according to associations by their key value.
  - 6.5.3.4 Substitute the imputable missing values at the variable CROPDMGEXP
    - Substituted the imputable missing values with valid ones.
6.5.4 Conduct post validation for the table with the imputed data
- Ensures that all values of each variable at the table with the imputed data are valid.
6.5.5 Overview of the table with the imputed data
- Presents some basic facts about the table with the imputed data.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.1 Impute missing values at the variable EVTYPE

The invalid values for the variable EVTYPE at the table with the target data subset (before they got substituted by NAs at the in-record data validation stage) were examined and associations were made to plausible valid substitutions. Those observations with missing values that corresponded to successfully associated plausible substitutions, were identified by their key values and were imputed.

6.5.1.1 Examine the invalid values from the variable EVTYPE
- Examine the invalid values that have been substituted by NAs.
6.5.1.2 Associate plausible substitutions to the invalid values from the variable EVTYPE
- Associated the invalid values with plausible substitutions.
6.5.1.3 Identify the imputable missing values at the variable EVTYPE
- Identified the imputable missing values according to associations by their key value.
6.5.1.4 Substitute the imputable missing values at the variable EVTYPE
- Substituted the imputable missing values with valid ones.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.1.1 Examine the invalid values from the variable EVTYPE

For the variable EVTYPE at the table with the in-record validated data, out of the total 144826 observations 32775 (22.63%) were NAs.

# Create a kable to present information on the missing values 
# for the variable EVTYPE at the table with the imputed data.
kable(
  x = data.table(
    "Variable" = "EVTYPE",
    "Total Number of Values" = length(in_record_validated_data$EVTYPE),
    "Number of Missing Values" = sum(is.na(in_record_validated_data$EVTYPE)),
    "Percentage of Missing Values" = mean(is.na(in_record_validated_data$EVTYPE))
  ),
  caption = paste0(
    "Table 6.5.1.1-1: ",
    "Information on the missing values ", 
    "for the variable EVTYPE at the table with the target data subset."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.5.1.1-1: Information on the missing values for the variable EVTYPE at the table with the target data subset.
Variable	Total Number of Values	Number of Missing Values	Percentage of Missing Values
EVTYPE	144826	32775	0.2263061

These 32775 missing values at the table with the in-record validated data, corresponded to 51 distinct invalid entries at the table with the target data subset before they got substituted by NAs at the in-record data validation stage.

# Create a kable to presents information on the distinct invalid values 
# for the variable EVTYPE at the table with the target data subset 
# that were substituted by NAs at the in-record validation stage.
kable(
  x = target_data_subset[
    is.na(in_record_validated_data$EVTYPE), .N, by = EVTYPE][
      order(N,decreasing = TRUE)],
  col.names = c("Invalid Values", "Number of Occurrences"),
  caption = paste0(
    "Table 6.5.1.1-2: ",
    "Information on the distinct invalid values ", 
    "for the variable EVTYPE at the table with the target data subset ",
    "which got substituted by NAs at the in-record validation stage."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.5.1.1-2: Information on the distinct invalid values for the variable EVTYPE at the table with the target data subset which got substituted by NAs at the in-record validation stage.
Invalid Values	Number of Occurrences
TSTM WIND	31453
LANDSLIDE	189
WINTER WEATHER/MIX	139
WILD/FOREST FIRE	132
RIP CURRENTS	115
URBAN/SML STREAM FLD	115
MARINE TSTM WIND	109
TSTM WIND/HAIL	108
STORM SURGE	86
HEAVY SURF/HIGH SURF	50
HURRICANE	38
LIGHT SNOW	38
FOG	32
WIND	26
EXTREME COLD	24
DRY MICROBURST	17
HEAVY SURF	12
MIXED PRECIPITATION	12
COASTAL FLOODING	11
ASTRONOMICAL HIGH TIDE	8
STRONG WINDS	6
SNOW	5
FREEZE	4
SMALL HAIL	4
GUSTY WINDS	4
MUDSLIDE	3
HIGH SEAS	3
SNOW SQUALLS	3
EXTREME WINDCHILL	3
WINTER WEATHER MIX	2
FALLING SNOW/ICE	2
ROUGH SEAS	2
LIGHT FREEZING RAIN	2
LATE SEASON SNOW	1
THUNDERSTORM	1
ROGUE WAVE	1
NON-TSTM WIND	1
NON TSTM WIND	1
OTHER	1
LAKE EFFECT SNOW	1
MUD SLIDE	1
BRUSH FIRE	1
BLOWING DUST	1
GUSTY WIND	1
HIGH WATER	1
HIGH SURF ADVISORY	1
HAZARDOUS SURF	1
COLD WEATHER	1
WHIRLWIND	1
ICE ON ROAD	1
DROWNING	1

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.1.2 Associate plausible substitutions to the invalid values from the variable EVTYPE

To impute the corresponding NAs associations were made from the invalid entries to defined weather event types.

Some of the associations were based solely on the invalid values, which directly corresponded to defined event types as they seem to be either typos (e.g. “RIP CURRENTS” instead of “RIP CURRENT”) or acronyms of the expected values (e.g. “TSTM WIND” instead of “THUNDERSTORM WIND”).
While for others the description for each one of the 48 event types that was available in NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007 (at chapter 7) was taken into account for cases where a variation of the defined value had been supplied (e.g. “URBAN/SML STREAM FLD” instead of “HEAVY RAIN”).

Nevertheless it is stressed that the associations mainly depend on ‘common sense’ judgment (instead of solid professional expertise) and in no way they are guaranteed to be error-free, despite the best efforts made to impute only the most obvious cases.

In total, 28 distinct invalid values were associated to some of the 48 defined weather event types:

‘COASTAL FLOODING’ –> COASTAL FLOOD
‘COLD WEATHER’ –> COLD/WIND CHILL
‘LANDSLIDE’ –> DEBRIS FLOW
‘MUDSLIDE’ –> DEBRIS FLOW
‘MUD SLIDE’ –> DEBRIS FLOW
‘DROWNING’ –> DROUGHT
‘EXTREME COLD’ –> EXTREME COLD/WIND CHILL
‘EXTREME WINDCHILL’ –> EXTREME COLD/WIND CHILL
‘FREEZE’ –> FROST/FREEZE
‘SMALL HAIL’ –> HAIL
‘URBAN/SML STREAM FLD’ –> HEAVY RAIN
‘HEAVY SURF/HIGH SURF’ –> HIGH SURF
‘HEAVY SURF’ –> HIGH SURF
‘HAZARDOUS SURF’ –> HIGH SURF

’ HIGH SURF ADVISORY’ –> HIGH SURF
‘HURRICANE’ –> HURRICANE/TYPHOON
‘LAKE EFFECT SNOW’ –> LAKE-EFFECT SNOW
‘MARINE TSTM WIND’ –> MARINE THUNDERSTORM WIND
‘RIP CURRENTS’ –> RIP CURRENT
‘STORM SURGE’ –> STORM SURGE/TIDE
‘STRONG WINDS’ –> STRONG WIND
‘TSTM WIND’ –> THUNDERSTORM WIND
‘DRY MICROBURST’ –> THUNDERSTORM WIND
‘THUNDERSTORM’ –> THUNDERSTORM WIND
‘WILD/FOREST FIRE’ –> WILDFIRE
‘BRUSH FIRE’ –> WILDFIRE
‘WINTER WEATHER/MIX’ –> WINTER WEATHER
‘WINTER WEATHER MIX’ –> WINTER WEATHER

( The 15th invalid value contained 3 spaces before the ‘HIGH SURF ADVISORY’, but for some unknown reason after rendering it seems to be only 1 space.)

# Create a list the associations made from the invalid entries 
# for the variable EVTYPE at the table with the target data subset 
# to defined weather event types.
associations_on_defined_event_types <- list(
  "COASTAL FLOOD" = c("COASTAL FLOODING"),
  "COLD/WIND CHILL" = c("COLD WEATHER"),
  "DEBRIS FLOW" = c("LANDSLIDE", "MUDSLIDE", "MUD SLIDE"),
  "DROUGHT" = c("DROWNING"),
  "EXTREME COLD/WIND CHILL" = c("EXTREME COLD", "EXTREME WINDCHILL"),
  "FROST/FREEZE" = c("FREEZE"),
  "HAIL" = c("SMALL HAIL"),
  "HEAVY RAIN" = c("URBAN/SML STREAM FLD"),
  "HIGH SURF" = c(
    "HEAVY SURF/HIGH SURF", "HEAVY SURF", "HAZARDOUS SURF",
    "   HIGH SURF ADVISORY"),
  "HURRICANE/TYPHOON" = c("HURRICANE"),
  "LAKE-EFFECT SNOW" = c("LAKE EFFECT SNOW"),
  "MARINE THUNDERSTORM WIND" = c("MARINE TSTM WIND"),
  "RIP CURRENT" = c("RIP CURRENTS"),
  "STORM SURGE/TIDE" = c("STORM SURGE"),
  "STRONG WIND" = c("STRONG WINDS"),
  "THUNDERSTORM WIND" = c("TSTM WIND", "DRY MICROBURST", "THUNDERSTORM"),
  "WILDFIRE" = c("WILD/FOREST FIRE", "BRUSH FIRE"),
  "WINTER WEATHER" = c("WINTER WEATHER/MIX", "WINTER WEATHER MIX")
)

On the other hand there were 23 distinct invalid values were not possible to get associated (with relatively high confidence) with any of the 48 defined event types:

‘FOG’
‘LIGHT SNOW’
‘WIND’
‘LIGHT FREEZING RAIN’
‘MIXED PRECIPITATION’
‘ASTRONOMICAL HIGH TIDE’
‘GUSTY WINDS’
‘SNOW’
‘HIGH SEAS’
‘ROUGH SEAS’
‘SNOW SQUALLS’
‘FALLING SNOW/ICE’

‘GUSTY WIND’
‘HIGH WATER’
‘OTHER’
‘BLOWING DUST’
‘ICE ON ROAD’
‘LATE SEASON SNOW’
‘NON TSTM WIND’
‘NON-TSTM WIND’
‘ROGUE WAVE’
‘WHIRLWIND’
‘TSTM WIND/HAIL’

# Create a list with the distinct invalid values of the variable EVTYPE 
# at the table with target data subset that couldn't be safely associated 
# with any of defined weather event types.
ambiguous_entries_on_EVTYPE <- list(
  "FOG", 
  "LIGHT SNOW", 
  "WIND", 
  "LIGHT FREEZING RAIN",
  "MIXED PRECIPITATION", 
  "ASTRONOMICAL HIGH TIDE", 
  "GUSTY WINDS", 
  "SNOW",
  "HIGH SEAS", 
  "ROUGH SEAS", 
  "SNOW SQUALLS",
  "FALLING SNOW/ICE", 
  "GUSTY WIND", 
  "HIGH WATER", 
  "OTHER", 
  "BLOWING DUST",
  "ICE ON ROAD", 
  "LATE SEASON SNOW",
  "NON TSTM WIND", 
  "NON-TSTM WIND", 
  "ROGUE WAVE",
  "WHIRLWIND", 
  "TSTM WIND/HAIL"
)

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.1.3 Identify the imputable missing values at the variable EVTYPE

After having established the associations for the invalid entries of the variable EVTYPE the observations that contained values that could be retrieved were identified.

# Create a validator to identify which observations for the variable EVTYPE 
# contain a missing value that correspond to one of the invalid values 
# at the table with target data subset that can be retrieved 
# according to the list with the association. 
V_________identification_test_of_imputable_missing_values_for_EVTYPE <- validator(
  "COASTAL FLOOD" = EVTYPE %in% associations_on_defined_event_types[["COASTAL FLOOD"]],
  "COLD WIND CHILL" = EVTYPE %in% associations_on_defined_event_types[["COLD/WIND CHILL"]],
  "DEBRIS FLOW" = EVTYPE %in% associations_on_defined_event_types[["DEBRIS FLOW"]],
  "DROUGHT" = EVTYPE %in% associations_on_defined_event_types[["DROUGHT"]],
  "EXTREME COLD/WIND CHILL" = EVTYPE %in% associations_on_defined_event_types[["EXTREME COLD/WIND CHILL"]],
  "FROST/FREEZE" = EVTYPE %in% associations_on_defined_event_types[["FROST/FREEZE"]],
  "HAIL" = EVTYPE %in% associations_on_defined_event_types[["HAIL"]],
  "HEAVY RAIN" = EVTYPE %in% associations_on_defined_event_types[["HEAVY RAIN"]],
  "HIGH SURF" = EVTYPE %in% associations_on_defined_event_types[["HIGH SURF"]],
  "HURRICANE/TYPHOON" = EVTYPE %in% associations_on_defined_event_types[["HURRICANE/TYPHOON"]],
  "LAKE-EFFECT SNOW" = EVTYPE %in% associations_on_defined_event_types[["LAKE-EFFECT SNOW"]],
  "MARINE THUNDERSTORM WIND" = EVTYPE %in% associations_on_defined_event_types[["MARINE THUNDERSTORM WIND"]],
  "RIP CURRENT" = EVTYPE %in% associations_on_defined_event_types[["RIP CURRENT"]],
  "STORM SURGE/TIDE" = EVTYPE %in% associations_on_defined_event_types[["STORM SURGE/TIDE"]],
  "STRONG WIND" = EVTYPE %in% associations_on_defined_event_types[["STRONG WIND"]],
  "THUNDERSTORM WIND" = EVTYPE %in% associations_on_defined_event_types[["THUNDERSTORM WIND"]],
  "WILDFIRE" = EVTYPE %in% associations_on_defined_event_types[["WILDFIRE"]],
  "WINTER WEATHER" = EVTYPE %in% associations_on_defined_event_types[["WINTER WEATHER"]]
)

# Confront the table with the target data subset with the validator with 
# the criteria for the association of invalid entries for the variable EVTYPE.
CF_________identification_test_of_imputable_missing_values_for_EVTYPE <- confront(
  dat = target_data_subset[is.na(in_record_validated_data[["EVTYPE"]])],
  V_________identification_test_of_imputable_missing_values_for_EVTYPE
)

Out of the total 32775 missing values for the variable EVTYPE at the in-record validation data table, 32520 (99.22%) could be imputed while for only 255 (0.78%) values it wasn’t possible to safely associate them with some of the 48 defined event types.

# Create a kable to presents information 
# on the imputable and not imputable missing values at the variable EVTYPE.
kable(
  x = data.table(
    "variable" = "EVTYPE",
    "n_missing" = sum(is.na(in_record_validated_data$EVTYPE)),
    "n_imputable" = sum(
      vapply(
        X = CF_________identification_test_of_imputable_missing_values_for_EVTYPE[["._value"]],
        FUN = sum,
        FUN.VALUE = integer(1)
      )
    )
  )[,"n_not_imputable" := n_missing - n_imputable][
      , "perc_imputable" := n_imputable/n_missing][
        ,"perc_not_imputable" := n_not_imputable/n_missing], 
  col.names = c(
    "Variable",
    "Number of Missing Values",
    "Number of Imputable Missing Values",
    "Number of Not Imputable Missing Values",
    "Percentage of Imputable Missing Values",
    "Percentage of Not Imputable Missing Values"
  ),
  caption = paste0(
    "Table 6.5.1.3-1: ",
    "Information on the imputable and not imputable missing values ", 
    "at the variable EVTYPE."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.5.1.3-1: Information on the imputable and not imputable missing values at the variable EVTYPE.
Variable	Number of Missing Values	Number of Imputable Missing Values	Number of Not Imputable Missing Values	Percentage of Imputable Missing Values	Percentage of Not Imputable Missing Values
EVTYPE	32775	32520	255	0.9922197	0.0077803

The imputed missing values were distributed according to the associations at 18 defined event types.

# Create a kable to presents the results from the identification 
# of the imputable missing values 
# by each of the associated defined weather event types. 
kable(
  x = summary(CF_________identification_test_of_imputable_missing_values_for_EVTYPE)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.5.1.3-2: ",
    "Information on the number of invalid values ",
    "that can be imputed by one the 48 defined weather event types ",
    "for the variable EVTYPE at the table with the imputed data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The subset of the 32775 observations with missing values ", "\n",
      "was used for the identification of imputable invalid values."
    )
  )

Table 6.5.1.3-2: Information on the number of invalid values that can be imputed by one the 48 defined weather event types for the variable EVTYPE at the table with the imputed data.
name	items	passes	fails	nNA	error	warning
COASTAL.FLOOD	32775	11	32764	0	FALSE	FALSE
COLD.WIND.CHILL	32775	1	32774	0	FALSE	FALSE
DEBRIS.FLOW	32775	193	32582	0	FALSE	FALSE
DROUGHT	32775	1	32774	0	FALSE	FALSE
EXTREME.COLD.WIND.CHILL	32775	27	32748	0	FALSE	FALSE
FROST.FREEZE	32775	4	32771	0	FALSE	FALSE
HAIL	32775	4	32771	0	FALSE	FALSE
HEAVY.RAIN	32775	115	32660	0	FALSE	FALSE
HIGH.SURF	32775	64	32711	0	FALSE	FALSE
HURRICANE.TYPHOON	32775	38	32737	0	FALSE	FALSE
LAKE.EFFECT.SNOW	32775	1	32774	0	FALSE	FALSE
MARINE.THUNDERSTORM.WIND	32775	109	32666	0	FALSE	FALSE
RIP.CURRENT	32775	115	32660	0	FALSE	FALSE
STORM.SURGE.TIDE	32775	86	32689	0	FALSE	FALSE
STRONG.WIND	32775	6	32769	0	FALSE	FALSE
THUNDERSTORM.WIND	32775	31471	1304	0	FALSE	FALSE
WILDFIRE	32775	133	32642	0	FALSE	FALSE
WINTER.WEATHER	32775	141	32634	0	FALSE	FALSE
Note:
The subset of the 32775 observations with missing values was used for the identification of imputable invalid values.

The key value (denoted by the REFNUM variable) for the observations that successfully got associated with a defined weather event type was identified.

# Identify the missing values that can be imputed 
# with one of the defined event type values 
# by their key value (denoted by the variable REFNUM)
criterion_by_REFNUM_____imputable_entries_for_EVTYPE <- lapply(
  X = CF_________identification_test_of_imputable_missing_values_for_EVTYPE[["._value"]],
  FUN = function(x) in_record_validated_data[is.na(EVTYPE), REFNUM][x]
)

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.1.4 Substitute the imputable missing values at the variable EVTYPE

The associated value for each of the 28 distinct invalid values at the variable EVTYPE was imputed at the corresponding observations which were identified by their key value.

# Impute the retrievable missing values at the EVTYPE variable 
# with the defined values
for (i in seq_along(criterion_by_REFNUM_____imputable_entries_for_EVTYPE)) {
  set(
    x = imputed_data,
    i = which(
      imputed_data$REFNUM %in%
        criterion_by_REFNUM_____imputable_entries_for_EVTYPE[[i]],
    ),
    j = "EVTYPE",
    value = names(associations_on_defined_event_types)[i]
  )
}

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.2 Impute missing values at the variable PROPDMGEXP

The invalid values for the variable PROPDMGEXP at the table with the target data subset (before they got substituted by NAs at the in-record data validation stage) were examined and associations were made to plausible valid substitutions. Those observations with missing values that corresponded to successfully associated plausible substitutions, were identified by their key values and were imputed.

6.5.2.1 Examine the invalid values from the variable PROPDMGEXP
- Examine the invalid values that have been substituted by NAs.
6.5.2.2 Associate plausible substitutions to the invalid values from the variable PROPDMGEXP
- Associated the invalid values with plausible substitutions.
6.5.2.3 Identify the imputable missing values at the variable PROPDMGEXP
- Identified the imputable missing values according to associations by their key value.
6.5.2.4 Substitute the imputable missing values at the variable PROPDMGEXP
- Substituted the imputable missing values with valid ones.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.2.1 Examine the invalid values from the variable PROPDMGEXP

For the variable PROPDMGEXP, at the table with the in-record validated data, out of the total 144826 observations, 4158 (2.87%) were NAs.

# Create a kable to present information on the missing values 
# for the variable PROPDMGEXP at the table with the in-record validated data.
kable(
  x = data.table(
    "Variable" = "PROPDMGEXP",
    "Total Number of Values" = length(in_record_validated_data$PROPDMGEXP),
    "Number of Missing Values" = sum(is.na(in_record_validated_data$PROPDMGEXP)),
    "Percentage of Missing Values" = mean(is.na(in_record_validated_data$PROPDMGEXP))
  ),
  caption = paste0(
    "Table 6.5.2.1-1: ",
    "Information on missing values for the variable PROPDMGEXP ", 
    "at the table with the in-record validated data."
    )
  
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.5.2.1-1: Information on missing values for the variable PROPDMGEXP at the table with the in-record validated data.
Variable	Total Number of Values	Number of Missing Values	Percentage of Missing Values
PROPDMGEXP	144826	4158	0.0287103

Those 4158 missing values at the table with the in-record validated data, corresponded to empty values at the table with the target data subset before they got substituted by NAs at the in-record data validation stage.

# Create a kable to presents the distinct invalid values 
# of the variable PROPDMGEXP at the table with the target data subset 
# that were substituted by NAs at the in-record data validation stage.
kable(
  x = target_data_subset[
    REFNUM %in% in_record_validated_data[
      is.na(PROPDMGEXP), REFNUM
      ],
    list(
      "distinct_values" = PROPDMGEXP
    )
    ][
      ,
      .N,
      distinct_values
      ],
  col.names = c(
    "Distinct Values", 
    "Number of Observations"
  ),
  caption = paste0(
    "Table 6.5.2.1-2: ",
    "The distinct invalid values for the variable 'PROPDMGEXP' ", 
    "that were substituted by NAs at the in-record data validation stage."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.5.2.1-2: The distinct invalid values for the variable ‘PROPDMGEXP’ that were substituted by NAs at the in-record data validation stage.
Distinct Values	Number of Observations
	4158

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.2.2 Associate plausible substitutions to the invalid values from the variable PROPDMGEXP

A single association (which works perfectly as shown in the next subsubsection was made for the missing values that corresponded to empty values:

The entries that correspond to property damage with zero magnitude, (denoted by the value 0 at the variable PROPDMG) could be associated with any of the valid values (“K”, “M”, “B”).

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.2.3 Identify the imputable missing values at the variable PROPDMGEXP

The observations that satisfied the criterion imposed by the association made for the invalid values from the variable PROPDMGEXP were identified.

# Create a validator with the criterion to identify 
# the imputable missing values for the variable PROPDMGEXP.
V_____imputable_missing_values_at_the_variable_PROPDMGEXP <- validator(
  "imputable_missing_values_at_PROPDMGEXP" = ( 
    (PROPDMG == 0) & is.na(PROPDMGEXP) 
  )
)

# Confront the subset of observations with missing values 
# in the variable PROPDMGEXP at the in-record validated data 
# with the validator with the criterion to identify 
# the imputable missing values at the variable PROPDMGEXP. 
CF_____imputable_missing_values_at_the_variable_PROPDMGEXP <- 
  confront(
    dat = in_record_validated_data[is.na(PROPDMGEXP)],
    V_____imputable_missing_values_at_the_variable_PROPDMGEXP
  )

All missing values at the variable PROPDMGEXP (4158 in total), corresponded to observations for which the magnitude of property damage (denoted by the variable PROPDMG) was zero.

# Create a kable to present the number of imputable missing values for the 
# variable PROPDMGEXP at the table with the in-record validated data subset.
kable(
  x = summary(CF_____imputable_missing_values_at_the_variable_PROPDMGEXP)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.5.2.3-1: ",
    "Results from identification of imputable missing values ", 
    "at the variable PROPDMGEXP."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The subset of the 4158 observations with missing values ", 
      "was used for the identification of imputable invalid values"
    )
  )

Table 6.5.2.3-1: Results from identification of imputable missing values at the variable PROPDMGEXP.
name	items	passes	fails	nNA	error	warning
imputable_missing_values_at_PROPDMGEXP	4158	4158	0	0	FALSE	FALSE
Note:
The subset of the 4158 observations with missing values was used for the identification of imputable invalid values

The key values (denoted by the variable REFNUM) of the observations for which the missing values at the variable PROPDMGEXP could be retrieved were identified.

# Identify the observations for which the missing value 
# at the variable PROPDMGEXP can be safely imputed, 
# by their key value denoted by the variable REFNUM.
criterion_by_REFNUM_____imputable_missing_values_at_the_variable_PROPDMGEXP <- 
  with(
    data = CF_____imputable_missing_values_at_the_variable_PROPDMGEXP[["._value"]],
    expr = in_record_validated_data[is.na(PROPDMGEXP), REFNUM]
  )

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.2.4 Substitute the imputable missing values at the variable PROPDMGEXP

The value “K” was imputed to all observations with imputable missing values at the variable PROPDMGEXP (which were identified by their key value).

# Set the imputable missing values at the variable PROPDMGEXP 
# with the value "K".
set(
  x = imputed_data,
  i = which(
    imputed_data$REFNUM %in% 
      criterion_by_REFNUM_____imputable_missing_values_at_the_variable_PROPDMGEXP
  ),
  j = "PROPDMGEXP",
  value = "K"
)

(They could have been substituted by any of the valid values (“K”, “M” or “B”) for the variable PROPDMGEXP without changing the fact that they refer to 0$ property damage.)

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.3 Impute missing values at the variable CROPDMGEXP

6.5.3.1 Examine the invalid values from the variable CROPDMGEXP
- Examine the invalid values that have been substituted by NAs.
6.5.3.2 Associate plausible substitutions to the invalid values from the variable CROPDMGEXP
- Associated the invalid values with plausible substitutions.
6.5.3.3 Identify the imputable missing values at the variable CROPDMGEXP
- Identified the imputable missing values according to associations by their key value.
6.5.3.4 Substitute the imputable missing values at the variable CROPDMGEXP
- Substituted the imputable missing values with valid ones.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.3.1 Examine the invalid values from the variable CROPDMGEXP

For the variable CROPDMGEXP at the table with the in-record validated data, out of the total 144826 observations 55041 (38.00%) were NAs.

# Create a kable to present information on the missing values 
# for the variable CROPDMGEXP at the table with the in-record validated data.
kable(
  x = data.table(
    "Variable" = "CROPDMGEXP",
    "Total Number of Values" = length(in_record_validated_data$CROPDMGEXP),
    "Number of Missing Values" = sum(is.na(in_record_validated_data$CROPDMGEXP)),
    "Percentage of Missing Values" = mean(is.na(in_record_validated_data$CROPDMGEXP))
  ),
  caption = paste0(
    "Table 6.5.3.1-1: ",
    "Information on missing values for the variable CROPDMGEXP ", 
    "at the table with the in-record validated data."
    )
  
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.5.3.1-1: Information on missing values for the variable CROPDMGEXP at the table with the in-record validated data.
Variable	Total Number of Values	Number of Missing Values	Percentage of Missing Values
CROPDMGEXP	144826	55041	0.3800492

Those 55041 missing values at the table with the in-record validated data, corresponded to empty values at the table with the target data subset before they got substituted by NAs at the in-record data validation stage.

# Create a kable to presents the distinct invalid values 
# of the variable CROPDMGEXP at the table with the target data subset 
# that were substituted by NAs at the in-record data validation stage.
kable(
  x = target_data_subset[
    REFNUM %in% in_record_validated_data[
      is.na(CROPDMGEXP), REFNUM
      ],
    list(
      "distinct_values" = CROPDMGEXP
    )
    ][
      ,
      .N,
      distinct_values
      ],
  col.names = c(
    "Distinct Values", 
    "Number of Observations"
  ),
  caption = paste0(
    "Table 6.5.3.1-2: ",
    "The distinct invalid values for the variable 'CROPDMGEXP' ", 
    "that were substituted by NAs at the in-record data validation stage."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.5.3.1-2: The distinct invalid values for the variable ‘CROPDMGEXP’ that were substituted by NAs at the in-record data validation stage.
Distinct Values	Number of Observations
	55041

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.3.2 Associate plausible substitutions to the invalid values from the variable CROPDMGEXP

A single association (which works perfectly as shown in the next subsubsection was made for the missing values that corresponded to empty values:

The entries that correspond to crop damage with zero magnitude, (denoted by the value 0 at the variable CROPDMG) could be associated with any of the valid values (“K”, “M”, “B”).

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.3.3 Identify the imputable missing values at the variable CROPDMGEXP

The observations that satisfied the criterion imposed by the association made for the invalid values from the variable CROPDMGEXP were identified.

# Create a validator with the criterion to identify 
# the imputable missing values for the variable CROPDMGEXP.
V_____imputable_missing_values_at_the_variable_CROPDMGEXP <- validator(
  "imputable_missing_values_at_CROPDMGEXP" = ( 
    (CROPDMG == 0) & is.na(CROPDMGEXP) 
  )
)

# Confront the subset of observations with missing values 
# in the variable PROPDMGEXP at the in-record validated data 
# with the validator with the criterion to identify 
# the imputable missing values at the variable CROPDMGEXP. 
CF_____imputable_missing_values_at_the_variable_CROPDMGEXP <- 
  confront(
    dat = in_record_validated_data[is.na(CROPDMGEXP)],
    V_____imputable_missing_values_at_the_variable_CROPDMGEXP
  )

The key values (denoted by the variable REFNUM) of the observations for which the missing values at the variable CROPDMGEXP could be retrieved were identified.

# Create a kable to present the number of imputable missing values for the 
# variable CROPDMGEXP at the table with the in-record validated data subset.
kable(
  x = summary(CF_____imputable_missing_values_at_the_variable_CROPDMGEXP)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.5.3.3-1: ",
    "Results from identification of imputable missing values ", 
    "at the variable CROPDMGEXP."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The subset of the 55041 observations with missing values ",
      "was used for the identification of imputable invalid values."
    )
  )

Table 6.5.3.3-1: Results from identification of imputable missing values at the variable CROPDMGEXP.
name	items	passes	fails	nNA	error	warning
imputable_missing_values_at_CROPDMGEXP	55041	55041	0	0	FALSE	FALSE
Note:
The subset of the 55041 observations with missing values was used for the identification of imputable invalid values.

The key values (denoted by the variable REFNUM) of the observations for which the missing values at the variable CROPDMGEXP could be retrieved were identified.

# Identify the observations for which the missing value 
# at the variable CROPDMGEXP can be safely imputed, 
# by their key value denoted by the variable REFNUM.
criterion_by_REFNUM_____imputable_missing_values_at_the_variable_CROPDMGEXP <- 
  with(
    data = CF_____imputable_missing_values_at_the_variable_CROPDMGEXP[["._value"]],
    expr = in_record_validated_data[is.na(CROPDMGEXP), REFNUM]
  )

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.3.4 Substitute the imputable missing values at the variable CROPDMGEXP

The value “K” was imputed to all observations with imputable missing values at the variable CROPDMGEXP (which were identified by their key value).

# Set the imputable missing values at the variable CROPDMGEXP 
# with the value "K".
set(
  x = imputed_data,
  i = which(
    imputed_data$REFNUM %in% 
      criterion_by_REFNUM_____imputable_missing_values_at_the_variable_CROPDMGEXP
  ),
  j = "CROPDMGEXP",
  value = "K"
)

(They could have been substituted by any of the valid values (“K”, “M” or “B”) for the variable PROPDMGEXP without changing the fact that they refer to 0$ property damage.)

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.4 Conduct post validation for the table with the imputed data

Post validation was conducted to verify that the values of variables at the table with imputed data were valid according to the same constrains that were used to identify the invalid values for each variable at the table with the target data subset.

# The table with the imputed data was post validated to verify 
# that all values for each of the variables it contained were valid.
CF____post_validation_of_the_table_with_the_imputed_data <- confront(
  dat = imputed_data,
  V____constrains_for_the_in_record_data_validation
)

All values for each variable at the table with the imputed data were valid.

# Present the results of the post validation for the table with imputed data
kable(
  x = summary(CF____post_validation_of_the_table_with_the_imputed_data)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.5.4-1: ", 
    "The results of post validation for the table with the imputed data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
     general = paste0(
      "The same constrains that were used to identify the invalid values ", 
      "of each variable at the table with the target data subset, ", "\n",
      "were used for the post validation of the observations ", 
      "at the table with the imputed data."
    )
  )

Table 6.5.4-1: The results of post validation for the table with the imputed data.
name	items	passes	fails	nNA	error	warning
REFNUM	144826	144826	0	0	FALSE	FALSE
BGN_DATE	144826	144826	0	0	FALSE	FALSE
EVTYPE	144826	144571	0	255	FALSE	FALSE
FATALITIES	144826	144826	0	0	FALSE	FALSE
INJURIES	144826	144826	0	0	FALSE	FALSE
PROPDMG	144826	144826	0	0	FALSE	FALSE
PROPDMGEXP	144826	144826	0	0	FALSE	FALSE
CROPDMG	144826	144826	0	0	FALSE	FALSE
CROPDMGEXP	144826	144826	0	0	FALSE	FALSE
Note:
The same constrains that were used to identify the invalid values of each variable at the table with the target data subset, were used for the post validation of the observations at the table with the imputed data.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.5.5 Overview of the table with the imputed data

The table with the imputed data contained 9 variables and 144826 observations.

The variable REFNUM was set as the key of the this table.

# Print the structure of the table with the imputed data.
str(imputed_data)

## Classes 'data.table' and 'data.frame':   144826 obs. of  9 variables:
##  $ REFNUM    : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ BGN_DATE  : chr  "1/19/2001 0:00:00" "1/19/2001 0:00:00" "1/19/2001 0:00:00" "1/19/2001 0:00:00" ...
##  $ EVTYPE    : chr  "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" ...
##  $ FATALITIES: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : int  0 0 0 0 0 0 0 4 0 0 ...
##  $ PROPDMG   : num  10 8 2 15 5 3 10 450 150 3 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "K" "K" "K" "K" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

There were 255 missing values left only at the variable EVTYTE (which were those that couldn’t be safely imputed). The number of distinct values at any of the variables didn’t indicate the presence of obvious abnormalities.

# Create a kable to highlight some facts about the table with the imputed data.
kable(
  x = data.table(
    "Variable" = names(imputed_data),
    "Number of Distinct Values" = vapply(imputed_data, function(x) length(unique(x[!is.na(x)])), integer(1)),
    "Number of Missing Values" = vapply(
      X = imputed_data, 
      FUN = function(x) sum(is.na(x)), 
      FUN.VALUE = integer(1)),
    "Percentage of Missing Values" = vapply(
      X = imputed_data, 
      FUN = function(x) mean(is.na(x)),
      FUN.VALUE = double(1))
  ),
  caption = paste0(
    "Table 6.5.5-1: Facts about the table with the imputed data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = paste0(
      "The table with the imputed data contained 9 variables ", 
      "and 144826 observations."
    )
  )

Table 6.5.5-1: Facts about the table with the imputed data.
Variable	Number of Distinct Values	Number of Missing Values	Percentage of Missing Values
REFNUM	144826	0	0.0000000
BGN_DATE	3746	0	0.0000000
EVTYPE	47	255	0.0017607
FATALITIES	31	0	0.0000000
INJURIES	101	0	0.0000000
PROPDMG	1162	0	0.0000000
PROPDMGEXP	3	0	0.0000000
CROPDMG	269	0	0.0000000
CROPDMGEXP	3	0	0.0000000
Note:
The table with the imputed data contained 9 variables and 144826 observations.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.6 Conduct Cross-Record Data Validation

Summary

Each observation at the table with imputed data was checked to verify if it contains entries which were valid across all variables simultaneous. Those observations that were valid were used to create the table with the cross-record validated data.

Steps

6.6.1 Identify all valid observations
- Identifies the valid observations according to a criterion that spans across all variables.
6.6.2 Create the table with the cross-record validated data
- Creates the table with the cross-record validated data extracting only the valid observations.
6.6.3 Conduct post validation for table with the cross-record validated data
- Ensures that all observations at the table with the cross-record validated data are valid.
6.6.4 Overview of the table with the cross-record validated data
- Presents some basic facts about the table with the cross-validated data.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.6.1 Identify all valid observations

A single constrain that spanned across all available variables at the table with the imputed data was created and used to identify the valid observations.

Specifically, each observation must simultaneous satisfy the 4 criteria below in order to be considered valid:

The id must be unique (and non-missing).
The weather event type value must be one of the defined weather events (and non-missing).
The year must be in the period from 2001 to 2011 (and non-missing).
There must be non-zero harm either to population health or to economy, so:
- either fatalities must be positive (and non-missing),
- or injuries must be positive (and non-missing),
- or property damage (in dollars) must be retrievable and positive (and non-missing),
- or crop damage (in dollars) must be retrievable and positive (and non-missing).

# A validator was created that contains a constrain 
# for the validity of each observation 
# that spans across all variables .  
V_____cross_record_constrains <- validator(
  "valid_observations" = (
    ( REFNUM %in% names(table(REFNUM)[table(REFNUM) == 1]) ) &
      ( !is.na(EVTYPE) ) & 
      ( as.integer(str_extract(BGN_DATE, "(?<=^\\d{1,2}/\\d{1,2}/)\\d{4}(?= 0:00:00$)")) %in%
          c(2001:2011) ) &
      (
        ( FATALITIES > 0 ) |
          ( INJURIES > 0 ) |
          ( PROPDMG > 0 & !is.na(PROPDMGEXP) ) |
          ( CROPDMG > 0 & !is.na(CROPDMGEXP) )
      )
  )
)

# The imputed data table was confronted 
# with the validator that identifies 
# the observations that contain valid values across all variables. 
CF_____cross_record_constrains <- confront(
  dat = imputed_data,
  V_____cross_record_constrains
)

Out of the total of 144826 observation at the table with the imputed data 144571 were valid across all variables while only 255 were found to be invalid.

# Present the result of cross-record validation 
# for the observations contained at the table with imputed data.  
kable(
  x = summary(CF_____cross_record_constrains)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.6.1-1: ",
    "The table contains the results of the cross-record data validation ",
    "for the observation contained at the imputed data table."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.6.1-1: The table contains the results of the cross-record data validation for the observation contained at the imputed data table.
name	items	passes	fails	nNA	error	warning
valid_observations	144826	144571	255	0	FALSE	FALSE

The value of the key (denoted by the variable REFNUM) was used to identify the observations that were valid.

# Identify the valid observations found through the cross-record validation 
# by the their key value.  
criterion_by_REFNUM_____cross_validated_observations <- imputed_data[
  CF_____cross_record_constrains[["._value"]][["valid_observations"]],
  REFNUM
  ]

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.6.2 Create the table with the cross-record validated data

From the table with the imputed data, the table with the cross-record validated data was created, by including only the observations that contained valid (and non-missing) values across all variables.

# Create the table with cross-record data validation 
# by using only the valid observations.
cross_validated_data <- imputed_data[
  REFNUM %in% criterion_by_REFNUM_____cross_validated_observations
  ]

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.6.3 Conduct post validation for table with the cross-record validated data

Post validation was conducted to verify that all observations at the table with the cross-validated data were valid according to the same constrains that were used to identify the valid observation at the table with the imputed data.

# The table with the cross-record validated data was post validated to verify 
# that all observations were valid.
CF_________post_validation_of_cross_validated_data <- confront(
  dat = cross_validated_data,
  V_____cross_record_constrains
)

All the observations at the table with the cross-record validated data were valid.

# Create a kable to present the results of the post validation 
# for the table with the cross-record validated data.
kable(
  x = summary(CF_________post_validation_of_cross_validated_data)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning")
    ],
  caption = paste0(
    "Table 6.6.3-1: ",
    "Presents the result of the post validation ", 
    "for the table with cross validated data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.6.3-1: Presents the result of the post validation for the table with cross validated data.
name	items	passes	fails	nNA	error	warning
valid_observations	144571	144571	0	0	FALSE	FALSE

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.6.4 Overview of the table with the cross-record validated data

The table with the cross-validated data contained 9 variables and 144571 observations.

The variable REFNUM was set as the key of this table.

# Print the structure of the table with the cross-record validated data.
str(cross_validated_data)

## Classes 'data.table' and 'data.frame':   144571 obs. of  9 variables:
##  $ REFNUM    : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ BGN_DATE  : chr  "1/19/2001 0:00:00" "1/19/2001 0:00:00" "1/19/2001 0:00:00" "1/19/2001 0:00:00" ...
##  $ EVTYPE    : chr  "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" ...
##  $ FATALITIES: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : int  0 0 0 0 0 0 0 4 0 0 ...
##  $ PROPDMG   : num  10 8 2 15 5 3 10 450 150 3 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "K" "K" "K" "K" ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

All the observation at the table with the cross-validated data are complete as indicated by the results of post validation for the table with cross-validated data.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.7 Produce The Processed Data

Summary

Having identified, validated and imputed the target data for the period of interest, by transforming the variables from the table with cross-record validated data, the processed data table was constructed that contained all information that was necessary in order to proceed with this analysis and address the two questions of interest.

Steps

6.7.1 Create the table with the processed data
- Transforms the variables from the table with cross-validated data to created the table with the processed data.
6.7.2 Conduct post validation for the table with the processed data
- Ensures that all observations at the table with the processed data are valid.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.7.1 Create the table with the processed data

The following transformations were applied at the variables from the table with the cross-record validated data, in order to construct the table with the processed data:

the variable REFNUM was transfered unchanged
the variable BGN_DATE was omitted
the variable EVTYPE was transfered and renamed to EVENT_TYPE
the variable FATALITIES was transfered unchanged
the variable INJURIES was transfered unchanged
the variables FATALITIES and INJURIES were added
in order to create the variable CASUALTIES
the variables PROPDMG (that denoted the magnitude of property damage) and PROPDMGEXP (that indicated if the value of PROPDMG referred to thousands, millions or billions) were combined appropriately to retrieve the property damage in dollars in order to create the variable PROPERTY_DAMAGE
the variables CROPDMG (that denoted the magnitude of crop damage) and CROPDMGEXP (that indicated if the value of CROPDMG referred to thousands, millions or billions) were combined appropriately to retrieve the crop damage in dollars in order to create the variable CROP_DAMAGE
the variables PROPERTY_DAMAGE and CROP_DAMAGE were added in order to create the variable ECONOMIC_DAMAGE

# Create the table with the processed data 
# from the information contained 
# at the table with cross-record validated data.
processed_data <- cross_validated_data[
  ,
  list(
    # REFNUM variable doesn't need to change
    "REFNUM" = REFNUM,
    # EVTYPE variable should be renamed to EVENT_TYPE
    "EVENT_TYPE" = EVTYPE,
    # FATALITIES variable doesn't need to change
    "FATALITIES" = FATALITIES,
    # INJURIES variable doesn't need to change
    "INJURIES" = INJURIES,
    # PROPERTY_DAMAGE is created by combining the information
    # from the PROPDMG variable which denotes the magnitude of property damage
    # and the PROPDMGEXP variable that indicates if the magnitude
    # refers to thousands (K), millions (M) or billions (B) of dollars
    "PROPERTY_DAMAGE" = (function(magnitude, coded_exponent, code_dictionary) {
      recoded_exponent <- str_replace_all(
        string = coded_exponent,
        code_dictionary
      ) %>%
        as.integer()
      ## the magnitude is multiplied by a coefficient
      ## with base 10 raised to the appropriate power
      ## (3 for thousands, 6 for millions or 9 for billions)
      ## to retrieve the value of property damage 
      reconstructed_number <- magnitude * 10^recoded_exponent
    })(PROPDMG, PROPDMGEXP, c("K" = "3", "M" = "6", "B" = "9")),
    # CROP_DAMAGE is created by combining the information
    # from the CROPDMG variable which denotes the magnitude of crop damage
    # and the CROPDMGEXP variable that indicates if the magnitude
    # refers to thousands (K), millions (M) or billions (B) of dollars
    "CROP_DAMAGE" = (function(magnitude, coded_exponent, code_dictionary) {
      recoded_exponent <- str_replace_all(
        string = coded_exponent,
        code_dictionary
      ) %>%
        as.integer()
      ## the magnitude is multiplied by a coefficient
      ## with base 10 raised to the appropriate power
      ## (3 for thousands, 6 for millions or 9 for billions)
      ## to retrieve the value of crop damage 
      reconstructed_number <- magnitude * 10^recoded_exponent
    })(CROPDMG, CROPDMGEXP, c("K" = "3", "M" = "6", "B" = "9"))
  )
  ][
    ,
    # Create a variable with the number of casualties
    # caused by each weather event type 
    # by adding the fatalities and injuries
    CASUALTIES := FATALITIES + INJURIES][
      ,
      # Create a variable with the economic damage
      # caused by each weather event type 
      # by adding the property damage and crop damage
      ECONOMIC_DAMAGE := PROPERTY_DAMAGE + CROP_DAMAGE
      ][
        ,
        # Re-arrange the order of the variables 
        list(
          REFNUM, EVENT_TYPE, 
          FATALITIES, INJURIES, CASUALTIES,
          PROPERTY_DAMAGE, CROP_DAMAGE, ECONOMIC_DAMAGE
        )
        ]

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

6.7.2 Conduct post validation for the table with the processed data

Post validation was conducted to verify that all observations contained at the table with the processed data were valid across all variables it contained.

One constrain was created and used, which consists of three parts that must hold simultaneous for each observation:

The key for each observation denoted by the variable REFNUM must be unique.
- The event type for each observation denoted by the variable EVENT_TYPE must be one of 48 defined weather event types according to the NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007 (at chapter 7)
- At least one of the six variables that indicate the harm (either to population health or to economy), denoted by the variables FATALITIES, INJURIES, CASUALTIES, PROPERTY_DAMAGE, CROP_DAMAGE or EC0NOMIC_DAMAGE must be positive.

# Create a validator that contains a constrain 
# that spans across all variables contained 
# at the table with the processed data 
# for each observation it includes.
V_____post_validation_of_table_with_the_processed_data <- validator(
  "valid_observation" = (
    ( REFNUM %in% table(REFNUM)[names(table(REFNUM)[table(REFNUM) == 1])] ) &
      ( EVENT_TYPE %in% defined_event_types ) &
      ( FATALITIES > 0 ) |
      ( INJURIES > 0 ) |
      ( CASUALTIES > 0) |
      ( PROPERTY_DAMAGE > 0 ) |
      ( CROP_DAMAGE > 0) |
      ( ECONOMIC_DAMAGE > 0 )
  )
)

# Confront the table with the processed data with 
# validator that verifies the validity of each observation it contains 
# across all variables.
CF_____post_validation_of_table_with_the_processed_data <- confront(
  dat = processed_data,
  V_____post_validation_of_table_with_the_processed_data
)

All the 144571 observations included in the processed data table were found to satisfy the condition.

# Create a kable to resents the results of post validation 
# for the table with the processed data.
kable(
  x = summary(CF_____post_validation_of_table_with_the_processed_data)[
    , c("name", "items", "passes", "fails", "nNA", "error", "warning", "error", "warning")
    ],
  caption = paste0(
    "Table 6.7.2-1: ",
    "The results of post validation for the table with the processed data."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  )

Table 6.7.2-1: The results of post validation for the table with the processed data.
name	items	passes	fails	nNA	error	warning	error.1	warning.1
valid_observation	144571	144571	0	0	FALSE	FALSE	FALSE	FALSE

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

7 PROCESSED DATA

The table with the processed data (which was the result of the data processing pipeline) contains all the information that was used in the chapters:

8 HARM ON POPULATION HEALTH
9 HARM ON ECONOMY

in order to address the two questions of interest for this analysis.

Details about the variables it contains and a short overview are presented in this chapter.

Finally in order to assist any attempt to reproduce the analysis a file with the processed data was exported to serve as checkpoint.

back to start of this chapter
back to TABLE OF CONTENTS

7.1 Information For The Table With The Processed Data

There are 8 variable at the table with the processed data:

REFNUM (int) : a value that uniquely identifies each observation and was used as the key of the table
EVENT_TYPE (chr) : the type of each weather event type
FATALITIES (int) : the number of fatalities
INJURIES (int) : the number of injuries
CASUALTIES (int) : the number of casualties (injuries and fatalities)
PROPERTY_DAMAGE (num) : the property damage in dollars
CROP_DAMAGE (num) : the crop damage in dollars
ECONOMIC_DAMAGE (num): the economic damage in dollars (property damage and crop damage)

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

7.2 Overview Of The Table With The Processed Data

The processed data consists of 8 variables and 144571 observations.

The variable REFNUM was set as the key of this table.

# Print the structure of the table with the processed data.
str(processed_data)

## Classes 'data.table' and 'data.frame':   144571 obs. of  8 variables:
##  $ REFNUM         : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ EVENT_TYPE     : chr  "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" ...
##  $ FATALITIES     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES       : int  0 0 0 0 0 0 0 4 0 0 ...
##  $ CASUALTIES     : int  0 0 0 0 0 0 0 4 0 0 ...
##  $ PROPERTY_DAMAGE: num  10000 8000 2000 15000 5000 3000 10000 450000 150000 3000 ...
##  $ CROP_DAMAGE    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ECONOMIC_DAMAGE: num  10000 8000 2000 15000 5000 3000 10000 450000 150000 3000 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

All the observations included in the processed data table are complete.

# Create a kable to present the number of complete cases 
# at the table with the processed data.
kable(
    x = data.table(
        "Percentage Of Complete Observations" = 
            paste0(mean(complete.cases(processed_data))*100, "%")
    ),
    caption = paste0(
        "Table 7.2-1: ",
        "The percentage of complete observations ",
        "at the table with the processed data."
    )
) %>% 
    kable_styling(
        bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
        full_width = FALSE,
        fixed_thead = TRUE
    )

Table 7.2-1: The percentage of complete observations at the table with the processed data.
Percentage Of Complete Observations
100%

The number of distinct values comply with what was expected from each variable.

# Create a kable to present the number of distinct values 
# for each variable at the table with the processed data.
kable(
    x = data.table(
        "Variable" = names(processed_data),
        "Number of Distinct Values" = 
            vapply(
                X = processed_data, 
                FUN = function(x) length(unique(x[!is.na(x)])), 
                FUN.VALUE = integer(1)
            )
    ),
    caption = paste0(
        "Table 7.2-2: ",
        "The number of distinct values ",
        "for each variable at the table with the processed data."
    )
) %>% 
    kable_styling(
        bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
        full_width = FALSE,
        fixed_thead = TRUE
    ) %>% 
  footnote(
    general = paste0(
      "The table with the processed data consists of 8 variables ", "\n",
      "and 144571 observations."
    )
  )

Table 7.2-2: The number of distinct values for each variable at the table with the processed data.
Variable	Number of Distinct Values
REFNUM	144571
EVENT_TYPE	47
FATALITIES	31
INJURIES	101
CASUALTIES	113
PROPERTY_DAMAGE	1369
CROP_DAMAGE	331
ECONOMIC_DAMAGE	1647
Note:
The table with the processed data consists of 8 variables and 144571 observations.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

7.3 Export The Table With The Processed Data

The table with the processed data was exported (as an R file), in the sub-directory of the working directory:

outputs –> processed_data

with filename:

table_with_the_precessed_data.R

# Supply the filepath at which the table with the summary
# for the harm on population health will be exported.
filepath_____processed_data <-
    file.path(
        directory_tree_____outputs[[
            "filepath_____outputs_____processed_data"
            ]],
        "table_with_the_processed_data.R"
    )

# Export the table with the summary for the harm on population health
# with respect to fatalities.
saveRDS(
    object = processed_data,
    file = filepath_____processed_data
)

The main reason for exporting the a file with the processed data was to supply a checkpoint for any attempts to reproduce the analysis.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8 HARM ON POPULATION HEALTH

In this chapter an attempt was made to quantify the harm on population health based on the information from the table with the processed data.

The harm on population health was examined over three perspectives:

The harm on population health with respect to fatalities caused by each weather event type based on the observations for weather events that resulted in non-zero fatalities at United States in the period from 2001 to 2011.
The harm on population health with respect to injuries caused by each weather event type based on the observations for weather events that resulted in non-zero injuries at United States in the period from 2001 to 2011.
The harm on population health with respect to casualties (sum of fatalities and injuries) caused by each weather event type based on the observations for weather events that resulted in non-zero casualties at United States in the period from 2001 to 2011.

The weather event types for which less than 10 observations that resulted in non-zero harm were available with respect to a perspective of interest were ommitted (from the analysis of that particular perspective), to avoid highly misleading statistics. Consequently the subset of weather event types that were included for each of the three perspectives is different.

Due to the fact that for all perspectives the values of interest for the observations of most weather event types were highly positively skewed, it was consider important in order to obtain an insightful picture of their consequences to examine them over three different aspects:

The overall harm on population health caused by each weather event type.
The harm on population health caused by the 90% of cases with the lowest impact of each weather event type.
The harm on population health caused by the 10% of cases with the highest impact of each weather event type.

For every aspect the sample size, the skewness and the mean of the values that encapsulated the harm with respect to each perspective were summarized by each weather event type and reported.

The results obtained for the harm on population health by each weather event type were presented at the section 10.1 Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? of the chapter 10 RESULTS).

For each of the three perspectives that were examined for the harm on population health by each weather event type a multiplot was created to visualize the respective results. Those multiplots constitute the three parts of the Figure 1, which was composed and presented at the subsection 10.1.1 Overview of results for the harm on population health of the chapter 10 RESULTS).

(In compliance with the restrictions of the assignment, according to which at least 1 but no more than 3 figures should be included in the report, the Multiplot as well as the elementary plots that contain were NOT displayed separately and can ONLY be examined as PARTs of the Figure 1 at the subsection 10.1.1 Overview of results for the harm on population health of the chapter 10 RESULTS.)

back to start of this chapter
back to TABLE OF CONTENTS

8.1 Harm On Population Health With Respect To Fatalities By Each Weather Event Type

Summary

The required variables and the target data subset of observations for the harm on population health with respect to fatalities were extracted from the table with the processed data, and processed to create a new variable that divided the observations for each of the included weather event types to two supplementary groups:

the 90% of observations with the lowest impact
the 10% of observations with the highest impact

before the information for the harm on population health with respect to fatalities was summarized by each weather event type.

Three aspects were examined:

The overall average number of fatalities by each weather event type.
The average number of fatalities by each weather event type for the 90% of cases with the lowest impact.
The average number of fatalities by each weather event type for the 10% of cases with the highest impact.

For each aspect, the average number of fatalities by each weather event type, the number of its available observations (based on which the average was computed) and their skewness were examined.

The overall average number of fatalities was used as the main criterion to determine which weather events caused the most harm on population health with respect to fatalities but it is important to take into account the other two aspect that were presented in order to obtain a more insightful and complete ‘picture’ of their consequences, (especially given the fact that for most of the weather event types, the fatalities were highly positively skewed).

The table with results for the harm on population health with respect to fatalities by each weather event type were presented at the subsection 10.1.2 Most harmful event types with respect to fatalities of the chapter 10 RESULTS.

Finally the Multiplot 1.1 was created to visualize the results for the harm on population health with respect to fatalities by each weather event type.

(Note that neither the Multiplot 1.1 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health at the chapter 10 RESULTS, where the Figure 1 was presented, of which the Multiplot 1.1 constitutes the PART 1.)

Steps

8.1.1 Extract the target data for harm on population health with respect to fatalities
- The target data subset of observations needed to evaluate the harm on population health with respect to fatalities by each weather event type was extracted from the table with the processed data.
8.1.2 Process the target data for harm on population health with respect to fatalities
- The table with target data subset for the harm on population health with respect to fatalities was processed to create the table with processed data for the harm on population health with respect to fatalities.
8.1.3 Summarize the processed data for harm on population health with respect to fatalities by each weather event type
- The harm on population health with respect to fatalities by each weather event type was evaluated over various aspects.
8.1.4 Visualize the results of the summary for the harm on population health with respect to fatalities by each weather event type
- The Multiplot 1.1 that presents the results of the summary for the harm on population health with respect to fatalities by each weather event type was created.
  - 8.1.4.1 Create the components of Multiplot 1.1
    - Creates the four elementary plot that constitute the Multiplot 1.1:
      - 8.1.4.1.1 Create The Plot 1.1.1
        
        Displays the overall average number of fatalities caused by each weather event type based on all the cases of weather events that resulted in non-zero fatalities.
      - 8.1.4.1.2 Create The Plot 1.1.2
        
        Displays the average number of fatalities caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero fatalities.
      - 8.1.4.1.3 Create The Plot 1.1.3
        
        Displays the average number of fatalities caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) Displays the average number of fatalities that resulted in non-zero fatalities.
      - 8.1.4.1.4 Create The Plot 1.1.4
        
        Displays a comparison for each weather event type, of the average number of fatalities for the 90% of its observations with the lowest impact versus the average number of fatalities for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero fatalities.
  - 8.1.4.2 Compose the Multiplot 1.1
    - Combines the four elementary plots to create the Multiplot 1.1.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.1 Extract the target data for harm on population health with respect to fatalities

In order to examine the harm on population health with respect to fatalities caused by each weather event type, the variables REFNUM, EVENT_TYPE and FATALITIES were selected from the table with the processed data and only the observations that refer to weather events that resulted in non-zero fatalities were extracted.

Furthermore, in an attempt to avoid highly misleading statistics due to the small number of observations for some of the weather event types, a lowest bound of 10 weather events that caused non zero fatalities (for each of the included weather event types) was selected (subjectively by the analyst) and applied.

This lowest bound, although it may seem (and generally it is) not enough to get trustworthy statistics, it was considered to be “good enough” taking into account that :

the analysis focuses in describing historical data without trying to make inferences that would demand substantially bigger samples, although any statistic based on less than 10 observations could not be taken seriously especially in cases (such as in this analysis) where the distribution of fatalities for each weather event type was skewed.
a period of 10 years (from 2001 to 2011) in which the observations that were used in the analysis occurred, is relatively small time to produce big samples of weather events that caused non zero fatalities for some the weather event types. Thus, if a highest bound was selected to get more robust statistics such as samples of 100 or 300, the majority of weather event types would have been excluded, making the results of the analysis trivial.

# Extract the required variables and the target data subset of observations 
# for the harm on population health with respect to fatalities.
target_data_____harm_on_population_health_____fatalities <- processed_data[
  ## Extract only the observations that have resulted in non-zero fatalities.
  FATALITIES > 0,
  ## Select only the relevant variables. 
  list(REFNUM, EVENT_TYPE, FATALITIES)
  ][
    ### Keep only the observations that correspond to the weather event types 
    ### for which there are at least 10 weather events available.
    EVENT_TYPE %in% 
      names(table(EVENT_TYPE)[table(EVENT_TYPE) >= 10])
    ]

The table with the target data for the harm on population health with respect to fatalities consist of 3175 observations.

# Print the structure of the table with the target data subset 
# for the harm on population health with respect to fatalities.
str(target_data_____harm_on_population_health_____fatalities)

## Classes 'data.table' and 'data.frame':   3175 obs. of  3 variables:
##  $ REFNUM    : int  413652 413757 413763 413862 414153 414183 414184 414187 414200 414267 ...
##  $ EVENT_TYPE: chr  "THUNDERSTORM WIND" "TORNADO" "HIGH WIND" "THUNDERSTORM WIND" ...
##  $ FATALITIES: int  1 2 1 1 1 1 1 1 1 2 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

The variable EVENT_TYPE includes 26 distinct weather event types, for most of which the variable FATALITIES was highly positively skewed.

# Create a kable to present some facts about the table with the target data 
# for the harm on population health with respect to fatalities.
kable(
  x = target_data_____harm_on_population_health_____fatalities[
    order(EVENT_TYPE), 
    list(
      "N" = .N, 
      "SKEWNESS" = round(skewness(FATALITIES), 4)
    ), 
    by = EVENT_TYPE
    ],
  caption = paste0(
    "Table 8.1.1-1: ",
    "Facts about the table with the target data subset of observations ", 
    "for the harm on population health with respect to fatalities."
  )
) %>% 
  kable_styling(
    bootstrap_options = c(
      "striped", "hover", "condensed", "responsive", "bordered"
    ), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = "The skewness was rounded to 4 decimal places."
  )

Table 8.1.1-1: Facts about the table with the target data subset of observations for the harm on population health with respect to fatalities.
EVENT_TYPE	N	SKEWNESS
AVALANCHE	129	2.2979
BLIZZARD	15	2.6185
COLD/WIND CHILL	75	2.9759
DEBRIS FLOW	11	1.6608
EXCESSIVE HEAT	296	5.4405
EXTREME COLD/WIND CHILL	103	4.5318
FLASH FLOOD	392	8.0755
FLOOD	187	5.0049
HEAT	127	4.1476
HEAVY RAIN	34	2.5950
HEAVY SNOW	18	0.9923
HIGH SURF	86	2.2931
HIGH WIND	92	3.4457
HURRICANE/TYPHOON	23	2.1981
ICE STORM	20	2.7519
LIGHTNING	387	5.3156
MARINE STRONG WIND	12	1.7889
MARINE THUNDERSTORM WIND	12	2.3158
RIP CURRENT	384	5.3801
STRONG WIND	90	2.6667
THUNDERSTORM WIND	195	6.4762
TORNADO	339	13.5732
TROPICAL STORM	20	3.8434
WILDFIRE	31	2.6290
WINTER STORM	51	0.9436
WINTER WEATHER	46	3.7781
Note:
The skewness was rounded to 4 decimal places.

It was worth noting that for the weather event types with highest number of observations there was highest skewness for the values of fatalities, indicating that the corresponding distribution of fatalities has a heavy tail that wasn’t possible to be observed when few observation were available.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.2 Process the target data for harm on population health with respect to fatalities

To create the table with the processed data for the harm on population health with respect to fatalities from the corresponding target data subset for this perspective, a new variable was created that divides the observations for each of the included weather event types in two complementary levels:

one that contains the 90% of cases with lowest impact
the other that contains the 10% of cases with highest impact

This decision was made due to the high skewness that was observed for the values of the variable FATALITIES for most weather event types, which indicates that the underlining distributions of such phenomena has a heavy tail that causes this heterogeneity on the observations. As a result a small number of fatalities were observed for the majority of cases that resulted in non-zero fatalities while in the few cases with the highest impact they caused lots of fatalities.

Having in mind that the average number of fatalities will be used to determine which weather event types were the most harmful to population health (with respect to fatalities) combined with the fact that the average doesn’t represent well the distribution of variables with high skewness, as it is highly affected by the most extreme values, it was considered necessary to examine the subsets created by those two levels in order to obtain an insightful picture.

# Create the table with the processed data 
# for the harm on population health with respect to fatalities.
processed_data_____harm_on_population_health_____fatalities <- 
  target_data_____harm_on_population_health_____fatalities[
    ,
    ## Create a new variable divides the observations
    ## for each weather event into two supplementary groups:  
    ##   - the 90% of weather events that resulted in lowest fatalities
    ##   - the 10% of weather events that resulted in highest fatalities
    BIN_GROUP_PER_EVENT_TYPE := (function(x, p_bins) {
      
      # adds 0 and 1 in the vector supplied at the argument 'p_bins' 
      # to the start and the end respectively  
      # the supplied percentiles if they are missing 
      # and sort them ascending
      p_bins_increasing <- sort(c(0, p_bins, 1))
      
      # creates the character strings that labels of the bins by the values supplied at 
      # the argument 'p_bins' that will be the values of the new variable
      bin_labels <- paste0("(", p_bins_increasing[-length(p_bins_increasing)]*100,
                           "% - ", p_bins_increasing[-1]*100, "%]")
      
      # identify the number of occurrences that correspond to each label
      n_times <- vapply(2:length(p_bins_increasing),
                        function(i) {
                          as.integer(floor(length(x) * p_bins_increasing[i]) -
                                       floor(length(x) * p_bins_increasing[i - 1]))
                        }, integer(1))
      
      # multiply each label with the number of its occurrences
      x_bins_expanded <- rep(x = bin_labels, times = n_times)
      
      # order the label to much the values of the corresponding vector
      x_bins_expanded_reordered <- x_bins_expanded[order(seq_along(x)[order(x)])]
      
      ## Coerce the character vector with the labels of bins to a factor
      x_bins_factor <- factor(x_bins_expanded_reordered, labels = bin_labels, ordered = TRUE)
      
    })(FATALITIES, 0.9)
    , by = EVENT_TYPE
  ][
    ## Coerce the EVENT_VARIABLE to factor
    , EVENT_TYPE := as.factor(EVENT_TYPE) 
  ]

The table with the processed data for the harm on population health with respect to fatalities contains 4 variables:

REFNUM (int) : an id that uniquely identifies each observation
EVENT_TYPE (Factor w/ 26 levels) : the type of each weather event
FATALITIES (int) : the number of fatalities
BIN_GROUP_PER_EVENT_TYPE (Ord.factor w/ 2 levels) : a factor that divides the observations for each weather event type to two complementary levels, one with the 90% of observations with the lowest impact and another with the 10% of observations with the highest impact.

and 3175 observations.

# Print the structure of the table with the processed data 
# for the harm on population health with respect to fatalities.
str(processed_data_____harm_on_population_health_____fatalities)

## Classes 'data.table' and 'data.frame':   3175 obs. of  4 variables:
##  $ REFNUM                  : int  413652 413757 413763 413862 414153 414183 414184 414187 414200 414267 ...
##  $ EVENT_TYPE              : Factor w/ 26 levels "AVALANCHE","BLIZZARD",..: 21 22 13 21 16 10 16 19 13 22 ...
##  $ FATALITIES              : int  1 2 1 1 1 1 1 1 1 2 ...
##  $ BIN_GROUP_PER_EVENT_TYPE: Ord.factor w/ 2 levels "(0% - 90%]"<"(90% - 100%]": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.3 Summarize the processed data for harm on population health with respect to fatalities by each weather event type

To evaluate the harm on population health by each weather event type with respect to fatalities a simplistic approach was adopted :

the weather event types were ranked from the most harmful to the least based on the overall average number of fatalities of the weather events that resulted in non-zero fatalities

The overall average number of fatalities caused by each weather event type was initially examined along with the skewness of the number of fatalities for each weather event type. In most cases the skewness was high (or even extremely high), so it was possible that the overall mean misrepresented the consequences of each weather event type.

That is the reason why the average number of fatalities for 90% of weather events with the lowest impact versus the average number of fatalities for the 10% of weather events with the highest impact were also computed and examined.

It is highlighted that for the average number of fatalities that refers to the 10% of the cases that had the highest impact, there were few observations available for a lot of weather event types and the corresponding mean values should be interpreted with caution.

# Create the table with the summary for the harm on population health 
# with respect to fatalities for each weather event type.
summary_____harm_on_population_health______fatalities <- 
  processed_data_____harm_on_population_health_____fatalities[
  ,
  list(
    ## The total number of observation by each weather event type.
    "N" = .N,
    ## The average number of fatalities caused by each weather event type.
    "AVRG" = round(mean(FATALITIES), 2),
    ## The skewness of fatalities for the observations by each weather event type.
    "SKEWNESS" = round(skewness(FATALITIES), 4),
    ## The number of observations for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "N_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , .N],
    ## The average number of fatalities caused by each weather event type 
    ## for the 90% of cases with the lowest impact.
    "AVRG_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(mean(FATALITIES), 2)],
    ## The skewness of fatalities for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "SKEWNESS_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(skewness(FATALITIES), 4)],
    ## The number of observations for the 10% of cases with the lowest impact 
    ## by each weather event type.
    "N_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , .N],
    ## The average number of fatalities caused by each weather event type 
    ## for the 10% of cases with the highest impact.
    "AVRG_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(mean(FATALITIES), 2)],
    ## The skewness of fatalities for the 10% of cases with the highest impact 
    ## by each weather event type.
    "SKEWNESS_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(skewness(FATALITIES), 4)]
  ),
  by = "EVENT_TYPE"
  ][
    ## The average number of fatalities is used to order the rows of the table
    ## from the most harmful weather event type to the least.
    order(-AVRG),
    ## Create a variable with the rank of the harmness of each weather event type.
    RANK := 1:length(EVENT_TYPE)
    ][
      ,
      ## Reorder the variables at the table.
      list(
        RANK, EVENT_TYPE, N, AVRG, SKEWNESS, N_LOW, AVRG_LOW, SKEWNESS_LOW, N_HIGH, AVRG_HIGH, SKEWNESS_HIGH
      )
      ]

The results of the table with the summary for the harm on population health with respect to fatalities by each weather event type that was created in this section were presented at the subsection 10.1.2 Most harmful event types with respect to fatalities of the chapter 10 RESULTS.

The table with the summary for the harm on population health with respect to fatalities by each weather event type was exported (as an R file), in the folder of the working directory:

outputs –> harm_on_population_health –> results

with filename:

summary______harm_on_population_health______fatalities.R

# Supply the filepath at which the table with the summary
# for the harm on population health will be exported.
filepath_____summary_____harm_on_population_health______fatalities <-
  file.path(
    directory_tree_____outputs[[
      "filepath_____outputs_____harm_on_population_health_____results"
    ]],
    "summary_____harm_on_population_health______fatalities.R"
  )

# Export the table with the summary for the harm on population health
# with respect to fatalities.
saveRDS(
  object = summary_____harm_on_population_health______fatalities,
  file = filepath_____summary_____harm_on_population_health______fatalities
)

The main reason for exporting the file with the summary for the harm on population health with respect to fatalities by each weather event type was to supply a checkpoint for any attempts to reproduce the analysis.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.4 Visualize the results of the summary for the harm on population health with respect to fatalities by each weather event type

From the table with the summary for the harm on population health by each weather event type with respect to fatalities the Multiplot 1.1 was created to present an overview of the results for the three different aspects that were examined for this perspective.

The elementary plots were created:

8.1.4.1.1 Create The Plot 1.1.1
- Displays the overall average number of fatalities caused by each weather event type based on all the cases of weather events that resulted in non-zero fatalities.
8.1.4.1.2 Create The Plot 1.1.2
- Displays the average number of fatalities caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero fatalities.
8.1.4.1.3 Create The Plot 1.1.3
- Displays the average number of fatalities caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero fatalities.
8.1.4.1.4 Create The Plot 1.1.4
- Displays a comparison for each weather event type, of the average number of fatalities for the 90% of its observations with the lowest impact versus the average number of fatalities for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero fatalities.

which were then combined in order to obtain the Multiplot 1.1.

It constitutes the PART 1 of the Figure 1 that displays the overview of the harm on population health by each weather event type.

(Note that neither the Multiplot 1.1 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health at the chapter 10 RESULTS, were the Figure 1 was presented, of which the Multiplot 1.1 constitutes the PART 1.)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.4.1 Create the components of Multiplot 1.1

Creates four elementary plots to visualize the results for the aspects that were examined for the harm on population health with respect to fatalities by each weather event type.

8.1.4.1.1 Create The Plot 1.1.1
- Displays the overall average number of fatalities caused by each weather event type based on all the cases of weather events that resulted in non-zero fatalities.
8.1.4.1.2 Create The Plot 1.1.2
- Displays the average number of fatalities caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero fatalities.
8.1.4.1.3 Create The Plot 1.1.3
- Displays the average number of fatalities caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero fatalities.
8.1.4.1.4 Create The Plot 1.1.4
- Displays a comparison for each weather event type, of the average number of fatalities for the 90% of its observations with the lowest impact versus the average number of fatalities for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero fatalities.

The elementary plots were used to compose the Multiplot 1.1.

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.4.1.1 Create The Plot 1.1.1

The Plot 1.1.1 displays the overall average number of fatalities caused by each weather event type taking into account all and only the observation that resulted in non-zero fatalities.

The skewness of the number of fatalities for the observations of each weather event type (based on which the overall number of fatalities was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.1.1 that displays 
# the overall average number of fatalities 
# by each weather event type for all cases. 
elementary_plot_1_1_1 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______fatalities,
    mapping = aes(
      x = AVRG,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to make them displayed alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a square shaped point to the position that corresponds to 
  ## the average number of fatalities caused by each weather event type, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(color = SKEWNESS),
    shape = 15, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of fatalities.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG, 
      group = EVENT_TYPE, 
      color = SKEWNESS
    )
    ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of fatalities it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2.5
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of fatalities for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.1 will be composed from the four elementary plots. 
    limits = c(-2, 14), 
    midpoint = 7, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels.  
  labs(
    title = "Plot 1.1.1", 
    subtitle = "Aspect: Overall",
    x = "Average Number of Fatalities\n",
    y = "Weather Event Types \n"
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.4.1.2 Create The Plot 1.1.2

The Elementary Plot 1.1.2 displays the average number of fatalities for the 90% of cases with the lowest impact caused by each weather event type from all the observation that resulted in non-zero fatalities.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to population health, based on the overall average number of fatalities they caused.
(so it is NOT based on the average number of fatalities caused by the 90% of cases with the lowest impact of each weather event type).

The skewness of the number of fatalities for the observations of each weather event type (based on which the average number of fatalities for the 90% of cases with the lowest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.1.2 that displays 
# the average number of fatalities by each weather event type 
# for the 90% of its cases with the lowest impact.
elementary_plot_1_1_2 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______fatalities,
    mapping = aes(
      x = AVRG_LOW,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a circle shaped point to the position that corresponds to 
  ## the average number of fatalities caused by each weather event type
  ## for the 90% of its cases with the lowest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_LOW
    ), 
    size = 3.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of fatalities 
  ## for the 90% of its cases with the lowest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_LOW, 
      group = EVENT_TYPE, 
      color = SKEWNESS_LOW
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of fatalities it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2
    ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of fatalities for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.1 will be composed from the four elementary plots.
    limits = c(-2, 14), 
    midpoint = 7, 
    low = "lightgreen",
    mid = "orange",
    high = "purple"
    ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.1.2",
    subtitle = "Aspect: 90% of cases with the lowest impact",
    x = paste0(
      "Average Number of Fatalities for the 90% ", "\n",
      "of Observations with the Lowest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.4.1.3 Create The Plot 1.1.3

The Plot 1.1.3 displays the average number of fatalities for the 10% of cases with the highest impact caused by each weather event type from all the observation that resulted in non-zero fatalities.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to population health, based on the overall average number of fatalities they caused.
(so it is NOT based on the average number of fatalities caused by the 10% of cases with the highest impact of each weather event type).

The skewness of the number of fatalities for the observations of each weather event type (based on which the average number of fatalities for the 10% of cases with the highest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.1.3 that displays 
# the average number of fatalities by each weather event type 
# for the 10% of its cases with the highest impact.
elementary_plot_1_1_3 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______fatalities,
    mapping = aes(
      x = AVRG_HIGH,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a diamond shaped point to the position that corresponds to 
  ## the average number of fatalities caused by each weather event type
  ## for the 10% of its cases with the highest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_HIGH
    ), 
    shape = 18, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of fatalities 
  ## for the 10% of its cases with the highest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_HIGH, 
      group = EVENT_TYPE, 
      color = SKEWNESS_HIGH
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of fatalities it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ),
    size = 2
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of fatalities for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.1 will be composed from the four elementary plots.
    limits = c(-2, 14), 
    midpoint = 7, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.1.3",
    subtitle ="Aspect: 10% of cases with the highest impact",
    x = paste0(
      "Average Number of Fatalities for the 10% ", "\n", 
      "of Observations with the Highest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    ### Remove the text, ticks and title of the y axis 
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.4.1.4 Create The Plot 1.1.4

The Plot 1.1.4 displays a compact overview of all three aspect that were examined for the harm on population health with respect to fatalities.

For each weather event type, the comparison was visualized for the average number of fatalities for the 90% of cases with the lowest impact versus the average number of fatalities for the 10% of cases with the highest impact.

# Create the Elementary Plot 1.1.4 that displays 
# by each weather event type the comparison of 
# the average number of fatalities 
# for the 90% of cases with the lowest impact
# versus the average number of fatalities 
# for the 10% of cases with the highest impact.
elementary_plot_1_1_4 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______fatalities,
    mapping = aes(
      x = AVRG_HIGH, 
      y = AVRG_LOW
    )
  ) +
  geom_point(
    mapping = aes(
      fill = SKEWNESS
    ), 
    shape = 21
  ) +
  ## Draw a label with a number that indicates the rank assigned 
  ## to each weather event type (from the most harmful to the least) 
  ## based on the overall average number of fatalities it caused.
  geom_label_repel(
    mapping = aes(
      label = RANK, 
      fill = SKEWNESS
    ),
    size = 2.5
  ) +
  ## Adjust the scale for the fill of each label.
  scale_fill_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of fatalities for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.1 will be composed from the four elementary plots.
    limits = c(-2, 14),
    midpoint = 7, 
    low = "lightgreen",
    mid = "orange", 
    high = "purple"
    ) +
  ## Set proper limits to the plot.
    xlim(c(1, 18)) +
    ylim(c(0.75, 2)) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.1.4",
    subtitle = paste0(
      "Comparison of the average number of fatalities ", 
      "for the 90% of observations with the lowest impact ", 
      "versus the average number of fatalities ", 
      "for the 10% of observations with highest impact. "
    ),
    x = paste0(
      "Average Number of Fatalities by each Weather Event Type ", 
      "for the 10% of its Observations with the Highest Impact"
    ),
    y = paste0(
      "Average Number of Fatalities by each Weather Event Type ", "\n", 
      "for the 90% of its Observations with the Lowest Impact."
    ),
    ### Add a descriptive label for the legend.
    fill = paste0(
      "The color indicates the skewness ",
      "of fatalities for the each weather event type. ",
      "(the color scale is unique for all four plots of PART 1) ", "\n",
      "When the color of a bar is gray, the skewness was indeterminable ",
      "due to the fact that all observations for that weather event type ",
      "took the same value."
    )
  ) +
  ## Select a theme.
  theme_linedraw() +
  ## Customize the selected theme.
  theme(
    ### Adjust the legend.
    legend.position = "bottom",
    legend.direction = "horizontal",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.1.4.2 Compose the Multiplot 1.1

The four elementary plots that were created from the results of the summary for the harm on population health with respect to fatalities by each weather event type, were combined to construct a single multiplot that displays the complete picture for this perspective.

# Create a multiplot that displays the overview of the summary 
# for the harm on population health with respect to fatalities
# by each weather event type.
multiplot_1_1 <- arrangeGrob(
  grobs = list(
      
    # Title
    textGrob(
      label = paste0(
        "\n",
        "PART 1: Harm on population health by each weather event type ", 
        "with the respect to fatalities ", "\n", 
        "based on the cases of weather events ", 
        "that resulted in non-zero fatalities.", "\n", 
        "\n"
      ),
       gp=gpar(
         fontsize = 16, 
         fontface = "bold"
       )
    ),
    
    # Subtitle
    textGrob(
      label = paste0(
          "\n", 
          "The results include only the weather event types, ", 
          "for which at least 10 observations ", 
          "that resulted in non-zero fatalities were available. ", "\n",
          "The number associated with each weather event type ", 
          "represents the rank (from the most harmful to the least) ", 
          "which was assigned based on the overall average number of fatalities.", "\n",
          "Because for most of the weather event types ", 
          "high positive skewness was observed for the number of fatalities, ",
          "the average of the 90% of cases with lowest impact ", "\n",
          "and the 10% of cases with highest impact were reported ", 
          "to provide a more representative picture of their consequences.","\n",
          "\n"
      ),
       gp=gpar(
         fontsize = 14, 
         fontface = "bold"
       )
    ),
    
    # Plot 1.1.1
    # Elementary plot for the average number of fatalities 
    # by each weather event type for all cases.
    elementary_plot_1_1_1,
    
    # ELEMENTARY PLOT 1.1.2
    # Elementary plot for the average number of fatalities 
    # by each weather event type for 90% of cases with the lowest impact.
    elementary_plot_1_1_2,
    
    # ELEMENTARY PLOT 1.1.3
    # Elementary plot for the average number of fatalities 
    # by each weather event type for 10% of cases with the highest impact.
    elementary_plot_1_1_3,
    
    # ELEMENTARY PLOT 1.1.4
    # Elementary Plot 1.1.4 for the comparison of 
    # the average number of fatalities 
    # for the 90% of cases with the lowest impact versus 
    # the 10% of cases with the highest impact.
    elementary_plot_1_1_4
  ),
  # Set the layout for this elementary plots
  layout_matrix = 
    matrix(
      c(1,1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2,2,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6
      ),
      byrow = TRUE, 
      nrow = 13
    )
)

(Note that the Multiplot 1.1 was NOT presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health of the chapter 10 RESULTS, were the Figure 1 was presented, of which the Multiplot 1.1 constitutes the PART 1.)

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2 Harm On Population Health With Respect To Injuries By Each Weather Event Type

Summary

The required variables and the target data subset of observations for the harm on population health with respect to injuries were extracted from the table with the processed data, and processed to create a new variable that divided the observations for each of the included weather event types to two supplementary groups:

the 90% of observations with the lowest impact
the 10% of observations with the highest impact

before the information for the harm on population health with respect to injuries was summarized by each weather event type.

Three aspects were examined:

The overall average number of injuries by each weather event type.
The average number of injuries by each weather event type for the 90% of cases with the lowest impact.
The average number of injuries by each weather event type for the 10% of cases with the highest impact.

For each aspect, the average number of injuries by each weather event type, the number of its available observations (based on which the average was computed) and their skewness were examined.

The overall average number of injuries was used as the main criterion to determine which weather events caused the most harm on population health with respect to injuries but it is important to take into account the other two aspect that were presented in order to obtain a more insightful and complete ‘picture’ of their consequences, (especially given the fact that for most of the weather event types, the injuries were highly positively skewed).

The table with results for the harm on population health with respect to injuries by each weather event type were presented at the subsection 10.1.3 Most harmful event types with respect to injuries of the chapter 10 RESULTS.

Finally the Multiplot 1.2 was created to visualize the results of the harm on population health with respect to injuries by each weather event type.

*(Note that neither the Multiplot 1.1 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health at the chapter 10 RESULTS, where the Figure 1 was presented, of which the Multiplot 1.2 constitutes the PART 2.)

Steps

8.2.1 Extract the target data for harm on population health with respect to injuries
- The target data subset of observations needed to evaluate the harm on population health with respect to injuries by each weather event type was extracted from the table with the processed data.
8.2.2 Process the target data for harm on population health with respect to injuries
- The table with target data subset for the harm on population health with respect to injuries was processed to create the table with processed data for the harm on population health with respect to injuries.
8.2.3 Summarize the processed data for harm on population health with respect to injuries by each weather event type
- The harm on population health with respect to injuries by each weather event type was evaluated over various aspects.
8.2.4 Visualize the results of the summary for the harm on population health with respect to injuries by each weather event type
- The Multiplot 1.2 that presents the results of the summary for the harm on population health with respect to injuries by each weather event type was created.
  - 8.2.4.1 Create the components of Multiplot 1.2
    - Creates the four elementary plot that constitute the Multiplot 1.2:
      - 8.2.4.1.1 Create The Plot 1.2.1
        
        Displays the overall average number of injuries caused by each weather event type based on all the cases of weather events that resulted in non-zero injuries.
      - 8.2.4.1.2 Create The Plot 1.2.2
        
        Displays the average number of injuries caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero injuries.
      - 8.2.4.1.3 Create The Plot 1.2.3
        
        Displays the average number of injuries caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero injuries.
      - 8.2.4.1.4 Create The Plot 1.2.4
        
        Displays a comparison for each weather event type, of the average number of injuries for the 90% of its observations with the lowest impact versus the average number of injuries for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero injuries.
  - 8.2.4.2 Compose the Multiplot 1.2
    - Combines the four elementary plots to create the Multiplot 1.1.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.1 Extract the target data for harm on population health with respect to injuries

In order to examine the harm on population health with respect to injuries caused by each weather event type, the variables REFNUM, EVENT_TYPE and INJURIES were selected from the table with the processed data and only the observations that refer to weather events that resulted in non-zero injuries were extracted.

Furthermore, in an attempt to avoid highly misleading statistics due to the small number of observations for some of the weather event types, a lowest bound of 10 weather events that caused non zero injuries (for each of the included weather event types) was selected (subjectively by the analyst) and applied.

This lowest bound, although it may seem (and generally it is) not enough to get trustworthy statistics, it was considered to be “good enough” taking into account that :

the analysis focuses in describing historical data without trying to make inferences that would demand substantially bigger samples, although any statistic based on less than 10 observations could not be taken seriously especially in cases (such as in this analysis) where the distribution of injuries for each weather event type was skewed.
a period of 10 years (from 2001 to 2011) in which the observations that were used in the analysis occurred, is relatively small time to produce big samples of weather events that caused non zero injuries for some the weather event types. Thus, if a highest bound was selected to get more robust statistics such as samples of 100 or 300, the majority of weather event types would have been excluded, making the results of the analysis trivial.

# Extract the required variables and the target data subset of observations 
# for the harm on population health with respect to injuries.
target_data_____harm_on_population_health_____injuries <- processed_data[
  ## Extract only the observations that have resulted in non-zero injuries.
  INJURIES > 0,
  ## Select only the relevant variables. 
  list(REFNUM, EVENT_TYPE, INJURIES)
  ][
    ### Keep only the observations that correspond to the weather event types 
    ### for which there are at least 10 weather events available.
    EVENT_TYPE %in% 
      names(table(EVENT_TYPE)[table(EVENT_TYPE) >= 10])
    ]

The table with the target data for the harm on population health with respect to injuries consist of 5581 observations.

# Print the structure of the table with the target data subset 
# for the harm on population health with respect to injuries.
str(target_data_____harm_on_population_health_____injuries)

## Classes 'data.table' and 'data.frame':   5581 obs. of  3 variables:
##  $ REFNUM    : int  413614 413649 413652 413663 413737 413743 413746 413757 413763 413795 ...
##  $ EVENT_TYPE: chr  "TORNADO" "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" ...
##  $ INJURIES  : int  4 2 4 1 6 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

The variable EVENT_TYPE includes 27 distinct weather event types, for most of which the variable INJURIES was highly positively skewed.

# Create a kable to present some facts about the table with the target data 
# for the harm on population health with respect to injuries.
kable(
  x = target_data_____harm_on_population_health_____injuries[
    order(EVENT_TYPE), 
    list(
      "N" = .N, 
      "SKEWNESS" = round(skewness(INJURIES), 4)
    ), 
    by = EVENT_TYPE
    ],
  caption = paste0(
    "Table 8.2.1-1: ",
    "Facts about the table with the target data subset of observations ", 
    "for the harm on population health with respect to injuries."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = "The skewness was rounded to 4 decimal places."
  )

Table 8.2.1-1: Facts about the table with the target data subset of observations for the harm on population health with respect to injuries.
EVENT_TYPE	N	SKEWNESS
AVALANCHE	80	3.2455
BLIZZARD	12	2.0441
DEBRIS FLOW	12	0.6818
DENSE FOG	20	1.4182
DUST DEVIL	10	1.8590
DUST STORM	22	1.5095
EXCESSIVE HEAT	86	4.1751
FLASH FLOOD	190	9.4282
FLOOD	61	4.6609
HAIL	109	5.8015
HEAT	36	2.1619
HEAVY RAIN	50	4.0900
HEAVY SNOW	31	4.3682
HIGH SURF	54	5.7692
HIGH WIND	220	10.7119
HURRICANE/TYPHOON	15	2.7730
ICE STORM	25	3.4714
LIGHTNING	1411	6.6360
MARINE THUNDERSTORM WIND	11	2.2867
RIP CURRENT	149	4.5935
STRONG WIND	142	2.9883
THUNDERSTORM WIND	1236	9.0224
TORNADO	1252	16.3086
TROPICAL STORM	19	3.8833
WILDFIRE	230	5.8510
WINTER STORM	51	3.1228
WINTER WEATHER	47	4.1679
Note:
The skewness was rounded to 4 decimal places.

It was worth noting that for the weather event types with highest number of observations there was highest skewness for the values of injuries, indicating that the corresponding distribution of injuries has a heavy tail that wasn’t possible to be observed when few observation were available.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.2 Process the target data for harm on population health with respect to injuries

To create the table with the processed data for the harm on population health with respect to injuries from the corresponding target data subset for this perspective, a new variable was created that divides the observations for each of the included weather event types in two complementary levels:

one that contains the 90% of cases with lowest impact
the other that contains the 10% of cases with highest impact

This decision was made due to the high skewness that was observed for the values of the variable INJURIES for most weather event types, which indicates that the underlining distributions of such phenomena has a heavy tail that causes this heterogeneity on the observations. As a result a small number of injuries were observed for the majority of cases that resulted in non-zero injuries while in the few cases with the highest impact they caused lots of injuries.

Having in mind that the average number of injuries will be used to determine which weather event types were the most harmful to population health (with respect to injuries) combined with the fact that the average doesn’t represent well the distribution of variables with high skewness, as it is highly affected by the most extreme values, it was considered necessary to examine the subsets created by those two levels in order to obtain an insightful picture.

# Create the table with the processed data 
# for the harm on population health with respect to injuries.
processed_data_____harm_on_population_health_____injuries <- 
  target_data_____harm_on_population_health_____injuries[
    ,
    ## Create a new variable divides the observations
    ## for each weather event into two supplementary groups:  
    ##   - the 90% of weather events that resulted in lowest injuries
    ##   - the 10% of weather events that resulted in highest injuries
    BIN_GROUP_PER_EVENT_TYPE := (function(x, p_bins) {
      
      # adds 0 and 1 in the vector supplied at the argument 'p_bins' 
      # to the start and the end respectively  
      # the supplied percentiles if they are missing 
      # and sort them ascending
      p_bins_increasing <- sort(c(0, p_bins, 1))
      
      # creates the character strings that labels of the bins by the values supplied at 
      # the argument 'p_bins' that will be the values of the new variable
      bin_labels <- paste0("(", p_bins_increasing[-length(p_bins_increasing)]*100,
                           "% - ", p_bins_increasing[-1]*100, "%]")
      
      # identify the number of occurrences that correspond to each label
      n_times <- vapply(2:length(p_bins_increasing),
                        function(i) {
                          as.integer(floor(length(x) * p_bins_increasing[i]) -
                                       floor(length(x) * p_bins_increasing[i - 1]))
                        }, integer(1))
      
      # multiply each label with the number of its occurrences
      x_bins_expanded <- rep(x = bin_labels, times = n_times)
      
      # order the label to much the values of the corresponding vector
      x_bins_expanded_reordered <- x_bins_expanded[order(seq_along(x)[order(x)])]
      
      ## Coerce the character vector with the labels of bins to a factor
      x_bins_factor <- factor(x_bins_expanded_reordered, labels = bin_labels, ordered = TRUE)
      
    })(INJURIES, 0.9)
    , by = EVENT_TYPE
  ][
    ## Coerce the EVENT_VARIABLE to factor
    , EVENT_TYPE := as.factor(EVENT_TYPE) 
  ]

The table with the processed data for the harm on population health with respect to injuries contains 4 variables:

REFNUM (int) : an id that uniquely identifies each observation
EVENT_TYPE (Factor w/ 27 levels) : the type of each weather event
INJURIES (int) : the number of injuries
BIN_GROUP_PER_EVENT_TYPE (Ord.factor w/ 2 levels) : a factor that divides the observations for each weather event type to two complementary levels, one with the 90% of observations with the lowest impact and another with the 10% of observations with the highest impact.

and 5581 observations.

# Print the structure of the table with the processed data 
# for the harm on population health with respect to injuries.
str(processed_data_____harm_on_population_health_____injuries)

## Classes 'data.table' and 'data.frame':   5581 obs. of  4 variables:
##  $ REFNUM                  : int  413614 413649 413652 413663 413737 413743 413746 413757 413763 413795 ...
##  $ EVENT_TYPE              : Factor w/ 27 levels "AVALANCHE","BLIZZARD",..: 23 22 22 22 22 22 22 23 15 18 ...
##  $ INJURIES                : int  4 2 4 1 6 1 1 1 1 1 ...
##  $ BIN_GROUP_PER_EVENT_TYPE: Ord.factor w/ 2 levels "(0% - 90%]"<"(90% - 100%]": 1 1 1 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.3 Summarize the processed data for harm on population health with respect to injuries by each weather event type

To evaluate the harm on population health by each weather event type with respect to injuries a simplistic approach was adopted :

the weather event types were ranked from the most harmful to the least based on the overall average number of injuries of the weather events that resulted in non-zero injuries

The overall average number of injuries caused by each weather event type was initially examined along with the skewness of the number of injuries for each weather event type. In most cases the skewness was high (or even extremely high), so it was possible that the overall mean misrepresented the consequences of each weather event type.

That is the reason why the average number of injuries for 90% of weather events with the lowest impact versus the average number of injuries for the 10% of weather events with the highest impact were also computed and examined.

It is highlighted that for the average number of injuries that refers to the 10% of the cases that had the highest impact, there were few observations available for a lot of weather event types and the corresponding mean values should be interpreted with caution.

# Create the table with the summary for the harm on population health 
# with respect to injuries for each weather event type.
summary_____harm_on_population_health______injuries <- 
  processed_data_____harm_on_population_health_____injuries[
  ,
  list(
    ## The total number of observation by each weather event type.
    "N" = .N,
    ## The average number of injuries caused by each weather event type.
    "AVRG" = round(mean(INJURIES), 2),
    ## The skewness of injuries for the observations by each weather event type.
    "SKEWNESS" = round(skewness(INJURIES), 4),
    ## The number of observations for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "N_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , .N],
    ## The average number of injuries caused by each weather event type 
    ## for the 90% of cases with the lowest impact.
    "AVRG_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(mean(INJURIES), 2)],
    ## The skewness of injuries for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "SKEWNESS_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(skewness(INJURIES), 4)],
    ## The number of observations for the 10% of cases with the lowest impact 
    ## by each weather event type.
    "N_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , .N],
    ## The average number of injuries caused by each weather event type 
    ## for the 10% of cases with the highest impact.
    "AVRG_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(mean(INJURIES), 2)],
    ## The skewness of injuries for the 10% of cases with the highest impact 
    ## by each weather event type.
    "SKEWNESS_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(skewness(INJURIES), 4)]
  ),
  by = "EVENT_TYPE"
  ][
    ## The average number of injuries is used to order the rows of the table
    ## from the most harmful weather event type to the least.
    order(-AVRG),
    ## Create a variable with the rank of the harmness of each weather event type.
    RANK := 1:length(EVENT_TYPE)
    ][
      ,
      ## Reorder the variables at the table.
      list(
        RANK, EVENT_TYPE, N, AVRG, SKEWNESS, N_LOW, AVRG_LOW, SKEWNESS_LOW, N_HIGH, AVRG_HIGH, SKEWNESS_HIGH
      )
      ]

The results of the table with the summary for the harm on population health with respect to injuries by each weather event type that was created in this section were presented at the subsection 10.1.3 Most harmful event types with respect to injuries of the chapter 10 RESULTS.

The table with the summary for the harm on population health with respect to injuries by each weather event type was exported (as an R file), in the folder of the working directory:

outputs –> harm_on_population_health –> results

with filename:

summary______harm_on_population_health______injuries.R

# Supply the filepath at which the table with the summary
# for the harm on population health will be exported.
filepath_____summary_____harm_on_population_health______injuries <-
  file.path(
    directory_tree_____outputs[[
      "filepath_____outputs_____harm_on_population_health_____results"
    ]],
    "summary_____harm_on_population_health______injuries.R"
  )

# Export the table with the summary for the harm on population health
# with respect to injuries.
saveRDS(
  object = summary_____harm_on_population_health______injuries,
  file = filepath_____summary_____harm_on_population_health______injuries
)

The main reason for exporting the file with the summary for the harm on population health with respect to injuries by each weather event type was to supply a checkpoint for any attempts to reproduce the analysis.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.4 Visualize the results of the summary for the harm on population health with respect to injuries by each weather event type

From the table with the summary for the harm on population health by each weather event type with respect to injuries the Multiplot 1.2 was created to present an overview of the results for the three different aspects that were examined for this perspective.

Four elementary plots were created:

8.2.4.1.1 Create The Plot 1.2.1
- Displays the overall average number of injuries caused by each weather event type based on all the cases of weather events that resulted in non-zero injuries.
8.2.4.1.2 Create The Plot 1.2.2
- Displays the average number of injuries caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero injuries.
8.2.4.1.3 Create The Plot 1.2.3
- Displays the average number of injuries caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero injuries.
8.2.4.1.4 Create The Plot 1.2.4
- Displays a comparison for each weather event type, of the average number of injuries for the 90% of its observations with the lowest impact versus the average number of injuries for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero injuries.

which were then combined in order to obtain the Multiplot 1.2.

It constitutes the PART 2 of the Figure 1 that displays the overview of the harm on population health by each weather event type.

(Note that neither the Multiplot 1.1 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health at the chapter 10 RESULTS, were the Figure 1 was presented, of which the Multiplot 1.2 constitutes the PART 2.)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.4.1 Create the components of Multiplot 1.2

Creates four elementary plots to visualize the results for the aspects that were examined for the harm on population health with respect to injuries by each weather event type.

8.2.4.1.1 Create The Plot 1.2.1
- Displays the overall average number of injuries caused by each weather event type based on all the cases of weather events that resulted in non-zero injuries.
8.2.4.1.2 Create The Plot 1.2.2
- Displays the average number of injuries caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero injuries.
8.2.4.1.3 Create The Plot 1.2.3
- Displays the average number of injuries caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero injuries.
8.2.4.1.4 Create The Plot 1.2.4
- Displays a comparison for each weather event type, of the average number of injuries for the 90% of its observations with the lowest impact versus the average number of injuries for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero injuries.

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.4.1.1 Create The Plot 1.2.1

The Plot 1.2.1 displays the overall average number of injuries caused by each weather event type taking into account all and only the observation that resulted in non-zero injuries.

The skewness of the number of injuries for the observations of each weather event type (based on which the overall number of injuries was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.2.1 that displays 
# the overall average number of injuries 
# by each weather event type for all cases. 
elementary_plot_1_2_1 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______injuries,
    mapping = aes(
      x = AVRG,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to make them displayed alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a square shaped point to the position that corresponds to 
  ## the average number of injuries caused by each weather event type, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(color = SKEWNESS),
    shape = 15, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of injuries.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG, 
      group = EVENT_TYPE, 
      color = SKEWNESS
    )
    ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of injuries it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2.5
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of injuries for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.2 will be composed from the four elementary plots. 
    limits = c(-2, 17), 
    midpoint = 7, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels.  
  labs(
    title = "Plot 1.2.1", 
    subtitle = "Aspect: Overall",
    x = "Average Number of Injuries\n",
    y = "Weather Event Types \n"
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.4.1.2 Create The Plot 1.2.2

The Elementary Plot 1.2.2 displays the average number of injuries for the 90% of cases with the lowest impact caused by each weather event type from all the observation that resulted in non-zero injuries.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to population health, based on the overall average number of injuries they caused.
(so it is NOT based on the average number of injuries caused by the 90% of cases with the lowest impact of each weather event type).

The skewness of the number of injuries for the observations of each weather event type (based on which the average number of injuries for the 90% of cases with the lowest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.2.2 that displays 
# the average number of injuries by each weather event type 
# for the 90% of its cases with the lowest impact.
elementary_plot_1_2_2 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______injuries,
    mapping = aes(
      x = AVRG_LOW,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a circle shaped point to the position that corresponds to 
  ## the average number of injuries caused by each weather event type
  ## for the 90% of its cases with the lowest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_LOW
    ), 
    size = 3.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of injuries 
  ## for the 90% of its cases with the lowest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_LOW, 
      group = EVENT_TYPE, 
      color = SKEWNESS_LOW
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of injuries it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2
    ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of injuries for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.2 will be composed from the four elementary plots.
    limits = c(-2, 17), 
    midpoint = 7, 
    low = "lightgreen",
    mid = "orange",
    high = "purple"
    ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.2.2",
    subtitle = "Aspect: 90% of cases with the lowest impact",
    x = paste0(
      "Average Number of Injuries for the 90% ", "\n",
      "of Observations with the Lowest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.4.1.3 Create The Plot 1.2.3

The Plot 1.2.3 displays the average number of injuries for the 10% of cases with the highest impact caused by each weather event type from all the observation that resulted in non-zero injuries.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to population health, based on the overall average number of injuries they caused.
(so it is NOT based on the average number of injuries caused by the 10% of cases with the highest impact of each weather event type).

The skewness of the number of injuries for the observations of each weather event type (based on which the average number of injuries for the 10% of cases with the highest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.2.3 that displays 
# the average number of injuries by each weather event type 
# for the 10% of its cases with the highest impact.
elementary_plot_1_2_3 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______injuries,
    mapping = aes(
      x = AVRG_HIGH,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a diamond shaped point to the position that corresponds to 
  ## the average number of injuries caused by each weather event type
  ## for the 10% of its cases with the highest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_HIGH
    ), 
    shape = 18, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of injuries 
  ## for the 10% of its cases with the highest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_HIGH, 
      group = EVENT_TYPE, 
      color = SKEWNESS_HIGH
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of injuries it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ),
    size = 2
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of injuries for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.2 will be composed from the four elementary plots.
    limits = c(-2, 17), 
    midpoint = 7, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.2.3",
    subtitle ="Aspect: 10% of cases with the highest impact",
    x = paste0(
      "Average Number of Injuries for the 10% ", "\n", 
      "of Observations with the Highest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    ### Remove the text, ticks and title of the y axis 
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.4.1.4 Create The Plot 1.2.4

The Plot 1.2.4 displays a compact overview of all three aspect that were examined for the harm on population health with respect to injuries.

For each weather event type, the comparison was visualized for the average number of injuries for the 90% of cases with the lowest impact versus the average number of injuries for the 10% of cases with the highest impact.

# Create the Elementary Plot 1.2.4 that displays 
# by each weather event type the comparison of 
# the average number of injuries 
# for the 90% of cases with the lowest impact
# versus the average number of injuries 
# for the 10% of cases with the highest impact.
elementary_plot_1_2_4 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______injuries,
    mapping = aes(
      x = AVRG_HIGH, 
      y = AVRG_LOW
    )
  ) +
  geom_point(
    mapping = aes(
      fill = SKEWNESS
    ), 
    shape = 21
  ) +
  ## Draw a label with a number that indicates the rank assigned 
  ## to each weather event type (from the most harmful to the least) 
  ## based on the overall average number of injuries it caused.
  geom_label_repel(
    mapping = aes(
      label = RANK, 
      fill = SKEWNESS
    ),
    size = 2.5
  ) +
  ## Adjust the scale for the fill of each label.
  scale_fill_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of injuries for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.2 will be composed from the four elementary plots.
    limits = c(-2, 17),
    midpoint = 7, 
    low = "lightgreen",
    mid = "orange", 
    high = "purple"
    ) +
  ## Set proper limits to the plot.
    xlim(c(-20, 550)) +
    ylim(c(-1, 17)) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.2.4",
    subtitle = paste0(
      "Comparison of the average number of injuries ", 
      "for the 90% of observations with the lowest impact ", 
      "versus the average number of injuries ", 
      "for the 10% of observations with highest impact. "
    ),
    x = paste0(
      "Average Number of Injuries by each Weather Event Type ", 
      "for the 10% of its Observations with the Highest Impact"
    ),
    y = paste0(
      "Average Number of Injuries by each Weather Event Type ", "\n", 
      "for the 90% of its Observations with the Lowest Impact."
    ),
    ### Add a descriptive label for the legend.
    fill = paste0(
      "The color indicates the skewness ",
      "of injuries for the each weather event type. ",
      "(the color scale is unique for all four plots of PART 2) ", "\n",
      "When the color of a bar is gray, the skewness was indeterminable ",
      "due to the fact that all observations for that weather event type ",
      "took the same value."
    )
  ) +
  ## Select a theme.
  theme_linedraw() +
  ## Customize the selected theme.
  theme(
    ### Adjust the legend.
    legend.position = "bottom",
    legend.direction = "horizontal",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.2.4.2 Compose the Multiplot 1.2

The four elementary plots that were created from the results of the summary for the harm on population health with respect to injuries by each weather event type, were combined to construct a single multiplot that displays the complete picture for this perspective.

# Create a multiplot that displays the overview of the summary 
# for the harm on population health with respect to injuries
# by each weather event type.
multiplot_1_2 <- arrangeGrob(
  grobs = list(
      
    # Title
    textGrob(
      label = paste0(
        "\n",
        "PART 2: Harm on population health by each weather event type ", 
        "with the respect to injuries ", "\n", 
        "based on the cases of weather events ", 
        "that resulted in non-zero injuries.", "\n", 
        "\n"
      ),
       gp=gpar(
         fontsize = 16, 
         fontface = "bold"
       )
    ),
    
    # Subtitle
    textGrob(
      label = paste0(
          "\n", 
          "The results include only the weather event types, ", 
          "for which at least 10 observations ", 
          "that resulted in non-zero injuries were available. ", "\n",
          "The number associated with each weather event type ", 
          "represents the rank (from the most harmful to the least) ", 
          "which was assigned based on the overall average number of injuries.", "\n",
          "Because for most of the weather event types ", 
          "high positive skewness was observed for the number of injuries, ",
          "the average of the 90% of cases with lowest impact ", "\n",
          "and the 10% of cases with highest impact were reported ", 
          "to provide a more representative picture of their consequences.","\n",
          "\n"
      ),
       gp=gpar(
         fontsize = 14, 
         fontface = "bold"
       )
    ),
    
    # Plot 1.2.1
    # Elementary plot for the average number of injuries 
    # by each weather event type for all cases.
    elementary_plot_1_2_1,
    
    # ELEMENTARY PLOT 1.2.2
    # Elementary plot for the average number of injuries 
    # by each weather event type for 90% of cases with the lowest impact.
    elementary_plot_1_2_2,
    
    # ELEMENTARY PLOT 1.2.3
    # Elementary plot for the average number of injuries 
    # by each weather event type for 10% of cases with the highest impact.
    elementary_plot_1_2_3,
    
    # ELEMENTARY PLOT 1.2.4
    # Elementary Plot 1.2.4 for the comparison of 
    # the average number of injuries 
    # for the 90% of cases with the lowest impact versus 
    # the 10% of cases with the highest impact.
    elementary_plot_1_2_4
  ),
  # Set the layout for this elementary plots
  layout_matrix = 
    matrix(
      c(1,1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2,2,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6
      ),
      byrow = TRUE, 
      nrow = 13
    )
)

*(Note that the Multiplot 1.2 was NOT presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health of the chapter 10 RESULTS, were the Figure 1 was presented, of which the Multiplot 1.2 constitutes the PART 2.)

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3 Harm On Population Health With Respect To Casualties By Each Weather Event Type

Summary

The required variables and the target data subset of observations for the harm on population health with respect to casualties were extracted from the table with the processed data, and processed to create a new variable that divided the observations for each of the included weather event types to two supplementary groups:

the 90% of observations with the lowest impact
the 10% of observations with the highest impact

before the information for the harm on population health with respect to casualties was summarized by each weather event type.

Three aspects were examined:

The overall average number of casualties by each weather event type.
The average number of casualties by each weather event type for the 90% of cases with the lowest impact.
The average number of casualties by each weather event type for the 10% of cases with the highest impact.

For each aspect, the average number of casualties by each weather event type, the number of its available observations (based on which the average was computed) and their skewness were examined.

The overall average number of casualties was used as the main criterion to determine which weather events caused the most harm on population health with respect to casualties but it is important to take into account the other two aspect that were presented in order to obtain a more insightful and complete ‘picture’ of their consequences, (especially given the fact that for most of the weather event types, the casualties were highly positively skewed).

The table with results for the harm on population health with respect to casualties by each weather event type were presented at the subsection 10.1.4 Most harmful event types with respect to casualties of the chapter 10 RESULTS.

Finally the Multiplot 1.3 was created to visualize the results of the harm on population health with respect to casualties by each weather event type.

*(Note that neither the Multiplot 1.3 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health at the chapter 10 RESULTS, where the Figure 1 was presented, of which the Multiplot 1.3 constitutes the PART 3.)

Steps

8.3.1 Extract the target data for harm on population health with respect to casualties
- The target data subset of observations needed to evaluate the harm on population health with respect to casualties by each weather event type was extracted from the table with the processed data.
8.3.2 Process the target data for harm on population health with respect to casualties
- The table with target data subset for the harm on population with respect to casualties was processed to create the table with processed data for the harm on population health with respect to casualties.
8.3.3 Summarize the processed data for harm on population health with respect to casualties by each weather event type
- The harm on population health with respect to casualties by each weather event type was evaluated over various aspects.
8.3.4 Visualize the results of the summary for the harm on population health with respect to casualties by each weather event type
- The Multiplot 1.3 that presents the results of the summary for the harm on population health with respect to casualties by each weather event type was created.
  - 8.3.4.1 Create the components of Multiplot 1.3
    - Creates the four elementary plot that constitute the Multiplot 1.3:
      - 8.3.4.1.1 Create The Plot 1.3.1
        
        Displays the overall average number of casualties caused by each weather event type based on all the cases of weather events that resulted in non-zero casualties.
      - 8.3.4.1.2 Create The Plot 1.3.2
        
        Displays the average number of casualties caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero casualties.
      - 8.3.4.1.3 Create The Plot 1.3.3
        
        Displays the average number of casualties caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero casualties.
      - 8.3.4.1.4 Create The Plot 1.3.4
        
        Displays a comparison for each weather event type, of the average number of casualties for the 90% of its observations with the lowest impact versus the average number of casualties for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero casualties.
  - 8.3.4.2 Compose the Multiplot 1.3
    - Combines the four elementary plots to create the Multiplot 1.3.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.1 Extract the target data for harm on population health with respect to casualties

In order to examine the harm on population health with respect to casualties caused by each weather event type, the variables REFNUM, EVENT_TYPE and CASUALTIES were selected from the table with the processed data and only the observations that refer to weather events that resulted in non-zero casualties were extracted.

Furthermore, in an attempt to avoid highly misleading statistics due to the small number of observations for some of the weather event types, a lowest bound of 10 weather events that caused non zero casualties (for each of the included weather event types) was selected (subjectively by the analyst) and applied.

This lowest bound, although it may seem (and generally it is) not enough to get trustworthy statistics, it was considered to be “good enough” taking into account that :

the analysis focuses in describing historical data without trying to make inferences that would demand substantially bigger samples, although any statistic based on less than 10 observations could not be taken seriously especially in cases (such as in this analysis) where the distribution of casualties for each weather event type was skewed.
a period of 10 years (from 2001 to 2011) in which the observations that were used in the analysis occurred, is relatively small time to produce big samples of weather events that caused non zero casualties for some the weather event types. Thus, if a highest bound was selected to get more robust statistics such as samples of 100 or 300, the majority of weather event types would have been excluded, making the results of the analysis trivial.

# Extract the required variables and the target data subset of observations 
# for the harm on population health with respect to casualties.
target_data_____harm_on_population_health_____casualties <- processed_data[
  ## Extract only the observations that have resulted in non-zero casualties.
  CASUALTIES > 0,
  ## Select only the relevant variables. 
  list(REFNUM, EVENT_TYPE, CASUALTIES)
  ][
    ### Keep only the observations that correspond to the weather event types 
    ### for which there are at least 10 weather events available.
    EVENT_TYPE %in% 
      names(table(EVENT_TYPE)[table(EVENT_TYPE) >= 10])
    ]

The table with the target data for the harm on population health with respect to casualties consist of 7936 observations.

# Print the structure of the table with the target data subset 
# for the harm on population health with respect to casualties.
str(target_data_____harm_on_population_health_____casualties)

## Classes 'data.table' and 'data.frame':   7936 obs. of  3 variables:
##  $ REFNUM    : int  413614 413649 413652 413663 413737 413743 413746 413757 413763 413795 ...
##  $ EVENT_TYPE: chr  "TORNADO" "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" ...
##  $ CASUALTIES: int  4 2 5 1 6 1 1 3 2 1 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

The variable EVENT_TYPE includes 30 distinct weather event types, for most of which the variable CASUALTIES was highly positively skewed.

It was worth noting that for the weather event types with highest number of observations there was highest skewness for the values of casualties, indicating that the corresponding distribution of casualties has a heavy tail that wasn’t possible to be observed when few observation were available.

# Create a kable to present some facts about the table with the target data 
# for the harm on population health with respect to casualties.
kable(
  x = target_data_____harm_on_population_health_____casualties[
    order(EVENT_TYPE), 
    list(
      "N" = .N, 
      "SKEWNESS" = round(skewness(CASUALTIES), 4)
    ), 
    by = EVENT_TYPE
    ],
  caption = paste0(
    "Table 8.3.1-1: ",
    "Facts about the table with the target data subset of observations ", 
    "for the harm on population health with respect to casualties."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = "The skewness was rounded to 4 decimal places."
  )

Table 8.3.1-1: Facts about the table with the target data subset of observations for the harm on population health with respect to casualties.
EVENT_TYPE	N	SKEWNESS
AVALANCHE	180	2.3975
BLIZZARD	22	2.3705
COLD/WIND CHILL	76	5.0297
DEBRIS FLOW	19	2.2183
DENSE FOG	20	1.3831
DUST DEVIL	12	2.1224
DUST STORM	23	1.5025
EXCESSIVE HEAT	350	8.3298
EXTREME COLD/WIND CHILL	107	4.3053
FLASH FLOOD	540	14.4341
FLOOD	231	9.3312
HAIL	110	5.8303
HEAT	154	5.2894
HEAVY RAIN	75	5.0249
HEAVY SNOW	45	5.2993
HIGH SURF	119	8.3730
HIGH WIND	279	11.3363
HURRICANE/TYPHOON	33	4.4573
ICE STORM	38	4.3115
LIGHTNING	1657	6.9576
MARINE STRONG WIND	16	1.9270
MARINE THUNDERSTORM WIND	17	2.3442
RIP CURRENT	475	6.9329
STRONG WIND	211	3.0745
THUNDERSTORM WIND	1364	9.4260
TORNADO	1327	17.6038
TROPICAL STORM	34	5.3288
WILDFIRE	244	6.5566
WINTER STORM	84	3.9675
WINTER WEATHER	74	5.2237
Note:
The skewness was rounded to 4 decimal places.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.2 Process the target data for harm on population health with respect to casualties

To create the table with the processed data for the harm on population health with respect to casualties from the corresponding target data subset for this perspective, a new variable was created that divides the observations for each of the included weather event types in two complementary levels:

one that contains the 90% of cases with lowest impact
the other that contains the 10% of cases with highest impact

This decision was made due to the high skewness that was observed for the values of the variable CASUALTIES for most weather event types, which indicates that the underlining distributions of such phenomena has a heavy tail that causes this heterogeneity on the observations. As a result a small number of casualties were observed for the majority of cases that resulted in non-zero casualties while in the few cases with the highest impact they caused lots of casualties.

Having in mind that the average number of casualties will be used to determine which weather event types were the most harmful to population health (with respect to casualties) combined with the fact that the average doesn’t represent well the distribution of variables with high skewness, as it is highly affected by the most extreme values, it was considered necessary to examine the subsets created by those two levels in order to obtain an insightful picture.

# Create the table with the processed data 
# for the harm on population health with respect to casualties.
processed_data_____harm_on_population_health_____casualties <- 
  target_data_____harm_on_population_health_____casualties[
    ,
    ## Create a new variable divides the observations
    ## for each weather event into two supplementary groups:  
    ##   - the 90% of weather events that resulted in lowest casualties
    ##   - the 10% of weather events that resulted in highest casualties
    BIN_GROUP_PER_EVENT_TYPE := (function(x, p_bins) {
      
      # adds 0 and 1 in the vector supplied at the argument 'p_bins' 
      # to the start and the end respectively  
      # the supplied percentiles if they are missing 
      # and sort them ascending
      p_bins_increasing <- sort(c(0, p_bins, 1))
      
      # creates the character strings that labels of the bins by the values supplied at 
      # the argument 'p_bins' that will be the values of the new variable
      bin_labels <- paste0("(", p_bins_increasing[-length(p_bins_increasing)]*100,
                           "% - ", p_bins_increasing[-1]*100, "%]")
      
      # identify the number of occurrences that correspond to each label
      n_times <- vapply(2:length(p_bins_increasing),
                        function(i) {
                          as.integer(floor(length(x) * p_bins_increasing[i]) -
                                       floor(length(x) * p_bins_increasing[i - 1]))
                        }, integer(1))
      
      # multiply each label with the number of its occurrences
      x_bins_expanded <- rep(x = bin_labels, times = n_times)
      
      # order the label to much the values of the corresponding vector
      x_bins_expanded_reordered <- x_bins_expanded[order(seq_along(x)[order(x)])]
      
      ## Coerce the character vector with the labels of bins to a factor
      x_bins_factor <- factor(x_bins_expanded_reordered, labels = bin_labels, ordered = TRUE)
      
    })(CASUALTIES, 0.9), 
    by = EVENT_TYPE
  ][
    ## Coerce the EVENT_VARIABLE to factor
    , EVENT_TYPE := as.factor(EVENT_TYPE) 
  ]

The table with the processed data for the harm on population health with respect to casualties contains 4 variables:

REFNUM (int) : an id that uniquely identifies each observation
EVENT_TYPE (Factor w/ 30 levels) : the type of each weather event
CASUALTIES (int ): the number of casualties
BIN_GROUP_PER_EVENT_TYPE (Ord.factor w/ 2 levels) : a factor that divides the observations for each weather event type to two complementary levels, one with the 90% of observations with the lowest impact and another with the 10% of observations with the highest impact.

and 7936 observations.

# Print the structure of the table with the processed data 
# for the harm on population health with respect to casualties.
str(processed_data_____harm_on_population_health_____casualties)

## Classes 'data.table' and 'data.frame':   7936 obs. of  4 variables:
##  $ REFNUM                  : int  413614 413649 413652 413663 413737 413743 413746 413757 413763 413795 ...
##  $ EVENT_TYPE              : Factor w/ 30 levels "AVALANCHE","BLIZZARD",..: 26 25 25 25 25 25 25 26 17 20 ...
##  $ CASUALTIES              : int  4 2 5 1 6 1 1 3 2 1 ...
##  $ BIN_GROUP_PER_EVENT_TYPE: Ord.factor w/ 2 levels "(0% - 90%]"<"(90% - 100%]": 1 1 2 1 2 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.3 Summarize the processed data for harm on population health with respect to casualties by each weather event type

To evaluate the harm on population health by each weather event type with respect to casualties a simplistic approach was adopted :

the weather event types were ranked from the most harmful to the least based on the overall average number of casualties of the weather events that resulted in non-zero casualties

The overall average number of casualties caused by each weather event type was initially examined along with the skewness of the number of casualties for each weather event type. In most cases the skewness was high (or even extremely high), so it was possible that the overall mean misrepresented the consequences of each weather event type.

That is the reason why the average number of casualties for 90% of weather events with the lowest impact versus the average number of casualties for the 10% of weather events with the highest impact were also computed and examined.

It is highlighted that for the average number of casualties that refers to the 10% of the cases that had the highest impact, there were few observations available for a lot of weather event types and the corresponding mean values should be interpreted with caution.

# Create the table with the summary for the harm on population health 
# with respect to casualties for each weather event type.
summary_____harm_on_population_health______casualties <- 
  processed_data_____harm_on_population_health_____casualties[
  ,
  list(
    ## The total number of observation by each weather event type.
    "N" = .N,
    ## The average number of casualties caused by each weather event type.
    "AVRG" = round(mean(CASUALTIES), 2),
    ## The skewness of casualties for the observations by each weather event type.
    "SKEWNESS" = round(skewness(CASUALTIES), 4),
    ## The number of observations for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "N_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , .N],
    ## The average number of casualties caused by each weather event type 
    ## for the 90% of cases with the lowest impact.
    "AVRG_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(mean(CASUALTIES), 2)],
    ## The skewness of casualties for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "SKEWNESS_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(skewness(CASUALTIES), 4)],
    ## The number of observations for the 10% of cases with the lowest impact 
    ## by each weather event type.
    "N_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , .N],
    ## The average number of casualties caused by each weather event type 
    ## for the 10% of cases with the highest impact.
    "AVRG_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(mean(CASUALTIES), 2)],
    ## The skewness of casualties for the 10% of cases with the highest impact 
    ## by each weather event type.
    "SKEWNESS_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(skewness(CASUALTIES), 4)]
  ),
  by = "EVENT_TYPE"
  ][
    ## The average number of casualties is used to order the rows of the table
    ## from the most harmful weather event type to the least.
    order(-AVRG),
    ## Create a variable with the rank of the harmness of each weather event type.
    RANK := 1:length(EVENT_TYPE)
    ][
      ,
      ## Reorder the variables at the table.
      list(
        RANK, EVENT_TYPE, N, AVRG, SKEWNESS, N_LOW, AVRG_LOW, SKEWNESS_LOW, N_HIGH, AVRG_HIGH, SKEWNESS_HIGH
      )
      ]

The results of the table with the summary for the harm on population health with respect to casualties by each weather event type that was created in this section were presented at the subsection 10.1.4 Most harmful event types with respect to casualties of the chapter 10 RESULTS.

The table with the summary for the harm on population health with respect to casualties by each weather event type was exported (as an R file), in the folder of the working directory:

outputs –> harm_on_population_health –> results

with filename:

summary______harm_on_population_health______casualties.R

# Supply the filepath at which the table with the summary
# for the harm on population health will be exported.
filepath_____summary_____harm_on_population_health______casualties <-
  file.path(
    directory_tree_____outputs[[
      "filepath_____outputs_____harm_on_population_health_____results"
      ]],
    "summary_____harm_on_population_health______casualties.R"
  )

# Export the table with the summary for the harm on population health
# with respect to casualties.
saveRDS(
  object = summary_____harm_on_population_health______casualties,
  file = filepath_____summary_____harm_on_population_health______casualties
)

The main reason for exporting the file with the summary for the harm on population health with respect to casualties by each weather event type was to supply a checkpoint for any attempts to reproduce the analysis.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.4 Visualize the results of the summary for the harm on population health with respect to casualties by each weather event type

From the table with the summary for the harm on population health by each weather event type with respect to casualties the Multiplot 1.3 was created to present an overview of the results for the three different aspects that were examined for this perspective.

Four elementary plots were created:

8.3.4.1.1 Create The Plot 1.3.1
- Displays the overall average number of casualties caused by each weather event type based on all the cases of weather events that resulted in non-zero casualties.
8.3.4.1.2 Create The Plot 1.3.2
- Displays the average number of casualties caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero casualties.
8.3.4.1.3 Create The Plot 1.3.3
- Displays the average number of casualties caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero casualties.
8.3.4.1.4 Create The Plot 1.3.4
- Displays a comparison for each weather event type, of the average number of casualties for the 90% of its observations with the lowest impact versus the average number of casualties for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero casualties.

which were then combined in order to obtain the Multiplot 1.3.

It constitutes the PART 3 of the Figure 1 that displays the overview of the harm on population health by each weather event type.

(Note that neither the Multiplot 1.3 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health at the chapter 10 RESULTS, were the Figure 1 was presented, of which the Multiplot 1.2 constitutes the PART 2.)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.4.1 Create the components of Multiplot 1.3

Creates four elementary plots to visualize the results for the aspects that were examined for the harm on population health with respect to casualties by each weather event type.

8.3.4.1.1 Create The Plot 1.3.1
- Displays the overall average number of casualties caused by each weather event type based on all the cases of weather events that resulted in non-zero casualties.
8.3.4.1.2 Create The Plot 1.3.2
- Displays the average number of casualties caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero casualties.
8.3.4.1.3 Create The Plot 1.3.3
- Displays the average number of casualties caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero casualties.
8.3.4.1.4 Create The Plot 1.3.4
- Displays a comparison for each weather event type, of the average number of casualties for the 90% of its observations with the lowest impact versus the average number of casualties for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero casualties.

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.4.1.1 Create The Plot 1.3.1

The Plot 1.3.1 displays the overall average number of casualties caused by each weather event type taking into account all and only the observation that resulted in non-zero casualties.

The skewness of the number of casualties for the observations of each weather event type (based on which the overall number of casualties was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.3.1 that displays 
# the overall average number of casualties 
# by each weather event type for all cases. 
elementary_plot_1_3_1 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______casualties,
    mapping = aes(
      x = AVRG,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to make them displayed alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a square shaped point to the position that corresponds to 
  ## the average number of casualties caused by each weather event type, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(color = SKEWNESS),
    shape = 15, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of casualties.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG, 
      group = EVENT_TYPE, 
      color = SKEWNESS
    )
    ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of casualties it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2.5
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of casualties for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.3 will be composed from the four elementary plots. 
    limits = c(-2, 18), 
    midpoint = 8, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels.  
  labs(
    title = "Plot 1.3.1", 
    subtitle = "Aspect: Overall",
    x = "Average Number of Casualties\n",
    y = "Weather Event Types \n"
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.4.1.2 Create The Plot 1.3.2

The Elementary Plot 1.3.2 displays the average number of casualties for the 90% of cases with the lowest impact caused by each weather event type from all the observation that resulted in non-zero casualties.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to population health, based on the overall average number of casualties they caused.
(so it is NOT based on the average number of casualties caused by the 90% of cases with the lowest impact of each weather event type).

The skewness of the number of casualties for the observations of each weather event type (based on which the average number of casualties for the 90% of cases with the lowest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.3.2 that displays 
# the average number of casualties by each weather event type 
# for the 90% of its cases with the lowest impact.
elementary_plot_1_3_2 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______casualties,
    mapping = aes(
      x = AVRG_LOW,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a circle shaped point to the position that corresponds to 
  ## the average number of casualties caused by each weather event type
  ## for the 90% of its cases with the lowest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_LOW
    ), 
    size = 3.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of casualties 
  ## for the 90% of its cases with the lowest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_LOW, 
      group = EVENT_TYPE, 
      color = SKEWNESS_LOW
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of casualties it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2
    ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of casualties for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.3 will be composed from the four elementary plots.
    limits = c(-2, 18), 
    midpoint = 8, 
    low = "lightgreen",
    mid = "orange",
    high = "purple"
    ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.3.2",
    subtitle = "Aspect: 90% of cases with the lowest impact",
    x = paste0(
      "Average Number of Casualties for the 90% ", "\n",
      "of Observations with the Lowest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.4.1.3 Create The Plot 1.3.3

The Plot 1.3.3 displays the average number of casualties for the 10% of cases with the highest impact caused by each weather event type from all the observation that resulted in non-zero casualties.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to population health, based on the overall average number of casualties they caused.
(so it is NOT based on the average number of casualties caused by the 10% of cases with the highest impact of each weather event type).

The skewness of the number of casualties for the observations of each weather event type (based on which the average number of casualties for the 10% of cases with the highest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 1.3.3 that displays 
# the average number of casualties by each weather event type 
# for the 10% of its cases with the highest impact.
elementary_plot_1_3_3 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______casualties,
    mapping = aes(
      x = AVRG_HIGH,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a diamond shaped point to the position that corresponds to 
  ## the average number of casualties caused by each weather event type
  ## for the 10% of its cases with the highest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_HIGH
    ), 
    shape = 18, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average number of casualties 
  ## for the 10% of its cases with the highest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_HIGH, 
      group = EVENT_TYPE, 
      color = SKEWNESS_HIGH
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## number of casualties it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ),
    size = 2
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of casualties for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.3 will be composed from the four elementary plots.
    limits = c(-2, 18), 
    midpoint = 8, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.3.3",
    subtitle ="Aspect: 10% of cases with the highest impact",
    x = paste0(
      "Average Number of Casualties for the 10% ", "\n", 
      "of Observations with the Highest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    ### Remove the text, ticks and title of the y axis 
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.4.1.4 Create The Plot 1.3.4

The Plot 1.3.4 displays a compact overview of all three aspect that were examined for the harm on population health with respect to casualties.

For each weather event type, the comparison was visualized for the average number of casualties for the 90% of cases with the lowest impact versus the average number of casualties for the 10% of cases with the highest impact.

# Create the Elementary Plot 1.3.4 that displays 
# by each weather event type the comparison of 
# the average number of casualties 
# for the 90% of cases with the lowest impact
# versus the average number of casualties 
# for the 10% of cases with the highest impact.
elementary_plot_1_3_4 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_population_health______casualties,
    mapping = aes(
      x = AVRG_HIGH, 
      y = AVRG_LOW
    )
  ) +
  geom_point(
    mapping = aes(
      fill = SKEWNESS
    ), 
    shape = 21
  ) +
  ## Draw a label with a number that indicates the rank assigned 
  ## to each weather event type (from the most harmful to the least) 
  ## based on the overall average number of casualties it caused.
  geom_label_repel(
    mapping = aes(
      label = RANK, 
      fill = SKEWNESS
    ),
    size = 2.5
  ) +
  ## Adjust the scale for the fill of each label.
  scale_fill_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average number of casualties for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 1.3 will be composed from the four elementary plots.
    limits = c(-2, 18),
    midpoint = 8, 
    low = "lightgreen",
    mid = "orange", 
    high = "purple"
    ) +
  ## Set proper limits to the plot.
    xlim(c(0, 320)) +
    ylim(c(0.5, 7)) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 1.3.4",
    subtitle = paste0(
      "Comparison of the average number of casualties ", 
      "for the 90% of observations with the lowest impact ", 
      "versus the average number of casualties ", 
      "for the 10% of observations with highest impact. "
    ),
    x = paste0(
      "Average Number of Casualties by each Weather Event Type ", 
      "for the 10% of its Observations with the Highest Impact"
    ),
    y = paste0(
      "Average Number of Casualties by each Weather Event Type ", "\n", 
      "for the 90% of its Observations with the Lowest Impact."
    ),
    ### Add a descriptive label for the legend.
    fill = paste0(
      "The color indicates the skewness ",
      "of casualties for the each weather event type. ",
      "(the color scale is unique for all four plots of PART 3) "
    )
  ) +
  ## Select a theme.
  theme_linedraw() +
  ## Customize the selected theme.
  theme(
    ### Adjust the legend.
    legend.position = "bottom",
    legend.direction = "horizontal",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

8.3.4.2 Compose the Multiplot 1.3

The four elementary plots that were created from the results of the summary for the harm on population health with respect to casualties by each weather event type, were combined to construct a single multiplot that displays the complete picture for this perspective.

# Create a multiplot that displays the overview of the summary 
# for the harm on population health with respect to casualties
# by each weather event type.
multiplot_1_3 <- arrangeGrob(
  grobs = list(
      
    # Title
    textGrob(
      label = paste0(
        "\n",
        "PART 3: Harm on population health by each weather event type ", 
        "with the respect to casualties ", "\n", 
        "based on the cases of weather events ", 
        "that resulted in non-zero casualties.", "\n", 
        "\n"
      ),
       gp=gpar(
         fontsize = 16, 
         fontface = "bold"
       )
    ),
    
    # Subtitle
    textGrob(
      label = paste0(
          "\n", 
          "The results include only the weather event types, ", 
          "for which at least 10 observations ", 
          "that resulted in non-zero casualties were available. ", "\n",
          "The number associated with each weather event type ", 
          "represents the rank (from the most harmful to the least) ", 
          "which was assigned based on the overall average number of casualties.", "\n",
          "Because for most of the weather event types ", 
          "high positive skewness was observed for the number of casualties, ",
          "the average of the 90% of cases with lowest impact ", "\n",
          "and the 10% of cases with highest impact were reported ", 
          "to provide a more representative picture of their consequences.","\n",
          "\n"
      ),
       gp=gpar(
         fontsize = 14, 
         fontface = "bold"
       )
    ),
    
    # Plot 1.3.1
    # Elementary plot for the average number of casualties 
    # by each weather event type for all cases.
    elementary_plot_1_3_1,
    
    # ELEMENTARY PLOT 1.3.2
    # Elementary plot for the average number of casualties 
    # by each weather event type for 90% of cases with the lowest impact.
    elementary_plot_1_3_2,
    
    # ELEMENTARY PLOT 1.3.3
    # Elementary plot for the average number of casualties 
    # by each weather event type for 10% of cases with the highest impact.
    elementary_plot_1_3_3,
    
    # ELEMENTARY PLOT 1.3.4
    # Elementary Plot 1.3.4 for the comparison of 
    # the average number of casualties 
    # for the 90% of cases with the lowest impact versus 
    # the 10% of cases with the highest impact.
    elementary_plot_1_3_4
  ),
  # Set the layout for this elementary plots
  layout_matrix = 
    matrix(
      c(1,1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2,2,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6
      ),
      byrow = TRUE, 
      nrow = 13
    )
)

(Note that the Multiplot 1.3 was NOT presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.1.1 Overview of results for the harm on population health of the chapter 10 RESULTS, were the Figure 1 was presented, of which the Multiplot 1.3 constitutes the PART 3.)

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9 HARM ON ECONOMY

In this chapter an attempt was made to quantify the harm on economy based on the information from the table with the processed data.

The harm on economy was examined over three perspectives:

The harm on economy with respect to property damage caused by each weather event type based on the observations for weather events that resulted in non-zero property damage at United States in the period from 2001 to 2011.
The harm on economy with respect to crop damage caused by each weather event type based on the observations for weather events that resulted in non-zero crop damage at United States in the period from 2001 to 2011.
The harm on economy with respect to economic damage
(sum of property damage and crop damage) caused by each weather event type based on the observations for weather events that resulted in non-zero economic damage at United States in the period from 2001 to 2011.

The weather event types for which less than 10 observations that resulted in non-zero harm were available with respect to a perspective of interest were ommited (from the analysis of that particular perspective), to avoid highly misleading statistics. Consequently the subset of weather event types that were included for each of the three perspectives is different.

The overall harm on economy caused by each weather event type.
The harm on economy cauced by the 90% of cases with the lowest impact of each weather event type.
The harm on economy cauced by the 10% of cases with the highest impact of each weather event type.

For every apsect the sample size, the skewness and the mean of the values that encapsulated the harm with respect to each perspective were summarized by each weather event type and reported.

The results obtained for the harm on economy by each weather event type were presented at the section 10.2 Question 2 : Across the United States, which types of events have the greatest economic consequences? of the chapter 10 RESULTS.

(In compliance with the restrictions of the assignment, according to which at least 1 but no more than 3 figures should be included in the report, the multiplots as well as the elementary plots that they contain were NOT displayed separately and can ONLY be examined as PARTs of the Figure 2 at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS.)

back to start of this chapter
back to TABLE OF CONTENTS

9.1 Harm On Economy With Respect To Property Damage By Each Weather Event Type

Summary

The required variables and the target data subset of observations for the harm on economy with respect to property damage were extracted from the table with the processed data, and processed to create a new variable that divided the observations for each of the included weather event types to two supplementary groups:

the 90% of observations with the lowest impact
the 10% of observations with the highest impact

before the information for the harm on economy with respect to property damage was summarized by each weather event type.

Three aspects were examined:

The overall average property damage by each weather event type.
The average property damage by each weather event type for the 90% of cases with the lowest impact.
The average property damage by each weather event type for the 10% of cases with the highest impact.

For each aspect, the average property damage by each weather event type, the number of its available observations (based on which the average was computed) and their skewness were examined.

The overall average property damage was used as the main criterion to determine which weather events caused the most harm on economy with respect to property damage but it is important to take into account the other two aspect that were presented in order to obtain a more insightful and complete ‘picture’ of their consequences, (especially given the fact that for most of the weather event types, the property damage were highly positively skewed).

The table with results for the harm on economy with respect to property damage by each weather event type were presented at the subsection 10.2.2 Most harmful event types with respect to property damage of the chapter 10 RESULTS.

Finally the Multiplot 2.1 was created to visualize the results of the harm on economy with respect to property damage by each weather event type.

*(Note that neither the Multiplot 2.1 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS, where the Figure 2 was presented, of which the Multiplot 2.1 constitutes the PART 1.)

of the chapter .

Steps

9.1.1 Extract the target data for harm on economy with respect to property damage
- The target data subset of observations needed to evaluate the harm on economy with respect to property damage by each weather event type was extracted from the table with the processed data.
9.1.2 Process the target data for harm on economy with respect to property damage
- The table with target data subset for the harm on economy with respect to property damage was processed to create the table with processed data for the harm on economy with respect to property damage.
9.1.3 Summarize the processed data for harm on economy with respect to property damage by each weather event type
- The harm on economy with respect to property damage by each weather event type was evaluated over various aspects.
9.1.4 Visualize the results of the summary for the harm on economy with respect to property damage by each weather event type
- The Multiplot 2.1 that presents the results of the summary for the harm on economy with respect to property damage by each weather event type was created.
  - 9.1.4.1 Create the components of Multiplot 2.1
    - Creates the four elementary plot that constitute the Multiplot 2.1:
      - 9.1.4.1.1 Create The Plot 2.1.1
        
        Displays the overall average property damage caused by each weather event type based on all the cases of weather events that resulted in non-zero property damage.
      - 9.1.4.1.2 Create The Plot 2.1.2
        
        Displays the average property damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero property damage.
      - 9.1.4.1.3 Create The Plot 2.1.3
        
        Displays the average property damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero property damage.
      - 9.1.4.1.4 Create The Plot 2.1.4
        
        Displays a comparison for each weather event type, of the average property damage for the 90% of its observations with the lowest impact versus the average property damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero property damage.
  - 9.1.4.2 Compose the Multiplot 2.1
    - Combines the four elementary plots to create the Multiplot 2.1.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.1 Extract the target data for harm on economy with respect to property damage

In order to examine the harm on economy with respect to property damage caused by each weather event type, the variables REFNUM, EVENT_TYPE and PROPERTY_DAMAGE were selected from the table with the processed data and only the observations that refer to weather events that resulted in non-zero property damage were extracted.

Furthermore, in an attempt to avoid highly misleading statistics due to the small number of observations for some of the weather event types, a lowest bound of 10 weather events that caused non zero property damage (for each of the included weather event types) was selected (subjectively by the analyst) and applied.

This lowest bound, although it may seem (and generally it is) not enough to get trustworthy statistics, it was considered to be “good enough” taking into account that :

the analysis focuses in describing historical data without trying to make inferences that would demand substantially bigger samples, although any statistic based on less than 10 observations could not be taken seriously especially in cases (such as in this analysis) where the distribution of property damage for each weather event type was skewed.
a period of 11 years (from 2001 to 2011) in which the observations that were used in the analysis occurred, is relatively small time to produce big samples of weather events that caused non zero property damage for some the weather event types. Thus, if a highest bound was selected to get more robust statistics such as samples of 100 or 300, the majority of weather event types would have been excluded, making the results of the analysis trivial.

# Extract the required variables and the target data subset of observations 
# for the harm on economy with respect to property damage.
target_data_____harm_on_economy_____property_damage <- processed_data[
  ## Extract only the observations that have resulted in non-zero property damage.
  PROPERTY_DAMAGE > 0,
  ## Select only the relevant variables. 
  list(REFNUM, EVENT_TYPE, PROPERTY_DAMAGE)
  ][
    ### Keep only the observations that correspond to the weather event types 
    ### for which there are at least 10 weather events available.
    EVENT_TYPE %in% 
      names(table(EVENT_TYPE)[table(EVENT_TYPE) >= 10])
    ]

The table with the target data for the harm on economy with respect to property damage consist of 136928 observations.

# Print the structure of the table with the target data subset 
# for the harm on economy with respect to property damage.
str(target_data_____harm_on_economy_____property_damage)

## Classes 'data.table' and 'data.frame':   136928 obs. of  3 variables:
##  $ REFNUM         : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ EVENT_TYPE     : chr  "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" ...
##  $ PROPERTY_DAMAGE: num  10000 8000 2000 15000 5000 3000 10000 450000 150000 3000 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

The variable EVENT_TYPE includes 37 distinct weather event types, for most of which the variable PROPERTY_DAMAGE was highly positively skewed.

# Create a kable to present some facts about the table with the target data 
# for the harm on economy with respect to property damage.
kable(
  x = target_data_____harm_on_economy_____property_damage[
    order(EVENT_TYPE), 
    list(
      "N" = .N, 
      "SKEWNESS" = round(skewness(PROPERTY_DAMAGE), 4)
    ), 
    by = EVENT_TYPE
    ],
  caption = paste0(
    "Table 9.1.1-1: ",
    "Facts about the table with the target data subset of observations ", 
    "for the harm on economy with respect to property damage."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = "The skewness was rounded to 4 decimal places."
  )

Table 9.1.1-1: Facts about the table with the target data subset of observations for the harm on economy with respect to property damage.
EVENT_TYPE	N	SKEWNESS
AVALANCHE	33	3.4882
BLIZZARD	129	10.5403
COASTAL FLOOD	152	4.5996
COLD/WIND CHILL	14	1.5907
DEBRIS FLOW	189	6.0565
DENSE FOG	56	3.7347
DROUGHT	30	4.9802
DUST DEVIL	60	2.4345
DUST STORM	60	3.7794
EXCESSIVE HEAT	20	4.0309
EXTREME COLD/WIND CHILL	22	4.0178
FLASH FLOOD	13902	61.0935
FLOOD	7072	83.9862
FROST/FREEZE	18	1.7679
HAIL	14584	69.4449
HEAVY RAIN	836	11.4264
HEAVY SNOW	573	7.0114
HIGH SURF	76	5.0462
HIGH WIND	3851	37.6952
HURRICANE/TYPHOON	107	4.9333
ICE STORM	410	8.6732
LAKE-EFFECT SNOW	195	13.1024
LIGHTNING	6162	22.3701
MARINE HIGH WIND	18	3.8120
MARINE STRONG WIND	34	5.3773
MARINE THUNDERSTORM WIND	127	10.0994
STORM SURGE/TIDE	131	9.6344
STRONG WIND	3179	51.6282
THUNDERSTORM WIND	73657	167.8966
TORNADO	8552	55.2385
TROPICAL DEPRESSION	35	5.4232
TROPICAL STORM	363	18.5864
TSUNAMI	14	2.7176
WATERSPOUT	12	3.0130
WILDFIRE	832	15.4642
WINTER STORM	930	29.7861
WINTER WEATHER	493	9.2933
Note:
The skewness was rounded to 4 decimal places.

It was worth noting that for the weather event types with highest number of observations there was highest skewness for the values of property damage, indicating that the corresponding distribution of property damage has a heavy tail that wasn’t possible to be observed when few observation were available.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.2 Process the target data for harm on economy with respect to property damage

To create the table with the processed data for the harm on economy with respect to property damage from the corresponding target data subset for this perspective, a new variable was created that divides the observations for each of the included weather event types in two complementary levels:

one that contains the 90% of cases with lowest impact
the other that contains the 10% of cases with highest impact

This decision was made due to the high skewness that was observed for the values of the variable PROPERTY_DAMAGE for most weather event types, which indicates that the underlining distributions of such phenomena has a heavy tail that causes this heterogeneity on the observations. As a result a small property damage were observed for the majority of cases that resulted in non-zero fatalities while in the few cases with the highest impact they caused lots of property damage.

Having in mind that the average property damage will be used to determine which weather event types were the most harmful to economy (with respect to property damage) combined with the fact that the average doesn’t represent well the distribution of variables with high skewness, as it is highly affected by the most extreme values, it was considered necessary to examine the subsets created by those two levels in order to obtain an insightful picture.

# Create the table with the processed data 
# for the harm on economy with respect to property damage.
processed_data_____harm_on_economy_____property_damage <- 
  target_data_____harm_on_economy_____property_damage[
    ,
    ## Create a new variable divides the observations
    ## for each weather event into two supplementary groups:  
    ##   - the 90% of weather events that resulted in lowest fatalities
    ##   - the 10% of weather events that resulted in highest fatalities
    BIN_GROUP_PER_EVENT_TYPE := (function(x, p_bins) {
      
      # adds 0 and 1 in the vector supplied at the argument 'p_bins' 
      # to the start and the end respectively  
      # the supplied percentiles if they are missing 
      # and sort them ascending
      p_bins_increasing <- sort(c(0, p_bins, 1))
      
      # creates the character strings that labels of the bins by the values supplied at 
      # the argument 'p_bins' that will be the values of the new variable
      bin_labels <- paste0("(", p_bins_increasing[-length(p_bins_increasing)]*100,
                           "% - ", p_bins_increasing[-1]*100, "%]")
      
      # identify the number of occurrences that correspond to each label
      n_times <- vapply(2:length(p_bins_increasing),
                        function(i) {
                          as.integer(floor(length(x) * p_bins_increasing[i]) -
                                       floor(length(x) * p_bins_increasing[i - 1]))
                        }, integer(1))
      
      # multiply each label with the number of its occurrences
      x_bins_expanded <- rep(x = bin_labels, times = n_times)
      
      # order the label to much the values of the corresponding vector
      x_bins_expanded_reordered <- x_bins_expanded[order(seq_along(x)[order(x)])]
      
      ## Coerce the character vector with the labels of bins to a factor
      x_bins_factor <- factor(x_bins_expanded_reordered, labels = bin_labels, ordered = TRUE)
      
    })(PROPERTY_DAMAGE, 0.9), 
    by = EVENT_TYPE
  ][
    ## Coerce the EVENT_VARIABLE to factor
    , EVENT_TYPE := as.factor(EVENT_TYPE) 
  ]

The table with the processed data for the harm on economy with respect to property damage contains 4 variables:

REFNUM (int) : an id that uniquely identifies each observation
EVENT_TYPE (Factor w/ 37 levels) : the type of each weather event
PROPERTY_DAMAGE (int) : the property damage in dollars
BIN_GROUP_PER_EVENT_TYPE (Ord.factor w/ 2 levels) : a factor that divides the observations for each weather event type to two complementary levels, one with the 90% of observations with the lowest impact and another with the 10% of observations with the highest impact.

and 136928 observations.

# Print the structure of the table with the processed data 
# for the harm on economy with respect to property damage.
str(processed_data_____harm_on_economy_____property_damage)

## Classes 'data.table' and 'data.frame':   136928 obs. of  4 variables:
##  $ REFNUM                  : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ EVENT_TYPE              : Factor w/ 37 levels "AVALANCHE","BLIZZARD",..: 29 29 29 29 29 29 29 30 29 29 ...
##  $ PROPERTY_DAMAGE         : num  10000 8000 2000 15000 5000 3000 10000 450000 150000 3000 ...
##  $ BIN_GROUP_PER_EVENT_TYPE: Ord.factor w/ 2 levels "(0% - 90%]"<"(90% - 100%]": 1 1 1 1 1 1 1 1 2 1 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.3 Summarize the processed data for harm on economy with respect to property damage by each weather event type

To evaluate the harm on economy by each weather event type with respect to property damage a simplistic approach was adopted :

the weather event types were ranked from the most harmful to the least based on the overall average property damage of the weather events that resulted in non-zero property damage

The overall average property damage caused by each weather event type was initially examined along with the skewness of the property damage for each weather event type. In most cases the skewness was high (or even extremely high), so it was possible that the overall mean misrepresented the consequences of each weather event type.

That is the reason why the average property damage for 90% of weather events with the lowest impact versus the average property damage for the 10% of weather events with the highest impact were also computed and examined.

It is highlighted that for the average property damage that refers to the 10% of the cases that had the highest impact, there were few observations available for a lot of weather event types and the corresponding mean values should be interpreted with caution.

# Create the table with the summary for the harm on economy 
# with respect to property damage for each weather event type.
summary_____harm_on_economy______property_damage <- 
  processed_data_____harm_on_economy_____property_damage[
  ,
  list(
    ## The total number of observation by each weather event type.
    "N" = .N,
    ## The average property damage caused by each weather event type.
    "AVRG" = round(mean(PROPERTY_DAMAGE), 0),
    ## The skewness of property damage for the observations by each weather event type.
    "SKEWNESS" = round(skewness(PROPERTY_DAMAGE), 4),
    ## The number of observations for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "N_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , .N],
    ## The average property damage caused by each weather event type 
    ## for the 90% of cases with the lowest impact.
    "AVRG_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(mean(PROPERTY_DAMAGE), 0)],
    ## The skewness of property damage for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "SKEWNESS_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(skewness(PROPERTY_DAMAGE), 4)],
    ## The number of observations for the 10% of cases with the lowest impact 
    ## by each weather event type.
    "N_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , .N],
    ## The average property damage caused by each weather event type 
    ## for the 10% of cases with the highest impact.
    "AVRG_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(mean(PROPERTY_DAMAGE), 0)],
    ## The skewness of property damage for the 10% of cases with the highest impact 
    ## by each weather event type.
    "SKEWNESS_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(skewness(PROPERTY_DAMAGE), 4)]
  ),
  by = "EVENT_TYPE"
  ][
    ## The average property damage is used to order the rows of the table
    ## from the most harmful weather event type to the least.
    order(-AVRG),
    ## Create a variable with the rank of the harmness of each weather event type.
    RANK := 1:length(EVENT_TYPE)
    ][
      ,
      ## Reorder the variables at the table.
      list(
        RANK, EVENT_TYPE, N, AVRG, SKEWNESS, N_LOW, AVRG_LOW, SKEWNESS_LOW, N_HIGH, AVRG_HIGH, SKEWNESS_HIGH
      )
      ]

The results of the table with the summary for the harm on economy with respect to property damage by each weather event type that was created in this section were presented at the subsection 10.2.2 Most harmful event types with respect to property damage of the chapter 10 RESULTS.

The table with the summary for the harm on economy with respect to property damage by each weather event type was exported (as an R file), in the folder of the working directory:

outputs –> harm_on_economy –> results

with filename:

summary______harm_on_economy______property_damage.R

In addition a txt file that contains the MD5 hash of the file was created and saved at the same directory with filename:

summary_____harm_on_economy______property_damage.R—–(MD5 HASH).txt

# Supply the filepath at which the table with the summary
# for the harm on economy will be exported.
filepath_____summary_____harm_on_economy______property_damage <-
  file.path(
    directory_tree_____outputs[[
      "filepath_____outputs_____harm_on_economy_____results"
    ]],
    "summary_____harm_on_economy______property_damage.R"
  )

# Export the table with the summary for the harm on economy
# with respect to property damage.
saveRDS(
  object = summary_____harm_on_economy______property_damage,
  file = filepath_____summary_____harm_on_economy______property_damage
)

The main reason for exporting the file with the summary for the harm on economy with respect to property damage by each weather event type was to supply a checkpoint for any attempts to reproduce the analysis.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.4 Visualize the results of the summary for the harm on economy with respect to property damage by each weather event type

From the table with the summary for the harm on economy by each weather event type with respect to property damage the Multiplot 2.1 was created to present an overview of the results for the three different aspects that were examined for this perspective.

Four elementary plots were created:

9.1.4.1.1 Create The Plot 2.1.1
- Displays the overall average property damage caused by each weather event type based on all the cases of weather events that resulted in non-zero property damage.
9.1.4.1.2 Create The Plot 2.1.2
- Displays the average property damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero property damage.
9.1.4.1.3 Create The Plot 2.1.3
- Displays the average property damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero property damage.
9.1.4.1.4 Create The Plot 2.1.4
- Displays a comparison for each weather event type, of the average property damage for the 90% of its observations with the lowest impact versus the average property damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero property damage.

which were then combined in order to obtain the Multiplot 2.1.

It constitutes the PART 1 of the Figure 2 that displays the overview of the harm on economy by each weather event type.

(Note that neither the Multiplot 2.1 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS.)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.4.1 Create the components of Multiplot 2.1

Creates four elementary plots to visualize the results for the aspects that were examined for the harm on economy with respect to property damage by each weather event type.

9.1.4.1.1 Create The Plot 2.1.1
- Displays the overall average property damage caused by each weather event type based on all the cases of weather events that resulted in non-zero property damage.
9.1.4.1.2 Create The Plot 2.1.2
- Displays the average property damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero property damage.
9.1.4.1.3 Create The Plot 2.1.3
- Displays the average property damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero property damage.
9.1.4.1.4 Create The Plot 2.1.4
- Displays a comparison for each weather event type, of the average property damage for the 90% of its observations with the lowest impact versus the average property damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero property damage.

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.4.1.1 Create The Plot 2.1.1

The Plot 2.1.1 displays the overall average property damage caused by each weather event type taking into account all and only the observation that resulted in non-zero property damage.

The skewness of the property damage for the observations of each weather event type (based on which the overall property damage was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.1.1 that displays 
# the overall average property damage 
# by each weather event type for all cases. 
elementary_plot_2_1_1 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______property_damage,
    mapping = aes(
      x = AVRG,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to make them displayed alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a square shaped point to the position that corresponds to 
  ## the average property damage caused by each weather event type, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(color = SKEWNESS),
    shape = 15, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average property damage.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG, 
      group = EVENT_TYPE, 
      color = SKEWNESS
    )
    ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## property damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2.5
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average property damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.1 will be composed from the four elementary plots. 
    limits = c(-5, 170), 
    midpoint = 70, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels.  
  labs(
    title = "Plot 2.1.1", 
    subtitle = "Aspect: Overall",
    x = "Average Number of Property Damage\n",
    y = "Weather Event Types \n"
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.4.1.2 Create The Plot 2.1.2

The Elementary Plot 2.1.2 displays the average property damage for the 90% of cases with the lowest impact caused by each weather event type from all the observation that resulted in non-zero property damage.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to economy, based on the overall average property damage they caused.
(so it is NOT based on the average property damage caused by the 90% of cases with the lowest impact of each weather event type).

The skewness of the property damage for the observations of each weather event type (based on which the average property damage for the 90% of cases with the lowest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.1.2 that displays 
# the average property damage by each weather event type 
# for the 90% of its cases with the lowest impact.
elementary_plot_2_1_2 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______property_damage,
    mapping = aes(
      x = AVRG_LOW,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a circle shaped point to the position that corresponds to 
  ## the average property damage caused by each weather event type
  ## for the 90% of its cases with the lowest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_LOW
    ), 
    size = 3.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average property damage 
  ## for the 90% of its cases with the lowest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_LOW, 
      group = EVENT_TYPE, 
      color = SKEWNESS_LOW
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## property damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2
    ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average property damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.1 will be composed from the four elementary plots.
    limits = c(-5, 170), 
    midpoint = 70, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
    ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.1.2",
    subtitle = "Aspect: 90% of cases with the lowest impact",
    x = paste0(
      "Average Number of Property Damage for the 90% ", "\n",
      "of Observations with the Lowest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.4.1.3 Create The Plot 2.1.3

The Plot 2.1.3 displays the average property damage for the 10% of cases with the highest impact caused by each weather event type from all the observation that resulted in non-zero property damage.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to economy, based on the overall average property damage they caused.
(so it is NOT based on the average property damage caused by the 10% of cases with the highest impact of each weather event type).

The skewness of the property damage for the observations of each weather event type (based on which the average property damage for the 10% of cases with the highest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.1.3 that displays 
# the average property damage by each weather event type 
# for the 10% of its cases with the highest impact.
elementary_plot_2_1_3 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______property_damage,
    mapping = aes(
      x = AVRG_HIGH,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a diamond shaped point to the position that corresponds to 
  ## the average property damage caused by each weather event type
  ## for the 10% of its cases with the highest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_HIGH
    ), 
    shape = 18, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average property damage 
  ## for the 10% of its cases with the highest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_HIGH, 
      group = EVENT_TYPE, 
      color = SKEWNESS_HIGH
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## property damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ),
    size = 2
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average property damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.1 will be composed from the four elementary plots.
    limits = c(-5, 170), 
    midpoint = 70, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.1.3",
    subtitle ="Aspect: 10% of cases with the highest impact",
    x = paste0(
      "Average Number of Property Damage for the 10% ", "\n", 
      "of Observations with the Highest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    ### Remove the text, ticks and title of the y axis 
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.4.1.4 Create The Plot 2.1.4

The Plot 2.1.4 displays a compact overview of all three aspect that were examined for the harm on economy with respect to property damage.

For each weather event type, the comparison was visualized for the average property damage for the 90% of cases with the lowest impact versus the average property damage for the 10% of cases with the highest impact.

# Create the Elementary Plot 2.1.4 that displays 
# by each weather event type the comparison of 
# the average property damage 
# for the 90% of cases with the lowest impact
# versus the average property damage 
# for the 10% of cases with the highest impact.
elementary_plot_2_1_4 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______property_damage,
    mapping = aes(
      x = AVRG_HIGH, 
      y = AVRG_LOW
    )
  ) +
  geom_point(
    mapping = aes(
      fill = SKEWNESS
    ), 
    shape = 21
  ) +
  ## Draw a label with a number that indicates the rank assigned 
  ## to each weather event type (from the most harmful to the least) 
  ## based on the overall average property damage it caused.
  geom_label_repel(
    mapping = aes(
      label = RANK, 
      fill = SKEWNESS
    ),
    size = 2.5
  ) +
  ## Adjust the scale for the fill of each label.
  scale_fill_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average property damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.1 will be composed from the four elementary plots.
    limits = c(-5, 170), 
    midpoint = 70, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
    ) +
  ## Set proper limits to the plot.
    xlim(c(-0.5e9, 6e9)) +
    ylim(c(-1e7, 8.5e7)) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.1.4",
    subtitle = paste0(
      "Comparison of the average property damage ", 
      "for the 90% of observations with the lowest impact ", 
      "versus the average property damage ", 
      "for the 10% of observations with highest impact. "
    ),
    x = paste0(
      "Average Number of Property Damage by each Weather Event Type ", 
      "for the 10% of its Observations with the Highest Impact"
    ),
    y = paste0(
      "Average Number of Property Damage by each Weather Event Type ", "\n", 
      "for the 90% of its Observations with the Lowest Impact."
    ),
    ### Add a descriptive label for the legend.
    fill = paste0(
      "The color indicates the skewness ",
      "of property damage for the each weather event type. ",
      "(the color scale is unique for all four plots of PART 1) "
    )
  ) +
  ## Select a theme.
  theme_linedraw() +
  ## Customize the selected theme.
  theme(
    ### Adjust the legend.
    legend.position = "bottom",
    legend.direction = "horizontal",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.1.4.2 Compose the Multiplot 2.1

The four elementary plots that were created from the results of the summary for the harm on economy with respect to property damage by each weather event type, were combined to construct a single multiplot that displays the complete picture for this perspective.

# Create a multiplot that displays the overview of the summary 
# for the harm on economy with respect to property damage
# by each weather event type.
multiplot_2_1 <- arrangeGrob(
  grobs = list(
      
    # Title
    textGrob(
      label = paste0(
        "\n",
        "PART 1: Harm on economy by each weather event type ", 
        "with the respect to property damage ", "\n", 
        "based on the cases of weather events ", 
        "that resulted in non-zero property damage.", "\n", 
        "\n"
      ),
       gp=gpar(
         fontsize = 16, 
         fontface = "bold"
       )
    ),
    
    # Subtitle
    textGrob(
      label = paste0(
          "\n", 
          "The results include only the weather event types, ", 
          "for which at least 10 observations ", 
          "that resulted in non-zero property damage were available. ", "\n",
          "The number associated with each weather event type ", 
          "represents the rank (from the most harmful to the least) ", 
          "which was assigned based on the overall average property damage.", "\n",
          "Because for most of the weather event types ", 
          "high positive skewness was observed for the property damage, ",
          "the average of the 90% of cases with lowest impact ", "\n",
          "and the 10% of cases with highest impact were reported ", 
          "to provide a more representative picture of their consequences.","\n",
          "\n"
      ),
       gp=gpar(
         fontsize = 14, 
         fontface = "bold"
       )
    ),
    
    # Plot 2.1.1
    # Elementary plot for the average property damage 
    # by each weather event type for all cases.
    elementary_plot_2_1_1,
    
    # ELEMENTARY PLOT 1.1.2
    # Elementary plot for the average property damage 
    # by each weather event type for 90% of cases with the lowest impact.
    elementary_plot_2_1_2,
    
    # ELEMENTARY PLOT 1.1.3
    # Elementary plot for the average property damage 
    # by each weather event type for 10% of cases with the highest impact.
    elementary_plot_2_1_3,
    
    # ELEMENTARY PLOT 1.1.4
    # Elementary Plot 2.1.4 for the comparison of 
    # the average property damage 
    # for the 90% of cases with the lowest impact versus 
    # the 10% of cases with the highest impact.
    elementary_plot_2_1_4
  ),
  # Set the layout for this elementary plots
  layout_matrix = 
    matrix(
      c(1,1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2,2,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6
      ),
      byrow = TRUE, 
      nrow = 13
    )
)

(Note that the Multiplot 2.1 was NOT presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS.), were the Figure 2 was presented, of which the Multiplot 2.1 constitutes the PART 1.)*

back to start of this subsubsection
back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2 Harm On Economy With Respect To Crop Damage By Each Weather Event Type

Summary

The required variables and the target data subset of observations for the harm on economy with respect to crop damage were extracted from the table with the processed data, and processed to create a new variable that divided the observations for each of the included weather event types to two supplementary groups:

the 90% of observations with the lowest impact
the 10% of observations with the highest impact

before the information for the harm on economy with respect to crop damage was summarized by each weather event type.

Three aspects were examined:

The overall average crop damage by each weather event type.
The average crop damage by each weather event type for the 90% of cases with the lowest impact.
The average crop damage by each weather event type for the 10% of cases with the highest impact.

For each aspect, the average crop damage by each weather event type, the number of its available observations (based on which the average was computed) and their skewness were examined.

The overall average crop damage was used as the main criterion to determine which weather events caused the most harm on economy with respect to crop damage but it is important to take into account the other two aspect that were presented in order to obtain a more insightful and complete ‘picture’ of their consequences, (especially given the fact that for most of the weather event types, the crop damage were highly positively skewed).

The table with results for the harm on economy with respect to crop damage by each weather event type were presented at the subsection 10.2.3 Most harmful event types with respect to crop damage of the chapter 10 RESULTS.

Finally the Multiplot 2.2 was created to visualize the results of the harm on economy with respect to crop damage by each weather event type.

*(Note that neither the Multiplot 2.2 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS, where the Figure 2 was presented, of which the Multiplot 2.2 constitutes the PART 2.)

Steps

9.2.1 Extract the target data for harm on economy with respect to crop damage
- The target data subset of observations needed to evaluate the harm on economy with respect to crop damage by each weather event type was extracted from the table with the processed data.
9.2.2 Process the target data for harm on economy with respect to crop damage
- The table with target data subset for the harm on economy with respect to crop damage was processed to create the table with processed data for the harm on economy with respect to crop damage.
9.2.3 Summarize the processed data for harm on economy with respect to crop damage by each weather event type
- The harm on economy with respect to crop damage by each weather event type was evaluated over various aspects.
9.2.4 Visualize the results of the summary for the harm on economy with respect to crop damage by each weather event type
- The Multiplot 2.2 that presents the results of the summary for the harm on economy with respect to crop damage by each weather event type was created.
  - 9.2.4.1 Create the components of Multiplot 2.2
    - Creates the four elementary plot that constitute the Multiplot 2.2:
      - 9.2.4.1.1 Create The Plot 2.2.1
        
        Displays the overall average crop damage caused by each weather event type based on all the cases of weather events that resulted in non-zero crop damage.
      - 9.2.4.1.2 Create The Plot 2.2.2
        
        Displays the average crop damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero crop damage.
      - 9.2.4.1.3 Create The Plot 2.2.3
        
        Displays the average crop damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero crop damage.
      - 9.2.4.1.4 Create The Plot 2.2.4
        
        Displays a comparison for each weather event type, of the average crop damage for the 90% of its observations with the lowest impact versus the average crop damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero crop damage.
  - 9.2.4.2 Compose the Multiplot 2.2
    - Combines the four elementary plots to create the Multiplot 2.2.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.1 Extract the target data for harm on economy with respect to crop damage

In order to examine the harm on economy with respect to crop damage caused by each weather event type, the variables REFNUM, EVENT_TYPE and CROP_DAMAGE were selected from the table with the processed data and only the observations that refer to weather events that resulted in non-zero crop damage were extracted.

Furthermore, in an attempt to avoid highly misleading statistics due to the small number of observations for some of the weather event types, a lowest bound of 10 weather events that caused non zero crop damage (for each of the included weather event types) was selected (subjectively by the analyst) and applied.

This lowest bound, although it may seem (and generally it is) not enough to get trustworthy statistics, it was considered to be “good enough” taking into account that :

the analysis focuses in describing historical data without trying to make inferences that would demand substantially bigger samples, although any statistic based on less than 10 observations could not be taken seriously especially in cases (such as in this analysis) where the distribution of crop damage for each weather event type was skewed.
a period of 10 years (from 2001 to 2011) in which the observations that were used in the analysis occurred, is relatively small time to produce big samples of weather events that caused non zero crop damage for some the weather event types. Thus, if a highest bound was selected to get more robust statistics such as samples of 100 or 300, the majority of weather event types would have been excluded, making the results of the analysis trivial.

# Extract the required variables and the target data subset of observations 
# for the harm on economy with respect to crop damage.
target_data_____harm_on_economy_____crop_damage <- processed_data[
  ## Extract only the observations that have resulted in non-zero crop damage.
  CROP_DAMAGE > 0,
  ## Select only the relevant variables. 
  list(REFNUM, EVENT_TYPE, CROP_DAMAGE)
  ][
    ### Keep only the observations that correspond to the weather event types 
    ### for which there are at least 10 weather events available.
    EVENT_TYPE %in% 
      names(table(EVENT_TYPE)[table(EVENT_TYPE) >= 10])
    ]

The table with the target data for the harm on economy with respect to crop damage consist of 12177 observations.

# Print the structure of the table with the target data subset 
# for the harm on economy with respect to crop damage.
str(target_data_____harm_on_economy_____crop_damage)

## Classes 'data.table' and 'data.frame':   12177 obs. of  3 variables:
##  $ REFNUM     : int  413886 413890 413893 415001 415205 415230 415477 415533 415652 416062 ...
##  $ EVENT_TYPE : chr  "HAIL" "HAIL" "HAIL" "HAIL" ...
##  $ CROP_DAMAGE: num  3000 3000 3000 5000 2500 3000 5000 2500 100000 30000 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

The variable EVENT_TYPE includes 16 distinct weather event types, for most of which the variable CROP_DAMAGE was highly positively skewed.

It was worth noting that for the weather event types with highest number of observations there was highest skewness for the values of crop damage, indicating that the corresponding distribution of crop damage has a heavy tail that wasn’t possible to be observed when few observation were available.

# Create a kable to present some facts about the table with the target data 
# for the harm on economy with respect to crop damage.
kable(
  x = target_data_____harm_on_economy_____crop_damage[
    order(EVENT_TYPE), 
    list(
      "N" = .N, 
      "SKEWNESS" = round(skewness(CROP_DAMAGE), 4)
    ), 
    by = EVENT_TYPE
    ],
  caption = paste0(
    "Table 9.2.1-1: ",
    "Facts about the table with the target data subset of observations ", 
    "for the harm on economy with respect to crop damage."
  )
) %>% kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = "The skewness was rounded to 4 decimal places."
  )

Table 9.2.1-1: Facts about the table with the target data subset of observations for the harm on economy with respect to crop damage.
EVENT_TYPE	N	SKEWNESS
DROUGHT	158	4.9333
EXTREME COLD/WIND CHILL	11	1.6402
FLASH FLOOD	1296	13.5455
FLOOD	1263	19.0535
FROST/FREEZE	106	5.8134
HAIL	5590	18.5382
HEAVY RAIN	75	7.8538
HIGH WIND	123	7.5985
HURRICANE/TYPHOON	48	5.6962
LIGHTNING	50	6.2946
STRONG WIND	94	8.5291
THUNDERSTORM WIND	2321	13.4840
TORNADO	889	27.0249
TROPICAL STORM	52	3.4070
WILDFIRE	91	5.3055
WINTER STORM	10	2.6305
Note:
The skewness was rounded to 4 decimal places.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.2 Process the target data for harm on economy with respect to crop damage

To create the table with the processed data for the harm on economy with respect to crop damage from the corresponding target data subset for this perspective, a new variable was created that divides the observations for each of the included weather event types in two complementary levels:

one that contains the 90% of cases with lowest impact
the other that contains the 10% of cases with highest impact

This decision was made due to the high skewness that was observed for the values of the variable CROP_DAMAGE for most weather event types, which indicates that the underlining distributions of such phenomena has a heavy tail that causes this heterogeneity on the observations. As a result a small crop damage were observed for the majority of cases that resulted in non-zero fatalities while in the few cases with the highest impact they caused lots of crop damage.

Having in mind that the average crop damage will be used to determine which weather event types were the most harmful to economy (with respect to crop damage) combined with the fact that the average doesn’t represent well the distribution of variables with high skewness, as it is highly affected by the most extreme values, it was considered necessary to examine the subsets created by those two levels in order to obtain an insightful picture.

# Create the table with the processed data 
# for the harm on economy with respect to crop damage.
processed_data_____harm_on_economy_____crop_damage <- 
  target_data_____harm_on_economy_____crop_damage[
    ,
    ## Create a new variable divides the observations
    ## for each weather event into two supplementary groups:  
    ##   - the 90% of weather events that resulted in lowest fatalities
    ##   - the 10% of weather events that resulted in highest fatalities
    BIN_GROUP_PER_EVENT_TYPE := (function(x, p_bins) {
      
      # adds 0 and 1 in the vector supplied at the argument 'p_bins' 
      # to the start and the end respectively  
      # the supplied percentiles if they are missing 
      # and sort them ascending
      p_bins_increasing <- sort(c(0, p_bins, 1))
      
      # creates the character strings that labels of the bins by the values supplied at 
      # the argument 'p_bins' that will be the values of the new variable
      bin_labels <- paste0("(", p_bins_increasing[-length(p_bins_increasing)]*100,
                           "% - ", p_bins_increasing[-1]*100, "%]")
      
      # identify the number of occurrences that correspond to each label
      n_times <- vapply(2:length(p_bins_increasing),
                        function(i) {
                          as.integer(floor(length(x) * p_bins_increasing[i]) -
                                       floor(length(x) * p_bins_increasing[i - 1]))
                        }, integer(1))
      
      # multiply each label with the number of its occurrences
      x_bins_expanded <- rep(x = bin_labels, times = n_times)
      
      # order the label to much the values of the corresponding vector
      x_bins_expanded_reordered <- x_bins_expanded[order(seq_along(x)[order(x)])]
      
      ## Coerce the character vector with the labels of bins to a factor
      x_bins_factor <- factor(x_bins_expanded_reordered, labels = bin_labels, ordered = TRUE)
      
    })(CROP_DAMAGE, 0.9), 
    by = EVENT_TYPE
  ][
    ## Coerce the EVENT_VARIABLE to factor
    , EVENT_TYPE := as.factor(EVENT_TYPE) 
  ]

The table with the processed data for the harm on economy with respect to crop damage contains 4 variables:

REFNUM (int) : an id that uniquely identifies each observation
EVENT_TYPE (Factor w/ 16 levels) : the type of each weather event
CROP_DAMAGE (int) : the crop damage in dollars
BIN_GROUP_PER_EVENT_TYPE (Ord.factor w/ 2 levels) : a factor that divides the observations for each weather event type to two complementary levels, one with the 90% of observations with the lowest impact and another with the 10% of observations with the highest impact.

and 12177 observations.

# Print the structure of the table with the processed data 
# for the harm on economy with respect to crop damage.
str(processed_data_____harm_on_economy_____crop_damage)

## Classes 'data.table' and 'data.frame':   12177 obs. of  4 variables:
##  $ REFNUM                  : int  413886 413890 413893 415001 415205 415230 415477 415533 415652 416062 ...
##  $ EVENT_TYPE              : Factor w/ 16 levels "DROUGHT","EXTREME COLD/WIND CHILL",..: 6 6 6 6 6 6 6 6 3 8 ...
##  $ CROP_DAMAGE             : num  3000 3000 3000 5000 2500 3000 5000 2500 100000 30000 ...
##  $ BIN_GROUP_PER_EVENT_TYPE: Ord.factor w/ 2 levels "(0% - 90%]"<"(90% - 100%]": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.3 Summarize the processed data for harm on economy with respect to crop damage by each weather event type

To evaluate the harm on economy by each weather event type with respect to crop damage a simplistic approach was adopted :

the weather event types were ranked from the most harmful to the least based on the overall average crop damage of the weather events that resulted in non-zero crop damage

The overall average crop damage caused by each weather event type was initially examined along with the skewness of the crop damage for each weather event type. In most cases the skewness was high (or even extremely high), so it was possible that the overall mean misrepresented the consequences of each weather event type.

That is the reason why the average crop damage for 90% of weather events with the lowest impact versus the average crop damage for the 10% of weather events with the highest impact were also computed and examined.

It is highlighted that for the average crop damage that refers to the 10% of the cases that had the highest impact, there were few observations available for a lot of weather event types and the corresponding mean values should be interpreted with caution.

# Create the table with the summary for the harm on economy 
# with respect to crop damage for each weather event type.
summary_____harm_on_economy______crop_damage <- 
  processed_data_____harm_on_economy_____crop_damage[
  ,
  list(
    ## The total number of observation by each weather event type.
    "N" = .N,
    ## The average crop damage caused by each weather event type.
    "AVRG" = round(mean(CROP_DAMAGE), 0),
    ## The skewness of crop damage for the observations by each weather event type.
    "SKEWNESS" = round(skewness(CROP_DAMAGE), 4),
    ## The number of observations for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "N_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , .N],
    ## The average crop damage caused by each weather event type 
    ## for the 90% of cases with the lowest impact.
    "AVRG_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(mean(CROP_DAMAGE), 0)],
    ## The skewness of crop damage for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "SKEWNESS_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(skewness(CROP_DAMAGE), 4)],
    ## The number of observations for the 10% of cases with the lowest impact 
    ## by each weather event type.
    "N_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , .N],
    ## The average crop damage caused by each weather event type 
    ## for the 10% of cases with the highest impact.
    "AVRG_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(mean(CROP_DAMAGE), 0)],
    ## The skewness of crop damage for the 10% of cases with the highest impact 
    ## by each weather event type.
    "SKEWNESS_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(skewness(CROP_DAMAGE), 4)]
  ),
  by = "EVENT_TYPE"
  ][
    ## The average crop damage is used to order the rows of the table
    ## from the most harmful weather event type to the least.
    order(-AVRG),
    ## Create a variable with the rank of the harmness of each weather event type.
    RANK := 1:length(EVENT_TYPE)
    ][
      ,
      ## Reorder the variables at the table.
      list(
        RANK, EVENT_TYPE, N, AVRG, SKEWNESS, N_LOW, AVRG_LOW, SKEWNESS_LOW, N_HIGH, AVRG_HIGH, SKEWNESS_HIGH
      )
      ]

The results of the table with the summary for the harm on economy with respect to crop damage by each weather event type that was created in this section were presented at the subsection 10.2.3 Most harmful event types with respect to crop damage of the chapter 10 RESULTS.

The table with the summary for the harm on economy with respect to crop damage by each weather event type was exported (as an R file), in the folder of the working directory:

outputs –> harm_on_economy –> results

with filename:

summary______harm_on_economy______crop_damage.R

# Supply the filepath at which the table with the summary
# for the harm on economy will be exported.
filepath_____summary_____harm_on_economy______crop_damage <-
  file.path(
    directory_tree_____outputs[[
      "filepath_____outputs_____harm_on_economy_____results"
    ]],
    "summary_____harm_on_economy______crop_damage.R"
  )

# Export the table with the summary for the harm on economy
# with respect to crop damage.
saveRDS(
  object = summary_____harm_on_economy______crop_damage,
  file = filepath_____summary_____harm_on_economy______crop_damage
)

The main reason for exporting the file with the summary for the harm on economy with respect to crop damage by each weather event type was to supply a checkpoint for any attempts to reproduce the analysis.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.4 Visualize the results of the summary for the harm on economy with respect to crop damage by each weather event type

From the table with the summary for the harm on economy by each weather event type with respect to crop damage the Multiplot 2.2 was created to present an overview of the results for the three different aspects that were examined for this perspective.

Four elementary plots were created:

9.2.4.1.1 Create The Plot 2.2.1
- Displays the overall average crop damage caused by each weather event type based on all the cases of weather events that resulted in non-zero crop damage.
9.2.4.1.2 Create The Plot 2.2.2
- Displays the average crop damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero crop damage.
9.2.4.1.3 Create The Plot 2.2.3
- Displays the average crop damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero crop damage.
9.2.4.1.4 Create The Plot 2.2.4
- Displays a comparison for each weather event type, of the average crop damage for the 90% of its observations with the lowest impact versus the average crop damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero crop damage.

which were then combined in order to obtain the Multiplot 2.2.

It constitutes the PART 2 of the Figure 2 that displays the overview of the harm on economy by each weather event type.

(Note that neither the Multiplot 2.2 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS.)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.4.1 Create the components of Multiplot 2.2

Creates four elementary plots to visualize the results for the aspects that were examined for the harm on economy with respect to crop damage by each weather event type.

9.2.4.1.1 Create The Plot 2.2.1
- Displays the overall average crop damage caused by each weather event type based on all the cases of weather events that resulted in non-zero crop damage.
9.2.4.1.2 Create The Plot 2.2.2
- Displays the average crop damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero crop damage.
9.2.4.1.3 Create The Plot 2.2.3
- Displays the average crop damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero crop damage.
9.2.4.1.4 Create The Plot 2.2.4
- Displays a comparison for each weather event type, of the average crop damage for the 90% of its observations with the lowest impact versus the average crop damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero crop damage.

back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.4.1.1 Create The Plot 2.2.1

The Plot 2.2.1 displays the overall average crop damage caused by each weather event type taking into account all and only the observation that resulted in non-zero crop damage.

The skewness of the crop damage for the observations of each weather event type (based on which the overall crop damage was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.2.1 that displays 
# the overall average crop damage 
# by each weather event type for all cases. 
elementary_plot_2_2_1 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______crop_damage,
    mapping = aes(
      x = AVRG,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to make them displayed alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a square shaped point to the position that corresponds to 
  ## the average crop damage caused by each weather event type, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(color = SKEWNESS),
    shape = 15, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average crop damage.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG, 
      group = EVENT_TYPE, 
      color = SKEWNESS
    )
    ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## crop damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2.5
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average crop damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.2 will be composed from the four elementary plots. 
    limits = c(0, 28), 
    midpoint = 14, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels.  
  labs(
    title = "Plot 2.2.1", 
    subtitle = "Aspect: Overall",
    x = "Average Number of Crop Damage\n",
    y = "Weather Event Types \n"
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.4.1.2 Create The Plot 2.2.2

The Elementary Plot 2.2.2 displays the average crop damage for the 90% of cases with the lowest impact caused by each weather event type from all the observation that resulted in non-zero crop damage.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to economy, based on the overall average crop damage they caused.
(so it is NOT based on the average crop damage caused by the 90% of cases with the lowest impact of each weather event type).

The skewness of the crop damage for the observations of each weather event type (based on which the average crop damage for the 90% of cases with the lowest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.2.2 that displays 
# the average crop damage by each weather event type 
# for the 90% of its cases with the lowest impact.
elementary_plot_2_2_2 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______crop_damage,
    mapping = aes(
      x = AVRG_LOW,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a circle shaped point to the position that corresponds to 
  ## the average crop damage caused by each weather event type
  ## for the 90% of its cases with the lowest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_LOW
    ), 
    size = 3.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average crop damage 
  ## for the 90% of its cases with the lowest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_LOW, 
      group = EVENT_TYPE, 
      color = SKEWNESS_LOW
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## crop damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2
    ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average crop damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.2 will be composed from the four elementary plots.
    limits = c(0, 28), 
    midpoint = 14, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
    ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.2.2",
    subtitle = "Aspect: 90% of cases with the lowest impact",
    x = paste0(
      "Average Number of Crop Damage for the 90% ", "\n",
      "of Observations with the Lowest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.4.1.3 Create The Plot 2.2.3

The Plot 2.2.3 displays the average crop damage for the 10% of cases with the highest impact caused by each weather event type from all the observation that resulted in non-zero crop damage.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to economy, based on the overall average crop damage they caused.
(so it is NOT based on the average crop damage caused by the 10% of cases with the highest impact of each weather event type).

The skewness of the crop damage for the observations of each weather event type (based on which the average crop damage for the 10% of cases with the highest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.2.3 that displays 
# the average crop damage by each weather event type 
# for the 10% of its cases with the highest impact.
elementary_plot_2_2_3 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______crop_damage,
    mapping = aes(
      x = AVRG_HIGH,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a diamond shaped point to the position that corresponds to 
  ## the average crop damage caused by each weather event type
  ## for the 10% of its cases with the highest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_HIGH
    ), 
    shape = 18, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average crop damage 
  ## for the 10% of its cases with the highest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_HIGH, 
      group = EVENT_TYPE, 
      color = SKEWNESS_HIGH
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## crop damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ),
    size = 2
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average crop damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.2 will be composed from the four elementary plots.
    limits = c(0, 28), 
    midpoint = 14, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.2.3",
    subtitle ="Aspect: 10% of cases with the highest impact",
    x = paste0(
      "Average Number of Crop Damage for the 10% ", "\n", 
      "of Observations with the Highest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    ### Remove the text, ticks and title of the y axis 
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.4.1.4 Create The Plot 2.2.4

The Plot 2.2.4 displays a compact overview of all three aspect that were examined for the harm on economy with respect to crop damage.

For each weather event type, the comparison was visualized for the average crop damage for the 90% of cases with the lowest impact versus the average crop damage for the 10% of cases with the highest impact.

# Create the Elementary Plot 2.2.4 that displays 
# by each weather event type the comparison of 
# the average crop damage 
# for the 90% of cases with the lowest impact
# versus the average crop damage 
# for the 10% of cases with the highest impact.
elementary_plot_2_2_4 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______crop_damage,
    mapping = aes(
      x = AVRG_HIGH, 
      y = AVRG_LOW
    )
  ) +
  geom_point(
    mapping = aes(
      fill = SKEWNESS
    ), 
    shape = 21
  ) +
  ## Draw a label with a number that indicates the rank assigned 
  ## to each weather event type (from the most harmful to the least) 
  ## based on the overall average crop damage it caused.
  geom_label_repel(
    mapping = aes(
      label = RANK, 
      fill = SKEWNESS
    ),
    size = 2.5
  ) +
  ## Adjust the scale for the fill of each label.
  scale_fill_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average crop damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.2 will be composed from the four elementary plots.
    limits = c(0, 28), 
    midpoint = 14, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
    ) +
  ## Set proper limits to the plot.
    xlim(c(-0.25e8, 5.2e8)) +
    ylim(c(-0.2e7, 1.5e7)) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.2.4",
    subtitle = paste0(
      "Comparison of the average crop damage ", 
      "for the 90% of observations with the lowest impact ", 
      "versus the average crop damage ", 
      "for the 10% of observations with highest impact. "
    ),
    x = paste0(
      "Average Number of Crop Damage by each Weather Event Type ", 
      "for the 10% of its Observations with the Highest Impact"
    ),
    y = paste0(
      "Average Number of Crop Damage by each Weather Event Type ", "\n", 
      "for the 90% of its Observations with the Lowest Impact."
    ),
    ### Add a descriptive label for the legend.
    fill = paste0(
      "The color indicates the skewness ",
      "of crop damage for the each weather event type. ",
      "(the color scale is unique for all four plots of PART 2) ", "\n",
      "When the color of a bar is gray, the skewness was indeterminable ",
      "due to the fact that all observations for that weather event type ",
      "took the same value."
    )
  ) +
  ## Select a theme.
  theme_linedraw() +
  ## Customize the selected theme.
  theme(
    ### Adjust the legend.
    legend.position = "bottom",
    legend.direction = "horizontal",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.2.4.2 Compose the Multiplot 2.2

The four elementary plots that were created from the results of the summary for the harm on economy with respect to crop damage by each weather event type, were combined to construct a single multiplot that displays the complete picture for this perspective.

# Create a multiplot that displays the overview of the summary 
# for the harm on economy with respect to crop damage
# by each weather event type.
multiplot_2_2 <- arrangeGrob(
  grobs = list(
      
    # Title
    textGrob(
      label = paste0(
        "\n",
        "PART 2: Harm on economy by each weather event type ", 
        "with the respect to crop damage ", "\n", 
        "based on the cases of weather events ", 
        "that resulted in non-zero crop damage.", "\n", 
        "\n"
      ),
       gp=gpar(
         fontsize = 16, 
         fontface = "bold"
       )
    ),
    
    # Subtitle
    textGrob(
      label = paste0(
          "\n", 
          "The results include only the weather event types, ", 
          "for which at least 10 observations ", 
          "that resulted in non-zero crop damage were available. ", "\n",
          "The number associated with each weather event type ", 
          "represents the rank (from the most harmful to the least) ", 
          "which was assigned based on the overall average crop damage.", "\n",
          "Because for most of the weather event types ", 
          "high positive skewness was observed for the crop damage, ",
          "the average of the 90% of cases with lowest impact ", "\n",
          "and the 10% of cases with highest impact were reported ", 
          "to provide a more representative picture of their consequences.","\n",
          "\n"
      ),
       gp=gpar(
         fontsize = 14, 
         fontface = "bold"
       )
    ),
    
    # Plot 2.2.1
    # Elementary plot for the average crop damage 
    # by each weather event type for all cases.
    elementary_plot_2_2_1,
    
    # ELEMENTARY PLOT 1.2.2
    # Elementary plot for the average crop damage 
    # by each weather event type for 90% of cases with the lowest impact.
    elementary_plot_2_2_2,
    
    # ELEMENTARY PLOT 1.2.3
    # Elementary plot for the average crop damage 
    # by each weather event type for 10% of cases with the highest impact.
    elementary_plot_2_2_3,
    
    # ELEMENTARY PLOT 1.2.4
    # Elementary Plot 2.2.4 for the comparison of 
    # the average crop damage 
    # for the 90% of cases with the lowest impact versus 
    # the 10% of cases with the highest impact.
    elementary_plot_2_2_4
  ),
  # Set the layout for this elementary plots
  layout_matrix = 
    matrix(
      c(1,1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2,2,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6
      ),
      byrow = TRUE, 
      nrow = 12
    )
)

(Note that the Multiplot 2.2 was NOT presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS.), were the Figure 2 was presented, of which the Multiplot 2.2 constitutes the PART 2.)*

back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3 Harm On Economy With Respect To Economic Damage By Each Weather Event Type

Summary

The required variables and the target data subset of observations for the harm on economy with respect to economic damage were extracted from the table with the processed data, and processed to create a new variable that divided the observations for each of the included weather event types to two supplementary groups:

the 90% of observations with the lowest impact
the 10% of observations with the highest impact

before the information for the harm on economy with respect to economic damage was summarized by each weather event type.

Three aspects were examined:

The overall average economic damage by each weather event type.
The average economic damage by each weather event type for the 90% of cases with the lowest impact.
The average economic damage by each weather event type for the 10% of cases with the highest impact.

For each aspect, the average economic damage by each weather event type, the number of its available observations (based on which the average was computed) and their skewness were examined.

The overall average economic damage was used as the main criterion to determine which weather events caused the most harm on economy with respect to economic damage but it is important to take into account the other two aspect that were presented in order to obtain a more insightful and complete ‘picture’ of their consequences, (especially given the fact that for most of the weather event types, the economic damage were highly positively skewed).

The table with results for the harm on economy with respect to economic damage by each weather event type were presented at the subsection 10.2.4 Most harmful event types with respect to economic damage of the chapter 10 RESULTS.

Finally the Multiplot 2.3 was created to visualize
the results of the harm on economy with respect to economic damage by each weather event type.

*(Note that neither the Multiplot 2.3 nor the elementary plots that it contains were presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS, where the Figure 2 was presented, of which the Multiplot 2.3 constitutes the PART 3.)

Steps

9.3.1 Extract the target data for harm on economy with respect to economic damage
- The target data subset of observations needed to evaluate the harm on economy with respect to economic damage by each weather event type was extracted from the table with the processed data.
9.3.2 Process the target data for harm on economy with respect to economic damage
- The table with target data subset for the harm on economy with respect to economic damage was processed to create the table with processed data for the harm on economy with respect to economic damage.
9.3.3 Summarize the processed data for harm on economy with respect to economic damage by each weather event type
- The harm on economy with respect to economic damage by each weather event type was evaluated over various aspects.
9.3.4 Visualize the results of the summary for the harm on economy with respect to economic damage by each weather event type
- The Multiplot 2.3 that presents the results of the summary for the harm on economy with respect to economic damage by each weather event type was created.
  - 9.3.4.1 Create the components of Multiplot 2.3
    - Creates the four elementary plot that constitute the Multiplot 2.3:
      - 9.3.4.1.1 Create The Plot 2.3.1
        
        Displays the overall average economic damage caused by each weather event type based on all the cases of weather events that resulted in non-zero economic damage.
      - 9.3.4.1.2 Create The Plot 2.3.2
        
        Displays the average economic damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero economic damage.
      - 9.3.4.1.3 Create The Plot 2.3.3
        
        Displays the average economic damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero economic damage.
      - 9.3.4.1.4 Create The Plot 2.3.4
        
        Displays a comparison for each weather event type, of the average economic damage for the 90% of its observations with the lowest impact versus the average economic damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero economic damage.
  - 9.3.4.2 Compose the Multiplot 2.3
    - Combines the four elementary plots to create the Multiplot 2.3.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.1 Extract the target data for harm on economy with respect to economic damage

In order to examine the harm on economy with respect to economic damage caused by each weather event type, the variables REFNUM, EVENT_TYPE and ECONOMIC_DAMAGE were selected from the table with the processed data and only the observations that refer to weather events that resulted in non-zero economic damage were extracted.

Furthermore, in an attempt to avoid highly misleading statistics due to the small number of observations for some of the weather event types, a lowest bound of 10 weather events that caused non zero economic damage (for each of the included weather event types) was selected (subjectively by the analyst) and applied.

This lowest bound, although it may seem (and generally it is) not enough to get trustworthy statistics, it was considered to be “good enough” taking into account that :

the analysis focuses in describing historical data without trying to make inferences that would demand substantially bigger samples, although any statistic based on less than 10 observations could not be taken seriously especially in cases (such as in this analysis) where the distribution of economic damage for each weather event type was skewed.
a period of 10 years (from 2001 to 2011) in which the observations that were used in the analysis occurred, is relatively small time to produce big samples of weather events that caused non zero economic damage for some the weather event types. Thus, if a highest bound was selected to get more robust statistics such as samples of 100 or 300, the majority of weather event types would have been excluded, making the results of the analysis trivial.

# Extract the required variables and the target data subset of observations 
# for the harm on economy with respect to economic damage.
target_data_____harm_on_economy_____economic_damage <- processed_data[
  ## Extract only the observations that have resulted in non-zero economic damage.
  ECONOMIC_DAMAGE > 0,
  ## Select only the relevant variables. 
  list(REFNUM, EVENT_TYPE, ECONOMIC_DAMAGE)
  ][
    ### Keep only the observations that correspond to the weather event types 
    ### for which there are at least 10 weather events available.
    EVENT_TYPE %in% 
      names(table(EVENT_TYPE)[table(EVENT_TYPE) >= 10])
    ]

The table with the target data for the harm on economy with respect to economic damage consist of 140236 observations.

# Print the structure of the table with the target data subset 
# for the harm on economy with respect to economic damage.
str(target_data_____harm_on_economy_____economic_damage)

## Classes 'data.table' and 'data.frame':   140236 obs. of  3 variables:
##  $ REFNUM         : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ EVENT_TYPE     : chr  "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" "THUNDERSTORM WIND" ...
##  $ ECONOMIC_DAMAGE: num  10000 8000 2000 15000 5000 3000 10000 450000 150000 3000 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

The variable EVENT_TYPE includes 37 distinct weather event types, for most of which the variable ECONOMIC_DAMAGE was highly positively skewed.

It was worth noting that for the weather event types with highest number of observations there was highest skewness for the values of economic damage, indicating that the corresponding distribution of economic damage has a heavy tail that wasn’t possible to be observed when few observation were available.

# Create a kable to present some facts about the table with the target data 
# for the harm on economy with respect to economic damage.
kable(
  x = target_data_____harm_on_economy_____economic_damage[
    order(EVENT_TYPE), 
    list(
      "N" = .N, 
      "SKEWNESS" = round(skewness(ECONOMIC_DAMAGE), 4)
    ), 
    by = EVENT_TYPE
    ],
  caption = paste0(
    "Table 9.3.1-1: ", 
    "Facts about the table with the target data subset of observations ", 
    "for the harm on economy with respect to economic damage."
  )
) %>% 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"), 
    full_width = FALSE,
    fixed_thead = TRUE
  ) %>% 
  footnote(
    general = "The skewness was rounded to 4 decimal places."
  )

Table 9.3.1-1: Facts about the table with the target data subset of observations for the harm on economy with respect to economic damage.
EVENT_TYPE	N	SKEWNESS
AVALANCHE	33	3.4882
BLIZZARD	129	10.5403
COASTAL FLOOD	152	4.5996
COLD/WIND CHILL	16	1.2895
DEBRIS FLOW	189	5.6453
DENSE FOG	56	3.7347
DROUGHT	171	4.6871
DUST DEVIL	60	2.4345
DUST STORM	62	5.4939
EXCESSIVE HEAT	21	4.2483
EXTREME COLD/WIND CHILL	32	3.5596
FLASH FLOOD	13954	58.0040
FLOOD	7368	85.7213
FROST/FREEZE	120	6.1949
HAIL	16305	72.6945
HEAVY RAIN	883	19.2418
HEAVY SNOW	573	7.0098
HIGH SURF	76	5.0462
HIGH WIND	3863	37.0482
HURRICANE/TYPHOON	108	4.7929
ICE STORM	410	8.6435
LAKE-EFFECT SNOW	195	13.1024
LIGHTNING	6199	22.3186
MARINE HIGH WIND	18	3.8120
MARINE STRONG WIND	34	5.3773
MARINE THUNDERSTORM WIND	128	10.1387
STORM SURGE/TIDE	131	9.6344
STRONG WIND	3251	53.9812
THUNDERSTORM WIND	74183	166.2756
TORNADO	8782	55.9160
TROPICAL DEPRESSION	35	5.4232
TROPICAL STORM	370	18.7288
TSUNAMI	14	2.7178
WATERSPOUT	12	3.0130
WILDFIRE	878	15.6629
WINTER STORM	931	29.8022
WINTER WEATHER	494	19.5434
Note:
The skewness was rounded to 4 decimal places.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.2 Process the target data for harm on economy with respect to economic damage

To create the table with the processed data for the harm on economy with respect to economic damage from the corresponding target data subset for this perspective, a new variable was created that divides the observations for each of the included weather event types in two complementary levels:

one that contains the 90% of cases with lowest impact
the other that contains the 10% of cases with highest impact

This decision was made due to the high skewness that was observed for the values of the variable ECONOMIC_DAMAGE for most weather event types, which indicates that the underlining distributions of such phenomena has a heavy tail that causes this heterogeneity on the observations. As a result a small economic damage were observed for the majority of cases that resulted in non-zero fatalities while in the few cases with the highest impact they caused lots of economic damage.

Having in mind that the average economic damage will be used to determine which weather event types were the most harmful to economy (with respect to economic damage) combined with the fact that the average doesn’t represent well the distribution of variables with high skewness, as it is highly affected by the most extreme values, it was considered necessary to examine the subsets created by those two levels in order to obtain an insightful picture.

# Create the table with the processed data 
# for the harm on economy with respect to economic damage.
processed_data_____harm_on_economy_____economic_damage <- 
  target_data_____harm_on_economy_____economic_damage[
    ,
    ## Create a new variable divides the observations
    ## for each weather event into two supplementary groups:  
    ##   - the 90% of weather events that resulted in lowest fatalities
    ##   - the 10% of weather events that resulted in highest fatalities
    BIN_GROUP_PER_EVENT_TYPE := (function(x, p_bins) {
      
      # adds 0 and 1 in the vector supplied at the argument 'p_bins' 
      # to the start and the end respectively  
      # the supplied percentiles if they are missing 
      # and sort them ascending
      p_bins_increasing <- sort(c(0, p_bins, 1))
      
      # creates the character strings that labels of the bins by the values supplied at 
      # the argument 'p_bins' that will be the values of the new variable
      bin_labels <- paste0("(", p_bins_increasing[-length(p_bins_increasing)]*100,
                           "% - ", p_bins_increasing[-1]*100, "%]")
      
      # identify the number of occurrences that correspond to each label
      n_times <- vapply(2:length(p_bins_increasing),
                        function(i) {
                          as.integer(floor(length(x) * p_bins_increasing[i]) -
                                       floor(length(x) * p_bins_increasing[i - 1]))
                        }, integer(1))
      
      # multiply each label with the number of its occurrences
      x_bins_expanded <- rep(x = bin_labels, times = n_times)
      
      # order the label to much the values of the corresponding vector
      x_bins_expanded_reordered <- x_bins_expanded[order(seq_along(x)[order(x)])]
      
      ## Coerce the character vector with the labels of bins to a factor
      x_bins_factor <- factor(x_bins_expanded_reordered, labels = bin_labels, ordered = TRUE)
      
    })(ECONOMIC_DAMAGE, 0.9), 
    by = EVENT_TYPE
  ][
    ## Coerce the EVENT_VARIABLE to factor
    , EVENT_TYPE := as.factor(EVENT_TYPE) 
  ]

The table with the processed data for the harm on economy with respect to economic damage contains 4 variables:

REFNUM (int) : an id that uniquely identifies each observation
EVENT_TYPE (Factor w/ 37 levels) : the type of each weather event
ECONOMIC_DAMAGE (int) : the economic damage
BIN_GROUP_PER_EVENT_TYPE (Ord.factor w/ 2 levels) : a factor that divides the observations for each weather event type to two complementary levels, one with the 90% of observations with the lowest impact and another with the 10% of observations with the highest impact.

and 140236 observations.

# Print the structure of the table with the processed data 
# for the harm on economy with respect to economic damage.
str(processed_data_____harm_on_economy_____economic_damage)

## Classes 'data.table' and 'data.frame':   140236 obs. of  4 variables:
##  $ REFNUM                  : int  413607 413608 413609 413610 413611 413612 413613 413614 413615 413616 ...
##  $ EVENT_TYPE              : Factor w/ 37 levels "AVALANCHE","BLIZZARD",..: 29 29 29 29 29 29 29 30 29 29 ...
##  $ ECONOMIC_DAMAGE         : num  10000 8000 2000 15000 5000 3000 10000 450000 150000 3000 ...
##  $ BIN_GROUP_PER_EVENT_TYPE: Ord.factor w/ 2 levels "(0% - 90%]"<"(90% - 100%]": 1 1 1 1 1 1 1 1 2 1 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "REFNUM"

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.3 Summarize the processed data for harm on economy with respect to economic damage by each weather event type

To evaluate the harm on economy by each weather event type with respect to economic damage a simplistic approach was adopted :

the weather event types were ranked from the most harmful to the least based on the overall average economic damage of the weather events that resulted in non-zero economic damage

The overall average economic damage caused by each weather event type was initially examined along with the skewness of the economic damage for each weather event type. In most cases the skewness was high (or even extremely high), so it was possible that the overall mean misrepresented the consequences of each weather event type.

That is the reason why the average economic damage for 90% of weather events with the lowest impact versus the average economic damage for the 10% of weather events with the highest impact were also computed and examined.

It is highlighted that for the average economic damage that refers to the 10% of the cases that had the highest impact, there were few observations available for a lot of weather event types and the corresponding mean values should be interpreted with caution.

# Create the table with the summary for the harm on economy 
# with respect to economic damage for each weather event type.
summary_____harm_on_economy______economic_damage <- 
  processed_data_____harm_on_economy_____economic_damage[
  ,
  list(
    ## The total number of observation by each weather event type.
    "N" = .N,
    ## The average economic damage caused by each weather event type.
    "AVRG" = round(mean(ECONOMIC_DAMAGE), 0),
    ## The skewness of economic damage for the observations by each weather event type.
    "SKEWNESS" = round(skewness(ECONOMIC_DAMAGE), 4),
    ## The number of observations for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "N_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , .N],
    ## The average economic damage caused by each weather event type 
    ## for the 90% of cases with the lowest impact.
    "AVRG_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(mean(ECONOMIC_DAMAGE), 0)],
    ## The skewness of economic damage for the 90% of cases with the lowest impact 
    ## by each weather event type.
    "SKEWNESS_LOW" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(0% - 90%]" , round(skewness(ECONOMIC_DAMAGE), 4)],
    ## The number of observations for the 10% of cases with the lowest impact 
    ## by each weather event type.
    "N_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , .N],
    ## The average economic damage caused by each weather event type 
    ## for the 10% of cases with the highest impact.
    "AVRG_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(mean(ECONOMIC_DAMAGE), 0)],
    ## The skewness of economic damage for the 10% of cases with the highest impact 
    ## by each weather event type.
    "SKEWNESS_HIGH" = .SD[BIN_GROUP_PER_EVENT_TYPE == "(90% - 100%]" , round(skewness(ECONOMIC_DAMAGE), 4)]
  ),
  by = "EVENT_TYPE"
  ][
    ## The average economic damage is used to order the rows of the table
    ## from the most harmful weather event type to the least.
    order(-AVRG),
    ## Create a variable with the rank of the harmness of each weather event type.
    RANK := 1:length(EVENT_TYPE)
    ][
      ,
      ## Reorder the variables at the table.
      list(
        RANK, EVENT_TYPE, N, AVRG, SKEWNESS, N_LOW, AVRG_LOW, SKEWNESS_LOW, N_HIGH, AVRG_HIGH, SKEWNESS_HIGH
      )
      ]

The results of the table with the summary for the harm on economy with respect to economic damage by each weather event type that was created in this section were presented at the subsection 10.2.4 Most harmful event types with respect to economic damage of the chapter 10 RESULTS.

The table with the summary for the harm on economy with respect to economic damage by each weather event type was exported (as an R file), in the folder of the working directory:

outputs –> harm_on_economy –> results

with filename:

summary______harm_on_economy______economic_damage.R

# Supply the filepath at which the table with the summary
# for the harm on economy will be exported.
filepath_____summary_____harm_on_economy______economic_damage <-
  file.path(
    directory_tree_____outputs[[
      "filepath_____outputs_____harm_on_economy_____results"
    ]],
    "summary_____harm_on_economy______economic_damage.R"
  )

# Export the table with the summary for the harm on economy
# with respect to economic damage.
saveRDS(
  object = summary_____harm_on_economy______economic_damage,
  file = filepath_____summary_____harm_on_economy______economic_damage
)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.4 Visualize the results of the summary for the harm on economy with respect to economic damage by each weather event type

From the table with the summary for the harm on economy by each weather event type with respect to economic damage the Multiplot 2.3 was created to present an overview of the results for the three different aspects that were examined for this perspective.

Four elementary plots were created:

9.3.4.1.1 Create The Plot 2.3.1
- Displays the overall average economic damage caused by each weather event type based on all the cases of weather events that resulted in non-zero economic damage.
9.3.4.1.2 Create The Plot 2.3.2
- Displays the average economic damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero economic damage.
9.3.4.1.3 Create The Plot 2.3.3
- Displays the average economic damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero economic damage.
9.3.4.1.4 Create The Plot 2.3.4
- Displays a comparison for each weather event type, of the average economic damage for the 90% of its observations with the lowest impact versus the average economic damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero economic damage.

which were then combined in order to obtain the Multiplot 2.3.

It constitutes the PART 2 of the Figure 2 that displays the overview of the harm on economy by each weather event type.

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.4.1 Create the components of Multiplot 2.3

Creates four elementary plots to visualize the results for the aspects that were examined for the harm on economy with respect to economic damage by each weather event type.

9.3.4.1.1 Create The Plot 2.3.1
- Displays the overall average economic damage caused by each weather event type based on all the cases of weather events that resulted in non-zero economic damage.
9.3.4.1.2 Create The Plot 2.3.2
- Displays the average economic damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero economic damage.
9.3.4.1.3 Create The Plot 2.3.3
- Displays the average economic damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero economic damage.
9.3.4.1.4 Create The Plot 2.3.4
- Displays a comparison for each weather event type, of the average economic damage for the 90% of its observations with the lowest impact versus the average economic damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero economic damage.

back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.4.1.1 Create The Plot 2.3.1

The Plot 2.3.1 displays the overall average economic damage caused by each weather event type taking into account all and only the observation that resulted in non-zero economic damage.

The skewness of the economic damage for the observations of each weather event type (based on which the overall economic damage was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.3.1 that displays 
# the overall average economic damage 
# by each weather event type for all cases. 
elementary_plot_2_3_1 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______economic_damage,
    mapping = aes(
      x = AVRG,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to make them displayed alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a square shaped point to the position that corresponds to 
  ## the average economic damage caused by each weather event type, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(color = SKEWNESS),
    shape = 15, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average economic damage.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG, 
      group = EVENT_TYPE, 
      color = SKEWNESS
    )
    ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## economic damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2.5
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average economic damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.3 will be composed from the four elementary plots. 
    limits = c(-5, 170), 
    midpoint = 70, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels.  
  labs(
    title = "Plot 2.3.1", 
    subtitle = "Aspect: Overall",
    x = "Average Number of Economic Damage\n",
    y = "Weather Event Types \n"
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.4.1.2 Create The Plot 2.3.2

The Elementary Plot 2.3.2 displays the average economic damage for the 90% of cases with the lowest impact caused by each weather event type from all the observation that resulted in non-zero economic damage.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to economy, based on the overall average economic damage they caused.
(so it is NOT based on the average economic damage caused by the 90% of cases with the lowest impact of each weather event type).

The skewness of the economic damage for the observations of each weather event type (based on which the average economic damage for the 90% of cases with the lowest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.3.2 that displays 
# the average economic damage by each weather event type 
# for the 90% of its cases with the lowest impact.
elementary_plot_2_3_2 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______economic_damage,
    mapping = aes(
      x = AVRG_LOW,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a circle shaped point to the position that corresponds to 
  ## the average economic damage caused by each weather event type
  ## for the 90% of its cases with the lowest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_LOW
    ), 
    size = 3.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average economic damage 
  ## for the 90% of its cases with the lowest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_LOW, 
      group = EVENT_TYPE, 
      color = SKEWNESS_LOW
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## economic damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ), 
    size = 2
    ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average economic damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.3 will be composed from the four elementary plots.
    limits = c(-5, 170), 
    midpoint = 70, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
    ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.3.2",
    subtitle = "Aspect: 90% of cases with the lowest impact",
    x = paste0(
      "Average Number of Economic Damage for the 90% ", "\n",
      "of Observations with the Lowest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.4.1.3 Create The Plot 2.3.3

The Plot 2.3.3 displays the average economic damage for the 10% of cases with the highest impact caused by each weather event type from all the observation that resulted in non-zero economic damage.

The weather event types were matched with a number that represents the rank which was assigned to each of them from the most harmful to the least with respect to economy, based on the overall average economic damage they caused.
(so it is NOT based on the average economic damage caused by the 10% of cases with the highest impact of each weather event type).

The skewness of the economic damage for the observations of each weather event type (based on which the average economic damage for the 10% of cases with the highest impact was computed) had been encoded in the color of the bar associated with each of them.

# Create the Elementary Plot 2.3.3 that displays 
# the average economic damage by each weather event type 
# for the 10% of its cases with the highest impact.
elementary_plot_2_3_3 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______economic_damage,
    mapping = aes(
      x = AVRG_HIGH,
      ### Reverse the order of the factors for the EVENT_TYPE variable 
      ### to display them alphabetically from top to bottom.
      y = factor(
        x = EVENT_TYPE, 
        levels = rev(x = levels(x = EVENT_TYPE)
        )
      ) 
    )
  ) +
  ## Draw a diamond shaped point to the position that corresponds to 
  ## the average economic damage caused by each weather event type
  ## for the 10% of its cases with the highest impact, 
  ## of which the color indicates the skewness of observations 
  ## based on which each average was computed.
  geom_point(
    mapping = aes(
      color = SKEWNESS_HIGH
    ), 
    shape = 18, 
    size = 4.5
  ) +
  ## Draw a line that visually associates each weather event type 
  ## with its respective average economic damage 
  ## for the 10% of its cases with the highest impact.
  geom_linerange(
    mapping = aes(
      xmin = 0, 
      xmax = AVRG_HIGH, 
      group = EVENT_TYPE, 
      color = SKEWNESS_HIGH
    )
  ) +
  ## Draw a number that indicates the rank assigned to each weather event type 
  ## (from the most harmful to the least) based on the overall average number
  ## economic damage it caused inside the square point 
  ## that displays the average.
  geom_text(
    mapping = aes(
      label = RANK
    ),
    size = 2
  ) +
  ## Adjust the scale for the color of each point.
  scale_color_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average economic damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.3 will be composed from the four elementary plots.
    limits = c(-5, 170), 
    midpoint = 70, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
  ) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.3.3",
    subtitle ="Aspect: 10% of cases with the highest impact",
    x = paste0(
      "Average Number of Economic Damage for the 10% ", "\n", 
      "of Observations with the Highest Impact" 
    )
  ) +
  ## Select a theme.
  theme_linedraw() + 
  ## Customize the selected theme.
  theme(
    ### Remove the legend.
    legend.position = "none",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    ),
    ### Remove the text, ticks and title of the y axis 
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.y = element_blank()
  )

back to start of this subsubsubsection
back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.4.1.4 Create The Plot 2.3.4

The Plot 2.3.4 displays a compact overview of all three aspect that were examined for the harm on economy with respect to economic damage.

For each weather event type, the comparison was visualized for the average economic damage for the 90% of cases with the lowest impact versus the average economic damage for the 10% of cases with the highest impact.

# Create the Elementary Plot 2.3.4 that displays 
# by each weather event type the comparison of 
# the average economic damage 
# for the 90% of cases with the lowest impact
# versus the average economic damage 
# for the 10% of cases with the highest impact.
elementary_plot_2_3_4 <-
  ## Supply the constant arguments for the aesthetics of all included geoms.
  ggplot(
    data = summary_____harm_on_economy______economic_damage,
    mapping = aes(
      x = AVRG_HIGH, 
      y = AVRG_LOW
    )
  ) +
  geom_point(
    mapping = aes(
      fill = SKEWNESS
    ), 
    shape = 21
  ) +
  ## Draw a label with a number that indicates the rank assigned 
  ## to each weather event type (from the most harmful to the least) 
  ## based on the overall average economic damage it caused.
  geom_label_repel(
    mapping = aes(
      label = RANK, 
      fill = SKEWNESS
    ),
    size = 2.5
  ) +
  ## Adjust the scale for the fill of each label.
  scale_fill_gradient2(
    ### Choose such limits and midpoint for the colorbar of the legend
    ### that they can be used unchanged to correctly display 
    ### the skewness of the observations based on which 
    ### the average economic damage for all three aspects: 
    ###   1. overall
    ###   2. 90% of cases with the lowest impact 
    ###   3. 10% of cases with the highest impact
    ### was computed. 
    ### This will allow to include only one common legend when the 
    ### Multiplot 2.3 will be composed from the four elementary plots.
    limits = c(-5, 170), 
    midpoint = 70, 
    low = "lightgreen", 
    mid = "orange", 
    high = "purple"
    ) +
  ## Set proper limits to the plot.
    xlim(c(-0.5e9, 8.5e9)) +
    ylim(c(-1e7, 9.5e7)) +
  ## Supply descriptive labels. 
  labs(
    title = "Plot 2.3.4",
    subtitle = paste0(
      "Comparison of the average economic damage ", 
      "for the 90% of observations with the lowest impact ", 
      "versus the average economic damage ", 
      "for the 10% of observations with highest impact. "
    ),
    x = paste0(
      "Average Number of Economic Damage by each Weather Event Type ", 
      "for the 10% of its Observations with the Highest Impact"
    ),
    y = paste0(
      "Average Number of Economic Damage by each Weather Event Type ", "\n", 
      "for the 90% of its Observations with the Lowest Impact."
    ),
    ### Add a descriptive label for the legend.
    fill = paste0(
      "The color indicates the skewness ",
      "of economic damage for the each weather event type. ",
      "(the color scale is unique for all four plots of PART 3) "
    )
  ) +
  ## Select a theme.
  theme_linedraw() +
  ## Customize the selected theme.
  theme(
    ### Adjust the legend.
    legend.position = "bottom",
    legend.direction = "horizontal",
    ### Adjust the title.
    plot.title = element_text(
      size = 12,
      face = "bold"
    ),
    ### Adjust the subtitle.
    plot.subtitle = element_text(
      size = 10
    )
  )

back to start of this subsubsubsection
back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

9.3.4.2 Compose the Multiplot 2.3

The four elementary plots that were created from the results of the summary for the harm on economy with respect to economic damage by each weather event type, were combined to construct a single multiplot that displays the complete picture for this perspective.

# Create a multiplot that displays the overview of the summary 
# for the harm on economy with respect to economic damage
# by each weather event type.
multiplot_2_3 <- arrangeGrob(
  grobs = list(
      
    # Title
    textGrob(
      label = paste0(
        "\n",
        "PART 3: Harm on economy by each weather event type ", 
        "with the respect to economic damage ", "\n", 
        "based on the cases of weather events ", 
        "that resulted in non-zero economic damage.", "\n", 
        "\n"
      ),
       gp=gpar(
         fontsize = 16, 
         fontface = "bold"
       )
    ),
    
    # Subtitle
    textGrob(
      label = paste0(
          "\n", 
          "The results include only the weather event types, ", 
          "for which at least 10 observations ", 
          "that resulted in non-zero economic damage were available. ", "\n",
          "The number associated with each weather event type ", 
          "represents the rank (from the most harmful to the least) ", 
          "which was assigned based on the overall average economic damage.", "\n",
          "Because for most of the weather event types ", 
          "high positive skewness was observed for the economic damage, ",
          "the average of the 90% of cases with lowest impact ", "\n",
          "and the 10% of cases with highest impact were reported ", 
          "to provide a more representative picture of their consequences.","\n",
          "\n"
      ),
       gp=gpar(
         fontsize = 14, 
         fontface = "bold"
       )
    ),
    
    # Plot 2.3.1
    # Elementary plot for the average economic damage 
    # by each weather event type for all cases.
    elementary_plot_2_3_1,
    
    # ELEMENTARY PLOT 1.3.2
    # Elementary plot for the average economic damage 
    # by each weather event type for 90% of cases with the lowest impact.
    elementary_plot_2_3_2,
    
    # ELEMENTARY PLOT 1.3.3
    # Elementary plot for the average economic damage 
    # by each weather event type for 10% of cases with the highest impact.
    elementary_plot_2_3_3,
    
    # ELEMENTARY PLOT 1.3.4
    # Elementary Plot 2.3.4 for the comparison of 
    # the average economic damage 
    # for the 90% of cases with the lowest impact versus 
    # the 10% of cases with the highest impact.
    elementary_plot_2_3_4
  ),
  # Set the layout for this elementary plots
  layout_matrix = 
    matrix(
      c(1,1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2,2,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        3,3,3,3,3,4,4,5,5,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6,
        NA,6,6,6,6,6,6,6,6
      ),
      byrow = TRUE, 
      nrow = 13
    )
)

(Note that the Multiplot 2.3 was NOT presented in this section due to the restrictions imposed by the assignment to include in the report at least 1 but no more than 3 figures. It can be examined at the subsection 10.2.1 Overview of results for the harm on economy of the chapter 10 RESULTS.), were the Figure 2 was presented, of which the Multiplot 2.3 constitutes the PART 3.)*

back to start of this subsubsection back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10 RESULTS

The unprocessed raw data from the file repdata_data_StormData.csv.bz2 that contains observations from Storm Events Dataset that was created and made publicly available by the U.S. National Oceanic and Atmospheric Administration (NOAA), was processed to obtain the table with processed data (through a processing pipeline which was described in detail at the chapter 6 DATA PROCESSING).

Based on the table with the processed data which contains valid observations for weather events that happened at United States in the period from 2001 to 2011 and caused harm either to population health (resulted in fatalities or injuries) or to economy (resulted in property or crop damage) the results of this analysis were produced for the two questions of interest set by the assignment (for which the guidelines can be found at the section 2.1 About The Assignment, that were presented in the following sections of this chapter:

10.1 Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
10.2 Question 2 : Across the United States, which types of events have the greatest economic consequences?

back to start of this chapter
back to TABLE OF CONTENTS

10.1 Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

In an attempt to identify the most harmful weather event types with respect to population health three different perspectives were examined (for which the analysis can be examined at the chapter 8 HARM ON POPULATION HEALTH).

A short overview of the results was presented at the subsection:

10.1.1 Overview of results for the harm on population health

Further details, at each of the three perspectives, are available at the following subsections:

10.1.2 Most harmful event types with respect to fatalities
10.1.3 Most harmful event types with respect to injuries
10.1.4 Most harmful event types with respect to casualties

It is highlighted that the results must be evaluated under the following context in order to be meaningful:

The results for any perspective (fatalities, injuries or casualties) refer specifically to the harm that was caused when harm with respect to that particular perspective was observed.

(In other words the results do not refer to the harm caused for a perspective of interest when a weather phenomenon of an included weather event type occurred independently of whether or not it caused harm with respect to the perspective that was examined.)

In addition, due to the fact that it was decided to include only the weather event types for which there were available at least 10 observations that corresponded to weather events that resulted in non-zero harm with respect to each perspective examined, the composition of weather event types for the three perspectives is different.

For each perspective, it was consider appropriate to present three aspects in order to supply an insightful picture of the consequences caused by each weather event type:

the overall average harm
the average harm of 90% of cases with lower impact
the average harm of 10% of cases with higher impact

The number of observations as well as their skewness were summarized by each weather event types for every aspect and presented along with the corresponding average.

Although the overall average harm was used as the primary criterion to determine the most harmful events, it should be examined along with the average harm for the two other subgroups, especially when the overall skewness for a weather event type of interest is high.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.1.1 Overview of results for the harm on population health

In order to display an overview of the results for the harm on population health by each weather event type the Figure 1 was created.

# Compose the Figure 1, by combining
#   - Multiplot 1.1
#   - Multiplot 1.2
#   - Multiplot 1.3
figure_1 <- arrangeGrob(
  grobs = list(
      ## TITLE
      textGrob(
          label = paste0(
              "FIGURE 1: SUMMARY OF HARM ON POPULATION HEALTH BY EACH WEATHER EVENT TYPE."
          ),
          gp=gpar(
              fontsize = 20, 
              fontface = "bold"
          )
      ),
      ## PART 1.
      multiplot_1_1,
      ## PART 2
      multiplot_1_2,
      ## PART 3
      multiplot_1_3,
      ## CAPTION
      textGrob(
          label = paste0(
              "\n",
              "All details on the source data, the data processing procedure and other ",
              "aspects of the analysis from which these results were obtained ", "\n",
              "are available at the associated github repository: ",
              "https://github.com/jzstats/Reproducible-Research--2nd-Assignment",
            "\n"
        ),
        gp=gpar(
            fontsize = 14
        )
    )
  ),
  layout_matrix = matrix(
      data = c(1,
               2,
               2,
               2,
               2,
               2,
               2,
               NA,
               3,
               3,
               3,
               3,
               3,
               3,
               NA,
               4,
               4,
               4,
               4,
               4,
               4,
               5
      ),
      byrow = TRUE,
      ncol = 1
  )
)

The Figure 2 consists of three parts, one for each of the three perspective examined:

PART 1
- Contains the Multiplot 1.1 which was constructed at the subsection 8.1.4 Visualize the results of the summary for the harm on population health with respect to fatalities by each weather event type and displays the results for the harm on population health with respect to fatalities by each weather event type for all the aspects that were examined. It consists of four plots:
  - Plot 1.1.1
    - Displays the overall average number of fatalities caused by each weather event type based on all the cases of weather events that resulted in non-zero fatalities.
  - Plot 1.1.2
    - Displays the average number of fatalities caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero fatalities.
  - Plot 1.1.3
    - Displays the average number of fatalities caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero fatalities.
  - Plot 1.1.4
    - Displays a comparison for each weather event type, of the average number of fatalities for the 90% of its observations with the lowest impact versus the average number of fatalities for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero fatalities.
PART 2
- Contains the Multiplot 1.2 which was constructed at the subsection 8.2.4 Visualize the results of the summary for the harm on population health with respect to injuries by each weather event type and displays the results for the harm on population health with respect to injuries by each weather event type for all the aspects that were examined. It consists of four plots:
  - Plot 1.2.1
    - Displays the overall average number of injuries caused by each weather event type based on all the cases of weather events that resulted in non-zero injuries.
  - Plot 1.2.2
    - Displays the average number of injuries caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero injuries.
  - Plot 1.2.3
    - Displays the average number of injuries caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero injuries.
  - Create The Plot 1.2.4
    - Displays a comparison for each weather event type, of the average number of injuries for the 90% of its observations with the lowest impact versus the average number of injuries for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero injuries.
PART 3
- Contains the Multiplot 1.3 which was constructed at the subsection 8.3.4 Visualize the results of the summary for the harm on population health with respect to casualties by each weather event type and displays the results for the harm on population health with respect to casualties by each weather event type for all the aspects that were examined. It consists of four plots:
  - Plot 1.3.1
    - Displays the overall average number of casualties caused by each weather event type based on all the cases of weather events that resulted in non-zero casualties.
  - Plot 1.3.2
    - Displays the average number of casualties caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero casualties.
  - Plot 1.3.3
    - Displays the average number of casualties caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero casualties.
  - Plot 1.3.4
    - Displays a comparison for each weather event type, of the average number of casualties for the 90% of its observations with the lowest impact versus the average number of casualties for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero casualties.

# Display the Figure 1 
grid.draw(figure_1)

The Figure 1 was exported (as a png file), in the folder of the working directory:

outputs –> harm_on_population_health –> figures

with filename:

figure_1.png

# Export Figure 1
ggsave(
  filename = "figure_1.png",
  plot = figure_1,
  device = "png",
  path = directory_tree_____outputs[[
    "filepath_____outputs_____harm_on_population_health_____figures"
    ]],
  width = 15,
  height = 50,
  limitsize = FALSE
)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.1.2 Most harmful event types with respect to fatalities

According to the summary for the harm on population health with respect to fatalities by each weather event type (that were obtained at the section 8.1 Harm On Population Health With Respect To Fatalities By Each Weather Event Type) out of the 26 included weather event types (for each of which at least 10 observations that resulted in non-zero fatalities at United States in the period from 2001 to 2011 were available) there were 7 of them that stand out:

When a weather event of type TORNADO resulted in fatalities, it caused about 3.4 fatalities on average (based on 339 observations that had extreme positive skewness equal to 13.5732). For 9 out of 10 times of such cases, an average of 1.88 fatalities was observed (based on the 90% of cases with the lower impact for which 305 observations were available, that had moderate positive skewness equal to 1.812), while for the remaining 1 out of 10 times it caused around 17 fatalities on average (based on the 10% of cases with the higher impact for which 34 observations were available, that had high positive skewness equal to 4.9099).
When a weather event of type DEBRIS FLOW resulted in fatalities, it caused about 3.36 fatalities on average (based on 11 observations that had moderate positive skewness equal to 1.6608). For 9 out of 10 times of such cases, an average of 1.44 fatalities was observed (based on the 90% of cases with the lower impact for which only 9 observations were available, that had moderate positive skewness equal to 2.0673), while for the remaining 1 out of 10 times it caused around 12 fatalities on average (based on the 10% of cases with the higher impact for which only 2 observations were available, that had low positive skewness equal to 0).
When a weather event of type HURRICANE/TYPHOON resulted in fatalities, it caused about 2.96 fatalities on average (based on 23 observations that had moderate positive skewness equal to 2.1981). For 9 out of 10 times of such cases, an average of 1.95 fatalities was observed (based on the 90% of cases with the lower impact for which 20 observations were available, that had moderate positive skewness equal to 1.6605), while for the remaining 1 out of 10 times it caused around 9.67 fatalities on average (based on the 10% of cases with the higher impact for which only 3 observations were available, that had low positive skewness equal to 0.7071).
When a weather event of type EXCESSIVE HEAT resulted in fatalities, it caused about 2.89 fatalities on average (based on 296 observations that had high positive skewness equal to 5.4405). For 9 out of 10 times of such cases, an average of 1.51 fatalities was observed (based on the 90% of cases with the lower impact for which 266 observations were available, that had moderate positive skewness equal to 1.9625), while for the remaining 1 out of 10 times it caused around 15.17 fatalities on average (based on the 10% of cases with the higher impact for which 30 observations were available, that had moderate positive skewness equal to 1.6149).
When a weather event of type WILDFIRE resulted in fatalities, it caused about 2.61 fatalities on average (based on 31 observations that had moderate positive skewness equal to 2.629). For 9 out of 10 times of such cases, an average of 1.59 fatalities was observed (based on the 90% of cases with the lower impact for which 27 observations were available, that had moderate positive skewness equal to 1.2688), while for the remaining 1 out of 10 times it caused around 9.5 fatalities on average (based on the 10% of cases with the higher impact for which only 4 observations were available, that had low negative skewness equal to -0.278).
When a weather event of type TROPICAL STORM resulted in fatalities, it caused about 2.5 fatalities on average (based on 20 observations that had high positive skewness equal to 3.8434). For 9 out of 10 times of such cases, an average of 1.33 fatalities was observed (based on the 90% of cases with the lower impact for which 18 observations were available, that had moderate positive skewness equal to 2.3814), while for the remaining 1 out of 10 times it caused around 13 fatalities on average (based on the 10% of cases with the higher impact for which only 2 observations were available, that had low positive skewness equal to 0).
When a weather event of type HEAT resulted in fatalities, it caused about 1.81 fatalities on average (based on 127 observations that had high positive skewness equal to 4.1476). For 9 out of 10 times of such cases, an average of 1.26 fatalities was observed (based on the 90% of cases with the lower impact for which 114 observations were available, that had moderate positive skewness equal to 1.912), while for the remaining 1 out of 10 times it caused around 6.62 fatalities on average (based on the 10% of cases with the higher impact for which 13 observations were available, that had moderate positive skewness equal to 1.4602).

# Create an interactive table to present present the results 
# for the harm on population health with respect to fatalities.  
datatable(
  data = summary_____harm_on_population_health______fatalities[order(RANK)],
  caption = paste0(
    "Table 10.1.2-1: ",
    "Harm on Population Health with respect to Fatalities ", 
    "by Each Weather Event Type"
  ),
  colnames = c(
    "RANK", 
    "WEATHER EVENT TYPE", 
    "NUMBER OF ALL AVAILABLE OBSERVATIONS", 
    "AVERAGE NUMBER OF FATALITIES FOR ALL AVAILABLE CASES", 
    "SKEWNESS IN FATALITIES FOR ALL AVAILABLE CASES", 
    "NUMBER OF OBSERVATIONS FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "AVERAGE NUMBER OF FATALITIES FOR THE 90% OF CASES WITH LOWEST IMPACT",
    "SKEWNESS IN FATALITIES FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "NUMBER OF OBSERVATIONS FOR THE 10% OF CASES WITH HIGHEST IMPACT", 
    "AVERAGE NUMBER OF FATALITIES FOR THE 10% OF CASES WITH HIGHEST IMPACT",
    "SKEWNESS IN FATALITIES FOR THE 10% OF CASES WITH HIGHEST IMPACT"
  ),
  rownames = FALSE,
  escape = TRUE,
  options=list(
    pageLength = nrow(summary_____harm_on_population_health______fatalities),
    dom = "t",
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'font-size': '50%'});",
      "}"
    )
  )
) %>% 
  formatStyle(columns = c(1, 3:11), fontSize = '70%')%>% 
  formatStyle(columns = 2, fontSize = '35%')

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.1.3 Most harmful event types with respect to injuries

According to the summary for the harm on population health with respect to injuries by each weather event type (that were obtained at the section 8.2 Harm On Population Health With Respect To Injuries By Each Weather Event Type) out of the 27 included weather event types (for each of which at least 10 observations that resulted in non-zero injuries at United States in the period from 2001 to 2011 were available) there were 3 of them that stand out:

Specifically :

When a weather event of type HURRICANE/TYPHOON resulted in injuries, it caused about 86.07 injuries on average (based on 15 observations that had moderate positive skewness equal to 2.773). For 9 out of 10 times of such cases, an average of 15 injuries was observed (based on the 90% of cases with the lower impact for which 13 observations were available, that had moderate positive skewness equal to 2.8806), while for the remaining 1 out of 10 times it caused around 548 injuries on average (based on the 10% of cases with the higher impact for which only 2 observations were available, that had low positive skewness equal to 0).
When a weather event of type EXCESSIVE HEAT resulted in injuries, it caused about 37.7 injuries on average (based on 86 observations that had high positive skewness equal to 4.1751). For 9 out of 10 times of such cases, an average of 16.48 injuries was observed (based on the 90% of cases with the lower impact for which 77 observations were available, that had moderate positive skewness equal to 1.2674), while for the remaining 1 out of 10 times it caused around 219.22 injuries on average (based on the 10% of cases with the higher impact for which only 9 observations were available, that had low positive skewness equal to 0.7763).
When a weather event of type HEAT resulted in injuries, it caused about 33.94 injuries on average (based on 36 observations that had moderate positive skewness equal to 2.1619). For 9 out of 10 times of such cases, an average of 13.56 injuries was observed (based on the 90% of cases with the lower impact for which 32 observations were available, that had moderate positive skewness equal to 2.4589), while for the remaining 1 out of 10 times it caused around 197 injuries on average (based on the 10% of cases with the higher impact for which only 4 observations were available, that had moderate negative skewness equal to -1.0869).

# Create an interactive table to present present the results 
# for the harm on population health with respect to injuries.  
datatable(
  data = summary_____harm_on_population_health______injuries[order(RANK)],
  caption = paste0(
    "Table 10.1.3-1: ",
    "Harm on Population Health with respect to Injuries ", 
    "by Each Weather Event Type"
  ),
  colnames = c(
    "RANK", 
    "WEATHER EVENT TYPE", 
    "NUMBER OF ALL AVAILABLE OBSERVATIONS", 
    "AVERAGE NUMBER OF INJURIES FOR ALL AVAILABLE CASES", 
    "SKEWNESS IN INJURIES FOR ALL AVAILABLE CASES", 
    "NUMBER OF OBSERVATIONS FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "AVERAGE NUMBER OF INJURIES FOR THE 90% OF CASES WITH LOWEST IMPACT",
    "SKEWNESS IN INJURIES FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "NUMBER OF OBSERVATIONS FOR THE 10% OF CASES WITH HIGHEST IMPACT", 
    "AVERAGE NUMBER OF INJURIES FOR THE 10% OF CASES WITH HIGHEST IMPACT",
    "SKEWNESS IN INJURIES FOR THE 10% OF CASES WITH HIGHEST IMPACT"
  ),
  rownames = FALSE,
  escape = TRUE,
  options=list(
    pageLength = nrow(summary_____harm_on_population_health______injuries),
    dom = "t",
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'font-size': '50%'});",
      "}"
    )
  )
) %>% 
  formatStyle(columns = c(1, 3:11), fontSize = '70%')%>% 
  formatStyle(columns = 2, fontSize = '35%')

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.1.4 Most harmful event types with respect to casualties

According to the summary for the harm on population health with respect to casualties by each weather event type (that were obtained at the section 8.3 Harm On Population Health With Respect To Casualties By Each Weather Event Type) out of the 30 included weather event types (for each of which at least 10 observations that resulted in non-zero casualties at United States in the period from 2001 to 2011 were available) there were 7 of them that stand out:

Specifically :

When a weather event of type HURRICANE/TYPHOON resulted in casualties, it caused about 41.18 casualties on average (based on 33 observations that had high positive skewness equal to 4.4573). For 9 out of 10 times of such cases, an average of 3.93 casualties was observed (based on the 90% of cases with the lower impact for which 29 observations were available, that had moderate positive skewness equal to 2.1573), while for the remaining 1 out of 10 times it caused around 311.25 casualties on average (based on the 10% of cases with the higher impact for which only 4 observations were available, that had low positive skewness equal to 0.7473).
When a weather event of type EXCESSIVE HEAT resulted in casualties, it caused about 11.71 casualties on average (based on 350 observations that had extreme positive skewness equal to 8.3298). For 9 out of 10 times of such cases, an average of 2.85 casualties was observed (based on the 90% of cases with the lower impact for which 315 observations were available, that had moderate positive skewness equal to 2.7042), while for the remaining 1 out of 10 times it caused around 91.43 casualties on average (based on the 10% of cases with the higher impact for which 35 observations were available, that had moderate positive skewness equal to 2.7186).
When a weather event of type TORNADO resulted in casualties, it caused about 11.67 casualties on average (based on 1327 observations that had extreme positive skewness equal to 17.6038). For 9 out of 10 times of such cases, an average of 4.29 casualties was observed (based on the 90% of cases with the lower impact for which 1194 observations were available, that had moderate positive skewness equal to 1.936), while for the remaining 1 out of 10 times it caused around 77.9 casualties on average (based on the 10% of cases with the higher impact for which 133 observations were available, that had high positive skewness equal to 6.2215).
When a weather event of type DUST STORM resulted in casualties, it caused about 9.7 casualties on average (based on 23 observations that had moderate positive skewness equal to 1.5025). For 9 out of 10 times of such cases, an average of 6.35 casualties was observed (based on the 90% of cases with the lower impact for which 20 observations were available, that had moderate positive skewness equal to 1.2737), while for the remaining 1 out of 10 times it caused around 32 casualties on average (based on the 10% of cases with the higher impact for which only 3 observations were available, that had low positive skewness equal to 0.4703).
When a weather event of type HEAT resulted in casualties, it caused about 9.43 casualties on average (based on 154 observations that had high positive skewness equal to 5.2894). For 9 out of 10 times of such cases, an average of 1.7 casualties was observed (based on the 90% of cases with the lower impact for which 138 observations were available, that had moderate positive skewness equal to 2.459), while for the remaining 1 out of 10 times it caused around 76.12 casualties on average (based on the 10% of cases with the higher impact for which 16 observations were available, that had low positive skewness equal to 0.9965).
When a weather event of type TROPICAL STORM resulted in casualties, it caused about 9.32 casualties on average (based on 34 observations that had high positive skewness equal to 5.3288). For 9 out of 10 times of such cases, an average of 1.9 casualties was observed (based on the 90% of cases with the lower impact for which 30 observations were available, that had moderate positive skewness equal to 1.4887), while for the remaining 1 out of 10 times it caused around 65 casualties on average (based on the 10% of cases with the higher impact for which only 4 observations were available, that had moderate positive skewness equal to 1.1226).
When a weather event of type DENSE FOG resulted in casualties, it caused about 7.6 casualties on average (based on 20 observations that had moderate positive skewness equal to 1.3831). For 9 out of 10 times of such cases, an average of 5.83 casualties was observed (based on the 90% of cases with the lower impact for which 18 observations were available, that had low positive skewness equal to 0.5675), while for the remaining 1 out of 10 times it caused around 23.5 casualties on average (based on the 10% of cases with the higher impact for which only 2 observations were available, that had low positive skewness equal to 0).

# Create an interactive table to present present the results 
# for the harm on population health with respect to casualties.  
datatable(
  data = summary_____harm_on_population_health______casualties[order(RANK)],
  caption = paste0(
    "Table 10.1.4-1: ",
    "Harm on Population Health with respect to Casualties ", 
    "by Each Weather Event Type"
  ),
  colnames = c(
    "RANK", 
    "WEATHER EVENT TYPE", 
    "NUMBER OF ALL AVAILABLE OBSERVATIONS", 
    "AVERAGE NUMBER OF CASUALTIES FOR ALL AVAILABLE CASES", 
    "SKEWNESS IN CASUALTIES FOR ALL AVAILABLE CASES", 
    "NUMBER OF OBSERVATIONS FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "AVERAGE NUMBER OF CASUALTIES FOR THE 90% OF CASES WITH LOWEST IMPACT",
    "SKEWNESS IN CASUALTIES FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "NUMBER OF OBSERVATIONS FOR THE 10% OF CASES WITH HIGHEST IMPACT", 
    "AVERAGE NUMBER OF CASUALTIES FOR THE 10% OF CASES WITH HIGHEST IMPACT",
    "SKEWNESS IN CASUALTIES FOR THE 10% OF CASES WITH HIGHEST IMPACT"
  ),
  rownames = FALSE,
  escape = TRUE,
  options=list(
    pageLength = nrow(summary_____harm_on_population_health______casualties),
    dom = "t",
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'font-size': '50%'});",
      "}"
    )
  )
) %>% 
  formatStyle(columns = c(1, 3:11), fontSize = '70%')%>% 
  formatStyle(columns = 2, fontSize = '35%')

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.2 Question 2 : Across the United States, which types of events have the greatest economic consequences?

In an attempt to identify the most harmful weather event types with respect to economy three different perspectives were examined (for which the analysis can be examined at the chapter 9 HARM ON ECONOMY.

A short overview of the results was presented at the subsection:

10.2.1 Overview of results for the harm on economy

Further details, at each of the three perspectives, are available at the following subsections:

10.2.2 Most harmful event types with respect to property damage
10.2.3 Most harmful event types with respect to crop damage
10.2.4 Most harmful event types with respect to economic damage

It is highlighted that the results must be evaluated under the following context in order to be meaningful:

The results for a perspective (property damage, crop damage or economic damage) refer specifically to the harm that was caused when harm with respect to that perspective was observed.

For each perspective, it was consider appropriate to present three aspects in order to supply an insightful picture of the consequences caused by each weather event type:

the overall average harm
the average harm of 90% of cases with lower impact
the average harm of 10% of cases with higher impact

The number of observations as well as their skewness were summarized by each weather event types for every aspect and presented along with the corresponding average.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.2.1 Overview of results for the harm on economy

In order to display an overview of the results for the harm on economy by each weather event type the Figure 2 was created.

# Compose the Figure 1, by combining
#   - Multiplot 2.1
#   - Multiplot 2.2
#   - Multiplot 2.3
figure_2 <- arrangeGrob(
  grobs = list(
      ## TITLE
      textGrob(
          label = paste0(
              "FIGURE 2: SUMMARY OF HARM ON ECONOMY BY EACH WEATHER EVENT TYPE."
          ),
          gp=gpar(
              fontsize = 20, 
              fontface = "bold"
          )
      ),
      ## PART 1.
      multiplot_2_1,
      ## PART 2
      multiplot_2_2,
      ## PART 3
      multiplot_2_3,
      ## CAPTION
      textGrob(
          label = paste0(
              "\n",
              "All details on the source data, the data processing procedure and other ",
              "aspects of the analysis from which these results were obtained ", "\n",
              "are available at the associated github repository: ",
              "https://github.com/jzstats/Reproducible-Research--2nd-Assignment",
            "\n"
        ),
        gp=gpar(
            fontsize = 14
        )
    )
  ),
  layout_matrix = matrix(
      data = c(1,
               2,
               2,
               2,
               2,
               2,
               2,
               NA,
               3,
               3,
               3,
               3,
               3,
               NA,
               4,
               4,
               4,
               4,
               4,
               4,
               5
      ),
      byrow = TRUE,
      ncol = 1
  )
)

The Figure 2 consists of three parts, one for each of the three perspective examined:

PART 1
- Contains the Multiplot 2.1 which was constructed at the subsection 9.2.4 Visualize the results of the summary for the harm on economy with respect to crop damage by each weather event type and displays the results for the harm on economy with respect to property damage by each weather event type for all the aspects that were examined. It consists of four plots:
  - The Plot 2.1.1
    - Displays the overall average property damage caused by each weather event type based on all the cases of weather events that resulted in non-zero property damage.
  - Plot 2.1.2
    - Displays the average property damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero property damage.
  - Plot 2.1.3
    - Displays the average property damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero property damage.
  - Plot 2.1.4
    - Displays a comparison for each weather event type, of the average property damage for the 90% of its observations with the lowest impact versus the average property damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero property damage.
PART 2
- Contains the Multiplot 2.2 which was constructed at the subsection 9.2.4 Visualize the results of the summary for the harm on economy with respect to crop damage by each weather event type and displays the results for the harm on economy with respect to crop damage by each weather event type for all the aspects that were examined. It consists of four plots:
  - Plot 2.2.1
    - Displays the overall average crop damage caused by each weather event type based on all the cases of weather events that resulted in non-zero crop damage.
  - Plot 2.2.2
    - Displays the average crop damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero crop damage.
  - Plot 2.2.3
    - Displays the average crop damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero crop damage.
  - Plot 2.2.4
    - Displays a comparison for each weather event type, of the average crop damage for the 90% of its observations with the lowest impact versus the average crop damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero crop damage.
PART 3
- Contains the Multiplot 2.3 which was constructed at the subsection 9.2.4 Visualize the results of the summary for the harm on economy with respect to crop damage by each weather event type and displays the results for the harm on economy with respect to economic damage by each weather event type for all the aspects that were examined. It consists of four plots:
  - Plot 2.3.1
    - Displays the overall average economic damage caused by each weather event type based on all the cases of weather events that resulted in non-zero economic damage.
  - Plot 2.3.2
    - Displays the average economic damage caused by each weather event type based on 90% of weather events with the lowest impact (for each weather event type) that resulted in non-zero economic damage.
  - Plot 2.3.3
    - Displays the average economic damage caused by each weather event type based on 10% of weather events with the highest impact (for each weather event type) that resulted in non-zero economic damage.
  - The Plot 2.3.4
    - Displays a comparison for each weather event type, of the average economic damage for the 90% of its observations with the lowest impact versus the average economic damage for the 10% of its observations with the highest impact based only on the weather events that resulted in non-zero economic damage.

# Display the Figure 2
grid.draw(figure_2)

The Figure 2 was exported (as a png file), in the folder of the working directory:

outputs –> harm_on_economy –> figures

with filename:

figure_2.png

# Export Figure 1
ggsave(
  filename = "figure_2.png",
  plot = figure_2,
  device = "png",
  path = directory_tree_____outputs[[
    "filepath_____outputs_____harm_on_economy_____figures"
    ]],
  width = 15,
  height = 50,
  limitsize = FALSE
)

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.2.2 Most harmful event types with respect to property damage

According to the summary for the harm on economy with respect to property damage by each weather event type (that were obtained at the section 9.1 Harm On Economy With Respect To Property Damage By Each Weather Event Type) out of the 37 included weather event types (for each of which at least 10 observations that resulted in non-zero property damage
at United States in the period from 2001 to 2011 were available) there were 2 of them that stand out:

When a weather event of type HURRICANE/TYPHOON resulted in property damage, it caused about 676106028$ of property damage on average (based on 107 observations that had high positive skewness equal to 4.9333). For 9 out of 10 times of such cases, an average of 81701511$ of property damage was observed (based on the 90% of cases with the lower impact for which 96 observations were available, that had high positive skewness equal to 3.4556), while for the remaining 1 out of 10 times it caused around 5863636364$ of property damage on average (based on the 10% of cases with the higher impact for which 11 observations were available, that had moderate positive skewness equal to 1.5154).
When a weather event of type STORM SURGE/TIDE resulted in property damage, it caused about 364969183$ of property damage on average (based on 131 observations that had extreme positive skewness equal to 9.6344). For 9 out of 10 times of such cases, an average of 749256$ of property damage was observed (based on the 90% of cases with the lower impact for which 117 observations were available, that had moderate positive skewness equal to 2.9093), while for the remaining 1 out of 10 times it caused around 3408807143$ of property damage on average (based on the 10% of cases with the higher impact for which 14 observations were available, that had moderate positive skewness equal to 2.7389).

# Create an interactive table to present present the results 
# for the harm on economy with respect to property damage.  
datatable(
  data = summary_____harm_on_economy______property_damage[order(RANK)],
  caption = paste0(
    "Table 10.2.2-1: ",
    "Harm on Population Health with respect to Property Damage ", 
    "by Each Weather Event Type"
  ),
  colnames = c(
    "RANK", 
    "WEATHER EVENT TYPE", 
    "NUMBER OF ALL AVAILABLE OBSERVATIONS", 
    "AVERAGE PROPERTY DAMAGE FOR ALL AVAILABLE CASES", 
    "SKEWNESS IN PROPERTY DAMAGE FOR ALL AVAILABLE CASES", 
    "NUMBER OF OBSERVATIONS FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "AVERAGE PROPERTY DAMAGE FOR THE 90% OF CASES WITH LOWEST IMPACT",
    "SKEWNESS IN PROPERTY DAMAGE FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "NUMBER OF OBSERVATIONS FOR THE 10% OF CASES WITH HIGHEST IMPACT", 
    "AVERAGE PROPERTY DAMAGE FOR THE 10% OF CASES WITH HIGHEST IMPACT",
    "SKEWNESS IN PROPERTY DAMAGE FOR THE 10% OF CASES WITH HIGHEST IMPACT"
  ),
  rownames = FALSE,
  escape = TRUE,
  options=list(
    pageLength = nrow(summary_____harm_on_economy______property_damage),
    dom = "t",
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'font-size': '50%'});",
      "}"
    )
  )
) %>% 
  formatStyle(columns = c(1, 3:11), fontSize = '50%')%>% 
  formatStyle(columns = 2, fontSize = '35%')

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.2.3 Most harmful event types with respect to crop damage

According to the summary for the harm on economy with respect to crop damage by each weather event type (that were obtained at the section 9.2 Harm On Economy With Respect To Crop Damage By Each Weather Event Type) out of the 16 included weather event types (for each of which at least 10 observations that resulted in non-zero crop damage
at United States in the period from 2001 to 2011 were available) there were 2 of them that stand out:

When a weather event of type HURRICANE/TYPHOON resulted in crop damage, it caused about 63684017$ of crop damage on average (based on 48 observations that had high positive skewness equal to 5.6962). For 9 out of 10 times of such cases, an average of 13275181$ of crop damage was observed (based on the 90% of cases with the lower impact for which 43 observations were available, that had moderate positive skewness equal to 2.4986), while for the remaining 1 out of 10 times it caused around 497200000$ of crop damage on average (based on the 10% of cases with the higher impact for which only 5 observations were available, that had moderate positive skewness equal to 1.3378).
When a weather event of type DROUGHT resulted in crop damage, it caused about 42389146$ of crop damage on average (based on 158 observations that had high positive skewness equal to 4.9333). For 9 out of 10 times of such cases, an average of 11981373$ of crop damage was observed (based on the 90% of cases with the lower impact for which 142 observations were available, that had moderate positive skewness equal to 2.3645), while for the remaining 1 out of 10 times it caused around 312258125$ of crop damage on average (based on the 10% of cases with the higher impact for which 16 observations were available, that had moderate positive skewness equal to 1.8881).

# Create an interactive table to present present the results 
# for the harm on economy with respect to crop damage.  
datatable(
  data = summary_____harm_on_economy______crop_damage[order(RANK)],
  caption = paste0(
    "Table 10.2.3-1: ",
    "Harm on Population Health with respect to Crop Damage ", 
    "by Each Weather Event Type"
  ),
  colnames = c(
    "RANK", 
    "WEATHER EVENT TYPE", 
    "NUMBER OF ALL AVAILABLE OBSERVATIONS", 
    "AVERAGE CROP DAMAGE FOR ALL AVAILABLE CASES", 
    "SKEWNESS IN CROP DAMAGE FOR ALL AVAILABLE CASES", 
    "NUMBER OF OBSERVATIONS FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "AVERAGE CROP DAMAGE FOR THE 90% OF CASES WITH LOWEST IMPACT",
    "SKEWNESS IN CROP DAMAGE FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "NUMBER OF OBSERVATIONS FOR THE 10% OF CASES WITH HIGHEST IMPACT", 
    "AVERAGE CROP DAMAGE FOR THE 10% OF CASES WITH HIGHEST IMPACT",
    "SKEWNESS IN CROP DAMAGE FOR THE 10% OF CASES WITH HIGHEST IMPACT"
  ),
  rownames = FALSE,
  escape = TRUE,
  options=list(
    pageLength = nrow(summary_____harm_on_economy______crop_damage),
    dom = "t",
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'font-size': '50%'});",
      "}"
    )
  )
) %>% 
  formatStyle(columns = c(1, 3:11), fontSize = '50%')%>% 
  formatStyle(columns = 2, fontSize = '35%')

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

10.2.4 Most harmful event types with respect to economic damage

According to the summary for the harm on economy with respect to economic damage by each weather event type (that were obtained at the section 9.3 Harm On Economy With Respect To Economic Damage By Each Weather Event Type) out of the 16 included weather event types (for each of which at least 10 observations that resulted in non-zero economic damage
at United States in the period from 2001 to 2011 were available) there were 2 of them that stand out:

When a weather event of type HURRICANE/TYPHOON resulted in economic damage, it caused about 698149795$ economic damage on average (based on 108 observations that had high positive skewness equal to 4.7929). For 9 out of 10 times of such cases, an average of 92388431$ economic damage was observed (based on the 90% of cases with the lower impact for which 97 observations were available, that had high positive skewness equal to 3.0615), while for the remaining 1 out of 10 times it caused around 6039863636$ economic damage on average (based on the 10% of cases with the higher impact for which 11 observations were available, that had moderate positive skewness equal to 1.3803).
When a weather event of type STORM SURGE/TIDE resulted in economic damage, it caused about 364975672$ economic damage on average (based on 131 observations that had extreme positive skewness equal to 9.6344). For 9 out of 10 times of such cases, an average of 756521$ economic damage was observed (based on the 90% of cases with the lower impact for which 117 observations were available, that had moderate positive skewness equal to 2.898), while for the remaining 1 out of 10 times it caused around 3408807143$ economic damage on average (based on the 10% of cases with the higher impact for which 14 observations were available, that had moderate positive skewness equal to 2.7389).

# Create an interactive table to present present the results 
# for the harm on economy with respect to economic damage.  
datatable(
  data = summary_____harm_on_economy______economic_damage[order(RANK)],
  caption = paste0(
    "Table 10.2.4-3: ",
    "Harm on Population Health with respect to Economic Damage ", 
    "by Each Weather Event Type "
  ),
  colnames = c(
    "RANK", 
    "WEATHER EVENT TYPE", 
    "NUMBER OF ALL AVAILABLE OBSERVATIONS", 
    "AVERAGE ECONOMIC DAMAGE FOR ALL AVAILABLE CASES", 
    "SKEWNESS IN ECONOMIC DAMAGE FOR ALL AVAILABLE CASES", 
    "NUMBER OF OBSERVATIONS FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "AVERAGE ECONOMIC DAMAGE FOR THE 90% OF CASES WITH LOWEST IMPACT",
    "SKEWNESS IN ECONOMIC DAMAGE FOR THE 90% OF CASES WITH LOWEST IMPACT", 
    "NUMBER OF OBSERVATIONS FOR THE 10% OF CASES WITH HIGHEST IMPACT", 
    "AVERAGE ECONOMIC DAMAGE FOR THE 10% OF CASES WITH HIGHEST IMPACT",
    "SKEWNESS IN ECONOMIC DAMAGE FOR THE 10% OF CASES WITH HIGHEST IMPACT"
  ),
  rownames = FALSE,
  escape = TRUE,
  options=list(
    pageLength = nrow(summary_____harm_on_economy______economic_damage),
    dom = "t",
    initComplete = JS(
      "function(settings, json) {",
      "$(this.api().table().header()).css({'font-size': '50%'});",
      "}"
    )
  )
) %>% 
  formatStyle(columns = c(1, 3:11), fontSize = '50%')%>% 
  formatStyle(columns = 2, fontSize = '35%')

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

11 REPRODUCIBILITY DETAILS

To help in any attempt to reproduce the report with this analysis beyond the structure and the in-detail description of the procedure that took place during the execution of the script, several details are provided to make it as easy as possible.

Specifically, in this chapter, information is supplied about:

the r session
the r options
the MD5 checksums of some important files
the random seed

back to start of this chapter
back to TABLE OF CONTENTS

11.1 Session Info

The details with respect to the operating system, R version as wells as the versions of the libraries used to create this report are supplied to help in any attempt to reproduce the report.

# Captures the session info.
session_info <- sessionInfo()

# Display the session info.
session_info

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 18.3
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=el_GR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=el_GR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=el_GR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      tools     stats     graphics  grDevices utils     datasets  methods  
## [9] base     
## 
## other attached packages:
##  [1] rsconnect_0.8.16  gridExtra_2.3     ggrepel_0.8.2     ggplot2_3.3.0    
##  [5] moments_0.14      stringr_1.4.0     validate_0.9.3    data.table_1.12.8
##  [9] DT_0.13           magrittr_1.5      kableExtra_1.1.0  knitr_1.28       
## [13] rmdformats_0.3.7  rmarkdown_2.1    
## 
## loaded via a namespace (and not attached):
##  [1] settings_0.2.4    tidyselect_1.1.0  xfun_0.13         purrr_0.3.4      
##  [5] colorspace_1.4-1  vctrs_0.3.0       htmltools_0.4.0   viridisLite_0.3.0
##  [9] yaml_2.2.1        rlang_0.4.6       R.oo_1.23.0       pillar_1.4.4     
## [13] glue_1.4.0        withr_2.2.0       R.utils_2.9.2     lifecycle_0.2.0  
## [17] munsell_0.5.0     gtable_0.3.0      rvest_0.3.5       R.methodsS3_1.8.0
## [21] htmlwidgets_1.5.1 evaluate_0.14     labeling_0.3      crosstalk_1.1.0.1
## [25] curl_4.3          highr_0.8         Rcpp_1.0.4.6      readr_1.3.1      
## [29] scales_1.1.1      jsonlite_1.6.1    webshot_0.5.2     farver_2.0.3     
## [33] hms_0.5.3         packrat_0.5.0     digest_0.6.25     stringi_1.4.6    
## [37] bookdown_0.18     dplyr_0.8.5       tibble_3.0.1      crayon_1.3.4     
## [41] pkgconfig_2.0.3   ellipsis_0.3.0    xml2_1.3.2        assertthat_0.2.1 
## [45] httr_1.4.1        rstudioapi_0.11   R6_2.4.1          compiler_3.6.3

An object with the information on the session was also exported at the folder of working directory:

outputs –> reproducibility_support –> r_session

with filename:

session_info.R

# Supply the filepath at which the R file with the session info will be exported.
filepath_____session_info <- file.path(
directory_tree_____outputs[[
    "filepath_____outputs_____reproducibility_support_____r_session"
  ]],
  "session_info.R"
)

# Export the session info as an R file.
saveRDS(
  object = session_info, 
  file = filepath_____session_info
)

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

11.2 Options

The details with respect to the r options active while the script that produced the report were executed are supplied to help in any attempt to reproduce the report.

# Captures the R options. 
r_options <- options()

# Displays the R options
r_options

## $add.smooth
## [1] TRUE
## 
## $askpass
## function (prompt) 
## {
##     .Call("rs_askForPassword", prompt)
## }
## <environment: 0x55b8df09dda0>
## 
## $asksecret
## function (name, title = name, prompt = paste(name, ":", sep = "")) 
## {
##     result <- .Call("rs_askForSecret", name, title, prompt, .rs.isPackageInstalled("keyring"), 
##         .rs.hasSecret(name))
##     if (is.null(result)) 
##         stop("Ask for secret operation was cancelled.")
##     result
## }
## <environment: 0x55b8df09dda0>
## 
## $bitmapType
## [1] "cairo"
## 
## $browser
## function (url) 
## {
##     .Call("rs_browseURL", url)
## }
## <environment: 0x55b8ddacd658>
## 
## $browserNLdisabled
## [1] FALSE
## 
## $buildtools.check
## function (action) 
## {
##     if (identical(.Platform$pkgType, "mac.binary.mavericks")) {
##         .Call("rs_canBuildCpp")
##     }
##     else {
##         if (!.Call("rs_canBuildCpp")) {
##             .rs.installBuildTools(action)
##             FALSE
##         }
##         else {
##             TRUE
##         }
##     }
## }
## <environment: 0x55b8df09dda0>
## 
## $buildtools.with
## function (code) 
## {
##     .rs.addRToolsToPath()
##     on.exit(.rs.restorePreviousPath(), add = TRUE)
##     force(code)
## }
## <environment: 0x55b8df09dda0>
## 
## $CBoundsCheck
## [1] FALSE
## 
## $check.bounds
## [1] FALSE
## 
## $citation.bibtex.max
## [1] 1
## 
## $connectionObserver
## $connectionObserver$connectionOpened
## function (type, host, displayName, icon = NULL, connectCode, 
##     disconnect, listObjectTypes, listObjects, listColumns, previewObject, 
##     connectionObject, actions = NULL) 
## {
##     if (!inherits(listObjectTypes, "function")) {
##         stop("listObjectTypes must be a function returning a list of object types", 
##             call. = FALSE)
##     }
##     promote <- function(name, l) {
##         if (length(l) == 0) 
##             return(list())
##         if (is.null(l$contains)) {
##             return(list(list(name = name, icon = l$icon, contains = "data")))
##         }
##         else {
##             return(unlist(append(list(list(list(name = name, 
##                 icon = l$icon, contains = names(l$contains)))), 
##                 lapply(names(l$contains), function(name) {
##                   promote(name, l$contains[[name]])
##                 })), recursive = FALSE))
##         }
##         return(list())
##     }
##     objectTree <- listObjectTypes()
##     objectTypes <- lapply(names(objectTree), function(name) {
##         promote(name, objectTree[[name]])
##     })[[1]]
##     connection <- list(type = type, host = host, displayName = displayName, 
##         icon = icon, connectCode = connectCode, disconnect = disconnect, 
##         objectTypes = objectTypes, listObjects = listObjects, 
##         listColumns = listColumns, previewObject = previewObject, 
##         actions = actions, connectionObject = connectionObject)
##     class(connection) <- "rstudioConnection"
##     .rs.validateConnection(connection)
##     cacheKey <- paste(connection$type, connection$host, .Call("rs_generateShortUuid"), 
##         sep = "_")
##     assign(cacheKey, value = connection, envir = .rs.activeConnections)
##     invisible(.Call("rs_connectionOpened", connection))
## }
## <environment: 0x55b8e0039108>
## 
## $connectionObserver$connectionClosed
## function (type, host, ...) 
## {
##     .rs.validateCharacterParams(list(type = type, host = host))
##     name <- .rs.findConnectionName(type, host)
##     if (!is.null(name)) 
##         rm(list = name, envir = .rs.activeConnections)
##     invisible(.Call("rs_connectionClosed", type, host))
## }
## <environment: 0x55b8e0039108>
## 
## $connectionObserver$connectionUpdated
## function (type, host, hint, ...) 
## {
##     .rs.validateCharacterParams(list(type = type, host = host, 
##         hint = hint))
##     invisible(.Call("rs_connectionUpdated", type, host, hint))
## }
## <environment: 0x55b8e0039108>
## 
## 
## $continue
## [1] "+ "
## 
## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly" 
## 
## $datatable.alloccol
## [1] 1024
## 
## $datatable.allow.cartesian
## [1] FALSE
## 
## $datatable.auto.index
## [1] TRUE
## 
## $datatable.dfdispatchwarn
## [1] TRUE
## 
## $datatable.old.unique.by.key
## [1] FALSE
## 
## $datatable.optimize
## [1] Inf
## 
## $datatable.print.class
## [1] FALSE
## 
## $datatable.print.colnames
## [1] "auto"
## 
## $datatable.print.keys
## [1] FALSE
## 
## $datatable.print.nrows
## [1] 100
## 
## $datatable.print.rownames
## [1] TRUE
## 
## $datatable.print.topn
## [1] 5
## 
## $datatable.use.index
## [1] TRUE
## 
## $datatable.verbose
## [1] FALSE
## 
## $datatable.warnredundantby
## [1] TRUE
## 
## $defaultPackages
## [1] "datasets"  "utils"     "grDevices" "graphics"  "stats"     "methods"  
## 
## $demo.ask
## [1] "default"
## 
## $deparse.cutoff
## [1] 60
## 
## $device
## function (width = 7, height = 7, ...) 
## {
##     grDevices::pdf(NULL, width, height, ...)
## }
## <bytecode: 0x55b8e1318000>
## <environment: namespace:knitr>
## 
## $device.ask.default
## [1] FALSE
## 
## $digits
## [1] 7
## 
## $download.file.method
## [1] "libcurl"
## 
## $dplyr.show_progress
## [1] TRUE
## 
## $dvipscmd
## [1] "dvips"
## 
## $echo
## [1] TRUE
## 
## $editor
## [1] "vi"
## 
## $encoding
## [1] "native.enc"
## 
## $error
## (function () 
## {
##     .rs.recordTraceback(TRUE, 5, .rs.enqueueError)
## })()
## 
## $example.ask
## [1] "default"
## 
## $expressions
## [1] 5000
## 
## $ggvis.renderer
## [1] "svg"
## 
## $help.search.types
## [1] "vignette" "demo"     "help"    
## 
## $help.try.all.packages
## [1] FALSE
## 
## $help_type
## [1] "html"
## 
## $HTTPUserAgent
## [1] "RStudio Desktop (1.2.5033); R (3.6.3 x86_64-pc-linux-gnu x86_64 linux-gnu)"
## 
## $httr_oauth_cache
## [1] NA
## 
## $httr_oob_default
## [1] FALSE
## 
## $internet.info
## [1] 2
## 
## $keep.parse.data
## [1] TRUE
## 
## $keep.parse.data.pkgs
## [1] FALSE
## 
## $keep.source
## [1] TRUE
## 
## $keep.source.pkgs
## [1] FALSE
## 
## $knitr.in.progress
## [1] TRUE
## 
## $knitr.table.format
## [1] "html"
## 
## $locatorBell
## [1] TRUE
## 
## $mailer
## [1] "mailto"
## 
## $matprod
## [1] "default"
## 
## $max.print
## [1] 1000
## 
## $menu.graphics
## [1] FALSE
## 
## $na.action
## [1] "na.omit"
## 
## $nwarnings
## [1] 50
## 
## $OutDec
## [1] "."
## 
## $pager
## function (files, header, title, delete.file) 
## {
##     for (i in 1:length(files)) {
##         if ((i > length(header)) || !nzchar(header[[i]])) 
##             fileTitle <- title
##         else fileTitle <- header[[i]]
##         .Call("rs_showFile", fileTitle, files[[i]], delete.file)
##     }
## }
## <environment: 0x55b8df09dda0>
## 
## $page_viewer
## function (url, title = "RStudio Viewer", self_contained = FALSE) 
## {
##     if (!is.character(url) || (length(url) != 1)) 
##         stop("url must be a single element character vector.", 
##             call. = FALSE)
##     if (!is.character(title) || (length(title) != 1)) 
##         stop("title must be a single element character vector.", 
##             call. = FALSE)
##     if (!is.logical(self_contained) || (length(self_contained) != 
##         1)) 
##         stop("self_contained must be a single element logical vector.", 
##             call. = FALSE)
##     invisible(.Call("rs_showPageViewer", url, title, self_contained))
## }
## <environment: 0x55b8ddacd658>
## 
## $papersize
## [1] "a4"
## 
## $PCRE_limit_recursion
## [1] NA
## 
## $PCRE_study
## [1] 10
## 
## $PCRE_use_JIT
## [1] TRUE
## 
## $pdfviewer
## [1] "/usr/bin/xdg-open"
## 
## $pkgType
## [1] "source"
## 
## $plumber.swagger.url
## function (url) 
## {
##     invisible(.Call("rs_plumberviewer", url, getwd(), 3))
## }
## <environment: 0x55b8df09dda0>
## attr(,"plumberViewerType")
## [1] 3
## 
## $printcmd
## [1] "/usr/bin/lpr"
## 
## $profvis.keep_output
## [1] TRUE
## 
## $profvis.print
## function (x) 
## {
##     envir <- as.environment(which(search() == "tools:rstudio"))
##     eval(substitute(.rs.profilePrint(x), list(x = x)), envir = envir)
## }
## <environment: 0x55b8dd66cb80>
## 
## $profvis.prof_extension
## [1] ".Rprof"
## 
## $profvis.prof_output
## [1] "/home/rick/Documents/training/coursera/spec__data_science_specialization/Reproducible-Research--2nd-Assignment/.Rproj.user/30C9779D/profiles-cache"
## 
## $prompt
## [1] "> "
## 
## $readr.show_progress
## [1] TRUE
## 
## $repos
##                          CRAN 
## "https://cloud.r-project.org" 
## 
## $restart
## function (afterRestartCommand = "") 
## {
##     afterRestartCommand <- paste(as.character(afterRestartCommand), 
##         collapse = "\n")
##     .Call("rs_restartR", afterRestartCommand, PACKAGE = "(embedding)")
## }
## <environment: 0x55b8df09dda0>
## 
## $reticulate.repl.hook
## function (buffer, contents, trimmed) 
## {
##     if (buffer$empty()) {
##         if (grepl("^[?]", trimmed)) {
##             text <- substring(trimmed, 2)
##             .Call("rs_showPythonHelp", text, PACKAGE = "(embedding)")
##             return(TRUE)
##         }
##         reHelp <- "help\\((.*)\\)"
##         if (grepl(reHelp, trimmed)) {
##             text <- gsub(reHelp, "\\1", trimmed)
##             .Call("rs_showPythonHelp", text, PACKAGE = "(embedding)")
##             return(TRUE)
##         }
##     }
##     FALSE
## }
## <environment: 0x55b8df09dda0>
## 
## $reticulate.repl.initialize
## function () 
## {
##     builtins <- reticulate::import_builtins(convert = FALSE)
##     help <- builtins$help
##     .rs.setVar("reticulate.help", builtins$help)
##     builtins$help <- function(...) {
##         dots <- list(...)
##         if (length(dots) == 0) {
##             message("Error: Interactive Python help not available within RStudio")
##             return()
##         }
##         help(...)
##     }
##     if (requireNamespace("png", quietly = TRUE) && reticulate::py_module_available("matplotlib")) {
##         matplotlib <- reticulate::import("matplotlib", convert = TRUE)
##         backend <- matplotlib$get_backend()
##         if (!identical(tolower(backend), "agg")) {
##             sys <- reticulate::import("sys", convert = TRUE)
##             if ("matplotlib.backends" %in% names(sys$modules)) 
##                 matplotlib$pyplot$switch_backend("agg")
##             else matplotlib$use("agg", warn = FALSE, force = TRUE)
##         }
##         plt <- matplotlib$pyplot
##         .rs.setVar("reticulate.matplotlib.show", plt$show)
##         plt$show <- .rs.reticulate.matplotlib.showHook
##     }
## }
## <environment: 0x55b8df09dda0>
## 
## $reticulate.repl.teardown
## function () 
## {
##     builtins <- reticulate::import_builtins(convert = FALSE)
##     builtins$help <- .rs.getVar("reticulate.help")
##     show <- .rs.getVar("reticulate.matplotlib.show")
##     if (!is.null(show)) {
##         matplotlib <- reticulate::import("matplotlib", convert = TRUE)
##         plt <- matplotlib$pyplot
##         plt$show <- show
##     }
## }
## <environment: 0x55b8df09dda0>
## 
## $rl_word_breaks
## [1] " \t\n\"\\'`><=%;,|&{()}"
## 
## $rsconnect.http.timeout
## [1] 5
## 
## $rsconnect.max.bundle.files
## [1] 10000
## 
## $rsconnect.max.bundle.size
## [1] 3145728000
## 
## $rstudio.notebook.executing
## [1] FALSE
## 
## $scipen
## [1] 0
## 
## $shinygadgets.showdialog
## function (caption, url, width = NULL, height = NULL) 
## {
##     if (!is.character(caption) || (length(caption) != 1)) 
##         stop("caption must be a single element character vector.", 
##             call. = FALSE)
##     if (!is.character(url) || (length(url) != 1)) 
##         stop("url must be a single element character vector.", 
##             call. = FALSE)
##     if (is.null(width)) 
##         width <- 600
##     if (is.null(height)) 
##         height <- 600
##     if (!is.numeric(width) || (length(width) != 1)) 
##         stop("width must be a single element numeric vector.", 
##             call. = FALSE)
##     if (!is.numeric(height) || (length(height) != 1)) 
##         stop("height must be a single element numeric vector.", 
##             call. = FALSE)
##     invisible(.Call("rs_showShinyGadgetDialog", caption, url, 
##         width, height))
## }
## <environment: 0x55b8ddacd658>
## 
## $shiny.launch.browser
## function (url) 
## {
##     invisible(.Call("rs_shinyviewer", url, getwd(), 3))
## }
## <environment: 0x55b8df09dda0>
## attr(,"shinyViewerType")
## [1] 3
## 
## $show.coef.Pvalues
## [1] TRUE
## 
## $show.error.messages
## [1] TRUE
## 
## $show.signif.stars
## [1] TRUE
## 
## $str
## $str$strict.width
## [1] "no"
## 
## $str$digits.d
## [1] 3
## 
## $str$vec.len
## [1] 4
## 
## $str$drop.deparse.attr
## [1] TRUE
## 
## $str$formatNum
## function (x, ...) 
## format(x, trim = TRUE, drop0trailing = TRUE, ...)
## <environment: 0x55b8debadb80>
## 
## 
## $str.dendrogram.last
## [1] "`"
## 
## $stringsAsFactors
## [1] TRUE
## 
## $terminal.manager
## $terminal.manager$terminalActivate
## function (id = NULL, show = TRUE) 
## {
##     if (!is.null(id) && (!is.character(id) || (length(id) != 
##         1))) 
##         stop("'id' must be NULL or a character vector of length one")
##     if (!is.logical(show)) 
##         stop("'show' must be TRUE or FALSE")
##     .Call("rs_terminalActivate", id, show)
##     invisible(NULL)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalCreate
## function (caption = NULL, show = TRUE, shellType = NULL) 
## {
##     if (!is.null(caption) && (!is.character(caption) || (length(caption) != 
##         1))) 
##         stop("'caption' must be NULL or a character vector of length one")
##     if (is.null(show) || !is.logical(show)) 
##         stop("'show' must be a logical vector")
##     if (!is.null(shellType) && (!is.character(shellType) || (length(shellType) != 
##         1))) 
##         stop("'shellType' must be NULL or a character vector of length one")
##     validShellType = TRUE
##     if (!is.null(shellType)) {
##         validShellType <- tolower(shellType) %in% c("default", 
##             "win-cmd", "win-ps", "win-git-bash", "win-wsl-bash", 
##             "custom")
##     }
##     if (!validShellType) 
##         stop("'shellType' must be NULL, or one of 'default', 'win-cmd', 'win-ps', 'win-git-bash', 'win-wsl-bash', or 'custom'.")
##     .Call("rs_terminalCreate", caption, show, shellType)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalClear
## function (id) 
## {
##     if (is.null(id) || !is.character(id) || length(id) != 1) 
##         stop("'id' must be a character vector of length one")
##     .Call("rs_terminalClear", id)
##     invisible(NULL)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalList
## function () 
## {
##     .Call("rs_terminalList")
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalContext
## function (id) 
## {
##     if (is.null(id) || !is.character(id) || (length(id) != 1)) 
##         stop("'id' must be a single element character vector")
##     .Call("rs_terminalContext", id)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalBuffer
## function (id, stripAnsi = TRUE) 
## {
##     if (is.null(id) || !is.character(id) || (length(id) != 1)) 
##         stop("'id' must be a single element character vector")
##     if (is.null(stripAnsi) || !is.logical(stripAnsi)) 
##         stop("'stripAnsi' must be a logical vector")
##     .Call("rs_terminalBuffer", id, stripAnsi)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalVisible
## function () 
## {
##     .Call("rs_terminalVisible")
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalBusy
## function (id) 
## {
##     if (is.null(id) || !is.character(id)) 
##         stop("'id' must be a character vector")
##     .Call("rs_terminalBusy", id)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalRunning
## function (id) 
## {
##     if (is.null(id) || !is.character(id)) 
##         stop("'id' must be a character vector")
##     .Call("rs_terminalRunning", id)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalKill
## function (id) 
## {
##     if (is.null(id) || !is.character(id)) 
##         stop("'id' must be a character vector")
##     .Call("rs_terminalKill", id)
##     invisible(NULL)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalSend
## function (id, text) 
## {
##     if (!is.character(text)) 
##         stop("'text' should be a character vector", call. = FALSE)
##     if (is.null(id) || !is.character(id) || length(id) != 1) 
##         stop("'id' must be a character vector of length one")
##     .Call("rs_terminalSend", id, text)
##     invisible(NULL)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalExecute
## function (command, workingDir = NULL, env = character(), show = TRUE) 
## {
##     if (is.null(command) || !is.character(command) || (length(command) != 
##         1)) 
##         stop("'command' must be a single element character vector")
##     if (!is.null(workingDir) && (!is.character(workingDir) || 
##         (length(workingDir) != 1))) 
##         stop("'workingDir' must be a single element character vector")
##     if (!is.null(env) && !is.character(env)) 
##         stop("'env' must be a character vector")
##     if (is.null(show) || !is.logical(show)) 
##         stop("'show' must be a logical vector")
##     .Call("rs_terminalExecute", command, workingDir, env, show)
## }
## <environment: 0x55b8df09dda0>
## 
## $terminal.manager$terminalExitCode
## function (id) 
## {
##     if (is.null(id) || !is.character(id) || (length(id) != 1)) 
##         stop("'id' must be a single element character vector")
##     .Call("rs_terminalExitCode", id)
## }
## <environment: 0x55b8df09dda0>
## 
## 
## $texi2dvi
## [1] "/usr/bin/texi2dvi"
## 
## $tikzMetricsDictionary
## [1] "RepRes_analysis-tikzDictionary"
## 
## $timeout
## [1] 60
## 
## $try.outFile
## A connection with                            
## description "output"        
## class       "textConnection"
## mode        "wr"            
## text        "text"          
## opened      "opened"        
## can read    "no"            
## can write   "yes"           
## 
## $ts.eps
## [1] 1e-05
## 
## $ts.S.compat
## [1] FALSE
## 
## $unzip
## [1] "/usr/bin/unzip"
## 
## $useFancyQuotes
## [1] FALSE
## 
## $verbose
## [1] FALSE
## 
## $viewer
## function (url, height = NULL) 
## {
##     if (!is.character(url) || (length(url) != 1)) 
##         stop("url must be a single element character vector.", 
##             call. = FALSE)
##     if (identical(height, "maximize")) 
##         height <- -1
##     if (!is.null(height) && (!is.numeric(height) || (length(height) != 
##         1))) 
##         stop("height must be a single element numeric vector or 'maximize'.", 
##             call. = FALSE)
##     invisible(.Call("rs_viewer", url, height))
## }
## <environment: 0x55b8ddacd658>
## 
## $warn
## [1] 0
## 
## $warning.length
## [1] 1000
## 
## $width
## [1] 92

An object with the information for r option was also exported at the folder of working directory:

outputs –> reproducibility_support –> r_session

with filename:

r_options.R

# Supply the filepath at which the R file with the R options will be exported.
filepath_____r_options <- file.path(
directory_tree_____outputs[[
    "filepath_____outputs_____reproducibility_support_____r_session"
  ]],
  "r_options.R"
)
# Export the R options as an R file.
saveRDS(
  object = r_options, 
  file = filepath_____r_options
)

## Warning in saveRDS(object = r_options, file = filepath_____r_options): 'package:stats' may
## not be available when loading

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

11.3 MD5 Checksums

To easily verify the integrity and validity of some important files that were either imported or exported through the execution of the script that produces the report with the analysis their MD5 checksums were computed and exported as txt files with the help of a utility function that was created and used, export_md5sums().

Three txt files with MD5 checksums were created:

unprocessed_data_____MD5_checksum.txt
processed_data_____MD5_checksum.txt
results_____MD5_checksum.txt

and exported at the subdirectory of the working directory:

output –> reproducibility_support –> MD5_checksums

The original files with the MD5 Checksums have been uploaded to github.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

11.3.1 Create a utility function to export MD5 checksums

To create txt files with the MD5 checksums the utility function export_md5sums() was created and used.

It takes as input two arguments:

target_files
- the paths to the files of which we want to compute the MD5 checksums
output_file
- the path at which the txt file with the MD5 checksums of the target files will be exported

Upon execution, it creates a txt file at the path denoted by the argument ‘output_file’ in which it stores the MD5 checksums of the files found at the paths supplied via the ‘target_files’ argument.

The txt files consists of one row for each of the target files, which:

begins with MD5 checksum
followed by two spaces
ends the the path of the file to which the MD5 corresponds

# utility function: export_md5sum()
#
# Creates and exports a txt file with the MD5 checksums of some target files.  
#
# Arguments:
#  'target_files'  :  A character vector with the paths of the target files,
#                     of which the MD5 checksums will be computed.
#                     All supplied files must exist.
#
#  'output_file'   :  A character string with the path to file 
#                     which will be created to store the MD5 checksums
#                     of the target files.
#                     The output file must end with the txt extenrgtion.
#                     Any number of directories can be included in the path 
#                     prior to the filename, that will be created 
#                     even if they don't exist. 
#
# Return: 
#  If the function executes to correctly it returns a named vector 
#  with the MD5 checksums of the target file, 
#  named after their corresponding paths. 

# Define a utility function to use in order to compute and export 
# the MD5 checksums of the files of interest. 
export_md5sum <- function(target_files, output_file = "MD5.txt") {

  # Check the validity of the supplied arguments.
  ## a single character string with a txt extention must have been supplied 
  ## as the value of 'output_file' argument.
  stopifnot(
    is.character(output_file) &&
      ( length(output_file) == 1 ) &&
      ( tools::file_ext(output_file) == "txt" )
  )
  ## An character vector with arbitrary number of EXISTING files 
  ## must have been supplied as the value of 'target_files' argument. 
  do_all_target_files_exist <- file.exists(target_files) & !dir.exists(target_files)
  if (!all(do_all_target_files_exist)) {
    not_existing_target_files <- target_files[!do_all_target_files_exist]
    stop(
      "\n",
      "The following supplied target files do not exists: ", "\n",
      paste("\t", not_existing_target_files, "\n", sep = "")
    )
  }

  # Computes the MD5 checksums of the target files.
  md5_checksums_of_target_files <- tools::md5sum(target_files)

  # Creates the content of that will be written inside the output file.
  content_of_output_file <- paste(
    unname(md5_checksums_of_target_files),
    "  ",
    names(md5_checksums_of_target_files)
  )

  # If the value of output file contains some directory name 
  # it is identified and created.
  dest_dir <- dirname(output_file)
  if (!dir.exists(dest_dir)) {
    dir.create(dest_dir)
  }
  # A blank output file is created.
  file.create(output_file)

  # If the output file was successfully created..
  if (file.exists(output_file)) {
    # ...it get populated with contents.
    con_to_output_file <- file(output_file)
    writeLines(text = content_of_output_file, con = output_file)
    close(con = con_to_output_file)
  } else {
    # else the operation fails and the execution stops.
    stop(
      "\n",
      "Failed to create the output file at the path:", "\n",
      "\t", output_file,
      "\n"
    )
  }
}

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

11.3.2 MD5 checksum of the input file with the unprocessed data

The input file, repdata_data_StormData.csv.bz2, with the unprocessed data was downloaded from the link that was supplied by the instructions of the assignment.

The same file that was download and used to produce the original report for this analysis was also uploaded at the github and can be accessed from the following link:

repdata_data_StormData.csv.bz2 on GitHub

A txt file with the MD5 checksum of the input file, repdata_data_StormData.csv.bz2, was exported at the subdirectory of the working directory:

output –> reproducibility_support –> MD5_checksums

with name:

unprocessed_data_____MD5_checksum.txt

# Supply the filepath at which the txt file with MD5 checksum 
# of the file with the unprocessed data will be exported.
filepath_____unprocessed_data_____MD5_checksum <- file.path(
  directory_tree_____outputs[[
    "filepath_____outputs_____reproducibility_support_____MD5_checksums"
  ]],
  "unprocessed_data_____MD5_checksum.txt"
) 

# Create and export a txt file with MD5 checksum 
# of the file with the unprocessed data .
export_md5sum(
  target_files = filepath_____unprocessed_data,
  output_file = filepath_____unprocessed_data_____MD5_checksum
)

To verify the input file with the unprocessed data, repdata_data_StormData.csv.bz2, compare the MD5 checksum contained at the file with name, unprocessed_data_____MD5_checksum.txt that was exported when you reproduced the analysis with the the original which was uploaded at github and can be accessed through the following link:

original unprocessed_data_____MD5_checksum.txt

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

11.3.3 MD5 checksum of the output file with the processed data

An R file with the table with the processed data was exported through the execution of the script.

The original file have been uploaded at the github and can be accessed from the following link:

table_with_the_processed_data.R

A txt file with the MD5 checksum of the output R file, table_with_the_processed_data.R, was exported at the subdirectory of the working directory:

output –> reproducibility_support –> MD5_checksums

with name:

processed_data_____MD5_checksum.txt

# Supply the filepath at which the txt file with MD5 checksum 
# of the file with the table with the processed data will be exported.
filepath_____processed_data_____MD5_checksum <- file.path(
  directory_tree_____outputs[[
    "filepath_____outputs_____reproducibility_support_____MD5_checksums"
  ]],
  "processed_data_____MD5_checksum.txt"
)

# Create and export a txt file with MD5 checksum 
# of the file with the table with the processed data.
export_md5sum(
  target_files = filepath_____processed_data,
  output_file = filepath_____processed_data_____MD5_checksum
)

To verify the table with the processed data, table_with_the_processed_data.R, compare the MD5 checksum contained at the file with name, processed_data_____MD5_checksum.txt that was exported when you reproduced the analysis with the the original which was uploaded at github and can be accessed through the following link:

original processed_data_____MD5_checksum.txt

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

11.3.4 MD5 checksum of the output files with the results

The results obtained in this analysis consist of 6 summary tables.

Those that correspond to the results for the harm on population health (over each of the three perspectives examined) which were exported as R files through the execution of the script, with names:

summary_____harm_on_population_health______fatalities.R
- for the table with the summary for harm on population health with respect to fatalities
summary_____harm_on_population_health______injuries.R
- for the table with the summary for harm on population health with respect to injuries
summary_____harm_on_population_health______casualties.R
- for the table with the summary for harm on population health with respect to casualties

And those that correspond to the results for the harm on economy (over each of the three perspectives examined) which were exported as R files through the execution of the script, with names:

summary_____harm_on_economy______property_damage.R
- for the table with the summary for harm on economy with respect to property damage
summary_____harm_on_economy______crop_damage.R*
- for the table with the summary for harm on economy with respect to crop damage
summary_____harm_on_economy______economic_damage.R
- for the table with the summary for harm on economy with respect to economic damage

A txt file with the MD5 checksum of all 6 output R files described above, was exported at the subdirectory of the working directory:

output –> reproducibility_support –> MD5_checksums

with name:

resulsts_____MD5_checksum.txt

# Supply the filepath at which the txt file with MD5 checksums 
# of the files with the tables with the results will be exported.
filepath_____results_____MD5_checksum <- file.path(
  directory_tree_____outputs[[
    "filepath_____outputs_____reproducibility_support_____MD5_checksums"
  ]],
  "results_____MD5_checksum.txt"
)

# Create and export a txt file with MD5 checksums 
# of the files with the tables with the results.
export_md5sum(
  target_files = c(
    filepath_____summary_____harm_on_population_health______fatalities,
    filepath_____summary_____harm_on_population_health______injuries,
    filepath_____summary_____harm_on_population_health______casualties,
    filepath_____summary_____harm_on_economy______property_damage,
    filepath_____summary_____harm_on_economy______crop_damage,
    filepath_____summary_____harm_on_economy______economic_damage
  ),
  output_file = filepath_____results_____MD5_checksum
)

To verify the results compare the MD5 checksum contained at the file with name, results_____MD5_checksum.txt that was exported when you reproduced the analysis with the the original which was uploaded at github and can be accessed through the link:

original results_____MD5_checksum.txt

back to start of this subsection
back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

11.4 Random Seed

At the beginning of the analysis a random seed was selected equal to 1234567890 to enhance the reproducibility of the report.

If the procedure have been reproduced correctly, (with respect to random events) at this point it is expected to produce a sample from standard normal distribution with the following 5 values :

-2.2152999
0.4738228
-0.4869480
-0.5343663
1.3206245

# Creates 5 random values out of typical normal distributions 
# to check the reproducibility of random events.  
expected_values_of_final_random_event <- rnorm(5)

# Display the 5 random values. 
expected_values_of_final_random_event

## [1] -0.5343663  1.3206245  1.5558662  2.6298662 -0.2373495

However, keep in mind that the only random events that took place through the execution of the script that produces this report happened at the creation of the plots:

Plot 1.1.4
Plot 1.2.4
Plot 1.3.4
Plot 2.1.4
Plot 2.2.4
Plot 2.3.4

by the function geom_repel_label() in order to assign randomly the positions of the labels.

So even if the random seed is not the same only the labels in those plots should be in different places, while the actual results are expected to be the identical.

back to start of this section
back to start of this chapter
back to TABLE OF CONTENTS

12 LICENSE

The script RepRes_analysis.Rmd with the code to conduct the analysis as well as any of the results and outputs obtained when it is executed can be used freely for any propose under the terms of MIT License.

Copyright (c)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

back to start of this chapter
back to TABLE OF CONTENTS

13 REFERENCES

NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, AUGUST 17, 2007. URL https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

Storm Data FAQ Page (2008). URL https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf

The History of the Strom Events Database (2014). URL https://www.ncdc.noaa.gov/stormevents/versions.jsp

R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

RStudio Team (2019). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/.

JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2020). rmarkdown: Dynamic Documents for R. R package version 2.1. URL https://rmarkdown.rstudio.com.

Yihui Xie and J.J. Allaire and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. URL https://bookdown.org/yihui/rmarkdown.

Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.28.

Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. package version 1.1.0. https://CRAN.R-project.org/package=kableExtra

Stefan Milton Bache and Hadley Wickham (2014). magrittr: A Forward-Pipe Operator for R. R package version 1.5. https://CRAN.R-project.org/package=magrittr

Yihui Xie, Joe Cheng and Xianying Tan (2020). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.13. https://CRAN.R-project.org/package=DT

Julien Barnier (2020). rmdformats: HTML Output Formats and Templates for ‘rmarkdown’ Documents. R package version 0.3.7. https://CRAN.R-project.org/package=rmdformats

Matt Dowle and Arun Srinivasan (2019). data.table: Extension of data.frame. R package version 1.12.8. https://CRAN.R-project.org/package=data.table

van der Loo M, de Jonge E (2019). “Data Validation Infrastructure for R.” Journal of Statistical Software, Accepted for publication. <URL:https://CRAN.R-project.org/package=validate>.

Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr

Lukasz Komsta and Frederick Novomestky (2015). moments: Moments, cumulants, skewness, kurtosis and related tests. R package version 0.14. https://CRAN.R-project.org/package=moments

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Kamil Slowikowski (2020). ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.8.2. https://CRAN.R-project.org/package=ggrepel

Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. https://CRAN.R-project.org/package=gridExtra

Yan Holtz (2018). PIMP MY RMD: GitHub Page with tips on refining an rmarkdown document. URL https://holtzy.github.io/Pimp-my-rmd/#github-link

back to start of this chapter
back to TABLE OF CONTENTS

END OF THE REPORT

Harm From Severe Weather Events To Population Health And Economy In United States

Harm From Severe Weather Events To Population Health And Economy In United States

1 TABLE OF CONTENTS

2 PROLOGUE

2.1 About The Assignment

2.2 About The Main Script

2.3 About The Report

3 SYNOPSIS

4 STORM EVENTS DATASET

4.1 General Informations

4.2 Points Of Interest

4.2.1 Changes in the composition of weather event types

4.2.2 Eligibility criteria for inclusion of weather events in the dataset

5 PRELIMINARY ACTIVITIES

5.1 Set The Random Seed

5.2 Load All Required Libraries

5.3 Create All Required Directories

5.4 Access The File With The Raw Data

6 DATA PROCESSING

6.1 Load The Raw Data In R

6.1.1 Create the table with the raw data

6.1.2 Conduct post validation for the table with the raw data

6.1.3 Overview of the table with the raw data

6.2 Preprocess The Raw Data

6.2.1 Verify the prerequisites for the selected variables

6.2.1.1 Verify the coercibility of the values for the selected variables

6.2.1.2 Verify the uniqueness of the key values

6.2.2 Create the table with the preprocessed data

6.2.3 Conduct post validation for the table with the preprocessed data

6.2.4 Overview of the table with the preprocessed data

6.3 Extract The Target Data Subset

6.3.1 Identify the target subset of observations

6.3.1.1 Verify the consistency of date format

6.3.1.2 Identify the eligible observations

6.3.2 Create the table with the target data subset

6.3.3 Conduct post validation for the table with the target data subset

6.3.4 Overview of the table with the target data subset

6.4 Conduct In-Record Data Validation

6.4.1 Introduce information from the Strom Data Documentation

6.4.1.1 Valid values for the EVTYPE variable

6.4.1.2 Valid values for the PROPDMGEXP variable

6.4.1.3 Valid values for the CROPDMGEXP variable

6.4.2 Conduct in-record data validation for each variable

6.4.3 Create the table with the in-record validated data

6.4.4 Conduct post validation for the table with the in-record validated data

6.4.5 Overview of the table with the in-record validated data

6.5 Impute Missing Values

6.5.1 Impute missing values at the variable EVTYPE

6.5.1.1 Examine the invalid values from the variable EVTYPE

6.5.1.2 Associate plausible substitutions to the invalid values from the variable EVTYPE

6.5.1.3 Identify the imputable missing values at the variable EVTYPE

6.5.1.4 Substitute the imputable missing values at the variable EVTYPE

6.5.2 Impute missing values at the variable PROPDMGEXP

6.5.2.1 Examine the invalid values from the variable PROPDMGEXP

6.5.2.2 Associate plausible substitutions to the invalid values from the variable PROPDMGEXP

6.5.2.3 Identify the imputable missing values at the variable PROPDMGEXP

6.5.2.4 Substitute the imputable missing values at the variable PROPDMGEXP

6.5.3 Impute missing values at the variable CROPDMGEXP

6.5.3.1 Examine the invalid values from the variable CROPDMGEXP

6.5.3.2 Associate plausible substitutions to the invalid values from the variable CROPDMGEXP

6.5.3.3 Identify the imputable missing values at the variable CROPDMGEXP

6.5.3.4 Substitute the imputable missing values at the variable CROPDMGEXP

6.5.4 Conduct post validation for the table with the imputed data

6.5.5 Overview of the table with the imputed data

6.6 Conduct Cross-Record Data Validation

6.6.1 Identify all valid observations

6.6.2 Create the table with the cross-record validated data

6.6.3 Conduct post validation for table with the cross-record validated data

6.6.4 Overview of the table with the cross-record validated data

6.7 Produce The Processed Data

6.7.1 Create the table with the processed data

6.7.2 Conduct post validation for the table with the processed data

7 PROCESSED DATA

7.1 Information For The Table With The Processed Data

7.2 Overview Of The Table With The Processed Data

7.3 Export The Table With The Processed Data

8 HARM ON POPULATION HEALTH

8.1 Harm On Population Health With Respect To Fatalities By Each Weather Event Type

8.1.1 Extract the target data for harm on population health with respect to fatalities

8.1.2 Process the target data for harm on population health with respect to fatalities