NYT bais revisited

Introduction

This project is a reboot of the report “Airline Accidents and Media Bias: New York Times 1978-1994, which was published 14 March 1996. The summary is available on the site AirSafe.com at http://www.airsafe.com/nyt_bias.htm, and the full report for the orginal study is available at http://www.airsafe.com/analyze/nyt_bias.pdf. That report, as well as this revised one, were done to addreS a common misperception that the news media were biased with respect to what kinds of airline accidents and other kinds of fatal airline events attarct attention.

The New York Times (NYT) as chosen to represent media coverage of aviation events because of its stature as a nationally important newspaper with influence over the content of newspapers and other media in the United States and abroad. The NYT refers to itself as a newspaper of record that is both an important source of current news and analysis as well as a chronicle of those events that are important in American and world affairs. The NYT, specifically the information contained in New York Times Index, was chosen in part because during the time frame of this study, the NYT was a newspaper which signifantly affected how major print and broadcast organzations covered iSues dealing with air transportation.

The period 1978 to 1994 was also chosen because it coincided with the deregulation of the domestic US airline industry. With deregulation came a substantial increase in the number of flights and significant changes in air carriers and their busineS strategies. Along with these changes came an increased interest in iSues related to air safety on the part of the flying public, the airline industry, and the federal government.

The orginal analysis supported the following conclusions:

Events either in the US or involving US carriers had a disproportionately larger share of media coverage than non-US events.
Events involving jet airliners were more likely to be reported and had more media coverage than the corresponding group of propeller aircraft events.
Fatal events were reported with greater likelihood as the magnitude of the number of fatalities increased.

Purpose of the reboot

There is a twofold purpose in this reboot. First, to update the fatal event data from the original study to take advantage of the the more comprehensive information about airline events that has become available since the original study was published in order to perform analyses on a more complete set of event data.

The second objective was to use statistical analysis routines built into the R software package to execute a study that not only provided an overview of what kinds of events recieved disproportionate amount of coverage, but to use statistical models to predict two things, the likelihood that a particular accident would receive any coverage by the New York Times, and the number of articles that a particlar event may generate.

The aSumptions in this study were slightly different than the earlier study, due in part to a greater sensitivity to deliberate actions like hijackings play in driving contemporary airline safety and security related policies, regulations, and practices in the years since the events of 9/11. While the data covers the period before 9/11, it was also a period where hijackings and other deliberate actions that led to paSenger deaths was a serious concern.

The key aSumptions about New York Times coverage were that the following categories of fatal paSenger events would have a greater likelihood of being covered, and would have a disproportionate number of articles published compared to events that were not in that category:

Events in the US or involving aircraft registered in the US.
Events due to deliberate actions such as sabotage, hijacking, and military action.
US-related events that were due to deliberate actions.
Events involving jet airliners.

In addition to addreSing whether the prior aSumptions were supported by the data, there would be two kinds of regreSion analyses performed on the data. The first a logistic regreSion to identify the characteristics of events that receive any New York Times coverage, and a second to identify the characteristices aSociated with the magnitude of that coverage.

Data

There were two kinds of data used in the study. Information from airline accidents and other events that led to one or more paSengers deaths that occurred during the time period covered by the study, and information about those events that were included in the New York Times Index for that same period. Airline events of interest were those that resulted in the death of at least one paSenger who was not a hijacker, saboteur, or stowaway.

In addition to the New York Times data, there was a set of fatal event data from events that led to one or more paSengerdeaths on airliners between 1978 and 1994. The data was in the following CSV files:

New York Times Index news coverage data - http://www.airsafe.com/analyze/accident_nyt_data.csv
Updated fatal event data - http://www.airsafe.com/analyze/fatal_event_data.csv

The fatal event data used in the original study is also available at http://www.airsafe.com/analyze/flight_intl_data.csv.csv.

A more complete description of the data used in the analysis, including a data dictionary that defined the variables used in the folloiwng analysis, is available at http://www.airsafe.com/analyze/nyt_bias-data.pdf.

Both the fatal event and the New York Times spreadsheets used a common index variable called “Accident.ID” to uniquely identify each fatal event and the coverage aSociated with that event.

# Updated fatal event data 
updated_events = read.csv("fatal_event_data.csv")
updated_events_stdby = updated_events # raw data in reserve

# New York Times Index data
nyt_data = read.csv("accident_nyt_data.csv")
nyt_data_stdby = nyt_data # raw data in reserve
# str(nyt_data)

Data proceSing for fatal event data

Both sets of data are uploaded, and the data frame with the updated event data was proceSed in the following ways to limit the data to what was relevant to the study:

A logical vector was added to identify events that were due to deliberate action (sabotage, hijacking, and military action)
The following columns were eliminated: Tail.Number,
Dates were convert dates into a format readable by R
The column name for dates was changed from “Flt.Intl.Accident.Date” to “Date”

Adding logical vector for deliberate actions

Deliberate actions are fatal events due to causes such as hijacking, sabotage, military action, and other deliberate acts of terrorism or mayhem that that led to at least one paSenger death that did not involve a person who caused the event. The events were identified through references made on the site AirSafe.com, specifically the following online resources:

After the deliberate action events were identified, the logical vector “deliberate” was added to the data frame containing the event data from the original study, as well as the event data from the revised study.

# From raw records file, following Accident.ID values aSociated with 
# sabotage, hijacking, miSile strikes, bombings, and other deliberate acts
# Deliberate 107, 509
# Hijack 81, 539, 519, 520, 246, 248, 250, 279, 280, 330, 663, 401, 529
# Bomb 49, 537, 145, 197, 243, 244, 253, 278, 283, 315, 316, 486, 487
 #Shot down 45, 51, 3, 73, 143, 245, 281
# Suicide 254

# Reviewed numerous resources in aviation-safety.net and AirSafe.com to find
# Events due to sabotoge, hijack, military action, and other deliberate acts,
# Matched these with the raw fatal event data frame

non_accident =c (107, 509,81, 539, 519, 520, 246, 248, 250, 279, 280, 330, 663, 401, 529, 49, 537, 145, 197, 243, 244, 253, 278, 283, 315, 316, 486, 487,45, 51, 3, 73, 143, 245, 281, 254)

deliberate_b = (updated_events$Accident.ID %in% non_accident)
updated_events$deliberate = deliberate_b

Changing and editing the date information in the fatal events data frame

Several changes were made to the event data frame, including the following:

Renaming the date column from “Flt.Intl.Accident.Date” to “Date”
Converting dates into a format suitable for R
Eliminating non-paSenger flights from the events data frame.
Adding a column called “USA.Event”, which is a logical vector identifying events that occurred on US terrritory, or events that involved aircraft registered in the US.
Replacing miSing values coded with number 999 with NA.
Eliminating data frame rows that were outside of the time period 1978-1994.
Eliminating any events with miSing values (coded as NA).
Removing aircraft events involving unknown aircraft types, or that involved selecteed aircraft models that were not typically used as airliners.
Adding a logical variable called “nyc” was added to identify an event with a connection to New York City. These would include airlines based in the New York area, specifically, those events where the “Carrier” variable containing “TWA”, “Pan Am”, or“Chautauqua” or where the “Location” variable contained: “New York”,“JFK”,“LGA”, or Newark.
Adding a logical variable called “covered” if a fatal event was aSociated with least one NYT article.
Adding a column to the fatal event data frame that has the number of articles in the New York Times Index data frame that are aSociated with that fatal event.

# Rename date column
colnames(updated_events)[colnames(updated_events)=="Flt.Intl.Accident.Date"] <- "Date" 
# str(updated_events)

# Convert dates
updated_events$Date=as.Date(as.character(updated_events$Date),"%m/%d/%Y")
# str(updated_events)

# Restrict events to paSenger flights
updated_events = subset(updated_events,PaSenger.Flight==TRUE)

# Add logical vector for event in the USA or if it involved a US-registered airline
updated_events$USA.Event = updated_events$Country.of.Accident=="USA"| updated_events$Country.of.Registration=="USA" 

# Logical vector for jet transport at updated_events$Jet.Transport. Review show if coded as jet, 
# Value for 'Airplane' variable begins with c('7', 'A3', 'BAC', 'BAe 1', 'BAe1', 'Cara','Citation', 'DC10', 'DC8', 'DC9', 'F100', 'F28', 'Gulf', 'IL62','L1011', 'Lear', 'MD', 'Trident', 'Tu1', 'VC10', 'Yak')


# Replace miSing values coded with number 999 with NA
updated_events[updated_events==999] = NA


# Restrict events to period 1978-1994
updated_events=subset(updated_events, updated_events$Date <= "1995-01-01" & updated_events$Date >= "1978-01-01")


# Restrict to complete cases
updated_events = updated_events[complete.cases(updated_events),]

# Remove single aircraft events involving unknown aircraft types, or selecteed aircraft models: # Removed types include "UNK"  "MU2B-60", "Learjet 25D", DC3 and "Beechcraft Bonanza"

updated_events = subset(updated_events, Aircraft !="DC3" & Aircraft != "UNK" & Aircraft !="Beech Bonanza" & Aircraft != "Learjet 25D"  & Aircraft != "MU2B-60" & Aircraft != "Piper Navajo" & Aircraft != "PA31")

nyc_airline_b  = grepl("pan am|twa|Chautauqua",updated_events$Carrier,ignore.case=TRUE )
nyc_location_b = grepl("LGA|New York|Newark",updated_events$Location,ignore.case=TRUE )
nyc_b = nyc_airline_b|nyc_location_b
updated_events$nyc = nyc_b


# Logical vector added to fatal_events data frame to indicate which of those events had both
# useable data and that were also included the subject of at least one New York Times article

covered_b = updated_events$Accident.ID %in% nyt_data$Accident.ID
updated_events$covered = covered_b

# Add columns to both the updated events data frame that have the aSociated columns from nyt_data
updated_events$Articles=0

# Will input nyt_data values for "Articles"
for (n in 1:nrow(updated_events)){
      if (covered_b[n]==TRUE){
              updated_events$Articles[n] = nyt_data[nyt_data$Accident.ID==updated_events$Accident.ID[n],"Articles"] 
              }
}

PRIOR ASUMPTIONS ABOUT THE DATA

The prior aSumptions about New York Times coverage were comfirmed by a review of the updated fatal event data.

# Events in the US or involving airlines registered in the US.

paste("US events (occurring in the US or on US-registered aircraft) - ", format(100*sum(updated_events$USA.Event)/nrow(updated_events), digits=3),"% of all events, ", format(100*sum(updated_events$USA.Event & updated_events$covered)/sum(updated_events$covered), digits=3), "% of all covered events, and ", format(100*sum(updated_events[updated_events$USA.Event & updated_events$covered,"Articles"])/sum(updated_events$Articles), digits=3), "% of all articles.", sep="")

## [1] "US events (occurring in the US or on US-registered aircraft) - 19.1% of all events, 28% of all covered events, and 57.6% of all articles."

paste("- Events due to deliberate actions such as sabotage, hijacking, and military action - ", format(100*sum(updated_events$deliberate)/nrow(updated_events), digits=2),"% of all events, ", format(100*sum(updated_events$deliberate & updated_events$covered)/sum(updated_events$covered), digits=3), "% of all covered events, and ", format(100*sum(updated_events[updated_events$deliberate & updated_events$covered,"Articles"])/sum(updated_events$Articles), digits=3), "% of all articles.", sep="")

## [1] "- Events due to deliberate actions such as sabotage, hijacking, and military action - 7.4% of all events, 10.7% of all covered events, and 50.1% of all articles."

# High interest event - USA and deliberate action
paste("- US-related events that were due to deliberate actions - ", format(100*sum(updated_events$USA.Event & updated_events$deliberate)/nrow(updated_events), digits=2),"% of all events, ", format(100*sum(updated_events[updated_events$USA.Event & updated_events$deliberate,"covered"])/sum(updated_events$covered), digits=2), "% of all covered events, and ", format(100*sum(updated_events[updated_events$USA.Event & updated_events$deliberate,"Articles"])/sum(updated_events$Articles), digits=3), "% of all articles.", sep="")

## [1] "- US-related events that were due to deliberate actions - 1.2% of all events, 2.1% of all covered events, and 22.6% of all articles."

# Jet transport events
paste("- Events involving jet airliners - ", format(100*sum(updated_events[updated_events$Jet.Transport,"covered"])/nrow(updated_events), digits=3),"% of all events, ", format(100*sum(updated_events[updated_events$Jet.Transport,"covered"])/sum(updated_events$covered), digits=3), "% of all covered events, and ", format(100*sum(updated_events[updated_events$Jet.Transport,"Articles"])/sum(updated_events$Articles), digits=3), "% of all articles.", sep="")

## [1] "- Events involving jet airliners - 37.9% of all events, 63.7% of all covered events, and 90.8% of all articles."

EXPLORATORY DATA ANALYSIS

The initial review of the data showed that the distribution of the articles was very skewed, with the majority of the events having one or fewer articles, as is shown by both the table of values, statistical summary of the article distribuiton, the histogram of the “Articles” variable, and the log of that same histogram.

# Airticle distribution
paste("There were a total of ", nrow(updated_events), " in the updated data frame for the period 1978-1994 that also had sufficient information available for use in this analysis, of which ", sum(updated_events$covered), " were the subject of at least one New York Times article from that period. ", "There were a total of ", sum(updated_events$Articles), " articles written about these events.", sep="")

## [1] "There were a total of 486 in the updated data frame for the period 1978-1994 that also had sufficient information available for use in this analysis, of which 289 were the subject of at least one New York Times article from that period. There were a total of 2730 articles written about these events."

paste("Table showing distribtion of the how many events had a particular number of articles written")

## [1] "Table showing distribtion of the how many events had a particular number of articles written"

table(updated_events$Articles)

## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
## 197 145  38  19  16  10   4   3   6   3   4   2   1   1   2   1   3   1 
##  18  19  20  21  25  26  27  29  34  35  38  41  43  48  49  50  57  60 
##   2   1   1   2   2   1   1   2   1   1   2   1   1   1   2   1   1   1 
##  97 105 110 161 359 371 
##   1   1   1   1   1   1

paste("Statistical summary of the distribtion articles written")

## [1] "Statistical summary of the distribtion articles written"

summary(updated_events$Articles)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   5.617   2.000 371.000

hist(updated_events$Articles)

hist(log(updated_events$Articles))

REGRESION ANALYSIS OF NEW YORK TIMES COVERAGE

Two regreSion models were created to identify the independent variables that could be used to predict two things, the likelihood that a particular fatal event would result in one or more New York Times articles, and a second one that could be used to predict the number of articles aSociated with a fatal event.

Both regreSions used a randomly selected portion of the fatal events to train the model, and tested the model against the remainder of the data frame. This was done twice, once for the fatal events data that was used in the original study, and also on the updated set of event data.

For the first regreStion, the variable “covered”, which was the logical vector indicating if an event received coverage, was the dependent variable, and the following twelve variables were the the independent variables: “Fatal.Crew”, “Fatal.Pax”, “Fatal.Other”, “Total.Fatal”, “Total.Crew”, “Total.Pax”, “Total.On.Board”, “Jet.Transport” , “Scheduled”, “USA.Event”, “nyc”, and “deliberate”.

As a first step, a correlation table was used to find pairs of independent variables that were highly correlated. Specifically if the magnitiude of the correlation was greater than 0.7.

abs(cor(updated_events[,c("Fatal.Crew", "Fatal.Pax", "Fatal.Other", "Total.Fatal", "Total.Crew", "Total.Pax", "Total.On.Board", "Jet.Transport" , "Scheduled", "USA.Event", "nyc", "deliberate")]))>0.7

##                Fatal.Crew Fatal.Pax Fatal.Other Total.Fatal Total.Crew
## Fatal.Crew           TRUE      TRUE       FALSE        TRUE      FALSE
## Fatal.Pax            TRUE      TRUE       FALSE        TRUE      FALSE
## Fatal.Other         FALSE     FALSE        TRUE       FALSE      FALSE
## Total.Fatal          TRUE      TRUE       FALSE        TRUE      FALSE
## Total.Crew          FALSE     FALSE       FALSE       FALSE       TRUE
## Total.Pax           FALSE     FALSE       FALSE       FALSE       TRUE
## Total.On.Board      FALSE     FALSE       FALSE       FALSE       TRUE
## Jet.Transport       FALSE     FALSE       FALSE       FALSE      FALSE
## Scheduled           FALSE     FALSE       FALSE       FALSE      FALSE
## USA.Event           FALSE     FALSE       FALSE       FALSE      FALSE
## nyc                 FALSE     FALSE       FALSE       FALSE      FALSE
## deliberate          FALSE     FALSE       FALSE       FALSE      FALSE
##                Total.Pax Total.On.Board Jet.Transport Scheduled USA.Event
## Fatal.Crew         FALSE          FALSE         FALSE     FALSE     FALSE
## Fatal.Pax          FALSE          FALSE         FALSE     FALSE     FALSE
## Fatal.Other        FALSE          FALSE         FALSE     FALSE     FALSE
## Total.Fatal        FALSE          FALSE         FALSE     FALSE     FALSE
## Total.Crew          TRUE           TRUE         FALSE     FALSE     FALSE
## Total.Pax           TRUE           TRUE         FALSE     FALSE     FALSE
## Total.On.Board      TRUE           TRUE         FALSE     FALSE     FALSE
## Jet.Transport      FALSE          FALSE          TRUE     FALSE     FALSE
## Scheduled          FALSE          FALSE         FALSE      TRUE     FALSE
## USA.Event          FALSE          FALSE         FALSE     FALSE      TRUE
## nyc                FALSE          FALSE         FALSE     FALSE     FALSE
## deliberate         FALSE          FALSE         FALSE     FALSE     FALSE
##                  nyc deliberate
## Fatal.Crew     FALSE      FALSE
## Fatal.Pax      FALSE      FALSE
## Fatal.Other    FALSE      FALSE
## Total.Fatal    FALSE      FALSE
## Total.Crew     FALSE      FALSE
## Total.Pax      FALSE      FALSE
## Total.On.Board FALSE      FALSE
## Jet.Transport  FALSE      FALSE
## Scheduled      FALSE      FALSE
## USA.Event      FALSE      FALSE
## nyc             TRUE      FALSE
## deliberate     FALSE       TRUE

The focus of the study was coverage of events dealing with paSenger fatalities, so variables not directly related to paSengers were eliminated if they were highly correlated to paSenger-related variables, or if the variable were a linear combination of other variables. The following combinations were highly correlated:

abs(cor(updated_events[,c("Fatal.Crew", "Fatal.Pax", "Fatal.Other", "Total.Fatal", "Total.Crew", "Total.Pax", "Total.On.Board", "Jet.Transport" , "Scheduled", "USA.Event", "nyc", "deliberate")]))>0.7

##                Fatal.Crew Fatal.Pax Fatal.Other Total.Fatal Total.Crew
## Fatal.Crew           TRUE      TRUE       FALSE        TRUE      FALSE
## Fatal.Pax            TRUE      TRUE       FALSE        TRUE      FALSE
## Fatal.Other         FALSE     FALSE        TRUE       FALSE      FALSE
## Total.Fatal          TRUE      TRUE       FALSE        TRUE      FALSE
## Total.Crew          FALSE     FALSE       FALSE       FALSE       TRUE
## Total.Pax           FALSE     FALSE       FALSE       FALSE       TRUE
## Total.On.Board      FALSE     FALSE       FALSE       FALSE       TRUE
## Jet.Transport       FALSE     FALSE       FALSE       FALSE      FALSE
## Scheduled           FALSE     FALSE       FALSE       FALSE      FALSE
## USA.Event           FALSE     FALSE       FALSE       FALSE      FALSE
## nyc                 FALSE     FALSE       FALSE       FALSE      FALSE
## deliberate          FALSE     FALSE       FALSE       FALSE      FALSE
##                Total.Pax Total.On.Board Jet.Transport Scheduled USA.Event
## Fatal.Crew         FALSE          FALSE         FALSE     FALSE     FALSE
## Fatal.Pax          FALSE          FALSE         FALSE     FALSE     FALSE
## Fatal.Other        FALSE          FALSE         FALSE     FALSE     FALSE
## Total.Fatal        FALSE          FALSE         FALSE     FALSE     FALSE
## Total.Crew          TRUE           TRUE         FALSE     FALSE     FALSE
## Total.Pax           TRUE           TRUE         FALSE     FALSE     FALSE
## Total.On.Board      TRUE           TRUE         FALSE     FALSE     FALSE
## Jet.Transport      FALSE          FALSE          TRUE     FALSE     FALSE
## Scheduled          FALSE          FALSE         FALSE      TRUE     FALSE
## USA.Event          FALSE          FALSE         FALSE     FALSE      TRUE
## nyc                FALSE          FALSE         FALSE     FALSE     FALSE
## deliberate         FALSE          FALSE         FALSE     FALSE     FALSE
##                  nyc deliberate
## Fatal.Crew     FALSE      FALSE
## Fatal.Pax      FALSE      FALSE
## Fatal.Other    FALSE      FALSE
## Total.Fatal    FALSE      FALSE
## Total.Crew     FALSE      FALSE
## Total.Pax      FALSE      FALSE
## Total.On.Board FALSE      FALSE
## Jet.Transport  FALSE      FALSE
## Scheduled      FALSE      FALSE
## USA.Event      FALSE      FALSE
## nyc             TRUE      FALSE
## deliberate     FALSE       TRUE

Fatal.Crew - Fatal.Pax, Total.Fatal
Total.Fatal - Total.Pax
Total.Crew - Total.Pax Total.On.Board - Total.Crew, Total.Pax

Running a second correlation with these for variables removed showed that there were no highly correlated variables remaining.

abs(cor(updated_events[,c( "Fatal.Pax", "Fatal.Other",   "Total.Pax",  "Jet.Transport" , "Scheduled", "USA.Event", "nyc", "deliberate")]))>0.7

##               Fatal.Pax Fatal.Other Total.Pax Jet.Transport Scheduled
## Fatal.Pax          TRUE       FALSE     FALSE         FALSE     FALSE
## Fatal.Other       FALSE        TRUE     FALSE         FALSE     FALSE
## Total.Pax         FALSE       FALSE      TRUE         FALSE     FALSE
## Jet.Transport     FALSE       FALSE     FALSE          TRUE     FALSE
## Scheduled         FALSE       FALSE     FALSE         FALSE      TRUE
## USA.Event         FALSE       FALSE     FALSE         FALSE     FALSE
## nyc               FALSE       FALSE     FALSE         FALSE     FALSE
## deliberate        FALSE       FALSE     FALSE         FALSE     FALSE
##               USA.Event   nyc deliberate
## Fatal.Pax         FALSE FALSE      FALSE
## Fatal.Other       FALSE FALSE      FALSE
## Total.Pax         FALSE FALSE      FALSE
## Jet.Transport     FALSE FALSE      FALSE
## Scheduled         FALSE FALSE      FALSE
## USA.Event          TRUE FALSE      FALSE
## nyc               FALSE  TRUE      FALSE
## deliberate        FALSE FALSE       TRUE

PREDICTING NEW YORK TIMES COVERAGE USING LOGISTIC REGESION

A logistic regreSsion model was used to predict the probability that a particular event would generate one or more articles where the model was trained on 75% of the fatal event data, and tested on the remaining 25%. The accuracy of the model would be compared agains the accuracy of gueSing that all of the events would either get coverage or not get coverage based on the likelihood that events in the training set received coverage. For example, if 55% of the events in the training set were covered, the model would have useful predictive value if it was accurate more than 55% of the time on the test set.

The first step is to split the data, and to put the training set into the following generalized linear model:

glm(covered~Fatal.Pax+Fatal.Other+Total.Pax+Jet.Transport+Scheduled+deliberate+USA.Event+nyc, data=updated_events, family=“binomial”)

# Install caret package for creating data splits within groups of the data and for the confusionMatrix function
install.packages("caret", repos="http://cran.rstudio.com/")

## 
##   There is a binary version available (and will be installed) but
##   the source version is later:
##       binary source
## caret 6.0-47 6.0-52
## 
## 
## The downloaded binary packages are in
##  /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpZV55E5/downloaded_packages

library(caret)

## Warning: package 'caret' was built under R version 3.1.3

## Loading required package: lattice
## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.1.3

# Install and load caTools package for spliting data
install.packages("caTools", repos="http://cran.rstudio.com/")

## 
## The downloaded binary packages are in
##  /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpZV55E5/downloaded_packages

library(caTools)

# Randomly split data
set.seed(1)
split = sample.split(updated_events$covered, SplitRatio = 0.75)
train = subset(updated_events, split == TRUE)
test = subset(updated_events, split == FALSE)

# Initial iteration of the regreSion model on the training set

model_a = glm(covered~Fatal.Pax+Fatal.Other+Total.Pax+Jet.Transport+Scheduled+deliberate+USA.Event+nyc, data=train, family="binomial")
summary(model_a)

## 
## Call:
## glm(formula = covered ~ Fatal.Pax + Fatal.Other + Total.Pax + 
##     Jet.Transport + Scheduled + deliberate + USA.Event + nyc, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0824  -0.7771   0.2643   0.7152   1.7442  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.775116   0.442486  -1.752  0.07982 .  
## Fatal.Pax          0.024175   0.005857   4.128 3.66e-05 ***
## Fatal.Other       -0.031693   0.022617  -1.401  0.16112    
## Total.Pax          0.002410   0.003222   0.748  0.45435    
## Jet.TransportTRUE  0.955847   0.351311   2.721  0.00651 ** 
## ScheduledTRUE     -0.516427   0.442115  -1.168  0.24277    
## deliberateTRUE     1.378151   0.583122   2.363  0.01811 *  
## USA.EventTRUE      2.645156   0.445504   5.937 2.90e-09 ***
## nycTRUE           -0.567923   1.189082  -0.478  0.63292    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 492.87  on 364  degrees of freedom
## Residual deviance: 358.47  on 356  degrees of freedom
## AIC: 376.47
## 
## Number of Fisher Scoring iterations: 6

The regreSion model was run on the training set several times, each time eliminating the least explanatory variable, until all the remaining variable coeffiencents were signficant at the 0.05 level.

The variables were remvoved in the following sequence: - “Scheduled -”nyc" - “Total.Pax” - “Fatal.Other”, resulting in the following model using the independent variables “Fatal.Pax”, “Jet.Transport”, “deliberate”, and “USA.Event”:

model_b = glm(covered~Fatal.Pax+Jet.Transport+deliberate+USA.Event, data=train, family="binomial")
summary(model_b)

## 
## Call:
## glm(formula = covered ~ Fatal.Pax + Jet.Transport + deliberate + 
##     USA.Event, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0340  -0.7967   0.2714   0.7309   1.7028  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.182443   0.199659  -5.922 3.17e-09 ***
## Fatal.Pax          0.024701   0.005633   4.385 1.16e-05 ***
## Jet.TransportTRUE  1.056924   0.279155   3.786 0.000153 ***
## deliberateTRUE     1.249575   0.543160   2.301 0.021416 *  
## USA.EventTRUE      2.537494   0.416908   6.086 1.15e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 492.87  on 364  degrees of freedom
## Residual deviance: 362.68  on 360  degrees of freedom
## AIC: 372.68
## 
## Number of Fisher Scoring iterations: 6

The model was then run against the training data data to see how accurately it could predict whether an event received coverage.

first_base = predict(model_b,data=train,type="response")
predict_base_results = confusionMatrix(first_base>0.5, train$covered) # Accuracy = 80.3%
predict_base_results

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   120   44
##      TRUE     28  173
##                                           
##                Accuracy : 0.8027          
##                  95% CI : (0.7582, 0.8423)
##     No Information Rate : 0.5945          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5978          
##  Mcnemar's Test P-Value : 0.0771          
##                                           
##             Sensitivity : 0.8108          
##             Specificity : 0.7972          
##          Pos Pred Value : 0.7317          
##          Neg Pred Value : 0.8607          
##              Prevalence : 0.4055          
##          Detection Rate : 0.3288          
##    Detection Prevalence : 0.4493          
##       Balanced Accuracy : 0.8040          
##                                           
##        'Positive' ClaS : FALSE           
##

paste("The accuracy of this model against the training set was ", format(100*unlist(predict_base_results[[3]])[1], digits=3),"% compared to the baseline gueSing rate of ", format(100*unlist(predict_base_results[[3]])[5], digits=3), "%.", sep="")

## [1] "The accuracy of this model against the training set was 80.3% compared to the baseline gueSing rate of 59.5%."

The proceS was repeated for the test set of data to see how the model worked with a ‘new’ set of fatal events.

second_base = predict(model_b, newdata=test,type="response")
predict_base_resultsb = confusionMatrix(second_base>0.5, test$covered) # Accuracy = 80.3%
predict_base_resultsb

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    38   18
##      TRUE     11   54
##                                           
##                Accuracy : 0.7603          
##                  95% CI : (0.6743, 0.8332)
##     No Information Rate : 0.595           
##     P-Value [Acc > NIR] : 9.868e-05       
##                                           
##                   Kappa : 0.5138          
##  Mcnemar's Test P-Value : 0.2652          
##                                           
##             Sensitivity : 0.7755          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.6786          
##          Neg Pred Value : 0.8308          
##              Prevalence : 0.4050          
##          Detection Rate : 0.3140          
##    Detection Prevalence : 0.4628          
##       Balanced Accuracy : 0.7628          
##                                           
##        'Positive' ClaS : FALSE           
##

paste("The accuracy of this model against the test set was ", format(100*unlist(predict_base_resultsb[[3]])[1], digits=3),"% compared to the baseline gueSing rate of ", format(100*unlist(predict_base_resultsb[[3]])[5], digits=3), "%.", sep="")

## [1] "The accuracy of this model against the test set was 76% compared to the baseline gueSing rate of 59.5%."

The results for the test data was similar to that of the training data, with the accuracy on the test data being a bit lower than the results of the training data, but the prediction accuracy for both was better that gueSing the most likely outcome (coverage by the New York Times) for every fatal event.

PREDICTING THE NUMBER OF NEW YORK TIMES ARTICLES WITH A POISON REGESION MODEL

As was shown earlier, the number of articles is highly skewed, with over three quarters of the events getting two or fewer articles, and a handful getting dozens or hundreds of articles. A generalized linear model from the PoiSon family was used to create a model to predict the number of articles that a particular event may generate.

One can use a variation of the glm() function to provide a model based on the articles generated by the study data. Since the glm() function gives a model of the log of the output, exp(glm estimate) gives the expected number of articles given the characteristics of the fatal event.

Using the same eight independentt variables identified earlier, the followin model was run repeatedly until the remaing coeffiecients were all significant at the 0.05 level:

glm(Articles~Fatal.Pax+Fatal.Other+Total.Pax+Jet.Transport+Scheduled+deliberate+USA.Event, data=updated_events, family=“poiSon”)

Since each group was of size one (number of events for each row of data), there was no offset in this model.

In the first iteration, only the “nyc” variable was not signiifcant at the 0.05 level, and the second iteration produced the following model:

# Model number of articles with PoiSon distribution

poiSon1 = glm(Articles~Fatal.Pax+Fatal.Other+Total.Pax+Jet.Transport+Scheduled+deliberate+USA.Event, data=updated_events, family="poiSon")
summary(poiSon1)

## 
## Call:
## glm(formula = Articles ~ Fatal.Pax + Fatal.Other + Total.Pax + 
##     Jet.Transport + Scheduled + deliberate + USA.Event, family = "poiSon", 
##     data = updated_events)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -13.3329   -1.2544   -0.8128    0.4336   22.5594  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.5087046  0.1115741 -13.522  < 2e-16 ***
## Fatal.Pax          0.0094750  0.0002728  34.732  < 2e-16 ***
## Fatal.Other       -0.0784692  0.0053332 -14.713  < 2e-16 ***
## Total.Pax         -0.0009282  0.0002758  -3.365 0.000766 ***
## Jet.TransportTRUE  1.3031353  0.0737573  17.668  < 2e-16 ***
## ScheduledTRUE      0.3885384  0.0891167   4.360  1.3e-05 ***
## deliberateTRUE     2.1789960  0.0442886  49.200  < 2e-16 ***
## USA.EventTRUE      2.2337461  0.0476698  46.859  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poiSon family taken to be 1)
## 
##     Null deviance: 11691.5  on 485  degrees of freedom
## Residual deviance:  3049.5  on 478  degrees of freedom
## AIC: 3895.7
## 
## Number of Fisher Scoring iterations: 6

This model implies that among the four categorical variables, “deliberate”, “Jet.Transport”, and “USA.Event” have the greatest influence on increasing the number of articles. When it comes to fatalities, more paSenger fatalities are aSociated with a higher number of articles, while the number of surviviing paSengers, as well as the number of fatailties of people who were neither paSengers or crew members, are aSociated with fewer articles.

Discussion

While this study focused on the period 1978-1994, it can be extended to prior and subsequent periods, although it may be difficult to use this model for periods prior to the 1970s when hijackings and other deliberate actions because a regular policy and media concern, and the period after 9/11, when the media and public policy focus on aviation security when through a significant and sustained change.