---
title: "New York Times coverage bias revisited"
author: "Todd Curtis"
date: "July 26, 2015"
output: html_document
---
##Introduction##

This project is a reboot of the report "Airline Accidents and Media Bias: New York Times 1978-1994, which was published 14 March 1996. The summary is available on the site AirSafe.com at [http://www.airsafe.com/nyt_bias.htm](http://www.airsafe.com/nyt_bias.htm), and the full report for the original study is available at [http://www.airsafe.com/analyze/nyt_bias.pdf](http://www.airsafe.com/analyze/nyt_bias.pdf). That report, as well as this revised one, were done to  address a common misperception that the news media were biased with respect to what kinds of airline accidents and other kinds of fatal airline events attract attention.

The New York Times (NYT) as chosen to represent media coverage of aviation events because of its stature as a nationally important newspaper with influence over the content of newspapers and other media in the United States and abroad. The NYT refers to itself as a newspaper of record that is both an important source of current news and analysis as well as a chronicle of those events that are important in American and world affairs. The NYT, specifically the information contained in New York Times Index, was chosen in part because during the time frame of this study, the NYT was a newspaper which significantly affected how major print and broadcast organizations covered issues dealing with air transportation.

The period 1978 to 1994 was also chosen because it coincided with the deregulation of the domestic US airline industry. With deregulation came a substantial increase in the number of flights and significant changes in air carriers and their business strategies. Along with these changes came an increased interest in issues related to air safety on the part of the flying public, the airline industry, and the federal government.

The original analysis supported the following conclusions: 

- Events either in the US or involving US carriers had a disproportionately larger share of media coverage than non-US events.

- Events involving jet airliners were more likely to be reported and had more media coverage than the corresponding group of propeller aircraft events.

- Fatal events were reported with greater likelihood as the magnitude of the number of fatalities increased.

##Purpose of the reboot##

There is a twofold purpose in this reboot. First, to update the fatal event data from the original study to take advantage of the the more comprehensive information about airline events that has become available since the original study was published in order to perform analyses on a more complete set of event data.

The second objective was to use statistical analysis routines built into the R software package to execute a study that not only provided an overview of what kinds of events received disproportionate amount of coverage, but to use statistical models to predict two things, the likelihood that a particular accident would receive any coverage by the New York Times, and the number of articles that a particular event may generate.

The assumptions in this study were slightly different than the earlier study, due in part to a greater sensitivity to deliberate actions like hijackings play in driving contemporary airline safety and security related policies, regulations, and practices in the years since the events of 9/11. While the data covers the period before 9/11, it was also a period where hijackings and other deliberate actions that led to passenger deaths was a serious concern. 

The key assumptions about New York Times coverage were that the following categories of fatal passenger events would have a greater likelihood of being covered, and would have a disproportionate number of articles published compared to events that  were not in that category:

- Events in the US or involving aircraft registered in the US.

- Events due to deliberate actions such as sabotage, hijacking, and military action.

- US-related events that were due to deliberate actions.

- Events involving jet airliners.

In addition to addressing whether the prior assumptions were supported by the data, there would be two kinds of regression analyses performed on the data. The first a logistic regression to identify the characteristics of events that receive any New York Times coverage, and a second to identify the characteristics associated with the magnitude of that coverage.

##Data##

There were two kinds of data used in the study. Information from airline accidents and other events that led to one or more passengers deaths that occurred during the time period covered by the study, and information about those events that were included in the New York Times Index for that same period. Airline events of interest were those that resulted in the death of at least one passenger who was not a hijacker, saboteur, or stowaway.

In addition to the New York Times data, there was a set of fatal event data from events that led to one or more passenger deaths on airliners between 1978 and 1994. The data was in the following CSV files:

- New York Times Index news coverage data - [http://www.airsafe.com/analyze/accident_nyt_data.csv](http://www.airsafe.com/analyze/accident_nyt_data.csv)

- Updated fatal event data - [http://www.airsafe.com/analyze/fatal_event_data.csv](http://www.airsafe.com/analyze/fatal_event_data.csv)

The fatal event data used in the original study is also available at [http://www.airsafe.com/analyze/flight_intl_data.csv.csv](http://www.airsafe.com/analyze/flight_intl_data.csv.csv).

A more complete description of the data used in the analysis, including a data dictionary that defined the variables used in the following analysis, is available at [http://www.airsafe.com/analyze/nyt_bias-data.pdf](http://www.airsafe.com/analyze/nyt_bias-data.pdf).

Both the fatal event and the New York Times spreadsheets used a common index variable called "Accident.ID" to uniquely identify each fatal event and the coverage associated with that event.


```{r, echo=TRUE}


# Updated fatal event data 
updated_events = read.csv("http://www.airsafe.com/analyze/fatal_event_data.csv")
updated_events_stdby = updated_events # raw data in reserve

# New York Times Index data
nyt_data = read.csv("http://www.airsafe.com/analyze/accident_nyt_data.csv")
nyt_data_stdby = nyt_data # raw data in reserve
# str(nyt_data)

```

###Data processing for fatal event data###

Both sets of data are uploaded, and the data frame with the updated event data was processed in the following ways to limit the data to what was relevant to the study:

- A logical vector was added to identify events that were due to deliberate action (sabotage, hijacking, and military action)


- The following columns were eliminated: Tail.Number, 

- Dates were convert dates into a format readable by R

- The column name for dates was changed from "Flt.Intl.Accident.Date" to "Date" 


####Adding logical vector for deliberate actions####

Deliberate actions are fatal events due to causes such as hijacking, sabotage, military action, and other deliberate acts of terrorism or mayhem that that led to at least one passenger death that did not involve a person who caused the event. The events were identified through references made on the site AirSafe.com, specifically the following online resources:

- http://www.airsafe.com/events/hijack.htm, 
- http://www.airsafenews.com/2014/07/jet-airliner-passengers-killed-by.html
- http://www.airsafe.com/plane-crash/deliberate.htm
- http://en.wikipedia.org/wiki/List_of_airliner_shootdown_incidents
- http://en.wikipedia.org/wiki/List_of_aircraft_hijackings
- http://en.wikipedia.org/wiki/Category:Airliner_bombings
- http://news.aviation-safety.net/2015/03/26/list-of-aircraft-accidents-and-incidents-deliberately-caused-by-pilots/
- http://aviation-safety.net/database/events/event.php?code=SE
- http://aviation-safety.net/database/record.php?id=19851124-0
- http://avstop.com/history/majorevents/hijackings.html

After the deliberate action events were identified, the logical vector "deliberate" was added to the data frame containing the event data from the original study, as well as the event data from the revised study.

```{r, echo=TRUE}

# From raw records file, following Accident.ID values associated with 
# sabotage, hijacking, missile strikes, bombings, and other deliberate acts
# Deliberate 107, 509
# Hijack 81, 539, 519, 520, 246, 248, 250, 279, 280, 330, 663, 401, 529
# Bomb 49, 537, 145, 197, 243, 244, 253, 278, 283, 315, 316, 486, 487
 #Shot down 45, 51, 3, 73, 143, 245, 281
# Suicide 254

# Reviewed numerous resources in aviation-safety.net and AirSafe.com to find
# Events due to sabotage, hijack, military action, and other deliberate acts,
# Matched these with the raw fatal event data frame

non_accident =c(107, 509,81, 539, 519, 520, 246, 248, 250, 279, 280, 330, 663, 401, 529, 49, 537, 145, 197, 243, 244, 253, 278, 283, 315, 316, 486, 487,45, 51, 3, 73, 143, 245, 281, 254)

deliberate_b = (updated_events$Accident.ID %in% non_accident)
updated_events$deliberate = deliberate_b


```

####Changing and editing the date information in the fatal events data frame ####

Several changes were made to the event data frame, including the following:

- Renaming the date column from "Flt.Intl.Accident.Date" to "Date" 

- Converting dates into a format suitable for R

- Eliminating non-passenger flights from the events data frame.

- Adding a column called "USA.Event", which is a logical vector identifying events that occurred on US territory, or events that involved aircraft registered in the US.

- Replacing missing values coded with number 999 with NA.

- Eliminating data frame rows that were outside of the time period 1978-1994.

- Eliminating any events with missing values (coded as NA).

- Removing aircraft events involving unknown aircraft types, or that involved selected aircraft models that were not typically used as airliners. 

- Adding a logical variable called "nyc" was added to identify an event with a connection to New York City. These would include airlines based in the New York area, specifically, those events where the "Carrier" variable containing "TWA", "Pan Am", or"Chautauqua" or where the "Location" variable contained: "New York","JFK","LGA", or Newark.

- Adding a logical variable called "covered" if a fatal event was associated with least one NYT article. 

- Adding a column to the fatal event data frame that has the number of articles in the New York Times Index data frame that are associated with that fatal event.


```{r, echo=TRUE}

# Rename date column
colnames(updated_events)[colnames(updated_events)=="Flt.Intl.Accident.Date"] <- "Date" 
# str(updated_events)

# Convert dates
updated_events$Date=as.Date(as.character(updated_events$Date),"%m/%d/%Y")
# str(updated_events)

# Restrict events to passenger flights
updated_events = subset(updated_events,Passenger.Flight==TRUE)

# Add logical vector for event in the USA or if it involved a US-registered airline
updated_events$USA.Event = updated_events$Country.of.Accident=="USA"| updated_events$Country.of.Registration=="USA" 

# Logical vector for jet transport at updated_events$Jet.Transport. Review show if coded as jet, 
# Value for 'Airplane' variable begins with c('7', 'A3', 'BAC', 'BAe 1', 'BAe1', 'Cara','Citation', 'DC10', 'DC8', 'DC9', 'F100', 'F28', 'Gulf', 'IL62','L1011', 'Lear', 'MD', 'Trident', 'Tu1', 'VC10', 'Yak')


# Replace missing values coded with number 999 with NA
updated_events[updated_events==999] = NA


# Restrict events to period 1978-1994
updated_events=subset(updated_events, updated_events$Date <= "1995-01-01" & updated_events$Date >= "1978-01-01")


# Restrict to complete cases
updated_events = updated_events[complete.cases(updated_events),]

# Remove single aircraft events involving unknown aircraft types, or selected aircraft models: # Removed types include "UNK"  "MU2B-60", "Learjet 25D", DC3 and "Beechcraft Bonanza"

updated_events = subset(updated_events, Aircraft !="DC3" & Aircraft != "UNK" & Aircraft !="Beech Bonanza" & Aircraft != "Learjet 25D"  & Aircraft != "MU2B-60" & Aircraft != "Piper Navajo" & Aircraft != "PA31")

nyc_airline_b  = grepl("pan am|twa|Chautauqua",updated_events$Carrier,ignore.case=TRUE )
nyc_location_b = grepl("LGA|New York|Newark",updated_events$Location,ignore.case=TRUE )
nyc_b = nyc_airline_b|nyc_location_b
updated_events$nyc = nyc_b


# Logical vector added to fatal_events data frame to indicate which of those events had both
# useable data and that were also included the subject of at least one New York Times article

covered_b = updated_events$Accident.ID %in% nyt_data$Accident.ID
updated_events$covered = covered_b

# Add columns to both the updated events data frame that have the associated columns from nyt_data
updated_events$Articles=0

# Will input nyt_data values for "Articles"
for (n in 1:nrow(updated_events)){
      if (covered_b[n]==TRUE){
              updated_events$Articles[n] = nyt_data[nyt_data$Accident.ID==updated_events$Accident.ID[n],"Articles"] 
              }
}

```

####PRIOR ASSUMPTIONS ABOUT THE DATA###

The prior assumptions about New York Times coverage were confirmed by a review of the updated fatal event data.

```{r, echo=TRUE}
# Events in the US or involving airlines registered in the US.

paste("US events (occurring in the US or on US-registered aircraft) - ", format(100*sum(updated_events$USA.Event)/nrow(updated_events), digits=3),"% of all events, ", format(100*sum(updated_events$USA.Event & updated_events$covered)/sum(updated_events$covered), digits=3), "% of all covered events, and ", format(100*sum(updated_events[updated_events$USA.Event & updated_events$covered,"Articles"])/sum(updated_events$Articles), digits=3), "% of all articles.", sep="") 

paste("- Events due to deliberate actions such as sabotage, hijacking, and military action - ", format(100*sum(updated_events$deliberate)/nrow(updated_events), digits=2),"% of all events, ", format(100*sum(updated_events$deliberate & updated_events$covered)/sum(updated_events$covered), digits=3), "% of all covered events, and ", format(100*sum(updated_events[updated_events$deliberate & updated_events$covered,"Articles"])/sum(updated_events$Articles), digits=3), "% of all articles.", sep="") 

# High interest event - USA and deliberate action
paste("- US-related events that were due to deliberate actions - ", format(100*sum(updated_events$USA.Event & updated_events$deliberate)/nrow(updated_events), digits=2),"% of all events, ", format(100*sum(updated_events[updated_events$USA.Event & updated_events$deliberate,"covered"])/sum(updated_events$covered), digits=2), "% of all covered events, and ", format(100*sum(updated_events[updated_events$USA.Event & updated_events$deliberate,"Articles"])/sum(updated_events$Articles), digits=3), "% of all articles.", sep="") 

# Jet transport events
paste("- Events involving jet airliners - ", format(100*sum(updated_events[updated_events$Jet.Transport,"covered"])/nrow(updated_events), digits=3),"% of all events, ", format(100*sum(updated_events[updated_events$Jet.Transport,"covered"])/sum(updated_events$covered), digits=3), "% of all covered events, and ", format(100*sum(updated_events[updated_events$Jet.Transport,"Articles"])/sum(updated_events$Articles), digits=3), "% of all articles.", sep="") 

```


###EXPLORATORY DATA ANALYSIS###

The initial review of the data showed that the distribution of the articles was very skewed, with the majority of the events having one or fewer articles, as is shown by both the table of values, statistical summary of the article distribution, the histogram of the "Articles" variable, and the log of that same histogram.
```{r, echo=TRUE}

# Article distribution
paste("There were a total of ", nrow(updated_events), " in the updated data frame for the period 1978-1994 that also had sufficient information available for use in this analysis, of which ", sum(updated_events$covered), " were the subject of at least one New York Times article from that period. ", "There were a total of ", sum(updated_events$Articles), " articles written about these events.", sep="") 

paste("Table showing distribution of the how many events had a particular number of articles written")
table(updated_events$Articles)

paste("Statistical summary of the distribution articles written")
summary(updated_events$Articles)

hist(updated_events$Articles)
hist(log(updated_events$Articles))

```

###REGRESSION ANALYSIS OF NEW YORK TIMES COVERAGE###

Two regression models were created to identify the independent variables that could be used to predict two things, the likelihood that a particular fatal event would result in one or more New York Times articles, and a second one that could be used to predict the number of articles associated with a fatal event. 

Both regressions used a randomly selected  portion of the fatal events to train the model, and tested the model against the remainder of the data frame. This was done twice, once for the fatal events data that was used in the original study, and also on the updated set of event data.

For the first regression, the variable "covered", which was the logical vector indicating if an event received coverage, was the dependent variable, and the following twelve variables were the the independent variables: "Fatal.Crew", "Fatal.Pax", "Fatal.Other", "Total.Fatal", "Total.Crew", "Total.Pax", "Total.On.Board", "Jet.Transport" , "Scheduled", "USA.Event", "nyc", and "deliberate".

As a first step, a correlation table was used to find pairs of independent variables that were highly correlated. Specifically if the magnitude of the correlation was greater than 0.7. 

```{r, echo=TRUE}

abs(cor(updated_events[,c("Fatal.Crew", "Fatal.Pax", "Fatal.Other", "Total.Fatal", "Total.Crew", "Total.Pax", "Total.On.Board", "Jet.Transport" , "Scheduled", "USA.Event", "nyc", "deliberate")]))>0.7
```

The focus of the study was coverage of events dealing with passenger fatalities, so variables not directly related to passengers were eliminated if they were highly correlated to passenger-related variables, or if the variable were a linear combination of other variables. The following combinations were highly correlated:

- Fatal.Crew - Fatal.Pax, Total.Fatal
- Total.Fatal - Total.Pax
- Total.Crew - Total.Pax
 Total.On.Board - Total.Crew, Total.Pax

Running a second correlation with these for variables removed showed that there were no highly correlated variables  remaining.

```{r, echo=TRUE}

abs(cor(updated_events[,c( "Fatal.Pax", "Fatal.Other",   "Total.Pax",  "Jet.Transport" , "Scheduled", "USA.Event", "nyc", "deliberate")]))>0.7

```
####PREDICTING NEW YORK TIMES COVERAGE USING LOGISTIC REGRESSION####

A logistic regression model was used to predict the probability that a particular event would generate one or more articles where the model was trained on 75% of the fatal event data, and tested on the remaining 25%. The accuracy of the model would be compared agains the accuracy of guessing that all of the events would either get coverage or not get coverage based on the likelihood that events in the training set received coverage. For example, if 55% of the events in the training set were covered, the model would have useful predictive value if it was accurate more than 55% of the time on the test set. 

The first step is to split the data, and to put the training set into the following generalized linear model:

glm(covered~Fatal.Pax+Fatal.Other+Total.Pax+Jet.Transport+Scheduled+deliberate+USA.Event+nyc, data=updated_events, family="binomial")


```{r, echo=TRUE}

# Install caret package for creating data splits within groups of the data and for the confusionMatrix function
install.packages("caret", repos="http://cran.rstudio.com/")
library(caret)

# Install and load caTools package for spliting data
install.packages("caTools", repos="http://cran.rstudio.com/")
library(caTools)

# Randomly split data
set.seed(1)
split = sample.split(updated_events$covered, SplitRatio = 0.75)
train = subset(updated_events, split == TRUE)
test = subset(updated_events, split == FALSE)

# Initial iteration of the regression model on the training set

model_a = glm(covered~Fatal.Pax+Fatal.Other+Total.Pax+Jet.Transport+Scheduled+deliberate+USA.Event+nyc, data=train, family="binomial")
summary(model_a)

```

The regression model was run on the training set several times, each time eliminating the least explanatory variable, until all the remaining variable coefficients  were significant at the 0.05 level.

The variables were removed in the following  sequence: - "Scheduled - "nyc" - "Total.Pax"  - "Fatal.Other", resulting in the following model using the independent variables "Fatal.Pax", "Jet.Transport", "deliberate", and "USA.Event":

```{r, echo=TRUE}
model_b = glm(covered~Fatal.Pax+Jet.Transport+deliberate+USA.Event, data=train, family="binomial")
summary(model_b)
```

The model was then run against the training data  data to see how accurately it could predict whether an event received coverage.

```{r, echo=TRUE}

first_base = predict(model_b,data=train,type="response")
predict_base_results = confusionMatrix(first_base>0.5, train$covered) # Accuracy = 80.3%
predict_base_results

paste("The accuracy of this model against the training set was ", format(100*unlist(predict_base_results[[3]])[1], digits=3),"% compared to the baseline guessing rate of ", format(100*unlist(predict_base_results[[3]])[5], digits=3), "%.", sep="")

```
The process was repeated for the test set of data to see how the model worked with a 'new' set of fatal events. 

```{r, echo=TRUE}

second_base = predict(model_b, newdata=test,type="response")
predict_base_resultsb = confusionMatrix(second_base>0.5, test$covered) # Accuracy = 80.3%
predict_base_resultsb

paste("The accuracy of this model against the test set was ", format(100*unlist(predict_base_resultsb[[3]])[1], digits=3),"% compared to the baseline guessing rate of ", format(100*unlist(predict_base_resultsb[[3]])[5], digits=3), "%.", sep="")

```
The results for the test data was similar to that of the training data, with the accuracy on the test data being a bit lower than the results of the training data, but the prediction accuracy for both was better that guessing the most likely outcome (coverage by the New York Times) for every fatal event.


####PREDICTING THE NUMBER OF NEW YORK TIMES ARTICLES WITH A POISSON REGRESSION MODEL####

As was shown earlier, the number of articles is highly skewed, with over three quarters of the events getting two or fewer articles, and a handful getting dozens or hundreds of articles. A generalized linear model from the Poisson family was used to create a model to predict the number of articles that a particular event may generate. 

One can use a variation of the glm() function to provide a model based on the articles generated by the study data. Since the glm() function gives a model of the log of the output, exp(glm estimate) gives the expected number of articles given the characteristics of the fatal event. 

Using the same eight independent variables identified earlier, the following model was run repeatedly until the remaining coefficients were all significant at the 0.05 level:

glm(Articles~Fatal.Pax+Fatal.Other+Total.Pax+Jet.Transport+Scheduled+deliberate+USA.Event, data=updated_events, family="poisson")

Since each group was of size one (number of events for each row of data), there was no offset in this model.

In the first iteration, only the "nyc" variable was not significant at the 0.05 level, and the second iteration produced the following model:


```{r, echo=TRUE}

# Model number of articles with Poisson distribution

poisson1 = glm(Articles~Fatal.Pax+Fatal.Other+Total.Pax+Jet.Transport+Scheduled+deliberate+USA.Event, data=updated_events, family="poisson")
summary(poisson1)
```

This model implies that among the four categorical variables, "deliberate", "Jet.Transport", and "USA.Event" have the greatest influence on increasing the number of articles. When it comes to fatalities, more passenger fatalities are associated with a higher number of articles, while the number of surviving passengers, as well as the number of fatalities of people who were neither passengers or crew members, are associated with fewer articles.

##Discussion##

While this study focused on the period 1978-1994, it can be extended to prior and subsequent periods, although it may be difficult to use this model for periods prior to the 1970s when hijackings and other deliberate actions became a regular policy and media concern, and the period after 9/11, when the media and public policy focus on aviation security when through a significant and sustained change.