Can we predict injury involved in Vehicle Crash?

Yeraldina Estrella
4 min readJan 11, 2021

--

Introduction

Driving or riding a vehicle is considered more dangerous than flying. How often do you hear that someone got injured after riding a plane?. We are more likely to hear about injury from a vehicle crash rather than a plane injury. Road safety has been a critical issue for many decades. Since vehicle crashes are fairly common, what can we learn from historical data in order to prevent collisions? Perhaps, improve the amount of traffic control devices or ensure everyone operating a vehicle is licensed and in good mental state. Would it be possible to build a machine learning model to predict factors that could determine injury in a vehicle crash?

Exploratory Data Analysis

This project will combine two datasets found on NYC Open Data. The datasets are: The Motor Vehicle Collision- Crash dataset (contains 1.74M data points) and The Motor Vehicle Collisions- vehicle dataset (contains 3.49M data points). After data exploration, feature engineering and data cleaning the dataset was reduced to about 300K rows and 24 features. Preliminary data analysis was performed to get an overview of the dataset. As show in figure 1 below, comparing crash data from 2018 thru 2020, we can determine that crash collisions are higher in the boroughs of Brooklyn and Queens compared to all five boroughs. There is a gradual collision between 2018 and 2019, nevertheless the collision dropped tremendously in 2020. As many may recall, NYC was shut down for most part of 2020 due to the COVID-19 pandemic. Therefore, traffic dropped and thus collisions dropped.

Predictive Modeling

In order to predict injury from vehicle crash, I fit a Logistic Regression model, and a XGboost Classifier. The results are then compared from each model to determine which approach performs better. The baseline for the majority class is 76% not injured.

Logistic Regression

In order to predict injury from vehicle crash, I fit a Logistic Regression model. The validation accuracy scored for the Logistic regression model is 75.6%. The coefficient for the Logistic regression model are shown below:

We can clearly see that the features Zip_code, crash time, pre crash and vehicle type add positive and negative weight to determination of injury.

Classification with XG Boost Classifier

XGBoost classifier is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. After fitting my model and evaluating the performance of my model, I found that the overall accuracy of 78% on training data and 77% on the testing data.

The Area Under the Curve (AUC) measures the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes. In order to evaluate my model, I used ROC AUC curve and I found that 74.65 roc auc score as shown in the graphs below.

ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

In order to see the over all feature importance in predicting the class, I used permutation importance and the values towards the top are the most important features, and those towards the bottom matter least. As you can see in the graphs below Vehicle damage, pre crash, contributing factors and point of impact are the top five financial indicators in predicting injury where as crash Day and On street name has no importance in predicting injury from a vehicle collision.

Conclusion

By predicting the factors leading to injury in a vehicle crash, the authorities can emphasize their resources in reducing vehicle crashes. By determining factors such as contributing factors, authorities and vehicle manufacturers can implement new procedures and technologies to avoid collisions and thus injuries. The baseline for the classification problem is 76% and the prediction accuracy of the model for the test set is 77%. Machine model chosen in the above analysis may be appropriate to predict injury.

Notebook

--

--