Traffic_accident_Severity_Classification_with_Pyspark

Create_undersampled_data
- Data pre-processing
  - Handles null values (filling na, dropping rows and dropping columns)
  - Randomizes the dataset and splits by severity into 4 CSV files.
  - These are imported in Work_final_balanced with a 5000 limit, balancing the labels equally.
  - Randomizing before repartition is important for diversity in date, city, state, ...
WorkFinal_balanced
- Data pre-processing
  - Initiate Label Indexer and OneHotEncoder classes of pyspark.ml.feature library (transform categorical features into numerical representations)
  - Initiate a Vector Assembler class of pyspark.ml.feature library (combining the columns into a single column)
  - Build a pipeline with Label Indexer, OneHotEncoder, and Vector Assembler.
  - Fit the data on the pipeline and transform the data.
- Machine Learning models
  - Train a logistic regression model, decision tree classifier, and random forest classifier
  - Perform Hyperparameter tunning for Logistic regression and Decision tree classifier
- Evaluate the models
  - Evaluate the models using True positive rate per label, False Positive rate per label, and F Measure per label
  - Evaluate the model using F1 Score, True positive rate, False positive rate, Precision, Recall, and Hamming Loss
Visualizations accidents 2016-2021
- Tableau visualizations of the data
  - Top 5 cities with most accidents
  - Total number of accidents per Year, Month, and Weekday
  - Number of accidents per weather condition and temperature in Celsius
  - Impact of road elements in the number of accidents
  - Dynamic time series visualization of number of accidents per month in each state

Link to the original Dataset: US Accidents (2016 - 2021).

Files

README.md

README.md

Traffic_accident_Severity_Classification_with_Pyspark

Files

README.md

Latest commit

History

README.md

File metadata and controls

Traffic_accident_Severity_Classification_with_Pyspark