Skip to content
Permalink
03e073a9d4
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
25 lines (22 sloc) 1.54 KB

Traffic_accident_Severity_Classification_with_Pyspark

  1. Create_undersampled_data
    • Data preprocessing
      • Handles null values (filling na, dropping rows and dropping columns)
      • Randomizes the dataset and splits by severity into 4 CSV files.
      • These are imported in Work_final_balanced with a 5000 limit, balancing the labels equaly.
      • Randomizing before repartition is important for diversity in date, city, state,...
  2. WorkFinal_balanced
    • Data preprocessing
      • Iniciate LabelIndexer and OneHotEncoder classes of pyspark.ml.feature library (transform categorical features into numerical representations)
      • Iniciate a VectorAssembler class of pyspark.ml.feature library (combining the columns into a single column)
      • Build a pipeline with LabelIndexer, OneHotEncoder, and VectorAssembler.
      • Fit the data on the pipeline and transform the data.
    • Machine Learning models
      • Train a logistic regression model, decision tree classifier, and random forest classifier
      • Perform Hyperparameter tunning for Logistic regression and Decision tree classifier
    • Evaluate the models
      • Evaluate the models using True positive rate per label, False Positive rate per label, and F Measure per label
      • Evaluate the model using F1 Score, True positive rate, False positive rate, Precison, Recall, and Hamming Loss
  3. Vizualizations accidents 2016-2021
    • Tableau visualizations of the data

Link to the original Dataset: US Accidents (2016 - 2021).