Skip to content
Permalink
Browse files
Update README.md
  • Loading branch information
lopesoll committed Feb 28, 2023
1 parent 5ba7fb5 commit 03e073a9d404ce9bf08fc652c7a7749db38f0f71
Showing 1 changed file with 14 additions and 0 deletions.
@@ -6,6 +6,20 @@
- Randomizes the dataset and splits by severity into 4 CSV files.
- These are imported in Work_final_balanced with a 5000 limit, balancing the labels equaly.
- Randomizing before repartition is important for diversity in date, city, state,...
2. WorkFinal_balanced
- Data preprocessing
- Iniciate LabelIndexer and OneHotEncoder classes of pyspark.ml.feature library (transform categorical features into numerical representations)
- Iniciate a VectorAssembler class of pyspark.ml.feature library (combining the columns into a single column)
- Build a pipeline with LabelIndexer, OneHotEncoder, and VectorAssembler.
- Fit the data on the pipeline and transform the data.
- Machine Learning models
- Train a logistic regression model, decision tree classifier, and random forest classifier
- Perform Hyperparameter tunning for Logistic regression and Decision tree classifier
- Evaluate the models
- Evaluate the models using True positive rate per label, False Positive rate per label, and F Measure per label
- Evaluate the model using F1 Score, True positive rate, False positive rate, Precison, Recall, and Hamming Loss
3. Vizualizations accidents 2016-2021
- Tableau visualizations of the data

Link to the original Dataset: [US Accidents (2016 - 2021)](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents).

0 comments on commit 03e073a

Please sign in to comment.