From 03e073a9d404ce9bf08fc652c7a7749db38f0f71 Mon Sep 17 00:00:00 2001
From: "Lucas Lopes Oliveira (lopesoll)" <lopesoll@coventry.ac.uk>
Date: Tue, 28 Feb 2023 13:22:35 +0000
Subject: [PATCH] Update README.md

---
 README.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/README.md b/README.md
index 9cc00ad..fd89829 100644
--- a/README.md
+++ b/README.md
@@ -6,6 +6,20 @@
      - Randomizes the dataset and splits by severity into 4 CSV files.
      - These are imported in Work_final_balanced with a 5000 limit, balancing the labels equaly.
      - Randomizing before repartition is important for diversity in date, city, state,... 
+2. WorkFinal_balanced
+   - Data preprocessing
+     - Iniciate LabelIndexer and OneHotEncoder classes of pyspark.ml.feature library (transform categorical features into numerical representations) 
+     - Iniciate a VectorAssembler class of pyspark.ml.feature library (combining the columns into a single column)
+     - Build a pipeline with LabelIndexer, OneHotEncoder, and VectorAssembler.
+     - Fit the data on the pipeline and transform the data. 
+   - Machine Learning models
+     - Train a logistic regression model, decision tree classifier, and random forest classifier
+     - Perform Hyperparameter tunning for Logistic regression and Decision tree classifier
+   - Evaluate the models
+     - Evaluate the models using True positive rate per label, False Positive rate per label, and F Measure per label
+     - Evaluate the model using F1 Score, True positive rate, False positive rate, Precison, Recall, and Hamming Loss 
+3. Vizualizations accidents 2016-2021
+   - Tableau visualizations of the data 
 
 Link to the original Dataset: [US Accidents (2016 - 2021)](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents).