Update README.md

lopesoll · Feb 28, 2023 · 03e073a9d404ce9bf08fc652c7a7749db38f0f71 · 03e073a
1 parent 5ba7fb5
commit 03e073a9d404ce9bf08fc652c7a7749db38f0f71
Showing 1 changed file with 14 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -6,6 +6,20 @@
      - Randomizes the dataset and splits by severity into 4 CSV files.
      - These are imported in Work_final_balanced with a 5000 limit, balancing the labels equaly.
      - Randomizing before repartition is important for diversity in date, city, state,... 
+2. WorkFinal_balanced
+   - Data preprocessing
+     - Iniciate LabelIndexer and OneHotEncoder classes of pyspark.ml.feature library (transform categorical features into numerical representations) 
+     - Iniciate a VectorAssembler class of pyspark.ml.feature library (combining the columns into a single column)
+     - Build a pipeline with LabelIndexer, OneHotEncoder, and VectorAssembler.
+     - Fit the data on the pipeline and transform the data. 
+   - Machine Learning models
+     - Train a logistic regression model, decision tree classifier, and random forest classifier
+     - Perform Hyperparameter tunning for Logistic regression and Decision tree classifier
+   - Evaluate the models
+     - Evaluate the models using True positive rate per label, False Positive rate per label, and F Measure per label
+     - Evaluate the model using F1 Score, True positive rate, False positive rate, Precison, Recall, and Hamming Loss 
+3. Vizualizations accidents 2016-2021
+   - Tableau visualizations of the data 
 
 Link to the original Dataset: [US Accidents (2016 - 2021)](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents).