From 03e073a9d404ce9bf08fc652c7a7749db38f0f71 Mon Sep 17 00:00:00 2001 From: "Lucas Lopes Oliveira (lopesoll)" Date: Tue, 28 Feb 2023 13:22:35 +0000 Subject: [PATCH] Update README.md --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index 9cc00ad..fd89829 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,20 @@ - Randomizes the dataset and splits by severity into 4 CSV files. - These are imported in Work_final_balanced with a 5000 limit, balancing the labels equaly. - Randomizing before repartition is important for diversity in date, city, state,... +2. WorkFinal_balanced + - Data preprocessing + - Iniciate LabelIndexer and OneHotEncoder classes of pyspark.ml.feature library (transform categorical features into numerical representations) + - Iniciate a VectorAssembler class of pyspark.ml.feature library (combining the columns into a single column) + - Build a pipeline with LabelIndexer, OneHotEncoder, and VectorAssembler. + - Fit the data on the pipeline and transform the data. + - Machine Learning models + - Train a logistic regression model, decision tree classifier, and random forest classifier + - Perform Hyperparameter tunning for Logistic regression and Decision tree classifier + - Evaluate the models + - Evaluate the models using True positive rate per label, False Positive rate per label, and F Measure per label + - Evaluate the model using F1 Score, True positive rate, False positive rate, Precison, Recall, and Hamming Loss +3. Vizualizations accidents 2016-2021 + - Tableau visualizations of the data Link to the original Dataset: [US Accidents (2016 - 2021)](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents).