Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Is the weather good to play outside?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The folder `datasets` contains two files:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"- `weather.numeric.csv`:\n",
"\n",
"```\n",
"temperature,humidity,windy,play\n",
"85,85,0,no\n",
"80,90,1,no\n",
"83,86,0,yes\n",
"70,96,0,yes\n",
"68,80,0,yes\n",
"65,70,1,no\n",
"64,65,1,yes\n",
"72,95,0,no\n",
"69,70,0,yes\n",
"75,80,0,yes\n",
"75,70,1,yes\n",
"72,90,1,yes\n",
"81,75,0,yes\n",
"71,91,1,no\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- `weather.nominal.csv`:\n",
"\n",
"```\n",
"outlook,temperature,humidity,windy,play\n",
"sunny,hot,high,FALSE,no\n",
"sunny,hot,high,TRUE,no\n",
"overcast,hot,high,FALSE,yes\n",
"rainy,mild,high,FALSE,yes\n",
"rainy,cool,normal,FALSE,yes\n",
"rainy,cool,normal,TRUE,no\n",
"overcast,cool,normal,TRUE,yes\n",
"sunny,mild,high,FALSE,no\n",
"sunny,cool,normal,FALSE,yes\n",
"rainy,mild,normal,FALSE,yes\n",
"sunny,mild,normal,TRUE,yes\n",
"overcast,mild,high,TRUE,yes\n",
"overcast,hot,normal,FALSE,yes\n",
"rainy,mild,high,TRUE,no\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use Decision Tress from the `scikit-learn` library to create accurate models, first for the numerical dataset, then for the nominal dataset.\n",
"\n",
"Explain your reasoning, and justify any choices of the hyperparameters (and/or run experiments to find the optimal ones).\n",
"\n",
"Use the provided datasets for training, and create testing datasets based on your experience.\n",
"\n",
"Evaluate your models, and use visualisation to show the trees and any relevant plots."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Marking scheme\n",
"\n",
"|Item|Mark|\n",
"|:----|---:|\n",
"|**Numerical dataset**:||\n",
"|Explanation, Justification|/4|\n",
"|DT model|/3|\n",
"|Evaluation|/3|\n",
"|**Nominal dataset**:||\n",
"|Explanation, Justification|/4|\n",
"|DT model|/3|\n",
"|Evaluation|/3|\n",
"|||\n",
"|**Total**: |/20|\n"
]
},
{
"cell_type": "code",
"execution_count": 192,
"metadata": {
"ExecuteTime": {
"end_time": "2022-10-25T11:37:52.790405Z",
"start_time": "2022-10-25T11:37:50.972952Z"
}
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 193,
"metadata": {
"ExecuteTime": {
"end_time": "2022-10-25T11:37:52.837819Z",
"start_time": "2022-10-25T11:37:52.790405Z"
}
},
"outputs": [],
"source": [
"data1 = pd.read_csv('datasets/weather.numeric.csv')\n",
"data2 = pd.read_csv('datasets/weather.nominal.csv')"
]
},
{
"cell_type": "code",
"execution_count": 194,
"metadata": {
"ExecuteTime": {
"end_time": "2022-10-25T11:37:52.890705Z",
"start_time": "2022-10-25T11:37:52.843562Z"
}
},
"outputs": [
{
"data": {
"text/plain": " temperature humidity windy play\n0 85 85 0 no\n1 80 90 1 no\n2 83 86 0 yes\n3 70 96 0 yes\n4 68 80 0 yes\n5 65 70 1 no\n6 64 65 1 yes\n7 72 95 0 no\n8 69 70 0 yes\n9 75 80 0 yes\n10 75 70 1 yes\n11 72 90 1 yes\n12 81 75 0 yes\n13 71 91 1 no",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>temperature</th>\n <th>humidity</th>\n <th>windy</th>\n <th>play</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>85</td>\n <td>85</td>\n <td>0</td>\n <td>no</td>\n </tr>\n <tr>\n <th>1</th>\n <td>80</td>\n <td>90</td>\n <td>1</td>\n <td>no</td>\n </tr>\n <tr>\n <th>2</th>\n <td>83</td>\n <td>86</td>\n <td>0</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>70</td>\n <td>96</td>\n <td>0</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>4</th>\n <td>68</td>\n <td>80</td>\n <td>0</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>5</th>\n <td>65</td>\n <td>70</td>\n <td>1</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>64</td>\n <td>65</td>\n <td>1</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>7</th>\n <td>72</td>\n <td>95</td>\n <td>0</td>\n <td>no</td>\n </tr>\n <tr>\n <th>8</th>\n <td>69</td>\n <td>70</td>\n <td>0</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>9</th>\n <td>75</td>\n <td>80</td>\n <td>0</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>10</th>\n <td>75</td>\n <td>70</td>\n <td>1</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>11</th>\n <td>72</td>\n <td>90</td>\n <td>1</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>12</th>\n <td>81</td>\n <td>75</td>\n <td>0</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>13</th>\n <td>71</td>\n <td>91</td>\n <td>1</td>\n <td>no</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 194,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data1"
]
},
{
"cell_type": "code",
"execution_count": 195,
"metadata": {
"ExecuteTime": {
"end_time": "2022-10-25T11:37:52.922669Z",
"start_time": "2022-10-25T11:37:52.895684Z"
}
},
"outputs": [
{
"data": {
"text/plain": " outlook temperature humidity windy play\n0 sunny hot high False no\n1 sunny hot high True no\n2 overcast hot high False yes\n3 rainy mild high False yes\n4 rainy cool normal False yes\n5 rainy cool normal True no\n6 overcast cool normal True yes\n7 sunny mild high False no\n8 sunny cool normal False yes\n9 rainy mild normal False yes\n10 sunny mild normal True yes\n11 overcast mild high True yes\n12 overcast hot normal False yes\n13 rainy mild high True no",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>outlook</th>\n <th>temperature</th>\n <th>humidity</th>\n <th>windy</th>\n <th>play</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>sunny</td>\n <td>hot</td>\n <td>high</td>\n <td>False</td>\n <td>no</td>\n </tr>\n <tr>\n <th>1</th>\n <td>sunny</td>\n <td>hot</td>\n <td>high</td>\n <td>True</td>\n <td>no</td>\n </tr>\n <tr>\n <th>2</th>\n <td>overcast</td>\n <td>hot</td>\n <td>high</td>\n <td>False</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>rainy</td>\n <td>mild</td>\n <td>high</td>\n <td>False</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>4</th>\n <td>rainy</td>\n <td>cool</td>\n <td>normal</td>\n <td>False</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>5</th>\n <td>rainy</td>\n <td>cool</td>\n <td>normal</td>\n <td>True</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>overcast</td>\n <td>cool</td>\n <td>normal</td>\n <td>True</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>7</th>\n <td>sunny</td>\n <td>mild</td>\n <td>high</td>\n <td>False</td>\n <td>no</td>\n </tr>\n <tr>\n <th>8</th>\n <td>sunny</td>\n <td>cool</td>\n <td>normal</td>\n <td>False</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>9</th>\n <td>rainy</td>\n <td>mild</td>\n <td>normal</td>\n <td>False</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>10</th>\n <td>sunny</td>\n <td>mild</td>\n <td>normal</td>\n <td>True</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>11</th>\n <td>overcast</td>\n <td>mild</td>\n <td>high</td>\n <td>True</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>12</th>\n <td>overcast</td>\n <td>hot</td>\n <td>normal</td>\n <td>False</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>13</th>\n <td>rainy</td>\n <td>mild</td>\n <td>high</td>\n <td>True</td>\n <td>no</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 195,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data2"
]
},
{
"cell_type": "markdown",
"source": [
"# Numeric Dataset"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### Explanation/Justification"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"The numeric dataset was a relatively straightforward implementation using Scikit-learn, as the inputs from the csv set are numeric integers and require no preprocessing. So in the below code, I simply set my column names and imported the data using panda (skipping the first row to allow for title row), and then seperated out the features into my X DataFrame and my \"play\" labels into my Y DataFrame. I then used Scikit-learn's \"train_test_split\" function to separate this data into train and test data (70% train, 30% test) and shuffled the data once. I opted for a 70/30 split as with such a small pool of data, we require a large proportion of training data while still allowing for tests (I tried with lower percentage of test data but this doesn't provide enough inputs to perform the test). I shuffle the data only once as the numeric data is already unordered in the csv file. I then simply create my classifier and fit the train data to it. Once training is completed, I use the test data to evaluate the success of my model and use Scikit-learn's metrics library to print out the model accuracy.\n",
"\n",
"Once all is done, I plot my tree using basic parameters (as we are working on a very simple problem) and use the graphviz package to save the output as a pdf (to have data available for external use). These pdfs are already included in the project file, however if you would like to save a new pdf to test with your own parameters, please download graphviz at the following link and add it to PATH (for Windows users): https://graphviz.org/download/"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### DT Model"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 196,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.4\n",
"Tree PDF saved in project file\n"
]
},
{
"data": {
"text/plain": "<Figure size 640x480 with 1 Axes>",
"image/png": "\n"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.tree import export_graphviz\n",
"from sklearn.tree import plot_tree\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn import metrics\n",
"from sklearn import preprocessing\n",
"import graphviz\n",
"\n",
"columns = ['temperature', 'humidity', 'windy', 'play']\n",
"numeric_data = pd.read_csv('datasets/weather.numeric.csv', names=columns, skiprows=1)\n",
"\n",
"features = ['temperature', 'humidity', 'windy']\n",
"X = numeric_data[features] # Features\n",
"Y = numeric_data.play\n",
"\n",
"X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)\n",
"\n",
"clf = DecisionTreeClassifier()\n",
"clf.fit(X_train,Y_train)\n",
"\n",
"Y_pred = clf.predict(X_test)\n",
"\n",
"print(\"Accuracy:\",metrics.accuracy_score(Y_test, Y_pred))\n",
"\n",
"plot_tree(clf)\n",
"\n",
"dot_data = export_graphviz(clf, out_file=None)\n",
"graph = graphviz.Source(dot_data)\n",
"graph.render(\"numeric_data\")\n",
"print(\"Tree PDF saved in project file\")"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"# Nominal Dataset"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### Explanation/Justification"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"My implementation for the nominal dataset is very similar to my numeric implementation as much of the same functions and steps are used in both cases when building decision trees with Scikit-Learn. The main differences we have is an extra value in the training data and the need to preprocess the input into a format that works with classifier decision trees.\n",
"\n",
"When making a decision tree, training data must be integers, as our input data here is nominal we need to find a solution to convert our strings into ints in a useful way (not hashmap as this can cause memory management issues). So, we are using an encoder, 3 common encoding solutions exist, Ordinal Encoding, One-Hot Encoding and Dummy Variable Encoding. The Ordinal Encoder encodes our string inputs into ordered integers (if 3 options, then values are 0,1,2), a One-Hot Encoder turns strings into arrays of bits that each represent a possible value of that category (if 3 options, then values are [1,0,0], [0,1,0], [0,0,1]) and Dummy Variable Encoder acts similarly to One-Hot but removes redundant values by dropping unnecessary bits (if 3 options, then values are [1,0], [0,1], [0,0]). For our use case, we will use One-Hot for encoding as we do not have naturally ordered data and Dummy is only useful for much more complicated problems.\n",
"\n",
"To do encoding, we simply import Scikit-learn's preprocessing library, create our encoder (handle_unknown is set to ignore so that any input that doesn't fit is skipped instead of raising an error) and use it's fit_transform function to convert our data into an encoded form. We do this for both the X training data and X testing data.\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### DT Model"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 197,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.4\n",
"Tree PDF saved in project file\n"
]
},
{
"data": {
"text/plain": "<Figure size 640x480 with 1 Axes>",
"image/png": "\n"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"columns = ['outlook','temperature', 'humidity', 'windy', 'play']\n",
"\n",
"nominal_data = pd.read_csv('datasets/weather.nominal.csv', names=columns, skiprows=1)\n",
"\n",
"features = ['outlook','temperature', 'humidity', 'windy']\n",
"X = nominal_data[features] # Features\n",
"Y = nominal_data.play\n",
"\n",
"X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)\n",
"\n",
"clf = DecisionTreeClassifier()\n",
"\n",
"enc = preprocessing.OneHotEncoder(handle_unknown='ignore')\n",
"X_train = enc.fit_transform(X_train)\n",
"\n",
"clf.fit(X_train,Y_train)\n",
"\n",
"X_test = enc.fit_transform(X_test)\n",
"Y_pred = clf.predict(X_test)\n",
"\n",
"print(\"Accuracy:\",metrics.accuracy_score(Y_test, Y_pred))\n",
"\n",
"plot_tree(clf)\n",
"\n",
"dot_data = export_graphviz(clf, out_file=None)\n",
"graph = graphviz.Source(dot_data)\n",
"graph.render(\"nominal_data\")\n",
"print(\"Tree PDF saved in project file\")"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion"
]
},
{
"cell_type": "markdown",
"source": [
"For both of our decision trees, we have a fluctuating accuracy between 0.2 and 0.4, this means that when performing tests, the tree predicts the output correctly only between 20% and 40% of the time. As a rule of thumb, in machine learning, a result that can be considered a \"good accuracy\" level is at least 60% to 70% as this will lead to the model predicting correctly a majority of the time. Here, our models aren't performing accurately enough as there is a significant lack of data both for training and testing. To correct this, a new model can be created with much more data, below is a copy of our numeric implementation but with 50 rows of data instead of 14. New rows were added and some data was duplicated. As our quantity of data is much higher, I have reduced the percentage of test data to 20% and have shuffled the data an extra time."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 198,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 1.0\n",
"Tree PDF saved in project file\n"
]
},
{
"data": {
"text/plain": "<Figure size 640x480 with 1 Axes>",
"image/png": "\n"
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"columns = ['temperature', 'humidity', 'windy', 'play']\n",
"numeric_data = pd.read_csv('datasets/weather.numericBetter.csv', names=columns, skiprows=1)\n",
"\n",
"features = ['temperature', 'humidity', 'windy']\n",
"X = numeric_data[features] # Features\n",
"Y = numeric_data.play\n",
"\n",
"X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)\n",
"\n",
"clf = DecisionTreeClassifier()\n",
"clf.fit(X_train,Y_train)\n",
"\n",
"Y_pred = clf.predict(X_test)\n",
"\n",
"print(\"Accuracy:\",metrics.accuracy_score(Y_test, Y_pred))\n",
"\n",
"plot_tree(clf)\n",
"\n",
"dot_data = export_graphviz(clf, out_file=None)\n",
"graph = graphviz.Source(dot_data)\n",
"graph.render(\"numeric_data\")\n",
"print(\"Tree PDF saved in project file\")"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"As we can see, increasing the amount of data has made the accuracy jump up to 1.0, so the model should be trained to predict an outcome properly 100% of the time."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# List of references\n",
"\n",
"- Jason Brownlee. (2020). Ordinal and One-Hot Encodings for Categorical Data. https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/\n",
"- Scikit Learn. (2022). 6.3.4.Encoding categorical features. https://scikit-learn.org/stable/modules/preprocessing.html\n",
"- Avinash Navlani. (2018). Decision Tree Classification in Python Tutorial. https://www.datacamp.com/tutorial/decision-tree-classification-python"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"vscode": {
"interpreter": {
"hash": "6d1e45cadc3597bb8b6600530fbdf8c3eefe919a24ef54d9d32b318795b772e0"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}