The provided code snippet and project overview suggest a data analytics coursework project focusing on analyzing and visualizing transportation-related data using Python libraries like Pandas, Dask, Plotly, and scikit-learn. Here's an overview and the skills acquired from this project:
Overview of the Project: Data Loading and Cleaning:
The project begins with loading data from CSV files (Trips_Full_data.csv and trips_by_distance.csv) using Dask for large dataset handling capabilities. Data cleaning involves dropping rows with missing values (dropna()). Exploratory Data Analysis (EDA):
Question 1: Compute the number of unique years (Year of Date) and mean trips (Trips 1-25 Miles) per year. Question 2: Analyze trips based on specific conditions (Number of Trips 10-25 > 10,000,000), then visualize using scatter plots with Plotly. Parallel Computing with Dask:
Performance evaluation using different numbers of processors (10 and 20) to measure computation time. Dask's Client is utilized for parallel processing capabilities. Machine Learning Analysis:
Utilize scikit-learn's LinearRegression to build a predictive model. Attempted to fit a model using columns from Trips_Full_data.csv, encountering an error due to incompatible data types (ValueError: could not convert string to float). Skills Acquired: Data Handling with Dask: Learned how to efficiently handle large datasets that exceed memory capacity using Dask dataframes. Data Cleaning and Preparation: Skills in data cleaning techniques such as dropping missing values (dropna()). Exploratory Data Analysis (EDA): Conducting basic statistical analyses and visualizations to gain insights into the dataset. Visualization with Plotly: Skills in creating interactive visualizations (scatter plots) to represent data trends. Parallel Computing with Dask: Understanding parallel computing concepts and practical implementation using Dask's distributed computing capabilities. Machine Learning Fundamentals: Introduction to building a simple predictive model using scikit-learn's LinearRegression. Conclusion: Despite encountering an error in the machine learning part, the project demonstrates proficiency in handling and analyzing large datasets, performing basic statistical computations, visualizing data trends, and utilizing parallel processing for improved performance. Addressing the machine learning error would involve ensuring data compatibility and proper preprocessing before model fitting.