Skip to content

5011CEM-2324JANMAY/5011

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

5011

The provided code snippet and project overview suggest a data analytics coursework project focusing on analyzing and visualizing transportation-related data using Python libraries like Pandas, Dask, Plotly, and scikit-learn. Here's an overview and the skills acquired from this project:

Overview of the Project: Data Loading and Cleaning:

The project begins with loading data from CSV files (Trips_Full_data.csv and trips_by_distance.csv) using Dask for large dataset handling capabilities. Data cleaning involves dropping rows with missing values (dropna()). Exploratory Data Analysis (EDA):

Question 1: Compute the number of unique years (Year of Date) and mean trips (Trips 1-25 Miles) per year. Question 2: Analyze trips based on specific conditions (Number of Trips 10-25 > 10,000,000), then visualize using scatter plots with Plotly. Parallel Computing with Dask:

Performance evaluation using different numbers of processors (10 and 20) to measure computation time. Dask's Client is utilized for parallel processing capabilities. Machine Learning Analysis:

Utilize scikit-learn's LinearRegression to build a predictive model. Attempted to fit a model using columns from Trips_Full_data.csv, encountering an error due to incompatible data types (ValueError: could not convert string to float). Skills Acquired: Data Handling with Dask: Learned how to efficiently handle large datasets that exceed memory capacity using Dask dataframes. Data Cleaning and Preparation: Skills in data cleaning techniques such as dropping missing values (dropna()). Exploratory Data Analysis (EDA): Conducting basic statistical analyses and visualizations to gain insights into the dataset. Visualization with Plotly: Skills in creating interactive visualizations (scatter plots) to represent data trends. Parallel Computing with Dask: Understanding parallel computing concepts and practical implementation using Dask's distributed computing capabilities. Machine Learning Fundamentals: Introduction to building a simple predictive model using scikit-learn's LinearRegression. Conclusion: Despite encountering an error in the machine learning part, the project demonstrates proficiency in handling and analyzing large datasets, performing basic statistical computations, visualizing data trends, and utilizing parallel processing for improved performance. Addressing the machine learning error would involve ensuring data compatibility and proper preprocessing before model fitting.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published