NYC cab trips duration

Machine learning for transport analysis and prediction. Predict trip duration of NYC cabs.

BLog link
Machine Learning in Python
tools: Jupyter Notebook, Python, Nympy + Pandas + Datetime + Plotly.express + Matplotlib + Math + Seaborn + Bokeh + Scikit-learn.

Question

How long do New York City cabs take to travel?

What I learned:

Data mining and aggregaion
Data manipulation, cleaning outlayers and preparation for analysis
Fitting data to ML model
Data analysis and visualization
Machine learning

Methodology

Data preparation includes steps: removal of outlayers of passengers = 0 and > 7, removal of trips longer 3 hours (10800 s), removal of trips beyond the boundary of NYC (-74.03 to -73.75, 40.63 to 40.9).
Evaluation of the share of filtered trips correlates to 99.85%. After that we have to count the difference between pickup and dropoff points of geo-coordinates and add them to the dataframe, and to calculate a distance of 1 degree in km on specific latitude with by Haversine formula (haversine(θ) = sin²(θ/2)). At the latitude of NYC = 40.5, one degree is equal to 84553 m, and one geo-minute is equal to 1.42126. After that we reduce rows by distance larger one minute, get month, day, hour, day of week from the pickup datetime column, split the dataset to train and test parts. Data testing.
The prediction with Linear regression returns the result of the Median absolute error in seconds = 291.0918991901533.
Next step is to load RandomForest regression and fit the model, load test of the absolute metric error for RandomForest regression, and get prediction for the full dataset. Evaluating prediction accuracy.
R2 score for the prediction evaluates the ratio of 0.7957444370115131.
In order to improve the trained model further we could add more data about weather conditions, data traffic jams, clustering by zipcodes.

Code

Data visualization and analysis

Pearson chart to identify outlayers in time-related and geography-related data for data cleaning

Data Visualization of pickup and dropoff geo-coordinates with sns distplot that proves existing outlayers in both directions.

Data Visualization of pickup and dropoff geo-coordinates with sns distplot after removal of remote geo-coordinates and framing the dataset to the NYC boundaries.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
visuals		visuals
README.md		README.md
Romanix_finsubmission2_term_randomforest2.ipynb		Romanix_finsubmission2_term_randomforest2.ipynb
submission_Romanix.csv		submission_Romanix.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC cab trips duration

Machine learning for transport analysis and prediction. Predict trip duration of NYC cabs.

Question

What I learned:

Methodology

Code

Data visualization and analysis

Pearson chart to identify outlayers in time-related and geography-related data for data cleaning

Data Visualization of pickup and dropoff geo-coordinates with sns distplot that proves existing outlayers in both directions.

Data Visualization of pickup and dropoff geo-coordinates with sns distplot after removal of remote geo-coordinates and framing the dataset to the NYC boundaries.

Data Visualization of pickup and dropoff of cab clients by vendors type.

Map of trips with Plotly.

Map of trips by Plotly divided in 59 clusters, equal to the number of Neighborhood Boards in the New York City.

Bar chart of trip frequency by amount of passengers.

About

Releases

Packages

Languages

RomanDataLab/NYC_cabs_trip_duration_ML

Folders and files

Latest commit

History

Repository files navigation

NYC cab trips duration

Machine learning for transport analysis and prediction. Predict trip duration of NYC cabs.

Question

What I learned:

Methodology

Code

Data visualization and analysis

Pearson chart to identify outlayers in time-related and geography-related data for data cleaning

Data Visualization of pickup and dropoff geo-coordinates with sns distplot that proves existing outlayers in both directions.

Data Visualization of pickup and dropoff geo-coordinates with sns distplot after removal of remote geo-coordinates and framing the dataset to the NYC boundaries.

Data Visualization of pickup and dropoff of cab clients by vendors type.

Map of trips with Plotly.

Map of trips by Plotly divided in 59 clusters, equal to the number of Neighborhood Boards in the New York City.

Bar chart of trip frequency by amount of passengers.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages