# Predicting Gender from Boston Marathon Finishing Times

In Spring 2020, I participated in a semester-long Data Science course offered by Forge (formerly HackCville). Throughout this course, I developed data science skills such as data cleaning, machine learning and web scraping in Python.

For my final project in the Data Science course, I wanted to build upon the machine learning skills I learned throughout the semester. As an avid runner, I was interested in topics related to health and fitness. While browsing through Kaggle, I stumbled upon a dataset with information about the Boston Marathon. I chose the most recent dataset from 2017 and decided to predict gender using marathon finishing times and other variables. The data on Kaggle is linked here and my code on GitHub is linked here.

# Data Cleaning

To begin, I focused on cleaning the data. Each row in the dataset represents a finisher in the 2017 Boston Marathon. There were over 20 columns with variables such as Name, Gender, Country, Division, 5K time, Half Marathon time, and Marathon finishing time. There were numerous ‘Unnamed’ columns with or columns with strictly NaN values. I deleted these columns then focused on the time data.

The dataset provided previous race times for each finisher including 5K, 10K, Half Marathon, 30K, and their Boston Marathon finishing time. The data was presented in a HH:MM:SS format, so I converted the times to datetime and then to minutes. For example, to convert the 5K time, I converted the column to datetime then created a new column with the time in minutes using a lambda function.

# Exploratory Data Analysis

Next, I moved on to exploratory data analysis. I created a series of visualizations to give me a better idea of the data I was working with.

First, I explored the age distribution of the 2017 Boston Marathon finishers. Evidently, the largest proportion of runners were between the ages 45 and 50 although there were peaks around age 35 and 40 as well. I was surprised by this distribution; my estimate was that the majority of runners would be in their 20s and 30s.

Next, I created a scatter plot to demonstrate the relationship between 5K finishing times and Boston Marathon finishing times. I split the data by gender and assigned different colors to each. As expected, there was a strong, positive correlation between 5K and Marathon finishing times. However, there were potential outliers.

Finally, I created a bar plot to show the proportion of 2017 Boston Marathon Finishers from the 5 most represented countries. I created two bar plots separated by gender. These 5 countries are the United States, Canada, Great Britain, Mexico and China.

# Variable Creation

After exploring the relationships between different variables, I created some variables of my own. The objective of this project is to predict gender using 2017 Boston Marathon finishing times and additional variables, so I wanted to create variables that would make my predictions more accurate. I created two new variables:

- I multiplied the half marathon finishing time by 2 and created a column with the new variable called Predicted Time from Half.
- Similarly, I took the average pace from the Boston Marathon and multiplied it by 26.2 and created a column with the new variable called Predicted Time from Pace.

As expected, these predictions were very close to the actual marathon finishing times. I proceeded with caution, knowing that using these variables may not give me the best indicator of how well my model is performing.

# Machine Learning Models

I attempted three different models to predict my data: logistic regression, K-nearest neighbors, and random forest classifiers. The variables I included in my models were 5K time, Half Marathon time, Division, Marathon Pace, and Boston Marathon time. I compared the F1 score from all three models to determine which one was the best fit.

Before I began modeling, I used LabelEncoder to transform the y_train and y_test data sets and normalize the labels.

## Logistic Regression

My first model was logistic regression, and it was fairly successful. I used GridSearch to determine the most accurate combination of hyperparameters for my model. The four hyperparameters that I used were C, penalty, solver and multi_class. The most successful logistic regression model was:

The F1 score for my logistic regression model was 0.721196. This was the lowest score of the three models.

## K-Nearest Neighbors

The model I created using K-nearest neighbors proved to be the most accurate. I used three hyperparameters: n_neighbors, p and n_jobs. The most successful model was:

The F1 score for my K-nearest neighbors model was 0.924229. This was the highest F1 score out of all three models.

## Random Forest Classifier

The random forest classifier model was the most fun to work with. I included multiple hyperparameters including max_depth, random_state, min_samples_split, max_feature and verbose. The best model I generated was:

The F1 score for this random forest classifiers model was 0.909221. This model with the specified hyperparameters was almost as accurate as the K-nearest neighbors model.

# Conclusions

The F1 score is a weighted average of the precision and recall values and ranges from 0 to 1 with its best value at 1. My most successful model was the K-nearest neighbors model with a F1 score of 0.924229. In other words, this model did an above-average job in predicting gender using the input variables. However, I must be cautious when analyzing my models solely with the F1 score and should consider other performance metrics such as confusion matrices.

Throughout this project, I learned a lot about the challenges of data science. I anticipated that the data would be fairly easy to work with, but I was wrong. I was tasked with doing a significant amount of data cleaning before I could even begin visually exploring the data. Additionally, I had multiple complications with the types of variables used in the models.

My original goal was to predict Boston Marathon finishing times from other variables, including gender, but flipped the question to make the analysis less complicated. In the future, I would like to perform the analysis the other way and predict marathon finishing times using other variables. Overall, I gained valuable experience in machine learning and look forward to refining these skills in future projects.