In Spring 2020, I participated in a semester-long Data Science course offered by Forge (formerly HackCville). Throughout this course, I developed data science skills such as data cleaning, machine learning and web scraping in Python.
The second project for this course was a machine learning project focused on predicting the quality of wine using a variety of variables including citric acid, density, pH and alcohol to name a few. To see who could generate the most successful model, we created a Kaggle competition that scored each competitor’s model using the F1 score. The data on Kaggle is linked here and my code on GitHub is linked here.
Luckily, the dataset presented to us had already been cleaned (a pleasant surprise!) so there was no further cleaning I had to do.
Exploratory Data Analysis
To explore the data, I used Seaborn to generate a pairplot of all the variables. The pairplot visualizes the pairwise relationships between variables. Although no relationships significantly stood out to me, the two strongest relationships were between fixed acidity and density and fixed acidity and pH.
Machine Learning Models
I split the dataset into train and test sets so I could evaluate the performance of each model on data that it wasn’t trained on. I created three different models to predict the quality of wine: logistic regression, K-nearest neighbors, and random forest classifiers. I used all 11 variables for prediction in the three models. To determine which model was best at prediction, I computed the F1 score.
The first model I created was a logistic regression model. I used three hyperparameters in my model: C, solver and penalty. To fit the best model, I used GridSearch and concluded that this was the most successful logistic regression model:
The F1 score for this model was 0.752411 so the model did a decent job at predicting the quality of wine.
The K-nearest neighbors model was my least successful model. I included two hyperparameters in my model: n_neighbors and p. The model was:
The F1 score for this model was 0.715189, so it performed slightly worse than the logistic regression model.
Random Forest Classifier
The random forest classifier model was the best model out of the three. I used five hyperparameters: max_depth, random_state, min_samples_split, max_features and verbose. The model was:
The F1 score for this model was the highest of the three at 0.805194. This model did a good job at predicting the quality of wine.
The F1 score is a weighted average of the precision and recall values and ranges from 0 to 1 with its best value at 1. My most successful model was the random forest classifier model with a F1 score of 0.805194. In other words, this model did a good job at predicting wine quality using the input variables. However, I must be cautious when analyzing my models solely with the F1 score and should consider other performance metrics such as confusion matrices when making final conclusions.
Since Kaggle calculated the F1 score of our models with only 20% of the test data, the values were slightly different than what I calculated. Compared to the other 12 participants in the competition, my model proved to be most successful when predicting wine quality on 20% of the test data with a mean F1 score of 0.84375.
This was the first machine learning project that I completed independently and I was very excited to not only develop new skills but place first in the Kaggle competition! I learned the differences between machine learning models, how to fit the best model with hyperparameters and how to compare the models using the F1 score. In future projects, I would like to analyze hyperparameters on all models and use more performance metrics to measure my models.