Basic Modeling in scikit-learn
Last updated
Was this helpful?
Last updated
Was this helpful?
This is the memo of the 11th course (23 courses in all) of ‘Machine Learning Scientist with Python’ skill track.
You can find the original course .
Course Description
Machine learning models are easier to implement now more than ever before. Without proper validation, the results of running new data through a model might not be as accurate as expected. Model validation allows analysts to confidently answer the question, how good is your model? We will answer this question for classification models using the complete set of tic-tac-toe endgame scenarios, and for regression models using fivethirtyeight’s ultimate Halloween candy power ranking dataset. In this course, we will cover the basics of model validation, discuss various validation techniques, and begin to develop tools for creating validated and high performing models.
Table of contents
Model’s tend to have higher accuracy on observations they have seen before. In the candy dataset, predicting the popularity of Skittles will likely have higher accuracy than predicting the popularity of Andes Mints; Skittles is in the dataset, and Andes Mints is not.
You’ve built a model based on 50 candies using the dataset X_train
and need to report how accurate the model is at predicting the popularity of the 50 candies the model was built on, and the 35 candies ( X_test
) it has never seen. You will use the mean absolute error, mae()
, as the accuracy metric.
When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model. In the next lesson, you will start building models to validate.
Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a continuous variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a regression model.
In this exercise, you will specify a few parameters using a random forest regression model rfr
.
You have updated parameters after the model was initialized. This approach is helpful when you need to update parameters. Before making predictions, let’s see which candy characteristics were most important to the model.
Although some candy attributes, such as chocolate, may be extremely popular, it doesn’t mean they will be important to model prediction. After a random forest model has been fit, you can review the model’s attribute, .feature_importances_
, to see which variables had the biggest impact. You can check how important each variable was in the model by looping over the feature importance array using enumerate()
.
If you are unfamiliar with Python’s enumerate()
function, it can loop over a list while also creating an automatic counter.
No surprise here – chocolate is the most important variable. .feature_importances_ is a great way to see which variables were important to your random forest model.
In model validation, it is often important to know more about the predictions than just the final classification. When predicting who will win a game, most people are also interested in how likely it is a team will win.
| Probability | Prediction | Meaning | | --- | --- | --- | | 0 < .50 | 0 | Team Loses | | .50 + | 1 | Team Wins |
In this exercise, you look at the methods, .predict()
and .predict_proba()
using the tic_tac_toe
dataset. The first method will give a prediction of whether Player One will win the game, and the second method will provide the probability of Player One winning. Use rfc
as the random forest classification model.
You can see there were 563 observations where Player One was predicted to win the Tic-Tac-Toe game. Also, note that the predicted_probabilities
array contains lists with only two values because you only have two possible responses (win or lose). Remember these two methods, as you will use them a lot throughout this course.
In this exercise, you use various methods to recall which parameters were used in a model.
Recalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used!
This exercise reviews the four modeling steps discussed throughout this chapter using a random forest classification model. You will:
Create a random forest classification model.
Fit the model using the
tic_tac_toe
dataset.
Make predictions on whether Player One will win (1) or lose (0) the current game.
Finally, you will evaluate the overall accuracy of the model.
Let’s get started!
That’s all the steps! Notice the first five predictions were all 1, indicating that Player One is predicted to win all five of those games. You also see the model accuracy was only 82%.
Let’s move on to Chapter 2 and increase our model validation toolbox by learning about splitting datasets, standard accuracy metrics, and the bias-variance tradeoff.
Replicating model performance is vital in model validation. Replication is also important when sharing models with co-workers, reusing models on new data or asking questions on a website such as . You might use such a site to ask other coders about model errors, output, or performance. The best way to do this is to replicate your work by reusing model parameters.