An open-source machine learning library.

PyCaret

Automate key steps to evaluate and compare the machine learning algorithms.

6 min readMay 24, 2021

Those who deal with data once mention that they spent most of their time cleaning data. Sometimes we complain that we don’t have time to implement machine learning models. We try to find answers to many questions such as, which machine learning model should I use, which one gives me optimal results, which parameters give the highest result, how to tune them…and so on.
In this article, I will write about a magic python library that will help you with all these problems and save a lot of time with a few lines of code. You already noticed the title of the article and guessed that I will mention PyCaret.

PyCaret is an open-source python machine learning library, inspired by the caret R package. We can briefly say the following for PyCaret from its own homepage.

“PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment... It is a simple and easy-to-use machine learning library that will help you to perform end-to-end ML experiments with fewer lines of code...” ¹

The strong point of the library is that a lot can be achieved with very few lines of code and very little manual configuration. The PyCaret library automates many steps of a machine learning project, such as defining the data transforms to perform, evaluating and comparing standard models, tuning model hyperparameters, and more.

PyCaret is basically a python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, CatBoost, LightGBM, spaCy, Optuna, Hyperopt, Ray, and many more. In addition, all these libraries are very well brought together and organized.

Now let’s test what this python library can do on an example dataset. I will apply all models on the Palmer Penguins dataset which is released as an R package by Allison Horst. And also it becomes an alternative to the lovely iris dataset.

As always first of all we install pycaret and import the necessary libraries.

#!pip install pycaret[full]import numpy as np
import pandas as pd
from pycaret.regression import *

The next step is loading the penguin dataset with the python pandas library.

df = pd.read_csv('penguins.csv')
df.info()RangeIndex: 333 entries, 0 to 332
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    int64  
 5   body_mass_g        333 non-null    int64  
 6   sex                333 non-null    object 
dtypes: float64(2), int64(2), object(3)

The dataset consists of 4 numerical features, 2 categorical features, and one target feature.

The PyCaret workflow always starts with the setup functionality that prepares the entire machine learning pipeline environment. Therefore, the setup must be executed before any other function.

clf = setup(
            data=df, 
            target='sex', 
            session_id=44, 
            train_size=0.8
           )

We can customize the setup function as you wish, it has many parameters. The default values are also fine, so I just start the function by giving the necessary arguments. The setup function does all the data preprocessing steps. It also splits the data into training and test sets.

And after this, the magic of PyCaret begins, with a one-line code, compare_models()which compares nearly 20 models and returns the results as a table.

best_model = compare_models()

This function trains all models in the model library and scores them using k-fold cross-validation for metric evaluation. The output prints a scoring table that shows average results across the folds (10 by default) along with training time.

Create Best Model

Creating a model in any module is as simple as writing create_model. It takes only one parameter i.e. the model name as a string.

# create an Extreme Gradient Boosting model
xgboost = create_model("xgboost")

Hyperparameter Tuning

When a model is created using the create_model function it uses the default hyperparameters to train the model. To tune hyperparameters, the tune_model() function is used.

tuned_xgboost = tune_model(xgboost)

This function automatically tunes the hyperparameters of a model using Random Grid Search a pre-defined search space. The output prints a scoring grid that shows results by fold. To use the custom search grid, you can pass custom_grid parameters in the tune_model function.

Analyze Model

Analyzing the performance of a trained machine learning model is an integral step in any machine learning workflow. Analyzing model performance is as simple as writing plot_model(). The function takes the trained model object and type of plot as a string within plot_model() the function.

plot_model(tuned_xgboost, plot = 'confusion_matrix'

plot_model(tuned_xgboost, plot = 'boundary')

pplot_model(tuned_xgboost, plot = 'class_report')

Predict and Finalize model

The test consists of the remaining 20% of data that PyCaret automatically split on the setup, it’s important to see that the model is not overfitting. Now we will use the predict_model() function that is used to predict on the unseen dataset.

# Make predictions on the test set
predict_model(tuned_xgboost)

Results of unseen data

The finalize_model() function fits the model onto the complete dataset including the test/hold-out sample (20% in this case). This function aims to train the model on the complete dataset before it is deployed in production.

# Finalize the model
finalized_xgboost = finalize_model(tuned_xgboost)

When the model is finalized using, the entire dataset including the test/hold-out set is used for training. If the model is used for predictions on the hold-out set after finalize_model() is used, the results will be misleading because you are trying to predict the same data used for training.

Save and load model

PyCaret’s inbuilt function save_model() allows us to save the model along with the entire transformation pipeline and trained model object as a transferable binary pickle file for later use.

# Save the final model
save_model(finalized_xgboost, 'penguins_xgboost_v4')

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret’s load_model() function and then easily apply the saved model on new unseen data for prediction.

(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='species',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_stra...
                                interaction_constraints='', learning_rate=0.4,
                                max_delta_step=0, max_depth=8,
                                min_child_weight=3, missing=nan,
                                monotone_constraints='()', n_estimators=140,
                                n_jobs=-1, num_parallel_tree=1,
                                objective='multi:softprob', random_state=44,
                                reg_alpha=0.1, reg_lambda=1e-07,
                                scale_pos_weight=12.100000000000001, subsample=1,
                                tree_method='auto', use_label_encoder=True,
                                validate_parameters=1, verbosity=0)]],
          verbose=False),
 'finalized_xgboost_model_2021-04-11 12:39.pkl')

If you have time and a high computational processing system(GPU or TPUs), you can easily try and apply ensemble and stack models with PyCaret with great joy.

You can access all these tests and many more from the resources. Be happy and keep coding with PyCaret…

Resources:

An open-source machine learning library.

PyCaret

Automate key steps to evaluate and compare the machine learning algorithms.

Create Best Model

Hyperparameter Tuning

Analyze Model

Predict and Finalize model

Save and load model

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by R.Caliskan

Responses (1)