Using Snowpark and Model Registry for Machine Learning – Part 2

After preparing our data in the first part of this mini-series, we can now turn our attention to the training of machine learning (ML) models and how to manage the different models and versions, as well as how to use them for inference.

 

 

Snowflake continues to innovate with features tailored for data scientists and engineers. Among its many tools, the Model Registry feature stands out as a game-changer in simplifying model management and collaboration. In this second part of our mini-series, we’ll delve into the feature’s details, exploring how it streamlines the lifecycle management of machine learning models within a single, unified schema-type object. By providing a centralised repository for storing, versioning, and deploying models, Model Registry facilitates efficient management of models, the flexibility to use different machine learning packages, and to use them to make predictions.

 

We’ll walk you through its core functionalities and benefits by continuing with our real-world use case scenario, as presented in the previous blog post. Whether you’re a data scientist, engineer, or business stakeholder, understanding how to leverage this feature will unlock new possibilities for driving innovation and competitivity within your organisation.

 

 

Training Machine Learning Models in Snowpark

 

With our data prepared and in place, we can start the training of three distinct models leveraging the snowflake.ml.modeling package: ‘LogisticRegression’, ‘XGBClassifier’, and ‘GradientBoostingRegressor’ from the scikit-learn package. After training, we’ll use the Model Registry feature to securely store these trained models in our Snowflake Database.

 

It’s important to note that, during this phase of the blog post, we won’t be looking at model tuning or validation techniques. Our primary objective is to explore the possibilities that Model Registry offers, emphasising its role in efficiently managing and organising machine learning models within the Snowflake ecosystem.

 

We will start by creating our train and test data:

 

# We prepare the train data so 80% of Rating values being 1 and 0 are inside the training data.
train_sdf = snowpark_df.sample_by("RATING", {"1": 0.8, "0": 0.8})
train_sdf = train_sdf.cache_result()

# Test data will be the values from the table thar are not in train_sdf
test_sdf = snowpark_df. minus(train_sdf)

 

As previously observed, the distribution of our target variable within our dataset is somewhat imbalanced: 68.6% (BAD) – 31.4% (GOOD). To enhance the quality of our training data, we’ll use the SMOTE package, an oversampling technique designed to balance class distribution in a dataset:

 

# Loading data into pandas dataframe
train_pdf = train_sdf.to_pandas()

# Define features and label
feature_cols = train_sdf.columns
feature_cols.remove('RATING')
target_col = 'RATING'
    
    
X = train_pdf[feature_cols]
y = train_pdf[target_col]
    
# Oversample minority class via SMOTE
from imblearn.over_sampling import SMOTE
X_balance, y_balance = SMOTE().fit_resample(X,y)
    
# Combine return values into single pandas dataframe
X_balance[target_col] = y_balance
   
# Persist dataframe in Snowflake table
session.sql('DROP TABLE IF EXISTS DATA_TRAIN')
session.write_pandas(X_balance, table_name="DATA_TRAIN", auto_create_table=True)
test_sdf.write.save_as_table(table_name='DATA_TEST', mode='overwrite')

 

Now our training data is balanced:

 

train_sdf = session.table("DATA_TRAIN")

tot = train_sdf.count()
train_sdf.group_by('RATING').count().sort('RATING')\
                        .with_column('PER',F.col('COUNT')/tot*100)\
                        .show()

 

table with training data balanced Rating, count and per

 

MODEL 1: LogisticRegression

Let’s train our first model using the ‘LogisticRegression’ function inside the Snowflake ML modelling package. For more information about how to train ML models with Snowpark, check out the Snowflake documentation:

 

from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F

from snowflake.ml.modeling.linear_model import LogisticRegression
from snowflake.ml.modeling.metrics import *

import json
import pandas as pd
import seaborn as sns



feature_cols = train_sdf.columns
feature_cols.remove('RATING')
target_col = 'RATING'

lr = LogisticRegression(
    C=0.8, 
    solver='lbfgs',
    random_state=0, 
    input_cols=feature_cols, 
    label_cols=target_col, 
    output_cols=['PREDICTION']
    )
lr.fit(train_sdf)

 

A big advantage of snowflake.ml.modeling is its effortless integration with scikit-learn. Models trained within the Snowflake environment can be easily converted to the scikit-learn format. This flexibility highlights the adaptability of Snowflake’s ML functionalities, providing a bridge between different frameworks, and guaranteeing a versatile approach to model deployment:

 

lr_local = lr.to_sklearn()
lr_local

 

Now we can check our model’s metrics:

 

from snowflake.snowpark.functions import udf
session.custom_package_usage_config = {"enabled": True}

#Prediction values using test data
scored_snowml_sdf = lr.predict(test_sdf)

#Let's create a confussion matrix to see our results:
cf_matrix = confusion_matrix(df=scored_snowml_sdf, y_true_col_name='RATING', y_pred_col_name='PREDICTION')

sns.heatmap(cf_matrix, annot=True, fmt='.0f', cmap='Blues')

 

Models metric values

 

And some of the model’s metrics values:

 

print('Acccuracy:', accuracy_score(df=scored_snowml_sdf, y_true_col_names='RATING', y_pred_col_names='PREDICTION'))
print('Precision:', precision_score(df=scored_snowml_sdf, y_true_col_names='RATING', y_pred_col_names='PREDICTION'))
print('Recall:', recall_score(df=scored_snowml_sdf, y_true_col_names='RATING', y_pred_col_names='PREDICTION'))
print('F1:', f1_score(df=scored_snowml_sdf, y_true_col_names='RATING', y_pred_col_names='PREDICTION'))

 

 

Model 2: XGBClassifier

We’ll follow the same steps as before but now with the ‘XGBClassifier’ model:

 

from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.metrics import *


feature_cols = train_sdf.columns
feature_cols.remove('RATING')
target_col = 'RATING'

xgbmodel = XGBClassifier(
    random_state=0, 
    input_cols=feature_cols, 
    label_cols=target_col, 
    output_cols=['PREDICTION']
    )

xgbmodel.fit(train_sdf)

# Obtaining and plotting a simple confusion matrix
from snowflake.snowpark.functions import udf
session.custom_package_usage_config = {"enabled": True}

scored_snowml_sdf_xgboost = xgbmodel.predict(test_sdf)

cf_matrix_xgboost = confusion_matrix(df=scored_snowml_sdf_xgboost, y_true_col_name='RATING', y_pred_col_name='PREDICTION')

sns.heatmap(cf_matrix_xgboost, annot=True, fmt='.0f', cmap='Blues')

# Printing the metrics
print('Acccuracy:', accuracy_score(df=scored_snowml_sdf_xgboost, y_true_col_names='RATING', y_pred_col_names='PREDICTION'))
print('Precision:', precision_score(df=scored_snowml_sdf_xgboost, y_true_col_names='RATING', y_pred_col_names='PREDICTION'))
print('Recall:', recall_score(df=scored_snowml_sdf_xgboost, y_true_col_names='RATING', y_pred_col_names='PREDICTION'))
print('F1:', f1_score(df=scored_snowml_sdf_xgboost, y_true_col_names='RATING', y_pred_col_names='PREDICTION'))

 

Displayed below are the confusion matrix and the metrics results:

 

Model 3: GradientBoostingRegressor

For our final training, we will use the scikit-learn GradientBoostingRegressor package. To integrate this package, we have to transform our train and test data into pandas format:

 

from sklearn.ensemble import GradientBoostingRegressor

train_pd = train_sdf.to_pandas()

y_train = train_pd.loc[:,'RATING']
x_train = train_pd.loc[:,train_pd.columns != 'RATING']

gbm_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='squared_error')

gbm_model.fit(x_train, y_train)

# Obtaining and plotting a simple confusion matrix
from snowflake.snowpark.functions import udf
from sklearn.metrics import confusion_matrix

session.custom_package_usage_config = {"enabled": True}

predict_values = test_sdf.to_pandas().loc[:,test_sdf.to_pandas().columns != 'RATING']

y_test =test_sdf.to_pandas().loc[:,'RATING']
y_predict = abs(np.around(gbm_model.predict(predict_values),0))


cf_matrix_gbm = confusion_matrix(y_test,y_predict)

sns.heatmap(cf_matrix_gbm, annot=True, fmt='.0f', cmap='Blues')

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Acccuracy:', accuracy_score(y_test,y_predict))
print('Precision:', precision_score(y_test,y_predict))
print('Recall:', recall_score(y_test,y_predict))
print('F1:', f1_score(y_test,y_predict))

 

Here are the confusion matrix and the metrics results:

 

Using Model Registry to Deploy ML Models

Now that our machine learning models have been trained, we will use the new Model Registry feature. As an integral component of Snowpark ML Operations (MLOps), this feature acts as a secure hub for managing models and their associated metadata within Snowflake, irrespective of their origin. This cutting-edge feature elevates machine learning models to the status of first-class schema-level objects in Snowflake, facilitating easy discovery and utilisation across your organisation. With Snowpark Model Registry, you can establish registries and seamlessly store models, harnessing the ML capabilities of the platform.

 

The models stored within the registry can have multiple versions, and you can designate a specific version as the default. Another advantage is that it supports different types of models (besides Snowpark ML), like scikit-learn, PyTorch or TensorFlow, so you can create different models and use different packages, but keep them together in the same object.

 

Let’s get back to our demo and register the three trained models using Model Registry. We will start by creating our registry object:

 

from snowflake.ml.registry import Registry

session = snp.Session.builder.configs(state_dict["connection_parameters"]).create()


reg = Registry(session=session,
               database_name=state_dict['connection_parameters']['database'],
               schema_name=state_dict['connection_parameters']['schema'])

 

Using this code, we created a model object inside our schema where we will register our machine learning models.

 

It is important to note than even though the model object is created, it will not be shown as a schema object in Snowflake (as, for example, Tables, Stages or Procedures are):

 

schema object

 

Now we can register our first model ‘LogisticRegression’:

 

model_ref = reg.log_model(
    lr,
    model_name="LogisticRegression",
    version_name="v1",
    comment = "First version of LogisticRegression for JUICES_RATINGS", #optional
    metrics = {"Accuracy": 0.74} #optional
)

 

When registering a model, only ‘model’, ‘model_name’ and ‘version’ are required arguments. However, we can also add a comment for the model and some specific metrics (in dictionary format). To get more information about model registry arguments, check the Snowflake documentation.

 

Now that we’ve registered the model, we can open an SQL sheet in Snowflake and use the following SQL code to check our registered models:

 

SQL sheet in Snowflake

 

Now we’ll add the other two models we trained:

 

# Registering XGBMODEL

model_ref = reg.log_model(
    xgbmodel,
    model_name="XGBClassifier",
    version_name="v1",
    comment = "First version of XGBClassifier for JUICE_RATINGS", #optional
    metrics = {"Accuracy": 0.71} #optional
)


# Regsitering GBM model: For sklearn model it is needed to add the sample_input_data argument

sample_values = test_sdf.to_pandas().loc[:,test_sdf.to_pandas().columns != 'RATING']

model_ref = reg.log_model(
    gbm_model,
    model_name="GBM",
    version_name="v1",
    sample_input_data=sample_values,
    comment = "First version of GradientBoostingRegressor for JUICE_RATINGS", #optional
    metrics = {"Accuracy": 0.72} #optional
)

 

For all models other than Snowpark ML and MLFlow, the ‘sample_input_data’ or ‘signatures’ arguments must be added. These arguments are to extract the feature names and types used in the model.

 

Now let’s check that our models are registered in our schema:

 

schema

 

Since models are first-class, schema-level objects within Snowflake, SQL commands like DROP MODEL, SHOW MODELS, and ALTER MODEL are available for managing them.

 

Another important aspect when registering models is the requirement for each model to have a unique combination of name and version, distinct from those already registered. An error will occur if this condition is not met:

 

Error

 

 

Using Model Registry for Inference

With our models now successfully registered, we are ready to utilise these model objects for predictive tasks. A notable advantage offered by Model Registry is its inherent flexibility, enabling users to easily select the desired model for use. To illustrate this, let’s start making predictions using our registered models, beginning with the creation of our model object in Jupyter:

 

from snowflake.ml.registry import Registry

model = Registry(session=session,
               database_name=state_dict['connection_parameters']['database'],
               schema_name=state_dict['connection_parameters']['schema'])

 

Initiating a method from a specific model version is done via the ‘.run()’ method. This involves specifying the name of the function to be called and providing a DataFrame containing the inference data. This method is executed within a Snowflake warehouse. Let’s look at an example of this process by applying it to the ‘LogisticRegression’ model:

 

#Using LogisticRegression Model
LogReg = model.get_model("LogisticRegression").version("v1")

test_values = test_sdf #test values must be in snowflake table format

#We run the predict function
LogReg.run(predict_values,function_name='predict')

 

We obtain this DataFrame containing the prediction as a result:

 

DataFrame

 

Now, if we want to select the ‘XGBClassifier’ model, we only have to change the desired model in our object, given that the test data is already in Snowpark table format:

 

#Using XGBClassifier Model
XGBClass = model.get_model("XGBClassifier").version("v1")

#We run the predict function
XGBClass.run(predict_values,function_name=)

 

 

Finally, we will use the ‘GBM’ function. As this model was created using sklearn, it’s necessary to use a pandas DataFrame for the test data. The ‘predict’ function employed here is from sklearn, not Snowpark, which is why the output DataFrame differs from the previous ones:

 

#Using GBM Model
gbm = model.get_model("GBM").version("v1")

#Test values must be in pandas dataframe
test_values = test_sdf.to_pandas().loc[:,test_sdf.to_pandas().columns != 'RATING']

gbm.run(predict_values,function_name='predict')

 

 

Another exceptional feature of Model Registry is its wide range of prediction functions, extending beyond the conventional ‘predict’ method. For instance, the ‘XGBClassifier’ model can seamlessly substitute ‘predict’ with the ‘predict_proba’ function. This introduces an additional layer of flexibility for data scientists, empowering them to customise predictions to meet specific needs:

 

#We run the predict_proba function
XGBClass.run(predict_values,function_name='predict_proba')

 

 

From this registry we can get other information, like metrics, comments, or other registered model metadata.

 

 

Creating A Function to Register Our Models

Having explored some of the key functionalities of Model Registry, it’s not difficult to get an idea of its immense capabilities, and the toolkit for creating functions to train, fine-tune, store, and manage various ML models is extensive. Imagine crafting a function capable of training an ‘XGBoost’ model with any given dataset, a function that not only calculates metrics but also facilitates the storage of the model, metrics, and associated metadata within our designated schema object in the Snowflake database. The code snippet below is a good example of this comprehensive functionality, highlighting the versatility and efficiency of Model Registry’s offerings:

 

# Arguments definition

session = snp.Session.builder.configs(state_dict["connection_parameters"]).create()
feature_cols = train_sdf.columns.remove('RATING')
target_col = 'RATING'
output_cols = ['PREDICTION']
train_data = "DATA_TRAIN"
test_data = "DATA_TEST"
random_state = 0
database_name = state_dict['connection_parameters']['database']
schema_name = state_dict['connection_parameters']['schema']
model_name = "XGB_CLASSIFER_FROM_FUNCTION"
version_name = "V4"
commentary = "model de prueba para la funcion xgboost_model_registry"

# Creation of the function

def xgboost_model_registry(session: snp.Session,
                           feature_cols: list,
                           target_col: str,
                           output_cols: list,
                           train_data: str,
                           test_data: str,
                           random_state: int,
                           database_name: str,
                           schema_name: str,
                           model_name: str,
                           version_name: str,
                           comment: str = "" )-> T.Variant:
    
    import snowflake.snowpark as snp
    from snowflake.ml.modeling.xgboost import XGBClassifier
    from snowflake.ml.modeling.metrics import accuracy_score, precision_score, recall_score, f1_score
    
    data_train = session.table(train_data)
    data_test = session.table(test_data)
    
    # We create the XGBClassifier Model
    xgbmodel = XGBClassifier(
    random_state=random_state, 
    input_cols=feature_cols, 
    label_cols=target_col, 
    output_cols=output_cols
    )

    xgbmodel.fit(data_train)                        
    
    #Predicted values using test_data                       
    predicted_test = xgbmodel.predict(data_test)
    
    # Metrics of the trained XGBClassifier model
    
    acc = accuracy_score(df=predicted_test, y_true_col_names=target_col, y_pred_col_names=output_cols[0])
    precision = precision_score(df=predicted_test, y_true_col_names=target_col, y_pred_col_names=output_cols[0])
    recall = recall_score(df=predicted_test, y_true_col_names=target_col, y_pred_col_names=output_cols[0])               
    F1 = f1_score(df=predicted_test, y_true_col_names=target_col, y_pred_col_names=output_cols[0])                       
    print('Acccuracy:', acc)
    print('Precision:', precision)
    print('Recall:', recall)
    print('F1:', F1)

    #Model Registry
    from snowflake.ml.registry import Registry

    reg = Registry(session=session,
                    database_name=database_name,
                    schema_name=schema_name)

    model_ref = reg.log_model(
        xgbmodel,
        model_name=model_name,
        version_name=version_name,
        metrics={"Accuracy": acc,
                "Precision":precision,
                "Recall":recall,
                "F1":F1},
        comment=comment
    )
    return "Model successfully registered"

# Now we call our function
xgboost_model_registry(session,feature_cols, target_col, output_cols, train_data, test_data, random_state, database_name,
                       schema_name, model_name, version_name, commentary)

 

As we can see in our Snowflake schema, the new model has successfully been trained and registered in Snowflake, along with the metrics and the comment:

 

Snowflake schema

 

The possibilities are immense. We could introduce new commands to choose between various ML models, incorporate diverse cross-validation methods, and add arguments for hyperparameter tuning, while keeping all that information inside the same object in our Snowflake database. What’s more, once these configurations have been set up, they can be incorporated into standard procedures, allowing seamless invocation within Snowflake and also facilitating integration into broader workflows. Model Registry’s functionality provides a robust framework for tailoring and orchestrating sophisticated ML processes.

 

Calling Model Methods in SQL

In earlier sections, we demonstrated how to use a registered model for inference with Python. However, Model Registry offers another remarkable capability: it can be leveraged not only in Jupyter notebooks but also within SQL environments. To call a model, employ the following code:

 

code to call a model

 

To invoke a specific version, you’d need to use an alias, like this:

 

 

Let’s look at some examples using our registered models. Imagine we need to use our ‘XGBClassifier’ model to get predictions using our ‘data_test’ table saved in Snowflake. We can use the following code:

Snowflake code

 

As you can see, we have to specify the names of our input variables within the method, such as ‘predict()’. This method generates a JSON output, where predictions for each row are sorted alphabetically. However, if we’d prefer to retrieve only the prediction value instead of the entire JSON response, we can simply select ‘:PREDICTION’ in our statement:

 

Now, we will employ the ‘GBM’ model using the ‘with model’ code structure, enabling the selection of any version of the specified model. It’s important to note that this ‘GBM’ model was generated from the sklearn library, so the table result produced by the ‘predict()’ function will differ from the one before, which was created using Snowflake ML.

 

 

 

Conclusions

 

In addition to its rich array of functions tailored for data transformation, Snowpark offers a wide range of capabilities for training machine learning models, going far beyond its native ML package. Able to integrate seamlessly with well-known libraries like scikit-learn, Snowpark lets data scientists leverage their preferred tools and methodologies, ensuring compatibility and interoperability with existing workflows. This versatility not only enhances the platform’s appeal but also reinforces its position as a comprehensive solution for ML model development within the Snowflake environment.

 

The introduction of the Model Registry object within Snowflake is a paradigm shift in model management. This feature not only modernises the registration, organisation, and versioning of ML models but also facilitates the incorporation of diverse packages outside Snowpark’s native offerings. It is important to highlight Model Registry’s integration with the inference process, facilitating swift model swaps for predictive tasks while also significantly enhancing agility and efficiency in model deployment. Moreover, the schema-object nature of Model Registry opens the door to seamless integration within SQL queries, adding another layer of flexibility and ease of use to the model management workflow.

 

In conclusion, the combination of Snowpark and the Model Registry object presents a compelling solution for end-to-end machine learning model development and deployment within the Snowflake ecosystem, positioning organisations for even greater success in their data-driven initiatives. Reach out to our team of dedicated experts, ready to harness Snowflake’s powerful ecosystem to drive innovation and transform data strategies into actionable success stories for you!

 

Pablo D
pablo.doniga@clearpeaks.com