Photo by Frederick Marschall on Unsplash

Obtaining SOTA ML Model Observability

6 min readJul 18, 2022

Hey! What’s up?

Sunday night and I was wondering where should I focus my studies on, and I remember that I needed to explore a little bit on a (not so) new model monitoring platform that we’ve been seeing around, called Arize. But "what is model observability", you might say, and we’ll talk a little bit about that, and then introduce where Arize comes in.

What is Observability and why does it matter?

Observability is something that we do unconsciously at every micro second of our day, and it has nothing to do with vision specifically, but more about sensing the environment.

Things like "Is it cold outside?"/"Will it rain?" just by feeling the air and humidity bouncing into our skin is so trivial that we never think of them being types of observations in the environment that we’re at.

For systems, we need the same capability in order to keep track of everything that is happening. This is extremely important and can be easily understood when we think of Formula 1 Racers, or SpaceX astronauts, they need to measure every inch of their system in order to stay alive and accomplish their goals.

For ML systems, we have three types of observability needed: Infrastructure, Data and Model Observability, as shown in the image below by Aparna Dhinakaran. She did an amazing job specifying the three types of observability that you must know in a ML system and you can check here.[1]

Three types of observability in ML systems, by Aparna Dhinakaran

Given that, how do we include this type of observability in our system? Should we code every part of it?

Depending on how picky you are, you may end up in a giant mess of things that you must take into account, which will require a mass of knowledge in distinct areas: infrastructure engineering isn’t the same skill set as data engineering or machine learning.

Luckily, there’s a tool that can give us that easily: Arize. You can also think of Monalabs, or Whylabs if you need other options, but I’ll stick to Arize because I liked the most.

Let’s get our hands dirty

Generating Data

In order to make use of whatever Arize has to offer, we should simulate that we’ve trained a ML Model, tested it and already deployed.

Assume that we did developed a ML model, and we'll simulate this by creating fake datasets, each containing 10 random and different features based on a specific seed for reproducibility, and also generate the target (satisfied/unsatisfied customer).

By this, we would be emulating training, evaluation and production dataset, and the latter would be better to be slightly different from the first two, in order for us to emulate a data drift.

!pip install 'arize[MimicExplainer]' scikit-learn pandas numpyimport pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
## Generating Training and Test Dataset with make_classification function
X, y = make_classification(n_samples=10000, n_features=10, random_state=42)df = pd.DataFrame(X, columns=[f"col_{i}"for i in range(10)])df["target"] = pd.Series(y)
df['target_label'] = ["satisfied" if (i==1) else "unsatisfied" for i in y]## Generating fake score probabilities to simulate the real world scenario
df["score"] = np.random.random(10000)
df['score_label'] = ["satisfied" if (round(i)==1) else "unsatisfied" for i in df.score]df = df.reset_index()train, test = train_test_split(df, test_size=0.2, stratify=df["target"], random_state=42)

By now, we have both train and test datasets.

## Emulating production dataset with slight changes and a different random stateX, y = make_classification(n_samples=3000, n_features=10, shift=0.2, scale=1.2, random_state=123)prod_df = pd.DataFrame(X, columns=[f"col_{i}"for i in range(10)])prod_df["target"] = pd.Series(y)
prod_df['target_label'] = ["satisfied" if (i==1) else "unsatisfied" for i in y]prod_df["score"] = pd.Series(np.random.random(3000))
prod_df['score_label'] = ["satisfied" if (round(i)==1) else "unsatisfied" for i in prod_df.score]prod_df = prod_df.reset_index()

Ok, now we’re ready to start.

Connecting to Arize

Run the code below to start the connection as shown in Arize documentation. [2]

from arize.pandas.logger import Client, Schema
from arize.utils.types import Environments, ModelTypesAPI_KEY = 'YOUR_API_KEY'
SPACE_KEY = 'YOUR_SPACE_KEY'
arize_pandas_client = Client(space_key=SPACE_KEY, api_key=API_KEY)

Capturing dataset column names

# example feature columns from sample dataframe, swap these out to reflect your own data!feature_column_names = [col for col in train.columns if col.startswith('col_')]prediction_label_column_name="score_label"
prediction_score_column_name="score"
actual_label_column_name="target_label"
actual_score_column_name="target"

Logging the datasets

Uploading train dataset

# Log the prediction using arize_pandas_client.log# Be sure to set up your own model parameters in the method call belowschema = Schema(
prediction_id_column_name="index",
# timestamp_column_name="YOUR_PREDICTION_TS_COLUMN_NAME", # optional timestamp column
prediction_label_column_name=prediction_label_column_name,
actual_label_column_name=actual_label_column_name,
prediction_score_column_name=prediction_score_column_name,
actual_score_column_name=actual_score_column_name,
feature_column_names=feature_column_names
)res = arize_pandas_client.log(
dataframe=train,
model_id='custom_model_test',
model_version='v1',
model_type=ModelTypes.SCORE_CATEGORICAL, # or ModelTypes.NUMERIC, see docs.arize.com for more info
environment=Environments.TRAINING, # see docs.arize.com for more info
schema=schema,
surrogate_explainability=True
)

Uploading Test Dataset

# Log the prediction using arize_pandas_client.log# Be sure to set up your own model parameters in the method call belowschema = Schema(
prediction_id_column_name="index",
# timestamp_column_name="YOUR_PREDICTION_TS_COLUMN_NAME", # optional timestamp column
prediction_label_column_name=prediction_label_column_name,
actual_label_column_name=actual_label_column_name,
prediction_score_column_name=prediction_score_column_name,
actual_score_column_name=actual_score_column_name,
feature_column_names=feature_column_names
)res = arize_pandas_client.log(
dataframe=test,
model_id='custom_model_test',
model_version='v1',
model_type=ModelTypes.SCORE_CATEGORICAL, # or ModelTypes.NUMERIC, see docs.arize.com for more info
environment=Environments.VALIDATION, # see docs.arize.com for more info
batch_id=1,
schema=schema,
surrogate_explainability=True
)if res.status_code == 200:
    print(f"✅ You have successfully logged production set to Arize")
else:
    print(f"logging failed with response code  {res.status_code}, {res.text}")

Uploading Production Dataset

# Log the prediction using arize_pandas_client.log# Be sure to set up your own model parameters in the method call belowschema = Schema(
prediction_id_column_name="index",
# timestamp_column_name="YOUR_PREDICTION_TS_COLUMN_NAME", # optional timestamp column
prediction_label_column_name=prediction_label_column_name,
actual_label_column_name=actual_label_column_name,
prediction_score_column_name=prediction_score_column_name,
actual_score_column_name=actual_score_column_name,
feature_column_names=feature_column_names
)res = arize_pandas_client.log(
dataframe=prod_df,
model_id='custom_model_test',
model_version='v1',
model_type=ModelTypes.SCORE_CATEGORICAL, # or ModelTypes.NUMERIC, see docs.arize.com for more info
environment=Environments.PRODUCTION, # see docs.arize.com for more info
schema=schema,
surrogate_explainability=True
)if res.status_code == 200:
    print(f"✅ You have successfully logged production set to Arize")
else:
    print(f"logging failed with response code  {res.status_code}, {res.text}")

After all of this we should be able to see something like:

Success! Check out your data at https://app.arize.com/organizations/XXXXXXXXXXXXXXXXXXXXXX=/spaces/XXXXXXXXX/models/modelName/custom_model_test?selectedTab=dataIngestion ✅ You have successfully logged production set to Arize

Checking Arize interface

Now we can go to Arize website and check the Models panel.

After clicking on it, we can explore a lot of what the tool has already built-in and optimized for a data scientist/machine learning engineer, such as feature performance, target/data drift and much more.

I will plot a little bit about what can be seen there and why I’m so amazed about this tool.

General metrics for target and predictions

Performance breakdown by feature, with detailed performance by slices of data, like you can get with IBM FreaAI

A little bit more details on features, we can see the drift that we simulated

A quite easy to deploy model performance dashboard

Feature importance generated by surrogate explainability on upload time

Possibility to explore fairness issues (here I didn’t actually configured for this specific dataset)

Payoff for a custom business metric that you can code and save, and compare between model training and production

One might say "Ok, but we can already do that coding, and we have much more flexibility and etc". I partially agree on that, being so much fond of coding. But you must accept that it speed things up a lot by delivering everything built-in and without coding effort, just like GitHub Copilot can help us in general (without having to search stack overflow for simple, but often forgettable things).

That's it for today, and I hope you’ve enjoyed exploring a little bit of ML Model Observability in our hands-on using Arize platform, there's so much that I haven’t covered such NLP domain and their UMAP functionality and other dashboards, but I encourage you to explore a little and check out their docs.

See you next time!

Obs.: All of the screenshots taken belong to Arize.ai and its beautifully designed UI, which will make our life much better presenting models to stakeholders.

References

[1] Aparna Dhinakaran. The Three Types of Observability Your System Needs (Jun-2022). Towards Data Science.

[2] Arize.AI. Python Batch Docs