<a id="top"></a>
<img width="40%" alt="Bluelight AI Logo" href="https://bluelightai.com/" src="https://github.com/BlueLightAI/cobalt-examples/blob/main/assets/blai-logo-light.png?raw=true">

# Use Cobalt to Pick the Best Model for your E-Commerce needs 
<a href="https://bluelightai.com/contact">Give Feedback</a> | <a href="https://bluelightai.com/">Our Website</a> | <a href="https://docs.cobalt.bluelightai.com/">Cobalt Docs</a> | <a href="https://bluelightaicom.slack.com/archives/C0807BUJ4KE">Slack Channel</a> 

**Last update:** 2025-01-13 (Created: 2024-11-15)

## Introduction


At [BluelightAI](https://bluelightai.com/) we are thrilled to help you identify the best model for your use case!

**Business Context for This Notebook**: 

An ecommerce retailer is spending millions of dollars bringing customers to their website, obtaining inventory and optimizing their models. Here, they use BluelightAI Cobalt to compare two different prospective retrieval models on their customer product search dataset before deploying one.


**Model and Dataset Details**

We compare the retrieval performance of an [SBERT](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) and an [E5](https://huggingface.co/intfloat/e5-base-v2) model on a popular ecommerce benchmark [dataset](https://huggingface.co/datasets/Marqo/marqo-GS-10M) from Marqo. 

The E5 model was fine-tuned on this dataset using Marqo's ecommerce fine-tuning [Marqtune](https://www.marqo.ai/blog/introducing-marqtune) Platform.

### Install dependencies

For setup instructions, see the [Cobalt docs](https://docs.cobalt.bluelightai.com/).

In [None]:
# %pip install cobalt-ai[embeddings]

### Import libraries

In [None]:
import warnings

# cobalt.setup_license()
import pandas as pd

import cobalt
from cobalt.embedding_models import SentenceTransformerEmbeddingModel

warnings.filterwarnings("ignore")

**Why BluelightAI Cobalt:** 

The time you have to understand and fix a model‚Äôs errors is limited, expensive and hard to scale to the size of your dataset. Cobalt automates the otherwise painful step of looking for patterns in how a model is performing. We also make comparing models on your dataset easy.


1. We identify groups of customer queries (inputs into your machine learning model) that have similar natural language using [TDA](https://www.nature.com/articles/srep01236) üë• 

2. We provide an easy Pandas DataFrame table so you can do model comparisons on these groups of user queries. üí°

3. This helps you to do risk analysis, model improvement, and model selection so that you can deploy the best possible model for your use case  üìä

#### Data Prep

For each query we simple need an evaluation or performance score from using your current or prospective model(s)
- Common [evaluation metrics](https://weaviate.io/blog/retrieval-evaluation-metrics) for search retrieval include Precision, Recall, MRR, NDCG, etc.
- Business evaluation scores often include add-to-cart rate, purchase rate, clickthrough rate, etc.

In this notebook we use a common evaluation score called [NDCG](https://www.marqo.ai/blog/what-is-normalized-discounted-cumulative-gain-ndcg) which can evaluate a product search model based on annotated data. We computed this performance score for each query beforehand, and we'll download them here.

In [None]:
from urllib.request import urlretrieve

base_path = "https://examples.cobalt.bluelightai.com/marqo-gs-10m/v1"
e5_results_file = "training_epoch_1_ndcg_per_query.csv"
sbert_results_file = "ndcg_per_query_gs_100k_training_2024-10-23_mini_lm_l6_v2.csv"
urlretrieve(f"{base_path}/{e5_results_file}", e5_results_file)
urlretrieve(f"{base_path}/{sbert_results_file}", sbert_results_file)

Note: The score per query has a best value of 1, and a worst value of 0.

In [None]:
e5_base_query_evals = pd.read_csv(sbert_results_file, index_col=0)
e5_base_query_evals.head()

In [None]:
marqo_e5_query_evals = pd.read_csv(e5_results_file, index_col=0)
marqo_e5_query_evals.head()

In [None]:
print(f"There are {len(e5_base_query_evals)} queries in the mpnet-base dataset")
print(f"There are {len(marqo_e5_query_evals)} queries in the marqo-e5 dataset")
print("The queries are the same in both datasets and the rows are aligned")

Without BluelightAI, current approaches analyze performance using the average on the whole dataset:

In [None]:
e5_base_print = e5_base_query_evals["ndcg_score"].mean().round(2)
print(f"The base E5 model had an average NDCG score of {e5_base_print} on this dataset")
e5_fine_tuned_print = marqo_e5_query_evals["ndcg_score"].mean().round(2)
print(
    f"The E5 model had an average NDCG score of {e5_fine_tuned_print} on this dataset"
)

***Limitations of Current Approaches:***
- Identifying where your model is performing poorly isn't addressed by taking an average on your whole dataset

- Looking at individual queries at a time to understand and improve model performance isn't scalable

**How BluelightAI Cobalt address these limitations:**

1. Automatically identify problematic groups of data in your model, saving days or weeks of troubleshooting effort.

2. Quickly compare models and assess the deployment risk for multiple models for your dataset

#### Data Prep: Compare Models on the same Queries

We can combine our dataframes since the queries are identical and aligned
 (ie: rows 2 is the query for Customizable Buttons for Men in both dataframes)

In [None]:
model_comparison_df = e5_base_query_evals.copy()
model_comparison_df = model_comparison_df.rename(columns={"ndcg_score": "sbert_ndcg"})
model_comparison_df["e5_ndcg"] = marqo_e5_query_evals["ndcg_score"]

#### And Now... BluelightAI Cobalt üî•

1. We find groups of user queries that have similar natural language using [TDA](https://www.nature.com/articles/srep01236) üë• üîó

2. We then illuminate the performance of these groups on each of your models üîé 

3. This makes identifying problematic groups and comparing models quick and easy

We'll start by getting embeddings for all the queries. If you already have embeddings, you can skip this step.

In [None]:
m = SentenceTransformerEmbeddingModel("all-MiniLM-L6-v2")
# You can specify your GPU type here in the `device` parameter.
embedding = m.embed(model_comparison_df["query"].tolist(), device="mps")

Now we'll load our data into Cobalt along with the embeddings.

In [None]:
# First load your dataframe into a `CobaltDataset`.
ds = cobalt.CobaltDataset(model_comparison_df)

# And add the embedding to the dataset, using the "cosine" similarity metric.
ds.add_embedding_array(embedding, metric="cosine", name="sbert")

# Create a Workspace to do the analysis.
w = cobalt.Workspace(ds)

Now we'll make sure Cobalt knows about the two models we're comparing. In particular, we need to tell it where to find the performance metric(s) for each model.

In [None]:
ds.add_model(
    input_columns="query",
    name="sbert",
    performance_columns=[
        {"name": "ndcg", "column": "sbert_ndcg", "lower_values_are_better": False},
    ],
)
ds.add_model(
    input_columns="query",
    name="e5",
    performance_columns=[
        {"name": "ndcg", "column": "e5_ndcg", "lower_values_are_better": False},
    ],
)

Now for the TDA step. We'll ask Cobalt to build a multiresolution graph of the queries. This graph has a number of different _levels_ representing the data at different levels of detail. Each level is a graph in itself, where each node corresponds with a group of similar queries. As the level increases, nodes become larger, including more queries together.

In [None]:
g = w.new_graph()

# We will look at level 20 in the graph to start.
# This is a fairly coarse resolution, so we will have larger groups than if we had a lower level.
groups = w.get_graph_level(g, level=20, name="level_20")

# With this method, we can find distinctive keywords for the queries in each group.
groups.compute_group_keywords(col="query", set_names=True)

We can now take a look at the groups the graph gives us and their keywords by just inspecting the `groups` object.


In [None]:
groups

Now let's compare the models! The `compare_models()` method will give us a dataframe with both models' performance for each subset.

In [None]:
results = groups.compare_models(ds.models, metrics=["ndcg"], select_best_model=True)

The resulting table has a column indicating the best model for each metric. It also has a column indicating the performance uplift of the best model over the second-place model.

Here we will look at the groups where the E5 model did best.

In [None]:
results[results["best_model_ndcg"] == "e5"].sort_values(
    "ndcg_margin", ascending=False
).head(10)

We can access the details for a group with a particular name by indexing into `groups`.

In [None]:
groups["rfid, rfid blocking, blocking"]

And now let's look at the groups where the SBERT model did better. Looks like the improvement margin is lower, which makes sense as the SBERT model has lower performance on average than the E5 model. Still, there are a number of groups where the SBERT model's performance appears to significantly beat the E5 model.

In [None]:
results[results["best_model_ndcg"] == "sbert"].sort_values(
    "ndcg_margin", ascending=False
).head(10)

In [None]:
groups["soda, soda maker, carbonation"]

### Possible Next Steps:
- **Risk Analysis**
1. Weigh the risk for important customer query patterns on whether the performance is satisfactory for model deployment.

2. You can evaluate more prospective models with BluelightAI until your bar for minimal performance is met, or

3. You can do precision improvement of the models on the queries you are concerned about.

- **Curate your Training Data** 
1. Ensure that your dataset is comprehensive and representative of the real-world scenarios your model will face.
2. Curate and improve your data for fine-tuning your model. 

    [Marqtune](https://www.marqo.ai/blog/introducing-marqtune) helps with fine-tuning ecommerce models. 
    
    [BluelightAI](https://bluelightai.com/) can help you to track performance at each of your fine-tuning model checkpoints for each of your queries and their associated groups.

Feel free to email support@bluelightai.com for enhancements üí™ or troubleshooting üôè

<div style="display: flex; align-items: center; justify-content: space-between;">
    <div style:"flex: 1; text-align: left;">
        <a href="#top" style="text-decoration: none; color: inherit;"> 
            <h3>Top of Page</h3> 
        </a>
    </div>
    <div style:"flex: 1; text-align: right;">
        <img width="50%" alt="Bluelight AI Logo" href="https://bluelightai.com/" src="https://github.com/BlueLightAI/cobalt-examples/blob/main/assets/blai-logo-light.png?raw=true" style="float: right;">
    </div>
</div>