## Intro

At [BlueLightAI](https://bluelightai.com/) we are **thrilled** to help you identify the best model for your use case!

**What we do:**
1. Our algorithm finds natural groups in your queries dataset üë•  ü§ù

2. We illuminate their performance rates for model comparison üí° üìä

Easily compare any pairs of models, like a base model üìç  

and it's **fine-tuned Marqtune** checkpoint üöÄ

**In this notebook**: we compare the retrieval performance of a fine-tuned embeddding model and its base model on the ecommerce dataset (gs_100k_training.csv) provided in Marqtune's examples.

## Why are the results impactful?

**Going beyond average performance metrics**

The average performance on your whole dataset is a good start but what about:
- Low performing query types?
- Query types that didn't improve?

**Quickly Address** if your model is
- Risky or ready to deploy based on real user queries
- Needs more fine-tuning in Marqtune
- Better or worse than another base model

## Setup
Sample setup steps to reproduce this specific notebook are in the cell below. To enable the most up-to-date version of Cobalt, see the Cobalt [Docs:](https://docs.cobalt.bluelightai.com/setup.html#installation)

In [None]:
# At the time of this writing, Python 3.12 was setup in a fresh virtual conda environment
# in the terminal prior to installing cobalt:

# conda create -y --name cobalt-env
# conda activate cobalt-env
# conda install -y python=3.12

In [None]:
# Uncomment and run this cell if you have not yet set up Cobalt and sentence_transformers.
# The '%' is not needed if pip is run in the terminal.

# %pip install cobalt-ai
# %pip install sentence_transformers==3.3.1
# 3.3.1 is the version of sentence_transformers used for this notebook at the time of writing

# import cobalt
# cobalt.register_license() #one time free trial registration

**How:** We built a hierarchical clustering algorithm with roots in Topological Data Analysis.

1. We take in unstructured data about your machine learning model (text, image, etc.)

2. We embed it (or choose your own embedding like your use case specific embedding for ecommerce!)

3. Our Algorithm outputs intuitive groups labels for your data as a dataframe.

4. Easily compare performance across models using the groups on your data

**In this notebook**, we use SBERT to embed your text queries, but we handle any kind of embeddings!


## Data Prep

For each query we simple need a per sample performance rate
- ie: for Search Retrieval it could be NDCG, Purchase rate, Clickthrough rate, etc.

In [None]:
import warnings

import pandas as pd

import cobalt
from cobalt.embedding_models import SentenceTransformerEmbeddingModel
from cobalt.lab.generate_interpretable_dataframe import get_interpretable_groups

warnings.filterwarnings("ignore")

In this notebook we precomputed these [NDCG](https://www.marqo.ai/blog/what-is-normalized-discounted-cumulative-gain-ndcg) performance rates using Marqo's data and tools.

The predictions came from using **Marqtune** and **Marqo Cloud** Vector Databases

Note: The "Score" per query has a best value of 1, and a worst value of 0 (NDCG metric)

In [None]:
from urllib.request import urlretrieve

base_path = "https://examples.cobalt.bluelightai.com/marqo-gs-10m/v1"

epoch_1_file = "training_epoch_1_ndcg_per_query.csv"
epoch_14_file = "training_epoch_14_ndcg_per_query.csv"

urlretrieve(f"{base_path}/{epoch_1_file}", epoch_1_file)
urlretrieve(f"{base_path}/{epoch_14_file}", epoch_14_file)

In [None]:
epoch_14_ndcg_per_query_df = pd.read_csv("training_epoch_14_ndcg_per_query.csv")
epoch_14_ndcg_per_query_df = epoch_14_ndcg_per_query_df.drop(columns=["Score"])
epoch_14_ndcg_per_query_df.head(1)

In [None]:
epoch_1_ndcg_per_query_df = pd.read_csv("training_epoch_1_ndcg_per_query.csv")
epoch_1_ndcg_per_query_df = epoch_1_ndcg_per_query_df.drop(
    columns=["Unnamed: 0", "Score"]
)
epoch_1_ndcg_per_query_df.head(1)

#### Data Prep: Compare Models on the same Queries

We can combine our dataframes since the queries are identical

This allows us to see the scores for each model (ie: base epoch 1 vs. fine tune epoch 14)

In [None]:
model_comparison_df = epoch_1_ndcg_per_query_df.copy()
model_comparison_df = model_comparison_df.rename(
    columns={"ndcg_score": "score_epoch_1"}
)
model_comparison_df["score_epoch_14"] = epoch_14_ndcg_per_query_df["ndcg_score"]

In [None]:
model_comparison_df["fine_tuning_impact"] = (
    model_comparison_df["score_epoch_14"] - model_comparison_df["score_epoch_1"]
)

In [None]:
model_comparison_df.head()

We can see in the plot below
- Many queries got worse from this fine-tuning run

- And many queries had around the same performance

Then we show you what groups of queries this is happening to üî¶

In [None]:
model_comparison_df["fine_tuning_impact"].plot.hist(
    title="Impact Per Query from Fine Tuning",
    xlabel="Raw Change NDCG from Fine Tuning",
    ylabel="Count",
    bins=50,
)

#### And Now... üî•

1. We compute intuitive group labels on your queries

2. and illuminate their performance on each model üîé

In [None]:
# First load your dataframe into a `CobaltDataset`.
ds = cobalt.CobaltDataset(model_comparison_df)

# In this case, embed your data with a specific version of SBERT.
# You can embed your data with your choice of model.
m = SentenceTransformerEmbeddingModel("all-MiniLM-L6-v2")

# Using the embedding model above, embed your data. You can specify GPU-acceleration here.
embedding = m.embed(model_comparison_df["query"].tolist(), device="cpu")

# And add the embedding to the dataset, using the "cosine" similarity metric.
ds.add_embedding_array(embedding, metric="cosine", name="sbert")

In [None]:
results, workspace, keywords_per_level = get_interpretable_groups(
    ds,
    text_column_name="query",
    n_gram_range="up_to_bigrams",
    min_level=0,
    max_level=20,
    max_keywords=3,
    return_intermediates=True,
)

In [None]:
graph = workspace.graphs["New Graph"]

### Observing the Results  üß†

Note: "Score" a best value of 1, and a worst value of 0 (NDCG metric)

**Starting Small**: Some queries got worse from fine-tuning!

 (see negative impact going from epoch 1 to epoch 14)

In [None]:
results.sort_values(by=["fine_tuning_impact"]).head()

#### Bigger Groups with the "level" column:

- The higher values for the "level" column retrieve larger sized groups on your source data

- Each level contains all of the unique points from the source, so combine levels with caution

- Levels are a part of our clustering algorithm design to enable "zoom" levels on patterns in the data

Easily navigate the clustering in the dataframe:
- Filter the results by a minimum query_count
- Sort for largest impact!
- Etcetera

In [None]:
results[(results["level"] == 10) & (results["query_count"] > 10)].sort_values(
    by=["fine_tuning_impact"]
)

In [None]:
from cobalt.lab.neighbors import get_raw_subset_with_label

#### Inspecting the Original Samples for a group

- For any "Label" you want to understand more about, pass it and its "level" column below (for uniqueness)

In [None]:
see_label = "boys equestrian, equestrian boots, equestrian"  # Insert Here
# Make sure this matches your row of interest from the results dataframe
level_column_for_see_label = 10

In [None]:
results[results["Label"] == see_label].head(1)

Simply run the next cell to see the matching source data!

In [None]:
raw_data = get_raw_subset_with_label(
    coarseness=level_column_for_see_label,
    label=see_label,
    g=graph,
    ds=ds,
    keywords_per_level=keywords_per_level,
)
raw_data.df

#### Super Positive Example

Fine Tuning with Marqtune had a huge positive impact of +0.29  for "cellars, cellar, cellar temperature" points

Many individual queries went from 0 to hero! (close to max NDCG of 1)


**Note**: Some queries still have huge room for improvement

ie: "Freestanding cellars" and "Cellar Cooling" still have a score of 0

In [None]:
see_label = "cellars, cellar, controlled cellars"  # Insert Here
level_column_for_see_label = (
    10  # Make sure this matches your row of interest from the results dataframe
)

In [None]:
results[
    (results["Label"] == see_label) & (results["level"] == level_column_for_see_label)
].head(1)

In [None]:
raw_data = get_raw_subset_with_label(
    coarseness=level_column_for_see_label,
    label=see_label,
    g=graph,
    ds=ds,
    keywords_per_level=keywords_per_level,
)
raw_data.df

In [None]:
raw_data.df[raw_data.df["fine_tuning_impact"] == 0.0]

How to Improve your Model üí™
- Try fine-tuning longer in Marqtune
- Vary your fine-tuning hyperparameters
- Compare to other base-models
- Curate your training data

Feel free to email support@bluelightai.com for enhancements üí™ or troubleshooting üôè