# Synthetic drift example

Here we will construct a very basic example of data drift. We will start by choosing a group of cluster centers, and assigning data points to each cluster at random. The baseline set will have clusters with ids from 0 to 3, while the comparison set will have points from all 5 clusters. However, in the second half, it will leave out clusters 0 and 1.

In [None]:
import numpy as np
import pandas as pd

from cobalt import CobaltDataset, DatasetSplit, Workspace

In [None]:
cluster_centers = np.random.randn(5, 25)
baseline_cluster_ids = np.random.randint(0, 4, 1000)
comparison_cluster_ids_1 = np.random.randint(0, 5, 1000)
comparison_cluster_ids_2 = np.random.randint(2, 5, 1000)
comparison_cluster_ids = np.concatenate(
    [comparison_cluster_ids_1, comparison_cluster_ids_2]
)

We will then generate the data by adding some noise to these points. While we have added a significant amount of noise, they are still quite easily distinguished due to the high dimensionality.

In [None]:
baseline_data = cluster_centers[baseline_cluster_ids, :]
baseline_data += 0.9 * np.random.randn(*baseline_data.shape)
comparison_data = cluster_centers[comparison_cluster_ids, :]
comparison_data += 0.9 * np.random.randn(*comparison_data.shape)

We'll load this all into a dataframe and assign some arbitrary timestamps. We'll pretend the baseline set all came from one point in time, while the comparison set is spread evenly over a month. We'll also add the cluster ids.

In [None]:
col_names = [f"feat_{i}" for i in range(25)]
baseline_df = pd.DataFrame(baseline_data, columns=col_names)
baseline_df["cluster"] = baseline_cluster_ids
baseline_df["timestamp"] = pd.Timestamp("2023-01-01")
comparison_df = pd.DataFrame(comparison_data, columns=col_names)
comparison_df["cluster"] = comparison_cluster_ids
comparison_df["timestamp"] = pd.date_range(
    "2023-02-01", "2023-03-01", len(comparison_df)
)
# this avoids a bug in some recent versions of pandas
baseline_df["timestamp"] = baseline_df["timestamp"].astype(
    comparison_df["timestamp"].dtype
)
df = pd.concat([baseline_df, comparison_df], axis=0, ignore_index=True)

So that the cluster ids show up as a categorical variable and not a numerical variable, we'll convert that column's type.

In [None]:
df["cluster"] = df["cluster"].astype("category")

Now we prepare this data table for Cobalt by creating a `CobaltDataset`. We will add the data array we sampled as an embedding, using the Euclidean metric. We'll also create a `DatasetSplit` object to describe the division between the original data and the new data. 

In [None]:
dataset = CobaltDataset(df)
dataset.add_embedding_array(
    np.concatenate([baseline_data, comparison_data]),
    metric="euclidean",
    name="numeric_cols",
)
split = DatasetSplit(
    dataset, {"baseline": range(1000), "comparison": range(1000, 3000)}
)

Now it's just a matter of instantiating the UI and seeing what we get.

In [None]:
w = Workspace(dataset, split)
w.ui

The colors on the graph indicate regions that primarily come from the baseline dataset (blue) or from the new comparison data (yellow). The large yellow cluster indicates a new region of the data space that was not covered by the original data distribution.