Cobalt API

class cobalt.Workspace(dataset: CobaltDataset, split: DatasetSplit | SplitDescriptor | None = None, auto_graph: bool = True, run_server: bool | None = None)

Bases: object

Encapsulates analysis done with a dataset and models.

ui

A user interface that can be used to interact with the data, models, and other analysis.

run_auto_group_analysis

Whether to automatically run a group analysis of the data and models when the UI is opened, if no analysis has yet been run.

Initialize a Workspace.

Parameters:
  • dataset – The CobaltDataset to use for the analysis.

  • split – A division of the dataset into predetermined groups, e.g. test/train.

  • auto_graph – Whether to automatically run the graph creation.

  • run_server – Whether to run a web server to host images. If None (default), will run a server unless a Colab environment is detected.

The dataset split can be provided in a number of different ways.

static from_arrays(model_inputs: List | ndarray | DataFrame, model_predictions: ndarray, ground_truth: ndarray | None, task: str = 'classification', embedding: ndarray | None = None, embeddings: List[ndarray] | None = None, embedding_metric: str | None = None, embedding_metrics: List[str] | None = None, split: DatasetSplit | SplitDescriptor | None = None)

Returns a Workspace object constructed from user-defined arrays.

Parameters:
  • model_inputs – the data evaluated by the model.

  • model_predictions – the model’s predictions corresponding to model_inputs.

  • ground_truth – ground truths for model_inputs.

  • task – model task, pass in “classification”

  • embedding – embedding array to include.

  • embeddings – list of embedding arrays to use.

  • embedding_metric – embedding metric corresponding to embedding.

  • embedding_metrics – list of metrics corresponding to embeddings.

  • split – an optional dataset split.

At most one of embedding or embeddings (and the corresponding embedding_metric or embedding_metrics) should be provided.

view_table(subset: List[int] | CobaltDataSubset | None = None, display_columns: List[str] | None = None, max_rows: int | None = None)

Returns a visualization of the dataset table.

static analyze(subset: CobaltDataSubset) Tuple[DataFrame, DataFrame]

Compute numerical and categorical statistics for the given subset.

Returns:

A tuple (numerical_statistics, categorical statistics) giving summary statistics for numerical and categorical features in the dataset.

static feature_compare(group_1: CobaltDataSubset, group_2: CobaltDataSubset | Literal['all', 'rest', 'neighbors'], numerical_features: List[str] | None = None, categorical_features: List[str] | None = None, numerical_test: Literal['t-test'] = 't-test', categorical_test: Literal['G-test'] = 'G-test', include_nan: bool = False)

Compare the distributions of features between two subsets.

get_groups() Dict[str, CobaltDataSubset]

Get a dictionary with the currently saved groups.

Returns:

A dictionary mapping group names to CobaltDataSubsets. This dictionary is a snapshot of the currently saved groups and will not update if the groups are changed in the UI.

property saved_groups: Dict[str, CobaltDataSubset]

A dictionary of the currently saved groups.

This does not include groups selected by algorithms like find_failure_groups(), only groups saved manually in the UI or with Workspace.add_group().

Note that this is a copy of the current collection of groups; modifying it will not change the set of groups stored in the Workspace or displayed in the UI.

add_group(name: str, group: CobaltDataSubset)

Add a group to the collection of saved groups.

property graphs: Dict[str, HierarchicalPartitionGraph]

The graphs that have been created and saved.

add_graph(name: str, graph: HierarchicalPartitionGraph, subset: CobaltDataSubset, init_max_nodes: int = 500, init_max_degree: float = 15.0)

Add a graph to self.graphs.

Parameters:
  • name (str) – A name for the graph.

  • graph – The graph to add.

  • subset – The subset of the self.dataset this graph is constructed from.

  • init_max_nodes – The maximum number of nodes to show in the initial view of this graph.

  • init_max_degree – The maximum average node degree for the initial view of this graph.

new_graph(name: str | None = None, subset: str | CobaltDataSubset | None = None, embedding: int | str | Embedding = 0, metric: str | None = None, init_max_nodes: int = 500, init_max_degree: float = 15.0, **kwargs) HierarchicalPartitionGraph

Create a new graph from a specified subset.

The resulting graph will be returned and added to the Workspace.

Parameters:
  • name – The name to give the graph in self.graphs. If None: Autoname it.

  • subset – The subset of the dataset to include in the graph. If a string, will try to use a subset with that name from the dataset split or the saved groups (in that order). Otherwise, should be a CobaltDataSubset.

  • embedding – The embedding to use to generate the graph. May be specified as an index into self.dataset.embeddings, the name of the embedding, or an Embedding object.

  • metric – The distance metric to use when constructing the graph. If none is provided, will use the metric specified by the embedding.

  • init_max_nodes – The maximum number of nodes to show in the initial view of this graph.

  • init_max_degree – The maximum average node degree for the initial view of this graph.

  • **kwargs – Any additional keyword parameters will be interpreted as parameters to construct a GraphSpec object.

get_graph_level(graph: str | HierarchicalPartitionGraph, level: int, name: str | None = None) GroupCollection

Create a GroupCollection from a specified level of a graph.

This method is experimental and its interface may be changed in the future.

Parameters:
  • graph – Name of the graph to use, or the graph object itself.

  • level – The level of the graph to use for the groups. One group will be created for each node in the graph.

  • name – An optional name for the GroupCollection.

get_graph_levels(graph: str | HierarchicalPartitionGraph, min_level: int, max_level: int, name_prefix: str | None = None) Dict[int, GroupCollection]

Create GroupCollections for a range of levels of a graph.

All levels between min_level and max_level will be used. The return value is a dict mapping levels to GroupCollections.

This method is experimental and its interface may be changed in the future.

Parameters:
  • graph – Name of the graph to use, or the graph object itself.

  • min_level – The lowest level of the graph to use for the groups.

  • max_level – The highest level of the graph to use for the groups.

  • name_prefix – If provided, the GroupCollection for level i will be named “{name_prefix}_{i}”.

add_evaluation_metric_values(name: str, metric_values: ArrayLike, model: int = 0, lower_values_are_better: bool = True)

Add values for a custom evaluation metric.

Parameters:
  • name – A name for this evaluation metric. This will be used to name a column in the dataset where these values will be stored, as well as to name the metric itself.

  • metric_values – An arraylike with one value for each data point in the dataset.

  • model – The index of the model that this metric evaluates.

  • lower_values_are_better – If True, Cobalt will interpret lower values of this metric as positive; otherwise, it will interpret higher values as positive.

find_drifted_groups(reference_group: str | CobaltDataSubset, comparison_group: str | CobaltDataSubset, embedding: int = 0, relative_prevalence_threshold: float = 2, p_value_threshold: float = 0.05, min_size: int = 5, run_name: str | None = None, config: Dict[str, Any] | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True, model: int = 0) GroupResultsCollection

Return groups in the comparison group that are underrepresented in the reference group.

Parameters:
  • reference_group – The reference subset of the data, e.g. the training set.

  • comparison_group – The subset of the data that may have regions that are not well represented in the reference set. This may be a test dataset or production data.

  • embedding – The embedding to use for the analysis. If none is provided, will use the default dataset embedding. (If one does not exist, will raise an error.)

  • relative_prevalence_threshold

    How much more common points from comparison_group need to be in a group relative to the overall average for it to be considered drifted. This is computed by comparing the ratio of comparison points to reference points in a group, compared with the ratio in the overall dataset. If the overall balance of points is 1:1 from each group and relative_prevalence_threshold = 2, a drifted group will have at least a 2:1 balance in favor of data points from the comparison set. If the overall ratio of points is 1:2 comparison : reference, then a drifted group will need to have at least a 1:1 ratio.

    Choose this value based on what amount of overrepresentation of the comparison group would be meaningful to you. Under the default parameter of 2, the interpretation is roughly that for any returned group, points from the comparison subset are at least twice as common as they would be in a random sample of data points.

  • p_value_threshold – Used in a significance test that the prevalence of points from the comparison group is at least as high as required based on the value of relative_prevalence_threshold.

  • min_size – The minimum number of data points that need to be in the drifted region else, the drifted region is dropped from the result

  • run_name – A name under which to store the results. If one is not provided, it will be chosen automatically.

  • config – A dictionary containing further configuration parameters that will be passed to the underlying algorithm.

  • manual – Used internally to signal whether the failure group analysis was created by the user.

  • visible – Whether to show the results of this analysis in the UI.

  • generate_group_descriptions – Whether to generate statistical and textual descriptions of returned groups. True by default, but consider setting to False for large datasets with many columns, as this process can be very time consuming.

  • model – Index of the model whose error metric will be shown with the returned groups.

Returns:

A GroupResultsCollection object containing the discovered drifted groups and the parameters used by the algorithm.

property drifted_groups: Dict[str, GroupResultsCollection]

The collection of all drifted group analysis results.

find_failure_groups(method: Literal['superlevel'] = 'superlevel', subset: str | CobaltDataSubset | None = None, model: int = 0, embedding: int | str | Embedding = 0, failure_metric: str | Series | None = None, min_size: int = 1, min_failures: int = 3, config: Dict[str, Dict] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True) GroupResultsCollection

Run an analysis to find failure groups in the dataset.

Saves the results in self.failure_groups under run_name.

Parameters:
  • method – Algorithm to use for finding failure groups. Currently only “superlevel” is supported.

  • subset – The subset of the data on which to perform the analysis. If none is provided, will use the entire dataset.

  • model – Index of the model for which failure groups should be found.

  • embedding – The embedding to use for the analysis. If none is provided, will use the default dataset embedding. (If one does not exist, will raise an error.)

  • failure_metric – The performance metric to use. If a string, will use the model performance metric with that name; otherwise, must be a Pandas Series, with length either equal to the length of the specified subset, or the whole dataset. If a Series is passed, it will be added to the dataset as a model evaluation metric.

  • min_size – The minimum size for a returned failure group. Smaller groups will be discarded.

  • min_failures – The minimum number of failure for a returned failure groups. Smaller groups will be discarded. Default is set to 3 to allow DS to spot failure patterns. This is only for classification tasks.

  • config – A dictionary containing further configuration parameters that will be passed to the underlying algorithm.

  • run_name – A name under which to store the results. If one is not provided, it will be chosen automatically.

  • manual – Used internally to signal whether the failure group analysis was created by the user.

  • visible – Whether to show the results of this analysis in the UI.

  • generate_group_descriptions – Whether to generate statistical and textual descriptions of returned groups. True by default, but consider setting to False for large datasets with many columns, as this process can be very time consuming.

Returns:

A GroupResultsCollection object containing the discovered failure groups and the parameters used by the algorithm.

property failure_groups: Dict[str, GroupResultsCollection]

The collection of all failure group analysis results.

find_clusters(method: Literal['modularity'] = 'modularity', subset: str | CobaltDataSubset | None = None, graph: HierarchicalPartitionGraph | None = None, embedding: int | str | Embedding = 0, min_group_size: int = 1, max_n_groups: int = 100, min_n_groups: int = 1, config: Dict[str, Any] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True) GroupResultsCollection

Run an analysis to find natural clusters in the dataset.

Saves the results in self.clustering_results under run_name.

Parameters:
  • method – Algorithm to use for finding clusters. Currently only “modularity” is supported.

  • subset – The subset of the data on which to perform the analysis. If none is provided, will use the entire dataset.

  • graph – A graph to use for the clustering. If none is provided, will create a new graph based on the specified embedding. Note that if a graph is provided, it must be built on the subset specified by the subset parameter.

  • embedding – The embedding to use to create a graph if none is provided. If none is provided, will use the default dataset embedding. (If one does not exist, will raise an error.)

  • min_group_size – The minimum size for a returned cluster.

  • max_n_groups – The maximum number of clusters to return.

  • min_n_groups – The minimum number of clusters to return.

  • config – A dictionary containing further configuration parameters that will be passed to the underlying algorithm.

  • run_name – A name under which to store the results. If one is not provided, it will be chosen automatically.

  • manual – Used internally to signal whether the clustering analysis was created by the user.

  • visible – Whether to show the results of this analysis in the UI.

  • generate_group_descriptions – Whether to generate statistical and textual descriptions of returned clusters. True by default, but consider setting to False for large datasets with many columns, or when a large number of clusters is desired, as this process can be very time consuming.

Returns:

A GroupResultsCollection object containing the discovered clusters and the parameters used by the algorithm.

property clustering_results: Dict[str, GroupResultsCollection]

Results from all previous runs of the clustering algorithm.

export_groups_as_dataframe() DataFrame

Exports saved groups as a DataFrame.

The columns of the resulting DataFrame are named after the saved groups, and the column for each group contains a boolean mask indicating which data points in the dataset belong to that group.

import_groups_from_dataframe(df: DataFrame)

Imports groups from a DataFrame with one column for each group.

The name of each column will be used as the name for the group, and the entries in the column will be interpreted as boolean values indicating the membership of each data point in that group.

auto_analysis(ref: str | CobaltDataSubset, cmp: str | CobaltDataSubset, model: int = 0, embedding: int | str | Embedding = 0, failure_metric: str | Series | None = None, min_size: int = 3, min_failures: int = 3, config: Dict[str, Dict] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True)

Returns an analysis of errors and warnings with the data and model.

Parameters:
  • ref – The subset of the data on which to do the reference analysis. Users should typically pass in the training dataset.

  • cmp – The subset of the data on which to do the comparison analysis. Users may pass in a test dataset, or a production dataset.

  • model – the index of the model object you want to consider.

  • embedding – The embedding to use to create a graph if none is provided. If none is provided, will use the default dataset embedding. (If one does not exist, will raise an error.)

  • failure_metric – The failure metric to use to find error patterns based on.

  • min_size – The minimum size of a returned group.

  • min_failures – The minimum number of failures in a failure group, for a classification task.

  • config – A dictionary containing further configuration parameters that will be passed to the underlying algorithm.

  • run_name – A name under which to store the results. If one is not provided, it will be chosen automatically.

  • manual – Used internally to signal whether the clustering analysis was created by the user.

  • visible – Whether to show the results of this analysis in the UI.

Returns:

a dictionary with keys “summaries” and “groups”

Under “summaries” is a tuple of two DataFrames. The first is a table summarizing the discovered error groups; the second is a table summarizing the discovered warning groups.

Under “groups” is a tuple of two lists of CobaltDataSubsets, the first listing the error groups, and the second listing the warning groups.

add_column(key: str, data, is_categorical: bool | Literal['auto'] = 'auto')

Add or replace a column in the dataset.

Parameters:
  • key – Name of the column to add.

  • data – ArrayLike of values to store in the column. Must have length equal to the length of the dataset.

  • is_categorical – Whether the column values should be treated as categorical. If “auto” (the default), will autodetect.

class cobalt.UI(workspace: Workspace, dataset: CobaltDataset, table_image_size: Tuple[int, int] = (80, 80))

Bases: object

Create a UI visualizing a Workspace.

Parameters:
  • workspace – the Workspace object that this UI will visualize

  • dataset – the CobaltDataset being analyzed

  • table_image_size – for datasets with images, the (height, width) size in pixels that these will be shown in the data table.

property overview
property discover
enable_persistent_labels()

Enable labels that persist (don’t require hover over).

Requires ui.build() to already have been run. Expected Behavior: as you zoom in and zoom out on the visual, labels will appear and disappear in a Google Maps like fashion.

disable_persistent_labels()

Disable labels that persist (don’t require hover over).

Requires ui.build() to already have been run.

refresh_coloring_and_table()
get_graph_and_clusters() Tuple[Graph, List[CobaltDataSubset]]

Return the current graph and the datapoints that belong to each node.

Returns:

A tuple(Graph, List[CobaltDataSubset]) representing the current graph as networkx, and a list of the datapoints that each node represents.

Note that the graph has the same number of nodes as the number of elements in the list.

get_current_graph() HierarchicalPartitionGraph

Return the currently shown graph.

get_graph_selection() CobaltDataSubset

Return the current subset selected in the graph.

get_current_graph_source_data() CobaltDataSubset

Return the current dataset being displayed in the current graph.

Returns:

A CobaltDataSubset of the data represented by the graph. Note that if sub-sampling is enabled, this may not be the entire dataset.

build()

Construct the UI.

This normally happens automatically when the UI object appears as an output in a notebook cell.

class cobalt.CobaltDataset(dataset: DataFrame, metadata: DatasetMetadata | None = None, models: List[ModelMetadata] | None = None, embeddings: List[Embedding] | None = None, name: str | None = None, arrays: Dict[str, ndarray] | None = None)

Bases: DatasetBase

Foundational object for a Cobalt analysis.

Encapsulates all necessary information regarding the data, metadata, and model outputs associated with an analysis.

name

Optional string for dataset name

property metadata: DatasetMetadata

A DatasetMetadata object containing the metadata for this dataset.

property models: List[ModelMetadata]

The models associated with this dataset.

Each ModelMetadata object represents potential outcome, prediction, and error columns.

get_categorical_columns() List[str]
set_column_text_type(column: str, input_type: TextDataType)

Set the type for a text column in the dataset.

Options include “long_text”, which means the data in the column will be subject to keyword analysis but will not be available for coloring, and “short_text”, which prevents keyword analysis but allows categorical coloring.

property array_names: List[str]

Names of the arrays stored in this dataset.

add_model(input_columns: str | List[str] | None = None, target_column: str | List[str] | None = None, prediction_column: str | List[str] | None = None, task: str | ModelTask = 'custom', performance_columns: List[str | dict] | None = None, name: str | None = None)

Add a new model.

Parameters:
  • input_columns – The column(s) in the dataset that the model takes as input.

  • target_column – The column(s) in the dataset with the target values for the model outputs.

  • prediction_column – The column(s) in the dataset with the model’s outputs.

  • task – The task the model performs. This determines which performance metrics are calculated automatically. The default is “custom”, which does not compute any performance metrics. Other options are “regression” and “classification”.

  • performance_columns – Columns of the dataset containing pointwise model performance metrics. This can be used to add extra custom performance metrics for the model.

  • name – An optional name for the model. If one is not provided, a unique id will be generated.

property df: DataFrame

Returns a pd.DataFrame of the underlying data for this dataset.

select_col(col: str) Series

Return the values for column col of this dataset.

set_column(key: str, data, is_categorical: bool | Literal['auto'] = 'auto')

Add or replace a column in the dataset.

Parameters:
  • key – Name of the column to add.

  • data – ArrayLike of values to store in the column. Must have length equal to the length of the dataset.

  • is_categorical – Whether the column values should be treated as categorical. If “auto” (the default), will autodetect.

add_media_column(paths: List[str], local_root_path: str | None = None, column_name: str | None = None)

Add a media column to the dataset.

Parameters:
  • paths – A list or other array-like object containing the paths to the media file for each data point in the dataset.

  • local_root_path – A root path for all the paths in paths

  • column_name – The name for the column in the dataset that should store the media file paths.

get_array(key: str) ndarray

Get an array from the dataset.

add_array(key: str, array: ndarray | csr_array)

Add a new array to the dataset.

Will raise an error if an array with the given name already exists.

add_embedding_array(embedding: ndarray | Any, metric: str = 'euclidean', name: str | None = None)

Add an embedding to the dataset.

Parameters:
  • embedding – An array or arraylike object containing the embedding values. Should be two-dimensional and have the same number of rows as the dataset.

  • metric – The preferred distance metric to use with this embedding. Defaults to “euclidean”; “cosine” is another useful option.

  • name – An optional name for the embedding.

add_embedding(embedding: Embedding)

Add an Embedding object.

property embedding_metadata: List[Embedding]

The Embedding objects associated with this dataset.

embedding_names() List[str]

Return the available embedding names.

get_embedding_metadata_by_name(name: str) Embedding
compute_model_performance_metrics()

Compute the performance metrics for each model in dataset.

Adds columns to the dataset storing the computed metrics, and updates the ModelMetadata.error_column attributes corerspondingly.

time_range(start_time: Timestamp, end_time: Timestamp) CobaltDataSubset

Return a CobaltDataSubset within a time range.

Parameters:
  • start_time – A pd.Timestamp marking the start of the time window.

  • end_time – A pd.Timestamp marking the end of the time window.

Returns:

A CobaltDataSubset consisting of datapoints within the range [start_time, end_time).

subset(indices: ArrayLike) CobaltDataSubset

Returns a CobalDataSubset consisting of rows indexed by indices.

as_subset()

Returns all rows of this CobaltDataset as a CobaltDataSubset.

sample(max_samples: int, random_state: int | None = None) CobaltDataSubset

Return a CobaltDataSubset containing up to max_samples sampled rows.

Up to max_samples rows will be sampled without replacement and returned as a CobaltDataSubset. If fewer rows exist than max_samples, all rows are returned.

Parameters:
  • max_samples – The maximum number of samples to pull.

  • random_state – An optional integer to be used as a seed for random sampling.

Returns:

A CobaltDataSubset representing up to max_samples randomly sampled datapoints.

to_dict() dict

Save all information in this dataset to a dict.

to_json() str

Serialize this dataset to a JSON string.

classmethod from_json(serialized_data: str) CobaltDataset

Deserialize a JSON string into a dataset.

classmethod from_dict(data) CobaltDataset

Instantiate a CobaltDataset from a dictionary representation.

save(file_path: str | PathLike) str

Write this dataset to a .json file.

Returns the path written to.

classmethod load(file_path: str) CobaltDataset

Load a saved dataset from a .json file.

create_rich_media_table(break_newlines: bool = True, highlight_terms: Dict[str, List[str]] | None = None, run_server: bool | None = False) DataFrame

Returns media table with images columns as HTML column.

property embedding_arrays: List[ndarray]

A list of the raw arrays for each embedding.

Deprecated. Get an embedding object from CobaltDataset.embedding_metadata and use Embedding.get() instead.

filter(condition: str) CobaltDataSubset

Returns subset where condition evaluates to True in the DataFrame.

Parameters:

condition – String predicate that is evaluated using the pd.eval function.

Returns:

Selected Subset of type CobaltDataSubset

Example

>>> df = pd.DataFrame({'a': [1, 2, 3, 4]})
>>> ds = cobalt.CobaltDataset(df)
>>> subset = ds.filter('a > 2')
>>> len(subset)
2
get_embedding(index: int = 0) ndarray

Return the embedding associated with this CobaltDataset.

get_embedding_array_by_name(name: str) ndarray
get_image_columns() List[str]

Gets image columns.

get_model_performance_data(metric: str, model_index: int) ndarray

Returns computed performance metric.

get_summary_statistics(categorical_max_unique_count: int = 10) Tuple[DataFrame, DataFrame]

Returns summary statistics for each feature in the dataset.

mask(m: ArrayLike) CobaltDataSubset

Return a CobaltDataSubset consisting of rows at indices where m is nonzero.

property model
overall_model_performance_score(metric: str, model_index: int) float

Computes the mean model performance score.

overall_model_performance_scores(model_index: int) Dict[str, float]

Computes performance score for each available metrics.

class cobalt.CobaltDataSubset(source: CobaltDataset, indices: ndarray | List[int])

Bases: DatasetBase

Represents a subset of a CobaltDataset.

Should in general be constructed by calling the subset() method (or other similar methods) on a CobaltDataset or CobaltDataSubset.

In principle, this could have repeated data points, since there is no check for duplicates.

source

The CobaltDataset of which this is a subset.

indices

np.ndarray of integer row indices defining the subset.

property metadata: DatasetMetadata

A DatasetMetadata object containing the metadata for this dataset.

property models: List[ModelMetadata]

The models associated with this dataset.

Each ModelMetadata object represents potential outcome, prediction, and error columns.

property array_names: List[str]
get_categorical_columns() List[str]
property df: DataFrame

Returns a pd.DataFrame of the data represented by this data subset.

select_col(col: str) Series

Return the pd.Series for column col of this data subset.

get_array(key: str) ndarray
subset(indices: ArrayLike) CobaltDataSubset

Returns a subset obtained via indexing into self.df.

Tracks the dependency on self.source_dataset.

concatenate(dataset: CobaltDataSubset) CobaltDataSubset

Add another data subset to this one. Does not check for overlaps.

Returns:

A new CobaltDataSubset object containing points from self and the passed dataset.

Raises:

ValueError – if self and dataset have different parent datasets.

difference(dataset: CobaltDataSubset) CobaltDataSubset

Returns the subset of self that is not contained in dataset.

Raises:

ValueError – if self and dataset have different parent datasets.

intersect(dataset: CobaltDataSubset) CobaltDataSubset

Returns the intersection of self with dataset.

Raises:

ValueError – if self and dataset have different parent datasets.

to_dataset() CobaltDataset

Converts this subset to a standalone CobaltDataset.

Returns:

returns this object as a dataset.

Return type:

dataset (CobaltDataset)

intersection_size(dataset: CobaltDataSubset) int

Returns the size of the intersection of self with dataset.

Somewhat more efficient than len(self.intersect(dataset)).

Raises:

ValueError – if self and dataset have different parent datasets.

complement() CobaltDataSubset

Returns the complement of this set in its source dataset.

as_mask_on(base_subset: CobaltDataSubset) ndarray[bool]

Returns mask of self on another subset.

Raises:

ValueError – if self is not a subset of base_subset.

as_mask() ndarray[bool]

Returns mask of self on self.source_dataset.

is_subset(other: CobaltDataSubset) bool
embedding_names() List[str]

Return the available embedding names.

property embedding_metadata: List[Embedding]
get_embedding_metadata_by_name(name: str) Embedding
get_model_performance_metrics()

Retrieve and aggregate performance metrics for each model in the subset.

This method iterates over each model and retrieves its overall performance scores.

Returns:

A dictionary structured as {model_name: {metric_name: metric_value}},

where metric_value is the computed score for each metric.

Return type:

dict

get_classifier(model_type: Literal['svm', 'knn', 'rf'] | Callable[[CobaltDataSubset, CobaltDataSubset, int], Classifier] = 'knn', embedding_index: int = 0, global_set: CobaltDataSubset | None = None, params: Dict | None = None)

Build a Classifier to distinguish this subset from the rest of the data.

The classifier takes data point embeddings as an input; the specific embedding to be used can be selected by the user.

This is an experimental method and interfaces may change.

Parameters:
  • model_type – a string representing the type of model to be trained

  • embedding_index – which embedding from self.embeddings to use as inputs

  • global_set – the ambient dataset that the classifier should distinguish this subset from. If not provided, the classifier will distinguish self from self.source_dataset

  • params – a dict of keyword arguments to be passed to the classifier constructor

get_graph_inputs(embedding: int | str | Embedding) Tuple[ndarray, Embedding]
to_json() str
classmethod from_json(serialized_data) CobaltDataSubset
create_rich_media_table(break_newlines: bool = True, highlight_terms: Dict[str, List[str]] | None = None, run_server: bool | None = False) DataFrame

Returns media table with images columns as HTML column.

property embedding_arrays: List[ndarray]

A list of the raw arrays for each embedding.

Deprecated. Get an embedding object from CobaltDataset.embedding_metadata and use Embedding.get() instead.

filter(condition: str) CobaltDataSubset

Returns subset where condition evaluates to True in the DataFrame.

Parameters:

condition – String predicate that is evaluated using the pd.eval function.

Returns:

Selected Subset of type CobaltDataSubset

Example

>>> df = pd.DataFrame({'a': [1, 2, 3, 4]})
>>> ds = cobalt.CobaltDataset(df)
>>> subset = ds.filter('a > 2')
>>> len(subset)
2
get_embedding(index: int = 0) ndarray

Return the embedding associated with this CobaltDataset.

get_embedding_array_by_name(name: str) ndarray
get_image_columns() List[str]

Gets image columns.

get_model_performance_data(metric: str, model_index: int) ndarray

Returns computed performance metric.

get_summary_statistics(categorical_max_unique_count: int = 10) Tuple[DataFrame, DataFrame]

Returns summary statistics for each feature in the dataset.

mask(m: ArrayLike) CobaltDataSubset

Return a CobaltDataSubset consisting of rows at indices where m is nonzero.

property model
overall_model_performance_score(metric: str, model_index: int) float

Computes the mean model performance score.

overall_model_performance_scores(model_index: int) Dict[str, float]

Computes performance score for each available metrics.

sample(max_samples: int, random_state: int | None = None) CobaltDataSubset

Return a CobaltDataSubset containing up to max_samples sampled rows.

Up to max_samples rows will be sampled without replacement and returned as a CobaltDataSubset. If fewer rows exist than max_samples, all rows are returned.

Parameters:
  • max_samples – An integer indicating the maximum number of samples to pull.

  • random_state – An optional integer to be used as a seed for random sampling.

Returns:

A CobaltDataSubset representing up to max_samples randomly sampled datapoints.

class cobalt.ModelMetadata(outcome_columns: List[str], prediction_columns: List[str], task: ModelTask, input_columns: List[str] | None = None, error_columns: List[str] | None = None, evaluation_metrics: Sequence[EvaluationMetric | Dict] | None = None, name: str | None = None)

Bases: object

to_dict() dict
to_json() str
classmethod from_json(serialized_data) ModelMetadata
classmethod from_dict(data: dict) ModelMetadata
property prediction_column

Returns the first prediction column if len(prediction_columns) > 0, else None.

property outcome_column

Returns the first outcome column if len(outcome_columns) > 0, else None.

add_metric_column(metric_name: str, column: str, lower_values_are_better: bool = True)

Add a column from the dataset as a performance metric for this model.

Parameters:
  • metric_name – The name for the metric. If you want to compare different models using this metric, use the same name for the metric in each.

  • column – The name of the column in the dataset that contains the values of this metric for the model.

  • lower_values_are_better – Whether lower or higher values of the metric indicate better performance.

performance_metrics() Dict[str, EvaluationMetric]

Return the relevant performance metrics for this model.

The returned functions have a standard signature and return type.

The input arguments into this are preds and target in that order.

Each function returns a dictionary where the keys are the names of the performance metric and the value associated with the submetrics is the point-wise computation of the metric on each data point.

performance_metric_keys() List[str]
get_performance_metric_for(key: str) EvaluationMetric
calculate_performance_metric(metric_name: str, dataset: DatasetBase) np.ndarray
overall_performance_metric(metric_name: str, dataset: DatasetBase) float
overall_performance_metrics(dataset: DatasetBase) Dict[str, float]
get_confusion_matrix(dataset: DatasetBase, normalize_mode: bool | Literal['all', 'index', 'columns'] = 'index', selected_classes: List[str] | None = None) pd.DataFrame | None

Calculate the confusion matrix for the model if applicable.

Parameters:
  • dataset – The dataset containing the outcomes and predictions.

  • normalize_mode – Specifies the normalization mode for the confusion matrix.

  • selected_classes – Specifies the classes to include in the matrix, with all others aggregated as “other”.

Returns:

Confusion matrix as a DataFrame, or None if not applicable.

Return type:

Optional[pd.DataFrame]

Raises:

ValueError – If the model task is not classification.

get_statistic_metrics(dataset: DatasetBase, selected_classes: List[str] | None = None)

Return a DataFrame containing recall, precision, F1 score, and accuracy for each class.

This method uses the model’s confusion matrix and can filter metrics to only selected classes. Metrics calculated include recall, precision, F1 score, and accuracy.

Parameters:
  • dataset – The dataset to compute the confusion matrix.

  • selected_classes – List of classes to include in the metrics calculation. If None, metrics for all classes are calculated.

Returns:

A DataFrame with recall, precision, F1 score, and accuracy for each class.

Return type:

pd.DataFrame

class cobalt.DatasetMetadata(media_columns: List[MediaInformationColumn] | None = None, timestamp_columns: List[str] | None = None, hidable_columns: List[str] | None = None, default_columns: List[str] | None = None, other_metadata_columns: List[str] | None = None, default_topic_column: str | None = None)

Bases: object

Encapsulates various metadata about a CobaltDataset.

media_columns

Optional list of MediaInformationColumns.

timestamp_columns

Optional list of timestamp column name strings.

hidable_columns

Optional list of hidable column name strings.

default_columns

Optional list containing the names of columns to display by default in an interactive data table.

other_metadata_columns

Optional list of column name strings.

data_types

Dict mapping column names to DatasetColumnMetadata objects.

timestamp_column(index=0) str

Return the (string) name of the indexth timestamp column.

has_timestamp_column() bool
property long_text_columns: List[str]

Columns containing large amounts of text data.

These are candidates for topic or keyword analysis.

property default_topic_column: str | None

Default column to use for topic analysis.

If len(self.long_text_columns) == 0, will always be None.

property embeddings
add_embedding(*_)
to_dict() dict
to_json() str
classmethod from_json(serialized_data) DatasetMetadata
classmethod from_dict(data: dict) DatasetMetadata
class cobalt.MediaInformationColumn(column_name: str, file_type: str, host_directory: str, is_remote=False)

Bases: Column

Represent a column containing information about media files.

column_name

Column Name in dataframe.

Type:

str

file_type

A string indicating the file type, e.g. its extension.

Type:

str

host_directory

Path or URL where the file is located.

Type:

str

is_remote

Whether the file is remote.

is_image_type()
autoname_media_visualization_column() dict

Autoname media column.

to_dict() dict
to_json() str
classmethod from_json(serialized_data) MediaInformationColumn
get_path_to_media(run_server: bool)
class cobalt.Embedding(name=None)

Bases: ABC

Encapsulates a dataset embedding.

property name
abstract property dimension: int

The dimension of the embedding.

abstract get(dataset: DatasetBase) np.ndarray

Get the values of this embedding for a dataset.

property distance_metrics: List[str]

Suggested distance metrics for use with this embedding.

abstract get_available_distance_metrics() List[str]

Return the list of distance metrics that could be used.

class cobalt.ArrayEmbedding(array_name: str, dimension: int, metric: str, name: str | None = None)

Bases: Embedding

An embedding stored in an array associated with a Dataset.

array_name

The name of the array in the dataset storing the embedding values

property dimension: int

The dimension of the embedding.

get(dataset: DatasetBase) np.ndarray

Return a np.ndarray of the embedding rows at specified indices.

Parameters:

dataset – Data(sub)set for which to get the embedding values.

Returns:

The np.ndarray containing the embedding values for the rows in the given dataset.

get_available_distance_metrics() List[str]

Return the list of distance metrics that could be used.

to_dict() dict
to_json() str
classmethod from_json(serialized_data) ArrayEmbedding
classmethod from_dict(data: dict) ArrayEmbedding
property distance_metrics: List[str]

Suggested distance metrics for use with this embedding.

property name
class cobalt.ColumnEmbedding(columns: List[str], metric: str, name=None)

Bases: Embedding

Represents an embedding as a column range.

columns

List of strings naming the columns to include in this embedding.

get(dataset: DatasetBase) np.ndarray

Return a np.ndarray of the embedding rows at specified indices.

Only columns specified in the columns attribute are included.

Parameters:

dataset – Data(sub)set for which to get the embedding values.

Returns:

The np.ndarray containing the embedding values for the rows in the given dataset.

get_available_distance_metrics() List[str]

Return the list of distance metrics that could be used.

property dimension: int

The dimension of the embedding.

to_dict() dict
to_json() str
classmethod from_json(serialized_data) ColumnEmbedding
classmethod from_dict(data: dict) ColumnEmbedding
property distance_metrics: List[str]

Suggested distance metrics for use with this embedding.

property name
class cobalt.DatasetSplit(dataset: CobaltDataset, split: Sequence[int] | Sequence[CobaltDataSubset | List[int] | ndarray] | Dict[str, CobaltDataSubset | List[int] | ndarray] | None = None, train: CobaltDataSubset | List[int] | ndarray | None = None, test: CobaltDataSubset | List[int] | ndarray | None = None, prod: CobaltDataSubset | List[int] | ndarray | None = None)

Bases: dict

The DatasetSplit object can contain any number of user-defined subsets of data.

This can be used to separate out training data from production data, or a baseline dataset from a comparison set, or labeled from unlabeled data, or any number of divisions. These subsets are stored as a dictionary of CobaltDataSubsets, each with a name. When an object that is not a CobaltDataSubset is added to the dictionary, it is automatically converted to a subset by calling dataset.subset(). This means that the split can be created or updated by simply adding lists of data point indices.

There are a few special subset names that will be given extra meaning by Cobalt: “train”, “test”, and “prod”. The “train” subset is meant to include data that was used to train the model under consideration, the “test” subset data that was originally used to evaluate that model, and “prod” data collected later, e.g. when the model is in production. If specified, these subsets will be used in automated failure mode and problem analyses.

Construct a DatasetSplit object.

Parameters:
  • dataset – The CobaltDataset that this separates into subsets.

  • split

    A collection of subsets. Can be given as any of the following:

    • a sequence of integers indicating how many data points fall in each split

    • a sequence of subsets

    • a dict mapping subset names to subsets.

    Subsets can be provided either as CobaltDataSubset objects or as arrays of indices into dataset. If none is provided, a single subset named “all” will be created, containing all data points.

    There are three special names for subsets, “train”, “test”, and “prod”, which are used to inform the automatic model analysis. These can also be passed as keyword parameters for convenience, e.g. DatasetSplit(dataset, train=np.arange(1000), prod=np.arange(1000,2000)).

property has_multiple_subsets: bool

Whether this split has multiple disjoint subsets that can be compared.

property comparable_subset_pairs: List[Tuple[Tuple[str, CobaltDataSubset], Tuple[str, CobaltDataSubset]]]

Returns a list of pairs of disjoint subsets in this split, with names.

Each pair is returned in both orders.

property names: List[str]

Names of subsets in this split.

property train: CobaltDataSubset | None

The training subset, if it exists.

property test: CobaltDataSubset | None

The testing subset, if it exists.

property prod: CobaltDataSubset | None

The production subset, if it exists.

clear() None.  Remove all items from D.
copy() a shallow copy of D
fromkeys(value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items() a set-like object providing a view on D's items
keys() a set-like object providing a view on D's keys
pop(k[, d]) v, remove specified key and return the corresponding value.

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) None.  Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() an object providing a view on D's values
class cobalt.ProblemGroup(name: str, subset: ~cobalt.schema.dataset.CobaltDataSubset, problem_description: str, metrics: ~typing.Dict[str, float], summary: str = '', severity: float = 1.0, group_details: ~cobalt.problem_group.schema.GroupDisplayInfo = <factory>, group_type: ~cobalt.cobalt_types.GroupType = GroupType.failure, visible: bool = True, run_id: ~uuid.UUID | None = None)

Bases: Group

A group representing a problem with a model.

name: str

A name for this group.

subset: CobaltDataSubset

The subset of data where the problem is located.

problem_description: str

A brief description of the problem.

metrics: Dict[str, float]

A dictionary of relevant model performance metrics on the subset.

summary: str = ''

A brief summary of the group attributes.

severity: float = 1.0

A score representing the degree of seriousness of the problem.

Used to sort a collection of groups. Typically corresponds to the value of a performance metric on the group, and in general is only comparable within the result set of a single algorithm run.

group_details: GroupDisplayInfo

Details about the group to be displayed.

group_type: GroupType = 'Failure Group'

The type of group represented, e.g. drifted or high-error.

visible: bool = True
run_id: UUID | None = None
class cobalt.GroupResultsCollection(name: str, run_type: RunType, source_data: CobaltDataset, group_type: GroupType, algorithm: str, params: dict, groups=None, visible: bool = True, run_id: UUID | None = None)

Bases: object

Contains the results of a group analysis on a dataset.

run_type: RunType

Whether the algorithm was run manually by the user or automatically by Cobalt.

groups: List[ProblemGroup]

A list of objects containing the discovered groups, with metadata (e.g. descriptions, model performance metrics) alongside each group.

run_id: UUID

A unique ID for this collection of groups.

name: str

A name for the collection of results. May be referred to as a “run name”, since it corresponds to a particular run of an algorithm.

source_data: CobaltDataSubset

The data(sub)set used for the analysis, as a CobaltDataSubset object.

group_type: GroupType

What each group in the collection represents, e.g. a failure group or a cluster.

algorithm: str

The algorithm used to produce the groups.

params: Dict

Parameters passed to the group-finding algorithm.

visible: bool

Whether the groups should be displayed in the UI.

property raw_groups: List[CobaltDataSubset]

The groups as a list of CobaltDataSubset objects.

Omits the descriptive metadata.

summary(model: ModelMetadata | None = None, production_subset: CobaltDataSubset | None = None) DataFrame

Create a tabular summary of the groups in this collection.

Parameters:
  • model – A ModelMetadata object whose performance metrics will be computed for the groups.

  • production_subset – If provided, will calculate the fraction of data points in each group that fall in this subset.

class cobalt.MultiResolutionGraph

Bases: ABC

A graph with multiple resolution scales.

There are n_levels different graphs arranged in a hierarchical way. Each node of the graph at level i represents a subset of data points, and is a subset of some node of the graph at each level j > i.

abstract property levels: List[AbstractGraph]

A list of graphs representing the dataset at multiple resolution scales.

to_dicts()

Produces a dictionary representation of every graph in self.levels.

n_levels: int
source_dataset: MapperMatrix | None
neighbor_graph: NeighborGraph
class cobalt.HierarchicalPartitionGraph(neighbor_graph: NeighborGraph, hierarchical_partition: HierarchicalDataPartition | None = None, filters: List[DataPartition] | None = None, L_coarseness: int | None = None, L_connectivity: int | None = None, distance_threshold: float = inf, affinity: Literal['slpi', 'exponential', 'gaussian'] = 'slpi')

Bases: MultiResolutionGraph

A MultiResolutionGraph built with a hierarchical partition.

property levels: List[DisjointPartitionGraph]

A list of graphs representing the dataset at multiple resolution scales.

property n_levels
to_dicts()

Produces a dictionary representation of every graph in self.levels.

source_dataset: MapperMatrix | None
neighbor_graph: NeighborGraph
class cobalt.DisjointPartitionGraph(neighbor_graph: NeighborGraph, filters: List[DataPartition], L_coarseness: int, L_connectivity: int, distance_threshold: float = inf, affinity: Literal['slpi', 'exponential', 'gaussian'] = 'slpi')

Bases: AbstractGraph

A graph whose nodes represent disjoint subsets of a dataset.

property edge_list: List[Tuple[int, int]]

List of tuples (i,j) representing edges i->j.

Edges should be interpreted as undirected, and only the direction with i < j will be included in the list.

property edge_mtx: ndarray

A list of edges in numpy array form.

Array is of shape (n_edges, 2), and edge_mtx[k, :] = [i, j] for an edge i->j.

property edge_weights: ndarray

Nonnegative weights for each edge.

property edges: List[Dict]

A list of dictionaries representing data for each edge.

The dictionary for an edge will contain at least “source”, “target”, and “weight” keys, but may contain additional data.

property n_edges: int

Number of edges in the graph.

property node_membership: ndarray

An array with one entry for each data point indicating the node to which that data point belongs.

property nodes: Sequence[Collection]

A list with one entry for each node, where entry i contains the data point ids represented in node i.

to_dict() Dict

Representation of the graph as a dictionary.

Has keys “nodes” and “edges”, where the entry for “nodes” is self.nodes, and the entry for “edges” is self.edges.

source_dataset: MapperMatrix | None
class cobalt.GraphSpec(X: ndarray, metric: str, M: int | None = None, K: int | None = None, min_nbrs: int | None = None, L_coarseness: int = 20, L_connectivity: int = 20, filters: Sequence[FilterSpec] = ())

Bases: object

A set of parameters for creating a graph.

X: ndarray

The source data. Should be an array of shape (n_points, n_dims).

metric: str

The name of the distance metric to use to create the graph.

M: int | None = None

The number of nearest neighbors to compute for each data point.

K: int | None = None

The number of mutual nearest neighbors to keep for each data point.

min_nbrs: int | None = None

The minimum number of neighbors to keep for each data point.

L_coarseness: int = 20

The number of neighbors to keep for each data point when clustering data points into graph nodes.

L_connectivity: int = 20

The number of neighbors to keep for each data point when connecting nodes in the graph.

filters: Sequence[FilterSpec] = ()

A (possibly empty) list of FilterSpec objects that describe filters to apply to the graph.

class cobalt.FilterSpec(f_vals: ndarray, n_bins: int = 10, bin_method: Literal['rng', 'uni'] = 'rng', pruning_method: Literal['bin', 'pct'] = 'bin', pruning_threshold: int | float = 1)

Bases: object

A set of parameters for a filter on a graph.

Separates the dataset into n_bins bins, based on the values of f_vals for each data point. Data points within each bin are clustered to form nodes, and are linked together if they are in nearby bins.

f_vals: ndarray

An array of values, one for each data point.

n_bins: int = 10

The number of bins to separate the dataset into.

bin_method: Literal['rng', 'uni'] = 'rng'

Either “rng” or “uni”. If “rng”, the bins will have equal width; if “uni” they will have equal numbers of data points.

pruning_method: Literal['bin', 'pct'] = 'bin'

Either “bin” or “pct”. If “bin”, will only allow edges between nodes from nearby bins. If “pct”, will only allow edges between nodes whose percentile difference for f_vals is within the given threshold.

pruning_threshold: int | float = 1

The maximum distance two nodes can be apart while still being connected.

cobalt.load_tabular_dataset(df: DataFrame, embeddings: DataFrame | ndarray | List[str] | Literal['numeric_cols', 'rf'] | None = None, rf_source_columns: List[str] | None = None, metadata_df: DataFrame | None = None, timestamp_col: str | None = None, outcome_col: str | None = None, prediction_col: str | None = None, other_metadata: List[str] | None = None, hidden_cols: List[str] | None = None, baseline_column: str | None = None, baseline_end_time: Timestamp | None = None, baseline_indices: List[int] | ndarray | None = None, split_column: str | None = None, embedding_metric: str = 'euclidean', task: Literal['classification', 'regression'] | None = None, model_name: str | None = None) Tuple[CobaltDataset, DatasetSplit]

Loads tabular data from a pandas DataFrame into a CobaltDataset.

Separate dataframes are used to specify the source data, embedding columns, and (optionally) metadata columns.

Note: This function is deprecated. Users should transition to constructing a CobaltDataset directly from a DataFrame.

Parameters:
  • df – A pandas.DataFrame containing the tabular source data.

  • embeddings – Specifies which data to use as embedding columns. May be of type pandas.DataFrame, np.ndarray, List[str], or may be “numeric_cols” or “rf”, indicating that the numeric columns contained in df should be used as the embedding vectors, or that a random forest embedding will be generated, respectively.

  • rf_source_columns – Columns to use in the random forest embedding. If embeddings == ‘rf’ then use rf_source_columns as the input columns for the RF embedding. If embeddings == ‘rf’ and rf_source_columns is None then the random forest embedding columns will default to all of the numerical columns within the dataframe (df).

  • metadata_df – Optional pandas.DataFrame containing additional metadata columns. All specified columns may be in either df or metadata_df. All other (non-hidden) columns in this dataframe will be treated as if they were in other_metadata.

  • timestamp_col – String name of the column containing datapoint timestamps.

  • outcome_col – String name of the column containing a numeric or categorical outcome variable. (E.g., the variable ‘y’.)

  • prediction_col – String name of the column containing model predictions. (E.g., the variable ‘ŷ’.)

  • other_metadata – Optional list of strings indicating other metadata columns in df. The Workspace may use this information to decide what to display, e.g. as options for coloring a visualization.

  • hidden_cols – Optional list of strings indicating columns that will not be displayed in Cobalt TableViews.

  • baseline_column – Optional string name of an indicator (boolean) column marking datapoints as belonging to the baseline set. One of three options for specifying the baseline set, along with baseline_end_time and baseline_indices.

  • baseline_end_time – An optional pd.Timestamp; datapoints with values in timestamp_col <= to this value will be marked as baseline. Ignored if baseline_column is specified.

  • baseline_indices – Optional list or np.ndarray of row indices, indicating datapoints belonging to the baseline set. Ignored if baseline_column or baseline_end_time are set.

  • split_column – The name of a categorical column containing labels of which split of the dataset each data point belongs to. These splits will be available as data sources in the UI.

  • embedding_metric – String indicating the type of metric to be used with the specified data embedding. Default: “euclidean”.

  • task – The type of task performed by the model being debugged. Can currently be set to either “regression” or “classification”.

  • model_name – A string name to refer to the model being analyzed.

Returns:

A (CobaltDataset, DatasetSplit) tuple.

Raises:
  • ValueErrortimestamp_col was not specified or was of incorrect type.

  • ValueError – None of baseline_column, baseline_end_time, or baseline_indices was specified.

  • ValueError – The number of embedding vectors does not exactly match the number of datapoints.

  • ValueErroroutcome_col or prediction_col dtypes do not match.

  • ValueError – Mismatch between outcome_type and detected dtype of outcome or prediction columns.

cobalt.get_tabular_embeddings(df: DataFrame, model_name: Literal['rf'] | None = None, outcome: str | None = None) Tuple[ndarray, str, str]

Create an embedding array based on the given df and embedding method.

Note that the design of this function is in flux. Currently supports generating embeddings via a random forest model.

Parameters:
  • df – pandas.DataFrame containing the data.

  • model_name – String indicating whether the model to be used is “rf”.

  • outcome – String name of the desired outcome column in df, for method == “model” embeddings.

Returns:

a tuple (embedding_array, metric, name).

cobalt.check_license()

Check the configured license key and print the result.

cobalt.setup_api_client()

Set up the API client by updating or adding the API key to the JSON config file.

cobalt.get_api_client(api_name: str = 'openai')

Get the API client by loading the API key from the JSON config or environment variables.

cobalt.setup_license()

Prompts for a license key and sets it in the configuration file.

The license key will be saved in ~/.config/cobalt/cobalt.json.

cobalt.register_license()

Registers this installation of Cobalt for noncommercial or trial usage.

Requests your name and email address and configures a license key. If you have already registered Cobalt on a different computer, this will link your computer with the previous registration.

Lab Functionality

The lab submodule contains preliminary and experimental functionality.

APIs in this module are subject to change without warning. Please contact us with any questions or feedback.

cobalt.lab.describe_groups_multiresolution(ds: CobaltDataset, text_column_name: str, n_gram_range: str | Tuple, aggregation_columns: List[str] | None = None, min_level: int = 0, max_level: int | None = None, max_keywords: int = 3, aggregation_method: Literal['all', 'mean'] | List[Callable] = 'mean', return_intermediates: bool = False) Tuple[DataFrame, Workspace, Dict[int, Dict[int, str]]]

Returns a summary of groups in a set of texts.

This builds a multiresolution graph from the embeddings provided in the input dataset, and for a range of coarseness levels, computes a keyword description of the text contained in each node, and returns this information in a DataFrame.

Optionally also returns a Workspace object that can be used to access the graph and explore the results further.

Parameters:
  • ds (CobaltDataset) – Dataset (containing an embedding of the text data)

  • text_column_name (str) – Column containing text data for keyword analysis.

  • n_gram_range (Union[str, Tuple]) – Whether to analyze keywords with unigrams, bigrams, or a combination.

  • aggregation_columns – Columns in ds to aggregate.

  • min_level (int) – Minumum graph level to output cluster labels for.

  • max_level (int) – Maximum graph level to output cluster labels for.

  • max_keywords (int) – Maximum number of keywords to find for each cluster.

  • max_neighbors (int) – Maximum number of neighbors to return in table.

  • aggregation_method – Method(s) to aggregate columns by.

  • return_intermediates (bool) – Whether to return intermediate results.

Returns:

A tuple consisting of a pd.DataFrame per level with the labels for each cluster, a Workspace object and the raw labels per level per node.

class cobalt.GroupCollection(source_dataset: CobaltDataset, indices: Sequence[Sequence[int]], name: str | None = None, group_type: GroupType = GroupType.any)

A collection of groups from a source CobaltDataset.

A group consists of a subset of data points together with some metadata about the subset. This metadata can include things like:

  • A name for the group

  • Distinctive keywords for the group

  • Model performance metrics on the group

  • Distinctive features for the group

The schema for metadata is defined in the GroupMetadata class.

The groups in a collection are stored in a specific order, and can be accessed by indexing, e.g. collection[0] to get the first group. If a group has been assigned a name, it can also be accessed by name, e.g. collection["group name"]. This will return the CobaltDataSubset containing the data points in the group. To access the metadata for a group, index into collection.metadata in the same way.

It should not usually be necessary to manually instantiate GroupCollection objects, but they will be returned by various Cobalt methods and functions.

The GroupCollection interface is under development and changes may be made in the near future.

classmethod from_groups(groups: Sequence[GroupMetadata])

Create a GroupCollection from a list of GroupMetadata objects.

classmethod from_subset_collection(subsets: SubsetCollection)

Promote a SubsetCollection to a GroupCollection.

This allows adding metadata to each subset.

property metadata: GroupMetadataIndexer

Get a group together with its metadata.

compute_group_keywords(col: str | None = None, n_keywords: int = 10, set_names: bool = False, **kwargs)

Find distinctive keywords for each group and store them in the group metadata.

Parameters:
  • col – The column containing text from which to extract keywords.

  • n_keywords – The number of keywords to find for each group.

  • set_names – If True, will set each group’s name based on the discovered keywords, using the default parameters to set_names_from_keywords().

set_names_from_keywords(col: str, n_keywords: int = 3, delimiter: str = ', ', min_match_rate: float = 0.0)

Set names for each group based on already-computed keywords.

Names groups with a string containing a number of the top keywords found for that group.

If two groups would end up with the same name, groups after the first will be named with a number to ensure names are unique.

Parameters:
  • col – The column whose keywords should be used to create the group names.

  • n_keywords – The number of keywords to use to form each name.

  • delimiter – The character(s) that should separate keywords from each other in the group names.

  • min_match_rate – The minimum fraction of data points in the group that should contain a keyword in order for it to be used in the group name.

set_names_sequential(prefix: str | None = None, prefix_source: Literal['group_type', 'collection_name'] = 'group_type', sep: str = ' ')

Set names for each group sequentially with a prefix string.

aggregate_col(col: str, method: Literal['mean', 'sum', 'mode'] | Callable[[Series], Any] | None = None) Sequence[float]

Aggregate the values of a column within each subset using the specified method.

concatenate() CobaltDataSubset

Concatenate all subsets in the collection.

evaluate_model(model: ModelMetadata | str, metrics: Sequence[str] | None = None) DataFrame

Produce a dataframe containing model performance metrics for each group.

Parameters:
  • model – Name of the model to evaluate, or a ModelMetadata object to evaluate.

  • metrics – Names of the metrics to evaluate on the model. By default, will use all metrics defined for the model.

get_array(key: str) Sequence[ndarray]

Retrieve the slice of an array for each subset.

is_pairwise_disjoint()

Return True if there are no overlaps between subsets, False otherwise.

select_col(col: str) Sequence[Series]

Retrieve the values of a column on each subset.

compare_models(models: Sequence[ModelMetadata | str], metrics: List[str], select_best_model: bool = True, statistical_test: Literal['t-test', 'wilcoxon'] | None = None) DataFrame

Produce a dataframe comparing two or more models on each group.

Evaluates each specified metric for each model on each group, and puts these values in a column called “model_name_metric_name”. If select_best_model is True, will also include a column indicating the best model for each group with respect to each metric, as well as the change in performance compared to the next-best model. If statistical_test is specified, will also run a test that the performance difference is significantly different between the two models on each group. The resulting p-values are not currently adjusted for multiple comparisons.

class cobalt.schema.group_collection.GroupMetadata(subset: 'CobaltDataSubset', name: 'Optional[str]' = None, metrics: 'Dict[str, float]' = <factory>, description: 'Optional[str]' = None, display_info: 'GroupDisplayInfo' = <factory>, keywords: 'Dict[str, GroupKeywords]' = <factory>, group_type: 'GroupType' = <GroupType.any: 'Group'>)
subset: CobaltDataSubset

The data points included in this group.

name: str | None = None

The group’s name. Should be unique within a SubsetCollection.

metrics: Dict[str, float]

Relevant numeric metrics for this group.

description: str | None = None

A short description of the contents of the group.

display_info: GroupDisplayInfo

Information to be displayed in the group explorer in the UI.

keywords: Dict[str, GroupKeywords]

Distinctive keywords found in text columns in the group.

group_type: GroupType = 'Group'

Describes the semantic meaning of the group in context.

class cobalt.SubsetCollection(source_dataset: CobaltDataset, indices: Sequence[Sequence[int]], name: str | None = None)

A collection of subsets of a CobaltDataset.

select_col(col: str) Sequence[Series]

Retrieve the values of a column on each subset.

aggregate_col(col: str, method: Literal['mean', 'sum', 'mode'] | Callable[[Series], Any] | None = None) Sequence[float]

Aggregate the values of a column within each subset using the specified method.

get_array(key: str) Sequence[ndarray]

Retrieve the slice of an array for each subset.

concatenate() CobaltDataSubset

Concatenate all subsets in the collection.

is_pairwise_disjoint()

Return True if there are no overlaps between subsets, False otherwise.