# Inspecting Model Performance

## Introduction to Model Metrics View

The Model Metrics View, accessible by clicking on the bar chart icon (<img src="/files/IeYV79XvHp3LxmyzTEWZ" alt="" data-size="line">) in the left hand navigation bar, is the primary way to interact with your models' aggregate performance in Aquarium.&#x20;

This page will cover how to use the different features within Aquarium so your team can analyze your model's performance and more efficiently find insights in the underlying datasets.

**To get started, select a dataset and at least one inference set in the project path underneath the top navigation bar.**

<figure><img src="/files/d11F7LcUz8eqdIY0uuc0" alt=""><figcaption><p>Example of what your view will look like with a dataset and an inference selected</p></figcaption></figure>

#### The Model Metrics View is split into two tabs which we elaborate more on in later sections:&#x20;

#### **Scenarios**

{% hint style="info" %}
Using Scenarios requires setting up model performance segments within your dataset. [Learn more about organizing your data with Segments.](/aquarium/working-in-aquarium/organizing-your-data.md)
{% endhint %}

The scenarios tab provides a summary view of your model's performance against pre-defined subsets of your dataset.

* Scenarios allow you to define a target threshold for your models' performance against a known set of frames, and then evaluate all inference sets against those thresholds. This may be as simple as reaching as target F1 score on the test set, or as complex as multi-metric pass/fail regression tests against domain specific problems.

#### **Metrics**

The Metrics tab provides a high level overview on the overall dataset, or an individual segment. Metrics provides additional drill through capability over the Scenarios view, including:

* A [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) that provides per-class precision, recall, and F1 metrics.
* A [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) that lays out the number of each confusion in the dataset.
* A [precision-recall (PR) curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) that illustrates the tradeoff between precision and recall given a confidence threshold.

## Configuring Thresholds for Model Performance Segments

When you create a Model Performance type segment, you can set thresholds for both precision and recall to easily evaluate the performance of an inference set. These thresholds are especially useful when used with Regression Test type segments because your teams can query the API and retrieve a pass/fail result.

<details>

<summary>Example Payload From Querying the Results of a Regression Test</summary>

```json
# note payload also always includes 
# similar result for your entire dataset named 'All Frames'
# also note the succeeds value will say true or false 
# depending on threshold criteria

[
   {
      "segment_name":"Far away planes",
      "frame_count":12,
      "precision":{
         "score":0.9716981132075472,
         "threshold":0.8,
         "succeeds":true
      },
      "recall":{
         "score":0.7410071942446043,
         "threshold":0.8,
         "succeeds":true
      },
      "f1_score":{
         "score":0.8408163265306122,
         "threshold":0.7,
         "succeeds":true
      },
      "false_positives":{
         "score":0
      },
      "false_negatives":{
         "score":33
      },
      "total_confusions":{
         "score":3
      },
      "segment_type":"Regression Test",
      "uuid":"UUID_Value"
   }
]
```

</details>

### Setting a Threshold for a Model Performance Segment

There are two ways to navigate to the page to set a model performance segment's thresholds.&#x20;

1. From a segment overview page, click on the Metrics tab

<figure><img src="/files/jgo1ZfiZK6mA8LVbQ65a" alt=""><figcaption><p>Metrics tab on Segment Overview page</p></figcaption></figure>

2\. From the Model Metrics View, click on the fly out button in one of the Scenario cards

<figure><img src="/files/3tUFJc8rrvTUHBa8bo1O" alt=""><figcaption><p>Button to take you from the Model Metrics Scenarios view to the Specific Metrics page to set thresholds</p></figcaption></figure>

On the metrics page, for each metric, you'll see a dot plotted representing each inference set related to the base labeled dataset.

<figure><img src="/files/eCZQaCYjgqbq3gEIM541" alt=""><figcaption><p>Arrow highlighting each inference set down below for each metric</p></figcaption></figure>

Once you have navigated to the Metrics Page, to set a threshold:

1. Click the gear button (<img src="/files/1nMTNKimhT1vdkrtz4Da" alt="" data-size="line">)
2. Enter in a number or use the arrows to set a value from 0.0 to 1.0 to represent the desired threshold value (*Must type 0 first, ie. 0.8*)
3. Click out of the input box anywhere on screen for the value to take effect

<figure><img src="/files/gpnRNoTIB5dTQ0jJJ7Pg" alt=""><figcaption><p>How to set a threshold</p></figcaption></figure>

You'll notice once you set a threshold your values will turn green or red depending on if your inference set metrics are over or under that threshold.&#x20;

Once you set the thresholds, you'll also see dotted lines that represent the threshold values superimposed on the PR curves in the Model Metrics View:

<figure><img src="/files/1XCZ6B3eGLFcbgL3tlIZ" alt=""><figcaption><p>On the left you can see an example segment with Thresholds set and one with no thresholds on the right</p></figcaption></figure>

## Scenarios Tab

The Scenarios tab summarizes your models' performance across all defined Model Performance Segments.

Model Performance Segments are grouped into three primary categories

* **Splits**
  * Always includes a segment card for All Frames in the dataset.
  * Typically the test, training and validation subsets of your dataset.
* **Regression Tests**
  * Sets of frames within your dataset that the model must perform to a certain threshold on in order to be considered for deployment.&#x20;
  * Regression tests might be tied to overall business goals, specific model development experiments, domain specific difficult performance scenarios, etc.
* **Scenarios**
  * Any other subset of frames you'd like to evaluate the model's performance on (e.g. data source, labeling provider, embedding clusters, etc.)
* View the performance of any uploaded inference set compared to the segment-specific precision, recall and F1 target thresholds,
* Modify the metric target thresholds,
* Manage the segment's elements and segment metadata,

From the scenarios tab select up to two inference sets to compare model performance.&#x20;

* Initially, the metrics calculations will respect the project-wide default IOU and confidence settings.&#x20;
* Otherwise, use the metrics settings to adjust the confidence and IOU thresholds for both models, or either model independently.

![Comparing two inference sets](/files/XO4FUPR2vj3hfxSOE3Xr)

Click the fly out button to open the Segment Details view.&#x20;

From here you can:

* View the performance of any uploaded inference set compared to the segment-specific precision, recall and F1 target thresholds,
* Modify the metric target thresholds,
* Manage the segment's elements and segment metadata,

![Access Segment details.](/files/y0m74QyMKVTqtzmop88U)

Click anywhere in the segment card to open the Metrics tab, pre-filtered to only the frames in that specific segment.

![View metrics for a single Segment.](/files/sPS6uf9cEdqESg3mGPey)

## Metrics Tab

Here, you can see a high level overview of the model's performance on the base dataset or scenario subset in a few forms:

* A [precision-recall (PR) curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) that illustrates the tradeoff between precision and recall given a confidence threshold.
* A [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) that provides per-class precision, recall, and F1 metrics.
* A [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) that lays out the number of each confusion in the dataset.

You can move the slider for parameters like confidence threshold and IOU matching threshold, and the metrics will recompute to reflect the model's performance with those parameters. You can also change the Metric Class.

<div><figure><img src="/files/ankGVviUwyUTY1i4xDfA" alt=""><figcaption><p>Example of changing the Metric Class</p></figcaption></figure> <figure><img src="/files/4zRMnoQ11K2MsEAu06gG" alt=""><figcaption><p>Adjust Metrics Thresholds</p></figcaption></figure></div>

#### Many Class Metrics View

When the number of classes exceeds 25, the model metrics view switches from the grids above to a filterable table format for better legibility.

![](/files/-MQZ59qxjRGpJ3WHT1vK)

Selecting a row in the Classification Report (left) will filter the Confusion Matrix (right) to only rows where either GT or Prediction matches the selected class. Selecting a row in the Confusion Matrix will show examples of those types of confusions, and sort by the amount of confusion. &#x20;

## Understanding the Confusion Matrix

The confusion matrix in the Metrics tab is extremely useful for identifying label/data quality issues. While we have a [whole guide](/aquarium/common-workflows/assess-data-quality.md) on how to move through a data quality workflow, this section will focus specifically on how to use all the features of the matrix.

For this example we&#x20;

### Filtering Buttons

<figure><img src="/files/vFHoH2g6TraYrQ9QTtqm" alt=""><figcaption><p>Filtering buttons in the top left corner of the screen</p></figcaption></figure>

In the Metrics tab, these filtering buttons allow you to quickly filter and view subsets of your dataset.&#x20;

For example, when you click on **FN** (False Negatives), you'll see specific cells highlighted in the matrix. In addition after the matrix cells are highlighted, you'll see those examples populate down below for review.

<figure><img src="/files/FOm93026TxL1D7LtfrQi" alt=""><figcaption><p>Interacting with the FN button</p></figcaption></figure>

The buttons **Confused As** and **Confused From** reveal a dropdown where you can filter on specific classes. When selecting a class, you'll again see cells in the matrix highlighted and examples that meet the criteria populating down below.

<figure><img src="/files/1OiD3yOkDNmsXbYK69AB" alt=""><figcaption><p>Interacting with the Confused As dropdown</p></figcaption></figure>

### Toggle Buttons

<figure><img src="/files/EP5Ctqd0DUo3dJGKsR6h" alt=""><figcaption><p>Toggle buttons above the confusion matrix</p></figcaption></figure>

#### Absolute/Percentage Toggle Buttons

The toggle buttons on the left allows you to change the value displayed on the cell:

* **Absolute** is the number of crops that meet the label/inference class criteria.&#x20;
* **Percentage** depends on which option is selected on the other toggle button (row, column, value).&#x20;
  * If **Row** is selected, cell percentage represents the count of the crops in the cell compared to the row
  * If **Column** is selected, cell percentage represents the count of the crops in the cell compared to the column
  * If **Value** is selected, cell percentage represents the count of the crops in the cell compared to the total number of crops across all classes

<figure><img src="/files/Tiwm357EN8EyCoedjwuL" alt=""><figcaption><p>Absolute/Percentage Toggle</p></figcaption></figure>

#### Row/Column/Value Toggle Buttons

The toggle buttons above the confusion matrix change the way you see the cell values colored and the numbers that display on cell hover.

<figure><img src="/files/4CSTI6ByUSfwZZKN24Tf" alt=""><figcaption><p>When you toggle between row/column/value the cells denominator changes on hover to reflect the total row value, column value, or total count value</p></figcaption></figure>

The darker the color, the larger the percentage of the Row/Column/Overall Value that specific cell represents.&#x20;

Also, depending on the toggled option, the denominator that is displayed on cell hover will reflect the total count per row, column, or for the entire dataset.

### Comparing Two Inference Sets

In the Model Metrics View it is possible to compare two inference sets at once.&#x20;

<figure><img src="/files/uxV5Epaqol2ftpDhjdUE" alt=""><figcaption><p>You can see two inference sets selected</p></figcaption></figure>

When two inference sets are selected, both the value displayed on a cell and the values displayed when hovering over a cell will appear different than with just a single inference.

{% hint style="info" %}
It's worth noting, whatever inference set is selected first up top from the drop down is the one you will see listed above the confusion matrix and does have an effect on the results you will see in the confusion matrix.
{% endhint %}

<figure><img src="/files/DxgU6126Bq9Ul2WWG0AG" alt=""><figcaption><p>An example of confusion matrix with two inference sets selected </p></figcaption></figure>

Taking a look at coloring in the matrix pictured above, the darker the blue the better, the darker the red the worse.&#x20;

Breaking this statement down, the coloring depends on if we are looking at values on the main diagonal or off of the diagonal.&#x20;

Here we mean the diagonal that represents the correct predictions from the model:

<figure><img src="/files/kCb3rrQ9fBDK7n5TKa5a" alt=""><figcaption><p>The diagonal we are referring to</p></figcaption></figure>

**On the diagonal**, any positive value is good and signifies that value is the increase in the number of correct classifications in the second inference set compared to the first. So since it is a positive change, positive numbers on the diagonal will be blue. On that same train of thought, any negatives on the diagonal signify a decrease in performance and are colored red.

**For any value outside the diagonal**, the colors actually represent the opposite because outside the diagonal, each cell represents a specific kind of error. So positive numbers actually represent MORE misclassifications in the second inference set compared to the first. Whereas negative numbers represent less error = better performance = blue colored cells.&#x20;

#### Understanding the Values on Hover

<figure><img src="/files/lfiChrJ0n9BsUhtLix1n" alt=""><figcaption><p>zoomed in image of a cell from the images above</p></figcaption></figure>

Looking at the image above, when comparing two inference sets the format of the message on hover for a cell is slightly different.&#x20;

The message for this cell reads:

```
delta - straight + 10 (2 -> 12) 
```

This reads as: for objects classified as delta but labeled straight, the second inference set had 10 more of these failures than the first inference set. The first inference set had 2 examples of this particular failure and the second selected inference set has 12 examples of this failure. (2 **+ 10** = 12)

<figure><img src="/files/ONzmoiFkjDboO9Mm57W8" alt=""><figcaption><p>how to determine which inference set is first vs second<br></p></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://legacy-docs.aquariumlearning.com/aquarium/working-in-aquarium/inspecting-model-performance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
