Analyzing Model Inferences

Once you've uploaded a model's inferences on a dataset through the Aquarium API, you can then begin to analyze your model's performance and more efficiently find insights in the underlying datasets.

Ultimately, the goal of most ML teams is to improve their model performance, so it's important to understand where your model is doing well / badly in order to improve it. We can also use to surface what parts of the dataset we should pay more attention to, either because the model is performing badly or because there's a problem with the underlying data.

To get started, select a dataset and at least one inference set in the project path underneath the top navigation bar.

Interacting with Inference Sets in Aquarium

The Model Metrics View, accessible from the left hand navigation bar, is the primary way to interact with your models' aggregate performance in Aquarium. The Model Metrics View is split into two tabs:


Using Scenarios requires setting up model performance segments within your dataset. Learn more about organizing your data with Segments.

  • The scenarios tab provides a summary view of your model's performance against pre-defined subsets of your dataset.

  • Scenarios allow you to define a target threshold for your models' performance against a known set of frames, and then evaluate all inference sets against those thresholds. This may be as simple as reaching as target F1 score on the test set, or as complex as multi-metric pass/fail regression tests against domain specific problems.


  • The Metrics tab provides a high level overview on the overall dataset, or an individual segment. Metrics provides additional drill through capability over the Scenarios view, including:


The Scenarios tab summarizes your models' performance across all defined Model Performance Segments.

Model Performance Segments are grouped into three primary categories

  • Splits

    • Always includes a segment card for All Frames in the dataset.

    • Typically the test, training and validation subsets of your dataset.

  • Regression Tests

    • Sets of frames within your dataset that the model must perform to a certain threshold on in order to be considered for deployment.

    • Regression tests might be tied to overall business goals, specific model development experiments, domain specific difficult performance scenarios, etc.

  • Scenarios

    • Any other subset of frames you'd like to evaluate the model's performance on (e.g. data source, labeling provider, embedding clusters, etc.)

From the scenarios tab select up to two inference sets to compare model performance.

  • Initially, the metrics calculations will respect the project-wide default IOU and confidence settings.

  • Otherwise, use the metrics settings to adjust the confidence and IOU thresholds for both models, or either model independently.

Click the fly out button to open the Segment Details view. From here you can:

  • View the performance of any uploaded inference set compared to the segment-specific precision, recall and F1 target thresholds,

  • Modify the metric target thresholds,

  • Manage the segment's elements and segment metadata,

Click anywhere in the segment card to open the Metrics tab, pre-filtered to only the frames in that specific segment.


Here, you can see a high level overview of the model's performance on the base dataset in a few forms:

You can also move the slider for parameters like confidence threshold and IOU matching threshold, and the metrics will recompute to reflect the model's performance with those parameters.

You can also click into an example in the confusion matrix, see examples of those types of confusions, and sort by the amount of confusion.

Many Class Metrics View

When the number of classes exceeds 25, the model metrics view switches from the grids above to a filterable table format for better legibility.

Selecting a row in the Classification Report (left) will filter the Confusion Matrix (right) to only rows where either GT or Prediction matches the selected class. Selecting a row in the Confusion Matrix will show examples of those types of confusions, and sort by the amount of confusion.

Surfacing Labeling Errors

Aquarium makes it easy to find the places where your model disagrees most with the data. Once you click on a confusion matrix box, you can sort examples to see high confidence disagreements, low confidence agreements, or by other common factors like IOU or box size.

These "high loss" examples tend to expose areas where the model is making egregious mistakes or places where the model is right and the underlying labels is wrong! In the following example - showing the most confident examples where the model detected cars where there was no corresponding label - most issues are due to missing labels on cars that the model correctly detects!

Whenever you upload a new model inference set to Aquarium, we highly recommend looking through some of these high loss examples to see if you have any labeling mistakes.

Finding Model Failure Patterns

You can color datapoints in the embedding view based on model precision, recall, and F1. This lets you identify trends in model performance by finding which parts of the dataset the model does particularly well / badly.

When switching to crop embeddings, you can color datapoints by confusion type to identify object-level failure patterns.

By clicking the entries in the class legend on the left, we can toggle the visualization to only show false positive scenarios.

We can identify a cluster of false positive detections on the same object across multiple different frames.

Last updated