Aquarium
  • Getting Started
    • Intro to Aquarium
    • Key Concepts
    • Account Setup and Team Onboarding
    • Quickstart Guides
      • 2D Classification
      • 2D Object Detection
      • 2D Semantic Segmentation
    • Announcements
    • Python Client API Docs
  • Data Privacy
    • Data Sharing Methodologies
      • Generate Local Credentials from AWS
      • Generating Access-Controlled URLs
      • Granting Aquarium Read Access to an AWS S3 Bucket
    • Anonymous Mode
  • Integrating With Aquarium
    • Creating Projects in Aquarium
    • Uploading Data
      • Labeled Datasets
      • Model Inferences
      • Unlabeled Datasets
    • Exporting Data
      • Batch Exports
      • Webhooks
    • Updating Datasets
  • Working In Aquarium
    • Managing Projects
    • Viewing Your Dataset
    • Analyzing Your Metadata
    • Querying Your Dataset
    • Organizing Your Data
    • Inspecting Model Performance
    • Analyzing Model Inferences
    • Finding Similar Elements Within a Dataset
    • Comparing Models
  • Common End-To-End Workflows
    • Assess Data Quality
    • Collect Relevant Data
    • Evaluate Model Performance
  • Python SDK
    • Python Client API Docs
    • Working With the SDK
    • Code Snippets and Examples
      • Segments
      • Confusion Matrix Scripting
      • Collection Campaign Scripting
  • Advanced Concepts
    • Adding Custom Embeddings
    • Dataset Checkpoints
    • Collection Campaign Classifier
    • Embeddings
    • URL formatting
    • Metrics Methodology
    • Complex Label Class Maps
    • Webhooks
      • Integrating with Labeling Using Webhooks
    • Custom Metrics
      • Stratified Metrics
    • Troubleshooting
      • Troubleshooting Common Web Issues
Powered by GitBook
On this page
  • Interacting with Inference Sets in Aquarium
  • Scenarios
  • Metrics
  • Surfacing Labeling Errors
  • Finding Model Failure Patterns

Was this helpful?

  1. Working In Aquarium

Analyzing Model Inferences

PreviousInspecting Model PerformanceNextFinding Similar Elements Within a Dataset

Last updated 3 years ago

Was this helpful?

Once you've uploaded a model's inferences on a dataset through the Aquarium API, you can then begin to analyze your model's performance and more efficiently find insights in the underlying datasets.

Ultimately, the goal of most ML teams is to improve their model performance, so it's important to understand where your model is doing well / badly in order to improve it. We can also use to surface what parts of the dataset we should pay more attention to, either because the model is performing badly or because there's a problem with the underlying data.

To get started, select a dataset and at least one inference set in the project path underneath the top navigation bar.

Interacting with Inference Sets in Aquarium

The Model Metrics View, accessible from the left hand navigation bar, is the primary way to interact with your models' aggregate performance in Aquarium. The Model Metrics View is split into two tabs:

Scenarios

Using Scenarios requires setting up model performance segments within your dataset.

  • The scenarios tab provides a summary view of your model's performance against pre-defined subsets of your dataset.

  • Scenarios allow you to define a target threshold for your models' performance against a known set of frames, and then evaluate all inference sets against those thresholds. This may be as simple as reaching as target F1 score on the test set, or as complex as multi-metric pass/fail regression tests against domain specific problems.

Metrics

  • The Metrics tab provides a high level overview on the overall dataset, or an individual segment. Metrics provides additional drill through capability over the Scenarios view, including:

    • A that provides per-class precision, recall, and F1 metrics.

    • A that lays out the number of each confusion in the dataset.

    • A that illustrates the tradeoff between precision and recall given a confidence threshold.

Scenarios

The Scenarios tab summarizes your models' performance across all defined Model Performance Segments.

Model Performance Segments are grouped into three primary categories

  • Splits

    • Always includes a segment card for All Frames in the dataset.

    • Typically the test, training and validation subsets of your dataset.

  • Regression Tests

    • Sets of frames within your dataset that the model must perform to a certain threshold on in order to be considered for deployment.

    • Regression tests might be tied to overall business goals, specific model development experiments, domain specific difficult performance scenarios, etc.

  • Scenarios

    • Any other subset of frames you'd like to evaluate the model's performance on (e.g. data source, labeling provider, embedding clusters, etc.)

From the scenarios tab select up to two inference sets to compare model performance.

  • Initially, the metrics calculations will respect the project-wide default IOU and confidence settings.

  • Otherwise, use the metrics settings to adjust the confidence and IOU thresholds for both models, or either model independently.

Click the fly out button to open the Segment Details view. From here you can:

  • View the performance of any uploaded inference set compared to the segment-specific precision, recall and F1 target thresholds,

  • Modify the metric target thresholds,

  • Manage the segment's elements and segment metadata,

Click anywhere in the segment card to open the Metrics tab, pre-filtered to only the frames in that specific segment.

Metrics

Here, you can see a high level overview of the model's performance on the base dataset in a few forms:

You can also move the slider for parameters like confidence threshold and IOU matching threshold, and the metrics will recompute to reflect the model's performance with those parameters.

You can also click into an example in the confusion matrix, see examples of those types of confusions, and sort by the amount of confusion.

Many Class Metrics View

When the number of classes exceeds 25, the model metrics view switches from the grids above to a filterable table format for better legibility.

Selecting a row in the Classification Report (left) will filter the Confusion Matrix (right) to only rows where either GT or Prediction matches the selected class. Selecting a row in the Confusion Matrix will show examples of those types of confusions, and sort by the amount of confusion.

Surfacing Labeling Errors

Aquarium makes it easy to find the places where your model disagrees most with the data. Once you click on a confusion matrix box, you can sort examples to see high confidence disagreements, low confidence agreements, or by other common factors like IOU or box size.

These "high loss" examples tend to expose areas where the model is making egregious mistakes or places where the model is right and the underlying labels is wrong! In the following example - showing the most confident examples where the model detected cars where there was no corresponding label - most issues are due to missing labels on cars that the model correctly detects!

Whenever you upload a new model inference set to Aquarium, we highly recommend looking through some of these high loss examples to see if you have any labeling mistakes.

Finding Model Failure Patterns

You can color datapoints in the embedding view based on model precision, recall, and F1. This lets you identify trends in model performance by finding which parts of the dataset the model does particularly well / badly.

When switching to crop embeddings, you can color datapoints by confusion type to identify object-level failure patterns.

By clicking the entries in the class legend on the left, we can toggle the visualization to only show false positive scenarios.

We can identify a cluster of false positive detections on the same object across multiple different frames.

A that provides per-class precision, recall, and F1 metrics.

A that lays out the number of each confusion in the dataset.

A that illustrates the tradeoff between precision and recall given a confidence threshold.

Learn more about organizing your data with Segments.
classification report
confusion matrix
precision-recall (PR) curve
classification report
confusion matrix
precision-recall (PR) curve
Access Segment details.
View metrics for a single Segment.
Exploring false positive car detections.
Most high confidence false positives are due to missing labels!
Our model has very good accuracy on pedestrians on a subset of the dataset...
Upon further inspection, it's because they're all the same scene! The model is overfitting to that scene.