Updating Datasets
How to update datasets after upload
Overview
This page will walk through best practices and show you how you can update your datasets and inference sets once they have been uploaded.
When updating a dataset or inference set, your classmap as originally defined in the project must be the same for the new/modified data
Fully Versioned w/ Edit History
As a dataset grows and changes, we maintain a versioned history of every previous version. This has many benefits, including:
Reproducible experiment results. If an experiment produced inferences that were evaluated against version X of the dataset, it can be evaluated and explored against that version, even if the dataset continues to be updated in the background.
Time-travel / rollbacks. Do you want to know what the dataset looked like before a major relabeling effort? Did that effort introduce problems that you want to undo? Load up a previous version at any time!
Edit histories / Audit logs. Each entry is versioned by its ID, so you can always look up the full history for a given image and see each modification made to its labels.
Versioning through Checkpoints
Aquarium released a feature named checkpoints that allows you to freeze the state of the dataset's frames and labels as of a point in time.
Use checkpoints to manage versions of your dataset over time and measure the impact of improving data quality or acquiring new data.
For more information regarding checkpoints, check out this page.
Streaming Inserts + Partial Success
Mutable datasets also allow you to upload data in a streaming format -- for example, one at a time as you receive images back from a labeling provider. If one batch of updates encounters an error, only those will fail, and the rest of the dataset will be processed and available for users.
Updating Frame Data
The following sections discuss how to:
Adding Frames to an Existing Dataset and Inference Set
You can walk through an almost identical process you used to do the initial upload but upload frames with a new frame_id
and use create_or_update_dataset()
or create_or_update_inferences()
instead of create_dataset()
or create_inferences()
.
The steps are the exact same as what you can find in the Uploading Data steps, the difference is that the frame_ids
are new and unique when being added to a LabeledDataset or Inferences object.
When updating a dataset, make sure you're uploading data to existing project names, dataset names, and/or inference set names in the client API.
Example code (extremely similar for create_or_update_inferences()
):
Once the data has been uploaded you'll be able to interact with the dropdown in the top right corner of your screen to view your dataset versions based on the time it was uploaded.
Deleting/Removing Frames in a Dataset and Inference Set
To remove a frame from a dataset, you will use the delete_frame(
frame_id
)
function in the Aquarium client.
Example usage:
Once you run a script like the one above to delete frames, you'll see a message in your console similar to a normal data uploaded, and see an orange spinner next to your dataset name in the UI while the frames are being deleted.
When it comes to viewing different versions of your dataset, we can use the dropdown in the top right to also view changes after deletion. If you want to view your dataset prior to a deletion, select the appropriate date in the dropdown.
Updating Existing Frames In a Dataset and Inference Set
To update existing frame data in your dataset, specify the original frame_id
when reuploading that frame, so that Aquarium can link it to the original. To update frame data we use the method update_dataset_frames()
; this method is useful for bulk-updating frame metadata: sensor data, external metadata, etc. Any new information provided will be appended to the frame as a new version. The previous state of the frame will continue to be available as an old version.
Example usage:
Once you update the metadata, you can also use the dropdown in the top right corner to view your data before and after the update.
Updating Label Data
The following sections discuss how to:
Adding or Modifying Labels
When you want to add new labels to an existing frame or modify existing labels in an existing frame, the function to use is update_dataset_labels()
.
This function works with a class called UpdateGTLabelSet
. This object is very similar to a frame object, and is used in cases of adding/modifying existing frames.
You would create an UpdateGTLabelSet
object for each frame you are modifying. And then just like you initially called a function to add a bounding box to the initial frame, you'll do the same with the UpdateGTLabelSet
object.
Add each UpdateGTLabelSet
object to a list and you pass that list into the update_dataset_labels()
function.
To demonstrate an example of how this will work, we will modify this frame and label.
In the code snippet, below, we are just showing an example of modifying the one label for the one frame pictured above, but you will likely be looping through some data to do your updates so the code may change:
Once successfully run, you can see the newly modified label reflected in the UI!
Deleting Labels
Currently in Aquarium, to delete a label, you actually replace the complete existing frame with the correct set of labels minus the labels you'd like to delete/remove. The steps to delete labels will look almost identical to the initial label upload process. The steps are:
Create a new
LabeledDataset
objectFor each frame that has a label you would like to delete, create a
LabeledFrame
object making sure to set theupdate_type
parameter toADD
Add all the correct/desired labels minus the ones you wish to delete to the
LabeledFrame
objectAdd
LabeledFrame
object to theLabeledDataset
using add_frame()Finally, use
create_or_update_dataset()
passing in your project name, dataset name, and theLabeledDataset
You only need to created LabeledFrames for the frames that have labels that need updating. You don't have to complete this process for every frame in your dataset!
By created the LabeledFrame object using ADD, this allows us to rewrite the labels associated with the frame. You'll be able to view this change in the history of the frame to view old and new labels.
Example below shows the original frame version with four total labels (green boxes):
This image shows the after where we have removed all but one label:
Example code block below:
Notes & Limitations
There are a few things to be aware of when using mutable datasets:
Some Dataset Attributes Are Still Immutable
This change allows all elements of a dataset (frames, metadata values, labels, bounding box geometry, etc.) to be added / updated / deleted, but they must still be compatible with the dataset as a whole.
Most notably, the following dataset attributes must remain consistent over time:
Set of known valid label classes
User provided metadata field schemas
Embedding source (i.e.., embeddings are expected to be compatible between all frames in the dataset)
We plan to support changes to all of these in the future. Please let us know if any of them are particularly valuable for you.
Inference Sets Are Pinned to a Specific Dataset Version
When an inference set is uploaded, it will be pinned to a specific version of the labeled dataset, which will default to the most up-to-date version at the time of submission.
Updates to the inference set itself will show up in the UI, but updates to the base dataset (ground truth) won't be incorporated.
Metrics shown will be computed against those pinned dataset labels, and any visualizations of the ground truth will be from that specific version.
Segment Elements are Versioned
Elements (frames, labels, inferences) added to segments correspond to the specific version of the element when it was added to the segment. They do not automatically update to the latest version of the element within the dataset. This is intentional, but has tradeoffs:
At any time, you can review elements in segments as they were when they were created. This makes it easy to reproduce issues in datasets across members of the team, even as the labels or associated frames change.
Because you're viewing the version of the dataset as it was when it was added to the segment, reviewing a segment after a data quality issue has been corrected may show that the issue is still present, despite the current version of the dataset being correct. Use the segment state tracking features in Aquarium (and archive old segments that have been resolved) to mitigate any potential confusion.
For example, you might create a segment with example labels of "bounding box too loose." If you go re-label those boxes and update them in the dataset, the segment will still contain the original (poorly drawn) labels, with an icon indicating that it belongs to an older version of the dataset.
It will be available for viewing, but some features (like within-dataset similarity search) may be disabled for out-of-date elements.
Monitoring Upload Status
Similar to batch uploads, you'll be able to view the status of your streaming uploads in the web app.
If you go to the Project Details page, you'll see a Streaming Uploads tab (previous batch uploads under your project will still be visible under Uploads):
Each upload ID corresponds to a subset of your dataset/inference set (with the associated frame count + label count).
To view more details on which specific frames/labels are present in a given upload, you can click on the Status (e.g. DONE). A pop-up will appear with the following info:
In the case of a failed upload, you can debug via the Errors section (which exposes frame-specific debug logs), and download this info to determine which frames/crops may need to be re-uploaded.
If you are running into an error and the error logs are not sufficient to understand how to fix the issue, please reach out to the Aquarium team and we can help resolve your problem.
Migrating Projects with Immutable Datasets to Mutable
This section only applies if you ever created a Dataset in BATCH
mode. Any update after June 2022 is uploaded in STREAMING
mode by default and you will not need this section.
Any scripts that make calls to client.create_dataset
or client.create_or_update_dataset
no longer need to pass the pipeline_mode
argument in order to use streaming mode, as "STREAMING"
is now the default argument. However, if you would prefer to continue using batch mode for your uploads, you will now have to specify pipeline_mode="BATCH"
.
You may have existing immutable datasets that were uploaded via batch mode, and want to convert them to mutable datasets.
If you go to the "Datasets" tab of your "Project Details" page, each of the listed legacy datasets should now have a new teal "Clone as New Mutable Dataset" button:
When you click this button, the cloning will begin:
After a minute or so, if you refresh the page, the new dataset will appear with the prefix "MUTABLE_". The old dataset will also have a tooltip that points to the new dataset:
Depending on the size of your original dataset, it may take some more time for this new mutable dataset to be fully processed and become viewable in "Explore" view.
NOTE: A dataset's corresponding inference sets will not be automatically cloned for now, but can be uploaded to the mutable dataset using the Aquarium client. Please contact us if you have questions about migrating inference sets.
Last updated