# Updating Datasets

## Overview

This page will walk through best practices and show you how you can update your datasets and inference sets once they have been uploaded.

{% hint style="danger" %}
When updating a dataset or inference set, your **classmap** as originally defined in the project must be the same for the new/modified data
{% endhint %}

### Fully Versioned w/ Edit History

As a dataset grows and changes, we maintain a versioned history of every previous version. This has many benefits, including:

* **Reproducible experiment results.** If an experiment produced inferences that were evaluated against version X of the dataset, it can be evaluated and explored against that version, even if the dataset continues to be updated in the background.
* **Time-travel / rollbacks.** Do you want to know what the dataset looked like before a major relabeling effort? Did that effort introduce problems that you want to undo? Load up a previous version at any time!
* **Edit histories / Audit logs.** Each entry is versioned by its ID, so you can always look up the full history for a given image and see each modification made to its labels.

### Versioning through Checkpoints

Aquarium released a feature named **checkpoints** that allows you to freeze the state of the dataset's frames and labels as of a point in time.

Use checkpoints to manage versions of your dataset over time and measure the impact of improving data quality or acquiring new data.

{% hint style="info" %}
For more information regarding **checkpoints**, check out [this](/aquarium/concepts/dataset-checkpoints.md) page.
{% endhint %}

### Streaming Inserts + Partial Success

Mutable datasets also allow you to upload data in a *streaming* format -- for example, one at a time as you receive images back from a labeling provider. If one batch of updates encounters an error, only those will fail, and the rest of the dataset will be processed and available for users.

## Updating Frame Data

The following sections discuss how to:

* [Add frames](#adding-frames-to-an-existing-dataset-and-inference-set)
* [Delete frames](#deleting-removing-frames-in-a-dataset-and-inference-set)
* [Edit frame data](#updating-existing-frames-in-a-dataset-and-inference-set)

### Adding Frames to an Existing Dataset and Inference Set

You can walk through an almost identical process you used to do the initial upload but upload frames with a new `frame_id`  and use [`create_or_update_dataset()`](https://aquarium-not-pypi.web.app/aquariumlearning/docs/#aquariumlearning.Client.create_or_update_dataset) or [`create_or_update_inferences()`](https://aquarium-not-pypi.web.app/aquariumlearning/docs/#aquariumlearning.Client.create_or_update_inferences) instead of `create_dataset()` or `create_inferences()`.

The steps are the exact same as what you can find in the Uploading Data steps, the difference is that the `frame_ids` are new and unique when being added to a [LabeledDataset](https://aquarium-not-pypi.web.app/aquariumlearning/docs/#aquariumlearning.LabeledDataset) or [Inferences](https://aquarium-not-pypi.web.app/aquariumlearning/docs/#aquariumlearning.Inferences) object.

When updating a dataset, make sure you're uploading data to existing project names, dataset names, and/or inference set names in the client API.&#x20;

Example code (extremely similar for `create_or_update_inferences()`):

{% code lineNumbers="true" %}

```python
al_client.create_or_update_dataset(
    # project name of exisiting project with same classmap
    EXISTING_PROJECT_NAME, 
    # dataset name of existing dataset
    AL_DATASET, 
    dataset=dataset
)
```

{% endcode %}

Once the data has been uploaded you'll be able to interact with the dropdown in the top right corner of your screen to view your dataset versions based on the time it was uploaded.

<figure><img src="/files/mMoM0qyPlEUYQskfvvNG" alt=""><figcaption><p>Dropdown to view prior versions of dataset based on upload time</p></figcaption></figure>

### Deleting/Removing Frames in a Dataset and Inference Set

To remove a frame from a dataset, you will use the `delete_frame(`*`frame_id`*`)` function in the Aquarium client.

Example usage:

{% code lineNumbers="true" %}

```python
import aquariumlearning as a

al_client = al.Client()
al_client.set_credentials(api_key='YOURKEY')

labeledDatset = al.LabeledDataset()

# list the frame ids to delete
frame_ids = ['']

# just like you would call add_frame to add to a labeled dataset
# we do the same thing with delete_frame to form an object we can 
# pass to the client so it knows what to delete
for id in frame_ids:
    labeledDatset.delete_frame(id)

# using create_or_update_dataset instead of just create_dataset
al_client.create_or_update_dataset(
    project_id='PROJECT_NAME',
    dataset_id='DATASET_NAME',
    dataset=labeledDatset
)
```

{% endcode %}

Once you run a script like the one above to delete frames, you'll see a message in your console similar to a normal data uploaded, and see an orange spinner next to your dataset name in the UI while the frames are being deleted.

<div><figure><img src="/files/JJXk1OweNGIhTa2Tdrd1" alt=""><figcaption></figcaption></figure> <figure><img src="/files/D6tE6uGLV4j9hYDdz8Eq" alt=""><figcaption><p>How your dataset will appear upon a deletion request</p></figcaption></figure></div>

When it comes to viewing different versions of your dataset, we can use the dropdown in the top right to also view changes after deletion. If you want to view your dataset prior to a deletion, select the appropriate date in the dropdown.

<figure><img src="/files/QHn7ayV3Ofirbg9R209a" alt=""><figcaption><p>Changing versions using the dropdown before and after a deletion based on timestamp</p></figcaption></figure>

### Updating Existing Frames In a Dataset and Inference Set

To **update existing frame data in your dataset**, specify the original `frame_id` when reuploading that frame, so that Aquarium can link it to the original. To update frame data we use the method `update_dataset_frames()`; this method is useful for bulk-updating frame metadata: sensor data, external metadata, etc. Any new information provided will be appended to the frame as a new version. The previous state of the frame will continue to be available as an old version.

Example usage:

{% code lineNumbers="true" %}

```python
import aquariumlearning as al

al_client = al.Client()
al_client.set_credentials(api_key='YOURAPIKEY')

# create list to hold frames to modify
# unlike other functions we dont add to LabeledDataset
# for updating frame level features we use a list
labeled_frame_list_to_modify = []

# create a labeled frame using an existing frame_id
# make sure you add the 'MODIFY' param
frame = al.LabeledFrame(frame_id='FRAME_ID', update_type='MODIFY')

# in this example we are adding a new metadata field
frame.add_user_metadata('test_metadata_field', "test_added_metadata_value")

# add frame object with new metadata field to the list
labeled_frame_list_to_modify.append(frame)

# call the client to push changes
al_client.update_dataset_frames(
    project_id='PROJECT_NAME',
    dataset_id='DATASET_NAME',
    update=labeled_frame_list_to_modify
)
```

{% endcode %}

Once you update the metadata, you can also use the dropdown in the top right corner to view your data before and after the update.

<figure><img src="/files/kubCn8dq8lK6RKN07k2x" alt=""><figcaption><p>Showing new metadata field at the frame level after running the code example above</p></figcaption></figure>

## Updating Label Data

The following sections discuss how to:

* [Add and modify labels](#adding-or-modifying-labels)

### Adding or Modifying Labels

When you want to add new labels to an existing frame or modify existing labels in an existing frame, the function to use is [`update_dataset_labels()`](https://aquarium-not-pypi.web.app/aquariumlearning/docs/#aquariumlearning.Client.update_dataset_labels). &#x20;

This function works with a class called [`UpdateGTLabelSet`](https://aquarium-not-pypi.web.app/aquariumlearning/docs/#aquariumlearning.UpdateGTLabelSet). This object is very similar to a frame object, and is used in cases of adding/modifying existing frames.

You would create an `UpdateGTLabelSet` object for each frame you are modifying. And then just like you initially called a function to add a bounding box to the initial frame, you'll do the same with the `UpdateGTLabelSet` object.

Add each `UpdateGTLabelSet` object to a list and you pass that list into the `update_dataset_labels()` function.

To demonstrate an example of how this will work, we will modify this frame and label.

<figure><img src="/files/5TWcaOXzm23Ye4swhAKE" alt=""><figcaption><p>You can see an example of the label we are modifying above</p></figcaption></figure>

In the code snippet, below, we are just showing an example of modifying the one label for the one frame pictured above, but you will likely be looping through some data to do your updates so the code may change:

{% code lineNumbers="true" %}

```python
import aquariumlearning as al

# configure Aquarium client
al_client = al.Client()
al_client.set_credentials(api_key='YOUR_API_KEY')

# specify project name and dataset name you will be working with
PROJECT_NAME = 'Rareplanes_Wingtype_Project'
DATASET_NAME = 'initial_train_labels'

# for the sake of example code
# this part will likely be looped but we have grabbed the frame id
# and label id pictured above
frame_id = '100_1040010039437200_tile_460'
label_id = '100_1040010039437200_tile_460_gt_0'

# define the list we will pass to update_dataset_labels
updateGTLabelSet_list = []

# defining an UpdateGTLabelSet for the frame that contains our label
update_GT_label_set = al.UpdateGTLabelSet(frame_id=frame_id)

# modifying an existing label, known through corresponding label id
# if label id doesnt exist in dataset, new label will be added
update_GT_label_set.add_2d_bbox(
    label_id=label_id,
    classification = CLASSIFICATION,
    top = NEW_TOP_VALUE,
    left = NEW_LEFT_VALUE,
    width = NEW_WIDTH_VALUE,
    height = NEW_HEIGHT_VALUE,
    user_attrs= DICT_OF_METADATS
)

# add our modified UpdateGTLabelSet object to the list
updateGTLabelSet_list.append(update_GT_label_set)

al_client.update_dataset_labels(PROJECT_NAME, DATASET_NAME, update_GT_label_set_list)
```

{% endcode %}

Once successfully run, you can see the newly modified label reflected in the UI!

<figure><img src="/files/nQxL3R3d2ZlDMetgztRr" alt=""><figcaption><p>You can see for the same frame id and same label id, the label has moved from its initial position</p></figcaption></figure>

### Deleting Labels

Currently in Aquarium, to delete a label, you actually replace the complete existing frame with the correct set of labels minus the labels you'd like to delete/remove. \
\
The steps to delete labels will look almost identical to the initial label upload process. The steps are:

1. Create a new `LabeledDataset` object
2. For each frame that has a label you would like to delete, create a `LabeledFrame` object making sure to set the **`update_type`** parameter to **`ADD`**
3. Add all the correct/desired labels minus the ones you wish to delete to the `LabeledFrame` object
4. Add `LabeledFrame` object to the `LabeledDataset` using add\_frame()
5. Finally, use `create_or_update_dataset()` passing in your project name, dataset name, and the `LabeledDataset`

{% hint style="success" %}
**You only need to created LabeledFrames for the frames that have labels that need updating.** You don't have to complete this process for every frame in your dataset!
{% endhint %}

By created the LabeledFrame object using ADD, this allows us to rewrite the labels associated with the frame. You'll be able to view this change in the history of the frame to view old and new labels.

Example below shows the original frame version with four total labels (green boxes):

<figure><img src="/files/JJH15FiObAxuPOdTNMcT" alt=""><figcaption><p>Original frame with 4 total labels</p></figcaption></figure>

This image shows the after where we have removed all but one label:

<figure><img src="/files/RLrbFtif3M4P25VSJvBm" alt=""><figcaption><p>After code block below is run, one label remains</p></figcaption></figure>

#### Example code block below:

{% code lineNumbers="true" %}

```python
import aquariumlearning as al

al_client = al.Client()
al_client.set_credentials(api_key='YOUR_API_KEY')

PROJECT_NAME = 'PROJECT_NAME'
DATASET_NAME = 'DATASET_NAME'

# in example images would be 51_104001003D4C9C00_tile_264
FRAME_ID = 'FRAME_ID'

# create new dataset
labeled_dataset = al.LabeledDataset()

# create new labeled frame object, remember the udpate_type param
new_labeled_frame = al.LabeledFrame(frame_id = FRAME_ID, update_type='ADD')

# add the appropriate labels
# you can use same label ids or new label ids
new_labeled_frame.add_label_2d_bbox(
    label_id='LABEL_ID',
    classification = 'straight',
    top = TOP_VAL,
    left = LEFT_VAL, 
    width = WIDTH_VAL,
    height = HEIGHT_VAL
)

# add your image to the frame
new_labeled_frame.add_image(image_url='ADD_IMAGE_SOURCE_URL')

# add your frame to dataset
labeled_dataset.add_frame(new_labeled_frame)

# upload newly created dataset using create_or_update_dataset
al_client.create_or_update_dataset(PROJECT_NAME, DATASET_NAME, dataset=labeled_dataset)
```

{% endcode %}

## Notes & Limitations

There are a few things to be aware of when using mutable datasets:

### Some Dataset Attributes Are Still Immutable

This change allows all elements of a dataset (frames, metadata values, labels, bounding box geometry, etc.) to be added / updated / deleted, but they must still be compatible with the dataset as a whole.

Most notably, the following dataset attributes must remain consistent over time:

* Set of known valid label classes
* User provided metadata field schemas
* Embedding source (i.e.., embeddings are expected to be compatible between all frames in the dataset)

We plan to support changes to all of these in the future. Please let us know if any of them are particularly valuable for you.

### Inference Sets Are Pinned to a Specific Dataset Version

**When an inference set is uploaded, it will be pinned to a specific version of the labeled dataset**, which will default to the most up-to-date version at the time of submission.

Updates to the inference set itself will show up in the UI, but updates to the base dataset (ground truth) won't be incorporated.

Metrics shown will be computed against those pinned dataset labels, and any visualizations of the ground truth will be from that specific version.

### Segment Elements are Versioned

Elements (frames, labels, inferences) added to [segments](/aquarium/working-in-aquarium/organizing-your-data.md) correspond to the specific version of the element when it was added to the segment. They do not automatically update to the latest version of the element within the dataset. This is intentional, but has tradeoffs:

* At any time, you can review elements in segments as they were when they were created. This makes it easy to reproduce issues in datasets across members of the team, even as the labels or associated frames change.
* **Because you're viewing the version of the dataset as it was when it was added to the segment**, reviewing a segment after a data quality issue has been corrected may show that the issue is still present, despite the current version of the dataset being correct. Use the segment state tracking features in Aquarium (and archive old segments that have been resolved) to mitigate any potential confusion.

For example, you might create a segment with example labels of "bounding box too loose." If you go re-label those boxes and update them in the dataset, the segment will still contain the original (poorly drawn) labels, with an icon indicating that it belongs to an older version of the dataset.&#x20;

![Warning icon indicating that an issue element is out-of-date.](/files/-MdSR0s-Sq48RbWzMtAQ)

It will be available for viewing, but some features (like within-dataset similarity search) may be disabled for out-of-date elements.&#x20;

![](/files/-MdSRDMFPieLNhwnVkno)

### Monitoring Upload Status

Similar to batch uploads, you'll be able to view the status of your streaming uploads in the web app.

If you go to the *Project Details* page, you'll see a *Streaming Uploads* tab (previous batch uploads under your project will still be visible under *Uploads*):

![](/files/-MdPw3qCzo5Tau9kHzhd)

Each upload ID corresponds to a subset of your dataset/inference set (with the associated frame count + label count).

To view more details on which specific frames/labels are present in a given upload, you can click on the *Status* (e.g. DONE). A pop-up will appear with the following info:

![](/files/-MdSPtNzb1KqhHiIMfT5)

In the case of a failed upload, you can debug via the **Errors** section (which exposes frame-specific debug logs), and download this info to determine which frames/crops may need to be re-uploaded.

![](/files/-MdSrFwDmppnHom3asGN)

If you are running into an error and the error logs are not sufficient to understand how to fix the issue, please reach out to the Aquarium team and we can help resolve your problem.

### Migrating Projects with Immutable Datasets to Mutable

{% hint style="warning" %}
This section only applies if you ever created a Dataset in `BATCH` mode. Any update after June 2022 is uploaded in `STREAMING` mode by default and you will not need this section.
{% endhint %}

{% hint style="info" %}
Any scripts that make calls to `client.create_dataset` or `client.create_or_update_dataset` no longer need to pass the `pipeline_mode` argument in order to use streaming mode, as `"STREAMING"` is now the default argument. However, if you would prefer to continue using batch mode for your uploads, you will now have to specify `pipeline_mode="BATCH"`.
{% endhint %}

You may have existing immutable datasets that were uploaded via batch mode, and want to convert them to mutable datasets.&#x20;

If you go to the "Datasets" tab of your "Project Details" page, each of the listed legacy datasets should now have a new teal *"Clone as New Mutable Dataset"* button:

![](/files/-MfJsEmuThmbXaykErNE)

When you click this button, the cloning will begin:

![](/files/-MfJsu3WngVu058jEV65)

After a minute or so, if you refresh the page, the new dataset will appear with the prefix "MUTABLE\_". The old dataset will also have a tooltip that points to the new dataset:

![](/files/-MfJtnW1Bcvacs2_pOvD)

Depending on the size of your original dataset, it may take some more time for this new mutable dataset to be fully processed and become viewable in "Explore" view.

{% hint style="warning" %}
**NOTE:** A dataset's corresponding inference sets will **not** be automatically cloned for now, but can be uploaded to the mutable dataset using the Aquarium client. Please contact us if you have questions about migrating inference sets.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://legacy-docs.aquariumlearning.com/aquarium/integrating-with-aquarium/mutable-datasets-beta.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
