Skip to content

Conversation

@AlbertvanHouten
Copy link
Contributor

@AlbertvanHouten AlbertvanHouten commented Oct 20, 2025

This pull request introduces the new experimental dataset into the dataset classes of OTX. A lot of logic has been migrated to datumaro such as management of image color channels and the tiling implementation.

Summary

How to test

Checklist

  • The PR title and description are clear and descriptive
  • I have manually tested the changes
  • All changes are covered by automated tests
  • All related issues are linked to this PR (if applicable)
  • Documentation has been updated (if applicable)

AlbertvanHouten and others added 16 commits September 4, 2025 10:10
Signed-off-by: Albert van Houten <[email protected]>
Co-authored-by: Leonardo Lai <[email protected]>
…4751)

Signed-off-by: Albert van Houten <[email protected]>
Co-authored-by: Grégoire Payen de La Garanderie <[email protected]>
…ing_extensions into feature/datumaro

# Conflicts:
#	library/src/otx/data/dataset/base_new.py
#	library/src/otx/data/dataset/classification_new.py
#	library/src/otx/data/dataset/detection_new.py
#	library/src/otx/data/dataset/instance_segmentation_new.py
#	library/src/otx/data/dataset/keypoint_detection_new.py
#	library/src/otx/data/dataset/segmentation_new.py
#	library/src/otx/data/entity/sample.py
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Co-authored-by: Albert van Houten <[email protected]>
@github-actions github-actions bot added DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM TEST Any changes in tests BUILD labels Oct 20, 2025
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
…ing_extensions into feature/datumaro

Signed-off-by: Albert van Houten <[email protected]>

# Conflicts:
#	library/pyproject.toml
#	library/src/otx/data/dataset/anomaly.py
#	library/src/otx/data/dataset/base.py
#	library/src/otx/data/dataset/classification.py
#	library/src/otx/data/dataset/detection.py
#	library/src/otx/data/dataset/instance_segmentation.py
#	library/src/otx/data/dataset/keypoint_detection.py
#	library/src/otx/data/dataset/segmentation.py
#	library/src/otx/data/dataset/tile.py
#	library/src/otx/data/factory.py
#	library/src/otx/data/module.py
#	library/src/otx/data/transform_libs/torchvision.py
#	library/tests/unit/data/samplers/test_class_incremental_sampler.py
#	library/tests/unit/data/utils/test_utils.py
@AlbertvanHouten AlbertvanHouten changed the title Feature/datumaro Experimental datumaro implementation Oct 29, 2025
@AlbertvanHouten AlbertvanHouten marked this pull request as ready for review October 29, 2025 09:24
@AlbertvanHouten AlbertvanHouten requested a review from a team as a code owner October 29, 2025 09:24
Copilot AI review requested due to automatic review settings October 29, 2025 09:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements an experimental Datumaro dataset integration for OTX, transitioning from legacy Datumaro components to the new experimental Dataset API. The changes introduce a new sample-based architecture while maintaining compatibility with existing OTX functionality.

Key changes:

  • Migration from legacy Datumaro components to experimental Dataset API with schema-based conversion
  • Introduction of new OTXSample-based data entities with PyTree registration for TorchVision compatibility
  • Replacement of legacy polygon handling with numpy ragged arrays for better performance
  • Comprehensive test updates and new test implementations for the updated dataset architecture

Reviewed Changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pyproject.toml Updates Datumaro dependency to experimental branch version
library/src/otx/data/dataset/*.py Implements new dataset classes using experimental Datumaro with sample-based architecture
library/src/otx/data/entity/sample.py Adds new OTXSample classes with PyTree registration for transform compatibility
library/tests/unit/types/test_label.py Updates label tests to use new hierarchical label categories
library/tests/unit/data/transform_libs/test_torchvision.py Converts polygon handling from Datumaro objects to numpy arrays

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +143 to +148
fake_polygons = np.array(
[
np.array([[10, 10], [50, 10], [50, 50], [10, 50]]), # Rectangle polygon for first object
np.array([[60, 60], [100, 60], [100, 100], [60, 100]]), # Rectangle polygon for second object
]
)
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The polygon data structure has changed from Datumaro Polygon objects to numpy ragged arrays, but the test doesn't validate that the new format maintains the same geometric properties. Consider adding assertions to verify that the polygon areas and vertex coordinates are equivalent between the old and new formats.

Copilot uses AI. Check for mistakes.
p = np.asarray(polygon.points)
p[0::2] = width - p[0::2]
return p.tolist()
def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray:
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function modifies the input polygon array in-place, which could cause issues if the original array is used elsewhere. The function should create a copy before modification to avoid unintended side effects.

Suggested change
def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray:
def revert_hflip(polygon: np.ndarray, width: int) -> np.ndarray:
polygon = polygon.copy()

Copilot uses AI. Check for mistakes.
rescaled_polygons[i] = scaled_points
else:
# Handle empty or invalid polygons
rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Creating a dummy polygon with three identical points may not be the best approach for handling empty/invalid polygons. Consider using an empty array or a clearly marked invalid polygon structure that can be properly filtered out later in the pipeline.

Suggested change
rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
rescaled_polygons[i] = np.empty((0, 2), dtype=np.float32)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly 3 points BTW? Why not just empty tensor or 0 tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure why Greg applied it in this manner, but I guess there is a validation step somewhere that requires 3 points for a valid polygon.

# Conflicts:
#	library/tests/unit/data/dataset/test_base.py
#	library/tests/unit/data/test_tiling.py
AlbertvanHouten and others added 18 commits December 8, 2025 16:33
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 98 out of 99 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +113 to +115
object.__setattr__(
label_group, "labels", [id_to_name_mapping.get(label, label) for label in label_group.labels]
)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using object.__setattr__ to mutate frozen dataclass attributes suggests that the upstream dataclass (HierarchicalLabelCategories or LabelGroup) is frozen. Consider whether these objects should be mutable by design or if there's a better approach (e.g., creating new instances rather than mutating in place) to avoid circumventing immutability guarantees.

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +131
if img.ndim == 3 and img.shape[0] <= 4 and img.shape[0] < min(img.shape[1:]):
# Image is in CHW format, transpose to HWC
return np.ascontiguousarray(img.transpose(1, 2, 0))
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heuristic assumes that if the first dimension is ≤4 and smaller than other dimensions, the image is in CHW format. This may fail for edge cases where an image legitimately has height or width of 4 or less. Consider adding additional validation or a more robust channel detection mechanism to prevent incorrect transpositions.

Copilot uses AI. Check for mistakes.
def _is_native_torchvision_transform(self, transform: tvt_v2.Transform) -> bool:
"""Check if the transform is a native torchvision transform."""
module = type(transform).__module__
return module.startswith("torchvision.")
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check relies on the module name starting with 'torchvision.'. If third-party packages or custom transforms mimic this naming convention, they may be incorrectly classified. Consider a more robust mechanism, such as checking against a known set of torchvision transform types or using an interface/base class check.

Suggested change
return module.startswith("torchvision.")
# Only treat as native if defined in torchvision.transforms.v2 or its _transforms submodule
return module == "torchvision.transforms.v2" or module == "torchvision.transforms.v2._transforms"

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +161
if (ph, pw) == (tw, th):
pm = pm.transpose(-1, -2)
# Else if target seems swapped relative to prediction, transpose target
elif (th, tw) == (pw, ph):
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dimension alignment logic attempts to fix spatial mismatches by transposing. However, this heuristic may not handle all cases correctly, especially if dimensions are neither equal nor swapped. Consider logging a warning when dimensions mismatch and a transpose is applied, so that unexpected dimension mismatches can be detected and investigated.

Suggested change
if (ph, pw) == (tw, th):
pm = pm.transpose(-1, -2)
# Else if target seems swapped relative to prediction, transpose target
elif (th, tw) == (pw, ph):
if (ph, pw) == (tw, th):
log.warning(
f"Predicted mask shape {pm.shape} does not match target mask shape {tm.shape}. "
"Transposing predicted mask to align spatial dimensions."
)
pm = pm.transpose(-1, -2)
# Else if target seems swapped relative to prediction, transpose target
elif (th, tw) == (pw, ph):
log.warning(
f"Target mask shape {tm.shape} does not match predicted mask shape {pm.shape}. "
"Transposing target mask to align spatial dimensions."
)

Copilot uses AI. Check for mistakes.
Comment on lines 112 to 114
"RandomPhotometricDistort",
"RandomGaussianBlur",
"RandomGaussianNoise",
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed 'TopdownAffine' augmentation from the configurable_augs list may indicate that this augmentation is no longer supported or has been moved elsewhere. Ensure that test coverage for this augmentation exists in another test or that its removal is intentional and documented.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove TopdownAffine from configurable augs list? We actually provide possibility to turn on/off this augmentation in Geti

Copy link
Contributor Author

@AlbertvanHouten AlbertvanHouten Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, TopdownAffine can't be an optional augmentation since it is responsible for resizing the image. Otherwise, at least for this test, the images were in the wrong size.

Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
Signed-off-by: Albert van Houten <[email protected]>
…ing_extensions into feature/datumaro

Signed-off-by: Albert van Houten <[email protected]>

# Conflicts:
#	library/uv.lock
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kprokofi please review if these changes make sense. The old logic only worked when tiling with square images.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think padding do nothing here. We resize and pad to the same size.

Copy link
Contributor Author

@AlbertvanHouten AlbertvanHouten Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the padding, tiling only works if the image is square (which it was previously in the integration test). I have replaced it with another dataset, as the polygons in the dataset were self-intersecting, which isn't supported, and this new dataset does have rectangle images, which broke specifically this tiling recipe.

Comment on lines +144 to +150
# Unit test models separately using --forked flag so that each is run in a separate process.
# Running these models after each other from the same process can cause segfaults
- name: Unit testing models
working-directory: library
run: uv run pytest tests/unit --cov --cov-report=xml
env:
OMP_NUM_THREADS: "1"
run: uv run pytest tests/unit/backend/native/models --cov --cov-report=xml --cov-append --forked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea what causes the segfaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me. I've tried a few different things to see if it would fix it, like clearing up pytorch memory usage, but this was the only way that I could fix it.

Comment on lines +53 to +54
# TODO(gdlg): Add support for ignore_index again
ignore_index: int = 255, # noqa: ARG003
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a TODO here. Do you know what's the purpose of this ignore_index parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore index is necessary for Semantic Segmentation tasks. This index helps to ignore marked pixels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any use case for the ignore index being different from the value used for background pixels? Or is it always the same?

Copy link
Contributor

@kprokofi kprokofi Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always the same if we don't count background as training class. Sometimes (in some datasets), ignore_index = -1 is used to indicate invalid pixels instead of 255.

However in the datasets like Pascal VOC we use background in training and mark it as 0 class

Copy link
Contributor

@kprokofi kprokofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments

self,
batch_tile_preds: list[OTXPredBatch],
batch_tile_attrs: list[list[dict]],
batch_tile_attrs: list[list[TileInfo]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are changing name tile_attrs to tile_info, could you change it everywhere for consistency?

for pred_mask, target_mask in zip(preds.masks, inputs.masks)
]
aligned: list[dict[str, Any]] = []
for pred_mask, target_mask in zip(preds.masks, inputs.masks):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pm, tm then? (for pm, tm in ..)

if tm.ndim == 2:
tm = tm.unsqueeze(0)

# Align spatial dimensions if swapped (e.g., [1, W, H] vs [1, H, W])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why spatial dimensions can be misaligned and swapped? Code here looks as a hack for me. We need to align targets and predictions on the dataset preparation / output preprocessing stage, not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me recheck this after my break. I have made many changes to get training to work, and probably some of these checks can be removed. I do think it's still good to make some of this code more robust against channel orders.



@register_pytree_node
class InstanceSegmentationSampleWithMask(OTXSample):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can't we combine InstanceSegmentationSample and InstanceSegmentationSampleWithMask? The difference in mask field only.
  2. Why do we still include polygons in InstanceSegmentationSampleWithMask?

cls,
task: OTXTaskType,
dm_subset: DmDataset,
dm_subset: DmDataset | DatasetNew,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for backward compatibility? I don't think we need that. cc @leoll2
Same for "convert_from_legacy".

GetiTune will depend on the new Datumaro Dataset, Geti classic has no dependency on the develop branch. OTX uses file system, not datumaro directly. I propose to support only the latest version if no stoppers.

Copy link
Contributor Author

@AlbertvanHouten AlbertvanHouten Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy datumaro datasets can still be passed here since the conversion to the new datumaro dataset schema happens in this factory. We will remove this in the future, but as long as we still support the legacy datasets in datumaro it also makes sense to keep it here. If you already have a new datumaro Dataset, you can initialize the tasks dataset immediately.

It also reduces the scope for this PR a bit.

elif isinstance(image, torch.Tensor):
# Handle tensor in HWC format - convert to CHW
# If the last dimension is <= 4 and smaller than other dims, it's likely HWC format
if image.ndim == 3 and image.shape[-1] <= 4 and image.shape[-1] < min(image.shape[:-1]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to convert tensors from CHW (what is actually needed for training and transforms) to HWC?

if isinstance(img, np.ndarray):
# Check if the numpy array is in CHW format (channels should be <= 4 typically)
# If the first dimension is small (<=4) and smaller than other dimensions, it's likely CHW format
if img.ndim == 3 and img.shape[0] <= 4 and img.shape[0] < min(img.shape[1:]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. I think we should always stick to CHW format.

rescaled_polygons[i] = scaled_points
else:
# Handle empty or invalid polygons
rescaled_polygons[i] = np.array([[0, 0], [0, 0], [0, 0]], dtype=np.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly 3 points BTW? Why not just empty tensor or 0 tensor?

Copy link
Contributor

@kprokofi kprokofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round of comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think padding do nothing here. We resize and pad to the same size.


@classmethod
def from_dm_label_groups(cls, dm_label_categories: LabelCategories) -> LabelInfo:
def from_dm_label_groups(cls, dm_label_categories: HierarchicalLabelCategories) -> LabelInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we deleted this in OTXDataset and we define LabelInfo directly, could you clean this class? Specifically, we don't need "from_dm_label_groups", "from_dm_label_groups_arrow", etc.

if torch.cuda.is_available():
torch.cuda.empty_cache()
if is_xpu_available():
import intel_extension_for_pytorch as ipex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use IPEX. It is outdated

Comment on lines 112 to 114
"RandomPhotometricDistort",
"RandomGaussianBlur",
"RandomGaussianNoise",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove TopdownAffine from configurable augs list? We actually provide possibility to turn on/off this augmentation in Geti

Signed-off-by: Albert van Houten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD DEPENDENCY Any changes in any dependencies (new dep or its version) should be produced via Change Request on PM Geti Tune Backend Issues related to Geti Tune Studio backend TEST Any changes in tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate OTX with the new dataset class

8 participants