Skip to content

[Discussion] How do we want to handle torchvision.prototype.features.Feature's? #5045

Open
@pmeier

Description

@pmeier

This issue should spark a discussion about how we want to handle Feature's in the future. There are a lot of open questions I'm trying to summarize. I'll give my opinion to each of them. You can find the current implementation under torchvision.prototype.features.

What are Feature's?

Feature's are subclasses of torch.Tensor and their purpose is threefold:

  1. With their type, e.g. Image, they information about the data they carry. The prototype transformations (torchvision.prototype.transforms) use this information to automatically dispatch an input to the correct kernel.
  2. They can optionally carry additional meta data that might be needed for transforming the feature. For example, most geometric transformations can only be performed on bounding boxes if the size of the corresponding image is known.
  3. They provide a convenient interface for feature specific functionality, for example transforming the format of a bounding box.

There are currently three Feature's implemented

  • Image,
  • BoundingBox, and
  • Label,

but in the future we should add at least three more:

  • SemanticSegmentationMask,
  • InstanceSegementationMask, and
  • Video.

What is the policy of adding new Feature's?

We could allow subclassing of Feature's. On the one hand, this would make it easier for datasets to conveniently bundle meta data. For example, the COCO dataset could return a CocoLabel, which in addition to the default Label.category could also have the super_category field. On the other hand, this would also mean that the transforms need to handle subclasses of features well, for example a CocoLabel could be treated the same as a Label.

I see two downsides with that:

  1. What if a transform needs the additional meta data carried by a feature subclass? Imagine I've added a special transformation that needs CocoLabel.super_category. Although from the surface this now supports plain Label's this will fail at runtime.
  2. Documentation custom features is more complicated than documenting a separate field in the sample dictionary of a dataset.

Thus, I'm leaning towards only having a few base classes.

From what data should a Feature be instantiable?

Some of the features like Image or Video have non-tensor objects that carry the data. Should these features know how to handle them? For example should something like Image(PIL.Image.open(...)) work?

My vote is out for yes. IMO this is very convenient and also not an unexpected semantic compared to passing the data directly, e.g. Image(torch.rand(3, 256, 256))

Should Feature's have a fixed shape?

Consider the following table:

Feature .shape
Image (*, C, H, W)
Label (*)
BoundingBox (*, 4)
SemanticSegmentationMask (*, H, W) or (*, C, H, W)
InstanceSegementationMask (*, N, H, W)
Video (*, T, C, H, W)

(For SemanticSegmentationMask I'm not sure about the shape yet. Having an extra channel dimension makes the tensor unnecessarily large, but it aligns well with segmentation image files, which are usually stored as RGB)

Should we fix the shape to a single feature, i.e. remove the * from the table above, or should we only care about the shape in the last dimensions to be correct?

My vote is out for having a flexible shape, since otherwise batching is not possible. For example, if we fix bounding boxes to shape (4,) a transformation would need to transform N bounding boxes individually, while for shape (N, 4) it could make use of parallelism.

On the same note, if we go for the flexible shape, do we keep the singular name of the feature? For example, do we regard a batch of images with shape (B, C, H, W) still as Image or should we go for the plural Images in general? My vote is out for always keeping the singular, since I've often seen something like:

for image, target in data_loader(dataset, batch_size=4):
    ...

Should Feature's have a fixed dtype?

This makes sense for InstanceSegementationMask which should always be torch.bool. For all the other features I'm unsure. My gut says to use a default dtype, but also allow other dtypes.

What meta data should Feature's carry?

IMO, this really depends on the decision above about the fixed / flexible shapes. If we go for fixed shapes, it can basically carry any information. If we go for flexible shapes instead, we should only have meta data, which is the same for batched features. For example, BoundingBox.image_size is fine, but Label.category is not.

What methods should Feature's provide?

For now I've only included typical conversion methods, but of course this is not exhaustive.

Feature method(s)
Image .to_dtype()
.to_colorspace()
Label .to_str()
BoundingBox .to_format()
InstanceSegementationMask .to_semantic()

cc @bjuncek

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions