Model Compositionality

One of the dominant scenario for text is to use some pre-trained encoder (Roberta, BERT, XLMR etc) and attach task specific head on top of it (classification head, Language modeling head, POS tagging head, Q&A head etc). I believe this is also true for Vision as well (as well as to audio @mthrok ?). To the best of my knowledge (please correct me if I am am mistaken), vision currently provides factory function for every possible combination there-of? This approach is somewhat limiting in terms of scalability and boiler-plate code over-head that comes with it. Also versioning could be bit redundant if we replicate same weights class across each combination for the encoder part.

I wonder what folks think about extending this framework to support model composition?

As a reference HF also explicitly provide classes for every combination. [Here](https://github.com/huggingface/transformers/blob/55695df0f7bce816b6d53ab2d43d51427ea77a75/src/transformers/models/roberta/modeling_roberta.py#L1462) is one example for Roberta Encoder + Q&A task. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model Compositionality #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model Compositionality #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions