Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at ...> can not be used with use_vocab=False because we do not know how to numericalize it.

## ❓ Questions and Help

**Description**

I am trying to implement a sequence (multi-output) regression task using `torchtext`, but I am getting the error in the title. 

torch version: 1.10.1
torchtext version: 0.11.1

Here's how I proceed: 

**Given.** sequential data (own data) of the form:
```
   text    label
    'w1'    '[0.1, 0.3, 0.1]' 
    'w2'    '[0.74, 0.4, 0.65]'  
    'w3'    '[0.21, 0.56, 0.23]' 
<empty line denoting the beginning of a new sentence>
    ...       ...
```
**TorchText Fields to read this data.** (works perfectly)

```
import torchtext
from torchtext.legacy import data
from torchtext.legacy import datasets


TEXT = data.Field(use_vocab=True,  #  use torchtext.vocab, and later on, numericalization based on pre-trained vectors
                              lower=True)

LABEL = data.Field(is_target=True,
                   use_vocab=False, # I don't think that I need a vocab for my task, because the output is a list of doubles 
                   unk_token=None,
                   preprocessing=data.Pipeline(
                       lambda x: torch.tensor(list(map(float, removeBracets(x).split(' '))),
                                              dtype=torch.double)),      # I implement this Pipeline to transform labels from string(list(doubles)) to torch.Tensor(doubles)
                   dtype=torch.DoubleTensor)  # the label is a tensor of doubles

fields = [("text",TEXT) , ("label",LABEL)]
```

Since I have sequential data, I used `datasets.SequenceTaggingDataset` to split the data into training, validation and testing sets.

```
train, valid, test = datasets.SequenceTaggingDataset.splits(path='./data/',
                                                                                              train = train_path,
                                                                                              validation = validate_path,
                                                                                              test = test_path,
                                                                                              fields=fields)
```
Then, I use a pre-trained embedding to build the vocab for the `TEXT` `Field`, e.g.

``` 
TEXT.build_vocab(train, vectors="glove.840B.300d")
```

After that, I use `BucketIterator` to create batches of the training data efficiently.

```
train_iterator, valid_iterator = data.BucketIterator.splits(
                                                        (train, valid),
                                                        device=DEVICE,
                                                        batch_size=BATCH_SIZE,
                                                        sort_key=lambda x: len(x.text),
                                                        repeat=False,
                                                        sort=True) # for validation/testing, better set it to False
``` 
Everything works perfectly till now. However, when I try to iterate over train_iterator,

```
batch = next(iter(train_iterator))
print("text", batch.text)
print("label", batch.label)
```

 I get the following error:

```
    229         """
    230         padded = self.pad(batch)
--> 231         tensor = self.numericalize(padded, device=device)
    232         return tensor
    233 

PATH_TO\torchtext\legacy\data\field.py in numericalize(self, arr, device)
    340                     "use_vocab=False because we do not know how to numericalize it. "
    341                     "Please raise an issue at "
--> 342                     "https://github.com/pytorch/text/issues".format(self.dtype))
    343             numericalization_func = self.dtypes[self.dtype]
    344             # It doesn't make sense to explicitly coerce to a numeric type if

ValueError: Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at 0x0XXXXXXXX> can not be used with use_vocab=False because we do not know how to numericalize it. Please raise an issue at https://github.com/pytorch/text/issues
```
I looked into the question #609. Unlike this issue, I need to find a numericalization for the labels, which are of the form list(torch.DoubleTensor). Do you have any suggestion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at ...> can not be used with use_vocab=False because we do not know how to numericalize it. #1581

❓ Questions and Help

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at ...> can not be used with use_vocab=False because we do not know how to numericalize it. #1581

Description

❓ Questions and Help

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions