Skip to content

Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at ...> can not be used with use_vocab=False because we do not know how to numericalize it. #1581

Open
@MSiba

Description

@MSiba

❓ Questions and Help

Description

I am trying to implement a sequence (multi-output) regression task using torchtext, but I am getting the error in the title.

torch version: 1.10.1
torchtext version: 0.11.1

Here's how I proceed:

Given. sequential data (own data) of the form:

   text    label
    'w1'    '[0.1, 0.3, 0.1]' 
    'w2'    '[0.74, 0.4, 0.65]'  
    'w3'    '[0.21, 0.56, 0.23]' 
<empty line denoting the beginning of a new sentence>
    ...       ...

TorchText Fields to read this data. (works perfectly)

import torchtext
from torchtext.legacy import data
from torchtext.legacy import datasets


TEXT = data.Field(use_vocab=True,  #  use torchtext.vocab, and later on, numericalization based on pre-trained vectors
                              lower=True)

LABEL = data.Field(is_target=True,
                   use_vocab=False, # I don't think that I need a vocab for my task, because the output is a list of doubles 
                   unk_token=None,
                   preprocessing=data.Pipeline(
                       lambda x: torch.tensor(list(map(float, removeBracets(x).split(' '))),
                                              dtype=torch.double)),      # I implement this Pipeline to transform labels from string(list(doubles)) to torch.Tensor(doubles)
                   dtype=torch.DoubleTensor)  # the label is a tensor of doubles

fields = [("text",TEXT) , ("label",LABEL)]

Since I have sequential data, I used datasets.SequenceTaggingDataset to split the data into training, validation and testing sets.

train, valid, test = datasets.SequenceTaggingDataset.splits(path='./data/',
                                                                                              train = train_path,
                                                                                              validation = validate_path,
                                                                                              test = test_path,
                                                                                              fields=fields)

Then, I use a pre-trained embedding to build the vocab for the TEXT Field, e.g.

TEXT.build_vocab(train, vectors="glove.840B.300d")

After that, I use BucketIterator to create batches of the training data efficiently.

train_iterator, valid_iterator = data.BucketIterator.splits(
                                                        (train, valid),
                                                        device=DEVICE,
                                                        batch_size=BATCH_SIZE,
                                                        sort_key=lambda x: len(x.text),
                                                        repeat=False,
                                                        sort=True) # for validation/testing, better set it to False

Everything works perfectly till now. However, when I try to iterate over train_iterator,

batch = next(iter(train_iterator))
print("text", batch.text)
print("label", batch.label)

I get the following error:

    229         """
    230         padded = self.pad(batch)
--> 231         tensor = self.numericalize(padded, device=device)
    232         return tensor
    233 

PATH_TO\torchtext\legacy\data\field.py in numericalize(self, arr, device)
    340                     "use_vocab=False because we do not know how to numericalize it. "
    341                     "Please raise an issue at "
--> 342                     "https://github.com/pytorch/text/issues".format(self.dtype))
    343             numericalization_func = self.dtypes[self.dtype]
    344             # It doesn't make sense to explicitly coerce to a numeric type if

ValueError: Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at 0x0XXXXXXXX> can not be used with use_vocab=False because we do not know how to numericalize it. Please raise an issue at https://github.com/pytorch/text/issues

I looked into the question #609. Unlike this issue, I need to find a numericalization for the labels, which are of the form list(torch.DoubleTensor). Do you have any suggestion?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions