Description
❓ Questions and Help
Description
I am trying to implement a sequence (multi-output) regression task using torchtext
, but I am getting the error in the title.
torch version: 1.10.1
torchtext version: 0.11.1
Here's how I proceed:
Given. sequential data (own data) of the form:
text label
'w1' '[0.1, 0.3, 0.1]'
'w2' '[0.74, 0.4, 0.65]'
'w3' '[0.21, 0.56, 0.23]'
<empty line denoting the beginning of a new sentence>
... ...
TorchText Fields to read this data. (works perfectly)
import torchtext
from torchtext.legacy import data
from torchtext.legacy import datasets
TEXT = data.Field(use_vocab=True, # use torchtext.vocab, and later on, numericalization based on pre-trained vectors
lower=True)
LABEL = data.Field(is_target=True,
use_vocab=False, # I don't think that I need a vocab for my task, because the output is a list of doubles
unk_token=None,
preprocessing=data.Pipeline(
lambda x: torch.tensor(list(map(float, removeBracets(x).split(' '))),
dtype=torch.double)), # I implement this Pipeline to transform labels from string(list(doubles)) to torch.Tensor(doubles)
dtype=torch.DoubleTensor) # the label is a tensor of doubles
fields = [("text",TEXT) , ("label",LABEL)]
Since I have sequential data, I used datasets.SequenceTaggingDataset
to split the data into training, validation and testing sets.
train, valid, test = datasets.SequenceTaggingDataset.splits(path='./data/',
train = train_path,
validation = validate_path,
test = test_path,
fields=fields)
Then, I use a pre-trained embedding to build the vocab for the TEXT
Field
, e.g.
TEXT.build_vocab(train, vectors="glove.840B.300d")
After that, I use BucketIterator
to create batches of the training data efficiently.
train_iterator, valid_iterator = data.BucketIterator.splits(
(train, valid),
device=DEVICE,
batch_size=BATCH_SIZE,
sort_key=lambda x: len(x.text),
repeat=False,
sort=True) # for validation/testing, better set it to False
Everything works perfectly till now. However, when I try to iterate over train_iterator,
batch = next(iter(train_iterator))
print("text", batch.text)
print("label", batch.label)
I get the following error:
229 """
230 padded = self.pad(batch)
--> 231 tensor = self.numericalize(padded, device=device)
232 return tensor
233
PATH_TO\torchtext\legacy\data\field.py in numericalize(self, arr, device)
340 "use_vocab=False because we do not know how to numericalize it. "
341 "Please raise an issue at "
--> 342 "https://github.com/pytorch/text/issues".format(self.dtype))
343 numericalization_func = self.dtypes[self.dtype]
344 # It doesn't make sense to explicitly coerce to a numeric type if
ValueError: Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at 0x0XXXXXXXX> can not be used with use_vocab=False because we do not know how to numericalize it. Please raise an issue at https://github.com/pytorch/text/issues
I looked into the question #609. Unlike this issue, I need to find a numericalization for the labels, which are of the form list(torch.DoubleTensor). Do you have any suggestion?