Skip to content

fix_length in data.Field does not truncate the sequence #1324

Open
@JTWang2000

Description

@JTWang2000

I am trying to use Field and TabularDataset to process the text sequence input. To make the input be a fix length, I add the fix_length=MAX_SEQ_LEN in Field.

MAX_SEQ_LEN = 128

# Fields
label_field = Field(sequential=False, use_vocab=False, batch_first=True, dtype=torch.float)
text_field = Field(use_vocab=False, tokenize=tokenizer.encode, lower=False, 
                   include_lengths=False, batch_first=True,fix_length=MAX_SEQ_LEN)
fields = [('label', label_field), ('text', text_field)]

# TabularDataset
train, valid, test = TabularDataset.splits(path=path, train='train.csv', validation='valid.csv',
                                           test='test.csv', format='CSV', fields=fields, skip_header=True)

# Iterators
train_iter = BucketIterator(train, batch_size=16, sort_key=lambda x: len(x.text),
                            device=device, train=True, sort=True, sort_within_batch=True)
valid_iter = BucketIterator(valid, batch_size=16, sort_key=lambda x: len(x.text),
                            device=device, train=True, sort=True, sort_within_batch=True)
test_iter = Iterator(test, batch_size=16, device=device, train=False, shuffle=False, sort=False)

However, when I run the code, there is a warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (262 > 256). Running this sequence through the model will result in indexing errors

I checked that fix_length should be able to truncate the data input, however, it does not work on my side.

It happens both on local side and Google colab:
Local: Mac m1 Big Sur 11.3.1
pytorch: 1.8.0
torchtext: 0.6.0
python: 3.8.10

Google colab:
pytorch: 1.8.1+cu101
torchtext: 0.9.1

It really bothers me!!! Looking forward to the solution! Thanks!!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions