Description
Hello! I hope you can help me. I'm trying to build a char_CNN-word_BiLSTM for NER. The .build_vocab() of the NestedField is not combining the vocab with the nested field, but it is just generating the shared vocab with respect to the nested field. I think this is a bug.
------------------------------FIELDS------------------------------
WORD = Field(
sequential=True,
lower=False
)
CHAR = NestedField(
WORD,
preprocessing=lambda x: [list(i) for i in x],
)
TAG = Field(
sequential=True,
use_vocab=True,
lower=False,
unk_token=None
)
FIELDS = {
'word': ('word', WORD),
'char': ('char', CHAR),
'tag': ('tag', TAG)
}
train_data, dev_data, test_data = TabularDataset.splits(
path='data',
train='train.json',
validation='dev.json',
test='test.json',
format='json',
fields=FIELDS
)
------------------------------VOCABS------------------------------
WORD.build_vocab(
train_data,
max_size=VOCAB_SIZE,
min_freq=MIN_FREQ,
vectors=VECTORS
)
CHAR.build_vocab(
train_data,
max_size=100,
min_freq=2,
)
TAG.build_vocab(train_data)
------------------------------ITERATORS------------------------------
train_iterator, dev_iterator, test_iterator = BucketIterator.splits(
(train_data, dev_data, test_data),
batch_size=BATCH_SIZE,
device=DEVICE,
sort=False
)
len(WORD.vocab.itos) # 102
len(CHAR.vocab.itos) # 102