FixedLengthTokenizer wrong tokenization with utf-8 extended characters

**Bug description**
When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way because string.substring is used instead of working with byte arrays.This happens because say an "è" character is made up of two bytes(so two "positions" on the file) but working with a string you see it as one position getting a wrong token.

**Environment**
All versions 

**Steps to reproduce**
Use a fixed file lenght with some fields and add in one field data a text like "aleè" 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions