Skip to content

FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

Closed
@jtremiel

Description

@jtremiel

Bug description
When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way because string.substring is used instead of working with byte arrays.This happens because say an "è" character is made up of two bytes(so two "positions" on the file) but working with a string you see it as one position getting a wrong token.

Environment
All versions

Steps to reproduce
Use a fixed file lenght with some fields and add in one field data a text like "aleè"

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions