Skip to content

Need help with serving JSON with UTF8 string property #302

@mosolovsa

Description

@mosolovsa

Hi! I need help with serving JSON with UTF8 string.

Currently as far as I understood json response serialized by crow::json::dump_internal:

inline void dump_internal(const wvalue& v, std::string& out) const

which in turn calls crow::json::escape for string:

case type::String: dump_string(v.s, out); break;

inline void dump_string(const std::string& str, std::string& out) const

inline void escape(const std::string& str, std::string& ret)

Commit cdd6139 removed 0 <= c && and changed char to unsigned char, so the logic stays the same - escape invisible chars 0 < ch < 32.
Commit df41cbe by @lcsdavid changes unsigned char to auto (e.g. char).

char - type for character representation which can be most efficiently processed on the target system (has the same representation and alignment as either signed char or unsigned char, but is always a distinct type).

I'm not sure what exactly problem that was supposed to solve, but now not only invisible chars are escaped now (0 < char < 0x20 https://www.asciitable.com/), but the Unicode sequences may be escaped too from now on (if auto -> char -> signed char is true on this architecture).

All bytes of multibyte utf8 codepoints contain the most significant bit on (e.g. 0x80), so signed char with the leading bit on is always negative for a two's complement (almost any architecture), and ch < 0x20 would be now true for any Unicode symbol.
https://en.wikipedia.org/wiki/UTF-8#Encoding

The original project took solution to store UTF8 sequences in std::string:
ipkn/crow#189

But with the mentioned commit that solution can't be applied.

Middleware just adds UTF8 headers to text and I'm not sure if the middleware is the right place to cancel mentioned escapes.
#202

I have almost no grasp at the codebase, but for me, it seems like it would be nice to have a customization point for defining escape function somehow or to introduce a new JSON value type raw_string that would not be escaped later. Maybe default

I could make a PR with high-level guidance about what may be acceptable in this situation.

For now, I just revert to unsigned char and that's totally fine for me, could someone kindly explain why that's wrong? And elaborating on how it must be done would be even greater! Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions