Skip to content

utils: plugins: Provide a choice for encoding raw UTF-8 strings #10665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 60 commits into
base: master
Choose a base branch
from

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Jul 31, 2025

This is because RFC 8259 requests the following both of cases:

  1. escaping multibyte characters are allowed
  2. Raw UTF-8 strings are also allowed

ref: https://datatracker.ietf.org/doc/html/rfc8259#section-8

Currently, we implemented pattern 1 of RFC 8259. But That RFC also permits to handle raw UTF-8 strings.
So, we need to support not escaping choice especially for the multibyte characters.

Closes #10631.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[SERVICE]
    Flush 1
    Daemon off
    Log_Level debug
    grace   2
    grace_input   1
    json.escape_unicode Off


[INPUT]
    Name         tail
    path          my_app_gbk.log
    Tag           dynamic_logs
    generic.encoding GBK
    Refresh_Interval  5
    Skip_Long_Lines On
    read_from_head true

[OUTPUT]
    Name        file
    Match       *
    File        test3.log

with the GBK encoded file:

{"message": "应用启动成功", "level": "info"}
  • Debug log output from testing the change
Fluent Bit v4.1.0
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/07/31 20:35:30] [ info] Configuration:
[2025/07/31 20:35:30] [ info]  flush time     | 1.000000 seconds
[2025/07/31 20:35:30] [ info]  grace          | 2 seconds
[2025/07/31 20:35:30] [ info]  daemon         | 0
[2025/07/31 20:35:30] [ info] ___________
[2025/07/31 20:35:30] [ info]  inputs:
[2025/07/31 20:35:30] [ info]      tail
[2025/07/31 20:35:30] [ info] ___________
[2025/07/31 20:35:30] [ info]  filters:
[2025/07/31 20:35:30] [ info] ___________
[2025/07/31 20:35:30] [ info]  outputs:
[2025/07/31 20:35:30] [ info]      file.0
[2025/07/31 20:35:30] [ info] ___________
[2025/07/31 20:35:30] [ info]  collectors:
[2025/07/31 20:35:30] [ info] [fluent bit] version=4.1.0, commit=3db7314ecb, pid=3084
[2025/07/31 20:35:30] [debug] [engine] coroutine stack size: 36864 bytes (36.0K)
[2025/07/31 20:35:30] [ info] [storage] ver=1.2.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/07/31 20:35:30] [ info] [simd    ] disabled
[2025/07/31 20:35:30] [ info] [cmetrics] version=1.0.5
[2025/07/31 20:35:30] [ info] [ctraces ] version=0.6.6
[2025/07/31 20:35:30] [ info] [input:tail:tail.0] initializing
[2025/07/31 20:35:30] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2025/07/31 20:35:30] [debug] [tail:tail.0] created event channels: read=25 write=26
[2025/07/31 20:35:30] [debug] [input:tail:tail.0] flb_tail_fs_stat_init() initializing stat tail input
[2025/07/31 20:35:30] [debug] [input:tail:tail.0] scanning path my_app_gbk.log
[2025/07/31 20:35:30] [debug] [input:tail:tail.0] inode=191383172 with offset=0 appended as my_app_gbk.log
[2025/07/31 20:35:30] [debug] [input:tail:tail.0] scan_glob add(): my_app_gbk.log, inode 191383172
[2025/07/31 20:35:30] [debug] [input:tail:tail.0] 1 new files found on path 'my_app_gbk.log'
[2025/07/31 20:35:30] [debug] [file:file.0] created event channels: read=32 write=33
[2025/07/31 20:35:30] [ info] [output:file:file.0] worker #0 started
[2025/07/31 20:35:30] [ info] [sp] stream processor started
[2025/07/31 20:35:30] [ info] [engine] Shutdown Grace Period=2, Shutdown Input Grace Period=1
[2025/07/31 20:35:30] [debug] [input:tail:tail.0] [static files] processed 45b
[2025/07/31 20:35:30] [debug] [input:tail:tail.0] inode=191383172 file=my_app_gbk.log promote to TAIL_EVENT
[2025/07/31 20:35:30] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2025/07/31 20:35:31] [debug] [task] created task=0x600003cbc000 id=0 OK
[2025/07/31 20:35:31] [debug] [output:file:file.0] task_id=0 assigned to thread #0
[2025/07/31 20:35:31] [debug] [out flush] cb_destroy coro_id=0
[2025/07/31 20:35:31] [debug] [task] destroy task=0x600003cbc000 (task_id=0)
^C[2025/07/31 20:35:33] [engine] caught signal (SIGINT)
[2025/07/31 20:35:33] [ info] [input] pausing tail.0
[2025/07/31 20:35:33] [ info] [output:file:file.0] thread worker #0 stopping...
[2025/07/31 20:35:33] [ info] [output:file:file.0] thread worker #0 stopped
[2025/07/31 20:35:33] [debug] [input:tail:tail.0] inode=191383172 removing file name my_app_gbk.log

Then, the generated file contains:

$ cat test3.log
dynamic_logs: [1753961970.958898000, {"log":"{\"message\": \"应用启动成功\", \"level\": \"info\"}"}]
  • Attached Valgrind output that shows no leaks or memory corruption was found

With leaks which is a detector of memory leaks on macOS:

Process:         fluent-bit [3121]
Path:            /Users/USER/*/fluent-bit
Load Address:    0x10215c000
Identifier:      fluent-bit
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  leaks [3120]
Target Type:     live task

Date/Time:       2025-07-31 20:36:15.809 +0900
Launch Time:     2025-07-31 20:36:09.331 +0900
OS Version:      macOS 15.5 (24F74)
Report Version:  7
Analysis Tool:   /Applications/Xcode.app/Contents/Developer/usr/bin/leaks
Analysis Tool Version:  Xcode 16.4 (16F6)

Physical footprint:         4577K
Physical footprint (peak):  4673K
Idle exit:                  untracked
----

leaks Report Version: 4.0, multi-line stacks
Process 3121: 761 nodes malloced for 79 KB
Process 3121: 0 leaks for 0 total leaked bytes.

From valgrind:

==41087== 
==41087== HEAP SUMMARY:
==41087==     in use at exit: 0 bytes in 0 blocks
==41087==   total heap usage: 3,328 allocs, 3,328 frees, 1,022,286 bytes allocated
==41087== 
==41087== All heap blocks were freed -- no leaks are possible
==41087== 
==41087== For lists of detected and suppressed errors, rerun with: -s
==41087== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added a global configuration option to enable or disable Unicode escaping in JSON output.
    • JSON serialization across all plugins and outputs now respects the json_escape_unicode setting.
  • Bug Fixes

    • Enhanced Unicode and special character escaping for consistent and accurate JSON serialization.
  • Refactor

    • Split string escaping into two modes: full Unicode escaping and minimal escaping.
    • Added a SIMD-accelerated path for raw string escaping to improve performance.
    • Updated all relevant components to support the new Unicode escaping configuration.
  • Tests

    • Extended test coverage to validate both Unicode-escaped and raw JSON output modes for correctness and compatibility.

cosmo0920 added 30 commits July 31, 2025 13:11
Signed-off-by: Hiroshi Hatake <[email protected]>
Copy link
Contributor

@braydonk braydonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out_stackdriver change LGTM

(I didn't know my checkmark would be green lol consider that a CODEOWNER approval only)

@@ -532,7 +532,7 @@ static int store_session_get(struct flb_calyptia *ctx,
}

/* decode */
json = flb_msgpack_raw_to_json_sds(buf, size);
json = flb_msgpack_raw_to_json_sds(buf, size, FLB_TRUE); /* TODO: could be ASCII? */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we resolve this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but the current pattern of usages do not have multibytes between the dedicated service and this plugin.

@patrick-stephens patrick-stephens added the ok-package-test Run PR packaging tests label Aug 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-required ok-package-test Run PR packaging tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fluent Bit fails to ingest logs containing Chinese characters
3 participants