jinja : implement mixed type object keys#18955
Conversation
ngxson
left a comment
There was a problem hiding this comment.
I'm rethinking about this approach a bit, probably we should drop the map::unordered, keep all the data inside std::vector<std::pair<value, value>> ordered, and always perform a linear search when we access an object.
Ofc, that will be slower, but realistically a template in the wild never has more than 50 or even 100 keys inside an object, so it's probably fine.
| if (std::all_of(vec.begin(), vec.end(), [&](auto ikv) -> bool { | ||
| return is_hashable(std::get<2>(ikv)); | ||
| })) { | ||
| key_type = "NamedTuple"; |
There was a problem hiding this comment.
I think kwarg is a good alternative to NamedTuple
|
Btw, the |
|
Phew, this led me down quite a few rabbit holes (how to use hash specialization in a class of the specialized type, c++ standard lacking hash_combine drama, virtual equality overload pitfalls, what have you), now it finally all makes sense and even compiles, but unfortunately does not work, more fun for tomorrow! :) |
|
Massive refactoring done, now has proper value hashing and equality operators (equivalence and strict (non-)equal). Added real Will be easy to add further operators for sorting in follow-up PR. Lessons learned:
Hopefully not too many mistakes made, |
Actually, it's the template that's faulty (not sure why this worked before?), we set {%- if not tools is defined %}
{%- set tools = none %}
{%- endif %}
{%- if tools is not none and (message == user_messages[-1]) %}
{{- "[AVAILABLE_TOOLS] [" }}
{%- for tool in tools %}
...
{%- endfor %}
{{- "[/AVAILABLE_TOOLS]" }}
{%- endif %} |
verified with transformers
ngxson
left a comment
There was a problem hiding this comment.
Pretty close! Just need to optimize the a little bit by avoiding relying too much on string.
Btw, your hash_bytes with seed is in some ways an implementation of hash_combine, you can combine hash of multiple elements by:
size_t cur_hash = 0; // or maybe something else
for (auto elem : arr) {
cur_hash = hash_bytes(/* seed */ cur_hash, elem.unique_hash());
}
return cur_hash;Doing str() or as_repr() uses a lot more memory. An example can be:
my_tuple = ("some string")
my_outer_tuple = (my_arr, my_arr, my_arr, my_arr, my_arr)Calculating hash of my_outer_tuple will now use 5 times more memory than needed, because as_repr need to allocate memory for all strings, even when they points to the same memory
Yep, that was the point, but I see I made a footgun, never start with an initial seed (unless it is a previous hash).
Very good point, I'll look into your suggestions and see what can be done. |
| virtual string as_string() const override { | ||
| std::ostringstream ss; | ||
| ss << "{"; | ||
| for (size_t i = 0; i < val_obj.size(); i++) { | ||
| if (i > 0) ss << ", "; | ||
| auto & [key, val] = val_obj.at(i); | ||
| ss << value_to_string_repr(key) << ": " << value_to_string_repr(val); | ||
| } | ||
| ss << "}"; | ||
| return ss.str(); | ||
| } |
There was a problem hiding this comment.
just note that I initially don't allow as_repr or to_string to be recursive, because it can go into an infinitive loop if the object/array entity is pointed back to itself:
obj = {}
obj["a"] = obj
# or even harder to detect, nested circular
obj["a"] = {"b": obj}a bit ironically that this is not actually classified as a vulnerability. just sometimes program are actually coded this way, and it is indeed a very common practice in high-level languages like javascript or python:
var node = {"parent": null}
node["child"] = {
"parent" : node,
"child": null,
};
// now, JSON.stringify(node) will throw:
// TypeError: Converting circular structure to JSONbtw, may worth a fix for to_json as it can also stuck on this case. IIRC javascript simply throw an error if it detect circular in an object
There was a problem hiding this comment.
Yeah, very aware of this, decided to ignore it for now. :)
There was a problem hiding this comment.
Nested circulars are usually bypassed by keeping a reference of every encountered object that can nest when processing and simply skip/print ... when coming across one that has already been visited. Will follow up if no-one else does.
|
I'll work on this a bit, will push some optimizations directly here |
| static size_t hash_bytes(size_t seed, void const * bytes, size_t len, Args&&... args) noexcept | ||
| { | ||
| static_assert(sizeof...(args) % 2 == 0); | ||
| static constexpr size_t prime = size_t_digits == 64 ? 0x100000001b3 : 0x01000193; | ||
|
|
||
| unsigned char const * c = static_cast<unsigned char const *>(bytes); | ||
| unsigned char const * const e = c + len; | ||
|
|
||
| for (; c < e; ++c) { | ||
| seed = (seed ^ *c) * prime; | ||
| } | ||
|
|
||
| if constexpr (sizeof...(args) > 0) { | ||
| seed = hash_bytes(seed, std::forward<Args>(args)...); | ||
| } | ||
|
|
||
| return seed; | ||
| } |
There was a problem hiding this comment.
just thinking out loud in math terms: if we consider hash_bytes(seed, data, size) as a (surjective) function f(s, x) with s = seed and x = tuple(data, size)
given input data x0, x1, x2:
s = initial_seed
hash = f(f(f(s, x0), x1), x2)
and your "convenient" function hash_bytes(x0, x1, ...) can be defined as g:
g() = s
g(x0) = f(s, x0)
g(x0, x1) = f(f(s, x0), x1)
...
therefore, g is still surjective and g(x0, x1) != g(x1, x0) which is so far so good.
but now, consider: f(x0 ~ x1) == g(x0, x1) with ~ the "concatenation" operation: as long as x0 ~ x1 == x2 ~ x3, then g(x0, x1) == g(x2, x3), which is expected when calculating hash of string parts (where each part is different but concatenated version is the same). technically says, this make g no longer a good hash function, since we have a known set of collisions. but in our context this is acceptable. FNV-1a is not that good anyway.
(note: the property f(x0 ~ x1) == g(x0, x1) can be derived from the implementation and can be considered as a postulation in this context)
in practice, to prevent this, most hash functions use some kind of internal state such that: hash(s, x) = output(s, mix(x)), so hash(x0 ~ x1) != hash(x1 ~ x0). we can implement this easily but it's not really necessary. we are not doing cryptographic hash anyway.
end of the math part. now, in term of speed, what I think can be nice & fun to make hash_bytes to operate on blocks of data instead of one byte at a time. because one update depends on the prior state, CPU will have hard time with out-of-order execution. a simple vectorization should help a lot. just the case of x0 ~ x1 will be a bit complicated. I'll see how to do that.
There was a problem hiding this comment.
FNV-1a certainly has its flaws, it was mainly chosen for its simplicity and chainability, it performs well enough in terms of quality and speed here I think.
Feel free to improve upon it if you wish though. :)
There was a problem hiding this comment.
So I ended up refactor this into a new struct called hasher(), it is inspired by nodejs's crypto.createHash()
// Old notion:
size_t seed = hash_bytes(data0, len0);
seed = hash_bytes(data1, len1);
// ...
size_t output = hash_bytes(seed, dataN, lenN);
// New notion
hasher hash = hasher().update(data0, len0).update(data1, len1); // ...
size_t output = hash.digest();With this type of notion, we can in theory implement the notion of "internal state", though I won't add it because it's unnecessary (just mentioning here for completeness)
The new notion should have the exact same mathematical properties as the old one (reflected by the new test cases in test-jinja)
The result will be different from the old one though, because we are processing block of N bytes at once (on 64-bit system, we process 64/8 = 8 bytes). I expect most compilers will know how to vectorize it. If data is not a multiple of the block size, they will be buffered.
I'll do another pass tomorrow to see if there are any other clean up needed. In the meantime, feel free to review my last commit and lmk if anything seems off to you.
| for (const auto & val : val_arr) { | ||
| const size_t val_hash = val->unique_hash().digest(); | ||
| hash.update(&val_hash, sizeof(size_t)); | ||
| } |
There was a problem hiding this comment.
We should allow unique_hash to be passed a hasher so that arrays and objects can just update it instead of hashing the hashes.
There was a problem hiding this comment.
forgot leave a comment here, but that won't work in this case: ["ab", "c"] vs ["a", "bc"]
the case of reusing state is mostly used by string parts, I don't think it's valid anywhere else
There was a problem hiding this comment.
I don't think that's actually a problem as typeid is hashed inbetween those.
There was a problem hiding this comment.
mathematically say, hashing the digest of elements vs letting each element to update are (quasi-)equivalent in our case, because the digest() only do one job: to add padding to the input data.
my current version is still mathematically equivalent to adding a typeid in-between, while being more simple than having to add a version of unique_hash that takes another hasher as input.
There was a problem hiding this comment.
also this code path is quite rarely used in practice (in case of using tuple as key), so I prefer to keep it simple for now. it should still be efficient enough doing this way.
ngxson
left a comment
There was a problem hiding this comment.
Feel free to merge when you're ready
Seems to need |
|
Ok, all good, will merge when CI is done. |
* implement mixed type object keys * add tests * refactor * minor fixes * massive refactor * add more tests * forgotten tuples * fix array/object is_hashable * correct (albeit broken) jinja responses verified with transformers * improved hashing and equality * refactor hash function * more exhausive test case * clean up * cont * cont (2) * missing cstring --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* implement mixed type object keys * add tests * refactor * minor fixes * massive refactor * add more tests * forgotten tuples * fix array/object is_hashable * correct (albeit broken) jinja responses verified with transformers * improved hashing and equality * refactor hash function * more exhausive test case * clean up * cont * cont (2) * missing cstring --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
… and new jinja template engine (ggml-org#1369) --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com> common : add nemotron 3 parsing (ggml-org#18077) common : add parser for ministral/mistral large 3/devstral 2 (ggml-org#17713) common : default content to an empty string (ggml-org#18485) chat: make tool description and parameters optional per OpenAI spec (ggml-org#18478) Per the OpenAI API specification, both 'description' and 'parameters' fields in tool function definitions are optional. Previously, the parser would throw an exception if these fields were missing. Attempts to fix ggml-org#17667 common : implement new jinja template engine (ggml-org#18462) --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> jinja: correct member access rule (ggml-org#18905) jinja : fix lexing of float literals with sign (ggml-org#18901) jinja : add missing tojson filter for bool (ggml-org#18900) jinja : attribute support for join, map and sort (ggml-org#18883) jinja : fix object item order (and properly implement dictsort) (ggml-org#18904) tests : add test-jinja -py option for cross-checking (ggml-org#18906) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> ci : run test-jinja -py on high perf [no ci] (ggml-org#18916) jinja : fix undefined keys and attributes and int/float as bool (ggml-org#18924) jinja: support none|string (ggml-org#18995) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> jinja : implement mixed type object keys (ggml-org#18955) --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (ggml-org#19147) `tojson` is not a supported `undefined` filter keep it DRY and fix some types jinja : do not pass empty tools and add some none filters (ggml-org#19176) jinja : add unordered_map include to value.h [no ci] (ggml-org#19205) jinja : add missing 'in' test to template engine (ggml-org#19004) (ggml-org#19239) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Add Jinja support for "indent" string filter (ggml-org#19529) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> add vendor refactor chat server : support preserving reasoning_content in assistant message (ggml-org#18994) chat : fix translategemma crash on common_chat_format_example (ggml-org#19019) chat: fix language input for translategemma (ggml-org#19052) Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev> chat: fix case where template accepts type content only (ggml-org#19419) mtmd : chat : Fix extra \n between text and media marker (ggml-org#19595) Thanks to @tugot17 for detecting and reporting the issue. For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation. However `llama-server` doesn't. I traced it down to extra newline inserted after `<__media__>`. This happens in `to_json_oaicompat`, that treats media markers as text and joins all parts with `\n` separator. PR introduces new type `media_marker` and uses it for media markers. Extra logic is added to prevent insertion of newlines before and after media markers. With this change number of input tokens is identical to HF implementation and as a result the output is also identical. I explored other ways to address the issue * remove completely `\n` between text parts in `to_json_oaicompat` * merge text messages in server-common.cpp before sending them to `to_json_oaicompat` Please propose alternative ways of fixing this issue. Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> common : merge qwen3-coder and nemotron nano 3 parsers (ggml-org#19765) common : fix improper trimming in XML parser on complete message (ggml-org#19805) Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com> jinja: correct stats for tojson and string filters (ggml-org#19785) jinja : correct default size for string slices (ggml-org#19913) common : handle unicode during partial json parsing (ggml-org#16526) common : fix json schema with '\' in literals (ggml-org#17307) add back qwen_coder_xml and mirothinker Co-authored-by: Aldehir Rojas <hello@alde.dev>
Allow all hashable types as object keys, taking care to replicate special python/jinja behavior between
int/float/bool.Fixed array/object output with
stringfilter.Fixed object
tojsonoutput (did not properly escape key string).Fixed object item order when replacing an item.