Skip to content

feat(dump): enrich dump output with type annotations for JSON and enums for YAML#595

Open
metthal wants to merge 1 commit intoVirusTotal:mainfrom
metthal:feat/yr-dump-improvements
Open

feat(dump): enrich dump output with type annotations for JSON and enums for YAML#595
metthal wants to merge 1 commit intoVirusTotal:mainfrom
metthal:feat/yr-dump-improvements

Conversation

@metthal
Copy link
Contributor

@metthal metthal commented Mar 15, 2026

Motivation

Recently we wanted to start using the data from yr dump somehow but I realized that it's hard to consume it in a way where we would preserve some of the information which YARA-X already operates with. Like which int should be further propagated as hex? What's the value behind the enum/flags?

There was already some information propagated for bytes and timestamp values through encoding key but I wanted to make it more general and also avoid possible name clashes for the future (thus $type as described later).

I know this adds a lot more noise to the JSON and breaks current consumers and if you don't want to break them, then I'm all for hiding this behind some additional flag (or even making it a completely different output format like rich-json or something).

It's just that YARA-X already contains this info so when I considered my options, all I got was:

  • Keep this mapping also on consumer side, either manually or by processing proto files from YARA-X - felt like unnecessary machinery for this task
  • Parse it from YAML which contains more information - I would need parser which retains comments and it would still feel like a workaround
  • Enrich JSON with more information albeit producing bigger JSONs - it felt like the best choice

In the end, I like that the JSON output is kind of self-contained where it contains all the information which other consumers can use and decide what to do with it. Either completely scrap it, or make use of it and construct their own views over the data.

What changed

This change adds more information about the dumped values in JSON format in form of type annotations. What is added:

  • New field $type (to avoid name clashes) is used to instruct consumers of data to either adjust decoding of data or adjust the presentation of the data
  • All existing encoding fields (currently timestamp and base64) convereted to the new $type fields
  • $type = "hex" used for ints which should be encoded as hex ints
  • $type = "enum" used for enums together with provisioning both the symbolic name and also the value
  • $type = "flags" used for flags while giving both the final value but also a list of flags with their symbolic names and flag value

What has been also changed:

  • YAML output now also prints the comment with enum value with decimal or hex value depending on what formatting is chosen for the field in proto

…ms for YAML

This change adds more information about the dumped values in JSON format
in form of type annotations. What is added:

* New field `$type` (to avoid name clashes) is used to instruct
  consumers of data to either adjust decoding of data or adjust the
  presentation of the data
* All existing `encoding` fields (currently `timestamp` and `base64`)
  convereted to the new `$type` fields
* `$type = "hex"` used for ints which should be encoded as hex ints
* `$type = "enum"` used for enums together with provisioning both the
  symbolic name and also the value
* `$type = "flags"` used for flags while giving both the final value but
  also a list of flags with their symbolic names and flag value

What has been also changed:

* YAML output now also prints the comment with enum value with decimal
  or hex value depending on what formatting is chosen for the field in
  proto
@metthal
Copy link
Contributor Author

metthal commented Mar 15, 2026

Okay I realized that I forgot to cover the option where enum can be hex in JSON and when I started solving it I no longer really like this approach. It was easy to implement so I did it but now that I would end up with something like

{"$type": "enum", "value": {"$type": "hex", "value": 123}, "name": "ENUM_VALUE1"}

I just feel like that's too much. I'll probably reconsider whether this

Keep this mapping also on consumer side, either manually or by processing proto files from YARA-X - felt like unnecessary machinery for this task

Is not better approach because the machinery required for this turns out be less than machinery for parsing what I proposed

@metthal
Copy link
Contributor Author

metthal commented Mar 15, 2026

I'd still maybe propose changing encoding at least to something like $encoding and having possibility to print out enum values in YAML format as comments (with possibility of hex annotation) so I'm leaving this up but maybe let's discuss it first.

@plusvic
Copy link
Member

plusvic commented Mar 16, 2026

I like the idea of providing additional information about the type or formatting options for each field, my only concern so far is that changing the actual format breaks backward compatibility.

An alternate solution could be providing all that information through a separate channel. I mean, instead of having all that type information embedded in the same JSON that contains the data, we could provide that information in some other JSON that acts like a scheme for the data produced by modules. Think of it as metadata that describes all fields in a module. That would allow to keep the output of the yr dump command as is, but at the same time you could have a mechanism for obtaining more information about the type and formatting options for each field.

While writting this I'm thinking that we already have something similar: the reflect API that is not public yet but will be published at some point. This API intends to be a reflection mechanims that allows to obtain many details about each module. For each field you can currently get its name and type, but it could be extended with more information. You can even get information about functions and their parameters. This API was designing with the language server in mind, but eventually it could become the way for obtaining any additional metadata you may need about the fields in a module. In the future you could even get the documentation associated to each field.

So, instead of having a separate mechanism for obtaining information about module fields, I would like to explore the possibility of using the reflection API. We may also add some CLI command that outputs that reflection information in JSON format, but it would be something separate from the current yr dump command. Would this work for you?

PS: Not all you needs are currently covered by the reflection API, for instance, enum fields are actually reported as integers fields, because YARA actually treats enums as integers with some predefined constant values. But we may add some additional information that tells when an integer field is actually an enum has which are the valid values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants