feat(dump): enrich dump output with type annotations for JSON and enums for YAML#595
feat(dump): enrich dump output with type annotations for JSON and enums for YAML#595metthal wants to merge 1 commit intoVirusTotal:mainfrom
Conversation
…ms for YAML This change adds more information about the dumped values in JSON format in form of type annotations. What is added: * New field `$type` (to avoid name clashes) is used to instruct consumers of data to either adjust decoding of data or adjust the presentation of the data * All existing `encoding` fields (currently `timestamp` and `base64`) convereted to the new `$type` fields * `$type = "hex"` used for ints which should be encoded as hex ints * `$type = "enum"` used for enums together with provisioning both the symbolic name and also the value * `$type = "flags"` used for flags while giving both the final value but also a list of flags with their symbolic names and flag value What has been also changed: * YAML output now also prints the comment with enum value with decimal or hex value depending on what formatting is chosen for the field in proto
|
Okay I realized that I forgot to cover the option where enum can be hex in JSON and when I started solving it I no longer really like this approach. It was easy to implement so I did it but now that I would end up with something like I just feel like that's too much. I'll probably reconsider whether this
Is not better approach because the machinery required for this turns out be less than machinery for parsing what I proposed |
|
I'd still maybe propose changing |
|
I like the idea of providing additional information about the type or formatting options for each field, my only concern so far is that changing the actual format breaks backward compatibility. An alternate solution could be providing all that information through a separate channel. I mean, instead of having all that type information embedded in the same JSON that contains the data, we could provide that information in some other JSON that acts like a scheme for the data produced by modules. Think of it as metadata that describes all fields in a module. That would allow to keep the output of the While writting this I'm thinking that we already have something similar: the So, instead of having a separate mechanism for obtaining information about module fields, I would like to explore the possibility of using the reflection API. We may also add some CLI command that outputs that reflection information in JSON format, but it would be something separate from the current PS: Not all you needs are currently covered by the reflection API, for instance, enum fields are actually reported as integers fields, because YARA actually treats enums as integers with some predefined constant values. But we may add some additional information that tells when an integer field is actually an enum has which are the valid values. |
Motivation
Recently we wanted to start using the data from
yr dumpsomehow but I realized that it's hard to consume it in a way where we would preserve some of the information which YARA-X already operates with. Like which int should be further propagated as hex? What's the value behind the enum/flags?There was already some information propagated for
bytesand timestamp values throughencodingkey but I wanted to make it more general and also avoid possible name clashes for the future (thus$typeas described later).I know this adds a lot more noise to the JSON and breaks current consumers and if you don't want to break them, then I'm all for hiding this behind some additional flag (or even making it a completely different output format like
rich-jsonor something).It's just that YARA-X already contains this info so when I considered my options, all I got was:
protofiles from YARA-X - felt like unnecessary machinery for this taskIn the end, I like that the JSON output is kind of self-contained where it contains all the information which other consumers can use and decide what to do with it. Either completely scrap it, or make use of it and construct their own views over the data.
What changed
This change adds more information about the dumped values in JSON format in form of type annotations. What is added:
$type(to avoid name clashes) is used to instruct consumers of data to either adjust decoding of data or adjust the presentation of the dataencodingfields (currentlytimestampandbase64) convereted to the new$typefields$type = "hex"used for ints which should be encoded as hex ints$type = "enum"used for enums together with provisioning both the symbolic name and also the value$type = "flags"used for flags while giving both the final value but also a list of flags with their symbolic names and flag valueWhat has been also changed: