Skip to content

Data filtering on reasoning requirement #2

@tmabraham

Description

@tmabraham

Papers like BioMed-R1 discuss how many examples in training are knowledge-heavy and not reasoning-heavy and filtering out such examples can help training.

Write a ligthweight script that, given a HuggingFace dataset like https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M or https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K, filters/tags only samples that require reasoning and not just knowledge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions