Skip to content

[BUG] GPU get_json_object does incompatible escaping and error checking #12483

@revans2

Description

@revans2

Describe the bug
the GPU implementation of get_json_object parses the JSON path in a way that is incompatible with Spark's and is not documented. We also throw a bunch of exceptions on invalid JSON paths that are not documented. Some of these I am fine if we just document the incompatibility. For others we might want to look into fixing them...

Steps/Code to reproduce bug

expected:

scala> val df = Seq("""{"A": 1, "A.B": 2, "'A": {"B'": 3}}""").toDF
scala> df.repartition(1).selectExpr("""get_json_object(value, "${A}") as something""").show()
+---------+
|something|
+---------+
|     null|
+---------+
scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A") as something""").show()
+---------+
|something|
+---------+
| {"B'":3}|
+---------+
scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A.B'") as something""").show()
+---------+
|something|
+---------+
|        3|
+---------+

Actual (on the GPU)

scala> val df = Seq("""{"A": 1, "A.B": 2, "'A": {"B'": 3}}""").toDF

scala> df.repartition(1).selectExpr("""get_json_object(value, "${A}") as something""").show()
...
ai.rapids.cudf.CudfException: CUDF failure at: /home/roberte/src/spark-rapids-jni/thirdparty/cudf/cpp/src/strings/strings_column_view.cpp:47: strings column has no children

scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A") as something""").show()
...
ai.rapids.cudf.CudfException: CUDF failure at:/home/roberte/src/spark-rapids-jni/thirdparty/cudf/cpp/src/strings/json/json_path.cu:654: Encountered invalid JSONPath input string

scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A.B'") as something""").show()
+---------+
|something|
+---------+
|        2|
+---------+

Expected behavior
At a minimum we need to document these differences. Ideally we catch any parsing errors from the JSON path parser and we return a null in those cases just like Spark does. We might even want to look into falling back to the CPU if we see a single quote ' in the path.

The single quote escaping, at least for spark, appears to only happen inside of [] operations, like $['A.B'], which returns 2 for both the GPU and the CPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentSparkFunctionality that helps Spark RAPIDSbugSomething isn't workinglibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions