Describe the bug
the GPU implementation of get_json_object parses the JSON path in a way that is incompatible with Spark's and is not documented. We also throw a bunch of exceptions on invalid JSON paths that are not documented. Some of these I am fine if we just document the incompatibility. For others we might want to look into fixing them...
Steps/Code to reproduce bug
expected:
scala> val df = Seq("""{"A": 1, "A.B": 2, "'A": {"B'": 3}}""").toDF
scala> df.repartition(1).selectExpr("""get_json_object(value, "${A}") as something""").show()
+---------+
|something|
+---------+
| null|
+---------+
scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A") as something""").show()
+---------+
|something|
+---------+
| {"B'":3}|
+---------+
scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A.B'") as something""").show()
+---------+
|something|
+---------+
| 3|
+---------+
Actual (on the GPU)
scala> val df = Seq("""{"A": 1, "A.B": 2, "'A": {"B'": 3}}""").toDF
scala> df.repartition(1).selectExpr("""get_json_object(value, "${A}") as something""").show()
...
ai.rapids.cudf.CudfException: CUDF failure at: /home/roberte/src/spark-rapids-jni/thirdparty/cudf/cpp/src/strings/strings_column_view.cpp:47: strings column has no children
scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A") as something""").show()
...
ai.rapids.cudf.CudfException: CUDF failure at:/home/roberte/src/spark-rapids-jni/thirdparty/cudf/cpp/src/strings/json/json_path.cu:654: Encountered invalid JSONPath input string
scala> df.repartition(1).selectExpr("""get_json_object(value, "$.'A.B'") as something""").show()
+---------+
|something|
+---------+
| 2|
+---------+
Expected behavior
At a minimum we need to document these differences. Ideally we catch any parsing errors from the JSON path parser and we return a null in those cases just like Spark does. We might even want to look into falling back to the CPU if we see a single quote ' in the path.
The single quote escaping, at least for spark, appears to only happen inside of [] operations, like $['A.B'], which returns 2 for both the GPU and the CPU.
Describe the bug
the GPU implementation of get_json_object parses the JSON path in a way that is incompatible with Spark's and is not documented. We also throw a bunch of exceptions on invalid JSON paths that are not documented. Some of these I am fine if we just document the incompatibility. For others we might want to look into fixing them...
Steps/Code to reproduce bug
expected:
Actual (on the GPU)
Expected behavior
At a minimum we need to document these differences. Ideally we catch any parsing errors from the JSON path parser and we return a null in those cases just like Spark does. We might even want to look into falling back to the CPU if we see a single quote
'in the path.The single quote escaping, at least for spark, appears to only happen inside of
[]operations, like$['A.B'], which returns 2 for both the GPU and the CPU.