-
Notifications
You must be signed in to change notification settings - Fork 328
Support ArrayType #536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ArrayType #536
Conversation
This PR is good! And, I saw the same check failed error that I cannot solve:
Maybe the E2E test sometimes just cannot find the active sparksession? In my case, just re-run the test and it will passed... |
@laneser this error is related to #333 . Internally Delta will call |
@suhsteve
To
spark/src/csharp/Extensions/Microsoft.Spark.Extensions.Delta.E2ETest/DeltaTableTests.cs Line 51 in ce23177
will fix the unit-test ?! which do not call ForPath (use GetActiveSession version)? |
I will try get this PR in soon! |
Thanks @elvaliuliuliu. What work is remaining to make this non-WIP? |
Currently, udf takes in |
Got it. Yes, let's include it in this PR. Thanks. |
@elvaliuliuliu Were you planning on updating this PR? |
Sorry got side-tracked. I will update this PR! |
This PR should be ready for review. I think FC and BC tests failed as expected, it should work with the current code, please advise. Thank you! |
Thanks for working on this @elvaliuliuliu ! |
} | ||
if (obj.GetType() == typeof(ArrayList)) | ||
{ | ||
return CastUnpickledItems.UnpickleArray(unpickledItems); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you just return here, what will happen to the rest of rows? Would that be ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be okay, since UnpickleArray
takes unpickledItems
and within the func itself, it will deal with all the rows when it's ArrayList
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you get ArrayList
, all the objects in unpickledItems
will be a type of ArrayList
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should change this function to IEnumerable<object> GetUnpickledObjects
and use yield return
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like below?
internal static IEnumerable<object> GetUnpickledObjects(Stream stream, int messageLength)
{
byte[] buffer = ArrayPool<byte>.Shared.Rent(messageLength);
try
{
...
// Check if unpickler returns ArrayList.
// If so, it needs to be cast to the appropriate array type using CastToArray.
foreach (object objArr in (object[])unpickledItems)
{
if (objArr.GetType() == typeof(object[]))
{
object obj = ((object[])objArr)[0];
if (obj == null)
{
continue;
}
if (obj.GetType() == typeof(ArrayList))
{
yield return CastUnpickledItems.CastToArray(unpickledItems);
}
else
{
yield return unpickledItems;
}
}
else
{
yield return unpickledItems;
}
}
}
finally
{
ArrayPool<byte>.Shared.Return(buffer);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @imback82 . You are iterating through all the object[] entries here, and the consumer of GetUnpickledObjects
will also end up iterating through the object[] again. It may be better to make this an IEnumerable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will make the change tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored this part. Please take a look, thanks!
} | ||
if (obj.GetType() == typeof(ArrayList)) | ||
{ | ||
return CastUnpickledItems.UnpickleArray(unpickledItems); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should change this function to IEnumerable<object> GetUnpickledObjects
and use yield return
.
@elvaliuliuliu how's the progress going? Tests are failing. |
Working on n d arrays. Some of the tests are passing, it failed at certain spark versions with error like "System.IO.DirectoryNotFoundException : Could not find a part of the path 'D:\a\1\b\spark-2.3.4-bin-hadoop2.7\RELEASE'", looking into it. |
|
||
// Array of Arrays. | ||
{ | ||
Func<Column, Column> udf = Udf<double[][], double>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pass in multiple arrays ?
Udf<double, double[], double[][], double[][][], double>
?
Can we give the user the option to also define a udf using ArrayList? Something like
Udf<double, ArrayList, ArrayList, ArrayList, double>
that will have the same behavior as the udf defined above ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests for multiple arrays. Working on ArrayList
cases as mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to support the option of return value as ArrayList
?
Like Udf<double, ArrayList, ... , ArrayList>
?
@imback82 ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it can be useful if you want to chain with Udf that takes in ArrayList
(just for consistency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
The current logic is to cast unpickledItems
to typed array if it contains ArrayList
here. Another way might be casting input(after unpickling) here when executing udf functions?
I am thinking either to pass commands in GetUnpickledObjects
or switch to the latter method, if we want to support ArrayList
like Udf<ArrayList, ArrayList, double>
. Please lmk which is preferred or any other suggestions? Thanks.
Can we get everything wrapped up by today ? |
I am finishing up the following left parts now - Updated
Please lmk if more comments. Thanks. |
foreach (object obj in (object[])unpickledItems) | ||
{ | ||
castUnpickledItems.Add( | ||
(obj.GetType() == typeof(RowConstructor)) ? obj : CastArray(obj)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if falling back to CastArray
always works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the function name accordingly. Sorry for the confusion. If not RowConstructor
, it will fall into object []
. CastHelper
will help decide and cast object[]
as needed.
/// <returns>Typed array after casting.</returns> | ||
public static object CastArray(object obj) | ||
{ | ||
if (obj is object[] objArr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work with Row[]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I have observed, if udf takes Row[]
like the example here. It will be RowConstructor
, and FromInternal
should handle such cases. It should not fall into CastArray
or CastHelper
.
return objArr.Select(x => CastHelper(x)).ToArray(); | ||
} | ||
|
||
// Array of arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about array of array of array? Don't you need to handle this recursively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the function name and description accordingly to make it more clear. This should be handled recursively as covered in test. Thanks!
Closing as #670 has been merged. |
This PR will add support to
ArrayType
which will fix part of #26 as follows:ArrayType
by casting unpickled objects to the appropriate type. Support scenarios like simple array, array of arrays and array ofRow
s.ArrayType
.