-
Notifications
You must be signed in to change notification settings - Fork 328
Resolve nuget dependencies for UDFs defined in dotnet-interactive #515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/csharp/Microsoft.Spark.Worker/Utils/AssemblyLoaderHelper.cs
Outdated
Show resolved
Hide resolved
|
||
string replMode = | ||
EnvironmentUtils.GetEnvironmentVariableAsBool("DOTNET_SPARK_REPL_MODE").ToString(); | ||
environmentVars.Put("DOTNET_SPARK_REPL_MODE", replMode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we always to set this or only when it's running a repl? Also, I suggested this name, but mode may not be a good name. Maybe DOTNET_SPARK_RUNNING_REPL
or a name that conveys a boolean may be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only need to set it if it's running the REPL. Modified code a bit and renamed env variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, who do you think should be responsible for setting up this env variable? Should we set it from the extension or whatever is going to start the dotnet-interactive session (livy) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. I was thinking whoever is starting the dotnet process, but I guess extension can still set it as well to be sure?
@@ -87,6 +74,15 @@ internal Payload Process(Stream stream) | |||
} | |||
|
|||
payload.IncludeItems = ReadIncludeItems(stream); | |||
|
|||
#if NETCOREAPP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so basically, .NET framework will not work right? If so, let's remember to put this in the release note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we don't really need this #if
. It should compile fine with either framework. The only issue I foresee is the dotnet-interactive
session is only pulling in dotnet and netstandard nugets. The AssemblyProbingPaths
would only have paths to these two frameworks, and if a Worker required a net461
specific nugets, then it wouldn't be available.
src/csharp/Microsoft.Spark.Worker/Utils/AssemblyLoaderHelper.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.Worker/Utils/AssemblyLoaderHelper.cs
Outdated
Show resolved
Hide resolved
UnpackPackages(sparkFilesPath, unpackPath, metadata.NuGets); | ||
|
||
(s_dependencyProvider as IDisposable)?.Dispose(); | ||
s_dependencyProvider = new DependencyProvider(AssemblyProbingPaths, NativeProbingRoots); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this wouldn't affect the running tasks? Note that stages can run in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that a Worker process would only be running parallel tasks for a specific stage. Can a Worker process run multiple stages in parallel ? If so, will we be locking too often ?
Changed the dispose order a bit. I now register a new DependencyProvider
for the new set of Assembly/Native probing paths before disposing the previously registered one.
if (files.Length > 0) | ||
{ | ||
Array.Sort(files); | ||
return files.Last(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The highest file should have accumulated info right? In that case should we try to remove the old files proactively or no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The highest file should contain the super set of all nuget, assembly, and native probing paths.
We can remove them proactively, but I don't think we'll have that many files. There would only be a new file for every cell/codesubmission that introduces a new nuget package.
@@ -153,7 +166,15 @@ internal static Assembly ResolveAssembly(string assemblyName) | |||
} | |||
} | |||
|
|||
throw new FileNotFoundException($"Assembly '{assemblyName}' file not found '{simpleAsmName}[{string.Join(",", s_extensions)}]' in '{string.Join(",", s_searchPaths)}'"); | |||
s_logger.LogWarn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the behavior change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we threw an exception in AssemblyLoader
, the AssemblyLoadContext.Resolving
event would stop checking any other event handlers that were registered.
https://docs.microsoft.com/en-us/dotnet/api/system.runtime.loader.assemblyloadcontext.resolving?view=netcore-3.1
If more than one event handler is registered for this event, the event handlers are called in order until an event handler returns a value that isn't null. Subsequent event handlers are ignored.
If we wanted to keep the the same behavior, then we would constantly need to remove and readd our AssemblyLoader
event handler with the AssemblyLoadContext
every time we remove/readd the DependencyProvider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we see this message, what does the user need to do? Is this something that can be ignored by the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. In the batch scenario, if the user sees this warning, it should be proceeded by an exception that will be thrown by the assembly loading class that is using this event handler.
In the interactive scenario the following may happen. User sees this warning. The next event handler registered by the DependencyProvider is processed. Next, either of the following can occur:
- Finds assembly in the probing paths defined in the DependencyProvider.
- Unable to find assembly and an exception is thrown similar to the batch scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this a noise (since an exception will be always thrown) or will be this be helpful for further debugging?
src/csharp/Microsoft.Spark.Worker/Utils/AssemblyLoaderHelper.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.Worker/Utils/AssemblyLoaderHelper.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.Worker/Utils/AssemblyLoaderHelper.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.Worker/Processor/PayloadProcessor.cs
Outdated
Show resolved
Hide resolved
…teve/spark into repl_nuget_dependencies
@@ -153,7 +166,15 @@ internal static Assembly ResolveAssembly(string assemblyName) | |||
} | |||
} | |||
|
|||
throw new FileNotFoundException($"Assembly '{assemblyName}' file not found '{simpleAsmName}[{string.Join(",", s_extensions)}]' in '{string.Join(",", s_searchPaths)}'"); | |||
s_logger.LogWarn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we see this message, what does the user need to do? Is this something that can be ignored by the user?
{ | ||
s_dependencyProvider?.Dispose(); | ||
s_dependencyProvider = dependencyProvider; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle this if TryLoad
fails? Should we log or fail early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fails as in returns false or throws an exception ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it returns false since we are not catching any exceptions here anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will only return false if there are no metadata files (ie, no nuget dependencies or if spark did not distribute the files to the worker) and if the cached s_lastFileRead
matches the latest metadata file in the SparkFiles path. In either case I don't think we need logs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie, no nuget dependencies or if spark did not distribute the files to the worker
This sounds like a critical error case, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we determine if it's the former or latter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how you implement TryLoad
. If you think it's a critical error, returning false in that case may not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this one more time, and I think it should be ok with the current approach.
src/csharp/Microsoft.Spark.Worker.UnitTest/DependencyProviderTests.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.Worker.UnitTest/DependencyProviderTests.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.UnitTest/DependencyProviderUtilsTests.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.UnitTest/DependencyProviderUtilsTests.cs
Outdated
Show resolved
Hide resolved
Generally LGTM, few minor comments/questions. |
} | ||
|
||
string expectedFile = "dependencyProviderMetadata_00000000000000000020"; | ||
string highestFile = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: latest
DependencyProviderUtils.Metadata deserializedMetadata = | ||
DependencyProviderUtils.Metadata.Deserialize(serializedFilePath); | ||
|
||
Assert.Equal(metadata, deserializedMetadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to depend on EqualTo (new code) to test Serialize/Deserialize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry i'm not sure what you're referring to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will invoke Metadata.Equals
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will call Metadata.Equals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you depend on the function to verify Ser/De since it can have a bug (not checking all fields, for example)?
{ | ||
s_dependencyProvider?.Dispose(); | ||
s_dependencyProvider = dependencyProvider; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this one more time, and I think it should be ok with the current approach.
Directory.GetCurrentDirectory()); | ||
if (dependencyProvider.TryLoad()) | ||
{ | ||
s_dependencyProvider?.Dispose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if another task is using this s_dependencyProvider
and you Dispose
at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make it thread local? What's the memory usage and perf to create/dispose this? Should we ask to implement incremental update instead of creating a new one every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed previous Dispose
pattern. Instead we will update the Metadata
files to only include incremental updates and load a new DependencyProvider
for each metadata file.
Equals(nugetMetadata); | ||
} | ||
|
||
public bool Equals(NuGetMetadata other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
…teve/spark into repl_nuget_dependencies
{ | ||
private readonly DepManager.DependencyProvider _dependencyProvider; | ||
|
||
internal DependencyProvider(string metadataFile, string src, string dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: metadataFilePath
, srcPath
, dstPath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
LoggerServiceFactory.GetLogger(typeof(AssemblyLoaderHelper)); | ||
|
||
// A mapping between a metadata file's path to its respective DependencyProvider. | ||
private static readonly ConcurrentDictionary<string, Lazy<DependencyProvider>> s_depProvs = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we spell it out to s_dependencyProviders
to be consistent with other names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will exceed the 100 char limit if you are okay with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @suhsteve!
In a dotnet-interactive REPL session (driver), nuget dependencies will be systematically added to
SparkContext.AddFiles(..)
.<packagename>.<version>.nupkg
nuget packages
dependencyProviderMetadata
Contains native probing paths required by
Microsoft.DotNet.DependencyManager.DependencyProvider
Contains the assembly probing paths required by
Microsoft.DotNet.DependencyManager.DependencyProvider
Contains nuget metadata such as filename, package name, and version to properly unpackage nuget packages.
On the
Microsoft.Spark.Worker
, in order to resolve the nuget dependencies referenced by thedotnet-interactive
session, we use the files added to theSparkContext.AddFiles(...)
. These files can be accessed from theWorker
by checking theSparkFiles.GetRootDirectory()
path.There are two parts
deserialize
dependencyProviderMetadata
NuGetMetadata
, we locate and unpack.nupkg
files to/path/to/working/directory/.nuget/packages/<package name>/<package version>
AssemblyProbingPaths
andNativeProbingPaths
to instantiate aMicrosoft.DotNet.DependencyManager.DependencyProvider
class. This class will register itself to theAssemblyLoadContext Resolver
.<zero padded ulong>