-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-33143][PYTHON] Add configurable timeout to python server and client #30389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It's still WIP because hesitating how it should be named and would like to see how tests behaving. |
Test build #131169 has finished for PR 30389 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status success |
Test build #131170 has finished for PR 30389 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #131217 has finished for PR 30389 at commit
|
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
core/src/main/scala/org/apache/spark/internal/config/Python.scala
Outdated
Show resolved
Hide resolved
cc @HyukjinKwon |
Test build #131222 has finished for PR 30389 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #131417 has finished for PR 30389 at commit
|
val serverSocket = new ServerSocket(0, 1, InetAddress.getByAddress(Array(127, 0, 0, 1))) | ||
// Close the socket if no connection in 15 seconds | ||
serverSocket.setSoTimeout(15000) | ||
// Close the socket if no connection in configured seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in configured
-> in the configured
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (only minor comments).
Thank you, @gaborgsomogyi and @HyukjinKwon .
@dongjoon-hyun @gaborgsomogyi, yeah I think it makes more sense to separate this change #30389 (comment). Let's exclude this in this PR. |
I just pushed some changes to remove the test and address the minor comment @gaborgsomogyi :-). Let me just merge this in first. For others, we could open a separate PR. |
Kubernetes integration test starting |
Kubernetes integration test status success |
// Close the socket if no connection in 15 seconds | ||
serverSocket.setSoTimeout(15000) | ||
// Close the socket if no connection in the configured seconds | ||
val timeout = authHelper.conf.get(PYTHON_AUTH_SOCKET_TIMEOUT).toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, actually the test in PythonRDDSuite
was related.
In python server error handling
test we don't initialize SparkEnv
because we don't create SparkContext
or executor there by default, but SparkEnv.get
was used in the previous change which ended up with NPE.
Now, I made a change to work around by using SparkConf
directly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it~
Kubernetes integration test starting |
Test build #131520 has finished for PR 30389 at commit
|
Test build #131519 has finished for PR 30389 at commit
|
Kubernetes integration test status success |
Thank you for your decision, @HyukjinKwon . |
Test build #131527 has finished for PR 30389 at commit
|
GA passed. I will merged this. Merged to master. |
Just visited this back and I would like to say a big thanks for such a help! |
I'm intended to solve the issue w/ |
Thanks @gaborgsomogyi! |
What changes were proposed in this pull request?
Spark creates local server to serialize several type of data for python. The python code tries to connect to the server, immediately after it's created but there are several system calls in between (this may change in each Spark version):
Under some circumstances in heavy user environments these calls can be super slow (more than 15 seconds). These issues must be analyzed one-by-one but since these are system calls the underlying OS and/or DNS servers must be debugged and fixed. This is not trivial task and at the same time data processing must work somehow. In this PR I'm only intended to add a configuration possibility to increase the mentioned timeouts in order to be able to provide temporary workaround. The rootcause analysis is ongoing but I think this can vary in each case.
Because the server part doesn't contain huge amount of log entries to with one can measure time, I've added some.
Why are the changes needed?
Provide workaround when localhost python server connection timeout appears.
Does this PR introduce any user-facing change?
Yes, new configuration added.
How was this patch tested?
Existing unit tests + manual test.