Skip to content

Commit c01152d

Browse files
peshopetrovsrowen
authored andcommitted
[SPARK-23182][CORE] Allow enabling TCP keep alive on the RPC connections
## What changes were proposed in this pull request? Make it possible for the master to enable TCP keep alive on the RPC connections with clients. ## How was this patch tested? Manually tested. Added the following: ``` spark.rpc.io.enableTcpKeepAlive true ``` to spark-defaults.conf. Observed the following on the Spark master: ``` $ netstat -town | grep 7077 tcp6 0 0 10.240.3.134:7077 10.240.1.25:42851 ESTABLISHED keepalive (6736.50/0/0) tcp6 0 0 10.240.3.134:44911 10.240.3.134:7077 ESTABLISHED keepalive (4098.68/0/0) tcp6 0 0 10.240.3.134:7077 10.240.3.134:44911 ESTABLISHED keepalive (4098.68/0/0) ``` Which proves that the keep alive setting is taking effect. It's currently possible to enable TCP keep alive on the worker / executor, but is not possible to configure on other RPC connections. It's unclear to me why this could be the case. Keep alive is more important for the master to protect it against suddenly departing workers / executors, thus I think it's very important to have it. Particularly this makes the master resilient in case of using preemptible worker VMs in GCE. GCE has the concept of shutdown scripts, which it doesn't guarantee to execute. So workers often don't get shutdown gracefully and the TCP connections on the master linger as there's nothing to close them. Thus the need of enabling keep alive. This enables keep-alive on connections besides the master's connections, but that shouldn't cause harm. Closes #20512 from peshopetrov/master. Authored-by: Petar Petrov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
1 parent 4ff2b94 commit c01152d

File tree

2 files changed

+14
-0
lines changed

2 files changed

+14
-0
lines changed

common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,10 @@ private void init(String hostToBind, int portToBind) {
126126
bootstrap.childOption(ChannelOption.SO_SNDBUF, conf.sendBuf());
127127
}
128128

129+
if (conf.enableTcpKeepAlive()) {
130+
bootstrap.childOption(ChannelOption.SO_KEEPALIVE, true);
131+
}
132+
129133
bootstrap.childHandler(new ChannelInitializer<SocketChannel>() {
130134
@Override
131135
protected void initChannel(SocketChannel ch) {

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ public class TransportConf {
4242
private final String SPARK_NETWORK_IO_RETRYWAIT_KEY;
4343
private final String SPARK_NETWORK_IO_LAZYFD_KEY;
4444
private final String SPARK_NETWORK_VERBOSE_METRICS;
45+
private final String SPARK_NETWORK_IO_ENABLETCPKEEPALIVE_KEY;
4546

4647
private final ConfigProvider conf;
4748

@@ -64,6 +65,7 @@ public TransportConf(String module, ConfigProvider conf) {
6465
SPARK_NETWORK_IO_RETRYWAIT_KEY = getConfKey("io.retryWait");
6566
SPARK_NETWORK_IO_LAZYFD_KEY = getConfKey("io.lazyFD");
6667
SPARK_NETWORK_VERBOSE_METRICS = getConfKey("io.enableVerboseMetrics");
68+
SPARK_NETWORK_IO_ENABLETCPKEEPALIVE_KEY = getConfKey("io.enableTcpKeepAlive");
6769
}
6870

6971
public int getInt(String name, int defaultValue) {
@@ -173,6 +175,14 @@ public boolean verboseMetrics() {
173175
return conf.getBoolean(SPARK_NETWORK_VERBOSE_METRICS, false);
174176
}
175177

178+
/**
179+
* Whether to enable TCP keep-alive. If true, the TCP keep-alives are enabled, which removes
180+
* connections that are idle for too long.
181+
*/
182+
public boolean enableTcpKeepAlive() {
183+
return conf.getBoolean(SPARK_NETWORK_IO_ENABLETCPKEEPALIVE_KEY, false);
184+
}
185+
176186
/**
177187
* Maximum number of retries when binding to a port before giving up.
178188
*/

0 commit comments

Comments
 (0)