Description
Currently, TcpListener::accept
forwards to a single instance of the underlying libc call, propagating any encountered errors. This encourages using the function in one of two ways:
- Propagating the error to the
main
function, causing the application to exit. - Sleeping when an error is encountered, and retrying
accept
later.
Unfortunately, both of these are incorrect, any may lead to denial-of-service attacks; in particular, I believe erroneous returns from accept
can be triggered by a client, thus either causing the server to crash or no more connections to be accepted for some period of time. There are four kinds of error that can be returned from accept()
:
- Client / transient errors. These should be handled by immediately calling
accept
again. - Resource exhaustion, or otherwise device failure. These are typically handled by logging the error, sleeping for some time, and then retrying.
- Programmer errors. These should be handled by exiting the application.
- Asynchronous errors (i.e.
WouldBlock
). These should be handled by waiting for further events via the chosen I/O driver, and are generally handled correctly today.
For prior art and further reading, see:
- accept(2)'s problem of trying to return two different sorts of errors, which explains the issue
- Which socket accept() errors are fatal?, a fairly comprehensive summary of the various kinds of error that can occur in
accept()
- man accept(2), which has two relevant excerpts:
-
ECONNABORTED
: A connection has been aborted. -
Linux accept() (and accept4()) passes already-pending network errors on the new socket as an error code from accept(). This behavior differs from other BSD socket implementations. For reliable operation the application should detect the network errors defined for the protocol after accept() and treat them like EAGAIN by retrying. In the case of TCP/IP, these are ENETDOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH.
-
- WSAAccept, which has two relevant excerpts:
-
WSAECONNREFUSED
: No connection could be made because the target machine actively refused it. This error is returned if the connection request was forcefully rejected as indicated in the return value of the condition function (CF_REJECT). -
WSAECONNRESET
: An existing connection was forcibly closed by the remote host. This error is returned of an incoming connection was indicated, but was subsequently terminated by the remote peer prior to accepting the call.
-
- axum, who follows Hyper 0.14 in treating
ConnectionRefused
,ConnectionAborted
andConnectionReset
as especially non-fatal (and sleeps on the rest)
A proposal: modifying accept()
A possible solution is to internally retry in accept
in the following cases:
- On all platforms,
io::ErrorKind::Interrupted
- On all platforms,
io::ErrorKind::Conection{Aborted, Refused, Reset}
- On Linux only,
ENETDOWN
,EPROTO
,ENOPROTOOPT
,EHOSTDOWN
,ENONET
,EHOSTUNREACH
,EOPNOTSUPP
, andENETUNREACH
This approach differentiates between client errors and resource exhaustion, but not between resource exhaustion and programmer errors. This is, however, far less of a footgun: it is generally not dangerous to treat a programmer error the same way as resource exhaustion. In the method documentation, we should encourage logging and sleeping on errors returned from .accept()
, with an optional clause that certain errors known to be programmer errors may be handled specially if the user so desires.
Why not a separate method or external crate?
The functionality of retrying on transient errors could be provided by another method on TcpListener
, e.g. .accept_retries()
(in a similar vein to write_all
), or by a third-party crate. Whether we choose this option depends on two factors:
- How widespread we believe this incorrect handling of
accept()
is, and how damaging it is. I have not collected data on this, but given that the method docs do not mention the possibility of transient errors at all, I suspect that it is rather frequent. - How likely we believe it is that making this change will break user code. I can’t immediately think of a case in which code would be broken, but it is always possible they exist.