Skip to content

TcpListener::accept not retrying transient errors is a footgun #142557

Open
@SabrinaJewson

Description

@SabrinaJewson

Currently, TcpListener::accept forwards to a single instance of the underlying libc call, propagating any encountered errors. This encourages using the function in one of two ways:

  • Propagating the error to the main function, causing the application to exit.
  • Sleeping when an error is encountered, and retrying accept later.

Unfortunately, both of these are incorrect, any may lead to denial-of-service attacks; in particular, I believe erroneous returns from accept can be triggered by a client, thus either causing the server to crash or no more connections to be accepted for some period of time. There are four kinds of error that can be returned from accept():

  1. Client / transient errors. These should be handled by immediately calling accept again.
  2. Resource exhaustion, or otherwise device failure. These are typically handled by logging the error, sleeping for some time, and then retrying.
  3. Programmer errors. These should be handled by exiting the application.
  4. Asynchronous errors (i.e. WouldBlock). These should be handled by waiting for further events via the chosen I/O driver, and are generally handled correctly today.

For prior art and further reading, see:

  • accept(2)'s problem of trying to return two different sorts of errors, which explains the issue
  • Which socket accept() errors are fatal?, a fairly comprehensive summary of the various kinds of error that can occur in accept()
  • man accept(2), which has two relevant excerpts:
    • ECONNABORTED: A connection has been aborted.

    • Linux accept() (and accept4()) passes already-pending network errors on the new socket as an error code from accept(). This behavior differs from other BSD socket implementations. For reliable operation the application should detect the network errors defined for the protocol after accept() and treat them like EAGAIN by retrying. In the case of TCP/IP, these are ENETDOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH.

  • WSAAccept, which has two relevant excerpts:
    • WSAECONNREFUSED: No connection could be made because the target machine actively refused it. This error is returned if the connection request was forcefully rejected as indicated in the return value of the condition function (CF_REJECT).

    • WSAECONNRESET: An existing connection was forcibly closed by the remote host. This error is returned of an incoming connection was indicated, but was subsequently terminated by the remote peer prior to accepting the call.

  • axum, who follows Hyper 0.14 in treating ConnectionRefused, ConnectionAborted and ConnectionReset as especially non-fatal (and sleeps on the rest)

A proposal: modifying accept()

A possible solution is to internally retry in accept in the following cases:

  • On all platforms, io::ErrorKind::Interrupted
  • On all platforms, io::ErrorKind::Conection{Aborted, Refused, Reset}
  • On Linux only, ENETDOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH

This approach differentiates between client errors and resource exhaustion, but not between resource exhaustion and programmer errors. This is, however, far less of a footgun: it is generally not dangerous to treat a programmer error the same way as resource exhaustion. In the method documentation, we should encourage logging and sleeping on errors returned from .accept(), with an optional clause that certain errors known to be programmer errors may be handled specially if the user so desires.

Why not a separate method or external crate?

The functionality of retrying on transient errors could be provided by another method on TcpListener, e.g. .accept_retries() (in a similar vein to write_all), or by a third-party crate. Whether we choose this option depends on two factors:

  1. How widespread we believe this incorrect handling of accept() is, and how damaging it is. I have not collected data on this, but given that the method docs do not mention the possibility of transient errors at all, I suspect that it is rather frequent.
  2. How likely we believe it is that making this change will break user code. I can’t immediately think of a case in which code would be broken, but it is always possible they exist.

Metadata

Metadata

Assignees

No one assigned

    Labels

    I-libs-nominatedNominated for discussion during a libs team meeting.T-libsRelevant to the library team, which will review and decide on the PR/issue.needs-triageThis issue may need triage. Remove it if it has been sufficiently triaged.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions