8 April 2023

Setting the right socket timeout for S3 client

by Marat Komarov

TLDR: In urllib3, setting connect_timeout=N (or a similar option in another HTTP client) may not always result in a straightforward N-second timeout. The actual timeout will be N multiplied by a number of IP addresses announced in DNS for a hostname, you’re connecting to. It’s worth keeping this in mind when configuring your timeouts to ensure that your code behaves as intended.

Experiment

The combination of requests/urllib3 is widely used as an HTTP client in Python. In fact, when calling any popular cloud provider API from Python, such as AWS or Azure, urllib3 is doing the work behind the scenes.

When fine-tuning a code that calls third-party APIs, one crucial aspect to consider is socket timeouts. Given that many developers are writing code for low-latency environments, it’s essential to keep timeouts short to ensure a fail-fast behavior.

connect timeout corresponds to number of seconds, a client will wait to establish socket connection to a remote end.
read timeout corresponds to a recv() operation on a socket.

Suppose you’ve weighed all the pros and cons and set the timeout parameter to timeout=(3, 5), expecting a 3-second timeout for connect() and a 5-second timeout for recv(). However, in reality, things can get complicated. For instance, if you try to connect to s3.amazonaws.com while experiencing an unstable network or if the S3 subnets are entirely blocked in the firewall, your request may take up to 24 seconds to give up.

Why is it behaving like that?

When sending an HTTP request, urllib3 will call BaseHTTPConnection.connect(), which in turn calls socket.create_connection() to establish a network connection, and where to resolve the DNS name, getaddrinfo() is called.

And this is where things start to get interesting:

The line for res in getaddrinfo(host, port, 0, SOCK_STREAM) ensures that the function keeps trying address info entries to create a socket until the first successful attempt.

How many results getaddrinfo() could return? The quote from the man page on getaddrinfo answers:

There are several reasons why the linked list may have more than one addrinfo structure, including: the network host is multihomed, accessible over multiple protocols (e.g., both AF_INET and AF_INET6); or the same service is available from multiple socket types (one SOCK_STREAM address and another SOCK_DGRAM address, for example).

Indeed s3.amazonaws.com is multihomed:

$ host s3.amazonaws.com
s3.amazonaws.com has address 52.217.75.30
s3.amazonaws.com has address 52.217.225.224
s3.amazonaws.com has address 52.216.114.141
s3.amazonaws.com has address 54.231.136.184
s3.amazonaws.com has address 3.5.9.148
s3.amazonaws.com has address 52.216.184.197
s3.amazonaws.com has address 54.231.164.152
s3.amazonaws.com has address 52.217.140.216

Google announces 10+ IPv4/IPv6 addresses for storage.googleapis.com, and Azure does a similar thing for their Blob service.

Conclusion

The maximum time spent in socket.create_connection() is the connect timeout multiplied by the number of IP addresses that DNS resolver returns for a hostname.

Another effect of dealing with multihome hosts is that your HTTP client code likely doesn’t need any extra retry strategy, as create_connection() already implements it for you.

Here is the boto3 configuration I use in a low-latency service, which does blobs/logs storage interface on top of a cloud storage service.

botocore.config.Config(
    connect_timeout=3,
    read_timeout=5,
    retries={"max_attempts": 0},
)

tags: python - sockets - aws - google cloud - azure