Setting the right socket timeout for S3 client
by Marat Komarov
TLDR: In urllib3, setting connect_timeout=N
(or a similar option in another HTTP client) may not always result in a straightforward N-second timeout. The actual timeout will be N multiplied by a number of IP addresses
announced in DNS for a hostname, you’re connecting to. It’s worth keeping this in mind when configuring your timeouts to ensure that your code behaves as intended.
Experiment
The combination of requests/urllib3 is widely used as an HTTP client in Python. In fact, when calling any popular cloud provider API from Python, such as AWS or Azure, urllib3 is doing the work behind the scenes.
When fine-tuning a code that calls third-party APIs, one crucial aspect to consider is socket timeouts. Given that many developers are writing code for low-latency environments, it’s essential to keep timeouts short to ensure a fail-fast behavior.
connect
timeout corresponds to number of seconds, a client will wait to establish socket connection to a remote end.read
timeout corresponds to arecv()
operation on a socket.
Suppose you’ve weighed all the pros and cons and set the timeout parameter to timeout=(3, 5)
, expecting a 3-second timeout for connect()
and a 5-second timeout for recv()
. However, in reality, things can get complicated. For instance, if you try to connect to s3.amazonaws.com
while experiencing an unstable network or if the S3 subnets are entirely blocked in the firewall, your request may take up to 24 seconds to give up.
Why is it behaving like that?
When sending an HTTP request, urllib3
will call BaseHTTPConnection.connect()
, which in turn calls socket.create_connection()
to establish
a network connection, and where to resolve the DNS name, getaddrinfo()
is called.
And this is where things start to get interesting:
The line for res in getaddrinfo(host, port, 0, SOCK_STREAM)
ensures that the function keeps trying address info entries to create a socket until the first successful attempt.
How many results getaddrinfo()
could return? The quote from the man page on getaddrinfo answers:
There are several reasons why the linked list may have more than one addrinfo structure, including: the network host is multihomed, accessible over multiple protocols (e.g., both AF_INET and AF_INET6); or the same service is available from multiple socket types (one SOCK_STREAM address and another SOCK_DGRAM address, for example).
Indeed s3.amazonaws.com is multihomed:
$ host s3.amazonaws.com
s3.amazonaws.com has address 52.217.75.30
s3.amazonaws.com has address 52.217.225.224
s3.amazonaws.com has address 52.216.114.141
s3.amazonaws.com has address 54.231.136.184
s3.amazonaws.com has address 3.5.9.148
s3.amazonaws.com has address 52.216.184.197
s3.amazonaws.com has address 54.231.164.152
s3.amazonaws.com has address 52.217.140.216
Google announces 10+ IPv4/IPv6 addresses for storage.googleapis.com, and Azure does a similar thing for their Blob service.
Conclusion
The maximum time spent in socket.create_connection()
is the connect timeout
multiplied by the number of IP addresses that DNS resolver returns for a hostname.
Another effect of dealing with multihome hosts is that your HTTP client code likely doesn’t need any extra retry strategy, as create_connection()
already implements it for you.
Here is the boto3
configuration I use in a low-latency service, which does blobs/logs storage interface on top of a cloud storage service.
botocore.config.Config(
connect_timeout=3,
read_timeout=5,
retries={"max_attempts": 0},
)