Most of our customers use our service for their real time applications with very stringent performance SLAs, and we continuously invest in performance enhancements to make our service faster. One of the technologies we have invested in is eBPF for accelerating communication between various microservices in our backend. Since there are very few articles that comprehensively show how to use eBPF, we decided to share what we learned with the broader community. This is the second part of a two part blog series in which we share in detail how we leverage eBPF for network acceleration and some interesting observations we made along the way.
In a previous blog post we shared how to write eBPF code for socket data redirection and how to use the bpftool to inject the resulting eBPF bytecode in the kernel. In this blog, we delve into the various performance benefits we have gained by switching to eBPF from regular TCP/IP.
Using eBPF for Network Acceleration – Performance Evaluation
Once we can perform TCP/IP stack bypass using (the awesome!) eBPF, we now need to actually “see” the performance gains. For this purpose, we used the netperf tool, a widely used tool for measuring network performance, to evaluate gains in throughput, latency and transaction rates. We evaluated these metrics using netperf for the following two cases:
- the traffic goes through the TCP/IP stack,
- the traffic is bypassing the TCP/IP stack using eBPF sockhash map redirect
Our test setup was as follows:
- Ubuntu Bionic 18.04.03 LTS on VirtualBox 6.1 running on MacOS High Sierra
- Linux Kernel 5.3.0-40-generic
- TCP Settings
- congestion control: Cubic
- Default buffer size passed to recv() call: 131072 bytes
- Default buffer size passed to send() call: 16384 bytes
- TCP Maximum Segment Size: 65483 bytes
- netperf 2.6.0
- We used the following netperf server and client for the throughput measurements for various send message sizes for 60 second run:
netserver -p 1000
netperf -H 127.0.0.1 -p 1000 -l 60 -- -m $msg_size
- We used the following netperf server and client for latency measurements (50th, 90th and 99th percentile) for various request and response message sizes for 60 second run:
netserver -p 1000
netperf -P 0 -t TCP_RR -H 127.0.0.1 -p 1000 -l 60 -- -r $req_size, $resp_size -o P50_LATENCY, P90_LATENCY, P99_LATENCY
- We used the following netperf server and client for the transaction rate measurements for various request and response message sizes for 60 second run:
netserver -p 1000
netperf -t TCP_RR -H 127.0.0.1 -p 1000 -l 60 -- -r $req_size, $resp_size
Fig. 2 Throughput: TCP/IP vs TCP/IP bypass using eBPF sockhash map
Fig. 2 shows the throughput performance numbers for eBPF (enabled TCP/IP bypass) relative to the throughput using the TCP/IP stack. One interesting observation is that the performance gain in throughput is linear for the send request message size used for eBPF. This is because there is no overhead (well almost) when the application pumps in larger messages in the send call.
Now for the counterintuitive: one would wonder why the throughput performance is bad compared to regular TCP/IP path. Turns out the culprit is that Nagle’s algorithm is enabled by default in the TCP/IP stack. Nagle’s algorithm was introduced to solve the problem of small packets flooding the slow networks thereby causing congestion. Since the algorithm requires only one TCP segment to be outstanding (i.e. unacknowledged) if it is less than the TCP MSS size, in our measurement this causes TCP to batch up the data for transmission. This batching results in more data transmission with amortized overhead and is able to exceed the performance gain for the eBPF sockhash map redirect performing the TCP/IP bypass which has a constant overhead for each send call buffer size (see Fig. 7).
The regular TCP/IP loses its batching advantage as soon as the packet sizes grow larger approaching the TCP MSS, fewer of them can fit in a TCP MSS (in our testbed MSS is set to 65483 bytes) and are sent to the destination via the TCP/IP kernel stack. At these large packet send sizes, eBPF by virtue of its low overhead far exceeds the throughput of the TCP/IP stack enabled with Nagle’s algorithm.Next, we repeat our performance runs using netperf with Nagle’s algorithm disabled (using -D flag):
netserver -p 1000
netperf -H 127.0.0.1 -p 1000 -l 60 -- -m $msg_size -D
Fig.3 Throughput: TCP/IP (with Nagle algorithm disabled) vs TCP/IP bypass using eBPF sockhash map
With Nagle’s algorithm disabled we see the gain in throughput for regular TCP compared to eBPF sockhash map bypass completely disappear. The performance of both TCP/IP stack and eBPF sockhash map bypass increases linearly as it is expected – eBPF having a greater slope than regular TCP/IP because of eBPF’s fixed cost overhead per send call. This gap in performance is more pronounced for larger send message sizes and smaller TCP MSS.
Next, we repeat our performance runs using netperf to measure the latency. We measure the 50th, 90th and 99th percentile of the latency varying the send and receive buffer (message) sizes. In our results, we use the median latency to clearly show the trendlines. Also, we show the trendlines for varying request message sizes of 64 bytes and 256 bytes. (Please note: netperf has only one transaction outstanding at a time.)
Fig.4 Latency: TCP/IP vs TCP/IP bypass using eBPF sockhash map (R64/R256 prefix is for request message size of 64bytes/256bytes.)
No surprise here, as is seen in Fig.4, eBPF sockhash map bypass outperforms the regular TCP/IP stack. The outperformance is almost 50% better than the regular TCP/IP stack. This is obvious given that eBPF eliminates any protocol level overhead compared to those in TCP/IP (slow start, congestion avoidance, flow control etc.) by redirecting packets from the transmit queue of the source socket to the receive queue of the destination socket. Also, the size of the request message size has no impact on the latency. We didn’t try to measure latency for send message size approaching TCP MSS.
Next, we repeat our performance runs using netperf to measure the transaction rate:
This should merely be the inverse of the latency measurements, which is the case if we simply flip the transaction rate curves to the left, they trend the same as the latency measurement.
Fig.5 Transaction rate: TCP/IP vs TCP/IP bypass using eBPF sockhash map
In Fig. 6, we plot the transaction rates for various send message request sizes against varying receive side message sizes. As in the case of latency measurement, we see the transaction rate is unaffected by the request response message sizes for the regular TCP/IP path and the eBPF sockhash map bypass.
Fig.6 Transaction rate for different request message sizes: TCP/IP vs TCP/IP bypass using eBPF sockhash map (R64/128/256 are the prefix for the request sizes of 64 bytes, 128 bytes and 256 bytes)
In Fig. 7, we plot the time the eBPF sockhash map bypass spent in the kernel when we perform the netperf throughput experiment run. We see that the eBPF overhead is around 1.5 seconds for send message size of 256 bytes and the overhead decreases as the send message size increases.
Fig.7 eBPF sockhash map redirect time spent in the kernel
eBPF is indeed a powerful technology that allows unprecedented access to the Linux kernel resources from the userspace. When applications are latency sensitive then using eBPF to bypass TCP/IP stack for communication with another application on the same host is a major benefit – which is the case when using mutually dependent microservices in the same pod in a Kubernetes cluster. Specifically, eBPF-based TCP/IP bypass can greatly mitigate the latency associated with microservices that see a lot of communication from using RPCs and REST APIs. We also see that while eBPF can lead to powerful gain in performance, its blind use can have unintended consequences. Specifically, with default settings a regular TCP/IP stack can outperform eBPF when we’re only concerned with raw unidirectional throughput. A proper tuning of application’s sendmsg buffer size is warranted in such a scenario.