Sandstorm : even faster TCP

Researchers have worked on improving the performance of TCP since the early 1980s. At that time, many researchers considered that achieving high performance with a software-based TCP implementation was impossible. Several new transport protocols were designed at that time such as XTP. Some researchers even explored the possibility of implementing transport protocols. Hardware-based implementations are usually interesting to achieve high performance, but they are usually too inflexible for a transport protocol. In parallel with this effort, some researchers continued to believe in TCP. Dave Clark and his colleagues demonstrated in [1] that TCP stacks could be optimized to achieve high performance.

TCP implementations continued to evolve in order to achieve even higher performance. The early 2000s, with the advent of Gigabit interfaces showed a better coupling between TCP and the hardware on the network interface. Many high-speed network interfaces can compute the TCP checksum in hardware, which reduces the load on the main CPU. Furthermore, high-speed interfaces often support large segment offload. A naive implementation of a TCP/IP stack would be to send segments and acknowledgements independently. For each segment sent/received, such a stack could need to process one interrupt, a very costly operation on current hardware. Large segment offload provides an alternative by exposing to the TCP/IP stack a large segment size, up to 64 KBytes. By sending an receiving larger segments, the TCP/IP stack minimizes the cost of processing the interrupts and thus maximises its performance.

Another bottleneck from a performance viewpoint is the need to copy data from the application to the kernel. The kernel needs to maintain a copy of the data that the TCP stack sent to be able to retransmit it in case of losses. Different solutions have been proposed and implemented to reduce this memory copy cost. A first approach is to optimise the transmission of files. A networking 101 course would probably explain that to send a file over the network, an application first needs to read the file, they send the data through the socket interface. In this case, data is copied from the kernel (i.e. the buffer cache) to the application memory to be immediately copied again in the TCP buffers of the kernel. This is clearly not the most efficient mechanism. A better approach is to use advanced functions like sendfile() on Linux that allow to transmit the contents of an entire file over a TCP connection without any copy in user space. These functions are used by most high performance web and fileservers.

Still, another approach is possible. In 2012, Luigi Rizzo proposed the netmap framework [2]. This API enables applications to efficiently send and receive packets directly through a network interfaces. It achieves high performance inside applications by batching packets, i.e. the application and the interface exchange groups of packets and also by using a ring buffer that is shared efficiently between the application and the network interface. Netmap has been used to implement applications that process packets like switches, routers, …

Supporting a complete TCP/IP stack on top of the netmap API was the next step. A recent paper written by Ilias Marinos, Robert Watson and Mark Handley [3] shows that it is possible to implement TCP efficiently in user space. The paper describes software libraries that allow applications to directly use their own TCP/IP stack and interface directly to the network interface. This user space implementation outperforms regular, but optimized, TCP/IP stacks on both Linux and FreeBSD. Future high performance servers may be entirely implemented in user space with their own TCP/IP stack included.

Bibliography

[1] D. Clark, V. Jacobson, J. Romkey, H. Salwen, An analysis of TCP processing overhead, IEEE Communications Magazine, Vol. 27, N. 6

[2] Luigi Rizzo, netmap: a novel framework for fast packet I/O, Best Paper award at Usenix ATC’12, June 2012

[3] I. Marinos, R. Watson, M. Handley, Network Stack Specialization for Performance, Hotnets2013