xPUBench: Scalable and Energy-Efficient GPU and DPU-Accelerated Network Functions

Today Maxime Vanliefde presented xPUBench: Scalable and Energy-Efficient GPU and DPU-Accelerated Network Functions at #PAM2026

This is a collaboration with Nikita tyunyayev Clément Delzotti, Romain Van Hauwaert and Elena Agostini (NVIDIA). We benchmark network packet processing on GPU, DPU, and CPU and a combination of them, hence the name, xPU (Bench).

We also take a novel angle on energy efficiency, with techniques to use less power for GPU (reducing uncore frequency), and show that DPU are highly efficient… but limited in term of processing power. When they match, they’re best. For complex workloads, GPU are the way to go and are even more energy efficient than CPUs.

Many more findings in the paper!

▶️ paper

PAMO at Middleware’25

PAMO: Pattern Matching Offload for Intrusion Detection Systems
Lukáš Šišmiš, Colin Evrard, Etienne Rivière, Tom Barbette

This week I am going to present PAMO, a modified version of the industry-grade Suricata IDS to support offloading pattern matching to the RegEx engine of the BlueField 2.

We first review and analyses the internals of IDS, focusing on Suricata with the help of one of its maintainers, Lukas Sisimis who did a 6 month exchange with us at UCLouvain (and continued to work on it as the job was much bigger than initially envisioned).

We then benchmarked what the RegEx engine was capable of.

The answer is : 51Gbps with big packets and not too many rules (we employed the widely used Emerging Threats ruleset). Still, the RXP engine used from the ARM cores of the BlueField provides a huge help and alleviate 6 or 7 x86 cores.

But the reality is that an IDS is far from being just about rules matching. And industry-grade IDS have complex processing to decide which rules should be evaluated.

After many challenges we evaluate PAMO using a real trace and a window-based mechanism to accelerate it using parallel traces, leaving temporal features untouched.

While we reduce the payload prefilter CPU processing time to peanuts, the RXP has a cost that bring the improvement to 6X. As we can’t beat Amdahl’s law, we get a 40% performance increase (this is with one core).

Perhaps an interesting result is how PAMO improves the performance of Suricata on the BlueField 2 itself. In that mode the IDS runs entierly on the NIC. As the ARM cores are weaker, the improvement reaches 70%.

Come say hello at Middleware’25 in Nashville, or check out our paper !

2-years post-doc position

ORGANISATION/COMPANY

Université catholique de Louvain

Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM)

RESEARCH FIELD

Networking, Systems, High-speed packet processing, SmartNICs, P4, NFV

MINIMUM REQUIRED QUALIFICATIONS

The candidate must have obtained a PhD in computer science before the start date.

APPLICATION DEADLINE

Open (continuous screening process), starting ASAP

LOCATION

Belgium › Louvain-la-Neuve

TYPE OF CONTRACT

Post-Doc fellowship for 1+1 year (yearly contract)

2 years starting 1/1/2026

JOB STATUS

Full-time

HOURS PER WEEK

38

OFFER DESCRIPTION

The applicant will join the ENSG to work on leading-edge research topics in crossroad of networking and systems to build a sustainable IT infrastructure. The group is composed of the PI and 5 PhD students, and 2 post-docs.

The post-doc will join the PI’s group to conduct research on thematics that fit the PI’s area of expertise. The applicant will also be expected to advise master students and take a leading role with PhD students. The exact project is open to discussion, but the following project is proposed and describes well the overall direction of the lab.

Internet was designed at a time when computers were monolithic devices, transferring data over a network of routers and switches. This paradigm does not match the reality of today’s devices, which are composed of several elements of different nature (CPU cores, RAM, storage, NICs, GPU, etc).  You will join the study of the fundamental challenges for the Internet to catch up with this shift of paradigm. Much like atoms were later refined into a set of particles, the communications of hosts must be reconsidered to enable the next leap in the Internet evolution.

We will examine the newfound programmability of the network, i.e., P4 switches and Smart NICs, to enable sub-atomic communications over the Internet by delegating the intelligence out of the end “hosts”.

The Smart NIC of a host may essentially act as a transparent multiplexer for the sub-devices of the host, bypassing unneeded CPU transfers, and saving time and energy. The Smart NIC will be aware of power-conservative strategies when assigning requests to cores. The group is also conducting research on efficient CPU-aware software pipelines using dynamic compilation to avoid cache and branch misses.

Programmable switches will similarly act as coordinators of the streams through the edge. They will also lead the transfers toward the right particles among the increasingly disaggregated datacenter’s resources that are serving a provider’s content. To overcome ossification, they may expose information and negotiate a behavior for each particle’s streams to reach one possible servicing entity through the best paths. It will be possible without going back to the “ends”, and therefore enable particle-to-particle encryption as well as network efficiency.

The resulting low latency communication will enable future use cases such as cloud gaming and latency-critical workloads, connecting nearby particles that are getting virtually closer thanks to 5G and fiber connectivity. In the long-term, the vision enforced by the group will bring back competitiveness to the Internet by standardizing the means for such next-generation sub-atomic communication.

SKILLS/QUALIFICATION

  • Published in known conference in the field (CoNEXT, NSDI, SIGCOMM, IMC, PAM, …) or journals (ToN, SIGCOMM CCR, TNSM, …)
  • Good comprehension of computer systems, operating systems. Knowledge of techniques like eBPF, kernel-bypass, Xilinx FPGA ecosystem, DPDK, … is a plus
  • Ease with low-level programming in C and/or Rust

SUBMISSION

Please send to tom.barbette@uclouvain.be:

(a) Curriculum vitae;
(b) A letter of motivation;
(c) Links to Masters and PhD thesis (If already defended);
(d) List of publications and links to PDF (not behind a paywall);
(e) If applicable, links to examples of personal software contributions.

REQUIRED LANGUAGES

ENGLISH: at least B2 (upper intermediate)

French is not required

Other positions

See the range of possibilities at the Efficiency of Networked Systems Group

Multi-End QUIC: A Transport Protocol to Enable One-to-Many Communications

https://dl.acm.org/doi/10.1145/3769700.3771696

El Mehdi Makhroute, Quentin De Coninck, Tom Barbette

The Internet has shifted from an end-to-end paradigm to an end-to-ends paradigm, i.e. to serve a provider’s content (e.g. a web page, an app feed, …) the client must connect to many ends to fetch various types of resources like web documents, pictures, scripts, news feed, videos, etc… However, even the recent QUIC transport protocol failed to catch up with this change of paradigm, and the user is forced to waste many round-trip times to fetch the entire content from multiple servers. In this paper, we propose Multi-End QUIC, an extension of Multipath QUIC that enables the establishment of sub-streams directly to backend servers or third-party servers. Multi-End QUIC alleviates the need for proxies in the edge that re-encode and delay those sub-streams, that is able to bypass the relays entirely. This leads to a ~50% latency improvement in a preliminary experiment.

acm

OpenDesc at HotNets’25 !

Our paper “OpenDesc: From Static NIC Descriptors to Evolvable
Metadata Interfaces”
will be presented today at HotNets’25.

In OpenDesc, we propose to use P4 as an interface to define packet descriptors. OpenDesc enables to expose NIC capabilities and match them with application intents. We propose a prototype compiler to generate accessors that can directly extract metadata from a negotiated NIC descriptor without the need for intermediate data structures like sk_buff, xdp_sock, rte_mbuf & cie.

Read more below!

Flexicast QUIC: Rethinking Multicast for the QUIC Era in SIGCOMM CCR

Our latest paper with Louis Navarre, Quentin de Coninck, Tom Barbette and Olivier Bonaventure has recently appeared in SIGCOMM CCR!

In short: Flexicast QUIC brings multicast back to the Internet by blending it with unicast, all within QUIC. It offers scalability where multicast works, and robustness where it doesn’t — making it a practical transport for the next generation of large-scale applications.

acm ; code

In long:

When distributing live video, software updates, or cloud gaming streams, today’s Internet almost exclusively relies on unicast: each receiver gets its own copy of the data. This is simple and robust but highly inefficient, especially when thousands of receivers consume the same content. The cost is felt both at the sender — which must generate and encrypt per-receiver packets — and in the network, which carries redundant traffic.

Multicast was designed decades ago to solve exactly this problem: a source transmits once, and routers replicate packets along a multicast tree. But despite its promise, multicast never became mainstream on the global Internet. It is difficult to deploy across ISPs, hard to monetize, and fragile — applications always need to fall back to unicast anyway. Most content providers gave up and built massive unicast infrastructures instead.

With the wide deployment of QUIC, a modern transport protocol running above UDP, there is a chance to revisit multicast — not at the IP layer, but directly at the transport layer. This is where Flexicast QUIC comes in.

FlexiCast uses a multicast stream shared between users, and individual unicast streams to send feedback or act as a fallback. The feedback is used to send more FEC frame to compensate loss, and if the client can’t keep up, use unicast as a fallback.

The Idea of Flexicast QUIC

Flexicast QUIC, presented in our SIGCOMM CCR 2025 paper, extends Multipath QUIC to combine the efficiency of multicast with the reliability of unicast. The idea is to make multicast flexible:

  • Each receiver gets a unicast QUIC path for control and fallback.
  • The sender also establishes a flexicast flow: a shared, unidirectional path encrypted with a common key and intended for all receivers.
  • If multicast routing is available, this flow is carried efficiently through the network. If not, the sender can still replicate the packets itself and deliver them over unicast.

Receivers can join or leave the flexicast flow at any time. If multicast fails for one receiver, it automatically falls back to unicast — without affecting others. All this happens within the same QUIC connection, so applications don’t need to juggle two protocols.

Key Design Points

Flexicast QUIC builds on QUIC’s extensible design:

  • Per-path keys: unlike regular Multipath QUIC, each unicast path and the flexicast flow use distinct encryption keys.
  • Reliability: acknowledgments are sent over unicast paths, aggregated at the sender to avoid the classic ack implosion problem.
  • Congestion control: the sender maintains per-receiver congestion states. If one receiver drags the group down, it can be removed from the flexicast flow and served over unicast.

Implementation and Results

We implemented Flexicast QUIC in Cloudflare’s quiche library (Rust), adding ~10,000 lines of code and 5,000 lines of tests. The evaluation was run on CloudLab and emulated networks.

FlexiCast can scale much better than pure QUIC delivery

Scalability

  • Unicast QUIC: saturates at ~200 receivers (~20 Gbps). CPU is the bottleneck, as every packet must be encrypted per receiver.
  • Flexicast QUIC: supports 1000 receivers and delivers >80 Gbps, over 4× higher throughput than unicast QUIC, with acceptable CPU usage.
  • With a small acknowledgment delay (5 ms), Flexicast QUIC perfectly matches the ideal UDP baseline.

Even without multicast in the network, Flexicast QUIC still helps: the sender can replicate encrypted packets using sendmmsg, which is far cheaper than generating per-receiver packets.

Robustness

We tested Flexicast QUIC with a 5 Mbps live video stream under failing multicast trees. Some receivers randomly lost multicast connectivity, forcing fallback to unicast. Results:

  • Other receivers were unaffected.
  • Video quality stayed excellent (SSIM > 0.99 for 99.4% of frames).
  • Latency remained low, with only small tail increases during recovery.

This shows Flexicast QUIC provides seamless continuity even when multicast is unreliable.

Why It Matters

Flexicast QUIC makes multicast practical again:

  • Efficient: one packet can serve thousands of receivers.
  • Robust: unicast fallback is built-in, so failures don’t break the stream.
  • Practical: works today, even without multicast routers.
  • Deployable: it’s just QUIC — already used by major Internet services.

This makes it promising for content delivery networks, software updates, and live streaming at scale.

What’s Next?

Our future work will explore:

  • Forward Erasure Correction to improve reliability.
  • Smarter flow control for heterogeneous receivers.
  • Source authentication to defend against spoofed multicast traffic.
  • Multiple flexicast flows (e.g., different video bitrates).
  • Dynamic key rotation for large, changing groups.
  • Inter-domain deployment using AMT and TreeDN.
  • New use cases like software updates or gaming.

The source code and experiment scripts are open-source: Flexicast QUIC on GitHub.

ASNI: Rethinking Packet I/O for High-Performance Networking

Accepted paper to be presented at CoNEXT’25 ; Nikita Tyunyayev (UCLouvain), Clément Delzotti (UCLouvain), Haggai Eran (NVIDIA) and Tom Barbette (UCLouvain)

In recent years, bypassing the Linux kernel networking stack has become a common strategy to accelerate packet processing. Techniques like DPDK eliminate kernel overhead, enabling network functions to run at much higher speeds. However, these solutions do not fundamentally change how packets are exchanged between the Network Interface Controller (NIC) and the CPU.

At the core of this communication are descriptors: NIC-specific metadata structures that reference memory buffers and encapsulate information such as VLAN tags, flow IDs, tunnel IDs, timestamps, and L3/L4 protocol information. Because descriptor formats vary across NIC vendors, both the Linux networking stack and user-space drivers like DPDK must translate these proprietary formats into a generic representation. This translation consumes substantial CPU resources, even though most of the metadata is typically unused by applications. Additionally, applications often perform further metadata transformations themselves. X-Change addresses some of those inefficiencies by proposing a unified model that merges driver-level and application-level metadata.

Ensō introduced a streaming interface that eliminates pointer indirection by delivering packets as a continuous array. While this improves performance through better memory locality and reduced overhead, it introduces challenges when packets need to be processed out of order, requiring costly copying operations. 

Comparing I/O processing models
Comparing I/O processing models

Introducing ASNI: Application-Specific Network Interface 

In our CoNEXT 2025 paper, we present ASNI, a new approach that builds upon, rather than replaces, the traditional packetized interface. ASNI delivers packets in large, contiguous buffers, each capable of holding multiple packets and their metadata, organized in a format tailored to the specific needs of the application. By offloading the majority of the driver datapath from the CPU to the NIC, ASNI significantly reduces CPU overhead. 

Our design brings three key benefits 

  •     Improved PCIe efficiency through better buffer utilization 
  •     Higher packet and metadata locality, reducing cache misses 
  •     Application-specific metadata layouts and content, avoiding unnecessary transformations 

In NFV scenarios, ASNI outperforms DPDK, the dominant kernel-bypass solution, by serving 2.2× more traffic, demonstrating its effectiveness in high-throughput, low-latency environments. 

paper ; code ; acm

High-Speed Forward Erasure Correction with HIRT

Modern low-latency applications—such as online gaming, remote control systems, and real-time file access—are increasingly sensitive to packet loss and delay. Traditional transport protocols like TCP and QUIC mitigate packet loss through retransmissions, but this leads to additional round-trip delays that can significantly impact tail latency. Our paper co-authored with Louis Navarre and François Michel, “A High-Speed Robust Tunnel using Forward Erasure Correction in Segment Routing” has been published at ICNP’24 introduces HIRT, a system that applies Forward Erasure Correction (FEC) at the network layer to proactively recover from packet losses without relying on retransmissions.

HIRT routers create repair packets at the transport layer, protecting a whole path at once

HIRT is built on IPv6 Segment Routing (SRv6) and leverages Random Linear Coding (RLC) to inject repair packets into the network path. These packets can be used by the receiver to reconstruct lost data without waiting for retransmissions, thereby reducing tail latency. Unlike traditional FEC schemes embedded at the transport layer, HIRT operates transparently at the network layer, requiring no changes to end-host protocols. This allows network operators to deploy it as a service that selectively protects latency-sensitive traffic across tunnels, including encrypted VPN flows.

HIRT implementation uses FastClick to scale on multiple cores

We implemented HIRT using FastClick, achieving line-rate processing beyond 60 Gbps on commodity servers, even under >3% packet loss. The system adapts redundancy dynamically based on feedback from the receiver and employs several optimizations, including DPDK kernel bypass, SIMD acceleration, and a multithreaded architecture. Each processing stream is handled independently, avoiding bottlenecks and enabling scalability across cores. HIRT also detects overload conditions and throttles encoding/decoding to avoid contributing to packet drops under high CPU load.

One of HIRT evaluation uses cases shows how HIRT can recover packets lost on Starlink without waiting for retransmission

Evaluations over the Starlink LEO satellite network—which shows bursty, non-congestion-induced losses—demonstrate HIRT’s practical benefits. It reduced HTTP tail latency by up to and improved completion times for 1 MB file transfers by 20% compared to loss-based recovery alone. Similarly, latency for NFS read/write operations improved by up to 14%. Compared with prior FEC solutions like Maelstrom, HIRT achieves better recovery at lower overhead, thanks to its adaptive approach and more powerful coding strategy. Its network-layer placement also makes it easier to deploy incrementally without disrupting application stacks.

paper ; source

CLOSED 3-years PhD position on Network Functions for privacy-preserving monitoring and policy-enforcing systems for post-growth Internet

CONTEXT

This project is part of the MimPG project funded by INNOVIRIS (Brussels Region). The goal is to limit the network growth to meet carbon reduction targets. The usage might need to be limited at peak times to reduce the growth. The question is, therefore, what type of traffic should be prioritized for minimal impact on the users. The political side of the question will be addressed with panels of citizens. The applicant will work on monitoring and policy enforcement network functions that guarantee the privacy of the network users to both inform a regulatory body and create policies that cannot harm minorities by design.

ORGANISATION/COMPANY

Université catholique de Louvain

Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM)

RESEARCH FIELD

Networking, Systems, NFV, Privacy

MINIMUM REQUIRED QUALIFICATIONS

The candidate must have obtained a Master’s degree in computer science before the start date.

APPLICATION DEADLINE

October 1st, 2024

LOCATION

Louvain-La-Neuve, Belgium

TYPE OF CONTRACT

PhD funded for 3 years. A typical PhD in Belgium is 4 years. Hence, the applicant will apply for supplementary funds with the help of the PI.

JOB STATUS

Full-time

HOURS PER WEEK

38

OFFER DESCRIPTION

The applicant will join the ENSG.

The objective of this project is to study opportunities for post-growth metropolitan internet access as a means to reduce the environmental impact of digital technologies. Inspired by Kate Raworth’s Doughnut Economics, post-growth industries strive to operate between the socio-economic floor corresponding to the satisfaction of the basic needs of all actors (individuals’ fundamental rights and social cohesion but also economic viability) and the ecological ceiling provided by planetary boundaries that restrict, e.g., greenhouse-gas emissions and the availability of raw materials.

ENSG is the leader of WP2, which comprises the integration of privacy-preserving techniques with a monitoring system (based on Retina). The monitor will allow for the observation of traffic, guaranteeing the privacy of the users. It will be the technical arm of WP1’s study, including the citizens of Brussels. For instance, a regulatory body might ask, “What is taking most of the Internet bandwidth?” with guarantees on the output of the program. When carbon engagement dictates limitations on bandwidth, a policy system will prioritize traffic according to WP1’s decisions, enforcing no rules harm minorities.

SKILLS/QUALIFICATION

  • Successful student
  • Good in the Operating System, computer systems/architecture and networking courses
  • Knowledge of programming in Rust is a plus, else comfortable with low level languages like C
  • Autonomy, research-minded

SUBMISSION

Please send to tom.barbette@uclouvain.be:

(a) Curriculum vitae;

(b) Motivation letter;

(c) Transcript of grades for freshly graduated/graduating students

We will get back to you, the selection process includes in general a first informal meeting via Teams, and then an in-person interview (if possible).

REQUIRED LANGUAGES

ENGLISH: at least B1

French is not required but is a plus

Other positions

See the range of possibilities at the Efficiency of Networked Systems Group