Request Information for Solutions from MLE

Network Function Accelerators, FACs, NICs and SmartNICs

A Network Interface Card (NIC) is a component which connects computers via networks, these days mostly via IEEE Ethernet - but what makes a NIC a SmartNIC?

With the push for Software-Defined Networking, (mostly open source) software running on standard server CPUs became a more flexible and cost-effective alternative to custom networking silicon and appliances. However, in the post Dennard scaling area, server CPU performance improvements cannot keep up with increasing computational demand of faster network port speeds.

This widening performance gap creates the need for so-called SmartNICs. SmartNIC not only implement Domain-Specific Architecture for network processing but also offload host CPUs from running portions of the network processing stack and, thereby, free up CPU cores to run the "real" application.

According to Gartner, Function Accelerator Cards (FAC) incorporate functions on the NIC that would have been done on dedicated network appliances. Hence, all FACs are essentially NICs, but not all NICs/SmartNICs are FACs. When deployed properly, FACs can increase bandwidth performance, can reduce transport latencies and can improve compute efficiency, which translates to less energy consumption.

MLE has partnered with FPGA vendors, Fraunhofer Institutes and EMS partners to implement FPGA-based FACs which deliver cost-efficient solutions for ultra-reliable, low-latency, deterministic networking.

Ultra-Reliable, Low-Latency, Deterministic Networking

With ultra-reliable, low-latency, deterministic networking we have borrowed a concept from 5G wireless communication (5G URLLC) and have applied this to LAN (Local Area Network) and WAN (Wide Area Network) wired communication:

  • Ultra-Reliable means no packets get lost in transport
  • Low-Latency means that packets get processed by a FAC at a fraction of CPU processing times
  • Deterministic means that there is an upper bound for transport and for processing latency

We do this by combining the TCP protocol, fully accelerated (in FPGA or ASIC using NPAP), with TSN (Time Sensitive Networking) optimized for stream processing at data rates of 10/25/50/100 Gbps. These so-called TCP-TSN-Cores not only give us precise time synchronization but also traffic shaping, traffic scheduling and stream reservation with priorities.

For more information, please refer to MLE Technical Brief "Deterministic Networking with TCP-TSN-Cores for 10/25/50/100 Gigabit Ethernet"(MLE-TB20201203).

Unique and Cost-Efficient Combination of Open Source

We believe that FPGAs are very well positioned as programmable compute engines for network processing because FPGAs can implement "stream processing" more efficiently than CPUs or GPUs can do. In particular, when the networking data stays local to the FPGA fabric Data-in-Motion processing can be done within 100s of clock cycles (which is 100s of nano-seconds) and can be sent back a few 100 clock cycles later, an aspect with is referred to as Full-Accelerated In-Network Compute.

While FPGA technology has been on the forefront of Moore's Law and modern devices such as AMD/Xilinx Versal Prime or Intel Agilex or Achronix Speedster7t can hold millions of gates, FPGA processing resources must be used wisely, when Bill-of-Materials costs are important. Therefore, at MLE we have put together a unique combination of FPGA and open-source software to achieve best-in-class performance while addressing cost metrics more in-line with CPU-based SmartNICs. Among the open source technologies we borrow from are:

High-Level Synthesis plays a vital role in our implementation as it allows MLE and MLE customers to turn algorithms implemented in C/C++/SystemC into efficient FPGA logic which is portable between different FPGA vendors.

To build a high-performance FAC platform, portions of the above have been integrated together with proven 3rd party networking technologies:

Corundum In-Network Compute + TCP Full Accelerator

Corundum is an open-source FPGA-based NIC which features a high-performance datapath between multiple 10/25/50/100 Gigabit Ethernet ports and the PCIe link to the host CPU. Corundum has several unique architectural features: For example, transmit, receive, completion, and event queue states are stored efficiently in block RAM or ultra RAM, enabling support for thousands of individually-controllable queues.

MLE is a contributor to the Corundum project. Please visit our Developer Zone for services and downloads for Corundum full system stacks pre-built for various in-house and off-the-shelf FPGA boards.

MLE combines the Corundum NIC with NPAP, the TCP/UDP/IP Full Accelerator from Fraunhofer HHI, via a so-called TCP Bypass which minimizes processing latency of network packets: Each packet gets processed in parallel by the Corundum NIC and by NPAP. The moment it can be determined that the packet shall be handled by NPAP (based on IP address and port number) this packet gets invalidated inside the Corundum NIC. If a packet shall not be processed by NPAP, it get's dropped in NPAP and will solely be processed by the Corundum NIC.

Fundamentally, this implements network protocol processing in multiple stages: Network data which is latency sensitive does get processed using full acceleration, while all other network traffic is handled either by a companion CPU and/or by the host CPU.

AMD OpenNIC + TCP Full Accelerator

The OpenNIC project provides an FPGA-based NIC platform with two components, an FPGA shell and a Linux kernel driver. The OpenNIC FPGA shell is equipped with well-defined data and control interfaces and is designed to enable easy integration of user logic:

MLE combines OpenNIC with NPAP, the TCP/UDP/IP Full Accelerator from Fraunhofer HHI by integrating NPAP into the 250 MHz User Logic Box. Because NPAP features a 128-bit wide bi-directional datapath, this allows to process at line rates of 32 Gbps.

Application Areas

MLE's Function Accelerators are of particular value where network bandwidth and latency constraints are key:

  • Wired and Wireless Networking
  • Acceleration of Software-Defined Wide Area Networks (SD-WAN)
    • Video Conferencing
    • Online Gaming
    • Industrial Internet-of-Things (IIoT)
  • Handling of Application Oriented Network Services
  • Mobile 5G User-Plane Function Acceleration
  • Mobile 5G URLLC Core Network Processing with TSN
  • Offloading OpenvSwitch (OvS), vRouter, etc

Key Benefits

The following shows the key benefits of MLE's technology by comparing open-source SD-WAN switching in native CPU software mode against MLE's Ultra-Reliable Low-Latency Deterministic Networking:

Compared with plain CPU software processing MLE's Ultra-Reliable Low-Latency Deterministic Networking increases network bandwidth and throughput close to Ethernet line rates, in particular for smaller packets, which reduces the need for over-provisioning within the backbone. And, processing latencies can be shortened significantly which is important, for example, when delivering a lively audio/video conferencing experience over WAN.

Availability

MLE's Ultra-Reliable Low-Latency Deterministic Networking is available as a licensable full system stack and delivered as an integrated hardware/firmware/software solution. In close collaboration with partners in the FPGA ecosystem, MLE has ported and tested variations of the stack on a growing list of FPGA cards. Currently, this list comprises high-performance 3rd party hardware as well as MLE-designed cost-optimized hardware:

FPGA Card Hardware Description & Features Status
NPAC-Ketch, MLE-designed single-slot FHHL PCIe card
  • Cost-optimized Intel Stratix 10 GX 400 FPGA
  • Optional 4GB DDR4 SO-DIMM attached to Programmable-Logic
  • 4x SFP+ (4x 10 GigE)
  • PCIe 3.1 8 GT/sec x8 lanes
  • 50 Watts TDP passive cooling front-to-back
 Early Access
(as of 2Q2022)
Sidewinder-100, Fidus Systems designed single-slot FHFL PCIe card
  • AMD/Xilinx ZU19EG MPSoC FPGA
  • ARM A53 Quad-Core CPU with 1 GB DDR4 DRAM running Linux
  • 2x DDR4 So-DIMM (connected to PS & PL)
  • 2x QSFP28 (2x 100 GigE or 8x 25 GigE or 8x 10 GigE)
  • 1x 1 GigE RJ45 (for separate control plane access)
  • PCIe 3.1 8 GT/sec x8 lanes
  • 2x PCIe 3.1 x4 M.2 NVMe SSD
  • 125 Watts TDP active cooling
Early Access
(as of 2Q2022)
Alveo U280, AMD/Xilinx-designed dual-slot FHFL PCIe card
  • AMD/Xilinx UltraScale+ FPGA
  • 32GB DDR4 DRAM plus 8GB HBM2 DRAM
  • 2x QSFP28 (2x 100 GigE or 4x 25 GigE or 8x 10 GigE)
  • PCIe 4.0 16 GT/sec x8 lanes
  • 225 Watts TDP active cooling
Early Access
(as of 2Q2022)
N6000-PL, Intel-designed single-slot FHHL PCIe card
  • Intel Agilex AGF014 F Series FPGA
  • 4x 4GB DDR4 SO-DIMM attached to Programmable-Logic
  • 2x QSFP28 (2x 100 GigE or 4x 25 GigE or 8x 10 GigE)
  • PCIe 4.0 16 GT/sec x16 lanes
  • ARM A53 Quad-Core CPU with 1 GB DDR4 DRAM running Linux
  • 125 Watts TDP passive cooling
Planned for 3Q2022
NPAC-Yawl, MLE-designed single-slot HHHL PCIe card
  • Cost-optimized low-density SoC-FPGA
  • Optional 4GB DDR4 SO-DIMM attached to Programmable-Logic
  • 1x QSFP28 (4x 25 GigE)
  • 1x 1 GigE RJ45 (for separate control plane access)
  • PCIe 4.0 16 GT/sec x8 lanes
  • 75 Watts TDP passive cooling front-to-back
Planned for 4Q2022

Please contact us in case you seek support for other FPGA cards!

Documents and Datasheets

Download the brochure "Ultra-Reliable, Low-Latency, Deterministic Networking".

Download the brochure for the Function Accelerator Card "NPAC-40G".