Alan Elder explores the extremities of advancements in coding performance and efficiency necessary for handling complex processing of millions of packets each second.
Alan Elder holds the position of Principal Software Engineering Manager with Azure for Operators at Microsoft, having been a part of the team subsequent to the acquisition of Metaswitch Networks. His career spans over 15 years in IP networking, focusing currently on high-performance packet processing through software.
The transformative power of software is celebrated at QCon London, which serves as a hub for disseminating knowledge and fostering innovation among the developer community. The conference targets technical team leads, architects, engineering directors, and project managers poised to drive innovation within their teams.
Elder: Asserting the world’s fastest code is indeed audacious. Defining speed – just how rapid is it considered? My background stems from networking, where nowadays, 100 Gig NICs have become the norm. What exactly does a 100 Gig NIC entail? With assumptions regarding packet sizes, it translates to roughly 10 million packets each second. Further calculations pinpoint this figure to about 100 nanoseconds per packet.
Imagine only having a hundred nanoseconds to process a packet—that’s how fast data needs to be processed. Think of light traveling from the stage to the back of the room windows; that’s your complete processing deadline. If we talk in terms of a tech conference, this equates to about 300 CPU clocks per core for the necessary computations. That’s what defines “fast” in this context.
What’s on our agenda today? We’re diving into seven methods to enhance processing efficiency. I’ve prepared some hands-on examples and code for today’s discussion, demonstrating how these optimizations can transform performance. My name is Alan, and I come with approximately 15 years of experience in networking. I currently serve at Microsoft, focusing on Azure for Operators, dedicated to enhancing cloud services for network operators.
To give you a relatable scenario, think about using 5G on your phones. The journey of data starts at your phone, travels to a nearby radio tower, then moves through various network points until it reaches the 5G packet core—the intelligence hub of this network, where vital processing occurs. So, what happens at the 5G packet core? Incoming data, encapsulated within several layers, including identity and session information, essential for directing the data properly through the network. The 5G packet core identifies and processes this data, making intelligent decisions about how to handle each packet efficiently.
The process involves applying various policies like quality of service or network address translation to data packets, possibly for lawful intercepts. After determining the treatment for a packet, its headers are rewritten and sent out. This is a typical scenario when accessing sites like BBC News from a mobile device.
A detailed demonstration using a real 5G packet core would be too complex, hence a simplified model will be used. The model revolves around analyzing and amending packet headers through coding iterations that enhance performance and reviewing the improvements. This model is termed a packet processing device, focusing on matching and rewriting header information depending on predefined criteria.
The process begins with extracting a lookup key from the packet header, which helps determine how the packet should be handled. The handling logic is coded in C, starting with a basic lookup using an unordered map from the C standard library. The resultant match structure from the lookup directs how the packet’s header should be rewritten. This functionality is encapsulated in an apply_rewrite function within the code. Although simplistic, initial benchmarks show that it takes about 1000 clocks per packet, which is considered slow but expected as the model has not yet been optimized.
We’ll begin by optimizing our code with a straightforward method known as an inline function. Focusing on a frequently executed function, process_packet, which is called about 10 million times per second, let’s dissect what happens during each call. The function call involves not just the core operation (represented by the green box) but also additional steps (the blue boxes) such as pushing arguments to the stack or registers, managing the return address, jumping to the code block, and post-execution cleanup like resetting the stack.
By inlining the function, we eliminate much of this overhead, allowing the function to focus solely on its primary task. Implementing this in C requires the use of the inline keyword, though it’s worth noting that this is merely a suggestion to the compiler. To ensure inlining, the always_inline attribute should be utilized, enhancing performance, albeit modestly. This is just the beginning of our optimization journey.
During development, I noticed the performance gains were not as expected, which reminded me to enable compiler optimizations. Leveraging the compiler’s capabilities can significantly enhance code performance, especially in production environments where -O3 optimization level is commonly used. Additionally, specifying the target architecture, like Skylake in my case, enables the compiler to utilize specific instructions optimized for that architecture, further boosting performance.
The primary elements that I want to highlight from this collection pertain to those beginning with AVX. The AVX commands are vector instructions and by designating the target architecture, I enable the compiler to utilize these AVX instructions. We will delve deeper into these instructions later as they greatly enhance performance. Additionally, here’s another noteworthy flag, the mbranches-within-32B-boundaries. How familiar are people with employing such flags when compiling their projects? Only a few, perhaps? This is fairly specific. It originated from a flaw in Intel chips named Jump Conditional Code Erratum. Intel addressed it with a fix, however, this solution causes the Decoded ICache to empty unless branches align on 32-byte boundaries.
Whenever the Decoded ICache is cleared, it leads to increased ICache misses which deteriorates your code’s performance. For certain architectures, using this compiler flag can yield a performance gain of around 7%. The essential takeaway, however, is there are numerous such specialized compiler flags available. If maximizing performance is crucial for you, it’s beneficial to explore and understand these options based on your system’s requirements. In our experimentations, the compiler has been exceptionally efficient in code optimization, reducing it to about 400 clocks per packet. Although we’ve achieved considerable improvements, we aim for 300 clocks per packet and believe we can enhance it further. Observant individuals might notice there’s ample space left on this graph, indicating we have room to continue improving.
I alluded earlier to these vector instructions which include allowing the compiler to utilize them. Additionally, these instructions can be manually implemented in programming. What exactly is a vector instruction? It’s a form of Single Instruction, Multiple Data, which is extensively used in GPUs, significantly benefiting AI training models. For instance, take the vector command vpand—it’s a directive that performs a logical ‘and’ operation on multiple data bits simultaneously. Under the AVX2 architecture, the vectors are 256 bits in length, accommodating eight integers each. Consequently, vpand processes a logical ‘and’ across eight integers at once, delivering the results in a vector format.
Incorporating these vector instructions into our programming can be exemplified by revisiting the main function involved in packet processing. Considering the apply_rewrite function, observe the implementation of inline optimization within it. The function methodically iterates over all 32 bytes in the header, logically ‘anding’ the data with a mask to retain desired header bits while zeroing out others, after which it logically ‘ors’ in bits to be rewritten.
In our effort to utilize vectorization, we need to leverage Intel Intrinsics. These are specialized functions provided by Intel to simplify the usage of vector instructions at a low level. To vectorize our data, we first load it into a vector. This involves transferring the data into a format that vector instructions can act upon. Next, we utilize the ‘vpand’ instruction to selectively zero out bits that require overwriting, while preserving the rest.
Following this operation, we use a ‘vpor’ instruction to integrate the new values into the vector. Once processed, the data must be stored back into its original location in memory, completing the first approach using AVX2 instructions. To improve upon this, we could utilize AVX-512 vector instructions, which introduce ternary logic operations capable of processing three binary variables simultaneously, thus potentially enhancing efficiency.
Despite the theoretical improvement, the performance gains were unremarkable. This lack of enhancement originates from the compiler’s inherent ability to optimize code using vector instructions effectively, often surpassing manual optimizations. The original non-vectorized function code, as indicated in the assembly output, shows that the compiler had already implemented similar vector operations like ‘vpand’ and ‘vpor’. Therefore, in this scenario, manual vectorization did not deliver additional performance benefits, highlighting the compiler’s optimization capabilities.
We need to elevate our effort now as we dive deeper into data structures. Initially, we relied on a simple HashMap, but research has led to more efficient structures for rapid data access. We’ll explore a Google-innovated structure called a Swiss Table. Providing a brief overview, a Swiss Table functions as a hash table. It operates with a given hash divided into two segments: an H1 and an H2 value. The H1 value determines a group within the table, guiding us to the appropriate bucket where our desired entry is located.
The group also points to a crucial component known as the metadata array. This array is crucial as it holds the compressed H2 hash values of all entries in the bucket. Utilizing our H2 value, we search within this array to find a matching entry, allowing us to pinpoint our exact data quickly. The phrase “jump straight to” in the context of data lookup always indicates an optimal process.
To summarize the operation: First, a hash is computed. Second, based on the H1 value, we access the corresponding metadata array. Third, we search through this array using our H2 value to find a matching entry, which leads us directly to our data. This method surpasses the efficiency of traditional HashMaps for several reasons. It’s highly cache-efficient, involving only two memory indirections: one to reach the metadata array and another to retrieve the actual entry. This is a significant improvement over standard HashMaps.
Another advantage is the method of entry probing using H2 values; this often eliminates the need to compare key values until the final data point is found. Furthermore, the metadata array, compact with up to 16 entries, allows batch operation with vector instructions. This makes the comparison process swift and reduces time spent probing entries. Although the basic operation mimics that of a typical HashMap, the implementation details and performance benefits are substantially better.
We’ll begin with the task of extracting the lookup key. My method involves a unique format of the lookup key due to the implementation of the Swiss Table. After extraction, we proceed with the lookup using the very same hash function, ensuring fairness in our comparison. Once we complete the lookup, we apply an identical rewrite as before. The compelling question remains: does this method improve performance? Remarkably, it does—our results show a decrease from approximately 400 to 300 clock cycles per packet, signifying a substantial enhancement in performance efficiency.
Subsequently, we delve into more complex territories. Our next focus is on interleaved processing. I previously mentioned the significance of cache efficiency during my discussion on the Swiss Table. Exploring further, the nuances of interleaving involve addressing memory latency issues. The time it takes to fetch memory varies based on the source; memory fetched from the L1 cache is rapid but limited in capacity. Conversely, fetching from the main RAM is slow, taking up to about 200 clocks, which could severely impact our 300-clock budget, although it allows storing extensive data. The strategy with interleaving is to mitigate the delays caused by memory access.
Consider this example regarding packet processing. As we process the first packet, we encounter a stage requiring memory access that isn’t immediately available in the cache, leading to a pause known as a memory stall. After the stall, we continue with the first packet. When we begin with the second packet, we face a similar situation—another memory stall occurs. We pause again, retrieve the needed memory, and then proceed.
With the implementation of interleaving, the approach changes. We start with the first packet, and upon reaching the juncture where additional memory is needed, rather than continuing, we issue a prefetch instruction. This prefetches the memory, placing it in the L1 cache. Instead of resuming the first packet, we commence processing the second packet, which is independent of the first, thereby optimizing the overall processing time.
Similar to earlier discussions, once we complete the initial segment of data processing and recognize the need for additional memory, we initiate prefetching. This strategy allows us to multitask, whereby while we finish the initial packet, the required memory efficiently transitions into the L1 cache. Thus, when returning to the first packet, it no longer experiences delays due to memory stalls. This approach can be replicated for subsequent packets, such as packet 2, by utilizing the time spent completing the first packet to prepare the cache for the next one.
This technique leads to a considerable time savings by avoiding frequent memory stalls. It’s important to acknowledge that this explanation simplifies the processes that occur. In reality, various techniques like parallel instructions, out-of-order execution, and pipelining are employed to prevent memory stall. However, through interleaving, where we understand our code’s intent and behavior, we can optimize performance better than these automatic executions.
In practical terms, this adjustment reflects in our coding approach. Instead of a function that processes packets one by one, we adopt a function that processes packets in bursts. For instance, we might decide to process bursts of 20 packets. Our loop then processes these 20 packets, maintaining the same initial steps of key extraction and hash calculation used previously. The shift occurs at the point of lookup; here, instead of immediate execution, we initiate prefetching the necessary metadata array for the Swiss Table. This allows us to continue processing the remaining 19 packets in the burst without delay.
When we delve further into the code, we find ourselves initiating another loop to begin the lookup process. In the first packet of a burst, we construct a lookup key and calculate its hash value. Simultaneously, we instruct the system to fetch the necessary memory for the lookup while processing hash values for the remaining packets. By this stage, our memory, residing in a warm cache, ensures a swift lookup process. But our code does not pause at this juncture; it consistently follows this sequence through its entirety.
Instead of merely progressing to the next rewrite phase, a prefetch for the requisite data—specifically the match structure for rewriting—is executed preemptively. This methodic prefetching and processing are looped throughout the code, significantly enhancing performance. Originally at about 300 clocks per packet, optimizations have now brought it down to a mere 80 clocks per packet. This clearly shows the high cost of memory access and stresses the importance of cache-efficient methods like the Swiss Table and strategies to prevent memory stalls, which dramatically accelerate performance.
Though tempting to rest with the achievements, we push on, employing ‘loop unrolling’ as our next strategy. Post-interleaving, the code involves numerous loops, each assessing loop continuation conditions and altering indices, which incurs a computational cost. By unrolling loops, we reduce these overheads; a loop originally iterating per packet, transforms into a briefer cycle, often handling multiple packets in each iteration.
In this instance, loop unrolling by a factor of two means each cycle processes two packets. Should packet numbers align, the cost linked to loop conditions and index adjustments is effectively halved. Furthermore, loop unrolling situates two independent packet operations adjacently, allowing the execution unit to deploy parallel instructions, potentially magnifying performance gains even further.
Let’s delve into our code structure. To be honest, it’s quite a mess. Were I to receive this for review, I’d likely return it for revisions. Let’s break down what’s happening. Notice the unrolling loop has been expanded to handle 4 packets at a time. Essentially, the operations remain consistent with prior versions, involving extraction of lookup keys four times for various packets, followed by hash computations for each, and prefetching corresponding Swiss Table sections likewise four times. More code isn’t displayed, yet the process does extend similarly. However, it’s important to mention that bursts don’t always have packets in multiples of four.
Additional coding is imperative to manage bursts with odd packet numbers, enhancing the performance significantly—reducing the processing time from 80 to 65 clocks per packet. Though it might appear minor on the graph, cutting down 15 clocks when already at 80 is remarkably substantial, marking a significant performance improvement.
Let’s pause to appreciate our progress. We’ve achieved a sixteenfold speed boost, slashing down from 1000 clocks per packet to only 65. This substantial improvement resulted from implementing six distinct techniques. However, it’s crucial to understand that these techniques might not universally enhance performance due to potential trade-offs. Until now, I’ve simplified our discussion, omitting these compromises. Now, let’s consider some: For instance, instruction cache misses are possible when employing strategies like inlining or loop unrolling which expand the codebase, increasing program size and likelihood of cache misses, potentially slowing down the performance. Indeed, on my team, we’ve scaled back on loop unrolling to mitigate instruction cache impacts. Furthermore, cache eviction is another risk when excessive prefetching forces out valuable data from the small L1 cache before its use, detrimentally impacting performance.
Once again, adjustments were made by reducing some prefetching, sometimes just into the L3 cache, to balance out benefits with trade-offs such as AVX throttling. Utilizing AVX-512 instructions accelerates processing speeds, but heavy use also decreases CPU clock speed, adversely affecting performance. Furthermore, issues of code maintainability emerge, highlighted by comparing simple initial code structures to the complexities introduced by techniques like interleaving and unrolling. Originally simple codes become cumbersome, especially in languages like C, impeding maintenance work.
Initially, I mentioned seven techniques, but only presented six previously. The final, pivotal technique involves consistent benchmarking. This process is essential for those prioritizing performance, requiring ongoing performance measurement and iterative improvements. In this session, the aim isn’t to encourage rewriting all your code but to introduce beneficial techniques. The key question posed asks whether or not to implement these strategies.
This presentation concludes with an illustrative slide distinguishing between simple and complex techniques across a performance impact spectrum. Techniques vary from easy and quick to implement with significant performance boosts, to complex strategies that might hinder maintainability while offering substantial impacts. Each quadrant on the graph serves to evaluate the balance between ease of implementation and the potential to enhance performance, guiding decisions on strategic incorporations in coding practices.
It’s a straightforward matter. There might be individuals wondering if data structures were really that simple. Indeed, it looked like a significant amount of code was utilized in the background to implement the Swiss Table. In fact, I did have to write considerable code to achieve that. However, in most programming languages today, you can leverage these data structures at no cost. Google was the pioneer behind this concept. If you’re coding in C or C++, the Abseil project, which Google has made openly available, incorporates a flat HashMap resembling a Swiss Table. For those coding in Rust, you might find it interesting to know that Rust’s default HashMap already uses a Swiss Table internally. It’s possible you are already using these structures without realizing it.
Techniques positioned in the diagonal quadrants—those that are either easy and low impact but very straightforward to implement, or those that are high impact but challenging to implement—require careful consideration. You need to assess whether the effort is justified. Here, the focus should be on the broader efficiency: the cost of development, maintenance, deployment of your code, and whether the performance benefits during runtime outweigh the developmental exertions. Such strategies are typically only utilized if your application demands exceptional performance or has widespread usage.
In scenarios where you are initiating a new, greenfield project, these techniques might be considered. However, it wouldn’t be practical to overhaul all existing code just to incorporate them. On the other hand, strategies that are both low impact and complex should likely be avoided altogether.
Participant 1: What is your view on frameworks that will automatically generate all this highly optimized code, based on benchmarks that will run automatically for you.
Elder: I haven’t found any frameworks that will automatically generate this code that performs at this level that we’ve got to here.
Participant 2: I wanted to ask about the prefetch stuff, like how do you know that the prefetch is finished? Are you going through and actually counting instructions? Isn’t it fragile? A similar question for the cache thrashing problem. You have to tune some parameters to know whether you’re thrashing the cache or not, doesn’t it all get fragile when compilers change? How does that actually work?
Elder: Certainly, in our code, we don’t count the cycles. We essentially pipeline all of our code, so that we’re pretty sure that by the time we need the memory, it’s going to be there. There are some great tools out there, where if you are running performance benchmarks, you can get lots of information about whether you are seeing memory stalls. We use VTune. That’s a really powerful tool for telling you if you are hitting memory stalls, and then allowing you to iterate.
This really illustrates the point of performance benchmarking. If you don’t have regressable performance tests that run as part of your CI suite, then, yes, it’s very fragile, because you can regress performance really easily. One of the key things that we do is we invest heavily in that CI/CD performance regression. Every time any code goes into main, it’s running performance tests, and if they don’t pass, the code doesn’t go in.
Participant 3 stated that the seventh technique, benchmarking, might be influenced by a physical law stating that measuring something inherently affects it. They questioned how this could improve performance rather than degrade it.
The Elder responded by acknowledging that measurement indeed impacts performance. They discussed the use of various types of benchmarking, specifically microbenchmarking, which involves setting up an infrastructure around the code and monitoring the clock cycles for repeated looping.
They observed significant performance improvements using loop unrolling in microbenchmarks. However, they noted that such improvements might not translate as effectively into larger applications due to potential issues like ICache interference.
The Elder concluded that employing a combination of microbenchmarks for quick iteration and performance testing for more comprehensive solutions could mitigate some of the negative impacts of benchmarking on performance.
Participant 4: In high level programming languages, normally, I mostly use C#, you don’t have this level of control of the optimization. Which could be a good strategy if I have something that needs to be very performant?
Elder: If you’re using higher level languages, and you don’t have this level of control, what can you do about it?
I think it probably depends on which higher level language you use. Rust is something I use a little bit. I think Rust does actually have ways of calling Intel Intrinsics directly from Rust, for example. I’m not familiar with C#. I think probably when you start a project, you’ve got to make a decision about what language you use. If you’re really looking for something fast and efficient, you choose an appropriate language. You’re not going to use Java or C#.
Participant 5: You’ve put benchmarking part on the seven steps. As you said, in the end, you need to do microbenchmarking at every step. I think a warning can be added about premature optimization, because I see, in various projects, people start making optimizations without directly benchmarking. Then, in the end, it turns out that their optimization and time put into it is useless, because maybe the bottleneck was not in the place where they thought it would be, as notably in people trying to optimize instruction.
Ultimately, when dealing with recursive algorithms, one might consider using the least recently used optimizations. However, these are not always effective. In recursive contexts, alternatives like loop unrolling can be significantly more beneficial.
Elder: Falling into theoretical traps is common, especially during initial discussions on solving complex issues. It’s advantageous to begin testing your code’s performance early to avoid unnecessary work. That’s something I strongly support.
See more presentations with transcripts
Aug 29, 2024
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.