Joe Rowell delves into the concept of unified memory on contemporary GPUs, detailing the low-level mechanics of its implementation on an x86-64 system, along with various tools to comprehend the operations occurring within a GPU.
Joe Rowell serves as a Founding Engineer at poolside. Before his tenure at poolside, he pursued a PhD in Cryptography at Royal Holloway, University of London under the guidance of Martin Albrecht. His research emphasizes efficiency across all layers of the software stack from both theoretical and practical viewpoints.
Software is pivotal in transforming the world. QCon London promotes software development by enhancing the exchange of knowledge and innovation within the developer community. This practitioner-focused conference caters to technical team leads, architects, engineering directors, and project managers who drive innovation within their teams.
Rowell: How many of you are familiar with GPUs? Before diving into that topic, I’d like to share a brief diversion. This striking red vehicle is an MG, specifically the MGB GT model, produced around the late 1960s. I’ve always had a desire to acquire one because, aside from the rarity of British cars nowadays, there’s the added advantage that if it were to malfunction, I could likely repair it myself. If I found myself stranded on the motorway due to the aging engine failing, I could probably figure out a fix. That’s not something I can claim about this vehicle. This modern Volvo is entirely beyond my comprehension. Should it break down, I would be helpless and reliant on a mechanic to resolve the issue, who would likely just tow it elsewhere.
The reason I mention this is that, upon examining these two vehicles, it’s clear that the way we engage with them hasn’t evolved significantly. This car is equipped with an array of advanced features, such as navigation and a sound system, yet the fundamental driving experience remains unchanged. You’re still using gears, pedals, and steering. I want you to hold onto this thought, as it will resurface throughout this discussion.
I recently embarked on a new position about six months ago, and I think you can imagine my nerves on the first day. I joined a preliminary call, and they turned to me, saying, “Joe, we need your help to enhance this program’s speed.” Hearing this was both exhilarating and daunting. I had hoped to be part of a team that valued its members and operated effectively, but here they were expressing uncertainty: “We don’t know why this is slow, and we’re counting on you to assist us.”
What intensified my anxiety was my complete lack of knowledge regarding GPUs when I started this role. The term CUDA evoked thoughts of a barracuda, which terrified me, as I was completely lost. I had no insights into how it functioned or how to accelerate its performance, much like what Thomas mentioned. Following the natural course, I turned to YouTube for guidance and ended up coding a program that resembles the following.
This program is a variant of C named CUDA C. You can see that we have a straightforward function akin to memset. We take a pointer and overwrite it with a specified value, val. Interestingly, during my first attempt to execute it, I allocated 100 integers, transferred them to the GPU, and anticipated results. Curiously, this program tends to be hit or miss. Its success hinges on the compatibility of both your hardware and software.
I encountered something truly fascinating. It struck me as peculiar that this seemingly innocuous program would occasionally function perfectly, while at other times it would fail completely. This leads us to two primary questions for our discussion: Firstly, what causes this inconsistency in performance? Secondly, why was our typically efficient program exhibiting slowness? Although these two programs differ, they do share a sufficient number of similarities to create overlaps.
To begin with, I would like to briefly explore the distinctions between a CPU and a GPU. When coding for a CPU, you typically manage numerous threads that execute concurrently, requiring explicit commands to enable this behavior. Whether you’re utilizing pthreads or the C++ standard library, you must clearly indicate your intention for parallel execution. In contrast, GPUs handle concurrency in a more implicit manner. When you write your code for a GPU, you adopt a declarative style, specifying that, when executed, it should operate on specific portions of the workload.
To illustrate the differences further, consider this analogy: a CPU functions like an office environment. In an office, numerous individuals are engaged in various independent tasks, each in their designated workspace, focusing on general responsibilities. Conversely, a GPU parallels a factory setting. Within a factory, specialized machinery is designed for specific tasks, such as producing tables or automobiles.
Factories typically concentrate on producing a single item proficiently rather than multiple products. Within this analogy, envision a storage area that contains raw materials, which can be likened to memory. This storage holds resources that are utilized to perform certain tasks. The analogy can be extended: in an office, each employee has their own workspace, whereas in a factory, multiple teams might share the same area, rotating in and out according to the day’s requirements. To summarize this in a more technical context, I want to illustrate the concept of implicit concurrency using a familiar function, which is essentially memset.
Currently, we are stating that each thread will primarily handle 32 elements. Following this, we will divide the pointer’s range into smaller subranges identified by their low and high indexes. With the inclusion of this check, we ensure that we do not overwrite any data. Fundamentally, this approach allows us to achieve concurrency. We define a concise program that partitions our range, and we anticipate that it will execute in parallel.
Please excuse my sidetracks, but this one feels significant. Does anyone recognize this car? I don’t expect you to; it might seem irrelevant. This is the Ford Model N. My knowledge of this vehicle is limited to the fact that it predates the Model T. The Model T was the first automobile assembled using a moving assembly line. Prior to the invention of the moving assembly line, vehicles were produced at a fixed spot, where workers would move around and work on the car sequentially. One day, Henry Ford envisioned a scenario where the components would move while the workers remained in place, in contrast to the traditional method where workers were on the move. This idea sparked a tremendous change at the time, resulting in the concept of the moving assembly line, where the workers stay stationary while the components are in motion.
The key takeaway here is our strong preference for the moving assembly line whenever feasible. We aim for an environment where hardware can perform specific tasks independently from one another. Each unit needs to focus on one particular function. One way we will achieve this is through the concept of a stream. A stream represents an organized sequence of events. You can add tasks to it, and you can expect them to execute consistently. This is crucial because there are instances when certain tasks are waiting on specific conditions.
For example, within a GPU context, there may be a task that predominantly involves reading and writing from memory, which can lead to inefficient use of resources as that process tends to be slow while computation is typically rapid. To optimize this, you need to communicate to the computer that it can handle alternative tasks in the meantime. This concept is facilitated through the use of streams, and we will revisit this topic later.
Here we have another example of code. The cudaStream_t
type in CUDA represents a logically coherent set of operations, allowing for the execution of specific tasks within that stream. The use of angle brackets denotes this functionality. This is quite a broad simplification. The core concept is that GPUs perform optimally when the ratio of computation to data is significantly high. Returning to our factory analogy, envision a scenario where factory workers are frequently waiting for raw materials to be delivered.
These workers are efficient and skilled at their tasks, but if the delivery time of the raw materials is excessive, they may end up being idle for extended periods. A classic example illustrating this is matrix multiplication. When dealing with an n by n matrix, the time complexity is O(n³) and the space requirement is O(n²). In such cases, using GPUs can lead to a considerable performance increase, which is a frequent topic of discussion in performance optimization. However, I have reservations about using big O notation in this context, as it can be misleading.
There exist algorithms for matrix multiplication that are asymptotically more efficient, yet underperform on GPUs. This is partly because the constant factors are significant in these scenarios. With that in mind, we will develop a function for memory copying that efficiently manages various types of copies. By this, I mean if I provide memory located in the CPU, I expect it to transfer that memory swiftly. Conversely, if the memory resides on a GPU, it should also copy that data efficiently. My goal is simply to ensure that the function operates effectively. Fortunately, many programmers are already acquainted with the memcpy
function, which CUDA supports. Ultimately, our task is to determine the appropriate operation to perform in each case.
To frame this discussion, we must consider three types of memory within a GPU environment that we should pay attention to. The first type is the memory allocated using cudaMalloc
. As the name implies, this function allocates memory on the graphics card, providing a buffer for storage. This type of memory is quite unique because it receives continuous prioritization, ensuring that it does not migrate and remains permanently stationed in that memory area unless manually relocated. This behavior starkly contrasts with standard allocators like malloc
, which can move memory around as necessary.
The second function we won’t delve into for this discussion is cudaMalloc, which allows for memory allocation in system RAM, making it accessible to devices through memory mapping. The advantage of this approach is that it enables your graphics card to access system memory directly, which is a fantastic feature. If I had utilized this method, the initial bug we encountered would not have arisen. For this conversation, however, let’s set that aside and concentrate on the final topic: managed memory, or unified memory.
In this scenario, memory allocation takes place where the physical storage may reside either on a hardware device, like a GPU, or within system memory. The intriguing aspect of this situation is its somewhat ethereal nature, as the memory is not fixed in one location or the other. In fact, you can find yourself in a situation where, similar to the stars above us, the memory seems to exist in a state of flux. It transitions between the two types based on what you need or the patterns you are presenting to the program.
Let’s explore how this fascinating process is accomplished. You will hear various performance tips, one of which is to use strace if you’re uncertain about the operations taking place, as it can reveal details that you might not otherwise uncover. For instance, I created a simple program that utilized CUDA mapper, specifically cudaMallocManaged, to allocate memory and then analyzed it with strace. The first action it takes is to open a file descriptor for a special device provided by NVIDIA, which serves as an interface between your system memory and device memory. Following that, it allocates a sizable chunk of memory via nmap. For those well-versed in binary, you’ll recognize that this value is precisely 64 megabytes.
Customization isn’t an option here; the request is straightforward: allocate 64 megabytes immediately. Subsequently, it deallocates a portion of memory—32 megabytes to be exact. The reason behind this specific deallocation is unclear; I haven’t found any documentation explaining why it chooses to release these 32 megabytes, but that is what it does. Finally, it adjusts the previously allocated memory to fit your alignment and returns the requested size. The value displayed as the result reflects this. It’s worth noting that the nmap values differ between the initial and latter calls due to the specified location for the memory allocation.
This process is crucial as it ensures that the pointers remain identical between the system and the device. We are explicitly defining that a specific resource must reside at a particular address, and we need to utilize the file descriptor that we previously opened for that address. Essentially, we are clarifying where this resource should be located. When it’s time to deallocate, the process is straightforward. We simply remap the memory we used before and release what we no longer need. The underlying mechanics involve making a sequence of system calls each time you perform this action, effectively mapping your memory to various locations before eventually removing it.
Now that we understand this, I would like to illustrate what occurs in the profiler during this operation. This is the NVIDIA system profiler, an incredibly valuable tool. It provides insightful information. Here, you can see the program I created, which we previously analyzed with strace. Over on the left, the MallocManaged call indicates that we are about to allocate this memory. The ‘Fill’ corresponds to the mset function we discussed earlier. When this function executes, represented by the blue line at the top, we quickly encounter a page fault.
This page fault occurs because the driver enforces an abstraction when the memory is initially allocated; it does not assign physical storage but instead provides an opaque page. Consequently, when the function runs for the first time and triggers a fault, it must allocate physical storage for this memory. The lengthy bar illustrated here signifies the process of allocating that memory on the device and managing the page fault. Observing this bar reveals that a significant portion of the execution time for this brief function is spent addressing the page fault. While this may vary in different scenarios, it’s easy to envision a situation where this could consume a large amount of your time.
Interestingly, if you were to allocate even more memory, you would observe a similar behavior with some intriguing details. For starters, repeated page faults on writes emerge, but the intervals between them vary. This inconsistency arises because each time a page fault occurs, the hardware attempts to assist by allocating additional memory. This results in a somewhat predictable pattern: initially, the hardware provides minimal support, but with each subsequent fault, it offers increasingly more assistance.
Interestingly, if you follow this line all the way through, you’ll discover that the entire sequence actually repeats itself. The reason for this is that after the last entry, there is a significantly larger gap between faults. I’m not entirely sure why this occurs, but when it does, the hardware seems to forget that it’s encountering a problem, thereby offering less assistance. You’ll notice that this recurring pattern continues to unfold as it progresses.
The CUDA programming guide articulates this concept well, stating that the actual physical location of data is essentially invisible to a program and can change at any moment, whether or not that change is apparent to you. At any time, regardless of your actions, the fundamental physical storage of the memory accessed through that shared pointer can fluctuate. It’s important to ponder this point for a moment, as such occurrences are not common. We witnessed this in Jules’ presentation, where processors can shift, but it’s relatively uncommon for memory to move in a way that is entirely hidden and has significant consequences.
Yes, it can occasionally happen with caches, but that’s a different scenario, as what occurs there is merely a duplication of data. You are essentially storing data and then managing it. In this case, your entire program’s data can shift if you’re working through this pointer. The latter part of the quote hints that access to the data’s virtual address will remain valid and coherent from any processor, irrespective of locality. In fact, I believe this might be the only feasible method to accomplish this, considering the extensive activity and myriad processes currently running. This aspect is particularly intriguing; ensuring that data access remains valid has considerable implications for our program’s performance.
I’ve shared this information, and I’ve also mentioned that functions exist for CUDA copying, so we can use memcpy. If we have a managed pointer on one end and a managed pointer on the other, it should be very quick and effective because this operates on the CPU, where the pointers are accessible. Everything is expected to run smoothly and efficiently. However, we have encountered a significant obstacle—our infamous pitfall. The graph illustrates this clearly; as we increase the size in gigabytes on the x-axis, and the gigabytes per second on the y-axis, the results vary erratically, showcasing substantial fluctuations.
Initially, when we start transferring a gigabyte, we achieve just over 600 megabytes per second in throughput. However, as we reach a peak transfer rate of 70 gigabytes per second, this drops to a mere 0.85. It’s hard to say how frequently you assess your memory bandwidth, but that’s truly disappointing. In fact, writing to and from an SSD would yield quicker results than this.
The underlying cause of this issue is fairly intricate. Let’s briefly consult the NVIDIA system profiler. Upon examining the details, it becomes apparent that we encounter CPU page faults. To recap, we previously had memory located on the GPU, and after executing a task, we attempted a memory copy on the CPU. This leads to a page fault, where the CPU tries to perform a copy but realizes it can’t access the relevant page. As we track this across the entire operation, we find that these page faults occur frequently—approximately 37,000 times in this particular trace. It’s astonishing how much of our time is wasted dealing with these page faults.
There’s a notable interaction between the graphics card and the CPU here that reveals we’re restricted by the CPU’s page size. To break down the process of a page fault: The CPU traverses memory, recognizes the absence of a required page, and initiates a request. It allocates memory space and contacts the GPU for the needed memory, which it provides. However, this allocation can only adhere to the system’s page size. Therefore, all the page faults we observe are related to 4-kilobyte pages.
One aspect I neglected to examine is how adjusting the page size might alter the results. While the outcomes would still be similar, they would likely be less severe. A larger page size might yield slightly improved performance since each page fault incurs a syscall cost, complicating the process and resulting in time inefficiencies.
This is portrayed once more here. You can observe that we encounter numerous page faults throughout this duration. This isn’t ideal for our performance. The same scenario is depicted again, but zoomed in, as I aim to illustrate some causality for you. When we delve deeply into this, a regular, repeating occurrence of page faults becomes evident.
Interestingly, these patterns repeat yet are spaced unevenly. For the two instances shown here, there exists a very small gap followed by slightly larger gaps, which continues consistently. The reason behind this involves the presence of purple blocks succeeded by red blocks, indicating that the hardware is attempting to assist. This is the hardware’s way of providing more than what’s requested. Each of the red lines signifies a page fault for 4k, while the purple lines represent a speculative prefetch. It acknowledges the occurrence of a page fault—an unfortunate event—and responds by transferring over additional memory.
As previously mentioned, you can observe that the widths of each of these blocks expand with each subsequent fault. By the time you reach this point, it’s estimated to be around 2 gigabytes being transferred back. In contrast, at the lowest point, it is likely only 64 kilobytes. Each time a misstep occurs, the system attempts to allocate more memory to enhance speed. This process is managed entirely by the hardware, requiring no additional action on your part—it’s handled directly by the device.
“Joe, you might be wondering, you previously mentioned these functions exist. We have ccudaMemcpy for transferring data between elements. Why create your own memcpy?” Indeed, utilizing it can yield slightly improved results. The purple line at the bottom represents the scenario where managed memory is utilized, and we indicate that a transfer occurs from the device to the CPU. This method proves to be marginally more efficient than what we previously utilized. You’ll notice a relatively incremental curve here, attributed to its superior ability to manage prefetches. Conversely, if we examine the other direction—from the device to the host—you’ll find a considerably rapid line.
We achieve around 10 gigabytes per second in data transfer, largely due to the nature of the pages involved. Interestingly, the GPU has a greater capability for managing larger pages, accommodating sizes up to 2 megabytes. This results in a performance curve that reflects a slight improvement, although it’s still not quite impressive. Ten gigabytes per second, while substantial, is not exceptional.
To recap, if you’re developing code that must handle this in a general manner, my strong recommendation is to steer clear of standard functions and to opt for CUDA functions instead. Notably, even in the least favorable scenario, represented by the purple line here, the performance is approximately double compared to what’s presented in this slide, achieved simply by implementing a one-line change. The underlying operations remain the same. For cases where a greater performance boost is feasible, like the one indicated by the green line, it’s still advantageous to adopt this generic approach.
All these challenges stem from the management of page faults and the issue of our physical memory not being aligned with the operational area. You might wonder what would happen if we positioned everything correctly from the outset. Instead of transferring data between the GPU and the CPU, we could relocate the memory to the proper location first. Thankfully, CUDA accommodates this. While I find the term “prefetch” somewhat misleading, as it aligns more with migration, the process essentially involves requesting the movement of memory associated with a specific pointer to the target device. You also have the option to place it on a stream, ensuring consistency with previous operations or awaiting the completion of earlier tasks.
I want to emphasize the necessity of specifying the size during this process. This requirement introduces a clever trick: it allows you to selectively move only certain pages to designated locations. Consequently, you can create an array distributed across multiple devices, including both the CPU and GPU. This flexibility results in a unique form of parallelism. For instance, if desired, you could simultaneously update an array on both the GPU and CPU without issues, thanks to the guarantees in place. If we perform the memory movement solely on the source—copying from one pointer to another and transferring the source pointer to the device—we can observe a significantly improved performance curve. The transfer rate stabilizes around 16 gigabytes per second, which is decent, but then it drops sharply. This decline is contingent on your specific device.
The data discussed in this presentation was collected using an H100, which boasts 80 gigabytes of memory. Initially, as both the prefetched and the destination arrays fit comfortably within memory, everything operates smoothly. However, once we reach the limit where both can no longer coexist in memory, performance dramatically declines. This drop occurs because we re-enter a situation laden with page faults—instances where we attempt to read or write to data that is not currently accessible, leading to a complete halt in memory performance. The issue becomes even more severe when we attempt to prefetch both pointers simultaneously, as illustrated.
Remarkably, during optimal conditions, performance peaks at an impressive 1300 gigabytes per second. This exceptional speed is attributed to efficient data copying occurring entirely within the GPU’s memory. Unfortunately, this efficiency only lasts until neither of the pointers fits within memory anymore, resulting in a catastrophic drop in memory bandwidth. The performance plummets to approximately 600 times slower than before, as we lose the capacity to proactively manage memory. Essentially, we’ve relinquished control, leading us back to previously experienced performance levels.
This situation can be visually represented, showcasing numerous red blocks that signify critical issues. These red areas predominantly highlight read requests, where our system fails to retrieve necessary memory, thus requiring a transfer from host to device. Furthermore, when the system runs out of memory, it must purge some data back into system memory to allow for new reads. This cycle is persistent and problematic, as the system struggles to intelligently decide which data to evict. I’ve witnessed instances where it erroneously removed critical data we had yet to utilize, resulting in significant performance degradation. We are caught in a loop, continually rereading unnecessary data, as demonstrated by the ongoing page faults related to both reads and writes.
The key takeaway from this experience is the imperative of managing our data transfers with greater precision. We cannot rely solely on hardware to handle these operations; we must take action to improve our processes. Ideally, we want to achieve a more favorable memory access profile. The following example is derived from an internal program run at Poolside, which was the specific issue I was tasked with diagnosing. The memory accesses appear normal initially, yet the scenario deteriorates drastically as we progress, resulting in a severe bandwidth crash. This can be observed in the notable spikes of PCI Express bandwidth measurements, illustrating our struggles as demand increases, culminating in performance plummeting to as low as 40 gigabytes.
In any situation where your device’s memory is at capacity, you can encounter this graph. If you’re utilizing the cudaMalloc API and your device’s memory is indeed filled to the brim, bear in mind that nothing will be evicted. You can easily shift the position of this graph to the left by allocating more memory on the graphics card. In our example, we repeatedly moved 512 megabytes, which resulted in an incredibly slow bandwidth because it was persistently trying to evict essential data for processing. This behavior can be triggered quite easily.
To address this, we need to manage these data transfers with greater precision. I will guide you through the process. First, we’ll define a PREFETCH_SIZE that corresponds to approximately 2 megabytes of information we are reading, and create two streams, named s1 and s2. These streams are crucial as they allow us to organize our operations appropriately. Subsequently, we’ll calculate the necessary number of prefetches. This is the essence of our loop. Initially, we will begin the copy process by prefetching the previous data and transferring it back to the CPU. This is what the CPU DeviceId notation signifies. We have executed an operation and acquired some data, which we are now sending back to CPU memory.
We must perform this operation specifically because, without it, our performance will severely decline. Additionally, the CPU must be engaged in remapping the data back into its own memory; it cannot be done unilaterally. Information needs to be returned to the CPU, making its involvement necessary. We achieve this by queuing the operations on a designated stream and explicitly returning them to the CPU. The next step is to prefetch the required blocks from the system memory back onto the graphics card. The DeviceId here serves as a placeholder indicating the device ID to which the data is being sent. You will observe that we utilize a different stream for this operation. While it is feasible to employ the same stream, such as s1, doing it this way is generally more efficient since the GPU has tasks queued up already.
This approach allows the GPU to be aware of its pending tasks. If you execute this and unroll the whole sequence, you will reach the final line displayed. You can observe that, in contrast to the previous graph that featured numerous unsightly red lines, the red lines here have nearly vanished. There is a minor one present, but it’s a bug that I haven’t yet identified, so we’ll disregard it for our discussion. The improvements are notable; by strategically prefetching data and returning it to its previous location while effectively managing everything, we achieve a significant enhancement in performance.
While I don’t have the exact statistics displayed, from my spontaneous observations, the custom approach surpasses many standard CUDA functionalities for this task. If you develop your own tailored code and carefully manage memory allocation, you might find that it outperforms the usual library CUDA functions available on the hardware and drivers to which I have access; though, of course, all the typical disclaimers apply.
That being said, let’s revisit the MGB. We’ve covered a substantial amount of ground regarding hardware. It’s fascinating to think back to when we believed having a single pointer would streamline our workflow significantly. Just one pointer shared between devices and system memory would immensely simplify our tasks. It felt like a game changer.
In some ways, this reminds me of the transition from a regular car to a Volvo. We’ve created and introduced features into the world with the intention of keeping things uncomplicated, yet the reality is far more intricate than we anticipated. Here we are on slide 51, and I’ve invested a lot of time discussing concepts that were meant to be straightforward, but, in truth, they are quite complex.
Honestly, at this stage, I think I would prefer to handle the data transfers myself. The reality is that we are essentially coding for a system designed half a century ago. We are, in essence, working with a PDP-11, utilizing methodologies that remain unchanged. We’ve resisted adjustments. For those interested in retro computing, this harks back to discussions about near and far pointers back in the ’90s. There was a push to evolve the language to more accurately reflect the hardware, yet we’ve refrained from doing so. We’ve consciously avoided upgrading our tools to better align with the current technologies we’re employing. I want to underscore this point: it is impossible to definitively ascertain whether this function or that piece of code is functioning correctly. The compiler cannot provide this assurance.
When you assume that these functions are public, it becomes impossible to express clearly that the pointer being passed in is something you can operate on. Rest assured, this won’t cause any harm to your computer; everything will function properly without memory violations. The hardware is designed to handle this. It strikes me as oddly perplexing that you can’t simply state, “This particular element must be this,” and expect it to perform safely in a meaningful way. We’ve resisted changes over time, and the outcomes haven’t been positive.
The first step I urge you to take is to profile your code. Specifically, I want you to examine your code’s performance closely. Utilize various tools, and experiment with innovative techniques such as strace. Make an effort to truly grasp why your code behaves the way it does. Though computers can be confusing, there’s usually a reason behind every occurrence. It might be an unexpected event, something unforeseen, or a mystery that eludes us, yet fundamentally, there’s a rationale behind it all.
Reflect on how you can streamline your code. I don’t mean simply rewriting loops or altering structure; rather, consider whether a newcomer to your code can understand it without needing an hour-long presentation explaining its functionality. If that’s not the case, then it’s likely not as simple as it may appear. Most importantly, prioritize performance. When in doubt, remember that performance is key.
Participant 1: Can you clarify the origin of that shared memory mechanism and its architecture? If it ultimately seems easier or more logical for us to handle it ourselves, where does that come from?
Rowell: NVIDIA and AMD both present it as a more straightforward approach to initiate your program. If you have an older CPU application and are looking to transition to a GPU, and you aim to do so early on while monitoring when issues arise, adopting unified memory is advisable. It can simplify certain aspects of your workflow. However, if you are in a setting where performance is critical, it may not be the best option.
Participant 1: It’s mainly for migration purposes or maintaining backwards compatibility.
Rowell: Precisely. There are individuals using it for different reasons as well. For instance, if you require additional memory beyond what your GPU can accommodate. Imagine if your working set is 100 gigs while your device memory is only 80; in such cases, it can be beneficial. Nevertheless, I genuinely believe it’s not superior to handling these tasks manually. Ultimately, we opted to perform the memory transfers ourselves because it did not meet our performance standards and was too challenging to predict.
Participant 2: Did the overhead you encountered in your memory copies consistently relate to page faults? If you were transferring data from the device to the host twice, would the second transfer be quicker?
Rowell: No, in fact, it wasn’t. I will open this profile for you right here, and you can see that our occupancy, which indicates how much our device is being utilized, is actually quite high. This relates to the coherency aspect I mentioned earlier. When working with memory, it’s crucial for the device hardware to maintain a consistent view of that memory across all instances. Even when simply accessing the same page from different locations on the same device, it can cause delays.
Moreover, merely counting the number of page faults won’t suffice. There are certain strategies you can employ when writing your code that can ensure each page you access is only required in one location; however, this does not entirely resolve the underlying issues.
Discover more presentations with transcripts
Oct 01, 2024
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.