Introduction
Graphic Processing Units (GPUs) are the leading hardware choice for users engaging in deep and machine learning activities. GPUs enhance machine learning operations by executing calculations in parallel. Many tasks, especially those that can be expressed as matrix multiplications, experience significant acceleration immediately. Fine-tuning operation parameters can yield even greater performance by optimizing the use of GPU resources.
However, deep learning computations can be quite resource-intensive, even with GPU assistance. It is also common to exceed the operational limits of these machines, resulting in out-of-memory errors. Thankfully, GPUs are equipped with both internal and external monitoring tools. By utilizing these tools, such as tracking power draw, utilization rates, and memory usage, users can gain insight into potential issues when they arise.
GPU Bottlenecks and Blockers
Preprocessing in the CPU
Many deep learning frameworks often process data transformations on the CPU before transferring the data to the GPU for further processing. This preprocessing phase can consume as much as 65% of epoch time, as illustrated in a recent study. Transformations related to image or text data can create significant bottlenecks that hinder overall performance. Offloading these tasks to the GPU can considerably improve training efficiency.
What causes Out Of Memory (OOM) errors?
Out-of-memory errors occur when the GPU lacks sufficient resources to handle the assigned task. Such errors typically arise with large data types, such as high-resolution images, overly large batch sizes, or when multiple processes run concurrently. It is contingent upon the accessible GPU RAM.
Suggested solutions for OOM
- Opt for a smaller batch size. Since the number of iterations corresponds to the batches needed for one epoch, reducing the batch size minimizes the data load the GPU must handle in memory during iterations. This is the most prevalent solution for OOM errors.
- If working with image data and conducting transformations, consider utilizing a library like Kornia to execute these processes with GPU memory.
- Evaluate your data loading methods. Instead of loading all data at once, consider using a DataLoader object, which combines a dataset and a sampler for an iterable dataset access.
Command line tools for monitoring performance
nvidia-smi is a command-line utility that provides detailed information about your GPU’s performance metrics.
nvidia-smi
nvidia-smi, short for Nvidia Systems Management Interface, is designed to simplify monitoring and GPU utilization tracking. Users can quickly obtain basic GPU information through this tool. The output will display GPU rank, name, fan speed, temperature, performance state, persistence mode, power draw, limits, and overall GPU utilization. A secondary output window reveals the specifics about process and GPU memory consumption for ongoing tasks.
Tips for using nvidia-smi
- Execute nvidia-smi -q -i 0 -d UTILIZATION -l 1 to show unit information and to monitor GPU usage and memory samples in real-time.
- Utilize the “-f” or “–filename=” flags to save command outputs to a specified file.
- Complete documentation can be accessed through the official NVIDIA site.
Glances
Glances provides another excellent approach for monitoring GPU activity. Unlike nvidia-smi, which displays static data, Glances offers a real-time dashboard for process monitoring. This dynamic insight can help identify potential performance problems, along with relevant CPU utilization stats.
To install Glances, run the command:
pip install glances
To launch the monitoring dashboard, simply execute:
glances
For additional information, refer to the Glances documentation.
Other useful commands
Below are a few other built-in commands useful for monitoring processes on your machine, particularly focusing on CPU utilization:
- top – displays CPU processes and usage metrics.
- free – indicates how much memory is utilized by the CPU.
- vmstat – reports data related to processes, memory, paging, and CPU activity.
What to check to understand GPU performance in real time
- CPU usage: indicates the percentage of CPU utilization.
- Memory: measures the RAM consumed by the CPU, presented in GB.
- GPU memory (used): reveals the current GPU memory in use.
- GPU power draw: signifies the power consumption of the GPU in Watts.
- GPU temperature: shows the temperature of the GPU in degrees Celsius.
- GPU utilization: the percentage of time kernels executed on the GPU.
- GPU memory utilization: indicates the active usage percentage of the memory controller.
Closing remarks
This article discussed the various tools available for monitoring GPU usage on both remote and local Linux systems.
Thanks for learning with the DigitalOcean Community. Explore our offerings for computing, storage, networking, and managed databases.
Learn more about our products.
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.