Tutorials

Harnessing GPU Droplets and DigitalOcean Networking for Scalable GenAI Solutions

November 21, 2024
2:01 am

Have you ever experimented with generating images using AI? Achieving high-quality images often relies heavily on crafting a detailed prompt. Personally, I struggle with this aspect and tend to depend on large language models to formulate thorough prompts, which I then use to generate impressive images. For example:

Prompt: "Create a stunning aerial view of Bengaluru, India with the city name written in bold, golden font across the top of the image, with the city skyline and Nandi Hills visible in the background."
Prompt: "Design an image of the iconic Vidhana Soudha building in Bengaluru, India, with the city name written in a modern, sans-serif font at the bottom of the image, in a sleek and minimalist style."
Prompt: "Generate an image of a bustling street market in Bengaluru, India, with the city name written in a playful, cursive font above the scene, in a warm and inviting style."

To produce these images, we utilized the Flux.1 model for image generation and the Llama 3.1 – 8B – Instruct model for creating prompts. Both models are hosted on a single H100 machine using MIG (Multi-Instance GPU), which we will delve into later.

This blog is not intended as a straightforward tutorial on image generation. Instead, we aim to establish a scalable, secure, and globally accessible architecture for Generative AI (GenAI).

Consider a scenario where a global e-commerce platform must rapidly customize images for users or a content platform producing on-demand AI-generated text across different continents. Such a setup poses various challenges for developers. For instance:

GPUs can seem daunting and costly.
GenAI tools are at the cutting edge, each requiring specific configurations.
Securely connecting backend servers to GenAI servers can be complex.
Properly routing users worldwide to the nearest server is a significant hurdle.

This guide is designed to help you address these challenges systematically.

Prerequisites

Before you dive into the demo, ensure you have:

A DigitalOcean account.
A fundamental understanding of GPU, cloud networking, VPC, and load-balancing concepts.
Basic familiarity with Bash, Docker, and Python.
Access to HuggingFace Image Generation utilizing the Flux.1 model.
Access to the Llama 3.1-8b-Instruct model.
The vLLM library, which simplifies LLM inference and serving.

High Level Design

To bring this architecture to life, we designed a distributed system leveraging DigitalOcean’s infrastructure. Our approach begins with a Global Load Balancer (GLB) that manages incoming requests, ensuring minimal latency for users in any region.

Lightweight image generation applications are deployed in strategic locations—London, New York, and Sydney—each with its cache to optimize connections with GPU resources. All components communicate securely over VPC Peering, transmitting complex tasks back to our H100 GPU powerhouse located in Toronto.

Components

Lightweight Image Generation App

The Image Generation app is a simple Python Flask application with three primary components:

Detect Location Section: This component makes a dummy request from the browser to the server to identify the user’s location (city and country) and determine which server region handles the request. This location info is displayed to the user and aids in optimizing prompt and image generation.
Prompts Dropdown Section: Once the user’s location is determined, the app checks a cache for existing prompts associated with that area. If it finds relevant prompts, it displays them in a dropdown menu for the user to select. If no cached prompts are available, the app queries the LLM to generate new prompts, which are then cached for later use.
Generated Image Section: When the user selects a prompt, the app first verifies if an image from that prompt is cached. If it exists, the cached image is served, ensuring quicker response times. If no cached image is present, an API call is made to generate a new image, which will also be cached for future requests.

MIG GPU Component

The MIG (Multi-Instance GPU) feature of NVIDIA GPUs, such as the H100, allows a single physical GPU to be divided into multiple independent instances, known as MIG slices. Each slice operates as an isolated GPU with its compute, memory, and bandwidth resources. This capability allows us to deploy both the image generation and prompt generation models simultaneously while optimizing GPU utilization.

Step-by-Step Setup

To establish this infrastructure, follow these steps:

Spin Up the GPU Droplet: Create a GPU Droplet on DigitalOcean with a single H100 GPU using a pre-built OS image for ML development.
Enable MIG on the H100 GPU: After the GPU Droplet is operational, enable MIG mode on the H100 to partition it into multiple isolated instances, crucial for concurrently running different models.
Choose the MIG Profile and Create Instances: With MIG enabled, select suitable profiles for each model you will run.
Set Up Docker Containers on Each MIG Instance: For every MIG instance, run a separate Docker container for each model. This entails downloading and deploying two Docker images: one for image generation (Flux.1-schnell) and one for prompt generation (Llama 3.1 using vLLM).
Deploy Flux.1-schnell for Image Generation: Utilize the necessary repository for deploying the Docker image of Flux.1.
Deploy Llama 3.1 using vLLM for Prompt Generation: Download and execute the Docker container for the prompt generation model via vLLM.

Regional App Instances

The lightweight apps are set up in regional locations (London, New York, Sydney) to efficiently manage user requests, caching commonly accessed prompts and images for faster response times.

VPC Peering

Establishing VPC Peering guarantees secure, low-latency communication between the regional app instances and the GPU server in Toronto over a private network.

Global Load Balancer (GLB)

The GLB distributes incoming user requests to the closest regional app instance, optimizing latency and enhancing user experience.

Conclusion

This setup offers a practical foundation for businesses and developers interested in distributed GenAI solutions, whether it be a global e-commerce platform generating custom content or AI-driven content services for a worldwide audience. By employing DigitalOcean’s offerings, this architecture demonstrates how to balance scalability, security, and cost-effectiveness when deploying state-of-the-art AI services.

Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.

Share this Post

0 0 votes

Article Rating

0 Comments

Oldest

Newest Most Voted

FRESH DEALS: KVM VPS PROMOS NOW AVAILABLE IN SELECT LOCATIONS!

DediRock is Waging War On High Prices Sign Up Now