Julia Kreger discusses the critical balance between various forces and the significance of Bare Metal in specific situations, alongside the latest trends influencing today’s hardware capabilities and the pivotal role of automation.
Julia Kreger boasts an extensive career in both addressing complex problems and guiding teams. She serves as a Senior Principal Software Engineer at Red Hat, concentrating on the deployment and management of Bare Metal systems within OpenStack. Additionally, she holds the position of Chair at the Open Infrastructure Foundation, which focuses on providing a platform for managing infrastructure tools.
Software is reshaping society. QCon San Francisco fosters the development of software by promoting the dissemination of knowledge and innovation within the developer community. Designed for technical team leaders, architects, engineering directors, and project managers, QCon is a practitioner-driven conference that encourages innovation within teams.
Kreger shares an anecdote: In 2017, her manager contacted her about supporting an event in the Midwest. As any engineer would, she inquired about the specifics. It turned out to be the International Collegiate Programming Contest in Rapid City, South Dakota, just two weeks away.
I groaned, as most people do, because two weeks’ notice on travel is painful. I said, ok, I’m booking. Two weeks later, I landed in Rapid City, two days early. Our hosts at the School of Mines who were hosting the International Collegiate Programming Contest wanted us to meet each other. They actually asked for us to be there two days early. They served us home-style cooking in a conference hall for like 30 people. It was actually awesome, great way to meet people. I ended up sitting at a random table. I asked the obvious non-professor to my right, what do you do? As a conversation starter. He responded that he worked in a data center in Austin, which immediately told me he was an IBM employee as well.
Then he continued to talk about what he did. He said that he managed development clusters of 600 to 2000 bare metal servers. At which point I cringed because I had concept of the scale and the pain involved. Then he added the bit that these clusters were basically being redeployed at least every two weeks. I cringed more. The way he was talking, you could tell he was just not happy with his job. It was really coming through.
It was an opportunity to have that human connection where you’re learning about someone and gaining more insight. He shared how he’d been working 60-hour weeks. He lamented how his girlfriend was unhappy with him because they weren’t spending time together, and all those fun things. All rooted in having to deploy these servers with thumb drives. Because it would take two weeks to deploy a cluster, and then the cluster would have to get rebuilt.
Then he shifted gears, he realized that he was making me uncomfortable. He started talking about a toolkit he had recently found. One that allowed him to deploy clusters in hours once he populated all the details required. Now his girlfriend was no longer upset with him. How he was actually now happy, and how his life was actually better. Then he asked me what I did for IBM. As a hint, it was basically university staff, volunteers, and IBM employees at this gathering.
I recounted my involvement in open source communities and my focus on systems automation. Having been in environments filled with hardware racks, I was familiar with the challenges of dealing with bare metal directly. A turning point in our conversation came when he abruptly became overwhelmed with joy. His smile broadened tremendously as he realized he was speaking to the person whose toolkit had significantly eased his workload.
For me, witnessing the positive impact of my work on someone firsthand was unforgettable. The expression of relief and happiness on his face is a lasting memory and continues to inspire me. This motivation is a big part of why I pursue automating these challenging aspects—to prevent the spread of this common pain.
You might wonder about my identity. I am Julia Kreger, a Senior Principal Software Engineer at Red Hat and chair of the board of directors at The OpenInfra Foundation. Over the last decade, I have been dedicated to automating the deployment of physical bare metal machines for various applications. The technologies I work with are incorporated into Red Hat OpenStack and Red Hat OpenShift to facilitate these deployments tailored to our client’s needs.
The technologies involved are quite fascinating and offer extensive versatility. It all depends on how you utilize them and the extent of their integration. I am eager to discuss the reasons behind focusing on bare metal, the current market trends, ongoing shifts in computing technology, and three useful tools for tackling these advancements.
We’re in the age of cloud. I don’t think that’s disputed at this point. Why bare metal? The reality is, the cloud has been in existence for 17 years, at least public cloud, as we know it today. When you start thinking about what is the cloud, its existing technologies, with abstractions, and innovations which help make new technologies, all in response to market demand on other people’s computers.
Why was it a hit? It increased our flexibility. We went through self-service on-demand. We weren’t ordering racks of servers, and waiting months to get the servers, and then having to do the setup of the servers anymore. This enabled a shift from a Cap-Ex operating model of businesses to an Op-Ex model for businesses. How many people actually understand what Cap-Ex and Op-Ex is? It is shorthand for capital expense and operational expense.
Capital expense is an asset that you have, that you will maintain on your books in accounting. Your auditors will want to see it occasionally. You may have to pay taxes on it. Basically, you’re dealing with depreciation of value. At some point, you may be able to sell it and regain some of that value or may not be able to depending on market conditions. Whereas Op-Ex is really the operational expense of running a business. Employees are operational expenses, although there are some additional categories there of things like benefits.
In thinking about it, I loaded up Google Ngrams, just to model it mentally, because I’ve realized this shift over time. One of the things I noticed was looking at the graph of the data through 2019, which is all it’s loaded in Google Ngrams right now, unfortunately, we can see delayed spikes of the various booms in the marketplace and shifts in the market response.
It’s interesting to note the shift where businesses no longer prioritize capital expenditures. While the trend moves towards cloud solutions, traditional capital expenditure models remain essential for certain businesses, especially those with a century of experience in managing such assets in a way that minimizes pain. The necessity to manage data and operations on-premises or in dedicated data centers often arises from stringent security requirements—such as not allowing data to cross a designated physical boundary and restricting access to the facility.
Furthermore, governance plays a critical role in this decision-making process. Legal obligations, either through contracts with vendors or clients, might restrict some businesses from migrating to cloud services. Issues like data sovereignty attract attention as it impacts the decision to maintain private data centers and utilize bare metal infrastructure, despite the possibility of integrating cloud orchestration technologies. Additionally, the concern of data crossing national boundaries and the need for low latency in high-performance tasks like fluid dynamics simulations, where reliable and repeatable results are critical, often necessitates the use of private data centers.
Market dynamics continually evolve. In an exploration using Google Ngrams, I observed notable increases in terms related to the ‘gig economy’, which reflect changes in our current economic framework. Similarly, themes like ‘economic bubble’ are gaining attention, indicated by a slight rise in their graph representation. Data sovereignty shows patterns nearly opposite to those observed for capital expenditures (Cap-Ex) and operational expenditures (Op-Ex). Additionally, the relevance of edge computing becomes apparent when considering the requirements of self-driving cars, which necessitate rapid processing and low-latency communication with nearby systems.
If one needs to activate the brakes, a decision time of just 30 milliseconds is typically available. Literature from the past decade illuminates some shifts in driver behavior that reflect these realities.
Shifts are also ongoing in the marketplace. An essential point to remember is the constant evolution of computers. Every day, they grow more complex with some vendors introducing new processor features and others adding specialized networking chips. Despite these advancements, the fundamentals of a computer in a data center remain largely unchanged—a box connected by a management network cable, a data path network cable, and hosting applications on an operating system.
However, there has been a noteworthy change towards using specialized, cost-effective hardware for specific domain problems. Technologies like GPUs and FPGAs are increasingly employed to handle specific portions of computational tasks that address particular domain issues. Meanwhile, the diversification of architectures is becoming more common, which can sometimes introduce complexity into the system.
For instance, an ARM system may appear standard until combined with an x86 firmware-driven GPU, presenting challenges for the ARM cores and firmware in initializing the device. A hidden layer of complexity involves launching a VM within the substrate, invisible to the operating system. Essentially, another Linux system operates silently alongside the primary system, managing the card initialization.
Currently, there is an evolving trend toward specialized units like data processing or infrastructure processing units. These systems are more versatile and encompassing than their predecessors. Think of the traditional setup of a network card connected to a PCIe bus, encapsulated in a generic box ASIC; but beyond this basic setup, modern servers can now integrate more complex components.
These sophisticated devices might include AI accelerator ASICs, FPGAs, and additional programmable networking ASICs. They are essentially small computers within the main computer, equipped with their own operating systems and applications housed within a management framework akin to that of the host server. The concept of computers within computers is burgeoning, reshaping how we think about server infrastructure.
Meanwhile, applications continue to run on the main host using GPUs and full-fledged operating systems. Devices such as DPUs or IPUs connect to the host and interface with its operating system through PCIe ports. However, the host’s operating system remains unaware of these discrete activities within the cards themselves, as they operate autonomously, running distinct workloads. This architecture creates complexities in the management of network connections, generally requiring at least two per host, pending advancements in inbound access standards.
To elaborate further on practical applications, it’s worth mentioning that these devices are increasingly used for tasks such as load balancing and request routing. They aren’t limited to traditional web server load balancing but can also manage database connections and sharding, effectively directing traffic to ensure efficient data management and access.
At which point, the actual underlying host that the card’s plugged into and receiving power from, never sees an interrupt from the transaction. It’s all transparent to it. Another use case that is popular is as a security isolation layer, so run like an AI enabled firewall, and have an untrusted workload on the machine. Then also do second stage processing. You may have a card or port where you’re taking data in and you may be dropping 90% of it. Then you’re sending it to the main host if it’s pertinent and makes sense to process.
This parallels the strategies employed by major data processing entities, which utilize an initial filter to discard 90% of incoming data that is not actionable or analytically beneficial. The small portion that does pass through undergoes further scrutiny, eventually leaving only about 1% of that data to be utilized, akin to methodologies observed in cellular networks.
It is possible to have a setup where a radio transmits data, yet the operating system only perceives it as Ethernet packets emitted from a network port, remaining oblivious to the actual process. With the emergence of concealed computing infrastructures, these types of cards, likely operational in servers across this city, deserve vigilant oversight and maintenance. Efforts to standardize these components’ interfaces and management models are being conducted through the OPI project. For those interested, more information can be found at https://opiproject.org.
Automation gains significance considering scenarios where bugs identified within these devices exist within a security layer inaccessible from the primary host, posing the question of how updates can be implemented. Reflecting back, one engineer tackled related challenges with early versions of such cards embedded in his hardware.
He found it extremely taxing to manually update the firmware by removing the card, attaching it to a special card, and inserting a USB drive. These tasks were infrequent, yet agonizing. Nowadays, there is a growing trend towards remote orchestration of these devices via their management ports and networks. This approach is often preferred as it reduces the risk associated with an untrusted workload that could endanger the entire system.
The necessity for automation becomes apparent when managing these devices at scale. Since these cards rely on power from the main host, shutting down the host also powers down the card, which disrupts the booting process of the operating system. Therefore, continuous operation of these cards is crucial.
Several tools are available for managing these issues, and I will discuss three of them. The first is the Ironic Project, which is the most comprehensive and feature-rich among them. The other two are Bifrost and Metal3. Initiated in 2012 within OpenStack, Ironic provides a scalable Bare Metal as a Service platform. It utilizes a state machine concept to manage data center operations and associated workflows, which is essential for efficiently deploying server racks in data centers. These workflows can be influenced by business processes or the sequential steps in the deployment process. The aim has been to systematize much of this through a service over time. This service can be accessed via a REST API and includes comprehensive functionalities. Conveniently, Ironic can be deployed independently of OpenStack, supporting various management protocols like DMTF Redfish, IPMI, and several others including iLO, iRMC, and Dell iDRAC. It also provides a stable driver interface for extensions by vendors.
One of the things we see a lot of is use of virtual media to enable these deployments of these machines in edge use cases. Think cell tower on a pole, as a single machine, where the radio is on one card, and we connected into the BMC, and we have asserted a new operating system. One of the other things that we do as a service is we ensure the machine is in a clean state prior to redeployment of the machine, because the whole model of this is lifecycle management. It’s not just deployment. It’s, be able to enable reuse of the machine.
This is the Ironic State Machine diagram. This is all the state transitions that Ironic is aware of. Only those operating Ironic really need to have a concept of this. We do have documentation, but it’s quite a bit.
Then there’s Bifrost, which happened to be the tool that that engineer that I sat next to in Rapid City had stumbled upon. The concept here was, I want to deploy a data center with a laptop. It leverages Ansible with an inventory module and playbooks to drive a deployment of Ironic, and drive Ironic through command sequences to perform deployments of machines. Because it’s written, basically, in Ansible, it’s highly customizable.
For example, I might have an inventory payload. This is YAML. As an example, the first node is named node0. We’re relying on some defaults here of the system. Basically, we’re saying, here’s where to find it. Here’s the MAC address so that we know the machine is the correct machine, and we don’t accidentally destroy someone else’s machine. We’re telling what driver to also use. Then we have this node0-subnode0 defined in this configuration with what’s called a host group label.
Within Bifrost, a useful feature allows for directional execution control. As inventory processes, it has the capability to assign additional labels per node as per requirements. Particularly useful for scenarios where deployment to subnodes or specific actions on these subnodes is needed, such as installing software on IPU or DPU devices, this configuration supports such endeavors. Noteworthy is the ongoing development in Ironic aimed at establishing a more standardized approach to managing DPUs, which is still a work in progress. The initial release was recently made. Given these IPUs and DPUs typically utilize ARM processors, our example illustrates using a designated RAM disk and image for writing to the IPU’s block device storage.
An execution of a sample playbook can then follow. Outlined here are two primary steps involving nodes described as bare metal within the inventory. The initial step involves creating configuration drives containing metadata to guide the machine’s boot-up process, including origins, destinations, and potentially accommodating SSH keys or other credentials. The following step sees the deployment, leveraging designated variables to populate and execute through the model, utilizing the API. Furthermore, a subnode, distinguished by a specific host group label, can be directly managed.
Metal3, deployed within Kubernetes clusters hosting a local Ironic instance, interprets cluster API and bare metal custom resource updates to provision new bare metal nodes. This platform facilitates configurations such as BIOS and RAID settings, and OS deployment. Custom modifications would require alterations to the underlying code of Metal3’s bare metal operator.
Description of a typical custom resource update in this environment showcases creating a secret initially, followed by defining the custom resource to reflect the machine’s operational status, including BMC and MAC addresses, desired image, and designated checksum. For user data, pre-existing metadata within the system is used to command deployment. Consequently, the bare metal operator utilizes this data to interact with the custom resource, deriving necessary actions and implementing them through Ironic’s API locally hosted in a pod, thereby facilitating the deployment of bare metal servers to end-users. Such capabilities allow for the expansion or reduction of a local Kubernetes cluster using this operator, managing a fleet of bare metal as required.
There’s a complex future looming with the integration of bare metal servers. These servers will not only need to be managed but also, as their prevalence increases, so will the need for effective bare metal management orchestration.
Participant 1: In my role, I engage with numerous clients who deploy on-premise Kubernetes clusters. Typically, we invest heavily in VMware to simplify management, which seems excessive considering the elastic capabilities provided by Kubernetes. The pertinent question is whether this virtualization layer is truly necessary given the advancements in hardware management tools.
Kreger: The issue you’re raising, while significant, is somewhat off the point I’m trying to make concerning IPUs and DPUs. Kubernetes is primarily optimized for cloud environments rather than on-premise setups. From my perspective at Red Hat, we’ve heavily invested in adapting OpenShift, which is based on Kubernetes, to function efficiently on-premise without relying on a virtualization layer. This adaptation was far from straightforward, particularly because certain original code components assumed the constant availability of services like the Amazon metadata service, which isn’t the case on-premise.
Participant 2: From what I gathered, management protocols such as Redfish or IPMI might interact with DPUs or IPUs via an external management interface hosted by the server. Is there any consideration for devising a new protocol instead of relying on these traditional, possibly outdated protocols?
Kreger: The emerging trend right now is to use Redfish or consensus. One of the things that does also exist and is helpful in this is there’s also consensus of maybe not having an onboard additional management controller, baseboard management controller style device in these cards. We’re seeing some consensus of maybe having part of it, and then having NC-SI support, so that the system BMC can connect to it and reach the device.
One of the things that’s happening in the ecosystem with 20-plus DPU vendors right now, is they are all working towards slightly different requirements. These requirements are being driven by market forces, what their customers are going to them and saying, we need this to have a minimum viable product or to do the needful. I think we’re always going to see some variation of that. The challenge will be providing an appropriate level of access for manageability. Redfish is actively being maintained and worked on and improved. I think that’s the path forward since the DMTF has really focused on that. Unfortunately, some folks still use IPMI and insist on using IPMI. Although word has it from some major vendors that they will no longer be doing anything with IPMI, including bug fixes.
Participant 2: How do you view the intersection of hardware-based security devices with these IPU, DPU platforms, because a lot of times they’re joined at the hip with the BMC. How is that all playing out?
Kreger: I don’t think it’s really coming up. Part of the problem is, again, it’s being driven by market forces. Some vendors are working in that direction, but they’re not talking about it in community. They’re seeing it as value add for their use case and model, which doesn’t really help open source developers or even other integrators trying to make complete solutions.
Explore additional presentations with transcripts
Aug 23, 2024
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.