Joni Collinge highlights the design and execution strategies of Diagrid Cloud platform, discussing design choices, compromises, and insights, while highlighting the use of Kubernetes, Dapr, and other Cloud-Native services.
As a Founding Software Engineer at Diagrid, Joni Collinge contributes to the development of multi-cloud solutions for managing extensive Dapr-based microservices in a production environment. His previous experience spans over ten years at Microsoft, involving the design, development, and management of robust cloud services. Joni focuses on addressing business problems with efficient, open-source, economical, and sustainable solutions.
QCon London is dedicated to enabling advancements in software development by promoting knowledge exchange and innovation among developers. It is a conference aimed at technical team leads, architects, engineering directors, and project managers who play a role in fostering innovation within their teams.
Collinge introduces an upcoming discussion about building SaaS from the ground up using cloud-native patterns and a thorough examination of a cloud startup. He begins with a historical outline starting in late 2018 featuring Mark Fussell and Yaron Schneider from Microsoft’s R&D team. They initially brainstorm and later initiate the development of the KEDA project — Kubernetes Event Driven Autoscaler. Exploring enhancements in productivity for enterprise developers focusing on distributed and cloud-native applications, their brainstorming evolves into the Dapr project, known as the Distributed Application Runtime. The project aspires to embed best practices such as resilience, common abstraction patterns, security, and observability into a framework that is easy for developers to adopt, remaining cloud-neutral, language-independent, and framework-independent.
This is where I join the story. My name is Joni Collinge. At the time, I was also an engineer working at Microsoft, where I witnessed enterprise developers repeatedly tackling the same challenges on the Azure platform for 8 years. I was captivated by the value proposition of Dapr, prompting me to become an open source maintainer and contributor to the project almost immediately. The project evolved, leading Microsoft to adopt an open governance model, contributing Dapr to the CNCF where it achieved incubation status. This marked a significant uptick in enterprise adoption of the project. Observing this, Mark and Yaron committed fully to empowering developers to create distributed systems and established a new company named Diagrid. Unexpectedly, they invited me to join as a founding engineer to develop cloud services adhering to this mission. After some persuasion, I accepted and joined Diagrid. Our initial vision involved two primary services; the first, Conductor, aimed at managing Dapr installations on users’ Kubernetes clusters. The second, the Catalyst service, planned to provide fully serverless Dapr APIs, enhanced with infrastructure solutions and additional valuable features not available through the open-source project.
Today’s presentation begins here, as we delve into the specifics of cloud components. Often referred to as a mysterious “black box,” clouds are, in fact, collections of common patterns utilized across various services, contrary to the “secret sauce” perception maintained by major cloud providers. This session aims to demystify these components and inspire others to share their experiences solving similar challenges.
The relevance of a cloud platform to a SaaS provider hinges on the fundamental infrastructure facilitating the delivery of services to end-users, providing a seamless adoption of services throughout your offering. It encompasses elements like self-service, multi-tenancy, scalability, among others, with today’s focus on the top five aspects: self-service, multi-tenancy, scalability, extensibility, and reliability. Some may wonder if this presentation borders on platform engineering. While there are overlaps between cloud engineering and platform engineering, it is crucial to differentiate that the former serves external end-users, enabling them to seamlessly integrate your services, whereas the latter supports internal developers in creating those services.
When discussing cloud platforms, names like GCP, AWS, or Azure might come to mind. However, at Diagrid, our goal was to create a superior cloud framework targeting developers rather than merely providing infrastructure. We aimed to offer high-level patterns and abstractions directly usable by developers, unlike traditional infrastructure teams handling provisions like Kafka. Given our startup status and limited resources, we planned to leverage existing cloud infrastructures. Our approach was cloud agnostic, focusing on making our services as portable and adaptable as possible, choosing Kubernetes for compute abstraction, MySQL for databases, and Redis for caching and streaming.
Consider the process of using cloud services, a common scenario is initiating an SSH or RDP session to a cloud-hosted VM. We understand that this VM operates within a hypervisor on a server, in a data center located within a specific region defined by the cloud provider. This VM originates from interacting with the cloud provider’s central service, accessible via a webpage, CLI, or SDK, where one opts for a specific region from several global options. This demonstrates the wide-reaching capability of the global service to deploy resources based on user selection. In Diagrid, this process is governed by what is referred to as a cloud control plane which manages regional data planes.
At Diagrid Cloud, for instance, our Catalyst service, which is a serverless API service, adheres closely to this model. Here, a centralized control plane allows deploying infrastructure across multiple regional data planes, accessible to users. Another service, Conductor, places infrastructure management directly in the hands of users who create what is known as a cluster connection, configuring and deploying Dapr installation in their own Kubernetes clusters, managed remotely via our control plane. These examples illustrate different operational models supported within Diagrid Cloud.
Broadly, cloud resource administration involves a centralized control plane which orchestrates regional data planes providing user-facing services. Notably, our primary computing platform utilizes Kubernetes, suggesting multiple Kubernetes clusters within our control plane and various regional data planes. Discussion of multi-cloud strategies showcases a need for flexibility at the control plane and portability at the data plane to accommodate specific customer needs around data localization and compliance, which are crucial for enterprise-level service.
This control plane typically interfaces through an API gateway, facilitating essential functionalities like authentication, authorization, audit, and routing to various services. Despite the availability of multiple vendors offering API gateways, this infrastructure is largely standardized across cloud providers. The control plane’s complexity might lead one to visualize it as a monolith, yet as it scales, it evolves, potentially adopting a cellular architecture, partitioning services to cater to specific tenants or regions primarily for data sovereignty and latency concerns rather than scalability.
Delving into the internal functions of the control plane, our Catalyst service, for instance, includes configuration capabilities, visualizations, API logs, telemetry, and graphical representations of data flows. These elements represent core functionalities that any cloud provider might need to offer, categorized broadly into managing resources, providing visual data views, and handling telemetry which encompasses logs, metrics, and potentially traces. These services outline the basic yet vital operations necessary for effective cloud resource management.
How should we design our control plane resources API? Several established cloud providers like GCP, AWS, and Azure offer robust public APIs that can serve as reference points. GCP even has a design document detailing their approach to creating cloud APIs. Essentially, the design process involves three key elements: using declarative resources to abstract away the backend operations from the users, structuring these resources in a hierarchical relationship which may include nesting, and applying standard operations including list, get, create, update, and delete. This closely mirrors the principles of RESTful APIs, suggesting the need to develop a REST API for managing domain objects specific to your cloud service.
For more advanced API design insights, we might consider the cloud-native approach exemplified by Kubernetes. Its resource model operates independently yet impacts the system’s broader architecture. Kubernetes introduces a consistent format for defining resources, including elements like API version, kind, metadata, along with spec and status sections. This design ensures that Kubernetes’ API machinery can handle generic objects without needing customization for each type, focusing on the specific fields within the spec and status to manage state and feedback. This paradigm underscores the importance of a comprehensive resource definition for effective system reconciliation and intuitive API endpoint mapping.
Looking at Kubernetes’ pod example, we see a straightforward definition that specifies a container named ‘explorer’ using a designated image. Kubernetes employs familiar HTTP methods such as GET, PUT, POST, PATCH, and DELETE for resource management. This simplicity and tactility in their API design could be adapted to different resource types beyond pods, offering flexibility in how resources are defined and managed.
In comparing other platforms, Azure utilizes ARM templates and AWS uses CloudFormation for resource configurations, reflecting a structured approach to resource deployment. However, Kubernetes differs by not embedding templating within its resource management model, though it does support resource composition. Alternatives like Crossplane, which explicitly defines resource dependencies, or Terraform and OpenTofu, which place dependency and state management on the client, present other viable approaches. Exploring these methods might offer strategic insights into building a functional and user-friendly cloud API.
In summary, the control plane serves as the brain of a cloud architecture, managing various data planes and supported by functionalities like authentication and routing provided through an API gateway. Emulating the Kubernetes model might provide a practical blueprint for constructing APIs that manage a hierarchical composition of cloud resources without necessitating complex templating solutions.
Exposing the resource API raises a critical question, particularly when dealing with a Kubernetes environment. The straightforward assumption might be to allow users direct access to create objects in the Kubernetes API. However, this approach is flawed since the Kubernetes API lacks multi-tenancy, causing resource competition between the service provider and the users. Both will have traffic throttled by Kubernetes which doesn’t distinguish between the different parties. Therefore, alternatives for exposing a Kubernetes-like API server are necessary. Using the term “Kubernetes-like” is deliberate to emphasize looking beyond the conventional Kubernetes framework for solutions. Options include nested Kubernetes solutions or leveraging technologies like vCluster or Capsule to develop a multi-tenant framework on existing Kubernetes structures. Projects such as KCP and Crossplane also offer solutions tailored towards building robust multi-tenant control planes, though they come with their challenges, including managing the underlying API servers.
The foundational principles of Kubernetes need thorough understanding to weigh the various alternatives effectively. These principles revolve around the use of a REST API that registers routes, validates and processes requests, and ideally stores data for asynchronous processing. This data storage could utilize various systems, but databases are preferred for leveraging optimistic concurrency controls to manage data integrity across multiple operations. This setup then supports asynchronous processing through event-driven mechanisms that trigger controllers, a framework that can be extended to serverless architectures on platforms like AWS, using components like Lambda and DynamoDB for the API server and data storage roles respectively.
In standard Kubernetes operations, the API server itself handles hardcoded and dynamically registered API types, using etcd as its database with optimistic concurrency control through a resource versioning system. Namespaces offer a basic level of isolation, primarily by prefixing resource names, which is complemented by in-memory caches for efficient resource management. Controllers handle resource reconciliation, adapting to the API server’s caching and resource versioning mechanics. While Kubernetes is traditionally viewed as a workload scheduler, this is only a portion of its capabilities, suggesting a careful consideration of what features are truly necessary for specific use cases. This thought process guides whether a bespoke approach, potentially integrating choreography and orchestration, could offer a tailored solution aligning more closely with specific architectural and operational requirements.
At Diagrid, we ventured into uncharted territories by building our own databases and API servers despite common advice against it. Our strategy was straightforward: we integrated all API types directly into our API server, utilizing mechanisms akin to those employed by Kubernetes for managing resources. Instead of using a challenging-to-maintain etcd system, we chose to write data directly to a managed SQL database. We implemented a unique system where changes in our database trigger a Redis cache update and a stream for activating controllers—this setup includes change data feed concepts for driving controllers located within the API server. This consolidated architecture allows for easy horizontal scaling as all state management is external. Additionally, we support remote controllers using Kubernetes-like ListWatch methods. It’s worth noting that database scalability can be achieved through vertical partitioning by resource type, which is a common practice in Kubernetes for efficient scaling.
Diving deeper into the internal workings of our API server, it encompasses all standard REST components, interfacing with what we designate as ‘resource storage’. At this level, operations with specialized resource types are generalized, following validation and templating steps. Resource storage works atop a transactional outbox pattern, simultaneously writing to resources and event logs, which facilitates a sophisticated watching mechanism on event changes. This setup utilizes peek lock semantics ensuring reliable state updates before acknowledging changes. Our stream provides level-based semantics, which, like Kubernetes, avoids redundant operations by focusing on the latest relevant event, allowing bulk event simplification at the controller level. Controllers are designed to function idempotently, handling retries effectively until successful execution or error isolation occurs, signaling system issues where necessary.
Our approach to controllers is bespoke, deviating from typical Kubernetes frameworks. We employ a generic base controller that interfaces with both the cache and stream, and also manages drift detection resync operations. The controllers’ reconciliation logic is tailored per API, handling resource-specific create, update, or delete operations. Unlike traditional Kubernetes controllers that often manage external resources, our lightweight controllers are capable of driving direct database modifications and business logic, potentially establishing materialized views. Post-reconciliation, an update status method concludes the process, signaling resource resolution. For advanced users interested in deletion orchestration, Kubernetes documentation on finalizers provides deeper insights.
In summary, effective resource isolation, particularly through specific tenancy or Kubernetes clusters, is advisable. The choice to implement a Kubernetes-like API server should be carefully assessed against specific needs. Our system demonstrates the flexibility of supporting both choreography and orchestration, with each method presenting distinct benefits. Additionally, resource composition offers a viable solution for certain templating scenarios.
We’ve discussed the control plane extensively; however, the data plane is essential since it implements configurations that deliver services to end-users. I envision this with several frameworks. In a centralized model, all resources are managed via an API server at the control plane level, and the compute or controllers reconcile these resources on the data planes across various regions. This centralized management performs well on a smaller scale but comes with limitations detailed in upcoming slides. The decentralized control model involves storing resources at the control plane but syncing them to regional data planes where controllers handle reconciliation. The API servers only sync necessary resources for each respective data plane. An example similar to this is KCP, where virtualized workspaces and API servers are linked to workload clusters where actual tasks are carried out. Another setup is the federated control model, which employs a large router that directs constituents to suitable data planes for resource storage, with controllers operational within those data planes. An extension of this is the mesh model, where API servers form a mesh, enabling inter-regional resource sharing, albeit more complex.
At Diagrid, our approach mirrors the decentralized control model, where a Kubernetes-style API server in the control plane houses resources. These resources must be claimed or bound to a data plane, highlighting which ones should be synchronized. A syncer then propagates updates to the local API server, commencing the control loop that provisions services and infrastructure for end-users. A significant advantage of this method is that controllers in the data plane can adjust to environmental variations across multiple clouds like AWS, Azure, or GCP, utilizing native integrations such as pod identity more efficiently than a centralized controller might. However, this model can strain the API server, leading to potential throttling, especially with managed API servers by cloud providers. Consequently, we explored a direct approach outside the API server to avoid Kubernetes restrictions and adopt a bespoke method using a syncer that maintains state through ListWatch semantics and communicates with a data plane actor, which is outlined further in the next slide.
The term ‘actor’ here doesn’t fit the traditional definition but represents an object in the data plane processing modifications from its inbox and managing state updates through a reconciliatory loop that interfaces with Kubernetes, Helm, or cloud providers. This process continuously relays status back to the control plane, offering visibility into the provisioning stages. Given its in-memory nature without persistence or inter-actor communication, this actor primarily utilizes Go’s concurrency features through goroutines efficiently handling numerous tasks concurrently, optimizing CPU and memory use significantly.
Lastly, on the ingress path, data planes must expose public load balancers and ingress points for user access to services. Typically, this might involve a Kubernetes ingress server coupled with a public load balancer, leveraging wildcard DNS for routing. Users authenticate via credentials received during resource provisioning at the control plane, which could be a connection string, an API token, or ideally, an X.509 certificate. This setup necessitates a versatile system capable of varying isolation and performance metrics to meet modern cloud service expectations.
Clouds can implement various methodologies such as centralized, decentralized, federated, or mesh for data plane resource management. It’s important to cautiously manage your API server to avoid complications that can be challenging to rectify. Also, consider adapting your data management strategies in multi-cloud environments, ensuring there are appropriate levels of isolation and performance tailored to specific needs because one approach does not suit all cloud scenarios.
Dapr, which was initially open sourced and introduced on GitHub in October 2019, transitioned under the stewardship of CNCF in November 2021. I became involved with the project shortly after its initiation, around November 2019. Following that, Diagrid was launched in December 2021. Within approximately 7 to 8 months, our small team, comprising of two backend engineers, one frontend engineer, and one infra engineer, developed Conductor. This platform now successfully manages hundreds of Dapr clusters, thousands of Dapr applications, and processes millions of metrics daily, and it’s operational and free. Later, we initiated Catalyst which entered private preview in November 2023. Despite maintaining a lean team and continuing our efforts on Conductor and Dapr, we developed a service capable of managing immense internal loads and processing millions of API requests each day.
When asked whether I would amend any steps taken in the past 2-3 years, I acknowledged the necessity for several iterations. Initially, the Diagrid’s system architecture heavily relied on our control plane to replicate data across varying databases and utilized an additional gRPC API exposing essential components to the agents within the data plane. We realized the inefficiency of this repetitive data replication and eventually pivoted to a more streamlined approach utilizing remote control via a continuous gRPC stream, significantly diminishing resource provisioning times from minutes down to seconds, thereby enhancing user experience and operational efficiency significantly.
Participant 2: I saw that you’ve essentially replicated a lot of Kubernetes logic and functionality. Was it a deliberate choice to avoid using Kubernetes directly as a strategic decision to separate yourselves from the dependency on any specific scheduling system and ensure compatibility across any cloud, regardless of their Kubernetes support? Why didn’t you choose Kubernetes right from the start?
Collinge: Kubernetes certainly possesses several admirable traits we aim to utilize. However, it wasn’t originally created to handle business logic or to operate lightweight controllers, including controllers and CRDs. Its primary design was for the kubelet to manage pod provisioning, and its use has since broadened as people adapted it for new purposes due to its API’s extensibility. When we were developing Conductor initially, our tasks involved merely creating YAML and writing it to storage like S3 or GCS. Considering the immense work involved in configuring a Kubernetes controller just to perform simple tasks like YAML generation and storage, it becomes apparent the amount of unnecessary complexity introduced by Kubernetes. As I’ve mentioned before, limiting our approach solely to Kubernetes brings numerous restrictions and reduces flexibility. By stepping back and focusing on more basic principles, we found more room to explore alternate solutions. Although creating everything on serverless platforms could be simpler, it wasn’t feasible for us as we needed to remain cloud-neutral.
Participant 3: From what I gather, you essentially developed a Kubernetes-style API to circumvent the unique constraints imposed by Kubernetes across different clouds, such as the disparities between Kubernetes in AKS versus other environments like Azure and AWS. For a data platform team aiming to develop services within a specific cloud provider, like AWS, and seeking to create a well-integrated service system, would your approach lean towards constructing a minimalistic API with a backend controller handled independently, or would you, owing to the single cloud environment, opt for using an established Kubernetes solution like AKS to build atop?
Collinge: This query brings up aspects of platform engineering, which can be a bit undefined and complex, primarily because we didn’t have a designated platform team—our team consisted of three engineers, making extensive platform engineering infeasible. You can create a cloud without adhering strictly to conventional cloud principles for internal infrastructure provisioning. Should you delve into platform engineering on this front, my advice would be to avoid creating bespoke solutions. For managing services, such as those needed by data platform teams, sticking with Kubernetes and conventional integrations, utilizing as many out-of-the-box tools and solutions as possible, would be most beneficial. Our motive in building this cloud infrastructure wasn’t to develop platforms over it but to efficiently serve our end users and enhance their experience, with a focus purely on resource provisioning.
Participant 3: I also think you are probably doing some platform engineering for it, but as a SaaS. It’s fairly similar, but indeed the fact that you have a product and everything on top of that makes some kind of customization worthy.
Collinge: The closest thing is like Upbound, I think, to building a system like this as a SaaS, like a full SaaS cloud as a service thing, but they are still very infrastructure focused. I think that there probably is an opportunity to build cloud as a service thing which is a bit more flexible and supports more lightweight business logic, because you might just want to create an API key. Why do you need all this logic to create an API key?
See more presentations with transcripts
Sep 12, 2024
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.