GitHub serves as a pivotal hub for software developers globally, boasting over 100 million developers and 420 million repositories. To manage this vast network effectively, GitHub utilizes a comprehensive data collection system crafted in-house. Despite being engineered for robustness and scalability, the continuous expansion of GitHub prompted the necessity for system evaluation to ensure it accommodates both present and future needs.
“We faced a scalability issue, now collecting approximately 700 terabytes of data daily. This data is crucial for identifying malicious activities against our system and for troubleshooting purposes. However, this internal system was becoming a bottleneck for our growth.”
—Stephan Miehe, GitHub Senior Director of Platform Security
In collaboration with their parent company, Microsoft, GitHub sought a resolution to handle vast event streams. The team developed a functional application running in Azure Functions Flex Consumption, a newly introduced plan designed for large-scale serverless computing. This model offers rapid scaling, supports lengthy execution times, private networking, choice of instance size, and concurrency management.
Discover how to accelerate growth using the Azure Functions Flex Consumption Plan
In a recent demonstration, GitHub managed a rate of 1.6 million events per second with a single Flex Consumption app triggered by a network-restricted event hub.
“The crucial benefit for us is the app’s ability to automatically scale according to demand. The dynamic scaling of Azure Functions Flex Consumption based on the events queued in Azure Event Hubs is highly beneficial for our operations.“
—Stephan Miehe, GitHub Senior Director of Platform Security
GitHub’s challenge centered around an internal messaging application that was essential for managing the communication between telemetry data producers and receivers. Initially, the solution was implemented with Java and Azure Event Hubs. However, as the system scaled to process up to 460 gigabytes of data daily, it began to hit its operational limits, leading to decreased reliability.
The previous setup required each data consumer to have an independent environment, necessitating extensive manual configuration that was both time-consuming and costly. Additionally, the Java codebase frequently encountered issues, increasing difficulty in maintenance as computational demands escalated.
“We couldn’t accept the risk and scalability challenges of the current solution,” stated Miehe, highlighting the need for a change. “We were already using Azure Event Hubs, which led us to consider other Azure solutions. Our requirements were simple—an HTTP POST request—prompting us to look for a serverless option that would be efficient.”
Experienced in serverless architecture, Miehe’s team evaluated other Azure-native options, eventually choosing Azure Functions as their preferred solution.
“Both platforms are well known for being good for simple data crunching at large scale, but we don’t want to migrate to another product in six months because we’ve reached a ceiling.“
—Stephan Miehe, GitHub Senior Director of Platform Security
A function app can automatically scale the queue based on the amount of logging traffic. The question was how much it could scale. At the time GitHub began working with the Azure Functions team, the Flex Consumption plan had just entered private preview. Based on a new underlying architecture, Flex Consumption supports up to 1,000 partitions and provides a faster target-based scaling experience. The product team built a proof of concept that scaled to more than double the legacy platform’s largest topic at the time, showing that Flex Consumption could handle the pipeline.
“Azure Functions Flex Consumption gives us a serverless solution with 100% of the capacity we need now, plus all the headroom we need as we grow.“
—Stephan Miehe, GitHub Senior Director of Platform Security
GitHub joined the private preview and worked closely with the Azure Functions product team to see what else Flex Consumption could do. The new function app is written in Python to consume events from Event Hubs. It consolidates large batches of messages into one large message and sends it on to the consumers for processing.
Finding the right number for each batch took some experimentation, as every function execution has at least a small percentage of overhead. At peak usage times, the platform will process more than 1 million events per second. Knowing this, the GitHub team needed to find the sweet spot in function execution. Too high a number and there’s not enough memory to process the batch. Too small a number and it takes too many executions to process the batch and slows performance.
The right number proved to be 5,000 messages per batch. “Our execution times are already incredibly low—in the 100–200 millisecond range,” Miehe reports.
This solution offers built-in versatility. The team can adjust the volume of messages per batch depending on the application, relying on target-based scaling to adequately increase the number of instances as needed. In this model, Azure Functions evaluates the count of pending messages in the event hub, and scales the instances accordingly based on the batch and partition numbers. In scenarios with high volume, the function app can scale up to one instance per event hub partition, possibly reaching up to 1,000 instances for larger deployments.
“If any customers are looking to utilize a similar setup with an Event Hubs-triggered function app, they should carefully consider the number of partitions relative to their workload volume; insufficient partitions could limit throughput.“
—Stephan Miehe, GitHub Senior Director of Platform Security
Azure Functions not only supports Event Hubs but also other sources such as Apache Kafka, Azure Cosmos DB, Azure Service Bus queues and topics, and Azure Queue Storage.
The Function as a Service (FaaS) model allows developers to avoid the hassles associated with managing infrastructure. Nonetheless, serverless code remains subject to the constraints imposed by its network environment. The Flex Consumption model mitigates this by enhancing virtual network (VNet) integration. With this arrangement, function apps can be isolated within a VNet and also communicate with other services within VNets, without sacrificing performance.
An early adopter, GitHub, saw immediate benefits from the enhancements incorporated into the Azure Functions platform through Flex Consumption. This model operates on Legion, a newly developed internal Platform as a Service (PaaS) infrastructure that bolsters network capacity and performance for demanding conditions. Notably, Legion can swiftly augment a VNet with computing resources. For instance, as a function app scales, each additional compute instance is functional and connected to the outbound VNet within milliseconds. This rapid capability allows GitHub’s message processing applications to interact with Event Hubs within a VNet promptly. Over the last 18 months, there has been a reduction in cold start times by about 53% across various regions and platforms.
This initiative challenged both the GitHub and Azure Functions engineering teams extensively, pushing them to optimize throughput effectively.
According to Miehe, with enhanced capabilities, the team also had to embrace greater responsibilities, noting that Flex Consumption offered “a lot of knobs to turn.” He emphasized, “There’s a balance between flexibility and the effort required to configure it properly.”
To that end, he recommends testing early and often, a familiar part of the GitHub pull request culture. The following best practices helped GitHub meet its milestones:
The GitHub team continues to run the new platform in parallel with the legacy solution while it monitors performance and determines a cutover date.
“We’ve been running them side by side deliberately to find where the ceiling is,” Miehe explains.
The team was delighted. As Miehe says, “We’re pleased with the results and will soon be sunsetting all the operational overhead of the old solution.”
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.