James Munro explores the functionality and construction of ArcticDB, a performance-oriented time-series datastore, elaborating on why transactions, specifically Isolation in ACID, may be unnecessary complications.
James Munro leads ArcticDB at Man Group, a sophisticated data-frame database tailored for time-series data and designed for extensive scalability and concurrent usage, accommodating petabytes of data and numerous users simultaneously. Prior to this role, he served as CTO at Man AHL from 2018 to 2023.
Software is revolutionizing our world, and QCon London is at the forefront of this transformation by enabling the dissemination of knowledge and fostering innovation within the developer community. This conference caters specifically to technical team leads, architects, engineering directors, and project managers who play a pivotal role in fostering technological advancement within their teams.
Munro shares: “I’m James Munro. Today, I’ll discuss the reasons behind a hedge fund’s decision to develop its own database technology. I oversee ArcticDB at Man Group, which is primarily an asset management firm. My background began in physics, dealing extensively with plasma physics and electron molecule scattering, among others. This experience proved somewhat pertinent when simulating plasmas for semiconductor processing, which has some connection to my current work.”
Ultimately, I transitioned into a role at Man AHL, a highly systematic, quantitative hedge fund management firm. My role here involved a wide range of responsibilities across various asset classes, including team strategies and portfolio management. I progressed to a managerial position in 2016 and ascended to CTO by 2018. My significant contributions really began to emerge around this time with specific focus areas for Man AHL. After a rewarding five-year stint, I transitioned to head ArcticDB. The shift was quite unique, moving from the largest consumer of ArcticDB to overseeing its entire operation, a role filled with demanding expectations and enhancements.
To provide some background, let me outline the essentials of building proprietary database technology. The Man Group, where I worked prior, manages over $160 billion in assets and specializes in alternative investments, which are designed to deliver uncorrelated returns. With a 35-year history in asset management and a foundation dating back to 1783 by James Man, the company has an extensive past primarily in brokerage and merchanting. Over the last few decades, our focus has shifted predominantly to managing assets with a varied clientele, often comprised of sophisticated investors such as pension funds seeking diverse investment strategies for alpha generation.
The range of investment strategies at Man Group is vast, including macro funds, trend following, multi-strategic portfolios, discretionary elements, credit facilities, loans, and real assets among others. Given our breadth in alternative asset management, all operations are powered by a unified technological and operational framework. This significant infrastructure supports our extensive trading activities, which exceed $6 trillion annually, influenced by market volatility and risk levels. This vast scale immediately presents a set of operational challenges and opportunities.
Reaching a valuation from approximately $160 billion to $6 trillion is a significant increase, primarily attributed to leverage utilized in hedge fund operations. Additionally, the jump from one trillion to six or seven trillion dollars is driven by active asset management, utilizing market data and information to make decisions on whether to take long or short positions in various markets.
Active trading involves entering and exiting positions multiple times throughout the year, which contributes to scaling up financially. It also involves exploring diverse strategies to generate alpha for various client needs across multiple liquid markets. Efficient and cost-effective trading practices are crucial, as high trading costs can severely impact profitability.
To provide some background on Arctic’s development, it began around 2011 with a focus on using Python and seeking effective data solutions. The initial version of the Arctic database was developed with Python and utilized Mongo, a fast document storage database. This version was open sourced in 2015 and saw considerable adoption within the finance sector. However, scalability issues with Mongo led to the decision to revamp Arctic by rewriting its core in C++ and eliminating the separate database layer, opting instead for direct connections to storage solutions like S3 for improved performance and scalability.
Moving to C++ from Python has significantly boosted our performance at Man Group. Now, C++ is leveraged extensively for processing critical data including market insights and risk assessments across the entire asset base. Its adoption isn’t just limited to us but is widespread across many financial institutions including banks, asset managers, and data providers.
The question arises, why create a new database when numerous ones already exist, constructed by established entities? This seems a bit extreme or outlandish, doesn’t it? It brings to mind a quote from Bruce Feirstein, an American writer known for his work on James Bond films, “The distance between insanity and genius is measured only by success.” This expression perfectly encapsulates our venture into database creation despite the abundance of available options. We are driven not by a lack of options, but by the need for a specialized solution capable of handling high-frequency data.
The consensus might lean towards insanity, but there’s a method to the madness. Specialized data types and high frequency handling are not adequately addressed by generic database solutions. Major banks and financial institutions often opt to develop bespoke databases for these very reasons. Our approach, while it might seem audacious, aligns with a common practice in the industry. Innovating within a niche often leads to creating tools that, while specialized, serve specific, critical needs effectively.
Exploring the reasons behind the strategy of Man Group to achieve alpha at large scale leads us into the realm of applying technologies such as ArcticDB. A compelling perspective is found in an economic study published by the American Economic Review in 2020, which examined research productivity. It might appear that the primary hurdle in operating a systematic quant hedge fund would be managing low latency trading systems, but that’s not necessarily the case.
Although we are not engaged in high-frequency trading where low latency is incredibly crucial, the quality and capability of technology for high-frequency trading and effective execution remain vital. More often, the competition is in research productivity, focusing on developing trading ideas, managing risks, and optimizing portfolios, rather than just improving the speed of trades. Digging deeper, the foundation lies in the research conducted to come up with these alpha-generating strategies based on the data analysis. The productivity of the quantitative research team in generating innovative ideas and building efficient portfolios is the focus.
I experienced this firsthand while working on semiconductor processing in the mid-2000s, directly contributing to what is depicted as the green line on productivity graphs for Moore’s Law. In this industry, semiconductor densities are known to double approximately every two years, displaying an impressive compound growth rate of 35%, a figure I’ve always found astonishing. This exponential growth has driven the IT and technology revolutions, making research increasingly costly. The complexity of simulating plasma etching at an atomic level was extremely high at that time.
The growing number of researchers needed to achieve progressively smaller feature sizes in semiconductors mirrors the exponential payoff this growth provided. While it might have been feasible with one person in 1971, by 2014, it required about 18 people, and likely even more today. This highlights a broader challenge across various sectors as noted in the research paper — not all sectors reap exponential benefits, making research productivity a widespread challenge. In quantitative finance, this challenge is magnified by increasing data volumes, rising competition, and greater market efficiency making it tougher to find profitable opportunities. Thus, enhancing research productivity stands as the quintessential challenge.
Data’s importance in finance is growing, particularly in asset management where the use of high frequency and low latency market data is increasing. Asset managers typically process billions of rows of data daily, with the capacity to handle even trillions. Alongside this, there’s a rising trend in utilizing alternative data (alt data) to forecast market trends and assess risks. This includes consumer-generated data of various types, such as weather information, images, and environmental sustainability data, all of which are expanding in volume.
The growing diversity and volume of this data is depicted in charts by a data indexing company named Eagle Alpha. They create visual representations to help clients identify valuable data sets. The diversity of data types presents a particular challenge to asset managers, necessitating not only performance handling capacity but also agility in data processing.
Additionally, the timeline of these developments is crucial for context. Reflecting on 2011, when I started at Man Group, the company was just beginning to adopt Python, a choice now prevalent in data science. Back then, other programming languages like R, MATLAB, or sometimes C++ were preferred in quantitative finance environments.
Initial choices were standard; the collective skill set in our building guided these decisions. At that time, Python did not stand out as the principal language for data science or production usage. However, it gained significant traction subsequently, as evidenced by its rising popularity in Python usage on Stack Overflow and generally worldwide. In those days, Python 2 was common, while Python 3 existed but was comparatively challenging to adopt.
Tools such as TensorFlow and PyTorch were not yet developed, and others like pandas were just beginning to gain recognition, having been open-sourced in 2008. We were pioneering, deciding to shift all data science operations into Python, supporting our decision by engaging with the PyData community in London and enhancing our tools with existing options like ArcticDB.
An additional aspect of our conceptual exploration involves managing data, applicable to both individual investors managing portfolios and sophisticated asset managers. The foundational steps are consistent: data acquisition, interpretation, and decision-making on trading strategies. For instance, managing a portfolio might not only involve owning American stocks but also addressing exposure to the dollar by applying hedges. Decisions also include choosing instruments and methods to optimize trading costs. This process is universal, irrespective of being an individual or a multinational corporation. The complexity of handling vast datasets, such as those indexed by Eagle Alpha, compounds the challenge.
There is a diverse range of data under analysis, from corporate documents and market tick data to consumer transactions and environmental metrics. The approach involves various statistical methods, not just trending machine learning algorithms, but encompassing basic science statistics and possibly deep learning techniques as well.
Tools such as ChatGPT might be employed to cover an extensive range of methods required for hands-on problem-solving, including portfolio construction, risk management, and high-frequency trading executions. Within organizations like Man Group, with its hundreds of quants, the diversity in problem-solving indicates a healthy research environment. Solving the same issue uniformly could suggest a lack of innovative thinking.
Diving into specifics, systematic traders who rely on algorithms for trading decisions represent a shift from manual stock picking to a highly automated process, from data management to execution. This system is often structured around the lambda architecture, integrating both streaming and batch data pipelines. In practice, this involves handling high-frequency tick data, storing it, and downsampling into various frequencies for analytical purposes. In addition, a batch processing workflow complements this by dealing with fundamental or alternative datasets, stored and organized effectively in a structured data lake known internally as codex.
The objective is to integrate this material into your framework, whereby your program will process it, employing it for backtesting to evaluate how it performs over time. It includes carrying out portfolio risk assessments and trading optimizations, which are the scientific components underpinning the algorithm development. Following this preparation, instructions are relayed to the trading sector, leading to eventual actions that could unfold in various methods.
Subsequent to this, it becomes necessary to engage in thorough analytics. Such analysis involves systematically arranging this information into tables. This practice gained momentum at the Man Group and is commonplace among numerous alternative asset managers, where the concept of a table has transitioned into a DataFrame. These DataFrames have evolved into crucial data conduits, akin to documents, as entire units are managed, with continuous data inflow and outflow necessitating daily analysis. This emphasizes the narrative where the DataFrame constitutes the primary operational entity.
From a technological perspective, recalling back to 2011 but still relevant today, we were exploring methods to overcome the limitations posed by server bottlenecks initiated by data-related demands. Our approach included experimenting with both proprietary and open-source databases at that time, which often led to the deployment of numerous servers. Here I briefly mention our experience with Mongo. The surprising takeaway was that the data demands from single users could overburden the system, capable of overwhelming an array of servers, contrasting sharply with the dynamics of web services that support vast user bases with minimal infrastructure. This scenario underscores the significant costs and operational challenges involved, including the mental and physical demands of maintaining such infrastructures, even with modern serverless technologies. Additionally, our shift to using Python aimed to simplify API interactions, making data handling via DataFrames exceptionally user-friendly, akin to sharing files on platforms like OneDrive. This move also involved the adoption of a high-performance time series database that eschews complex tools for a straightforward Python API, facilitating easy data manipulation. These collective decisions and technological evolutions directed our strategic focus.
Could there have been an alternative path that would have spared us the effort of developing our own database? In the following, I share some insights into real-world data challenges and my personal reflections on them. Imagine not having to rely on patterns like these. For example, consider images that appear excessively wide, elongated, or jagged. Ultimately, I wasn’t satisfied with these images and decided to experiment with cubism. Although I’m not certain if what I created strictly qualifies as cubism, it certainly contains cubes. Pictured here is an individual grappling with the complexities of real-world data, exemplified by bond data, an intriguing case due to its relative obscurity.
Bonds significantly surpass equities in market size, being almost three times larger globally from the data I obtained, and even more predominant in the U.S., where most bond transactions transpire. Despite its vastness, the bond market exhibits less liquidity than that of equities. The trading of bonds, essentially credit, is notably slower and more challenging due to its over-the-counter nature. This sector provides a fertile landscape for quantitative analysis, especially because of its size and illiquidity, making it a challenging yet rewarding frontier.
The associated data also presents its challenges. At the outset, one might envision data normalization. Consider the layout whereby Python pandas DataFrame represents bond data, displaying dates and various identifying numbers like Cusips and ISINs, alongside numerous metrics such as prices and durations. This typifies a conventional method of data arrangement. However, it’s not always the most practical for data manipulation. Real work often entails modification of these setups.
A pivotal transformation involves streamlining the data to focus solely on price, viewed through a pivot. Here, individual measures are prioritized to facilitate specific computations. The IDs now align horizontally with time extending vertically, optimizing the setup for time series analytics. When this data is organized column-wise, processing speeds are remarkable. Presented in this way, the data aligns ideally for cross-sectional analysis, treating each category like a component of a portfolio.
Handling a single asset rarely suffices. This task fundamentally organizes how you wish to manipulate the data. Ultimately, from this dataset, you have access to 400,000 historically tradable bonds, amounting to several gigabytes of data. While the extract presented consists of just five rows, the complete dataset encompasses thousands of daily entries. The volume of data is substantial; for instance, it wouldn’t fit within the RAM of a standard laptop. Additionally, managing this data necessitates tools capable of handling extensive columns, which isn’t typical of standard SQL schemas; rather, it requires treatment akin to a data block.
The narrative here largely revolves around the trade-offs in data normalization and could be subject to personal opinion. Traditionally, people are trained to devise schemas that are normalized—a practice advocated for on the left side. While this is not inherently incorrect, it demands thoughtful consideration. Reasons for normalization include variations in data structure, recognizing that not every timestamp persists, and acknowledging the fluctuation of assets, with examples such as Apple not always being present, and numerous companies facing dissolution.
The necessity to normalize escalates with the dynamic nature of your data schema. Such foundation is typically laid out during collegiate education in computer science, yet, users may prioritize different aspects. To illustrate, I’ve presented an alternative scenario where asset IDs are not merely headers but represent divisions within your problem. This could be driven by applications where users are primarily fetching individual assets. In such cases, the burden of constructing and querying whole tables for single assets becomes inefficient. Instead, opting for columnar storage for specific assets may be preferable. Additionally, as operational parameters or calculations alter the structure of your columns, alignment becomes crucial, emphasizing user-centric design over extensive normalization. This approach has led to the crafting of DataFrames tailored to the shapes users require.
Two key elements are essential here: firstly, managing large data sets effectively, whether they span billions or trillions of rows – this could range from tick data to decades of daily records. It’s also crucial to cater to extensive cross-sectional analyses by supporting hundreds of thousands, or even a million columns, as with some of our current users on ArcticDB.
Secondly, it is important to accommodate and handle missing data gracefully rather than relying solely on data normalization techniques. Allow the presence of irregular and incomplete data, optimizing performance even under those conditions. This approach must be adept at managing various scenarios, such as assets that no longer exist, or data gaps due to external issues like server failures. This flexibility is critical not only for quantitative analysts but extends to wider data science applications as well.
Understanding the rationale behind developing a new database opens the door to addressing these challenges. At this juncture, it’s worth considering another illustration using DALL·E. Imagine a polar bear navigating a labyrinthine structure, symbolizing our journey through complex database architecture. In addressing database design, it might be practical to compromise on certain traditional database properties like isolation from the ACID principles (Atomicity, Consistency, Isolation, Durability) to enhance agility and performance. While atomicity ensures complete success or failure of transactions, and consistency and durability guarantee reliable data states and backups, isolation, which helps in coordinating concurrent transactions, can be flexible to boost system performance, using the analogy of managing an online shopping cart with fluctuating stock levels.
The store keeps a close inventory check to prevent overselling as each item can only be sold once. This control is crucial in trading, where maintaining the order of transactions and updates is necessary for accurate data representation. This kind of management, known as serializability, ensures that changes to data are handled in an orderly fashion without conflicts between users.
The discussion shifts when considering data science over transaction systems, highlighting that certain stringent consistency models, although necessary for transactions, can be relaxed in data analytics. This flexibility alleviates the need for complex coordination mechanisms typically involved in transaction systems, such as databases, queues, and locks. The possibility to employ simpler, even serverless solutions like AWS S3 exists but comes with performance trade-offs due to the added abstraction layer. However, overcoming these inefficiencies, diverse methodologies involving queues and locks can still be executed effectively without dedicated server setups.
This approach of serverless databases isn’t novel but was already explored in a 2008 document post the inauguration of AWS S3 in 2006. This exploration showed how databases could function on S3 by adjusting expectations of efficiency and performance. It highlighted the feasibility of managing database logic directly from the client side, suggesting that while achieving atomicity and basic consistency, perfect isolation poses significant challenges. Essentially, the issue resides in achieving strict consistency where every input operation reads from the latest output operation, a common dilemma in database management requiring high levels of performance to solve.
This isn’t exactly rocket science once you decide to adopt this method. It’s familiar territory, delve deep into how it’s utilized in ArcticDB concerning its data configurations. If our goal is to honor atomicity and consistency while leaving durability to depend on the reliability of storage solutions, then our data organization needs to support this setup. An ideal choice would be an immutable data structure where additions are made through new versions instead of altering existing ones.
Imagine a scenario where User 1 is engaging with version 5 but simultaneously, an update is being made. The correct approach here is to produce a novel version of the data. For instance, imagine a data piece named Apple that is now linked symbolically to a newly completed version 6 through an atomic action. Post this update, any subsequent users will interact with version 6, setting a foundation to incorporate database semantics over this structure.
Further to referencing and versioning, databases will also maintain indexes and store data. I’ve repeatedly mentioned, however, the complete elimination of database servers. No more job queues either. With everyone independently operating over a unified storage system, be it a shared file system or an S3, collaborative work becomes feasible. Given the multitude of DataFrames and numerous libraries, typically, there’s a defined ownership because of the sheer volume involved.
This setup reshapes critical database concerns. It’s not solely about isolation; it’s about allocating your priorities. In traditional setups, a light API often suffices as hefty duties like running queries, ensuring security, handling transactions, and managing indexes usually occur on the database server.
When you distribute responsibilities—for both performance and data volume—across servers and storage within a modern distributed system, it enhances resilience and durability without creating any single points of failure. On transitioning to a serverless architecture, transactional sacrifices may be evident, necessitated by concerns on isolation. This shift entails a more complex client handling not just the simple API interactions but also the indexing and execution of queries. Security, capacity, and resilience are entrusted to a robust storage solution. Utilizing top-tier storage systems such as those similar to S3 allows for extensive scalability, advanced security features, and high resilience effortlessly.
In serverless environments, there’s no need to be concerned about managing database servers. An interesting benefit at an organizational level is the natural scalability in database workload that aligns with client capabilities. This model is advantageous in applications like data science where users with more robust systems handle heavier database tasks, facilitating a balance between data operations and the processing power available. With S3 handling overarching storage responsibilities, the focus shifts from server maintenance to data utilization.
The advantage of a serverless system extends beyond typical server management. Embracing serverless allows for simple infrastructural setups, like creating an S3 bucket on AWS, configuring credentials, and setting up a client database framework like ArcticDB. This configuration promises simplicity and scalability. Connections to this database system are straightforward, requiring just minor adjustments in the interaction through its distinct API, and it supports standard database operations on DataFrames. A notable advantage of using such systems is their foundation on immutable data structures, which supports versioning—even when data deletions occur, fostering an environment where despite immutability, adaptability is retained.
The capability to rewind time due to the existence of older versions is especially beneficial for data science. It’s crucial to be able to revert to various models to evaluate changes in data, outputs, and their relations to the model employed. This functionality is a vital benefit offered by the system’s architecture.
In light of these factors, we opted for a fully client-side database engine capable of data deduplication, compression, tiling, indexing, and integration with storage systems. This setup enables a comprehensive, shared database infrastructure. This system can operate on shared file systems, cloud storage, or high-performance flash drives in a local data center, the latter being our preferred method. Additional complexities in the data structures include building indexes and compressing data like tick data or alternative data, facilitating chunked, columnar storage in the data layer for optimized access.
Utilizing a VPN access to corporate infrastructure, I am set to conduct some data imports and discuss related concepts. Here, ‘namespaces’ and ‘bucket levels’ in S3 terms are used, where storage buckets with designated permissions are created. Data is categorized into ‘libraries’ at the dataset level, for instance, segregating U.S. equity data from European equity data or meteorological data, encompassing possibly millions of DataFrames tailored to specific informational needs.
Each unit of data is stored within a DataFrame. Accessing the research cluster, which hosts 30,000 libraries compared to a separate production cluster used in real trading scenarios, helps illustrate this setup. A simple example will be provided: reading from one library and writing to another, navigating through the DataFrames available which are symbolically listed, demonstrating the system’s operation.
These are simple examples to illustrate what is occurring. Essentially, you engage in reading and writing DataFrames. I start by reading from Amazon 1, then write to Amazon, inserting some metadata to monitor various details. The system includes database capabilities, facilitating data appending among other functions. Thus, I’ve expanded my DataFrame. After reading a segment, I’ve appended it to Amazon resulting in an extended DataFrame.
Modifications can be made in the middle of the DataFrame, influencing both the indexing and versioning. For instance, the original write shows multiple iterations, having reached version 625. During an update to the center of the dataset, attentive observers might notice changes in the data. Moreover, the concept of time travel introduces new versions, allowing retrieval of the latest version or reverting to a specific version or timestamp, showcasing the initial version as a reference point. All these are illustrative examples of basic operations.
The activities occur within a JupyterHub notebook powered by two virtual cores, operating on a rather limited virtual machine that possibly hampers performance. Yet, the demonstration handles extensive data involving 100,000 rows and columns, spotlighting a selection of three columns over several months, effectively managing the bulk of data. Similarly, it accommodates tick data, exemplified by Bloomberg’s Level 1 data—comprising Bid and Ask figures—spanning 66,000 equities. By sampling some columns from a day’s data, significant volume is processed within a fraction of a second, revealing over a million rows, thus confirming data retrieval efficacy.
Further illustrating functionality, the dataset from New York’s yellow taxi cabs is used to explore substantial tipping practices through enhanced query capabilities akin to those found in pandas. The queries attempt to identify instances where tips disproportionately exceed the total fare amount, such as a $100 tip on a $5 fare, uncovering remarkable tipping behavior. This similarity to pandas leverages the programming proficiency acquired in Python for effective data manipulation.
Here’s an insider look at how ArcticDB functions for end users, a tool employed notably in [inaudible]. It operates by transferring data at the rate of 40 gigabytes per second directly from flash storage into Python, bypassing traditional database servers altogether. Users generally utilize clusters to execute their computational work. With robust networking capabilities, similar to 40 Gigabit E networking, this setup facilitates efficient data management, handling multiple days of continuous operations and processing billions of rows every second.
A special mention to D-Tale, an innovative tool for visualizing pandas dataframes and our most acclaimed open-source product. Our efforts in advancing this tool have been supported by other significant contributors like Bloomberg, who integrate D-Tale within their BQuant platform—a quantitative data science toolkit powered by Python, enriched with pre-loaded data. Additionally, QuantStack, a French firm specializing in Jupyter, conda-forge, and mamba among others, collaborates with us too. Their focus on open-source projects enhances our development capabilities significantly.
Participant 1: Discussing the optimization techniques applied to pandas DataFrames, which traditionally struggle with scaling to billions of rows, asked what specific enhancements were implemented. Were there any integrations with technologies like Spark?
Munro: Explaining the internal strategy, he highlighted that while the returned pandas DataFrames are memory-efficient and suitable for querying within defined limits, users are often inclined to switch to alternative analytical tools like Polars or DuckDB for better performance, especially when pandas does not meet their scalability or speed requirements.
Participant 1: Or they are stored in a single node of the cluster or they’re distributed across?
Munro: They’re distributed across the storage, which is your storage provider’s problem, and something they’re actually pretty good at solving. Generally, actually AWS will dynamically distribute your data, to make sure it’s meeting demand.
See more presentations with transcripts
Aug 20, 2024
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.