In this post, I discuss how I would choose an analytics database as of early 2020.
Choosing a database technology is a huge decision because it constrains how I am going to build the data loading pipeline, query patterns, and database administration. In this post, I ignore the consideration of database administration.
When I am evaluating database technologies, at a minimum I will do my best to understand the architecture query engine and storage engine. To understand the query engine, I review the in-memory data mode and how worker nodes coordinate smaller computing tasks. To understand the storage engine, I review the file format and storage backends. Examples of file formats are ORC, parquet, and custom-built format; examples of storage backends are apache Kudu, apache Bookeeper, object store such as S3, and custom-built storage layers that are collocated with compute workers.
I distinguish between two types of requirements. The first type is for batch processing, and the other is to serve user-facing, realtime queries.
Batch¶
Queries for batch processing could take anywhere from seconds to hours to days. The query engines are used to generate reports and ad-hoc analysis. Some data workloads could bypass the query engines and access the data directly.
The majority of the hyped-up databases fall into this category. Most of these solutions work well enough. Query performances are generally acceptable. Using SQL query is likely to be the easiest way to slice and dice the data. Once I figure out that a particular solution is sufficient for my query patterns and performance requirements, the other questions that I would ask are:
- What are my choices for storage engines?
- Can I use other computing frameworks (Beam, Flink, Spark, MapReduce) on my data?
- How do I load data into the storage engines?
- How do I deploy and administer the database?
Some of the most obvious choices are:
- Dremio
- Presto
- Hive
- Spark SQL
- Snowflake
Snowflake stands out because it is a fully managed solution. It does not expose its storage backends for other tools to connect to. It usually means that the data platform needs to keep another copy of the raw data in an object store (e.g. S3) before the data is loaded into Snowflake.
Realtime¶
Realtime analytics queries need to be completed in seconds, if not less. All of the query engines listed in the previous section are not designed to serve user-facing queries. Even if we tune those engines to have reasonable realtime results, they are usually expensive and deliver horrendous tail latency, which will take years to learn and optimize.
Realtime analytics should take advantage of as many of the following features as possible:
- minimize network: Majority of the data are collocated at compute nodes.
- minimize disk: Compute nodes caches hot data in memory.
- memory mapping: The file format should be designed to be compatible with the in-memory data model. Memory mapping avoids data copying from kernel space into user space. The data model uses page markers to allow for partial scans.
- optimized in-memory layout: The compute engine performs computation without wasting cycles on converting data types.
In my experience, even if the most appropriate query engine is chosen, there is going to be a need to preaggregate data and remove unnecessary columns to get acceptable query performances. The options for realtime analytics are limited and flawed.
My top choices are:
Both Druid and Pinot are designed to be realtime analytics engines ( Lev18 ). One can read their documentation for details. I have to mention Elasticsearch as a valid option here even though it is designed for full-text document search. Elasticsearch has a simple design that horizontally scales out its search and aggregation to each of its shards. Data collocation, horizontal scaling, aggressive in-memory processing, and flexible aggregation semantics are sufficient to make it a passable realtime analytics engine for basic aggregation and bucketing that are typical of user-facing dashboards.
Data Storage¶
I would consider data storage backends after I identify the query engine most appropriate for the query patterns. Sometimes I do not have a choice. For example, Elasticsearch has built-in storage layer. Druid’s storage format in the form of Druid segments and metadata is fixed. However, if the engine of choice is Presto, we would have to choose the storage backend and file formats.
My personal preference is leverage on an object storage as much as I can. S3 is dead simple. I tend to prefer Parquet if I have a choice. For example, one of my favorites is to organize data as partition datasets on S3 as Parquet files.
table_name/partition_key1/partition_key2/partition_key3
These data could be used by Preso, Hive, Dremio, and Spark SQL directly. Besides using SQL, I can easily write Spark or Flink jobs to process them. I can write background batch jobs on the data without any frameworks. This style of storage engine is flexible and supports high throughput on a batch mode.
Realtime Data and Realtime Analytics¶
One of the most desirable features of an analytics system is have realtime query performances on realtime data. This is hard, and there should be an impossibility theorem out there to sound the alarms for the uninitialized. The perfect solution does not exist, it is a matter of tradeoffs.
Achieving realtime analytics is hard enough. That is, even if we can assume that the data format and storage are fixed and could be optimized for answering queries, fast realtime queries are still difficult. In the best case scenario when we have data collocating, data model compatible with memory-mapping, and optimized in-memory data representation, we might still have to rely on pre-aggregation to have reasonable query performances. When a query need to scan through large amount of data, there is a theoretical limit on time needed to complete based on the compute resources. All a perfect query engine could do is to get close to that limit.
Allowing for real time data loading invariably degrades the ability the query engine could optimize compute. For example, if we want data to show up more quickly, we could build more but smaller files. A large number of small file leads to query slow down. We could update the existing files, but the query engine have to swap out those files. We could stream the realtime data points into the query engine, but those data points would have to be limited otherwise the query engine could spent all of its compute cycle acting as a stream processor.
It does not mean that we cannot improve on the existing solutions to have both better realtime data and query. One promising data storage engine is Kudu ( kud20 ). It looks like that it has good integration with some query engines (e.g. Presto) and processing framework (e.g. Flink). However, none of the query engine that connects to Kudu are designed for realtime. In a way, Kudu could not be a storage engine for a best in-class realtime query engine because it is not designed to be memory-mapped and not designed to be collated with compute nodes.
Another way to achieve realtime query results is to think about how to optimize some of the more popular batch query engines. Take Dremio as an example. If it is computing on Parquet on S3, there is already too much time spent on network and converting data into Arrow in-memory representation. A more realtime “Dremio” can get around those two bottlenecks. It needs to introduce stickness between data and compute nodes. It would need to use a file format that could be memory map onto Arrow layout, e.g. Arrow IPC file format ( arr20 ).
Given the tradeoffs between storage engine and realtime query engines, the best strategy to deliver realtime queries is probably pre-computation, building aggregated tables that comes as close to your query patterns as possible.
Citations
- druid. Druid Architecture Overview. 2020. URL: https://druid.apache.org/docs/latest/design/architecture.html. 1
- pinot. Pinot Architecture Overview. 2020. URL: https://docs.pinot.apache.org/basics/architecture. 1
- Leventov, Roman. Comparison of the Open Source OLAP Systems for Big Data. 2018. URL: https://leventov.medium.com/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7. 1
- Kudu Overview. 2020. URL: https://kudu.apache.org/overview.html. 1
- Apache Arrow File Format. 2020. URL: https://arrow.apache.org/faq/. 1