Sep 26, 2019
Introducing Apache Druid 0.16.0
Earlier this week, the Apache Druid community released Druid 0.16.0-incubating.
Like every Druid release, this one has a huge amount and variety of fixes and new functionality. This particular one include over 350 new features, performance enhancements, bug fixes, and major documentation improvements from 50 contributors. As always, visit the Apache Druid download page for the software and full release notes detailing every change.
I’ll talk about a few of the major items below.
Shuffle for native parallel batch ingestion
As you may already know if you’re a long-time Druid user, it supports a variety of streaming and batch ingestion methods. For batch ingestion, there are two major methods: Hadoop-based (Druid submits MapReduce jobs) and native (Druid itself pulls data from the filesystem, cloud storage, etc).
This release introduces a native batch ingest shuffle mode as part of our ongoing work to reduce, and eventually eliminate, Druid’s dependency on Hadoop for massively scalable batch ingestion. In Druid’s history, the Hadoop-based method has been around the longest. Past releases have first introduced single-threaded native ingestion, then introduced a parallel mode, and now introduced a shuffle capability to the parallel mode.
The new shuffle capability allows for ‘perfect rollup’ and partitioning on dimensions, ultimately giving you better locality, compression, and query performance. For more details on how to enable it, check out Druid’s parallel native batch task documentation, and in particular the new forceGuaranteedRollup option.
Vectorized execution (operating on batches of rows at a time, instead of individual rows) is widely recognized as being crucial to maximizing performance of analytical databases like Druid. It allows queries to be sped up by reducing the number of method calls, allowing more cache-efficiency, and potentially enabling CPU SIMD instructions.
Starting with 0.16.0, Druid includes a vectorized processing engine that enables all of the following:
- Reading values from columns in batches instead of one value at a time. There was already some batching (e.g. decompressing happens in batches), but now Druid can retrieve multiple values in a single method call.
- Filtering rows in batches instead of one row at a time. This includes iterating filter bitmaps in batches when possible.
- Processing rows by query engines (timeseries, groupBy, etc) in batches.
In all cases, Druid benefits by cutting down on method calls and by improving locality. We’ve measured speedups of 1.3–3x for a wide variety of query types; check out https://static.imply.io/gianm/vb.html for details on the tests we’ve done.
Vectorization is off by default, and not all query features are supported in this release. In particular, the topN engine is not yet vectorized, nor are all aggregator and filter types. In future releases, we’ll build out support for all query features and enable vectorization by default. See Vectorizable queries in the Druid documentation for more details about how to enable this feature, and in which cases it applies.
An Indexer process
The new Indexer process is an alternative to the MiddleManager + Peon task execution system. Instead of forking a separate JVM process per-task, the Indexer runs tasks as separate threads within a single JVM process. The Indexer is designed to be easier to configure and deploy compared to the MiddleManager + Peon system and to better enable resource sharing across tasks.
The advantage of the Indexer is that it allows query processing resources, lookups, cached authentication/authorization information, and much more to be shared between all running indexing task threads, giving each individual task access to a larger pool of resources and far fewer redundant actions done than is possible with the Peon model of execution where each task is isolated in its own process.
The Indexer is experimental for now, but we believe its simplicity and resource pooling benefits will lead, in future releases, to Indexers replacing MiddleManagers in the default Druid architecture. Until then, I encourage you to check it out and provide feedback. Start from the docs at https://druid.apache.org/docs/latest/design/indexer.html and let us know what you think through community channels!
Web console enhancements
Two major enhancements stand out in this release: Data Loader Expansion and SQL Editor upgrade. The Data Loader introduced in 0.15.0, has been expanded in 0.16 to support Kafka, Kinesis, segment reindexing, and even “inline” data which can be pasted in directly.
The query view on the web console has received a major upgrade for 0.16, transitioning into an interactive point-and-click SQL editor. There is a new column picker sidebar and the result output table let you directly manipulate the SQL query without needing to actually type SQL.
For a full list of all new functionality in Druid 0.16.0, head over to the Apache Druid download page and check out the release notes!