Jun 22, 2022
Druid 0.23 – Features And Capabilities For Advanced Scenarios
By Abhishek Agarwal & Will Xu
Features and capabilities for advanced scenarios
This is part two of the Druid 0.23.0 release blog. Many of Druid’s improvements focus on building a solid foundation, including making the system more stable, easier to use, faster to scale, and better integrated with the rest of the data ecosystem. This blog is intended for advanced users as well as potential/existing contributors of the Druid project who might want to peek behind the scenes.
Streaming: Druid improves integration with Kinesis
Kinesis supports dynamic re-sharding to accommodate traffic growth. During re-sharding, empty intermediate shards are created. Druid can potentially be stuck due to empty shards. In this release, we’ve added a new capability to ignore those empty shards. You can do this by setting skipIgnorableShards = True as part of Druid common settings or part of the ingestion context.
At the same time, Druid now supports newer, faster Kinesis APIs to query for Kinesis shards. You can access this by setting useListShards = True.
We recommend both settings for users who are using Kinesis ingestion and will make those settings the default in the future.
Task system: Task reports for parallel tasks
Druid now publishes task reports for parallel tasks. This is useful to monitor parallel tasks and is a necessary feature to move to native batch ingestion. The following image shows the parent task and the associated sub-tasks:
Auto-compaction system helps you get optimal segment files to achieve good query performance. In this release, we have introduced two new changes to make this system more useful.
The first change supports auto-compaction of mixed granularity overlapping intervals, which was previously not possible. This paves the way to support changing the granularity of data based on the age of data in the future.
The second change allows resources to be used by the auto-compaction system to be adjusted independently from other tasks. This enables users to run auto-compaction more frequently than other tasks such as segment balancing and data drops.
There are also changes in webconsole’s segment view to help visualize the segment fragmentation issues. Specifically, if you see a significant variance between the size of the segments, it’s a good indication that there is some fragmentation of your data. Again, this is where applying auto-compaction can help your overall performance.
Over the next few releases, we aim to make the auto-compaction system enabled by default so segment files can stay optimized.
Querying: Better JDBC
There are a number of improvements around JDBC connection, such as handling trailing slashes, better logging as well as sanitization of exceptions. If you are using JDBC today, we definitely recommend you to upgrade and give the new version a try.
New things for Druid contributors
Overview for contributors
Below are the highlights from the release notes. For full details, please check out the full release notes.
- Add SQL query ID to response header for failed SQLl query (#11756)
- Improved query IDs to make it easier to link queries and sub-queries for end-to-end query visibility (#11809)
Better internal typing systems
- Added ARRAY_CONCAT_AGG to aggregate array inputs together into a single array (#12226)
- Added a query context to use internally generated SegmentMetadata query (#11429)
- Added support for Druid complex types to the native expression processing system to make all Druid data usable within expressions (#12016)
- Added the ability to store null columns in segments (Store null columns in the segments #12279)
- Druid now returns an empty result after optimizing a GROUP BY query to a time series query (Return empty result when a group by gets optimized to a timeseries query #12065)
Better memory management to reduce OOM during ingestion
- Fixed the OOM failures in the dimension distribution phase of parallel indexing (Fix OOM failures in dimension distribution phase of parallel indexing #12331)
- Druid no longer creates a materialized list of segment files and eliminates looping over the files to reduce OOM issues (Avoid materializing list of segment files when finding a partition file during shuffle #11903)
The Druid 0.23.0 release includes the following metrics and metric dimensions to help you better monitor and operate a Druid cluster:
- Auto-compaction duty group
- Whether a query is vectorized
- Shenandoah GC
- CPU and CPU sets for cgroups
- Jetty server thread pool
- Batch tasks finish waiting for segment
New metric dimensions
- Auto-compaction duty cycle
- Work category for tasks
Druid also now includes a Prometheus emitter by default as well as supports proxying data through HTTP proxy.
Looking for contributors
We’re very thankful to all of the 81 contributors who have made Druid 0.23.0 possible – but we need more!
Are you a developer? A tech writer? Someone who is just interested in databases, analytics, streams, or anything else Druid? Join us! Take a look at the Druid Community to see what is needed and jump in.
Try this out today
For a full list of all new functionality in Druid 0.23.0, head over to the Apache Druid download page and check out the release notes!