Jul 28, 2021
The Open Source Modern Analytics Stack
Danny D. Leybzon
How combining Druid, Trino, and Superset can enable self-service analytics across your organization
Empowering all types of users to analyze data incredibly quickly from wherever it sits provides huge value to organizations. Citizen data scientists and decision scientists are able to make empirically-backed, data-driven decisions when they can interface directly with the data themselves, instead of being bottlenecked by dedicated data science and analytics teams. By enabling these employees to make better decisions, your organization can unlock tremendous value and outpace the competition. We propose an analytics stack built entirely around open source technologies that empowers your employees to make smarter decisions based on the data.
In the last decade, no field has seen as much innovation and development in open source software as data. There exist open-source tools for data collection, data processing, data workflow orchestration, data science, and, of course, data analytics. The open-source data analytics stack that we propose here is aimed to allow users to do one thing: query data incredibly quickly (with Druid), from anywhere (with Trino), without the need to know SQL or any other programming language (with Superset). This stack is utilized by cutting-edge technology companies such as Netflix and Airbnb and is seeing adoption across myriad verticals, company sizes, etc.
The full list of why companies continue to choose open-source software to solve data challenges is beyond the scope of this article, but a few prominent reasons come to mind: open source is innovative, open-source is auditable, and open source is fast to respond to the needs of users because users are able to contribute directly to the code base. While open source technologies can be a pain to manage out of the box, there exist trusted vendors (e.g., Imply for Apache Druid, Starburst for Trino, and Preset for Apache Superset) who allow you to reap the benefits of open source without having to manage them yourself.
Apache Druid is a time series analytics database designed for fast slice-and-dice analytics (OLAP queries) on large data sets.
- Great for powering use cases where low-latency ingest, high performance queries, and high uptime are important
- Works best with event-oriented data (to optimize raw data structures for fast querying/efficient storage)
- Supports at-once ingestion from Kafka and Object Storage with high performance indexing/aggregation
- Moving towards full SQL support
As such, Druid is commonly used for powering GUIs for interactive analytics apps or as a backend for highly concurrent APIs that need fast aggregations (for use cases such as anomaly detection, complex alerts, or hyper-fast dashboards). This makes it a powerful tool for enabling the speed that self-service analytics often requires, whether it’s customer-facing or internal.
Trino (formerly known as Presto) is a distributed SQL query engine created to run large scale analytics and complex queries against large volumes of data.
- Great for bursty workloads (such as batch reports or large scale queries without tight SLAs)
- Built for handling large complex JOINs
- Shines when running federated queries across large-scale data repositories like (Snowflake, Object Storage, MySQL, etc.)
- Use existing SQL to query data from anywhere
These characteristics make Trino a great option for building a data lakehouse. Its ability to leverage separation of compute and storage (a la a data lake query engine) means that it’s cheap to store lots of data. Its ability to query data from any number of different sources and formats means that it unlocks value from disparate data sources. And its SQL-compliant query language means that SQL-fluent users (or BI tools that issue SQL queries for accessing data) can easily connect to it.
Superset isn’t a database or a query engine; it’s an open-source data visualization tool meant to enable self-service analytics by empowering anybody to analyze and explore data visually.
- Great for enabling self-service analytics for users not fluent in SQL
- Built specifically to visualize data from Druid, now can be used to visualize data from a number of sources including Trino
- The best open source alternative to BI tools like Tableau, Looker, etc
Because of these features, Superset is a powerful addition to the open-source analytics stack. While it doesn’t process the data itself, it enables all users across an organization to answer questions themselves, rather than being bottlenecked by a single data science or data analytics team.
How they fit together
Hopefully, you can see now Druid (for interactive queries), Trino (for analyzing data wherever it sits), and Superset (for visualizing data from either of these engines) each enable self-service analytics in their own way. However, in combination, these three tools are much more powerful than any on their own. While Superset can be connected to a number of different databases, its interactivity complements Druid’s speed. While Trino can join data from many disparate sources in a matter of seconds or minutes, it isn’t fast enough to empower self-service analytics when users need interactive speeds. While Druid provides those interactive speeds, it falls short in its ability to join data from multiple sources.
By combining these three tools—Druid, Trino, and Superset—your team can reap the benefits of all three without having to deal with the tradeoffs or limitations of any individual tool. You can get started with this stack by following the quick start guides for Druid, Trino, and Superset as well as the Druid-Trino connector.