Mar 17, 2020
Imply Releases A Reference Architecture for Apache Druid on Microsoft Azure
Simplify cloud-based real time intelligence with Microsoft Azure and Imply
Tijo Thomas, a Solutions Architect at Imply,recently wrote a reference architecture for Apache Druid on Microsoft Azure that includes some best practices for running on services such as Azure VM, Azure Blob Storage, Azure Database Service and HDInsight. The reference architecture provides a brief explanation of Imply’s components and then goes on to describe example cluster architectures and their accompanying machine types and configurations. The reference architecture will help anyone looking to deploy Imply components, including Apache Druid, on Microsoft Azure can get a head start and avoid potential snafus.
Apache Druid is a real-time analytics database designed for ultrafast query response on large datasets. Druid can scale to ingest millions of events per second, store trillions of events ( petabytes of data), and perform queries with sub-second response times at scale. Druid’s most common use cases are where real-time (streaming) ingestion, fast query performance, and no downtime are critical. This makes Druid a good choice for operational analytics projects that provide real-time intelligence, the process of delivering information as events occur so that businesses can gain immediate insight. Some of the world’s largest companies and digital leaders rely on Druid for use cases such as clickstream analytics, application/device/network performance monitoring, and BI/OLAP. While Druid ingests data from a variety of sources, it is commonly paired with Apache Kafka or Azure Event Hub on Azure for event monitoring, financial analysis, and IoT monitoring.
Druid is cloud-native and runs as server types that host groups of processes. Briefly, there’s the Master server to coordinate data ingestion and storage, the Data server to store and ingest data, and the Query server(s) that act as endpoints for users and client applications to interact with. Druid also relies on external metadata storage, deep storage, and Apache Zookeeper to coordinate its processes.
There’s a lot of detail underlying this simple explanation, and you can learn all about it when you download the Azure reference architecture document.
Azure, Microsoft’s cloud computing platform, is a combination of services for creating, deploying and managing applications that run in Microsoft’s global secure data centers. The set of services can be used to build almost any software solution, and powers use cases such as global retail and ecommerce, CPU-intensive, scientific data calculations, and even mundane yet critical IT activities such as data backup.
Microsoft has gone to great lengths to integrate Azure services with open source frameworks, giving customers an easy way to learn, develop for, and run them in production. With the rapid widespread implementation of Druid, it’s no surprise that there are a number of very large Druid deployments on Azure.
There are a few Azure services that are of primary importance when it comes to running Druid: Azure VM, Azure Database Service, Azure Blob Storage and HDInsight. Let’s start at the bottom of the list and work our way up because the sizing of the Azure VM compute environment is worth a discussion on its own, as that is where Druid and the rest of the Imply processes will run. HDInsight, Azure’s Apache Hadoop service, is a powerful yet easy to use solution for Druid’s deep storage, as is Azure Blob Storage. At the same time, Azure Database Service makes it easy to set up, maintain, manage and administer MySQL or PostgresQL as Druid’s metadata storage component.
As a former Ops guy, I always focus on the server sizing recommendations in reference architectures because nothing brings a POC to a halt faster than out-of-memory (OOM) errors on inadequate hardware. Server and cluster sizing can be key decisions in your Druid rollout, and the reference architecture provides a deep explanation of Druid server types (Master, Data, Query), and the Azure VM instance types that are best-suited for them. On Azure, the best options for instance types are the Linux-based D, E, F, Lsv2 VMs. A good place to start is with D Series VMs as Master servers, as they were designed to run in production. You can further specialize deployment to leverage additional VM types to run Lsv2 instances with local SSD for Data servers and F instances pumped up with RAM as Query server types. For your reference, the Druid docs contain a more general discussion of Druid server sizing and cluster tuning that goes beyond the Azure environment.
After installing Imply on Azure, you can learn more about ingesting, querying and visualizing data with the Imply Quickstart.
Imply Azure Reference Architecture
Imply and Apache Druid can be installed on any flavor of Linux and in any public or private cloud to bring real-time intelligence to your business. Get your project off the ground quickly with the Apache Druid Reference Architecture for Azure Cloud.