Sep 18, 2020
Virtual Druid Summit Returns with Talks from Pinterest, Splunk, GameAnalytics, Nielsen, and Zeotap
Though you might have heard that Virtual Druid Summit was returning, did you guess that it would be so soon? We’re stoked, too! The third installment of Virtual Druid Summit will be taking place on October 7, 2020.
Gian Merlino (Apache Druid PMC Chair) will deliver the opening keynote which will be followed by talks from Apache Druid adopters at GameAnalytics, Zeotap, Nielsen, Splunk, and Pinterest. You can look forward to hearing the details of these companies’ Druid use cases and after each talk wraps. you’ll have the opportunity to ask the speakers questions.
Once again, Virtual Druid Summit is a free event and the virtual format enables us to bridge our geographical gaps. Sign up and select your talks today by visiting the Virtual Druid Summit III registration page!
Virtual Druid Summit III: October 7, 2020 8am – 1:45pm Pacific Time
Building Data Applications with Apache Druid
Gian Merlino, PMC Chair, Apache Druid
8:00am – 8:45am PT
One of the most popular use cases for Apache Druid is building data applications. Data applications exist to deliver data into the hands of everyone on a team in a business, and are used by these teams to make faster, better decisions. To fulfill this role, they need to support granular drill down, because the devil is in the details, but also be extremely fast, because otherwise people won’t use them!
In this talk, Gian Merlino will cover:
- The unique technical challenges of powering data-driven applications
- What attributes of Druid make it a good platform for data applications
- Some real-world data applications powered by Druid
Building a Real-Time Gaming Analytics Service with Apache Druid
Ramón Lastres Guerrero, Director of Engineering, GameAnalytics
9:00am – 9:45am PT
At GameAnalytics we receive and process real time behavioural data from more than 100 million daily active users, helping thousands of game studios and developers understand user behaviour and improve their games. In this talk, you will learn how we managed to migrate our legacy backend system from using an in-house built streaming analytics service to Apache Druid, and the lessons learned along the way. By adopting Druid, we have been able to reduce development costs, increase reliability of our systems and implement new features that would have not been possible with our old stack. We will provide an overview of our approach to schema design, segments optimization, creation of our query layer, caching and datasources optimisation, which can help you better understand how you can successfully use Druid as a key component on your data processing and reporting infrastructure.
Data Modelling in Druid for Non-temporal and Nested Data
Sathish K S, VP of Engineering & Chaitanya Bendre, Lead Software Engineer, Zeotap
10:00am – 10:45am PT
Druid has been the production workhorse for the past 2+ years at Zeotap powering the core Audience planning across our Connect and Targeting products. Though Druid is best suited for data having time as a dimension as it partitions data based on time first, we have used Druid to serve ML powered enhanced insights and Estimation of potential dataset sizes, to assist us with our core business case of Audience planning. These are datasets without timestamp a.k.a non-temporal with high scale and having nested dimensions. These have been achieved using nuanced data modelling to store the data sets and achieve millisecond latency retrieval on top of the same. The core of the presentation would be on the data modelling journey to achieve these use cases detailing the query access patterns. We also delve upon the architecture – ingestion into druid sink and processing including ML. In the end we go over the production setup and configurations and provide the performance tunings applied. The presentation would have the following heads:
- Business case in Ad-Tech and Mar-Tech vertical
- Audience Planner Use case 1 – Insights
-Lambda Architecture and data flow
-Deep dive on data model
- Audience Planner Use case 2: Estimator
-Architecture and data flow
-Stratified sampling explained
-Data model to solve nested data
- Audience Planner Use case 3 – Skew correction
-Skew correction model
-Data model in Druid to accommodate output from ML models
- Production setup, config and tunings
- Production Operation experience takeaways
Casting the Spell: Druid in Practice
Itai Yaffe, Principal Solutions Architect, Imply & Yakir Buskilla, VP R&D and GM Israel, Nielsen Identity
11:00am – 11:45am PT
At Nielsen Identity, we leverage Druid to provide our customers with real-time analytics tools for various use-cases, including in-flight analytics, reporting and building target audiences. The common challenge of these use-cases is counting distinct elements in real-time at scale. We’ve been using Druid to solve these problems for the past 4 years, and gained a lot of experience with it.
In this talk, we will share some of the best practices and tips we’ve gathered over the years, including:
- Data modeling
- Retention and deletion
- Query optimization
Druid on Kubernetes with Druid-operator
Himanshu Gupta, Principal Software Engineer, Splunk
12:00pm – 12:45pm PT
We went through the journey of deploying Apache Druid clusters on Kubernetes(K8s) and created a druid-operator (https://github.com/druid-io/druid-operator). This talk introduces the Druid Kubernetes operator, how to use it to deploy druid clusters and how it works under the hood. We will share how we use this operator to deploy Druid clusters at Splunk.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Druid is a complex stateful distributed system and a Druid cluster consists of multiple web services such as Broker, Historical, Coordinator, Overlord, MiddleManager, etc each deployed with multiple replicas. Deploying a single web service on K8s requires creating few K8s resources via YAML files and it multiplies due to multiple services inside of a Druid cluster. Now doing it for multiple Druid clusters (dev, staging, production environments) makes it even more tedious and error prone.
K8s enables creation of application (such as Druid) specific extension, called “Operator”, that combines Kubernetes and application specific knowledge into a reusable K8s extension that makes deploying complex applications simple.
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Jian Wang, Tech Lead Software Engineer, Pinterest
1:00pm – 1:45pm PT
In this talk, we will talk about:
1) the motivation of switching from Hbase backed analytics system to Druid
2) the architecture design of Druid as a platform in Pinterest (Archmage, Hadoop, Kafka) including a query interface, Archmage, a thrift service in front of Druid which exposes a thrift api to company-wise clients, handles Druid broker hosts discovery, serves as a relay to broker hosts to abstract the async HTTP connection and provides query optimizations transparent to clients including directly translating fixed pattern SQL to Druid native JSON queries to save planning time. In addition, we’ll cover the production Hadoop batch and Kafka real time ingestion pipeline setup and the reason we picked a pull-based solution instead of a push-based solution for real time ingestion.
3) We will also talk about the use cases currently running in production on this platform including their data volume, QPS, Druid cluster setup, the unique challenges we met while onboarding and how we addressed them with extensive tunings to meet SLA and lessons learned for use cases including: partner insights, which provides partners with stats on organic pins; realtime spam detection, which detects user login related anomaly events and pin related spamming events like pin creation and repin; and migrating the backend from Presto to Druid for Ads related experiments data analysis.
You can find the recordings from Virtual Druid Summit III here.