Aug 22, 2019
Apache Druid helps Zeotap Master Multi-Channel Attribution at Scale
Chaitanya Bendre, Zeotap
Below is a transcript of a short interview we conducted with Chaitanya Bendre, Lead Data Engineer at Zeotap, where we discussed their use of Druid to help address the difficult problems of identity resolution and multi-channel attribution.
Can you introduce yourself and your company?
My name is Chaitanya Bendre. I work as a Lead Engineer at Zeotap, a Global Identity and Data Platform that enables brands to better understand their customers and power economic returns on analytics and marketing towards acquisition, churn and upsell and cross-sell.
We aggregate data from a variety of sources at a very large scale, which also includes enterprise data like telecom data, with a strict focus on security and privacy for the data we store. We have different products that can leverage this aggregated data for various use cases.
What is your use case with Druid today? How were you doing things before Druid?
The two products which leverage Druid are called Connect and Target. Target is pretty straightforward. An ad agency can come to Zeotap and, for example, say they want to target a group of people between 18 and 24 who have visited Facebook at least once. Based on this request, we generate the audience and then the customer can use that audience to show ads on any platform.
With Connect, we use Druid to show insights. A customer can bring their own data, say offline IDs, like mobile numbers, email addresses, and map these online identities to an audience for advertising.
A customer can enrich their first party data by attaching it to our third party data, which we aggregate, and then use that to take some decision and act. It really works as a tool, with which you slice and dice and see the insights of your own data.
Customers upload their identifiers, and we allow slice and dice against the attributes we store so that the customer can understand the audience, including 3rd-party data recently-used mobile apps, average data usage on any telecom network, or frequently-visited locations.
One of the challenges that we solve as we aggregate this data from a variety of sources is multi-channel attribution and identity resolution. Is the same user or not? The more linkages we get from different variety of sources, the higher the confidence with which we can attribute that linkage to the same user. Right now, it’s all deterministic, but in the future we can infer the linkages using machine learning.
Druid solves a very important problem for us. We have data coming in from different formats and in different modes. We have real-time data streaming from our own tracking pixels, which generate the user impressions and data from the telecom network. We also have a batch mode where a customer can upload his state of identifiers to get insights.
We want to join this batch data with our own data, and show various charts and allow analysis. The kind of reporting product is what the Druid powers for our stack, and the product is called Insights. It’s available for both our main products Connect and Targeting.
Did you evaluate any other technologies for your use case? Why did you select Druid over the other technologies?
Before Druid we had a traditional ETL pipeline, which used to get counts and publish them to MySQL. Compared to that, Druid has provided tremendous improvements in terms of both performance and stability. We have not seen any downtime in Druid; it has given us a very good experience.
We evaluated a lot of other analytics databases, such as AWS Redshift and Google BigQuery, but for both of them, as the data set grew, the latency was not up to the mark. The Druid pre-aggregation feature really matters since it operates very quickly and greatly reduces the segment sizes.
From our experience with Druid, the most useful features we found are pre-aggregation roll-up and hyperloglog for COUNT DISTINCT operations. That has actually been quite useful and something we are looking at for other projects.
What does your overall architecture with Druid look like today? How do you ingest data? Where do you run Druid? What do you use as your front-end analytics?
We are currently hosting and self-managing Druid in AWS in three regions: the US, Europe, and India. The batch data is ingested from S3 right now. The streaming data is ingested via the Kafka indexing service provided natively in Druid.
The raw data size in S3 is compressed to about 200 GB per day, and once it gets aggregated in Druid it is about 30 to 40 GB per day. For the streaming data, we aggregate for 30 minutes or 50 minutes periods, and for our batch data we aggregate daily.
Our analytics UI is a home grown system though we plan to explore Imply Pivot.
Our monitoring system for this data is based on Grafana and Graphite, though we will consider Imply Clarity for his functionality.
What are your plans for the future with Druid?
Druid is important for a new feature called Estimator, which provides real-time estimation of audience size. If a customer wants to target an age group and location, we will approximate the size of this audience before running that query on the main data warehouse. Estimating size is like a pre-aggregation counting problem, so we’ll use Druid for that.
The other place that we are looking to use pre-aggregation is for revenue attribution to our data partners. We are in the middle of this whole ecosystem where we aggregate data from our data partners, and they are paid based on its use by our customers, so we have to attribute those revenues to these data partners, and that currently is a really large query on current data warehouse. We are trying to solve this problem using Druid. So, if we can pre-aggregate how many profiles each data partner has given and stored in Druid, then it can give us an estimate very quickly.