May 11, 2020
Continuing the Virtual Druid Summit Conversation with Athena Health
Karthik Urs, Athena Health
When an engaged technical audience asks great questions, it’s easy to run out of time during Q&A. And that’s exactly what happened at Virtual Druid Summit! Because our speakers weren’t able to address all of your questions during the live sessions, we’re following up with the answers you deserve in a series of blog posts.
Karthik Urs, Lead Member of Technical Staff, takes on your remaining questions as a follow-up to “Automating CI/CD for Druid Clusters at Athena Health”.
1. I’m wondering if you can talk about how you actually “test” the Druid cluster with your production queries and data?
There are two aspects here. Testing the queries/benchmarking it falls under performance benchmarking/tuning. We have a test bed with multiple frameworks, prominent ones being jmeter and artillery. With this, we focus on concurrency testing and query response times at peak load. A small disclaimer here: we haven’t started playing with cache yet, so our testing on the benchmarking front is nascent.
The second part, which I assume you also wanted to know, is the validation of data. Our source is on Snowflake. We are currently validating the data by running aggregate queries on Snowflake and on Druid and compare them. We also plan to test cardinality of columns, etc. Snowflake has a concept called time travel which is pretty cool. With that, we can run queries on Snowflake going back in time to a state when the ETL was running.
2. Cool system! Have you thought about automating no-downtime rolling upgrades?
Not yet. Though we have the general idea, we haven’t started automating it.
3. Why such large zookeeper nodes? Do they have heavy mem/cpu usage in your cluster?
The zookeeper nodes are large instances. They are 2 vCPUs each and currently, we are doing a complete backfill of data and we have noticed zookeeper getting overwhelmed at peak ingestion/etl time.
4. Do you do auto scaling of your broker or historical nodes based on queries?
That is the plan going forward. We just recently built our LMM stack on ELK and we have enabled log emitters on each node. The next plan is to scale brokers up based on CPU utilization, # of threads (CLOSED_WAIT mainly, we don’t understand how threads behave in the Druid environment yet) and requests/minute.
5. How is the state maintained for historical? Specifically, how do you handle if a historical node goes down in regards to state?
We are not explicitly handling it (yet). Our understanding is, as long as coordinator is alive and well, it manages the state of historicals. We haven’t put any safeguards in place yet. Looking forward to postmortems of other folks who may have encountered problems in this specific area.
6. How did you handle downtime in production when data transfers from deep storage to historical takes time? For example, when we want to change our instance type. Did you add new servers first and remove the existing ones?
We are not in production yet, but are getting close. We indeed want to have rolling updates on historicals.