May 25, 2021
The Future of Analytics
During my past four years as a Partner at Andreessen Horowitz, and then as VP of Product and Engineering at Imply, I have observed an interesting shift in the world of analytics. Analytics has permeated an exponentially increasing number of job roles, industry verticals, market segments, and, ultimately, how people work. The number of software applications that include some kind of analytics has exploded, and now most SaaS applications that serve the enterprise today have some kind of analytics capability to help users make decisions. The reason for this is simple: The traditional analytics tools, usually in the form of Business Intelligence tools (such as Tableau or Looker), are focused on professional analysts, and feature mainly reporting and dashboarding workflows. By contrast, the new “citizen analysts” need domain-specific analytics with more diagnostic, exploratory, iterative workflows. This blog explores the differences between the workflows of professional and citizen analysts, and explains how Imply is enabling citizen analysts to answer their own questions without relying on professional analysts.
The traditional BI workflow starts with a strategic question. Such a question is not too time-sensitive—days or weeks is okay—and the question is pretty complex to answer. A typical question would be, “What market segments in Asia are less penetrated by our various products, and how is our competition doing there?” This question requires translation from human-speak to analytics-speak (i.e., SQL) to dashboard. This is the job of the professional analyst. Because of this human-in-the-loop involvement, getting an answer could be a matter of days to months. Therefore, only questions that can withstand this delay can be answered in this way. We like to call this kind of analytics “cold analytics,” since such questions don’t need an immediate answer, and the data involved is usually hours or days or even months old.
But the new world of citizen analysts is different. Here, individual marketers or salespeople or network engineers or customer support specialists or merchants, or…or…or… need to answer many questions every day. These are questions such as “What does the cell network coverage look like right now in neighborhood A vs. B, and what are the differences in cell tower settings? If I change this one setting here, what happens? How does this compare to last month?” or “Of the millions of SKUs I’m selling in my stores, which ones are we out of in which stores, and which should I order from my supplier, and where and what should I replace it with if it’s not available?” or “Which segment of the population is responding better to advertising campaign A vs. campaign B, how is this different from the past three months, and what happens if I tweak my campaign in some way?”. These are questions that demand an immediate answer since they are mainly about the present and how the immediate present relates to the past. We like to call this kind of analytics “hot analytics,” because the queries need to be answered quickly and by a large number of users. The data itself is usually recent, but also often involves longer-term, historical data.
At Imply, we see hot analytics in our customers’ use cases: Twitter, Hawk, and Ippen Digital use Imply to power their advertising platforms, Lucidity uses Imply to detect and analyze potential fraud in cryptocurrency transactions, Charter uses Imply to provide a unified customer experience platform, Ibotta uses Imply to combat e-commerce fraud, and many many more. We also see hot analytics with open-source Apache Druid users like performance monitoring at Netflix, advertising analytics at Pinterest, anti-money laundering at DBS, and network performance monitoring at BT.
What’s the difference between cold and hot analytics? Why do people come to Imply and Apache Druid instead of using BI and data warehouses? It’s about query latency, and how latency impacts the user experience. Ultimately, it’s about which workflows are possible and which workflows aren’t. In this 4-minute video, I show a simple example of what I mean, where I execute a hot analytics workflow using Tableau on top of Snowflake; then I compare it with Imply. Tableau is much slower to the point where rapid iteration is impossible. Hot analytics workflows are difficult to execute in these more traditional tools because they are not optimized for fast iterations. The data warehouses these BI tools rely on were designed to serve low volumes of very powerful queries at low cost, and to be used by professional analysts who know how to write SQL. But the citizen analyst doesn’t know what they’re looking for before they get started! They are mostly exploring data and trying to see if there are interesting trends to drill into. The citizen analyst also might not know how to write SQL! The workflows of a citizen analyst involve repeated trial-and-error, comparisons, filters, and drill-downs. When you have hundreds or thousands of users executing these kinds of workflows concurrently, the traditional cold analytics tools become too expensive and too slow.
What we also see with our customers with the most advanced use cases, though, is a blend of hot and cold analytics where citizen analysts want to start with hot analytics workflows and migrate to cold analytics workflows, or vice-versa. For example, since the cost of making all data available for hot analytics is high, they want to move older less frequently accessed data to colder storage systems (such as a data lake or data warehouse). Yet, they still want to be able to seamlessly query data, regardless of which system the data resides in. Or they might use hot analytics to diagnose an issue and then use cold analytics to execute more powerful queries (such as fact-to-fact joins) to troubleshoot further. In other cases, users want the opposite: they want to move data from cold to hot on-demand for an investigation, and then move the data back to cold storage after the investigation. This is a core area of Imply’s future roadmap, and we’re working on a number of features to enable such use cases for our customers. Some of the key capabilities we intend to roll out this year include a cold tier; allowing customers to separate storage from compute and to scale up compute on-demand, which leads to vastly lower compute costs for cold data; and SQL shuffle joins, allowing users to execute fact-to-fact table joins against large data sets. Some of these data sets might be in the cold tier, and users can then load them in the hot tier for investigation.
The new world of hot analytics is huge. The ratio of professional analysts to citizen analysts is 1:100 or 1:1000, which means there are far more users who need hot analytics than there are users who need cold analytics! Imply’s impact is tremendous. I’m very excited to see our roadmap become reality, and to see Imply helping the next generation of decision makers.