15 Years of Realtime OLAP (Part 2)

Realtime OLAP for fraud detection, observability, and more.

Aug 05, 2024

My previous post covered my first foray into user-facing analytics products at LinkedIn. In this post, I want to talk about the realtime online analytical processing (OLAP) use cases I found after leaving LinkedIn in 2015 to join WePay. Each new use case expanded my understanding of the space.

WePay is (was?) a software as a service (SaaS) payments provider. Applications could integrate WePay into their software to process payments. When I joined the company, several customers were growing rapidly, which strained our monolithic database. We began exploring how to spread the load. The two primary query patterns were transferring money and detecting fraudulent transactions. Both of these had realtime OLAP characteristics, but fraud detection seemed easier to start with.

Our fraud system had two main parts: the model and the data. In an offline environment, the model was periodically retrained against data warehouse data. Afterward, the model was pushed to production. The model serving layer in production queried the data that the model was trained on, except it read from our OLTP database (MySQL) instead of our data warehouse.

Model serving needs a lot of data. Both the account and the specific transaction have metadata: logins, transactions, balance, and so on. Each of these data points is aggregated across different dimensions: time (year, month, day, hour, minute), geography, IP block, and so on. Thus, the queries looked like, “logins in the past hour”, “non-US payment volume in the last week”, “credit cards added to the account in last month”, and so on.

The fraud data (referred to as “fraud signals”) were essentially SELECT COUNT(*)/SUM(*) FROM … GROUP BY. This felt familiar to me from my WVMP days. But the model serving layer had a number of important characteristics that were different from WVMP’s:

Users were not pivoting on the data in realtime. We knew each dimension and aggregation we needed beforehand.
Query responses needed to be very fast (single digit milliseconds)
Data needed to be very fresh (less than a minute, ideally less than 10 seconds)

Unlike the WVMP use case, the query pattern from the model was predictable. It asked for the same aggregations for every transaction. But unlike a human waiting on a dashboard to refresh, the fraud models needed to be very fast. We had a tight bound on the number of milliseconds that a transaction would wait for a fraud response. If the window passed, we had to decide whether to reject the transaction outright or risk allowing fraudulent transactions through. The signal data also needed to be very fresh. Fraudsters act quickly and often use scripts. We needed activity data to be available to the model on the order of seconds so we could detect fraudulent behavior as it was occurring.

We did a lot of digging to try and avoid building our own system. Around this time, Fangjin Yang and his co-founders started Imply. We discussed our requirements with them. They told us Druid might work, but it would require a lot of machines to get the latency we wanted. Even then, it might not be fast enough. Druid was built for dynamic queries and pivots—more like the WVMP use case in part 1.

Ultimately, we went back to what I’d learned at LinkedIn. We built a batch and stream processing pipeline on top of Apache Kafka, Google Cloud BigQuery, and Google Cloud Dataflow. The pipeline aggregated data and pushed the results into a Google Cloud Bigtable instance. This largely mirrored the Hadoop/Voldemort pipeline we built at LinkedIn.

My work with our fraud signals team helped me recognize that different realtime OLAP use cases have different requirements. Some need flexibility while others need low latency responses. I started calling the former “human interaction” and the latter “machine interaction”. I associated human interaction OLAP with realtime slice-and-dice dashboards. Response times could take a few hundred milliseconds while the chart updated. Machine interaction, by contrast, had pre-defined queries but required much faster response times.

I started to see realtime OLAP everywhere. We were running the Elasticsearch, Kibana, and Logstash (ELK) stack at WePay. The Kibana dashboards looked an awful lot like business analytics chart’n’graph dashboards to me. The data was just machine instances, CPUs, and disks. It seemed odd to power these graphs using a search engine; systems like InfluxDB and Prometheus made more sense. At LinkedIn, we used rrdtool for metric dashboards. And the logs that were funneled into Elasticsearch looked a lot like faceted search—more GROUP BY and WHERE clauses.

For each of these use cases, there are a different but overlapping set of requirements. Each needs different query latency, data freshness, data correctness, and query throughput. StarTree’s﹩ blog post, How To Pick A Real-Time OLAP Platform, has a fairly comprehensive breakdown:

Real-time analytics use case requirement summary — How To Pick A Real-Time OLAP Platform

This is how I see the realtime OLAP space now. Pinot and Druid started with user-facing analytics use cases. InfluxDB and Prometheus are coming from metrics. Elasticsearch started with log analytics (at least, for its realtime OLAP use cases). Traditional data warehouses have dominated the visualization and dash-boarding categories. Even stream processing systems like Materialize and Flink’s recent materialized tables satisfy many the use cases in the image above. Indeed, one of Materialize’s marquee customers, Ramp, is using it for fraud detection.

These systems are converging for many use cases. Data warehouses like BigQuery now have realtime ingest, Pinot now supports JOINs like a data warehouse, and so on. Data warehouses, real-time OLAP systems, time series databases (TSDBs), and search engines might all converge.

This is the context I carry with me as I look at ClickHouse. In my next post, I’ll try and understand ClickHouse’s history, which uses cases it’s good for, why it’s so popular, and more. Stay tuned.

Book

Support this newsletter by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to someone.

Buy Now

Disclaimer

I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a ﹩ in this newsletter. See my LinkedIn profile for a complete list.