Accelerate your data roadmap: from tables to code in seconds

November 6, 2023
Whelan
Developer
An abstract image representing data roadmap prrogression.

This post is for product people who like to ship product and do it fast. For the PMs who dig into engineering delivery estimates and seek out ways to ship faster. 

It’s also for PMs who understand the power of data. Now, I’m not talking about the cliche of making data-driven decisions and I’m not talking about AB testing (although, admittedly, I can be nerd sniped by some good testing talk). I’m talking about product managers and product leaders who aspire to deliver world-class data and AI features in their products.

What are data features? Why do they matter?

Data features have the following characteristics:

  • Customer-facing – Explicitly not internal BI or notebooks
  • Low-latency – Sub 100ms page loads and interactive reloads e.g. when filters or time grains are changed by the user
  • High concurrency – Since they are customer-facing, they must withstand high traffic (e.g. at least 20 qps) without reliability issues
  • Analytics queries – Include counts, sums, averages, and other aggregations over many rows of data, such as time series events
  • [Bonus] Workload diversity – We’ve found with existing customers that while analytics queries are what make these features unique, sub 10ms search and single row lookup queries (e.g. give me all info on account 2873) over the same dataset are also required

They don’t necessarily need to be an entire screen. In-product analytics capabilities often contain charts, and they can also be single, standalone numbers like “total documents created” that are infused in a product. 

Let’s take a look at a few examples and why they matter.

Tell your own ROI story: Loom & Monte Carlo

Loom provides simple rollups about user activity to tie their product to the specific business value, in this case time saved.

Similarly, Monte Carlo used Patch to leverage the sophisticated datasets in their Snowflake data warehouse to power their customer-facing Data Reliability Dashboard (a bit meta: the dashboard gives insight into their customer’s data stack reliability). The page is full of charts, tables, interactive filters and segmentation, and it loads in 100-200ms even after user interactions change the query. 

Enterprise plan differentiation: Notion

Another familiar example is Notion. Users like myself wanted more insights into document engagement. As they’ve moved up market, they’ve prioritized Admin reporting features to present product utilization mapped to billing forecasts to central IT teams deploying Notion across the enterprise. This differentiates their Enterprise plan from lower tiers, which drives higher new logo ACVs and expansion.

workspace analytics members tab

Data features in the enterprise

It’s not just Bay Area startups and scaleups that are sitting on underutilized data. Many Enterprises are organizing their data strategy around cloud data warehouses like Snowflake, Databricks, and BigQuery. 

We’ve seen insurance companies underwriting claims risk with scoring algorithms in Databricks, fleet management software processing high volumes of IoT data in BigQuery, and name brand retailers organizing massive product catalogs and inventory data in Snowflake. Customer expectations are sky-high and the opportunity cost of not using this data to power personalized experiences is even higher.

The trouble is that justifying the priority of these features is tough because of the high level of effort and engineering expertise currently required to build and maintain them.

Data roadmap prioritization

Product management boils down to prioritizing customer problems and working with engineering and design toward a viable solution. The output of that often nonlinear process is a roadmap. 

Roadmaps are typically prioritized feature lists, probably (hopefully) with some tech debt pay down baked in. I’ve seen at least a dozen prioritization frameworks, but they generally resemble something like Intercom’s RICE, which scores ideas along four factors:

  1. Reach – how many customers will this affect per quarter
  2. Impact – how much will this impact the customer (e.g. how much more likely are they do do X)
  3. Confidence – how confident are you in this idea, how much evidence do you have that it will work
  4. Effort – how many person-months will this take to build

As you can see, the data feature will be far down the list even though it scores pretty well on the first three factors because the Effort is so much higher. We dive into this in more detail below, but generally data and AI features take a long time primarily because where the data lives matters a lot

Constraints of physics: quick database primer

As with most technology, data infrastructure has gotten less pricey per unit over time. Cloud providers offer cheap storage and compute. Open source orchestrators provide a way to move and manipulate data with cost controls in the engineer’s hands. Databases have gotten more efficient too, resulting in less compute or memory usage for similar tasks. 

These tools have generally gotten easier to use as well, with plenty of serverless databases and next-gen orchestrators abstracting away some of the complexity. Then, why is it that data features still take so long to build?

The issue is that teams reach for at least one component in each category to build their data features and there’s still quite a bit of complexity.

Let’s take a look at two examples to understand why: 

  1. Billing dashboard charting product usage with interactive filters, segmentation, and custom metrics.
  2. In-product banner calling out a new feature users can access by upgrading plans.

In the first example, the product usage, application database, CRM, and billing data is joined by a transformation pipeline in a data warehouse before a subsequent job executes the billing logic. 

In the second scenario, Jira, customer support, and CRM data is once again combined in the data warehouse. Sentiment analysis over customer support tickets scores users on their likelihood to upgrade, while a keyword search on Jira and CS tickets determines which enterprise features to emphasize in the banner.

In both cases, the team has all the data they need in the data warehouse, e.g. Snowflake, Databricks, or BigQuery, but querying directly is too slow, expensive, and unstable for production traffic.

The data warehouse is purpose built for ingesting and processing large volumes of data, including machine learning model execution. As we’ve explored in more technical depth previously, getting around the cost and performance issues of building production features on a data warehouse has several, unappealing options.

To summarize the most attractive path, the data must be piped from the data warehouse into some compute engine that can withstand a high volume of queries with a complex compute profile. Setting up a high performance OLAP database can easily take weeks or months as the engineer creates indexes, sets up query caches, converts data types from the source, and a host of other ops work. Next, they must collaborate (read: tight coupling) with the product engineers to develop an API. Finally, the product engineer must write boilerplate so they can cleanly and safely code against the API. 

But why couldn’t this data be in our application database in the first place? Typically an application’s primary data store is an OLTP database like Postgres, which is optimized for CRUD transactions. Ingesting and transforming data from other sources can cause write load and consistency issues, effectively overloading it and causing severe performance problems.

And even if all of this data were in the primary application database, OLTPs store data in row-oriented format, so performing aggregations or search queries will inevitably be slow. This would force an engineering team to set up an entirely different database that is optimized for the types of queries the app must run, very similar to the approach described above.

In summary, the speed to implement these systems is slow. And they are even slower to change as your feature requirements evolve. Each layer in the stack must be updated in unison to avoid downtime.

This is why data features are so often punted to next quarter, next year, or never.

Collapsing time

Benn Stancil recently distinguished between the effects of free content creation and fast content creation. He argues that making content creation, or shipping, or air travel, or conducting analytics, free might compel people to create more content, buy more products, take more trips, or ask more analytics questions. But if each of those were instantaneous, human behavior changes much more dramatically. We’ll do those things the moment we decide we need to.

Revisiting our prioritization framework, let’s see what happens as creation time falls by an order of magnitude. When we move a XXL item to S or XS. Better yet, cross out that hard dependency on another team. Not only do some of the higher value, higher complexity features move up the list, the sheer number of items that can be completed in a given quarter with the same headcount nearly doubles. 

Even if we change the confidence level down to 50%, moving the Effort from 6 to 1 moved the Data feature to the top of the list.

And while “features shipped” doesn’t always equate to “value delivered”, it generally does for great teams. In fact, it’s superlinear because you generate more data and deliver a greater cumulative total of value per unit time, which typically affords you more confidence and revenue to reinvest.

Focus on the what, not the where

We imagine a world where developers can reliably and performantly run complex analytical, search, and point read queries over any data, no matter where it lives. That means no waiting for streams or ETL pipelines to move data from one database to another. That means no specialized database to set up and maintain for each use case.

The moment you, a product leader, identify an opportunity to use your business’ valuable data to directly power customer experiences, the engineering team can immediately start building the app and skip the low-level data engineering.

With Patch, you or an engineer connect your data warehouse, select tables, and the platform automatically generates live APIs and SDKs you can use to query data instantly. The query interfaces support aggregations, single row lookups, and string search at ultra low latency without any infrastructure work.

Our mission at Patch is to enable product engineers to build amazing data and AI features in the time it takes them (or a copilot) to write application code. We’ve built planet scale data infrastructure for product engineers, so developers can focus on business logic and we take care of the rest.

What would your data roadmap look like if killer data features took 2 weeks instead of 1-2 quarters? What will your next hack week project look like? (It’s so easy, even we PMs can do it). 

Grab your engineering peers and let’s find a time to talk about how Patch’s data packages can accelerate your data feature roadmap.

Start building with data packages.