Back to blogs
Why you should not build apps directly on the data warehouse
The first time I asked an application developer how we could use the data from the data warehouse in our web app, I expected to get pushback like “it’s too slow to query from prod” or “the compute-based pricing will be too costly for constant traffic”. But the actual question I got?
What the heck is Snowflake?
Having spent countless hours doing analysis, training ML models, and querying data in our data stack, it was surprising that the real technical innovators on our team weren’t even aware that the system existed, let alone dreamt about the potential that lurked inside.
My roadmap was full of features that needed this data:
- Rich, interactive billing dashboard using data joined from Segment (product usage), Recurly (billing), Salesforce, and our own product databases
- Personalized banners based on ML sentiment analysis of NPS surveys
- Notifications for feature launches and bug fixes determined by a keyword search of Zendesk tickets and CSM call logs
Unfortunately, the distance between my ideas and an implementation that met our production standards was quite far, both organizationally and technically. This frustration ultimately inspired a much deeper conversation with my now co-founder, Peter.
This post discusses how you could and should build features using data from the data warehouse.
Why Build With Warehouse Data
Patch is an acceleration layer for popular data stores like Snowflake, BigQuery, and Databricks. We built it because we needed a way to integrate data from these platforms into production features and services, performantly and reliably. But why did we want to do that?
Customer demand for data-powered features
As discussed in our previous post, data features allow you to tell your own ROI story, differentiate enterprise pricing tiers, and deliver highly personalized user experiences that drive adoption and expansion.
The benefits extend beyond SaaS. As far back as 2014, LinkedIn was at the forefront of releasing analytics-driven features to create the future revenue drivers of their business.
Who’s Viewed Your Profile .
Each of these features supports a core pillar of LinkedIn’s modern revenue strategy. They each provide users with the critical signals to optimize their activity and spend on the LinkedIn platform.
Data Stack as a Cost Center
If you have a data warehouse, you’re probably losing money. At best, you’re leaving money on the table. Today, most companies use their data lakes and warehouses for archiving data that their data teams may use in the future for analysis. Therefore, the value of these platforms is tied to the value of the insights generated by these teams. Notoriously uncertain and challenging to quantify.
In this macro-environment, CFOs are inspecting every line item on the balance sheet. Data warehouses, and the data stack more broadly, are one of the larger expenses and currently seen as a cost center.
The most direct way to change this perception, and move your CFO’s labeling from cost center to revenue driver, is to leverage your modeled data in revenue-generating features or services.
What’s Unique About the Warehouse?
There’s no single technical approach to building data features today, so let’s look at the advantages of using data from the warehouse specifically.
The power of joins
Data warehouses exist to join and process data from disparate sources. Data from the application database, CRM, customer support, billing, product usage, and much more land here to be transformed by basic SQL and advanced python models into unique, derived datasets.
AI, ML, and data science
Databricks continues to invest in advanced ML/AI workloads, while Snowflake’s Vertex AI and BigQuery’s Cortex offerings also bring these models right into the data team’s workbench. Many of the models are used to produce or enrich existing datasets. For example, sentiment analysis, forecasting, and general classifiers often augment datasets. These added columns are ripe for powering personalized customer experiences.
This historical perspective is valuable in various contexts, such as showing past purchase history, usage patterns, or financial trends to customers.
The fewer databases, pipelines, and other distributed systems to manage, the better.
Security and governance
You’ve already invested in making the data warehouse not only a source of truth for modeled data, but also the governance and security layers to ensure safe access.
Why It’s Hard to Build With Warehouse Data
Let’s face it — most application engineers have not worked directly with a data warehouse. They view it as a sink, not a source. Primarily, this is because the warehouse is owned by a separate team and there’s a lack of tooling to reliably integrate the data into the application stack.
Below, we confront both the organizational and technical challenges. We then review approaches we’ve seen in the real world, and reconsider whether these paths solve the problems or simply move the complexity elsewhere in the stack.
What are the people problems that make data warehouses hard to build on?
Multi-tenant, concurrent ownership environment
Data warehouses are typically owned by data engineers with contributions from analytics engineers, analysts, data scientists, and ML engineers. This results in an environment where there can be unexpected changes, unpredictable workloads, and unclear boundaries of ownership — too unstable for most application engineer’s preference.
Data teams are very careful controlling access to the warehouse. Security teams and data stewards are wary of an application connecting directly to the warehouse for fear that a vulnerability in the application could lead an attacker upstream.
Building with warehouse data typically requires a pipeline out of the warehouse and often an entirely separate database to power the production workload. The application teams are therefore dependent on the data or infra teams to stand up infrastructure to unblock the product roadmap.
Data teams are accountable to business and executive stakeholders for things like “data quality” and “metric correctness”. Application teams are accountable to customers for “performance”, “uptime” and “good user experience.” Latency means end-to-end pipeline time to data teams while implying page load time to the app engineers.
Integrating data warehouse data into a product is challenging largely because of the demands of production experiences – performance, reliability, and concurrency.
The first obvious question is – why can’t you simply query the data warehouse directly?
In general, production applications and particularly their users have little tolerance for slowness. Consumers expect sub-second load times at worst with <50ms expected in most cases.
Data warehouse query engines are slow by application standards.
- Workload diversity — production applications often perform a range of queries from analytics and multi-million row aggregations to search and point reads.
- Limited caching — Their caching architecture is best effort with respect to analytics queries, but again by application standards the cache hit rate is too low to be suitable for global end-user traffic.
- Data format layout — Because they exclusively store data in columnar format, searches and point reads that are best suited for k/v and row-oriented storage perform particularly poorly.
Given the major data warehouses’ compute-second pricing models, the cost of serving production volumes of traffic can be unbounded and scales with traffic as virtual warehouses are running essentially continuously.
The multiplayer environment described in the previous section impacts schema stability such that an errant dropped or renamed column can unwittingly break application code.
Testability and CI/CD
Given the potential impact of an uncoordinated schema change, developers need a mechanism to test new schema versions and integrate the updates into their CI/CD flow. There is nothing natively available from the data warehouse vendors.
Warehouse query interfaces and SDKs are not tailored for application developers. The SDKs are typically low-level or support formats intended for data scientists. The query interface can usually be accessed over JDBC, but with a SQL dialect unfamiliar to application developers.
The Pros and Cons of Common Approaches
Approach #1: Build a “data service” in front of the data warehouse
In this approach, queries are still running directly against the warehouse. The difference is that it intermediates the application somewhat from being exposed directly to the warehouse.
We have seen some teams take this approach but unfortunately it suffers from largely the same set of problems; this approach is not recommended for both complexity and cost reasons.
Building a “data service”in front of the data warehouse
✔️ Intermediate applications from direct warehouse exposure.
✔️ Better control over what queries are exposed to consumers (instead of just free for all SQL access).
✔️ Theoretically better opportunity to enforce tighter security and network controls.
✔️ Application teams may be more comfortable programming against an API than directly against a data warehouse due to their limited familiarity with data warehouses; generally speaking.
❌ High cost of construction (most teams surveyed invested 3-5 months before being production ready).
Time To Value: Slow
Total Cost of Ownership: High
Approach #2: Pipe the data from the warehouse to the application database
In this approach, data is piped from the warehouse to the application team’s OLTP database, such as Postgres. In practice, we’ve found teams underestimate the cost to build and maintain these systems.
Replication from the data warehouse is challenging
Point-in-time bulk exporting is fairly straightforward, assuming you are comfortable picking the data up as Parquet files on blob storage, but production applications require a continuously updated read-replica. Data warehouses do not support native replication protocols like Postgres does, nor is there native changed-data-capture. Even if there were, there is no clear answer when a table is rebuilt or a schema is changed.
Landing data in the application database is not straightforward
The data coming from warehouses is often very different from what the application database is optimized for handling.
- Data types: do not map cleanly to application database column types, requiring conversion work.
- Data layout: warehouse data is typically column oriented, whereas OLTP or application databases are typically row-oriented. An aggregation like COUNT() might run very efficiently in the warehouse, but very inefficiently in the application, causing performance issues.
- Data volume: tables in the warehouse often contain hundreds of millions of rows or more. For dataset sizes any bigger than a few million rows, most application databases will start to suffer read and write performance bottlenecks upon large inserts or upserts from the data warehouse, threatening availability.
Piping the data from the warehouse to the application database
✔️ Data is extracted from the warehouse, enabling the application to access the dataset natively without having to know anything about the warehouse.
✔️ The pipeline is the only thing that accesses the warehouse, mitigating the security risk.
✔️ The application can continue to use the existing database it relies on, minimizing the disruption to developer experience.
❌ The replication pipeline requires skilled talent, build time and ongoing maintenance.
❌ Data type conversions are time-consuming to write and error-prone.
❌ The load applied to the application database may threaten its stability.
❌ Ambiguous ownership. Who owns replication pipelines?
Time to value: Slow
Total cost of ownership: High
Approach #3: Pipe from warehouse to a general purpose high performance OLAP engine
The main improvement of this approach is that high performance OLAP engines such as Clickhouse, DuckDB, TiDB, Hazelcast, Pinot, and Rockset are more compatible out of the box with the data coming from the warehouse. Additionally, the data lands in an engine purpose-built for that workload.
The downside of this approach is the high total cost of ownership. It still requires solving the pipeline complexity and developing a bespoke API to integrate the engine into the application.
Piping the data from the warehouse to a general purpose high-performance OLAP engine
✔️ Isolated performance profile insulates application database from workload contention.
❌ Expensive to stand up and maintain
❌ Pipeline orchestration complexity remains, albeit better contained and with fewer translation problems.
Of the three approaches, this one strikes the best balance between reliability and performance, but the total ownership cost and complexity remain significant.
Time to value: Moderate
Total cost of ownership: High
Approach #4: Run application inside the data warehouse compute environment
This approach is theoretically appealing because it moves the application instead of moving the data.
In practice, porting a non-trivial production application to a data warehouse compute environment is not viable. Most production applications have a web of dependencies outside the data warehouse. The warehouse compute environments have fairly limited support for native APIs to supporting services.
For entirely new applications performing computations over a known schema, running them inside something like Snowflake’s Native Application Framework makes sense.
But for general purpose software backends and frontends, even building a greenfield app inside a data warehouse is a tall order.
Running application inside the data warehouse compute environment
✔️ Avoid moving the data entirely.
❌ Not feasible for most non-trivial applications, especially existing ones
❌ Very high build cost if porting existing applications.
❌ Greenfield applications may work, but are limited by capabilities of the warehouse-native compute environments.
So, with all that said. How can we make this Very Hard Thing ™ into a Very Easy Thing ™ – we will dive into how we took the best parts of Approach #3 to build a developer platform for building applications on top of the data warehouse.
The Core Tenets of Patch
Patch is designed to enable software engineers to easily build features using data from every corner of their organization, starting with data warehouses. It’s optimized for performance, stability, cost, and application developer experience.
There are three primary components: efficient automated data replication; high performance cache and multi-modal compute engines; and query interfaces.
Efficient, automated data replication
Patch’s architecture is most similar to Approach #3 above, with a high performance compute engine serving traffic, fed by a replication pipeline from the warehouse. The pipeline performs a change data capture algorithm to detect and load new or modified rows in the underlying source tables, performing all necessary type conversions. Users can configure a fixed schedule or trigger this process with their orchestrator via webhook.
Each time the system has to replicate new rows from a warehouse table, it can verify that the schema has not changed in a breaking manner, and that the shape and characteristics of the underlying data have not varied significantly.
This enables some desirable properties on the application side:
- Schema evolution can be treated as a partially automated, opt-in upgrade by the application.
- In the event of a breaking schema change, or a complete destruction of the warehoused data source, Patch can continue to serve the most recent stable copy of the data, ensuring that the worst-case outcome for the end user is that data may appear stale until the issue is resolved. This is a notable improvement over the application crashing or loading garbage data.
- In the event of data corruption in the underlying warehouse tables, the replication system can detect a certain level of anomaly and alert the application engineers that they may want to pause, rollback and investigate the sync states before allowing that data to appear to end-users who are expecting consistent data at all times.
As a result, Patch is able to ensure no downtime stemming from an upstream breakage, such as an unannounced schema change or even a full source outage.
High performance cache & multi-modal compute engines
The replication system terminates in a multi-modal compute engine that supports OLAP queries, range scans, string searches, and point reads at extremely low latency. We will provide rigorous benchmarks in a separate post in the coming weeks, but below are some rough estimates from both internal testing and production customer workloads. Note: entire tables are replicated so never-before-seen queries are returned with 0% cache miss rate.
- Point reads: the ability to retrieve a single row by a key (or compound key): ~5-10ms
- Paginated or filtered scans: the ability to retrieve a list of records by some criteria and paginate quickly: 10-25ms per page for 1000 records
- High cardinality, single column aggregations: a query that runs one aggregation over a high cardinality column: 150ms for 100M rows with cardinality expressed as a ratio of row count (1.0 meaning all rows are unique)
- High cardinality, multi-column aggregation: a query that runs multiple aggregations over multiple high cardinality columns (cardinality 1.0 / column), 250ms
- Simple string match: 100M rows, LIKE-operator, 100ms
- Multi-level JOINs (3 or more JOINs): 100M rows, 3-column JOIN, integer columns, 90ms
The other benefit of such compute efficiency is cost reduction and predictability resulting from Patch’s fixed instance size pricing.
Data warehouse query engines are optimized for MPP computations on unbounded dataset sizes with no cap on query latency or cost. This is well-suited for ad-hoc analysis, but not for production applications whose queries are constrained to a more limited set, and need to be optimized for high concurrency at low latencies over a smaller dataset.
Query latency and compute efficiency are linked. There is more overhead executing a query against a cheaper storage layer, resulting in longer latency. On the other hand, faster disks and high efficiency query engines are more expensive, but the lower latency means you pay for less work time. With smaller dataset sizes and bounded query complexity, both of which are common in production applications, the lower cost-per-query more than makes up for the slightly higher storage costs. Patch is optimized for this disproportionality.
The automated replication keeps data fresh and reliable, while the compute engine is highly tuned for performance and cost-optimality. On the consumption side, developers are completely abstracted away from this sophisticated infrastructure.
They have several options for querying data.
- ORM (coming soon) – Patch’s JDBC connector integrates with most popular ORMs.
- GraphQL – Patch generates GraphQL APIs, based on the specific dataset’s schema. Certain fields are derived based on the column type. For example, aggregation fields are automatically provided for numerical columns and timestamps are projected to more coarse-grained date parts like day of week.
- Data packages – Data packages are importable client libraries with query builders in the developer’s preferred runtime, e.g. Nodejs, Python, C#. They are schema-versioned and contain all the same helper functions derived for the GraphQL APIs. No need to write database drivers or API boilerplate.
While not a standalone technical component, it’s necessary to explain the steps to ensure reliability throughout the system. For production applications, any query engine serving in the critical request path needs to be up 24/7/365 to ensure that customers are never impacted by slow page loads or complete server downtime: an SLA of “two nines” to ~4 minutes of allowed downtime per month.
There are a number of architectural investments we’ve made at Patch to offer such an SLA.
- Cross-zone and cross-region failover – the cluster nodes serving the workload need to span multiple availability zones within a region to protect against rack failure, as well as a backup region that can take over if the primary region fails.
- Replication – the datasets being served need to be replicated across multiple nodes to ensure that a single node failure does not impact availability.
- Zero downtime deployments – rolling out the database engine, as well as any supporting network appliances such as load balancers and proxies, cannot impact live traffic.
- End-to-end monitoring and alerting – no matter the precautions you take, there is always something that can start to go wrong in production. Disk failures, zombie machines, DNS failures, network partitions, etc. Responding efficiently requires deeply integrated monitoring and alerting at all levels of the stack.
- Logs, backups and archives – in a worst-case scenario, say an entire region suffers an outage, the system needs to be restored to a consistent state as quickly as possible. Disaster recovery requires careful planning to ensure all system state can be restored from archived snapshots and reproducible configurations.
As more data lands in the data warehouse, the latent value of the modeled datasets continues to grow. Historically, it’s been very challenging to build customer-facing features that leverage this data, but users continue to demand data-rich, personalized, and snappy experiences. Our goal at Patch is to reduce the organizational and technical barriers to building these features.