Lessons learned from 100s of conversations about Data Packages | 2
In last week’s post, we explored the performance, reliability, and workflow requirements for a production application built using data from a data warehouse like Snowflake or BigQuery. Now, imagine building customer-facing features while simultaneously switching vendors.
“We’re migrating from another CDW to Snowflake and need to build a production app with the data while keeping the feature set online through the migration.”
- Data engineer, SaaS company in e-comm logistics space
In most scenarios, the migration would be treated as a blocker.
Migrations require engineers to stand up a new data system, replicate the dataset, and rewrite schema and database-specific driver code. Then, they finally retire the old data system once everything is tested in production. Refactoring the application code takes the longest, and it’s usually blocked by the actual data migration work.
Data Packages, on the other hand, decouple application code from specific data infrastructure so that a dataset can be moved to different storage systems by data teams without impacting application teams downstream. Here’s how it works:
Since the query interface is the same before and after the migration, the logical code does not need to change.
It’s important to consider why migrations happen in the first place. Often, it’s because the scale or type of workload requires another type of database. Commonly, applications are initially built with a single OLTP database like Postgres. Over time, data scales. If the workload remains primarily transactional, no problem. Postgres and its rich ecosystem offer plenty of tools to keep up with the growth.
However, when the workloads begin to diversify to include e.g. analytics queries, search, or heavy transformations, database performance suffers and engineers have a choice. They can invest deeply in database operations, and pay permanently high maintenance costs. Or they can introduce other storage engines like a data warehouse to relieve the primary database from a subset of workloads. In the latter case, disentangling the application code from the location of the data is messy work.
By using Data Packages, developers reduce the cost of these inflection points. For example, a Data Package can be generated over the Postgres dataset. Then, read queries (or just the analytics & search subset) can be routed through the Data Package to alternate query engines, including the native package query engine, to relieve Postgres. Later, if the team decides to adopt a data warehouse for heavier data integration or transformation workloads and wants to integrate the processed data into their app, the same Data Package query interface can be used.
We’ve observed that over time, engineering teams introduce additional storage systems and other infrastructure components to their stack. In most cases, this complexity is justified. But that doesn’t mean it comes without cost. Our goal is for Data Packages to make the inevitable migrations and increased infrastructure footprint less costly. A company’s proprietary datasets are intrinsically valuable and the physical storage location shouldn’t impact what you can build with them.
If you’d like to get early access to Data Packages, please share your email at www.dpm.sh. If you’re interested in Patch’s managed infrastructure for versioning, access controls, and query performance, sign up at patch.tech/request-invite.