Decoupling data migrations from shipping production features

July 19, 2023
Data architecture diagram

We’ve previously explored the performance, reliability, and workflow requirements for a production application built using data from a data warehouse like Snowflake or BigQuery. Now, imagine building customer-facing data features while undergoing a cloud data migration.

“We’re migrating from another CDW to Snowflake and need to build a production app with the data while keeping the feature set online through the migration.”
– Data engineer, SaaS company in e-comm logistics space

In most scenarios, the migration would be treated as a blocker.

Data portability is critical to cloud data migrations

Migrations require engineers to stand up a new data system, replicate the dataset, and rewrite schema and database-specific driver code. Then, they finally retire the old data system once everything is tested in production. Refactoring the application code takes the longest, and it’s usually blocked by the actual data migration work.

Data Packages, on the other hand, decouple application code from specific data infrastructure so that a dataset can be moved to different storage systems by data teams without impacting application teams downstream. Here’s how it works:

  1. A data package is generated from a source schema, containing a client library with a type-safe query interface.
  2. Developers install the library with their package manager, import it into their code, and write queries.
  3. At some point, the data engineer replicates the dataset to the new database.
  4. Any change to an upstream source or schema bumps the package version. In this case, the schema is not changing, but the source is. Therefore, a new version of the data package is published with the same exact query interface.
  5. In a dev environment, the app developer bumps the version of the package in their dependency file. The dpm-agent routes queries to the source specified in the new package version.
  6. After testing, the change to the dependency file is promoted to production.
Queries are routed to a different source based on the package version. No code changes.

Since the query interface is the same before and after the migration, the logical code does not need to change.

Data migrations are inevitable

It’s important to consider why migrations happen in the first place. Often, it’s because the scale or type of workload requires another type of database. Commonly, applications are initially built with a single OLTP database like Postgres. Over time, data scales. If the workload remains primarily transactional, no problem. Postgres and its rich ecosystem offer plenty of tools to keep up with the growth.

However, when the workloads begin to diversify to include e.g. analytics queries, search, or heavy transformations, database performance suffers and engineers have a choice. They can invest deeply in database operations, and pay permanently high maintenance costs. Or they can introduce other storage engines like a data warehouse to relieve the primary database from a subset of workloads. In the latter case, disentangling the application code from the location of the data is messy work.

By using Data Packages, developers reduce the cost of these inflection points. For example, a Data Package can be generated over the Postgres dataset. Then, read queries (or just the analytics & search subset) can be routed through the Data Package to alternate query engines, including the native package query engine, to relieve Postgres. Later, if the team decides to adopt a data warehouse for heavier data integration or transformation workloads and wants to integrate the processed data into their app, the same Data Package query interface can be used.

Try Data Packages

We’ve observed that over time, engineering teams introduce additional storage systems and other infrastructure components to their stack. In most cases, this complexity is justified. But that doesn’t mean it comes without cost. Our goal is for Data Packages to make the inevitable migrations and increased infrastructure footprint less costly. A company’s proprietary datasets are intrinsically valuable and the physical storage location shouldn’t impact what you can build with them.

If you’d like to try Data Packages, or if you’re interested in Patch’s managed infrastructure for versioning, access controls, and query performance, book a demo with us.

– Whelan

Start building with data packages.