Bringing performance, reliability, and CI/CD to data

July 12, 2023
Whelan
Developer
Image of office with system reliability monitoring

At Snowflake Summit, we demoed Data Packages to engineers & leaders from hundreds of companies preparing for increasingly demanding use cases for the data infrastructure. Some examples of some of the issues we heard:

  1. Line-of-business applications & services necessitate guarantees for performance, availability, and stable schema versioning
  2. Tight coupling of datasets to infrastructure makes data sharing, migrations, and schema evolution challenging
  3. Existing solutions for granular access policies, cost controls, and auditing are fragmented across each data system, without a single place to administer all data access.

This is the first in a series of posts where we explore lessons learned over the past several months that were reinforced at Summit. First up, we address the issue #1: line-of-business applications where low latency queries and data uptime matter most.

“As we work with larger enterprise clients, we need to deliver rich analytics dashboards and notifications to help them manage their sales and operations. This data is spread across several databases, including Snowflake, but our customers expect lightning fast load times, even when interacting with chart filters or adjusting time grains.”
– Data engineer, B2B SaaS company

Today, leveraging processed data from a data warehouse like Snowflake in production requires dedicated infrastructure. End customers expect ultra low latency, multiple 9’s of availability, and low error or bug rates. While they often contain the richest data, OLAP databases are not cost-effective nor fast enough to serve production traffic.

In turn, engineering teams must set up pipelines to move data into another database or cache. These caches must be built, managed and scaled according to query and traffic patterns. Then, they have to develop an API to serve the data and write boilerplate code to integrate the API into their application codebase.

Additionally, OLAP databases do not provide a robust development workflow for managing the change lifecycle of production data. The loose structures and multi-tenant usage make them great for analysis, but brittle for production application data. Upstream schema changes break data pipelines and subsequently break application code. Table recomputations or materialization bugs lead to incorrect data shown to a customer. Feature updates necessitate changes throughout the stack, forcing data and software engineering teams to coordinate rollouts across too many systems.

Performance and high availability with zero config

Data Packages address these issues by decoupling the client from the infrastructure. Developers import a versioned code library in their preferred runtime language that includes a powerful query interface generated from their specific dataset schema.

A Data Package installed via npm imported into a node.js project

Queries are routed through an agent to a supported source, such as Snowflake or Patch’s hybrid query engine. For high availability use cases, Patch has a built-in cache and dataset versioning system; stable data continues to be served even if upstream systems have broken. It also enables ultra low latency analytics queries, single row lookups, and text search.

The agent provides users with the flexibility to decide which execution engine a Data Package query will be routed to depending on the use case requirements. Users simply update their Data Package sources without changing their application code or configuring any infrastructure.

Bringing CI/CD to production data

“Each time we receive a new claim, we must query a historical claims dataset in Snowflake from a Python enrichment service as part of risk assessment. The schema on this dataset evolves over time, and every now and then there are inadvertent changes upstream. We cannot allow this to break our underwriting process given the P&L impact.”
– Senior Architect, public insurance company

Like teams building customer-facing features, engineers integrating processed data into a backend enrichment or machine learning service need to avoid shipping code regressions. A proper CI/CD workflow requires data snapshots, such as “data from June 1-14”, because the goal is to verify that application behavior is correct, assuming the data is correct.

Software engineers typically use embedded fixture data to write unit and integration tests. They expect the same input to produce the same output every single time. If you tried to write automated tests against dynamic data, the assertions will be non-deterministic. Tests would fail for reasons other than the code logic.

Data Packages provide an easy way to snapshot an arbitrary range of data and treat it as a fixture. Each test run is against a static version of the data. When running tests, it’s also straightforward to move the date forward to a new or rolling window, such as “last completed month”. Once it’s updated, it’s again static & frozen.

Most importantly, tests can now assert against real (vs synthetic) data and avoid embedding the data into the source code. Access control policies can mask out or obfuscate sensitive columns. There is no boilerplate and no database connections to set up. Developers simply import the package into their test suite just like a regular code library.

Try Data Packages

The simplicity and breadth of use cases are why everyone we spoke to is so excited about Data Packages. This single abstraction ties together the solutions to some of the most pressing problems at the center of data architecture today, and gives engineers the ability to distribute datasets across team and technical lines with ease and confidence. In the next post, we explore safe migrations, data sharing, and further benefits of decoupling clients from infrastructure.

If you’d like to try data packages or if you’re interested in Patch’s managed infrastructure for versioning, access controls, and performance, book a demo with us.

-Whelan

Start building with data packages.