Back to blogs
Monetize data products without engineering overhead
“We are distributing data-as-a-product to customers and partners. Some of them want to build apps on top of the data and others simply want to run ad hoc queries. Right now, we grant direct access to our data warehouse, but that’s brittle. We must support fast queries on fresh data, be robust to upstream schema changes, and give ourselves flexibility to change data storage locations in the future without causing huge headaches for our consumers or ourselves.”
– Head of Data, Healthcare tech company
Data teams that exclusively focus on helping their organization make “better, data-driven decisions” often struggle to demonstrate their value, as that’s a very difficult outcome to quantify. Data teams, particularly data engineers, must reframe their mission to include securely distributing data products to a diverse set of consumers.
We previously highlighted product engineers building online applications as a key data consumer. This post focuses on external customers and partners.
Not monetizing data products? You’re leaving money on the table
Every organization is sitting on a proprietary dataset that has intrinsic value to both themselves and in many cases their customers, partners, and surrounding ecosystem. Roaring investments in AI, specifically model fine-tuning, only propel this further. Demand for complementary data to enrich a first party dataset is skyrocketing.
Some organizations have already begun monetizing their data products. These teams excel at ingesting, processing, and curating data that their customers and partners find valuable. Where they struggle is data sharing.
Sharing data products
The available options for data product sharing either sacrifice the user experience or leave engineering teams with burdensome maintenance.
For example, SFTP requires the data consumer to set up their own pipelines to move the data from the cloud storage location to a place where it can be useful. CSV Exports are limited to small datasets and lack schema validation. Direct database access is inherently insecure and leaks infrastructure details.
Custom data APIs address some of these concerns, but require a large engineering investment to ensure they’re reliable and backwards schema-compatible. Developers consuming the API also typically need a type-safe client SDK, adding to the provider’s level of effort.
An elegant solution to sharing and monetizing data products:
- Provides a reliable data contract that consumers can safely build against
- Makes it easy for consumers to discover and access data products
- Provides all the information and tools necessary to be immediately productive
- Insulates consumers from backward-incompatible changes and offers a seamless upgrade workflow
- Decouples the external interface from specific data storage & compute engines, enabling the provider’s engineers to make infrastructure changes at-will without impacting customers
- Enforce access controls without manual effort or exposing credentials
- Provides type-safe clients to reduce boilerplate for consumers
- Measures usage and makes this data easily available, e.g. for billing purposes
- Supports multi-source datasets
Data Packages: a reliable approach to distributing and monetizing data products
“If you sell something through Amazon, you don’t ship the raw goods. You put it in a box, packing material, tape…it’s all wrapped up. That’s what Data Packages give you.”
– Principal Analyst, Pharmaceutical company
Data Packages are designed to provide external parties with a stable interface to build products, train models, or conduct analysis using your data assets. Like the Amazon analogy above, they come with everything built-in to make consumers immediately productive.
The consumer installs a versioned code library via a package manager like pip or npm. The library contains a type-safe query interface in their runtime language with a live connection to the data package’s configured source, such as Snowflake or AWS S3. Crucially, the client is decoupled from the source infrastructure, so engineering is free to make any changes necessary over time.
When users run queries, a process called dpm-agent enforces any access policies that you’ve configured before routing the query to the source. With org, package, and row-level ACL rules, you can ensure that only the right individuals can access the data. Since they are not coupled to specific databases or other infrastructure components, you won’t have to redefine these policies if/when you migrate storage systems in the future.
Patch also provides additional package configurations depending on your consumer – accelerating package queries for ultra low latency use cases, reliable schema evolution on live datasets, and data snapshotting for reproducibility.
Introducing Data Packages and Data Package Manager
Embedded access policies, versioned query interfaces with type-safety, and package metadata are designed to make your data consumers immediately productive. Meanwhile, we’ve drawn inspiration from code package management to reduce the dependency cost between the provider and consumer.
If you’d like to get early access to Data Packages or Patch’s managed infrastructure for versioning, access controls, and query performance, let’s chat.