Data Packages provide external parties with a stable interface to build products, train models, or conduct analysis using your data assets.
“We are distributing data-as-a-product to customers and partners. Some of them want to build apps on top of the data and others simply want to run ad hoc queries. Right now, we grant direct access to our data warehouse, but that’s brittle. We must support fast queries on fresh data, be robust to upstream schema changes, and give ourselves flexibility to change data storage locations in the future without causing huge headaches for our consumers or ourselves.”
- Head of Data, Healthcare tech company
Data teams struggle to demonstrate their value largely because they exclusively focus on helping their organization make “better, data-driven decisions.” That’s a very difficult outcome to quantify. Data teams, particularly data engineers, must reframe their mission to include data products that can be securely distributed to a diverse set of consumers.
We previously highlighted product engineers building online applications as a key data consumer. This post focuses on external customers and partners.
Every organization is sitting on a proprietary dataset that has intrinsic value to both themselves and in many cases their customers, partners, and surrounding ecosystem. Roaring investments in AI, specifically model fine-tuning, only propel this further. Demand for complementary data to enrich a 1st party dataset is skyrocketing.
Some organizations have already begun monetizing their data assets. These teams excel at ingesting, processing, and curating data that their customers and partners find valuable. Where they struggle is data sharing.
The available options for data sharing either sacrifice the user experience or leave engineering teams with burdensome maintenance.
For example, SFTP requires the data consumer to set up their own pipelines to move the data from the cloud storage location to a place where it can be useful. CSV Exports are limited to small datasets and lack schema validation. Direct database access is inherently insecure and leaks infrastructure details.
Custom data APIs address some of these concerns, but require a large engineering investment to ensure they’re reliable and backwards schema-compatible. Developers consuming the API also typically need a type-safe client SDK, adding to the provider’s level of effort.
An elegant solution to sharing and monetizing data assets:
“If you sell something through Amazon, you don’t ship the raw goods. You put it in a box, packing material, tape…it’s all wrapped up. That’s what Data Packages give you.”
- Principal Analyst, Pharmaceutical company
Data Packages are designed to provide external parties with a stable interface to build products, train models, or conduct analysis using your data assets. Like the Amazon analogy above, they come with everything built-in to make consumers immediately productive.
The consumer installs a versioned code library via a package manager like pip or npm. The library contains a type-safe query interface in their runtime language with a live connection to the data package’s configured source, such as Snowflake or AWS S3. Crucially, the client is decoupled from the source infrastructure, so engineering is free to make any changes necessary over time.
When users run queries, a process called dpm-agent enforces any access policies that you’ve configured before routing the query to the source. With org, package, and row-level ACL rules, you can ensure that only the right individuals can access the data. Since they are not coupled to specific databases or other infrastructure components, you won’t have to redefine these policies if/when you migrate storage systems in the future.
DPM Cloud also provides additional package configurations depending on your consumer - accelerating package queries for ultra low latency use cases, reliable schema evolution on live datasets, and data snapshotting for reproducibility.
Embedded access policies, versioned query interfaces with type-safety, and package metadata are designed to make your data consumers immediately productive. Meanwhile, we’ve drawn inspiration from code package management to reduce the dependency cost between the provider and consumer.
If you’d like to get early access to Data Packages, please share your email at www.dpm.sh. If you’re interested in Patch’s managed infrastructure for versioning, access controls, and query performance, sign up at patch.tech/request-invite.
By providing, I agree to the Terms of Use and Privacy Policy