Back to blogs
Data Package Manager: the missing bridge in enterprise data architecture
Distributing datasets is not conceptually different from distributing code. The process for distributing data is just as important and deserves a tool specially designed for the problem. Package managers have been the standard for distributing source code for the past several decades. This proven workflow can be applied to datasets.
With Data Packages, an engineer installs the dataset directly into their code as though it were a normal library dependency.
Introducing Data Packages
This abstraction solves several key problems when engineering datasets into applications. Today, upstream schema changes break data pipelines and subsequently break application code. Data Packages prevent this by decoupling the client from the infrastructure to enable stable versions of a dataset to continue being served even if upstream systems have broken.
Next, Data Packages decouple application code from specific data infrastructure so that a dataset can be moved to different storage systems by data teams without impacting application teams.
Finally, Data Package versioning enables rollbacks & proper testing against newer versions, achieving a true CI/CD flow.
Data packages make engineers immediately productive with data by decoupling applications from data infrastructure.
How a Data Package works
There are some key technical problems that make packaging data quite hard to do at scale. At Patch, solving these problems is our mission. The interface to our solution is a fully open source package manager for data called `DPM`, which stands for Data Package Manager.
The package has two parts. First, a descriptor file formally describes the dataset schema, version, storage engine, and other helpful metadata. Second, a code-generated type-safe query builder derived from the schema of the dataset itself. The query interface makes it possible to express any valid function for a given column type. The functions are generic and not coupled to specific data stores. Performance optimization is pushed upstream to the infrastructure and not a concern of the application.
Once the package is generated, it can be published to any public or private package repository such as PyPI or npm. Then in the application code, the package is installed and imported just like any other code library dependency. After importing, you can immediately run queries without any of the typical database, infrastructure or API client boilerplate.
At runtime, the Data Packages route queries through an agent which implements the plumbing that enables versioning, swappable datastores, credentials management, RBAC and performance optimization. The Data Packages do not communicate directly with upstream data sources, creating a key interface between the application and data planes via the `DPM-agent`.
There has never been more demand for data engineering in an organization. This growth is straining both data producers and consumers who cannot scale efficiently.
It is often impossible for data producers to make data consumers immediately productive. Producers have to set up new pipelines, databases, services or caches to satisfy the various production requirements. Meanwhile, consumers lose countless hours educating themselves on the dataset, installing libraries, and reconciling access policies.
Over time, the changes in data schemas, computation and infrastructure become tougher to manage. Because of this, the cost of building applications on top of our most valuable data is only going up.
With Data Packages, data producers and consumers can share datasets with ease. An engineer consuming a Data Package simply installs a dataset like a code library and queries it immediately.
Our mission is to provide the fastest, most robust solution to building complex data applications at enterprise scale. The Data Package is the missing abstraction.
We are excited to announce the beta availability of Data Package Manager. Book a demo with us if you’d like to learn more!