Skip to main content

Subset Data

Introduction

Subsetting is useful to reduce the size of a large dataset so that it is usable in another environment with less resources. For example, if you have a large 100gb database, you'll likely want to filter that down to be able to use it locally. Additionally, for teams spinning up databases in their CI pipelines, they often pay by the minute if the CI pipeline is hosted. As a result, teams are often looking for ways to scale down their dataset size so that it is usable in different environments. This is where subsetting comes into play.

Subsetting

Neosync can help you subset your data by taking in a SQL statement of how you want to filter your data on a table-by-data. This gives you a flexible way of building your destination data set. Once you've connected Neosync to your source database and configured your schema and mappings, you can then decide to subset that data further by selecting a source table to start with.

subset

Neosync will automatically ensure relational integrity in the data, making sure that the resulting dataset, post-subset, still has all of the foreign key constraints you had in the original data set. Once you've subsetted the data, Neosync will push the result set to your destination(s).

Advanced Subsetting with Foreign Key Constraints

Neosync introduces an advanced subsetting feature that leverages foreign key constraints to automatically subset child tables based on the subset selections of their parent tables. This functionality is particularly useful when working with complex schemas where maintaining relational integrity across tables is crucial.

When you specify a subset condition on a parent table, Neosync automatically applies this subset condition to any child tables linked through foreign key constraints. This ensures that only relevant data is included in your subset, maintaining both the efficiency of the subset process and the integrity of your data.

Additionally, Neosync is adept at managing self-referencing tables and circular dependencies, provided there is at least one nullable column within the circular dependency cycle to serve as a viable entry point in your database schema. This advanced feature significantly simplifies the process of creating subsets from complex databases, ensuring that all related data is cohesively maintained.

Enabling this advanced subsetting option is straightforward and can be accessed in the new job flow and job details page, specifically under the Subset tab.

Conclusion

Neosync has powerful subsetting features which allow you to create smaller subsets of your data while maintaining relational integrity. This is useful for local and CI testing where you don't want or need the entire dataset but don't want to spend time querying, joining and filtering the data yourself.