One of the most important concepts in analytics, machine learning, and data science is to obtain a workable dataset to analyze, and consistently repeat any training, testing, evaluation, or model creation. In the world of Big Data and large Data Warehouses taking a smaller subset to make sense of can often be a large percentage of the overall work of producting a potentially valuable result.
For example, if a telco company has billions of records for a year’s worth data, with most tools, processing that amount of data could take an inordinate amount of time. And if that data has critical relationships spread across to other entities, or in the case of most data subsetting, referential integrity to other tables, such as an example of a Data Warehouse then just pulling a percentage of the transactional records is a daunting manual task. The reason, which may be obvious to you, is that the matrix of relationships between 1-to-n number of relationships where one table is based on another table which might have a many-to-many join to another table, is more of a network graph than a linear relationship. Therefore some level of compuation, or algorithmic parsing is required. And that would be just finding the primary relationships from a single selected table, for example a fact table in a data warehouse, and retrieving a % of the fact table records, and automatically pulling in the only referential integrity rows from other tables that relate only to those select records that fall within the selected % of records.
So using Data Subsetting one could have a viable sample set of data from which to analyze and do meaningful work, using actual relationships carried by the main table or tables in question without having gaps in the data. Not having gaps is very important. Imagine in the telco example if all telco transactions (usually something like tower pings, etc., but lust just say phone calls initiations from consumer cell phones) were stored in a table named “Phone Calls”, and to allow a dimensional reference, a couple of foreign key tables were used, Consumer Phone, Customer, Geography, and Carrier. If you wanted to get a random 3% of the 1 Billion phone calls from the “Phone Calls” table, how would you ensure that randomly you capture in a 3% subset, 100% of the relevant Carrier, Customer, Customer Phone, and Geography rows that have a direct relationship to the phone calls transaction without missing any relationships? If you miss a relationship, then you would have sparse data which would give you gaps in your queries, testing, etc. and your data would not appear real, it would be in context, but almost lest valuable than any generated synthetic data because it would be faulty.
There’s a few tools that we’ve used in the past to help us drive sample sets for demos, training, troubleshooting, and piloting our own software and solutions for customers. Here’s a few that we’ve worked with in the past:
We’ve also found a newcomer to the data subsetting stage also with an open source solution and a SaaS offering which adds more horsepower, tonic.ai.
At the end of the day getting your hands on data, good data, is always going to have some challenges. It’s like sticking your hand in the cookie jar for one cookie, usually you have crumbs on your fingers, or come out with more cookies than you went in for in the first place. The challenge is to have a clean, repeatable process to select just what you want from the data, wash, rinse, and repeat as necessary. Without this ability, and especially the ability to automate a surely painful and time sucking manual process, the data management process becomes a job that otherwise detracts from time that should be spent bringing the value out of the data, rather than just getting data out of the data.