Machine learning models are only as good as the data that they’re fed, so dataset management is a crucial component of your pipeline.
This involves any actions taken to maintain a dataset over the course of that data’s lifecycle: acquiring the data, ensuring it’s properly documented, storing it in a way that complies with regulations, monitoring it for quality, and, eventually, disposing of it when it’s no longer useful.
Dataset management begins with ensuring that your organization has a firm policy in place for how its data will be managed. Here are some factors to consider when developing your organization’s dataset management policies.
By performing effective dataset management, you can enhance the value that your organization gains from its data. Collecting data can be extremely time-consuming, requiring engineers and analysts to find appropriate data for a given model and to perform any necessary data labeling and cleaning.
Well-maintained datasets can often be reused for different use cases, allowing organizations to move quickly through the data acquisition step without needing to start data collection from the beginning. For example, datasets containing lidar-labeled images of roads are frequently used in the research and development of autonomous vehicles. A similar dataset could also be used to examine the quality of roads to detect locations where infrastructure could be improved.
With dataset management, you can also monitor your data to ensure that quality stays consistent over time. Since high-quality data produces better model performance, data monitoring helps your organization generate the best possible models.
Since dataset management covers the entirety of a dataset’s lifecycle, there are many factors to keep in mind when creating a data policy.
When approaching a new ML problem, start by evaluating the data that you need for the project. Is the data already on hand, or does it need to be acquired? In some situations, you can acquire data from open-source datasets, but in others you may need to collect it manually. And once the data has been collected, you may need to label it.
Data acquisition is a deceptively simple aspect of dataset management; finding the right data can be a time-consuming process, so it can be beneficial to store datasets in a way that optimizes their reusability. For example, datasets should include identifying keys that allow related datasets to be easily joined and integrated.
Once you’ve acquired data, there’s still the question of how you should store it. While some organizations choose to store their data on-site, cloud-based services provide secure infrastructure for storing large amounts of data, while reducing the need for on-site storage. Because the amount of data available to organizations is growing exponentially, the cloud is becoming an increasingly popular option for storage. When choosing a cloud-based service, organizations should ensure that its security complies with the regulations for the type of data being stored.
Most organizations currently use data warehouses or data lakes for storage, but a newer trend for cloud-based solutions is the use of data fabrics, a distributed environment that connects and manages data sources across an organization. While data warehouses exist primarily to store and transfer data to end users, data fabrics provide an entire environment for performing data analytics in the cloud.
Metadata is essentially data about data. For datasets, this includes information such as tags and timestamps. Clear metadata is vital for ensuring data is interpreted correctly and
incorporated properly into the analysis. Metadata is typically stored alongside the dataset, and it should be updated following changes to the data.
When datasets are updated with new data, inconsistencies within the ingested data can lead to quality control issues. For example, the original dataset may contain a variable that is no longer collected, so that all recent entries are missing data. Additionally, the possible values for a categorical variable may change, causing the values of the new and old data to no longer correspond.
Because distributions of data tend to change over time, old training data sometimes no longer reflects current, real-world data. In medical applications, for example, health outcomes tend to vary over time. This shift in data distribution is called “time drift,” and it can cause models to underperform on real-world data.
To prevent time drift, datasets that are applied to current trends should be updated regularly with new data. The best rate at which to update a dataset depends on access to updated data as well as the rate at which that data changes. Sensor data, for example, can often be incorporated continuously through streaming updates. Data derived from registries or publicly available databases, however, can be updated only when new data is released, whether that’s daily, monthly, or yearly.
By monitoring changes to the data, you can quickly catch declines in data quality and address these issues as needed.
Version control systems allow users to create multiple snapshots of a dataset. These systems are helpful when a dataset changes over time or when you’re using different versions of a dataset for different models. If you record each version of the dataset, it’s easy to revisit what data you used for each model. This information is especially important in research applications, where the ability to replicate results is vital.
By creating dataset versions, you can also help to detect and eliminate occurrences of data leakage. Data leakage occurs when a model is trained with data that would not be accessible at the time of prediction. This often occurs when validation or testing examples slip into the training set.
Data leakage can also occur if the training set includes “giveaway” features that are strongly correlated with the result but would not be available at the time of prediction. For example, a model for predicting whether a customer will revisit a store might contain data leakage in the form of the date of the customer’s next visit.
By using version control, you can ensure that your train, validation, and test sets are kept firmly separate, and you can revisit the variables included in a model if the performance seems too good to be true.
Many ML models use datasets that contain confidential data. When working with sensitive data, the data owner must regulate who has access to the data and what they are permitted to do with it. Organizations must also comply with regulations related to how data must be stored. Data regulations depend on the region and the type of data being stored, so be aware of the policies that affect your organization.
When a dataset is no longer expected to be used for future analytics, organizations can dispose of their data. Data that may be of use again can be archived, while other data might be entirely destroyed. The data disposal policy must comply with regional security policies.
Building datasets can be a time-consuming process, so it’s important to get as much value out of your data as possible. By managing and documenting your datasets, you can optimize the amount of data that can be reused among projects. Monitoring data for quality will also help you to catch errors early in the development process, leading to higher-quality models.
By maintaining dataset versions, you can also easily rollback your dataset and retrain a model if any errors are identified. For example, if a later version of a dataset includes data leakage or corrupted data, it’s simple enough to rollback your dataset to the previous version. Dataset versions also allow analysts to store training, validation, and testing datasets separately and to maintain datasets containing different types of augmented data.
Setting up a firm dataset management policy helps to ensure that your data is managed consistently throughout your organization. By incorporating dataset management early into your ML pipeline, you can set up your projects for success.