by I. Harmon
Introduction
Machine learning has a growing influence on modern life. Machine learning models are used in autonomous vehicles, they’re trained to make clinical diagnoses from radiological images, and they’re used to make financial predictions. As machine learning becomes more prevalent, the consequences of errors within the model becomes more severe. With autonomous vehicles and medical diagnoses, errors may be the difference between life and death. In financial predictions model errors can make the difference between wealth and bankruptcy. As data scientist we always want better performing models.
Improving model performance can be approached form two sides: improving the model or improving the dataset. Many ML practitioners break machine learning models into four pieces: the dataset, the model itself, the cost function, and the optimization algorithm. But in this case I’m clumping the optimization algorithm and cost function in with the model and considering the dataset separately.
Dataset Importance
In classic machine learning approaches, feature engineering, such as dimensionality reduction or whitening, can significantly improve a model’s performance without increasing the number of instances within its dataset. However, many deep learning applications have been shown to be more affected by the number of instances. Better models are usually more complex and therefore require more computing power to train and longer training times for a dataset of a fixed size. Object detection research, an area of computer vision concerned with training models to recognize objects within images, is heavily dependent on large datasets. We’ll explore the research of Sun et al. for supporting examples. Figure 1, a graph from Sun’s work shows the progression of state of the art CNNs between 2012 and 2016. As the number of layers grows the performance increases but so does the number of gigaflops needed to train the model. On the other hand, the model performance can be improved by growing the dataset. In Figure 2, the graph on the left shows the improvement in mean average precision of object detection for the ResNet CNN as the number of layers is increased. Increasing the number of layers in the model is equivalent to growing model complexity. The graph on the right shows the performance increase as the number of instances within the dataset grows. Basically, if you increase the size of your dataset model performance can be improved. This result has been shown to be generalizable to other other fields and models.

Figure 1. Sun, Chen, et al. “Revisiting unreasonable effectiveness of data in deep learning era.” Proceedings of the IEEE international conference on computer vision. 2017.

Figure 2. Sun, Chen, et al. “Revisiting unreasonable effectiveness of data in deep learning era.” Proceedings of the IEEE international conference on computer vision. 2017.
The argument for the importance of datasets can be further bolstered by considering the average time between the invention of a new model, the publication of supporting datasets, and the when the model exhibits breakthrough performance. Chart 1 shows that on average their is an 18 year gap between when a new model is released and when a breakthrough occurs, while there is a only a 3 year gap between the release of the dataset used and when the breakthrough occurs.
So the story so far: datasets can improve model performance. But creating a large dataset is expensive and time consuming, especially if the data needs to be labeled. And though there are many datasets available online these datasets may not meet our research requirements, forcing researchers to create their own. Similarly, in business environments, datasets may need to be frequently updated in models that are deployed to dynamic online environments. Unsurprisingly, for new models the slowest step in model implementation is dataset creation. This problem is called the data bottleneck. Weak supervision is one solution to this problem. In short weak supervision is a quick way to generate large datasets that trade offs speed for accuracy.
Supervision Spectrum
Supervision is about where we get the labels for our data. Most ML practitioners are most familiar with supervised and unsupervised learning. But there is actually a spectrum of types of supervision. There are several models of learning that sit between supervised and unsupervised learning, as well as models that can be considered to be the left or right of unsupervised learning depending on who you ask. In Figure 3, the amount of human intervention, the amount of labeled data, and the label accuracy increase from right to left.
Starting from the left, supervised learning requires that all instances are labeled. Ideally, these labels would be done by hand and by a domain expert. This is an example of full supervision. Slightly less optimal, would be a fully labeled dataset, but suppose the labels are NOT provided by domain experts. Maybe we use Amazon Mechanical Turkers to get our labels. In this case some of the labels will probably be wrong. This dataset would probably not be as accurate as one created by domain experts. This is called crowd sourced supervision. Less accurate still is weak supervision. Depending on whose taxonomy is used, weak supervision is data with noisy labels. These labels are usually generated by a computer applying heuristics to a signal within the unlabeled data to generate a label. In the middle of the spectra is semi-supervision. And again depending on who you ask, some experts consider semi-supervision a type of weak supervision. Semi-supervised algorithms are able to make use of both labeled and unlabeled data.
There are several flavors of weak supervision, some of them are shown in the blue band at the bottom of Figure 3. This list is not meant to be complete and I chose an order that makes sense for many common applications, but these methods cannot be ordered as well as those above.
There are other types of supervision not shown on the spectrum, including self-supervision, transfer learning, and reinforcement learning. Each of these can probably be considered a type of unsupervised learning and thus sit to the right or left of vanilla unsupervised learning depending on who you ask.
In Chart 2 I break down some of the pros and cons of some of the common types of supervision.
Weak Supervision Applicability
Now we know what weak supervision is. We have seen how it relates to other types of supervision. But when can we use it? What are some of the problems that arise from using it?
To apply weak supervision, we usually have a small amount of labeled training data and a large amount of unlabeled data. Our goal is to somehow create labels for the unlabeled data so that it can be used to train our model.
The foremost requirement to make use of the unlabeled data is that it contains relevant information. Obviously, if the data isn’t related to the problem we’re trying to solve its not going to be helpful. Secondly, we need to generate enough correctly labeled data that its able to overcome noise generated by weak supervision.
Within a dataset their are two types of noise, attribute noise and label noise. Weak supervision introduces label noise, this can be very detrimental to model performance. Several computational learning models have been used to study the affect of label noise on model performance, including the probably approximately correct (PAC) model. Using this model a lower bound on m, the number of instances needed to train a model within an arbitrary error rate , with certainty bounded by
, and label error rate
is given by
(1)
This equation is correct under the assumption that the hypothesis space, N, is finite. The upshot of the equation is the number of instances needed to train the model within an arbitrary error rate with some certainty increases with the amount of label error.
Creating classifiers that are robust to label noise is called confident learning and is an entire area of machine learning research. But bagging, boosting, and filtering noisy data are techniques that can be used to help most models cope with label noise.
Weak Supervision Frameworks
Tools are being developed to allow laypersons to quickly create large labeled datasets. These frameworks automate the process of applying weak supervision to unlabeled data. The benefit of these tools is that they are able to generate labels that are nearly on par with hand labels by using multiple weak signals to generate a single label. These frameworks can be applied to text, images, and numeric data. Though intended for the layperson, computer scientists that need to create their own datasets will inevitably find them useful.
One such framework is Snorkel. Snorkel is an open source weak supervision framework developed at Stanford by Ratner et al (https://github.com/snorkel-team/snorkel). Starting from a small amount of labeled data and a large amount of unlabeled data, Snorkel allows the user to write labeling functions in Python for multiple dataset signals. These labeling functions can generate a dataset that can train a model to within a few F1 points of the same model trained with hand labeled data. In Snorkel multiple weak signals from labeled and labeling function labeled data are used to train a generative model. The generative model is then used to produce probabilistic labels that can be used to train the target model. Figure 4 summarizes the Snorkel workflow pipeline.
Newer research in this area includes the frameworks Snuba and Osprey that incorporate dataset rebalancing and automated production of labeling functions respectively. Using Snorkel a dataset that would have taken months or years to create by hand can be created in a matter of days.
Conclusion
When you need a large dataset fast, weak supervision is an excellent option. Weak supervision is a quick way to create a dataset, but accuracy is traded for speed. Label noise generated by weakly labeled data will always reduce performance, but these problems can be managed through techniques like filtering, boosting, and bagging. As frameworks such as Snorkel and Osprey become more widely used, we’ll continue to see accelerated gains in model performance.
References
- Angluin, Dana, and Philip Laird. “Learning from noisy examples.” Machine Learning 2.4 (1988): 343-370.
-
Bringer, Eran, et al. “Osprey: Weak supervision of imbalanced extraction problems without code.” Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning. 2019.
- Ratner, Alexander, et al. “Snorkel: Rapid training data creation with weak supervision.” Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. Vol. 11. No. 3. NIH Public Access, 2017.
- Sun, Chen, et al. “Revisiting unreasonable effectiveness of data in deep learning era.” Proceedings of the IEEE international conference on computer vision. 2017.
-
Varma, Paroma, and Christopher Ré. “Snuba: Automating weak supervision to label training data.” Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. Vol. 12. No. 3. NIH Public Access, 2018.