In this story we introduce and broadly explore the topic of weak supervision in machine learning. Weak supervision is a learning paradigm in machine learning that started gaining notable attention in recent years. Simply put, full supervision requires us to have a training set (x,y) where and is the correct label for x; meanwhile, weak supervision assumes a general scenario (x, y') where You does not have to be correct (i.e. it is potentially incorrect; a weak label). Also, in weak supervision we can have multiple weak supervisors, so one can have (x, y'1,y'2,…,y'F) for each example where each one Yes Yes. comes from a different source and is potentially incorrect.
table of Contents
∘ Problem Statement
∘ General framework
∘ General Architecture
∘ snorkel
∘ Example of weak supervision
Problem Statement
In more practical terms, weak supervision helps solve what I like to call the supervised machine learning dilemma. If you're a company or an individual with a new idea in machine learning, you're going to need data. It's often not that hard to collect a lot of samples. (x1, x2, …, xm) and sometimes it can even be done on a scheduled basis; However, the real dilemma is that you will need to hire human annotators to label this data and pay about $Z per label. The problem is not only that you don't know if the project is worth all that much, but also that you may not be able to afford to hire annotators to begin with, as this process can be quite expensive, especially in fields like law and medicine.
You may be thinking, but how does weak supervision fix all of this? In simple terms, instead of paying annotators to give you labels, you ask them to give you some generic rules that can sometimes be inaccurate when labeling data (which requires much less time and money). In some cases, it may even be trivial for your development team to discover these rules themselves (for example, if the task does not require expert annotators).
Now let's think about an example use case. You are trying to create an NLP system that masks words corresponding to sensitive information, such as phone numbers, names, and addresses. Instead of hiring people to tag words in a corpus of sentences you've collected, you write some functions that automatically tag all the data based on whether the word is just numbers (probably, but not certainly, a phone number), whether the word starts with a capital letter as long as it is not at the beginning of the sentence (probably, but not certainly, a name), etc., then train your system on the weakly labeled data. You might think that the trained model won't be better than those labeling sources, but that's wrong; Weak supervision models are designed to generalize beyond labeling sources, knowing that uncertainty exists and often taking it into account in one way or another.
General framework
Now let us formally look at the weak supervision framework as it is employed in natural language processing.
✦ Given
A set of F labeling functions {L1 L2,…,LF} where The girl assigns a weak (i.e. potentially incorrect) label given an input x where any labeling function The girl can be any of:
- Crowdsourced annotator (sometimes not so accurate)
- Label due to remote supervision (i.e. extracted from another knowledge base)
- Weak model (e.g., inherently weak or trained on another task)
- Heuristic function (e.g. observing labels based on the existence of a keyword or pattern or defined by a domain expert)
- Gazetteers (e.g., looking at tags based on their appearance in a specific list)
- Invoking LLM under a specific message P (recent work)
- Any function in general that (preferably) performs better than random guessing in guessing the label of x.
It is generally assumed that li may refrain from giving a label (for example, a heuristic function such as “If the word has numbers, label it with the phone number; otherwise, do not label it.”).
Suppose the training set has N examples, then this data is equivalent to a matrix of weak labels (N,F) in the case of sequence classification. For token classification with a sequence of length T, it is a matrix (N,T,F) of weak labels.
✦ Sought
Train an M-model that effectively leverages weakly labeled data along with strong data, if it exists.
✦ Common NLP Tasks
- Sequence classification (e.g. sentiment analysis) or Token Classification (e.g. named entity recognition) where the labeling functions are typically heuristic functions or gazetteers.
- Translation with few resources (x→y) where labeling functions are usually a weaker translation model (e.g. a translation model in the reverse direction) (y→x) to add more (x,y) translation pairs.
General Architecture
For sequence or token classification tasks, the most common architecture in the literature takes this form:
He label template Learn how to map the results of label functions to probabilistic or deterministic labels that are used to train the final model. In other words, it takes the (N,F) or (N,T,F) array of labels discussed above and returns an array of (N) or (N,T) labels (which are often probabilistic labels (i.e. soft)).
He final model is used separately after this step and is just an ordinary classifier operating on soft labels (cross-entropy loss allows this) produced by the label model. Some architectures use deep learning to fuse label and final models.
Note that once we have trained the label model, we use it to generate the labels for the final model and then we no longer use the label model. In this sense, this is quite different from betting, even if the label features are other machine learning models.
Another architecturewhich is the default in the case of translation (and less common for sequence/token classification), is to weight the weak example pair (src, trg) based on their quality (usually just a labeling function for the translation, which is a weak model in the reverse direction, as explained above). That weight can then be used in the loss function so that the model learns more from higher-quality examples and less from lower-quality ones. The approaches in this case try to devise methods to evaluate the quality of a specific example. One approach, for example, uses round-trip BLEU scoring (i.e., translates the sentence to the destination and then back to the source) to estimate such a weight.
snorkel
To see an example of how the label model can work, we can look at snorkel which is possibly the most fundamental work in weak supervision for sequence classification.
In Snorkel, the authors were interested in finding P(yi|Λ(xi)) where Λ(xi) is the weak label vector of the ith example. Clearly, once this probability is found, we can use it as a soft label for the final model (because, as we said, cross-entropy loss can handle soft labels). Also clearly, if we have P(y, Λ(x)) then we can easily use to find P(y|Λ(x)).
We see in the previous equation that they used the same hypothesis as logistic regression to model P(y, Λ(x)) (Z is for normalization as in Sigmoid/Softmax). The difference is that instead of wx have w.φ(Λ(xi),yi). In particular, φ(Λ(xi),yi) is a vector of dimensionality 2F+|C|. F is the number of labeling functions as mentioned above; meanwhile, C is the set of pairs of labeling functions that are correlated (hence, |C| is the number of correlated pairs). The authors refer to a method in another paper for automating the construction of C which we will not go into further here for brevity.
The vector φ(Λ(xi),yi) contains:
- F binary elements to specify whether each of the labeling functions has been abstained for the given example
- F binary elements to specify whether each of the labeling functions is equal to the true label and (here y will be left as a variable; it is an input to the distribution) given this example
- c binary elements to specify whether each correlated pair made the same vote given this example
They then train these label models (i.e. estimate the vector of length weights 2F+|C|) solving the following objective (minimizing the negative log marginal likelihood):
Note that they do not need information about y since this objective is solved independently of any specific value of y as indicated by the sum. If you look closely (undo the negative and log), you can find that this amounts to finding the weights that maximize the probability of any of the true labels.
Once the label model is trained, they use it to produce north soft labels P(y1|Λ(x1)), P(y2|Λ(x2)),…,P(yN|Λ(xN)) and use it to normally train some discriminative model (i.e. a classifier).
Example of weak supervision
Snorkel has an excellent tutorial for spam classification hereSkweak is another package (and document) that is essential for weak monitoring of token sorting. This is an example of how to get started with Skweak as shown in his Github:
First define the labeling functions:
import spacy, re
from skweak import heuristics, gazetteers, generative, utils### LF 1: heuristic to detect occurrences of MONEY entities
def money_detector(doc):
for tok in doc(1:):
if tok.text(0).isdigit() and tok.nbor(-1).is_currency:
yield tok.i-1, tok.i+1, "MONEY"
lf1 = heuristics.FunctionAnnotator("money", money_detector)
### LF 2: detection of years with a regex
lf2= heuristics.TokenConstraintAnnotator("years", lambda tok: re.match("(19|20)\dtech education$",
tok.text), "DATE")
### LF 3: a gazetteer with a few names
NAMES = (("Barack", "Obama"), ("Donald", "Trump"), ("Joe", "Biden"))
trie = gazetteers.Trie(NAMES)
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":trie})
Apply them to the corpus.
# We create a corpus (here with a single text)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump paid $750 in federal income taxes in 2016")# apply the labelling functions
doc = lf3(lf2(lf1(doc)))
Create and adjust the label template.
# create and fit the HMM aggregation model
hmm = generative.HMM("hmm", ("PERSON", "DATE", "MONEY"))
hmm.fit((doc)*10)# once fitted, we simply apply the model to aggregate all functions
doc = hmm(doc)
# we can then visualise the final result (in Jupyter)
utils.display_entities(doc, "hmm")
Then of course you can train a classifier on top of this using the estimated soft labels.
In this article, we explore the problem that weak supervision addresses, provide a formal definition, and describe the general architecture typically employed in this context. We also delve into Snorkel, one of the fundamental models of weak supervision, and conclude with a practical example to illustrate how weak supervision can be applied.
I hope you found the article useful. Until next time, bye.
References
(1) Zhang, J. et al. (2021) Wrench: A comprehensive benchmark for weak supervision, arXiv.org. Available in: https://arxiv.org/abs/2109.11377 .
(2) Ratner, A. et al. (2017) Snorkel: Fast training data creation with weak supervision, arXiv.org. Available in: https://arxiv.org/abs/1711.10160.
(3) Norwegian Computing Center (2021) NorskRegnesentral/skweak: Skweak: a set of software tools for weak supervision applied to NLP tasks, GitHub. Available in: https://github.com/NorskRegnesentral/skweak.