Harnessing the Power of Weak and Strong Labels
How to differentiate between weak and strong labels when constructing a classifier
Most real-world classification projects suffer from the same problem: there is not enough data to train and validate the classifier. One way to deal with this issue is to use artificially classified data (weak labels). The following article helps you differentiate between weak and strong labels and explains how to account for it when building a classifier.
Strong Labels
Strong labels represent highly reliable and accurate annotations assigned to the training data. In a supervised learning scenario, where the model is trained on labeled examples, strong labels indicate a high level of confidence in the correctness of the assigned class. Achieving strong labels typically involves meticulous human annotation, expert knowledge, or high-confidence automated systems.
Weak Labels
Conversely, weak labels are characterized by a lower level of certainty or reliability. These labels can be created using heuristic rules. Heuristic rules are mental shortcuts or rules of thumb that people use to make decisions quickly and efficiently. They are practical strategies that are not necessarily guaranteed to be optimal or rational but are often “good enough” in many situations.
Example
Let’s pretend we work for a bank, and we want to build a classifier that finds out if a future customer’s loan request will be approved or not. This system could be used to recommend existing customers potential €50,000 loan offers. Now we already have some data regarding our dataset.
In total, we have 10,000 customers, but only for a fraction of them, do we know if they have been approved or not approved for a €50,000 euro loan. Let’s say we only have 200 data points; we could now apply a heuristic rule that was defined by the expert to get more data points. Let’s say that he tells us that every customer who has an annual income above €100,000 euro will be easily approved for a loan.
We have the following data set:
In total we have 10,000 customer but only for a fraction of them we know if they have been approved or not approved for 50,000 euro loan.
Let’s say we only have 200 data points; we could now apply a heuristic rule that was defined by the expert to get more data points. Let’s say that the expert told us that every customer who has an annual income above €100,000 euro will be easily approved for a loan.
We now get the following validatoin results from our logistic regression classifier:
- Accuracy using only strong labels: 71%
- Accuracy using strong and weak labels: 80%
As we can see, we increased the accuracy by using strong and weak labels. However, this would treat ground truth the same as weak labels, which we know is not right. To account for the difference, we can now add a known column called weights to the dataframe. Each row with ground truth would get the weight 1, and each row with a weak label would get the weight 0.5.
# Initialize the logistic regression model with sample weights
model = LogisticRegression()
model.fit(X_train, y_train,
sample_weight=df_combined['weight'][df_combined.index]
)
We now get the following results:
- Accuracy using only strong labels: 71%
- Accuracy using strong and weak labels: 80%
- Accuracy using strong and weak labels and weights: 87%
Conclusion
By understanding the difference between strong and weak labels, you will be able to use heuristic rules to gain more data for your classifier. To get the most out of your findings, you need to add weights to each data point to distinguish between strong and weak labels. A majority of the classifiers in sklearn have a parameter to hand over the weight for each example.