I've been having a lot of fun at my day job recently experimenting with models from the Hugging Face catalog, and I thought this might be a good time to share what I've learned and give readers some tips on how to apply these models. with a minimum of stress.
My specific task recently has been to look at blobs of unstructured text data (think notes, emails, free text comment fields, etc.) and classify them according to categories that are relevant to a business use case. There are many ways to do this, and I've been exploring as many as I can do, including simple things like pattern matching and lexicon searching, but also expanding to using pre-built neural network models for a number of different functionalities and I am moderately satisfied with the results.
I think the best strategy is to incorporate multiple techniques, in some type of package, to obtain the best of options. I don't necessarily trust these models to get things right often enough (and definitely not consistently enough) to use them alone, but when combined with more basic techniques they can boost signal.
For me, as I mentioned, the task is simply to take chunks of text, usually written by a human, without a consistent format or outline, and try to figure out what categories apply to that text. I've taken a few different approaches, in addition to the analysis methods mentioned above, to achieve this, and these range from very low effort to a bit more work on my part. These are three of the strategies I have tried so far.
- Ask the model to choose the category (zero shot rating – I will use this as an example later in this article)
- Use a named entity recognition model to find key objects referenced in text and perform a classification based on that.
- Ask the model to summarize the text, then apply other techniques to make a classification based on the summary.
This is one of the most fun: searching for models in the Hugging Face catalog! In https://huggingface.co/models You will be able to see a gigantic assortment of available models, which have been added to the catalog by users. I have some tips and advice on how to select wisely.
- Look at downloads and like numbers, and don't choose something that hasn't been tested by a decent number of other users. You can also check the Community tab on each model's page to see if users are discussing challenges or reporting bugs.
- Research who uploaded the model, if possible, and determine if you find it trustworthy. This person who trained or tuned the model may or may not know what they are doing, and the quality of their results will depend on them!
- Read the documentation carefully and skip models with little or no documentation. Either way, you'll have a hard time using them effectively.
- Use the filters at the side of the page to narrow down the models suitable for your task. The volume of options can be overwhelming, but they are well categorized to help you find what you need.
- Most model cards offer a quick test that you can run to see the behavior of the model, but keep in mind that this is just an example and is probably one that was chosen because the model is good at it and finds this case quite easy.
Once you've found a model you'd like to try, it's easy to get started: Click the “Use this model” button at the top right of the Model Card page and you'll see options for how to implement it. If you choose the Transformers option, you'll get some instructions similar to this.
If a model you have selected is not supported by the Transformers library, other techniques may be listed, such as TF-Keras, scikit-learn, or more, but they should all show instructions and sample code for easy use when you click on that button. .
In my experiments, all the models were compatible with Transformers, so I found it quite easy to get them up and running, simply by following these steps. If you have questions, you can also check out the more detailed documentation and see the full API details for the Transformers library and the different classes it offers. I've definitely spent some time looking at these docs for specific classes when optimizing, but to get the basics up and running it shouldn't be necessary.
Okay, so you've chosen a model you want to try. Do you already have data? If not, I've been using several publicly available datasets for this experimentation, mainly from Kaggle, and you can find many useful datasets there as well. Additionally, Hugging Face also has a catalog of data sets that you can look at, but in my experience it's not as easy to search or understand the data content there (there's just not as much documentation).
Once you choose an unstructured text data set, loading it for use in these models is not that difficult. Upload your model and your tokenizer (from the docs provided in Hugging Face as above) and pass all of this to the pipeline
transformer library function. You'll loop through your text blobs in a panda list or series and pass them to the model function. This is essentially the same for any type of task you're doing, although for zero-shot classification you must also provide a candidate label or list of labels, as I'll show below.
So let's take a closer look at the zero shot rating. As I noted above, this involves using a pre-trained model to classify text according to categories it has not been specifically trained on, in the hope that it can use its learned semantic embeddings to measure similarities between the text and the label. terms.
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import pipelinenli_model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli", model_max_length=512)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
classifier = pipeline("zero-shot-classification", device="cpu", model=nli_model, tokenizer=tokenizer)
label_list = ('News', 'Science', 'Art')
all_results = ()
for text in list_of_texts:
prob = self.classifier(text, label_list, multi_label=True, use_fast=True)
results_dict = {x: y for x, y in zip(prob("labels"), prob("scores"))}
all_results.append(results_dict)
This will return you a list of dicts, and each of those dicts will contain keys to the possible labels, and the values are the probability of each label. You don't need to use the pipeline like I've done here, but it makes zero-triggering with multiple tags much easier than writing that code manually, and returns results that are easy to interpret and work with.
If you prefer not to use the pipeline, you can do something like this, but you will have to run it once for each tag. Note how it is necessary to specify the processing of the logits resulting from the model run to obtain human-interpretable results. Additionally, you still need to load the tokenizer and model as described above.
def run_zero_shot_classifier(text, label):
hypothesis = f"This example is related to {label}."x = tokenizer.encode(
text,
hypothesis,
return_tensors="pt",
truncation_strategy="only_first"
)
logits = nli_model(x.to("cpu"))(0)
entail_contradiction_logits = logits(:, (0, 2))
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs(:, 1)
return prob_label_is_true.item()
label_list = ('News', 'Science', 'Art')
all_results = ()
for text in list_of_texts:
for label in label_list:
result = run_zero_shot_classifier(text, label)
all_results.append(result)
You've probably noticed that I haven't talked about fine-tuning the models myself for this project; that's true. I may do this in the future, but I'm limited by the fact that I have minimal labeled training data to work with at the moment. I can use semi-supervised techniques or start a labeled training set, but this whole experiment has been to see how far I can go with commercially available models. I have some small samples of labeled data, to use in testing the performance of the models, but that is nowhere near the same volume of data that I will need to fit the models.
If you have good training data and would like to tune a base model, Hugging Face has some docs that can help. https://huggingface.co/docs/transformers/en/training
Performance has been an interesting issue, as I've done all my experiments so far on my local laptop. Naturally, using these Hugging Face models will require a lot more computation and be slower than basic strategies like regular expressions and lexicon search, but it provides a signal that can't really be achieved any other way, so it may be worth finding ways to optimize. All of these models are GPU enabled and it is very easy to run them on GPU. (If you want to quickly try it out on GPU, check out the code I showed above where you'll see “cpu” as a substitute for “cuda” if you have a GPU available in your programming environment.) Keep in mind that GPU usage cloud providers aren't cheap, though, so prioritize accordingly and decide if more speed is worth the price.
Most of the time, using the GPU is much more important for training (keep that in mind if you decide to fine-tune) but less vital for inference. I'm not going to go into more detail about optimization here, but you'll want to also consider parallelism if this is important to you: both data parallelism and actual training/computation parallelism.
We have run the model! The results are here. I have some final tips on how to review the output and actually apply it to business questions.
- Don't blindly trust the model output, but run rigorous tests and evaluate the performance. Just because a transformer model performs well on a given blob of text, or is able to correctly match text with a given tag regularly, does not mean that this is a generalizable result. Use many different examples and different types of text to demonstrate that the performance will be sufficient.
- If you are confident in the model and want to use it in a production environment, track and record the behavior of the model. This is just good practice for any model in production, but you should keep the results you've produced along with the inputs you've given it, so you can continually check it and make sure performance doesn't drop. This is more important for these types of deep learning models because we don't have as much interpretability of why and how the model generates its inferences. It is dangerous to make too many assumptions about the internal workings of the model.
As I mentioned above, I like to use these types of model results as part of a larger set of techniques, combining them into ensemble strategies; that way I don't just rely on one approach, but get the signal that those inferences can provide. .
I hope this overview is useful for those who are getting started with pre-trained models for text analysis (or otherwise). Good luck!