In the rapidly evolving field of natural language processing, Transformers have become dominant models, demonstrating remarkable performance in a wide range of sequence modeling tasks, including part-of-speech tagging, named entity recognition, and fragmentation. Before the era of Transformers, conditional random fields (CRFs) were the go-to tool for sequence modeling and specifically line-chain CRFs that model sequences as directed graphs, while CRFs more generally can be used on arbitrary graphs.
This article will be broken down as follows:
- Introduction
- Emissions and transition scores
- loss function
- Efficient Estimation of Partition Function Using Direct Algorithm
- Viterbi algorithm
- Complete LSTM-CRF code
- Disadvantages and conclusions
The implementation of the CRFs in this article is based on this excellent tutorial. Note that it is definitely not the most efficient implementation out there and also lacks batch processing capabilities; However, it is relatively simple to read and understand and since the goal of this tutorial is to understand the inner workings of CRFs, it is perfectly suitable for us.
In sequence labeling problems, we deal with a sequence of input data elements, such as words within a sentence, where each element corresponds to a specific label or category. The main goal is to correctly assign the appropriate tag to each individual element. Within the CRF-LSTM model we can identify two key components to do this: emission and transition probabilities. Note In fact, we will deal with scores in logarithmic space instead of numerical stability probabilities:
- Emission scores are related to the probability of observe a particular label for a given data element. In the context of named entity recognition, for example, each word in a sequence is affiliated with one of three labels: beginning of an entity (B), middle word of an entity (I), or a word outside of any entity (O ). ). Emission probabilities quantify the probability that a specific word is associated with a particular tag. This is expressed mathematically as P(y_i | x_i), where y_i denotes the label and x_i represents the…