Named Entity Recognition (NER) is vital in natural language processing, with applications spanning medical coding, financial analysis, and legal document analysis. Custom models are typically created using transformative encoders pre-trained on self-supervised tasks such as masked language modeling (MLM). However, recent years have seen the emergence of large language models (LLMs) such as GPT-3 and GPT-4, which can address NER tasks through well-designed prompts, but pose challenges due to high costs. of inference and potential privacy concerns.
The NuMind team presents an approach that suggests using LLM to minimize human annotations for creating custom models. Instead of employing an LLM to annotate a single-domain dataset for a specific NER task, the idea involves using the LLM to annotate a diverse, multi-domain dataset covering multiple NER problems. Subsequently, a smaller base model like BERT is pretrained on this annotated dataset. This pre-trained model can be fine-tuned for any subsequent NER task.
The team has presented its three NER models, which are the following:
- NuNER Zero: A zero-shot NER model adopts the GLiNER (Generalist Model for Named Entity Recognition via Bidirectional Transformer) architecture and requires input as a concatenation of entity types and text. Unlike GLiNER, NuNER Zero works as a token classifier, allowing detection of arbitrarily long entities. Trained on the NuNER v2.0 dataset, which fuses subsets of Pile and C4 annotated via LLM using the NuNER procedure, NuNER Zero emerges as the leading zero-shot compact NER model, with a level F1 score improvement. token value of +3.1% over GLiNER. -large-v2.1 in the GLiNER benchmark.
- NuNER zero 4k: NuNER Zero 4k is the long context (4k tokens) version of NuNER Zero. It generally has lower performance than NuNER Zero, but can outperform NuNER Zero in applications where context size matters.
- Zero Range NuNER: NuNER Zero-span is the span prediction version of NuNER Zero, which shows slightly better performance than NuNER Zero but cannot detect entities larger than 12 tokens.
The key features of these three models are:
- NuNER Zero: Originated from NuNER, suitable for moderate sized tokens.
- NuNER zero 4K: A variation of NuNER works best in scenarios where context size matters.
- NuNER zero-span: The interval prediction version of NuNER Zero is not suitable for entities larger than 12 tokens.
In conclusion, NER is crucial in natural language processing; However, creating custom models typically relies on transformative encoders trained through MLM. However, the rise of LLMs such as GPT-3 and GPT-4 poses challenges due to high inference costs. The NuMind team proposes an approach that uses LLM to reduce human annotations by annotating a multi-domain dataset. They feature three NER models: NuNER Zero, a compact zero-shot model; NuNER Zero 4k, which emphasizes a broader context; and NuNER Zero-span, which prioritizes interval prediction with slight performance improvements but limited to entities with fewer than 12 tokens.
Sources
- https://huggingface.co/numind/NuNER_Zero-4k
- https://huggingface.co/numind/NuNER_Zero
- https://huggingface.co/numind/NuNER_Zero-span
- https://arxiv.org/pdf/2402.15343
- https://www.linkedin.com/posts/tomaarsen_numind-yc-s22-has-just-released-3-new-state-of-the-art-activity-7195863382783049729-kqko/?utm_source=share&utm_medium=member_ios
Sana Hassan, a consulting intern at Marktechpost and a dual degree student at IIT Madras, is passionate about applying technology and artificial intelligence to address real-world challenges. With a strong interest in solving practical problems, she brings a new perspective to the intersection of ai and real-life solutions.