Having started at a time when wrappers were less common, I got into the habit of writing my own training loops, which are easier for me to debug. an approach that Accelerate effectively supports. It proved beneficial in this project: I was not completely sure of the required data and label formats or shapes and my data did not match the well-organized examples often shown in tutorials, but I had full access to the intermediate calculations during the training cycle. it allowed me to iterate quickly.
Context length
Most tutorials suggest using each sentence as a single training example. However, in this case, I decided a longer context would be more appropriate since documents typically contain references to multiple entitiesmany of which are irrelevant (e.g. attorneys, other creditors, case numbers). This broader context allows the model to better identify the relevant customer. I used 512 tokens from each document as a training example. This is a common maximum limit for models, but it comfortably fits all entities in most of my documents.
Subtoken tagging
In the token sorting tutorial (1), the recommended approach is:
Label only the first token of a given word. Assign
to other subtokens of the same word.
However, I found that the following method suggested in the token sorting tutorial in their NLP course (2) works much better:
Each token receives the same label as the token that started the word it contains, since they are part of the same entity. For tokens within a word but not at the beginning, we replace the
The “-100” tag is a special tag that is ignored by loss function. Therefore, I used its functions with minor changes:
def align_labels_with_tokens(labels, word_ids):
new_labels = ()
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Start of a new word!
current_word = word_id
label = -100 if word_id is None else labels(word_id)
elif word_id is None:
# Special token
# Same word as previous token
label = labels(word_id)
# If the label is B-XXX we change it to I-XXX
if label % 2 == 1:
label += 1
new_labels.append(label)return new_labels
def tokenize_and_align_labels(examples):
tokenizer = AutoTokenizer.from_pretrained("../model/xlm-roberta-large")
tokenized_inputs = tokenizer(
examples("tokens"), truncation=True, is_split_into_words=True,
padding="max_length", max_length=512)
all_labels = examples("ner_tags")
new_labels = ()
for i, labels in enumerate(all_labels):
word_ids = tokenized_inputs.word_ids(i)
new_labels.append(align_labels_with_tokens(labels, word_ids))
tokenized_inputs("labels") = new_labels
return tokenized_inputs
I also used your postprocess()
To simplify your evaluation part, we define this
function that takes predictions and labels and converts them into lists of strings.
def postprocess(predictions, labels):
predictions = predictions.detach().cpu().clone().numpy()
labels = labels.detach().cpu().clone().numpy()true_labels = ((id2label(l) for l in label if l != -100) for label in labels)
true_predictions = (
(id2label(p) for (p, l) in zip(prediction, label) if l != -100)
for prediction, label in zip(predictions, labels)
return true_predictions, true_labels
Class weights
Incorporating class weights into the loss function significantly improved model performance. While this setting may seem simple (without it, the model overemphasized the majority “O” class), it is surprisingly absent from most tutorials. I implemented a custom compute_weights()
function to address this imbalance:
def compute_weights(trainset, num_labels):
c = Counter()
for t in trainset:
c += Counter(t('labels').tolist())
weights = (sum(c.values())/(c(i)+1) for i in range(num_labels))
return weights
Training loop
I defined two additional functions: PyTorch DataLoader()
to manage batch processing and a main()
function to configure distributed training objects and execute the training cycle.
from accelerate import Accelerator, notebook_launcher
from collections import Counter
from datasets import Dataset
from datetime import datetime
import torch
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.nn import CrossEntropyLoss
from import DataLoader
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification
from transformers import XLMRobertaConfig, XLMRobertaForTokenClassification
from seqeval.metrics import classification_report, f1_scoredef create_dataloaders(trainset, evalset, batch_size, num_workers):
train_dataloader = DataLoader(trainset, shuffle=True,
batch_size=batch_size, num_workers=num_workers)
eval_dataloader = DataLoader(evalset, shuffle=False,
batch_size=batch_size, num_workers=num_workers)
return train_dataloader, eval_dataloader
def main(batch_size, num_workers, epochs, model_path, dataset_tr, dataset_ev, training_type, model_params, dt):
accelerator = Accelerator(split_batches=True)
num_labels = model_params('num_labels')
# Prepare data #
train_ds = Dataset.from_dict(
{"tokens": (d(2)(:512) for d in dataset_tr),
"ner_tags": (d(1)(:512) for d in dataset_tr)})
eval_ds = Dataset.from_dict(
{"tokens": (d(2)(:512) for d in dataset_ev),
"ner_tags": (d(1)(:512) for d in dataset_ev)})
trainset =, batched=True,
remove_columns=("tokens", "ner_tags"))
evalset =, batched=True,
remove_columns=("tokens", "ner_tags"))
train_dataloader, eval_dataloader = create_dataloaders(trainset, evalset,
batch_size, num_workers)
# Type of training #
if training_type=='from_scratch':
config = XLMRobertaConfig.from_pretrained(model_path, **model_params)
model = XLMRobertaForTokenClassification(config)
elif training_type=='transfer_learning':
model = AutoModelForTokenClassification.from_pretrained(model_path,
ignore_mismatched_sizes=True, **model_params)
for param in model.parameters():
for param in model.classifier.parameters():
elif training_type=='fine_tuning':
model = AutoModelForTokenClassification.from_pretrained(model_path,
for param in model.parameters():
for param in model.classifier.parameters():
# Intantiate the optimizer #
optimizer = torch.optim.AdamW(params=model.parameters(), lr=2e-5)
# Instantiate the learning rate scheduler #
lr_scheduler = ReduceLROnPlateau(optimizer, patience=5)
# Define loss function #
weights = compute_weights(trainset, num_labels)
loss_fct = CrossEntropyLoss(weight=torch.tensor(weights))
# Prepare objects for distributed training #
loss_fct, train_dataloader, model, optimizer, eval_dataloader, lr_scheduler = accelerator.prepare(
loss_fct, train_dataloader, model, optimizer, eval_dataloader, lr_scheduler)
# Training loop #
max_f1 = 0 # for early stopping
for t in range(epochs):
# training
accelerator.print(f"\n\nEpoch {t+1}\n-------------------------------")
tr_loss = 0
preds = list()
labs = list()
for batch in train_dataloader:
outputs = model(input_ids=batch('input_ids'),
labels = batch("labels")
loss = loss_fct(outputs.logits.view(-1, num_labels), labels.view(-1))
tr_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
accelerator.print(f"Train loss: {tr_loss/len(train_dataloader):>8f} \n")
accelerator.print(classification_report(labs, preds))
# evaluation
ev_loss = 0
preds = list()
labs = list()
for batch in eval_dataloader:
with torch.no_grad():
outputs = model(input_ids=batch('input_ids'),
labels = batch("labels")
loss = loss_fct(outputs.logits.view(-1, num_labels), labels.view(-1))
ev_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
accelerator.print(f"Eval loss: {ev_loss/len(eval_dataloader):>8f} \n")
accelerator.print(classification_report(labs, preds))
accelerator.print(f"Current Learning Rate: {optimizer.param_groups(0)('lr')}")
# checkpoint best model
if f1_score(labs, preds) > max_f1:
unwrapped_model = accelerator.unwrap_model(model)
accelerator.print(f"Model saved during {t+1}. epoch.")
max_f1 = f1_score(labs, preds)
best_epoch = t
# early stopping
if (t - best_epoch) > 10:
accelerator.print(f"Early stopping after {t+1}. epoch.")
With everything prepared, the model is ready to train. I just need to start the process:
label_list = (
"B-evcu", "I-evcu", # variable symbol of creditor
"B-rc", "I-rc", # birth ID
"B-prijmeni", "I-prijmeni", # surname
"B-jmeno", "I-jmeno", # given name
"B-datum", "I-datum", # birth date
id2label = {a: b for a,b in enumerate(label_list)}
label2id = {b: a for a,b in enumerate(label_list)}num_workers = 6 # number of GPUs
batch_size = num_workers*2
epochs = 100
model_path = "../model/xlm-roberta-large"
training_type = "fine_tuning" # from_scratch / transfer_learning / fine_tuning
model_params = {"id2label": id2label, "label2id": label2id, "num_labels": 11}
dt ="%Y%m%d_%H%M%S")
notebook_launcher(main, args=(batch_size, num_workers, epochs, model_path,
dataset_tr, dataset_ev, training_type, model_params, dt),
num_processes=num_workers, mixed_precision="fp16", use_port="29502")
I find using notebook_launcher()
convenient as it allows me to run the training in the console and then easily work with the results.
Base XLM-RoBERTa vs Large vs Small-E-Czech
I experimented with adjusting three models. The base model XLM-RoBERTa (3) offered satisfactory performance, but the server capacity also allowed me to test the large model
XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of leaked CommonCrawl data containing 100 languages.
The large model showed a slight improvement in results, so I finally implemented it. I also tried Small-E-Czech (4), an Electra-small model pre-trained on Czech web data, but it performed poorly.
Adjustment, learning transfer and training from scratch
In addition to fine-tuning (updating all model weights), I tried transfer learning, as it is sometimes suggested that training only the final layer (classification) may be enough. However, the performance difference was significant, favoring full fine tuning. I also tried training from scratch by importing only the model architecture, initializing the weights randomly and then training, but as expected this approach was not effective.
RoBERTa vs LLM (Claude 3.5 Sonnet)
I briefly explored zero shot LLMs, albeit with minimal engineering (so ). The model had problems even with basic requests, such as (I used Czech in the actual message):
Find the creditor's variable symbol. This number has exactly 9 consecutive digits from 0 to 9 without letters or other special characters. It is usually preceded by one of the following abbreviations: 'ev.č.', 'zn. opr', 'VS. O', 'obvious. do. opr.'. On the other hand, I am not interested in a transaction number with the abbreviation 'č.j.'. This number does not appear frequently in the documents, it may happen that you cannot find it, so write 'can't find it'. If you are not sure, write “not sure.”
The model sometimes failed to generate the 9-digit format accurately. Post-processing would filter out shorter numbers, but there were many false positive 9-digit numbers.
Occasionally the model inferred something incorrect. birth identificationIt is based solely on birth dates (even with the temperature set to 0). On the other hand, it stood out in the extraction names, surnamesand dates of birth.
In general, even in my previous experiments, I discovered that LLMs (at the time of writing) They perform better on general tasks, but lack accuracy and reliability for specific or unconventional tasks. Customer identification performance was quite similar for both approaches. For internal reasons, the RoBERTa model was implemented.