NER with little data? Transformers to the rescue!

No Comments

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and

  • fine-tune a pre-trained BERT to extract information from legal texts,
  • encounter a token misalignment problem due to BERT’s preference for sub-word token, and
  • observe tremendous improvements on difficult classes compared to the hand-made bi-lstm model of our previous posts.

Let’s get started!

The NER dataset and task

We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

again. It consists of German court decisions with annotations of entities referring to legal norms, court decisions, legal literature and so on of the following form:

‘Trotz der zweifelhaften Bewertung von MDMA als ” harte Droge “( vgl. BGH , Beschluss vom 3. Februar 1999 – 5 StR 705/98 , juris Rn. 2RS; zum Meinungsstand Patzak inKörner / Patzak / Volkmer , BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213LITmwN ;Weber , BtMG , 5. Aufl. , § 1 Rn. 364LITmwN ) hat der Strafausspruch Bestand , da die verhängte Rechtsfolge jedenfalls angemessen ist (§ 354 Abs. 1a Satz 1 StPOGS) . ‘

The task for our model will be to annotate, given a sample sentence, each word with a tag that indicates whether this word is part of a reference to legal norm, court decisions and so on. For more details, see the first post of this series.

The transformer revolution

In case you haven’t read about transformers, here’s a summary. For details on the the original transformer architecture, see the original paper or one of the many blog posts on the topic.

Transformers transformed natural language processing (NLP) with

  • a revolutionary attention mechanism that replaces convolutional or recurrent architectures,
  • a shift in transfer learning from pre-training (word vectors) for feature extraction to training generic language models plus fine-tuning on downstream tasks, and
  • an exponential growth of model size that brought us performance on par with humans on a number of NLP tasks but also exploding resource consumption with diminishing returns:exponential growth of transformer sizes

To leverage transformers for our custom NER task, we’ll use the Python library huggingface transformers which provides

Get your keyboard ready or follow along just reading!

Setting up the environment

Set up a virtual environment, install the required dependencies and download the dataset similarly as in the preceding blog posts:

1mkdir transformers_ner_project && cd transformers_ner_project
2python3 -m venv .venv && source .venv/bin/activate
3pip install numpy pandas tqdm sklearn transformers[tf-cpu]
4mkdir -p data/01_raw
6     -L -o data/01_raw/
7unzip data/01_raw/ -d data/01_raw

Alternatively, follow along with Jupyter running inside a TensorFlow Docker container, or with a Google Colab notebook.

Step 1: Loading a pre-trained BERT

With huggingface transformers, it’s super-easy to get a state-of-the-art pre-trained transformer model nicely packaged for our NER task: we choose a pre-trained German BERT model from the model repository and request a wrapped variant with an additional token classification layer for NER with just a few lines:

1from transformers import AutoConfig, TFAutoModelForTokenClassification
3MODEL_NAME = 'bert-base-german-cased' 
5config = AutoConfig.from_pretrained(MODEL_NAME, num_labels=len(schema))
6model = TFAutoModelForTokenClassification.from_pretrained(MODEL_NAME, 
7                                                          config=config)

The result is a TensorFlow model consisting of the pre-trained BERT transformer, followed by a drop-out and a dense classifier layer which predicts the tag of each token:

Model: "tf_bert_for_token_classification"
Layer (type)                 Output Shape              Param #   
bert (TFBERTMainLayer)       multiple                  109081344 
dropout_37 (Dropout)         multiple                  0         
classifier (Dense)           multiple                  16149     
Total params: 109,097,493
Trainable params: 109,097,493
Non-trainable params: 0

Step 2: Preprocessing

The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line:

 1an O
 2Kapitalgesellschaften O
 3( O
 4§ B-GS
 517 I-GS
 6Abs. I-GS
 71 I-GS
 8und I-GS
 92 I-GS
11) O

We read two data files line-by-line, store the sentences as lists of token-tag pairs, and determine the annotation schema just like we did it for training our bi-LSTM model:

 1def load_data(filename: str):
 2    with open(filename, 'r') as file:
 3        lines = [line[:-1].split() for line in file]
 4    samples, start = [], 0
 5    for end, parts in enumerate(lines):
 6        if not parts:
 7            sample = [(token, tag.split('-')[-1]) 
                          for token, tag in lines[start:end]]
 8            samples.append(sample)
 9            start = end + 1
10    if start < end:
11        samples.append(lines[start:end])
12    return samples
14train_samples = load_data('data/01_raw/bag.conll')
15val_samples = load_data('data/01_raw/bgh.conll')
16samples = train_samples + val_samples
17schema = ['_'] + sorted({tag for sentence in samples 
18                             for _, tag in sentence})

Gotcha! Sub-word tokenization?

But how do we feed the data into our transformer? The answer depends on the model that we chose because it has been pre-trained with a custom sub-word tokenizer. This tokenizer splits an input sentence into a sequence of subword tokens instead of words, using an algorithm like byte-pair encoding or unigram language models. Let’s get hold of the tokenizer that was used to pre-train our model,

1from transformers import AutoTokenizer
2tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

and apply it to some samples. The results are dictionaries where we’re mainly interested in the component input_ids:

'Das ist'[3, 295, 127, 4]
'eine Frage' [3, 155, 1685, 4]
'eine hochinteressante Frage' [3, 155, 2426, 21477, 5004, 1685, 4]

What do we see?

  1. The tokenizer marks the beginning and the end of a sample with a 3 and 4, respectively.
  2. Common words like 'Das', 'ist', 'eine', 'Frage' are treated as single tokens.
  3. Less frequent words like 'hochinteressante' are split up into a sequence of sub-word token.

So we need to

  • apply the sub-word tokenizer to every word in our input samples, and
  • whenever it does split up a word, tag each sub-word like the entire word.

This can be done as follows:

 1import numpy as np
 2import tqdm
 4def tokenize_sample(sample):
 5    seq = [
 6               (subtoken, tag)
 7               for token, tag in sample
 8               for subtoken in tokenizer(token)['input_ids'][1:-1]
 9           ]
10    return [(3, 'O')] + seq + [(4, 'O')]
12def preprocess(samples):
13    tag_index = {tag: i for i, tag in enumerate(schema)}
14    tokenized_samples = list(tqdm(map(tokenize_sample, samples)))
15    max_len = max(map(len, tokenized_samples))
16    X = np.zeros((len(samples), max_len), dtype=np.int32)
17    y = np.zeros((len(samples), max_len), dtype=np.int32)
18    for i, sentence in enumerate(tokenized_samples):
19        for j, (subtoken_id, tag) in enumerate(sentence):
20            X[i, j] = subtoken_id
21            y[i,j] = tag_index[tag]
22    return X, y
24X_train, y_train = preprocess(train_samples)
25X_val, y_val = preprocess(val_samples)

Step 3: Fine-tuning BERT on our custom NER task

Training the model is now more or less the same as in the preceding post with our bi-LSTM model:

 1import pandas as pd
 6optimizer = tf.keras.optimizers.Adam(lr=0.00001)
 7loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
 8model.compile(optimizer=optimizer, loss=loss, metrics='accuracy')
 9history =, tf.constant(y_train),
10                    validation_split=0.2, epochs=EPOCHS, 
11                    batch_size=BATCH_SIZE)

Well, except that now the model has some more parameters and training for just one epoch might take … some hours, depending on your hardware. Here’s the validation accuracy (note the lower bound):validation accuracy history

Note the domain of the accuracy and that the x-axis measures the training time in seconds.

Step 4: Evaluation — gotcha again!

Now that we have trained our custom-NER-BERT, we want to apply it and … face another problem: the model predicts tag annotations on the sub-word level, not on the word level. To obtain word-level annotations, we need to aggregate the sub-word level predictions for each word. Two obvious solutions come to mind:

  1. for each sub-word, choose the tag with highest probability, and then use a majority vote, or
  2. average the predicted probabilities over all sub-words of a word, and then take the tag with highest average probability.

Given predictions pred for a sequence seq of sub-words of shape (len(seq), len(schema)), this would amount to taking the tag indexed by

  1. scipy.stats.mode(np.argmax(pred, axis=-1)), using the package SciPy, or
  2. np.argmax(np.mean(pred, axis=0)),

respectively, or, in the picture below, to go 1. first right, then down or 2. first down, then right:sub-word prediction aggregation

We choose variant 2 and apply it to the model’s predictions as follows:

 1def aggregate(sample, predictions):
 2    results = []
 3    i = 1
 4    for token, y_true in sample:
 5        nr_subtoken = len(tokenizer(token)['input_ids']) - 2
 6        pred = predictions[i:i+nr_subtoken]
 7        i += nr_subtoken
 8        y_pred = schema[np.argmax(np.sum(pred, axis=0))]
 9        results.append((token, y_true, y_pred))
10    return results
12y_probs = model.predict(X_val)[0]
13predictions = [aggregate(sample, predictions)
14               for sample, predictions in zip(val_samples, y_probs)]

Finally, we can evaluate the predictions on the level of tokens as a multi-class classification problem using scikit-learn again as in the preceding blog post. Here is the scatterplot of the resulting f1-Scores versus the support for each tag class:
NER-f1-score vs support per tag class


Let’s see how our new results compare to those of the previous post, and note that I’ve let BERT train 50 times as long as the bi-LSTM:

comparison NER-f1-scores per tag

We see that BERT significantly outperforms the bi-LSTM on difficult classes in our task. Is this only because of the more powerful network architecture and more training time? No! The scatterplot above shows a significant correlation between the f1-score and the supply of training data, and points us to the key advantage of the present approach:

  • Before (bi-LSTM), we used it in the form of pre-trained word embeddings.
  • Now (BERT), we start from a fully trained language model that embodies much more knowledge.

The upshot is:

The fewer data we have, the more important transfer learning becomes.

Thomas Timmermann

Thomas did a PhD in Mathematics, gathered rich research experience, and joined the Münster team in the area of data science and machine learning. He is interested in everything related to AI and deep learning.


Your email address will not be published. Required fields are marked *