NER with little data? Transformers to the rescue!

14.12.2020 | 8 minutes of reading time

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and

fine-tune a pre-trained BERT to extract information from legal texts,
encounter a token misalignment problem due to BERT’s preference for sub-word token, and
observe tremendous improvements on difficult classes compared to the hand-made bi-lstm model of our previous posts.

Let’s get started!

The NER dataset and task

We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

again. It consists of German court decisions with annotations of entities referring to legal norms, court decisions, legal literature and so on of the following form:

‘Trotz der zweifelhaften Bewertung von MDMA als ” harte Droge “( vgl. BGH , Beschluss vom 3. Februar 1999 – 5 StR 705/98 , juris Rn. 2RS; zum Meinungsstand Patzak inKörner / Patzak / Volkmer , BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213LITmwN ;Weber , BtMG , 5. Aufl. , § 1 Rn. 364LITmwN ) hat der Strafausspruch Bestand , da die verhängte Rechtsfolge jedenfalls angemessen ist (§ 354 Abs. 1a Satz 1 StPOGS) . ‘

The task for our model will be to annotate, given a sample sentence, each word with a tag that indicates whether this word is part of a reference to legal norm, court decisions and so on. For more details, see the first post of this series.

The transformer revolution

In case you haven’t read about transformers, here’s a summary. For details on the the original transformer architecture, see the original paper or one of the many blog posts on the topic.

Transformers transformed natural language processing (NLP) with

a revolutionary attention mechanism that replaces convolutional or recurrent architectures,
a shift in transfer learning from pre-training (word vectors) for feature extraction to training generic language models plus fine-tuning on downstream tasks, and
an exponential growth of model size that brought us performance on par with humans on a number of NLP tasks but also exploding resource consumption with diminishing returns:

To leverage transformers for our custom NER task, we’ll use the Python library huggingface transformers which provides

a model repository including BERT, GPT-2 and others, pre-trained in a variety of languages,
wrappers for downstream tasks like classification, named entity recognition, summarization, et cetera and
convenient ways to fine-tunining on downstream tasks , e.g. in end-to-end pipelines or via TensorFlow or PyTorch .

Get your keyboard ready or follow along just reading!

Setting up the environment

Set up a virtual environment, install the required dependencies and download the dataset similarly as in the preceding blog posts :

1mkdir transformers_ner_project && cd transformers_ner_project
2python3 -m venv .venv && source .venv/bin/activate
3pip install numpy pandas tqdm sklearn transformers[tf-cpu]
4mkdir -p data/01_raw
5curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip 
6     -L -o data/01_raw/raw.zip
7unzip data/01_raw/raw.zip -d data/01_raw

Alternatively, follow along with Jupyter running inside a TensorFlow Docker container , or with a Google Colab notebook .

Step 1: Loading a pre-trained BERT

With huggingface transformers , it’s super-easy to get a state-of-the-art pre-trained transformer model nicely packaged for our NER task: we choose a pre-trained German BERT model from the model repository and request a wrapped variant with an additional token classification layer for NER with just a few lines:

1from transformers import AutoConfig, TFAutoModelForTokenClassification
2
3MODEL_NAME = 'bert-base-german-cased' 
4
5config = AutoConfig.from_pretrained(MODEL_NAME, num_labels=len(schema))
6model = TFAutoModelForTokenClassification.from_pretrained(MODEL_NAME, 
7                                                          config=config)
8model.summary()

The result is a TensorFlow model consisting of the pre-trained BERT transformer, followed by a drop-out and a dense classifier layer which predicts the tag of each token:

Model: "tf_bert_for_token_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBERTMainLayer)       multiple                  109081344 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  16149     
=================================================================
Total params: 109,097,493
Trainable params: 109,097,493
Non-trainable params: 0
_________________________________________________________________

Step 2: Preprocessing

The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line:

1an O
2Kapitalgesellschaften O
3( O
4§ B-GS
517 I-GS
6Abs. I-GS
71 I-GS
8und I-GS
92 I-GS
10EStG I-GS
11) O

We read two data files line-by-line, store the sentences as lists of token-tag pairs, and determine the annotation schema just like we did it for training our bi-LSTM model :

1def load_data(filename: str):
2    with open(filename, 'r') as file:
3        lines = [line[:-1].split() for line in file]
4    samples, start = [], 0
5    for end, parts in enumerate(lines):
6        if not parts:
7            sample = [(token, tag.split('-')[-1]) 
8                          for token, tag in lines[start:end]]
9            samples.append(sample)
10            start = end + 1
11    if start < end:
12        samples.append(lines[start:end])
13    return samples
14
15train_samples = load_data('data/01_raw/bag.conll')
16val_samples = load_data('data/01_raw/bgh.conll')
17samples = train_samples + val_samples
18schema = ['_'] + sorted({tag for sentence in samples 
19                             for _, tag in sentence})

Gotcha! Sub-word tokenization?

But how do we feed the data into our transformer? The answer depends on the model that we chose because it has been pre-trained with a custom sub-word tokenizer. This tokenizer splits an input sentence into a sequence of subword tokens instead of words, using an algorithm like byte-pair encoding or unigram language models . Let’s get hold of the tokenizer that was used to pre-train our model,

1from transformers import AutoTokenizer
2tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

and apply it to some samples. The results are dictionaries where we’re mainly interested in the component input_ids:

`sample`	`tokenizer(sample)['input_ids']`
`'Das ist'`	`[3, 295, 127, 4]`
`'eine Frage'`	`[3, 155, 1685, 4]`
`'eine hochinteressante Frage'`	`[3, 155, 2426, 21477, 5004, 1685, 4]`

What do we see?

The tokenizer marks the beginning and the end of a sample with a 3 and 4, respectively.
Common words like 'Das', 'ist', 'eine', 'Frage' are treated as single tokens.
Less frequent words like 'hochinteressante' are split up into a sequence of sub-word token.

So we need to

apply the sub-word tokenizer to every word in our input samples, and
whenever it does split up a word, tag each sub-word like the entire word.

This can be done as follows:

1import numpy as np
2import tqdm
3
4def tokenize_sample(sample):
5    seq = [
6               (subtoken, tag)
7               for token, tag in sample
8               for subtoken in tokenizer(token)['input_ids'][1:-1]
9           ]
10    return [(3, 'O')] + seq + [(4, 'O')]
11
12def preprocess(samples):
13    tag_index = {tag: i for i, tag in enumerate(schema)}
14    tokenized_samples = list(tqdm(map(tokenize_sample, samples)))
15    max_len = max(map(len, tokenized_samples))
16    X = np.zeros((len(samples), max_len), dtype=np.int32)
17    y = np.zeros((len(samples), max_len), dtype=np.int32)
18    for i, sentence in enumerate(tokenized_samples):
19        for j, (subtoken_id, tag) in enumerate(sentence):
20            X[i, j] = subtoken_id
21            y[i,j] = tag_index[tag]
22    return X, y
23
24X_train, y_train = preprocess(train_samples)
25X_val, y_val = preprocess(val_samples)

Step 3: Fine-tuning BERT on our custom NER task

Training the model is now more or less the same as in the preceding post with our bi-LSTM model:

1import pandas as pd
2
3NR_EPOCHS=10
4BATCH_SIZE=16
5
6optimizer = tf.keras.optimizers.Adam(lr=0.00001)
7loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
8model.compile(optimizer=optimizer, loss=loss, metrics='accuracy')
9history = model.fit(tf.constant(X_train), tf.constant(y_train),
10                    validation_split=0.2, epochs=EPOCHS, 
11                    batch_size=BATCH_SIZE)

Well, except that now the model has some more parameters and training for just one epoch might take … some hours, depending on your hardware. Here’s the validation accuracy (note the lower bound):

Note the domain of the accuracy and that the x-axis measures the training time in seconds.

Step 4: Evaluation — gotcha again!

Now that we have trained our custom-NER-BERT, we want to apply it and … face another problem: the model predicts tag annotations on the sub-word level, not on the word level. To obtain word-level annotations, we need to aggregate the sub-word level predictions for each word. Two obvious solutions come to mind:

for each sub-word, choose the tag with highest probability, and then use a majority vote, or
average the predicted probabilities over all sub-words of a word, and then take the tag with highest average probability.

Given predictions pred for a sequence seq of sub-words of shape (len(seq), len(schema)), this would amount to taking the tag indexed by

scipy.stats.mode(np.argmax(pred, axis=-1)), using the package SciPy , or
np.argmax(np.mean(pred, axis=0)),

respectively, or, in the picture below, to go 1. first right, then down or 2. first down, then right:

We choose variant 2 and apply it to the model’s predictions as follows:

1def aggregate(sample, predictions):
2    results = []
3    i = 1
4    for token, y_true in sample:
5        nr_subtoken = len(tokenizer(token)['input_ids']) - 2
6        pred = predictions[i:i+nr_subtoken]
7        i += nr_subtoken
8        y_pred = schema[np.argmax(np.sum(pred, axis=0))]
9        results.append((token, y_true, y_pred))
10    return results
11
12y_probs = model.predict(X_val)[0]
13predictions = [aggregate(sample, predictions)
14               for sample, predictions in zip(val_samples, y_probs)]

Finally, we can evaluate the predictions on the level of tokens as a multi-class classification problem using scikit-learn again as in the preceding blog post . Here is the scatterplot of the resulting f1-Scores versus the support for each tag class:

Conclusion

Let’s see how our new results compare to those of the previous post, and note that I’ve let BERT train 50 times as long as the bi-LSTM:

We see that BERT significantly outperforms the bi-LSTM on difficult classes in our task. Is this only because of the more powerful network architecture and more training time? No! The scatterplot above shows a significant correlation between the f1-score and the supply of training data, and points us to the key advantage of the present approach:

Before (bi-LSTM), we used it in the form of pre-trained word embeddings.
Now (BERT), we start from a fully trained language model that embodies much more knowledge.

The upshot is:

The fewer data we have, the more important transfer learning becomes.

Was this post helpful?

Likes

Blog author

Thomas Timmermann

Data Scientist

Do you still have questions? Just send me a message.

fromThomas Timmermann

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts. In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 Minuten Lesezeit

Thomas Timmermann

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include automation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 8 Minuten Lesezeit

Thomas Timmermann

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model! In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 Minuten Lesezeit

Thomas Timmermann

Natural Language Processing — Einsteigen und loslegen!

1 Worum geht es? Ob Suchmaschinen, Spamfilter, Chatbots oder Sprachassistenten wie Siri und Alexa — Computer verarbeiten immer mehr Sprache mit immer besserer Genauigkeit und dringen damit immer weiter in unseren Alltag vor. Dahinter stecken anspruchsvolle...

Künstliche Intelligenz
Machine Learning
Python
NLP
Data

7.3.2019 | 10 Minuten Lesezeit

Thomas Timmermann

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

ChatGPT im Alltag eines Python-Entwicklers

Seit einigen Tagen spiele ich mit ChatGPT herum. Beruflich und privat konnte ich damit einige Fragen bearbeiten, bspw. welche Alternativen es zu bestimmten Tools gibt, was Vorteile von Teilzeit für den Arbeitgeber sind oder wer ich bin. Leider weiß ChatGPT...

NLP
Python
Künstliche Intelligenz

27.1.2023 | 7 Minuten Lesezeit

Robert Meißner

Mit wenigen Zeilen Code Titel und Vorschaubild generieren

Ich bin ein fauler Mensch. Und ich schreibe viel, u. a. beruflich und privat in Blogs, auf Twitter und auf Wissenschaftsseiten. Das Schreiben per se ist schön. Aber wenn ich mir Titel überlegen muss oder gar Schlagwörter, dann ist der Spaß vorbei. Noch...

11.10.2022 | 7 Minuten Lesezeit

Robert Meißner

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Das Auslesen von Adress-/Anschriftbereichen in Briefen war schon immer eine recht schwierige Problematik. Die Freude war umso größer, als Kofax vor einigen KTM-Versionen (Kofax Transformation Modules ) ein Werkzeug (Adress-Lokator) für das automatisierte...

NLP
Archivierung

7.3.2022 | 6 Minuten Lesezeit

Jürgen Voss

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

Natural Language Processing: Erweiterungen mit KTM 6.4

Im Frühjahr 2020 erhielt das Produkt Kofax Transformation Modules (KTM) mit dem Service Pack 6.3.1 ein neues Modul: Natural Language Processing (NLP). Natural Language Processing versucht, den Text des Dokuments zu analysieren, Wörter und deren Beziehungen...

Content Management
Archivierung
NLP

15.4.2021 | 2 Minuten Lesezeit

Jürgen Voss

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

NER with little data? Transformers to the rescue!

The NER dataset and task

The transformer revolution

Setting up the environment

Step 1: Loading a pre-trained BERT

Step 2: Preprocessing

Gotcha! Sub-word tokenization?

Step 3: Fine-tuning BERT on our custom NER task

Step 4: Evaluation — gotcha again!

Conclusion

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Take control of named entity recognition with your own Keras model!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Move n-gram extraction into your Keras model!

Natural Language Processing &mdash; Einsteigen und loslegen!

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Große Sprachmodelle: Was ist ein LLM?

Bessere SQL-Datenpipelines mit dbt

ChatGPT im Alltag eines Python-Entwicklers

Mit wenigen Zeilen Code Titel und Vorschaubild generieren

Streaming Wikipedia mit Apache Kafka

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Natural Language Processing: Erweiterungen mit KTM 6.4

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten

Natural Language Processing — Einsteigen und loslegen!