NER @ CLI: Custom-named entity recognition with spaCy in four lines

6.11.2020 | 7 minutes of reading time

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include

automation of business processes involving documents
distillation of data from the web by scraping websites
indexing document collections for scientific, investigative, or economic purposes

Some cases can be treated by classical approaches, for example:

forms with a fixed structure can be handled by layout-based rules
entities with fixed pattern like phone numbers can be extracted using regular expressions
occurrences of known entities like invoice numbers or customer names can be detected by matching against a database

But when more flexibility is needed, named entity recognition (NER) may be just the right tool for the task. In a sequence of blog posts, we will explain and compare three approaches to extract references to laws and verdicts from court decisions:

First, we use the popular NLP library spaCy and train a custom NER model on the command line with no fuzz.
Next, we build a bidirectional word-level LSTM model by hand with TensorFlow & Keras .
Finally, we fine-tune a pre-trained BERT model using huggingface transformers for state-of-the-art performance on the task.

This post introduces the dataset and task and covers the command line approach using spaCy .

Our dataset and task

The dataset for our task was presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

and can be found on GitHub . It consists of decisions from several German federal courts with annotations of entities referring to legal norms, court decisions, legal literature, and others of the following form:

The entire dataset comprises 66,723 sentences. We pick

court decisions of the Federal Labour Court (BAG) for training and
court decisions of the Federal Court of Justice (BGH) for validation.

The following histograms show the distribution of sentence lengths and token annotations for this slice, where ‘O’ denotes the “empty” annotation:

The NER task we want to solve is, given sample sentences, to annotate each token of each sentence with a tag which indicates whether this token is part of a reference to a legal norm, court decision, legal literature, and so on. Put differently, this is a sequence-labeling task where we classify each token as belonging to one or none annotation class.

Enter the NLP library spaCy

The Python library spaCy provides “industrial-strength natural language processing” covering

15 languages with small-, medium- or large-scale language models
the full NLP pipeline starting with tokenization over word embeddings to part-of-speech tagging and parsing
many NLP tasks like classification, similarity estimation or named entity recognition

We now show how to use it for our NER task with no knowledge of deep learning nor NLP.

Get your keyboard ready!

Step 0: Setup

To experiment along, you need Python 3. Fire up a terminal to work on the command line, create a folder for this experiment, switch to this folder and create and activate a virtual environment with

python3 -m venv .venv
source .venv/bin/activate

In case you are on Windows, switch to the Subsystem for Linux or replace the last line by

.venv\Scripts\activate.bat

Next, install spaCy and download the medium-sized German language model with

pip install spacy
python -m spacy download de_core_news_md

Step 1: Get the NER data ready

The dataset is hosted on GitHub and contained in one zip file which we download and unzip:

mkdir -p data/01_raw
curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
     -L -o data/01_raw/raw.zip
!unzip data/01_raw/raw.zip -d data/01_raw

Each of the unzipped files contains sample sentences from one court. The sentences come as paragraphs separated by blank lines, with one token and annotation in BIO format per line as follows:

an O
Kapitalgesellschaften O
( O
§ B-GS
17 I-GS
Abs. I-GS
1 I-GS
und I-GS
2 I-GS
EStG I-GS
) O

python -m spacy convert --converter ner data/01_raw/bag.conll data/02_train
python -m spacy convert --converter ner data/01_raw/bgh.conll data/03_val

Along the way, we obtain some status information:

To check for potential problems before training, we check the data with spaCy’s debug-data tool:

1python -m spacy debug-data de data/02_train data/03_val -p ner -b de_core_news_md

which produces the following output:

As we have seen before, some tags occur extremely rarely so we can’t expect the model to learn them very well. Moreover, we see that the language model knows almost all words occuring in the dataset, which may come as a surprise.

Step 2: Train the NER model

To obtain a custom model for our NER task, we use spaCy’s train tool as follows:

python -m spacy train de data/04_models/md data/02_train data/03_val \
    --base-model de_core_news_md --pipeline 'ner' -R -n 20

python -m spacy evaluate data/04_models/md/model-best data/03_val

This outputs the precision, recall and F1-score for the NER task again (NER P, NER R, NER F):

Time	Words	Words/s	TOK	POS	UAS	LAS	NER P	NER R	NER F	Textcat
4.37	177835	40663	100.00	0.00	0.00	0.00	70.15	60.09	64.73	0.00

The overall performance looks moderate. For better results, one could use

the large language model de_core_news_lg
more training steps
more training data (we only used a subset of the dataset).

As an example, training the large model for 40 epochs yields the following scores:

Time	Words	Words/s	TOK	POS	UAS	LAS	NER P	NER R	NER F	Textcat
4.52	177835	39339	100.00	0.00	0.00	0.00	73.72	64.39	68.74	0.00

Apparently, the problem is not the model, but the data: some tag categories appear very rarely so it’s hard for the model learn them. For a more thorough evaluation, we need to see the scores for each tag category.

Step 3: Use the model for named entity recognition

To use our new model and to see how it performs on each annotation class, we need to use the Python API of spaCy . To experiment along, activate the virtual environment again, install Jupyter and start a notebook with

pip install jupyter
jupyter notebook spacy_ner.ipynb

If it did not open by itself, open a web browser pointing to the URL output by the last command, and enter the following Python code blocks in code cells to work along.

Let us load the best-trained model version:

import spacy
MODEL_PATH = 'data/04_models/md/model-best'
nlp = spacy.load(MODEL_PATH)

It can be applied to detect entities in new text as follow :

sample = """Trotz der zweifelhaften Bewertung von MDMA als "harte Droge"
( vgl. BGH , Beschluss vom 3. Februar 1999 - 5 StR 705/98 ,
juris Rn. 2 ; zum Meinungsstand Patzak in Körner / Patzak / Volkmer
, BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213 mwN ; Weber , BtMG ,
5. Aufl. , § 1 Rn. 364 mwN ) hat der Strafausspruch Bestand ,
da die verhängte Rechtsfolge jedenfalls angemessen ist 
(§ 354 Abs. 1a Satz 1 StPO) ."""

doc = nlp(sample)

for ent in doc.ents:
    print(ent.label_, ':', ent.text)

The output looks as follows:

RS : BGH , Beschluss vom 3. Februar 1999 - 5 StR 705/98 , juris Rn. 2
LIT : Patzak in Körner / Patzak / Volkmer , BtMG , 8.
GS : § 29 ff
GS : Rn
LIT : Weber , BtMG , 5.
GS : § 1 Rn
GS : § 354 Abs. 1a Satz 1 StPO

Step 4: Evaluate the model

To obtain scores for the model on the level of annotation classes, we continue to work in the Jupyter notebook and load the validation data:

from spacy.gold import GoldCorpus

VAL_FILENAME = 'data/03_val/bgh.json'

val_corpus = GoldCorpus(VAL_FILENAME, VAL_FILENAME)
docs_golds = list(val_corpus.train_docs(nlp))
docs, golds = zip(*docs_golds)

To apply our model to these documents, we need to use only the NER component of the model’s NLP pipeline:

1ner = nlp.pipeline[0][1]
2predictions = list(ner.pipe(docs))

Finally, we can evaluate the performance using the Scorer class. Along the way, we count how often each tag occured:

from spacy.scorer import Scorer
from collections import Counter

tag_counts = Counter()
scorer = Scorer()
for y_p, y_t in zip(predictions, golds):
    scorer.score(y_p, y_t)
    for tag in y_t.ner:
        tag_counts[tag.split('-')[-1]] += 1
print(scorer.ents_p, scorer.ents_r, scorer.ents_f)

These are the same scores that we obtained by validating on the command line. Additionally, the ents_per_type attribute of scorer gives us access to the tag-level scores. With pandas installed (pip install pandas), we can put these scores in a table as follows:

import pandas as pd

scores = (pd.DataFrame.from_dict(scorer.ents_per_type, orient='index')
                      .join(pd.Series(tag_counts, name='support'))
                      .sort_values(by='support', ascending=False))
scores

For the medium model trained over 20 epochs, we obtain the following result:

tag	p	r	f	support
RS	62.77	63.34	63.06	18615
GS	84.93	84.93	84.93	7640
LIT	73.70	83.82	78.44	4685
GRT	67.88	32.40	43.86	662
RR	94.37	81.03	87.19	560
EUN	14.28	7.81	10.10	540
PER	25.00	1.62	3.05	494
ORG	32.25	28.57	30.30	176
VT	4.86	29.16	8.33	150
INN	33.33	8.00	12.90	124
UN	47.61	16.39	24.39	122
LD	36.87	65.82	47.27	95
ST	28.12	11.25	16.07	85
VO	0.00	0.00	0.00	81
MRK	0.00	0.00	0.00	58
AN	50.00	1.92	3.70	57
STR	0.00	0.00	0.00	35
LDS	33.33	10.00	15.38	10
VS	0.00	0.00	0.00	10

This gives a much clearer picture. Plotting the F1-Score (f) versus the number of tokens with this tag shows a correlation between poor performance and shortage of training data:

What next?

We’ve seen that spaCy allows us to train a model for extracting information from text with no knowledge of deep learning or NLP with a few commands on the command line. The options to improve performance and to adjust the model to our needs are, however, limited. In two following posts, we shall do better and

train a standard bi-directional LSTM model by hand, using TensorFlow & Keras
train state-of-the-art transformer models using huggingface transformers .

Stay tuned!

Was this post helpful?

Likes

Blog author

Thomas Timmermann

Data Scientist

Do you still have questions? Just send me a message.

fromThomas Timmermann

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) and fine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 Minuten Lesezeit

Thomas Timmermann

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts. In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 Minuten Lesezeit

Thomas Timmermann

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model! In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 Minuten Lesezeit

Thomas Timmermann

Natural Language Processing — Einsteigen und loslegen!

1 Worum geht es? Ob Suchmaschinen, Spamfilter, Chatbots oder Sprachassistenten wie Siri und Alexa — Computer verarbeiten immer mehr Sprache mit immer besserer Genauigkeit und dringen damit immer weiter in unseren Alltag vor. Dahinter stecken anspruchsvolle...

Künstliche Intelligenz
Machine Learning
Python
NLP
Data

7.3.2019 | 10 Minuten Lesezeit

Thomas Timmermann

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Große Sprachmodelle: Was ist ein LLM?

Große Sprachmodelle (Large Language Models oder LLM) haben in den letzten Jahren enorme Fortschritte gemacht und spielen eine entscheidende Rolle in verschiedenen Anwendungen. Aber was ist ein LLM? Es ist sinnvoll zu erklären, was ein „einfaches“ Sprachmodell...

Machine Learning

20.6.2023 | 4 Minuten Lesezeit

Elvira Siegel

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

ChatGPT im Alltag eines Python-Entwicklers

Seit einigen Tagen spiele ich mit ChatGPT herum. Beruflich und privat konnte ich damit einige Fragen bearbeiten, bspw. welche Alternativen es zu bestimmten Tools gibt, was Vorteile von Teilzeit für den Arbeitgeber sind oder wer ich bin. Leider weiß ChatGPT...

NLP
Python
Künstliche Intelligenz

27.1.2023 | 7 Minuten Lesezeit

Robert Meißner

Mit wenigen Zeilen Code Titel und Vorschaubild generieren

Ich bin ein fauler Mensch. Und ich schreibe viel, u. a. beruflich und privat in Blogs, auf Twitter und auf Wissenschaftsseiten. Das Schreiben per se ist schön. Aber wenn ich mir Titel überlegen muss oder gar Schlagwörter, dann ist der Spaß vorbei. Noch...

11.10.2022 | 7 Minuten Lesezeit

Robert Meißner

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Das Auslesen von Adress-/Anschriftbereichen in Briefen war schon immer eine recht schwierige Problematik. Die Freude war umso größer, als Kofax vor einigen KTM-Versionen (Kofax Transformation Modules ) ein Werkzeug (Adress-Lokator) für das automatisierte...

NLP
Archivierung

7.3.2022 | 6 Minuten Lesezeit

Jürgen Voss

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Die Corona-Krise ist weiterhin in aller Munde und wird uns mit hoher Wahrscheinlichkeit noch etwas länger begleiten. Wie man aus unterschiedlichen Statistiken erfährt, schwanken die Fallzahlen weiter und sorgen für zusätzliche Restriktionen. Diese werden...

Computer Vision
Künstliche Intelligenz
IoT
Machine Learning

13.12.2021 | 7 Minuten Lesezeit

Michel Ehmen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

Natural Language Processing: Erweiterungen mit KTM 6.4

Im Frühjahr 2020 erhielt das Produkt Kofax Transformation Modules (KTM) mit dem Service Pack 6.3.1 ein neues Modul: Natural Language Processing (NLP). Natural Language Processing versucht, den Text des Dokuments zu analysieren, Wörter und deren Beziehungen...

Content Management
Archivierung
NLP

15.4.2021 | 2 Minuten Lesezeit

Jürgen Voss

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Our dataset and task

Enter the NLP library spaCy

Step 0: Setup

Step 1: Get the NER data ready

Step 2: Train the NER model

Step 3: Use the model for named entity recognition

Step 4: Evaluate the model

What next?

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

Move n-gram extraction into your Keras model!

Natural Language Processing &mdash; Einsteigen und loslegen!

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Große Sprachmodelle: Was ist ein LLM?

Bessere SQL-Datenpipelines mit dbt

ChatGPT im Alltag eines Python-Entwicklers

Mit wenigen Zeilen Code Titel und Vorschaubild generieren

Streaming Wikipedia mit Apache Kafka

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Auslesen von deutschen Empfängeradressen mit Kofax Transformation Modules...

Smart DistancR – Perspektivisch korrekte Distanzmessung zwischen Personen

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Natural Language Processing: Erweiterungen mit KTM 6.4

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten

Natural Language Processing — Einsteigen und loslegen!