Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include
- automation of business processes involving documents
- distillation of data from the web by scraping websites
- indexing document collections for scientific, investigative, or economic purposes
Some cases can be treated by classical approaches, for example:
- forms with a fixed structure can be handled by layout-based rules
- entities with fixed pattern like phone numbers can be extracted using regular expressions
- occurrences of known entities like invoice numbers or customer names can be detected by matching against a database
But when more flexibility is needed, named entity recognition (NER) may be just the right tool for the task. In a sequence of blog posts, we will explain and compare three approaches to extract references to laws and verdicts from court decisions:
- First, we use the popular NLP library spaCy and train a custom NER model on the command line with no fuzz.
- Next, we build a bidirectional word-level LSTM model by hand with TensorFlow & Keras.
- Finally, we fine-tune a pre-trained BERT model using huggingface transformers for state-of-the-art performance on the task.
This post introduces the dataset and task and covers the command line approach using spaCy.
Our dataset and task
The dataset for our task was presented by E. Leitner, G. Rehm and J. Moreno-Schneider in
Fine-grained Named Entity Recognition in Legal Documents.
and can be found on GitHub. It consists of decisions from several German federal courts with annotations of entities referring to legal norms, court decisions, legal literature, and others of the following form:
The entire dataset comprises 66,723 sentences. We pick
- court decisions of the Federal Labour Court (BAG) for training and
- court decisions of the Federal Court of Justice (BGH) for validation.
The following histograms show the distribution of sentence lengths and token annotations for this slice, where ‘O’ denotes the “empty” annotation:
The NER task we want to solve is, given sample sentences, to annotate each token of each sentence with a tag which indicates whether this token is part of a reference to a legal norm, court decision, legal literature, and so on. Put differently, this is a sequence-labeling task where we classify each token as belonging to one or none annotation class.
Enter the NLP library spaCy
The Python library spaCy provides “industrial-strength natural language processing” covering
- 15 languages with small-, medium- or large-scale language models
- the full NLP pipeline starting with tokenization over word embeddings to part-of-speech tagging and parsing
- many NLP tasks like classification, similarity estimation or named entity recognition
We now show how to use it for our NER task with no knowledge of deep learning nor NLP.
Get your keyboard ready!
Step 0: Setup
To experiment along, you need Python 3. Fire up a terminal to work on the command line, create a folder for this experiment, switch to this folder and create and activate a virtual environment with
python3 -m venv .venv
In case you are on Windows, switch to the Subsystem for Linux or replace the last line by
Next, install spaCy and download the medium-sized German language model with
pip install spacy
python -m spacy download de_core_news_md
Step 1: Get the NER data ready
The dataset is hosted on GitHub and contained in one zip file which we download and unzip:
mkdir -p data/01_raw
curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
-L -o data/01_raw/raw.zip
!unzip data/01_raw/raw.zip -d data/01_raw
Each of the unzipped files contains sample sentences from one court. The sentences come as paragraphs separated by blank lines, with one token and annotation in BIO format per line as follows:
We simply use
data/01_raw/bag.conll for training
data/01_raw/bgh.conll for validation,
and convert these files into the format required by spaCy:
python -m spacy convert --converter ner data/01_raw/bag.conll data/02_train
python -m spacy convert --converter ner data/01_raw/bgh.conll data/03_val
Along the way, we obtain some status information:
To check for potential problems before training, we check the data with spaCy’s debug-data tool:
python -m spacy debug-data de data/02_train data/03_val -p ner -b de_core_news_md
which produces the following output:
As we have seen before, some tags occur extremely rarely so we can’t expect the model to learn them very well. Moreover, we see that the language model knows almost all words occuring in the dataset, which may come as a surprise.
Step 2: Train the NER model
To obtain a custom model for our NER task, we use spaCy’s train tool as follows:
python -m spacy train de data/04_models/md data/02_train data/03_val \
--base-model de_core_news_md --pipeline 'ner' -R -n 20
which tells spaCy to train a new model
- for the German language whose code is
- saving the trained model in
- using the training and validation data in
- starting from the base model
- where the task to be trained is
ner — named entity recognition
- replacing the standard named entity recognition component via
- using 20 epochs, that is, 20 runs over the entire training data.
Depending on your system, training may take several minutes up to a few hours. In case you have an NVidia GPU with CUDA set up, you can try to speed up the training, see spaCy’s installation and training instructions.
To track the progress, spaCy displays a table showing the loss (NER loss), precision (NER P), recall (NER R) and F1-score (NER F) reached after each epoch:
|Itn||NER Loss||NER P||NER R||NER F||Token %||CPU WPS|
At the end, spaCy tells you that it stored the last and the best model version in
data/04_models/md/model-best, respectively. To check the performance of the model after training, we evaluate it on the validation data:
python -m spacy evaluate data/04_models/md/model-best data/03_val
This outputs the precision, recall and F1-score for the NER task again (NER P, NER R, NER F):
|Time||Words||Words/s||TOK||POS||UAS||LAS||NER P||NER R||NER F||Textcat|
The overall performance looks moderate. For better results, one could use
- the large language model
- more training steps
- more training data (we only used a subset of the dataset).
As an example, training the large model for 40 epochs yields the following scores:
|Time||Words||Words/s||TOK||POS||UAS||LAS||NER P||NER R||NER F||Textcat|
Apparently, the problem is not the model, but the data: some tag categories appear very rarely so it’s hard for the model learn them. For a more thorough evaluation, we need to see the scores for each tag category.
Step 3: Use the model for named entity recognition
To use our new model and to see how it performs on each annotation class, we need to use the Python API of spaCy. To experiment along, activate the virtual environment again, install Jupyter and start a notebook with
pip install jupyter
jupyter notebook spacy_ner.ipynb
If it did not open by itself, open a web browser pointing to the URL output by the last command, and enter the following Python code blocks in code cells to work along.
Let us load the best-trained model version:
MODEL_PATH = 'data/04_models/md/model-best'
nlp = spacy.load(MODEL_PATH)
It can be applied to detect entities in new text as follow:
sample = """Trotz der zweifelhaften Bewertung von MDMA als "harte Droge"
( vgl. BGH , Beschluss vom 3. Februar 1999 - 5 StR 705/98 ,
juris Rn. 2 ; zum Meinungsstand Patzak in Körner / Patzak / Volkmer
, BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213 mwN ; Weber , BtMG ,
5. Aufl. , § 1 Rn. 364 mwN ) hat der Strafausspruch Bestand ,
da die verhängte Rechtsfolge jedenfalls angemessen ist
(§ 354 Abs. 1a Satz 1 StPO) ."""
doc = nlp(sample)
for ent in doc.ents:
print(ent.label_, ':', ent.text)
The output looks as follows:
RS : BGH , Beschluss vom 3. Februar 1999 - 5 StR 705/98 , juris Rn. 2
LIT : Patzak in Körner / Patzak / Volkmer , BtMG , 8.
GS : § 29 ff
GS : Rn
LIT : Weber , BtMG , 5.
GS : § 1 Rn
GS : § 354 Abs. 1a Satz 1 StPO
Step 4: Evaluate the model
To obtain scores for the model on the level of annotation classes, we continue to work in the Jupyter notebook and load the validation data:
from spacy.gold import GoldCorpus
VAL_FILENAME = 'data/03_val/bgh.json'
val_corpus = GoldCorpus(VAL_FILENAME, VAL_FILENAME)
docs_golds = list(val_corpus.train_docs(nlp))
docs, golds = zip(*docs_golds)
To apply our model to these documents, we need to use only the NER component of the model’s NLP pipeline:
ner = nlp.pipeline
predictions = list(ner.pipe(docs))
Finally, we can evaluate the performance using the
Scorer class. Along the way, we count how often each tag occured:
from spacy.scorer import Scorer
from collections import Counter
tag_counts = Counter()
scorer = Scorer()
for y_p, y_t in zip(predictions, golds):
for tag in y_t.ner:
tag_counts[tag.split('-')[-1]] += 1
print(scorer.ents_p, scorer.ents_r, scorer.ents_f)
These are the same scores that we obtained by validating on the command line. Additionally, the
ents_per_type attribute of
scorer gives us access to the tag-level scores. With pandas installed (
pip install pandas), we can put these scores in a table as follows:
import pandas as pd
scores = (pd.DataFrame.from_dict(scorer.ents_per_type, orient='index')
For the medium model trained over 20 epochs, we obtain the following result:
This gives a much clearer picture. Plotting the F1-Score (
f) versus the number of tokens with this tag shows a correlation between poor performance and shortage of training data:
We’ve seen that spaCy allows us to train a model for extracting information from text with no knowledge of deep learning or NLP with a few commands on the command line. The options to improve performance and to adjust the model to our needs are, however, limited. In two following posts, we shall do better and