NER @ CLI: Custom-named entity recognition with spaCy in four lines

No Comments

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications include

  • automation of business processes involving documents
  • distillation of data from the web by scraping websites
  • indexing document collections for scientific, investigative, or economic purposes

Some cases can be treated by classical approaches, for example:

  • forms with a fixed structure can be handled by layout-based rules
  • entities with fixed pattern like phone numbers can be extracted using regular expressions
  • occurrences of known entities like invoice numbers or customer names can be detected by matching against a database

But when more flexibility is needed, named entity recognition (NER) may be just the right tool for the task. In a sequence of blog posts, we will explain and compare three approaches to extract references to laws and verdicts from court decisions:

  1. First, we use the popular NLP library spaCy and train a custom NER model on the command line with no fuzz.
  2. Next, we build a bidirectional word-level LSTM model by hand with TensorFlow & Keras.
  3. Finally, we fine-tune a pre-trained BERT model using huggingface transformers for state-of-the-art performance on the task.

This post introduces the dataset and task and covers the command line approach using spaCy.

Our dataset and task

The dataset for our task was presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

and can be found on GitHub. It consists of decisions from several German federal courts with annotations of entities referring to legal norms, court decisions, legal literature, and others of the following form:

named entity recognition

The entire dataset comprises 66,723 sentences. We pick

  • court decisions of the Federal Labour Court (BAG) for training and
  • court decisions of the Federal Court of Justice (BGH) for validation.

The following histograms show the distribution of sentence lengths and token annotations for this slice, where ‘O’ denotes the “empty” annotation:

data distribution

The NER task we want to solve is, given sample sentences, to annotate each token of each sentence with a tag which indicates whether this token is part of a reference to a legal norm, court decision, legal literature, and so on. Put differently, this is a sequence-labeling task where we classify each token as belonging to one or none annotation class.

Enter the NLP library spaCy

The Python library spaCy provides “industrial-strength natural language processing” covering

  • 15 languages with small-, medium- or large-scale language models
  • the full NLP pipeline starting with tokenization over word embeddings to part-of-speech tagging and parsing
  • many NLP tasks like classification, similarity estimation or named entity recognition

We now show how to use it for our NER task with no knowledge of deep learning nor NLP.

Get your keyboard ready!

Step 0: Setup

To experiment along, you need Python 3. Fire up a terminal to work on the command line, create a folder for this experiment, switch to this folder and create and activate a virtual environment with

python3 -m venv .venv
source .venv/bin/activate

In case you are on Windows, switch to the Subsystem for Linux or replace the last line by

.venv\Scripts\activate.bat

Next, install spaCy and download the medium-sized German language model with

pip install spacy
python -m spacy download de_core_news_md

Step 1: Get the NER data ready

The dataset is hosted on GitHub and contained in one zip file which we download and unzip:

mkdir -p data/01_raw
curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
     -L -o data/01_raw/raw.zip
!unzip data/01_raw/raw.zip -d data/01_raw

Each of the unzipped files contains sample sentences from one court. The sentences come as paragraphs separated by blank lines, with one token and annotation in BIO format per line as follows:

an O
Kapitalgesellschaften O
( O
§ B-GS
17 I-GS
Abs. I-GS
1 I-GS
und I-GS
2 I-GS
EStG I-GS
) O

We simply use

  • file data/01_raw/bag.conll for training
  • file data/01_raw/bgh.conll for validation,

and convert these files into the format required by spaCy:

python -m spacy convert --converter ner data/01_raw/bag.conll data/02_train
python -m spacy convert --converter ner data/01_raw/bgh.conll data/03_val

Along the way, we obtain some status information:

spaCy preprocessing output

To check for potential problems before training, we check the data with spaCy’s debug-data tool:

python -m spacy debug-data de data/02_train data/03_val -p ner -b de_core_news_md

which produces the following output:

spaCy data validation output

As we have seen before, some tags occur extremely rarely so we can’t expect the model to learn them very well. Moreover, we see that the language model knows almost all words occuring in the dataset, which may come as a surprise.

Step 2: Train the NER model

To obtain a custom model for our NER task, we use spaCy’s train tool as follows:

python -m spacy train de data/04_models/md data/02_train data/03_val \
    --base-model de_core_news_md --pipeline 'ner' -R -n 20

which tells spaCy to train a new model

  • for the German language whose code is de
  • saving the trained model in data/04_models
  • using the training and validation data in data/02_train and data/03_val, respectively,
  • starting from the base model de_core_news_md
  • where the task to be trained is ner — named entity recognition
  • replacing the standard named entity recognition component via -R
  • using 20 epochs, that is, 20 runs over the entire training data.

Depending on your system, training may take several minutes up to a few hours. In case you have an NVidia GPU with CUDA set up, you can try to speed up the training, see spaCy’s installation and training instructions.

To track the progress, spaCy displays a table showing the loss (NER loss), precision (NER P), recall (NER R) and F1-score (NER F) reached after each epoch:

ItnNER LossNER PNER RNER FToken %CPU WPS
126507.80364.20951.19756.970100.00034947
214681.51467.48057.93162.342100.00039232
310907.75868.23959.38463.504100.00042043

At the end, spaCy tells you that it stored the last and the best model version in data/04_models/model-final and data/04_models/md/model-best, respectively. To check the performance of the model after training, we evaluate it on the validation data:

python -m spacy evaluate data/04_models/md/model-best data/03_val

This outputs the precision, recall and F1-score for the NER task again (NER P, NER R, NER F):

TimeWordsWords/sTOKPOSUASLASNER PNER RNER FTextcat
4.3717783540663100.000.000.000.0070.1560.0964.730.00

The overall performance looks moderate. For better results, one could use

  1. the large language model de_core_news_lg
  2. more training steps
  3. more training data (we only used a subset of the dataset).

As an example, training the large model for 40 epochs yields the following scores:

TimeWordsWords/sTOKPOSUASLASNER PNER RNER FTextcat
4.5217783539339100.000.000.000.0073.7264.3968.740.00

Apparently, the problem is not the model, but the data: some tag categories appear very rarely so it’s hard for the model learn them. For a more thorough evaluation, we need to see the scores for each tag category.

Step 3: Use the model for named entity recognition

To use our new model and to see how it performs on each annotation class, we need to use the Python API of spaCy. To experiment along, activate the virtual environment again, install Jupyter and start a notebook with

pip install jupyter
jupyter notebook spacy_ner.ipynb

If it did not open by itself, open a web browser pointing to the URL output by the last command, and enter the following Python code blocks in code cells to work along.

Let us load the best-trained model version:

import spacy
MODEL_PATH = 'data/04_models/md/model-best'
nlp = spacy.load(MODEL_PATH)

It can be applied to detect entities in new text as follow:

sample = """Trotz der zweifelhaften Bewertung von MDMA als "harte Droge"
( vgl. BGH , Beschluss vom 3. Februar 1999 - 5 StR 705/98 ,
juris Rn. 2 ; zum Meinungsstand Patzak in Körner / Patzak / Volkmer
, BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213 mwN ; Weber , BtMG ,
5. Aufl. , § 1 Rn. 364 mwN ) hat der Strafausspruch Bestand ,
da die verhängte Rechtsfolge jedenfalls angemessen ist 
(§ 354 Abs. 1a Satz 1 StPO) ."""

doc = nlp(sample)

for ent in doc.ents:
    print(ent.label_, ':', ent.text)

The output looks as follows:

RS : BGH , Beschluss vom 3. Februar 1999 - 5 StR 705/98 , juris Rn. 2
LIT : Patzak in Körner / Patzak / Volkmer , BtMG , 8.
GS : § 29 ff
GS : Rn
LIT : Weber , BtMG , 5.
GS : § 1 Rn
GS : § 354 Abs. 1a Satz 1 StPO

Step 4: Evaluate the model

To obtain scores for the model on the level of annotation classes, we continue to work in the Jupyter notebook and load the validation data:

from spacy.gold import GoldCorpus

VAL_FILENAME = 'data/03_val/bgh.json'

val_corpus = GoldCorpus(VAL_FILENAME, VAL_FILENAME)
docs_golds = list(val_corpus.train_docs(nlp))
docs, golds = zip(*docs_golds)

To apply our model to these documents, we need to use only the NER component of the model’s NLP pipeline:

ner = nlp.pipeline[0][1]
predictions = list(ner.pipe(docs))

Finally, we can evaluate the performance using the Scorer class. Along the way, we count how often each tag occured:

from spacy.scorer import Scorer
from collections import Counter

tag_counts = Counter()
scorer = Scorer()
for y_p, y_t in zip(predictions, golds):
    scorer.score(y_p, y_t)
    for tag in y_t.ner:
        tag_counts[tag.split('-')[-1]] += 1
print(scorer.ents_p, scorer.ents_r, scorer.ents_f)

These are the same scores that we obtained by validating on the command line. Additionally, the ents_per_type attribute of scorer gives us access to the tag-level scores. With pandas installed (pip install pandas), we can put these scores in a table as follows:

import pandas as pd

scores = (pd.DataFrame.from_dict(scorer.ents_per_type, orient='index')
                      .join(pd.Series(tag_counts, name='support'))
                      .sort_values(by='support', ascending=False))
scores

For the medium model trained over 20 epochs, we obtain the following result:

tagprfsupport
RS62.7763.3463.0618615
GS84.9384.9384.937640
LIT73.7083.8278.444685
GRT67.8832.4043.86662
RR94.3781.0387.19560
EUN14.287.8110.10540
PER25.001.623.05494
ORG32.2528.5730.30176
VT4.8629.168.33150
INN33.338.0012.90124
UN47.6116.3924.39122
LD36.8765.8247.2795
ST28.1211.2516.0785
VO0.000.000.0081
MRK0.000.000.0058
AN50.001.923.7057
STR0.000.000.0035
LDS33.3310.0015.3810
VS0.000.000.0010

This gives a much clearer picture. Plotting the F1-Score (f) versus the number of tokens with this tag shows a correlation between poor performance and shortage of training data:

scatter plot f1-score vs support per tag

What next?

We’ve seen that spaCy allows us to train a model for extracting information from text with no knowledge of deep learning or NLP with a few commands on the command line. The options to improve performance and to adjust the model to our needs are, however, limited. In two following posts, we shall do better and

Stay tuned!

Thomas Timmermann

Thomas did a PhD in Mathematics, gathered rich research experience, and joined the Münster team in the area of data science and machine learning. He is interested in everything related to AI and deep learning.

More content about Artificial Intelligence

Comment

Your email address will not be published. Required fields are marked *