Take control of named entity recognition with your own Keras model!

No Comments

This post shows how to extract information from text documents with the high-level deep learning library Keras: we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.

In a previous post, we solved the same NER task on the command line with the NLP library spaCy. The present approach requires some work and knowledge, but yields a much more flexible solution which we can tune, scale and modify to our needs.

The NER dataset and task

We use the dataset presented by E. Leitner, G. Rehm and J. Moreno-Schneider in

Fine-grained Named Entity Recognition in Legal Documents.

again. It consists of decisions from several German federal courts with annotations of named entities referring to legal norms, court decisions, legal literature and others of the following form:

‘Trotz der zweifelhaften Bewertung von MDMA als ” harte Droge ” ( vgl. BGH , Beschluss vom 3. Februar 1999 – 5 StR 705/98
, juris Rn. 2 RS
; zum Meinungsstand Patzak in Körner / Patzak / Volkmer , BtMG , 8. Aufl. , Vorbem. zu §§ 29 ff. Rn. 213 LIT mwN ; Weber , BtMG , 5. Aufl. , § 1 Rn. 364 LIT mwN ) hat der Strafausspruch
Bestand , da die verhängte Rechtsfolge jedenfalls angemessen ist ( § 354 Abs. 1a Satz 1 StPO GS ) . ‘

The task will be to build, train and evaluate a model that, given sample sentences, annotates each token of each sentence with a tag that indicates whether this token is part of a reference to a legal norm, court decision, legal literature and so on.

NER with bi-LSTM for dummies

We implement a standard deep-learning architecture for NER — a bi-directional recurrent neural network — which works as follows:

bi-lstm network architecture

  1. Each sentence is split into a sequence of token and each token is represented by a word vector. These word vectors or embeddings are usually pre-trained on a huge corpus of documents so that they encode semantic information. We thus employ general language proficiency to our special task, a technique known as transfer learning. Common methods for pre-training are word2vec, gloVe or fasttext; we use the word vectors provided by spaCy.
  2. The model processes the input sequence step by step and maintains an internal memory along the way,
    • reading the corresponding input vector,
    • combining this input with the internal memory,
    • producing an output vector and
    • updating the internal memory

    at each step. This magic is carried out by a long-short-term memory (LSTM) cell. As a result, we obtain an output sequence ot the same length as the input sequence, and an internal memory state.

  3. Going backwards, the model reads the input again and produces a second output sequence.
  4. At each position, the outputs of steps 2 and 3 are combined and fed into a classifier which outputs, for the input word at this position, the probability that should be annotated with the first tag, second tag, and so on.

To improve performance, one can replace the last feed-forward layer by a conditional random field model (CRF). The resulting architecture is called bi-LSTM-CRF model.

Setting up the environment

First, set up a virtual environment as described in the preceding blog post, and install the required dependencies:

1mkdir keras_ner_project
2cd keras_ner_project
3python3 -m venv .venv
4source .venv/bin/activate
5pip install spacy
6python -m spacy download de_core_news_md
7pip install tensorflow

Alternatively, follow along with Jupyter running inside a TensorFlow Docker container, or with a google colab notebook.

Next, download the data as in the preceding blog post (in case you are inside a Jupyter notebook, put an exclamation mark ! in front of each command to have it executed by the shell):

1mkdir -p data/01_raw
2curl https://github.com/elenanereiss/Legal-Entity-Recognition/raw/master/data/dataset_courts.zip \
3     -L -o data/01_raw/raw.zip
4!unzip data/01_raw/raw.zip -d data/01_raw

Step 1: Preprocessing for NER

The data files contain sample sentences separated by blank lines, with one token and annotation in BIO format per line as follows:

 1an O
 2Kapitalgesellschaften O
 3( O
 4§ B-GS
 517 I-GS
 6Abs. I-GS
 71 I-GS
 8und I-GS
 92 I-GS
10EStG I-GS
11) O

We read such a data file line-by-line and store the sentences as lists of token-tag pairs:

 1def load_data(filename: str):
 2    with open(filename, 'r') as file:
 3        lines = [line[:-1].split() for line in file]
 4    samples, start = [], 0
 5    for end, parts in enumerate(lines):
 6        if not parts:
 7            sample = [(token, tag.split('-')[-1]) for token, tag in lines[start:end]]
 8            samples.append(sample)
 9            start = end + 1
10    if start < end:
11        samples.append(lines[start:end])
12    return samples
13
14train_samples = load_data('data/01_raw/bag.conll')
15val_samples = load_data('data/01_raw/bgh.conll')
16all_samples = train_samples + val_samples

For simplicity, we’ll truncate the sentences to a maximum length and pad shorter input sequences. But first, let us determine the set of all tags in the data and add an extra tag for the padding:

1schema = ['_'] + sorted({tag for sentence in samples for _, tag in sentence})

Next, we represent each token by a word vector, using a pre-trained German language model of the NLP library spaCy:

 1import spacy
 2import numpy as np
 3
 4nlp = spacy.load('de_core_news_md')
 5EMB_DIM = nlp.vocab.vectors_length
 6MAX_LEN = 50
 7
 8def preprocess(samples):
 9    tag_index = {tag: index for index, tag in enumerate(schema)}
10    X = np.zeros((len(samples), MAX_LEN, EMB_DIM), dtype=np.float32)
11    y = np.zeros((len(samples), MAX_LEN), dtype=np.uint8)
12    vocab = nlp.vocab
13    for i, sentence in enumerate(samples):
14        for j, (token, tag) in enumerate(sentence[:MAX_LEN]):
15            X[i, j] = vocab.get_vector(token)
16            y[i,j] = tag_index[tag]
17    return X, y
18
19X_train, y_train = preprocess(train_samples)
20X_val, y_val = preprocess(val_samples)

Now, we got the data ready for NER and can assemble our model!

Step 2: Build the bi-LSTM model

With the wide range of layers offered by Keras, we can can construct a bi-directional LSTM model as a sequence of two compound layers:

bi-lstm network layers

  1. The bidirectional LSTM layer encapsulates a forward- and a backward-pass of an LSTM layer, followed by the stacking of the sequences returned by both passes.
  2. The second layer applies a dense classification layer to every position of the stacked sequences. Here, the SoftMax
    activation
    function scales the output so that we obtain sequences of probability distributions:
 1from tensorflow.keras.models import Sequential
 2from tensorflow.keras.layers import Bidirectional, LSTM, TimeDistributed, Dense
 3
 4def build_model(nr_filters=256):
 5    input_shape = (MAX_LEN, EMB_DIM)
 6    lstm = LSTM(NR_FILTERS, return_sequences=True)
 7    bi_lstm = Bidirectional(lstm, input_shape=input_shape)
 8    tag_classifier = Dense(len(schema), activation='softmax')
 9    sequence_labeller = TimeDistributed(tag_classifier)
10    return Sequential([bi_lstm, sequence_labeller])
11
12model = build_model()

For more complex architectures involving multiple inputs or outputs, residual connections or the like, Keras offers a more flexible functional API. With this, we can create directed acyclic graphs of tensors connected by applications of layers, and specify a model in terms of its input and output tensors.

Step 3: Train the model

To train a model means to optimize its weights or parameters on data so that the model’s predictions approximate the truth. For Keras to perform this optimization, we need to specify

  • how to measure the distance of the prediction to the truth, that is, a loss function,
  • the optimization strategy which is a variant of batch-wise gradient descent.

Additionally, we can specify a metrics to monitor the training progress. Once this has been done using the compile method, we can call the fit method for training:

 1def train(model, epochs=10, batch_size=32):
 2    model.compile(optimizer='Adam',
 3                  loss='sparse_categorical_crossentropy',
 4                  metrics='accuracy')
 5    history = model.fit(X_train, y_train,
 6                        validation_split=0.2,
 7                        epochs=epochs,
 8                        batch_size=batch_size)
 9    return history.history
10
11history = train(model)

Keras provides implementations of all the standard optimizers, loss functions and metrics, and also allows us to supply our own.

The training history contains the losses and metrics achieved on the training and validation data after each epoch. Here, I got the following result:

training history

Note the scale on the y-axis, but don’t get excited by accuracies of 99%: almost all token are labelled by the trivial tag O and hence accuracy does not tell much about detection of the non-trivial tags.

Step 4: Evaluate the model

To assess the performance of the model, we apply it to the preprocessed validation data and obtain a tensor of the shape (len(val_samples), MAX_LEN, len(schema)). This tensor contains, for each sample sentence and each token in this sentence, a predicted probability distribution over the tags. We choose the tag with highest probability and return, for each sentence and each token, the true and the predicted tag:

1def predict(model):
2    y_probs = model.predict(X_val)
3    y_pred = np.argmax(y_probs, axis=-1)
4    return [
5        [(token, tag, schema[index]) for (token, tag), index in zip(sentence, tag_pred)]
6        for sentence, tag_pred in zip(val_samples, y_pred)
7    ]
8
9predictions = predict(model)

Finally, we compute precision, recall and f1-score on the level of tag categories using scikit learn’s classification_report:

 1import pandas as pd
 2from sklearn.metrics import classification_report
 3
 4def evaluate(predictions):
 5    y_t = [pos[1] for sentence in predictions for pos in sentence]
 6    y_p = [pos[2] for sentence in predictions for pos in sentence]
 7    report = classification_report(y_t, y_p, output_dict=True)
 8    return pd.DataFrame.from_dict(report).transpose().reset_index()
 9
10evaluate(predictions)

Training a model with 1024 filters for 10 epochs, we reach the following scores:

tagf1-scoreprecisionrecallsupport
EUN56.967.049.5398
GRT65.991.051.6643
GS94.596.192.96774
INN41.388.926.9119
LD74.067.082.686
LDS0.00.00.09
LIT79.574.385.41681
MRK0.00.00.049
ORG25.332.420.8159
PER0.00.00.0473
RR92.094.489.8560
RS90.797.185.08380
ST71.993.958.279
STR0.00.00.035
UN32.764.921.8110
VO2.24.01.566
VS0.00.00.010
VT18.011.738.9144

Let’s see how this compares to the results achieved with spaCy:

f1-scores for spaCy and Keras model

It seems that our hand-built NER model does very well! But beware that these experiments do not show a winner: neither of the two approaches has been optimized and we did not compare training time nor compute resources used. The main differentiating factor is that

  • spaCy can be used out-of-the-box with no understanding of deep learning
  • the approach presented here is much more flexible and tuneable (see below).

What next?

With the deep learning library Keras, build and training our custom NER model took just a few lines, but setting up the data and the training required much more understanding than the command-line approach with spaCy.

To improve performance, we could try to tune the model and

  • increase the number of filters, that is, the size of the LSTM cell,
  • stack several bidirectional layers on top of each other,
  • replace the time-distributed classification layer with a conditional random field (CRF) model or
  • address the imbalance of the tag distribution with a focal loss instead of categorical cross-entropy.

But to achieve a significant boost, we need to provide our model with more input by

  1. labeling more task-specific training data or
  2. applying more of task-independent language proficiency to our task.

In a next blog post, we shall fine-tune a pre-trained NLP transformer model to our NER task and get state-of-the-art performance.

Stay tuned!

Thomas Timmermann

Thomas did a PhD in Mathematics, gathered rich research experience, and joined the Münster team in the area of data science and machine learning. He is interested in everything related to AI and deep learning.

Comment

Your email address will not be published. Required fields are marked *