# Move n-gram extraction into your Keras model!

07/19/19

# Move n-gram extraction into your Keras model!

In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments he could just modify the preprocessing and the model as he wished, but for production, it was much preferable to just replace the model being served by tensorflow and leave all other code unchanged. And that is what we did — move the bigram and trigram extraction into our neural network. In this blog post, I’ll show you the basic idea, the implementation, an application and the limitations of our approach.

## The idea: n-gram extraction via convolution

Suppose we want to process the quote

*“I’d far rather be happy than right any day”*

of Douglas Adams. Instead of looking at the text as a sequence of characters

I‘d far rather …

a neural network may profit from looking at pairs of adjacent characters, that is, at the sequence of bigrams

I’‘dd ffaarr rraatthheer …

or even at the sequence of trigrams or n-grams for n larger 3. To feed the neural network, we need to convert characters into numbers, for example, using the ASCII or UTF-8 codes. Our bigrams then become sequences of pairs of numbers:

73, 3939, 100100, 3232, 102102, 97 …

If we encode these bigrams using the rule (a, b) ↦ N · a + b, where N is the size of our alphabet, we obtain a sequence of numbers again: in case N=256, this would be

73*256+39=1872739*256+100=10084100*256+32=2563232*256+102=8294 …

More generally, we can encode n-grams for arbitrary n using the rule

(a_{0}, …,a_{n-1}) ↦ N^{n-1} · a_{1} + N^{n-2} · a_{2} + … + N · a_{n-2} + a_{n-1}.

Here comes the key observation: with this encoding rule,

extracting n-grams becomes a convolution of the sequence of character codes with the kernel (1,N, …, N^{n-1}).

And this preprocessing step can easily be inserted as a first step into any character-level text-processing neural network.

## The implementation

As a warm-up, let us implement the n-gram extraction as a convolution with NumPy. Given a NumPy array of character codes, the n-gram length n and the size of the alphabet N, the following function returns the sequence of encoded n-grams as an array:

import numpy as np def ngrams_numpy(array, n, alphabet_size): kernel = np.power(alphabet_size, range(0, n)) return np.convolve(array, kernel, mode='valid')

Next, how about the deep learning library Keras? Suppose we already have a working text-processing model whose input are (batches of) sequences of character codes. Then we can add bigram or n-gram extraction as a first layer using a lambda layer in one line. Indeed, given a batch of samples in form of a tensor of shape (`batch_size`

, `sample_length`

), the following function returns a batch of encoded bigrams in form of a tensor of shape (`batch_size`

, `sample_length - 1`

):

from keras.layers import Lambda def bigrams_lambda_layer(alphabet_size): return Lambda(lambda x: x[:,:-1] + x[:,1:] * alphabet_size)

However, lambda layers in Keras may cause problems when saving, loading or checkpointing the model.

For further deployment of a model, for example with tensorflow serving, it might be better to avoid a lambda layer and to use a 1d-convolutional layer with fixed weights as follows:

import numpy as np from keras import layers, backend def ngram_block(n, alphabet_size): def wrapped(inputs): layer = layers.Conv1D(1, n, use_bias=False, trainable=False) x = layers.Reshape((-1, 1))(inputs) x = layer(x) kernel = np.power(alphabet_size, range(0, n), dtype=backend.floatx()) layer.set_weights([kernel.reshape(n, 1, 1)]) return layers.Reshape((-1,))(x) return wrapped

This function can be used like a layer:

bigrams_tensor = ngram_block(2, alphabet_size)(input_tensor)

See also the source code for the experiment below. What this function does is

*create a 1d-convolutional layer*`layer`

with one feature map, window size*n*, zero bias vector and frozen weights that are not changed during training,*reshape the input*`inputs`

, which is a tensor of shape (`batch_size`

,`sample_length`

), to a tensor`x`

with shape (`batch_size`

,`sample_length`

, 1) (necessary because convolutional layers operate on sequences of vectors and not on sequences of scalars),*apply the convolutional layer*to the reshaped input,*set the kernel*of the convolutional layer and*reshape the output*of the convolutional layer from (`batch_size`

,`sample_length`

, 1) to (`batch_size`

,`sample_length`

) again.

## An experiment

Let us finally see how this idea works out for a classical test case, the 20 newsgroups dataset, where the task is to guess the topic of a given post from its text. We shall use a simple character-level convolutional network and see how n-gram extraction inside the model affects the classification accuracy and and training time.

To load the data, we use the datasets module of scikit-learn:

from sklearn.datasets import fetch_20newsgroups as fetch data = fetch(subset="train", remove=("headers", "footers", "quotes")) posts, topics = data["data"], data["target"]

Now `posts`

is a list of newsgroup posts as strings, and `topics`

is a list of numbers representing the respective newsgroup topics. For each topic, we have 350 to 600 samples:

Note that this is *way too little data for a character-level model to perform well.* But let us try nevertheless.

We apply some minimal preprocessing and

- convert the characters to lower case,
- filter out all characters that are not contained in our chosen
`ALPHABET`

, - replace the remaining characters by their index in the
`ALPHABET`

, - trim the sequence of indices to a fixed length
`MAX_LEN`

, - stack all those sequences in one large NumPy array:

import numpy as np ALPHABET = "abcdefghijklmnopqrstuvwxyz1234567890 !$#()-=+:;,.?/" MAX_LEN = 1000 def encode_sample(sample, index): indices = [index[char] for char in sample if char in index] return np.resize(np.array(indices), MAX_LEN) index = {char: i + 1 for i, char in enumerate(ALPHABET)} X = np.stack([encode_sample(x.lower(), index) for x in posts]) y = np.eye(20)[topics]

Now `X`

is an array of shape (`len(posts)`

, `MAX_LEN`

), and `y`

is an array of shape (`len(posts)`

, 20) containing the one-hot encoded topics.

As a baseline, we train a simple convolutional model:

from keras import layers, models, optimizers LAYER_PARAMS = [[64, 3, 3], [128, 3, 3]] EMBEDDING_DIM = 16 def build_model(): inputs = layers.Input(shape=(MAX_LEN,)) x = layers.Embedding(len(ALPHABET), EMBEDDING_DIM)(inputs) for filters, kernel_size, pool_size in LAYER_PARAMS: x = layers.Conv1D(filters, kernel_size, activation="relu")(x) x = layers.BatchNormalization()(x) x = layers.SpatialDropout1D(0.15)(x) x = layers.MaxPooling1D(pool_size)(x) x = layers.GlobalAveragePooling1D()(x) x = layers.Dense(20, activation="softmax")(x) model = models.Model(inputs=inputs, outputs=x) model.compile(optimizer=optimizers.Adadelta(), loss="categorical_crossentropy", metrics=["acc"]) return model model = build_model() history = model.fit(X, y, epochs=60, batch_size=20, validation_split=0.2)

The results are quite poor — the validation accuracy reaches just 60 percent

By careful tuning of hyperparameters, things certainly could be improved a bit.

Now let us see how bigram and trigram extraction will affect performance of the model. Using the function `ngram_block`

, we only need to insert the line `x = ngram_block(n, size)(inputs)`

between the `Input`

and `Embedding`

layers in `build_model`

as follows:

def build_ngram_model(n): inputs = layers.Input(shape=(MAX_LEN,)) x = ngram_block(n, len(ALPHABET))(inputs) x = layers.Embedding(pow(len(ALPHABET), n), n * EMBEDDING_DIM)(x) for filters, kernel_size, pool_size in LAYER_PARAMS: x = layers.Conv1D(filters, kernel_size, activation="relu")(x) x = layers.BatchNormalization()(x) x = layers.SpatialDropout1D(0.05 + 0.1 * n)(x) x = layers.MaxPooling1D(pool_size)(x) x = layers.GlobalAveragePooling1D()(x) x = layers.Dense(20, activation="softmax")(x) model = models.Model(inputs=inputs, outputs=x) model.compile(optimizer=optimizers.Adadelta(), loss="categorical_crossentropy", metrics=["acc"]) return model

We also raised the embedding dimension (because now we want to embed bigrams and trigrams instead of single characters) and use an adaptive spatial dropout rate. Let us see how the n-gram model performs:

for n in range(1, 4): build_ngram_model(n).fit(X, y, epochs=40, batch_size=20, validation_split=0.2)

The training histories show that **n-gram extraction yields a significant improvement**:

Indeed, the mean validation accuracy of the last 5 training epochs increased by more than 10 percent:

n | 1 | 2 | 3 |
---|---|---|---|

mean validation accuracy | 0.5796 | 0.6401 | 0.7064 |

## One limitation of the technique

Why did we stop at trigrams in the experiment above? The reason is that we do not only encode the n-grams that occur in our samples, but reserve codings for all n-grams that could possibly occur. And that makes a *huge difference* when n is growing larger:

n | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|

#(occuring n-grams) | 52 | 2,596 | 47,203 | 214,362 | 551,904 |

#(potential n-grams) | 51 | 2,601 | 132,651 | 6,765,201 | 345,025,251 |

And therefore, *the embedding layer will need memory increasing exponentially with **n*. This is the reason why we stick to bigrams or trigrams. By the way, the numbers above where extracted as follows:

import pandas as pd def all_ngrams(n): length = MAX_LEN - n + 1 def ngrams(x): return set(zip(*[x[i:length + i] for i in range(0, n)])) return set().union(*[ngrams(x) for x in X]) ns = range(1,6) alphabet_size = len(ALPHABET) cts = {'#(occuring n-grams)': [len(all_ngrams(n)) for n in ns], '#(potential n-grams)': [pow(alphabet_size, n) for n in ns]} pd.DataFrame(cts, index = pd.Index(ns, name='n')).transpose()