Model distillation helps us train small NLP models that rival large competitors
Transfer learning is one of the most impactful recent breakthroughs in Natural Language Processing. Less than a year after its release, Google's BERT and its offspring (RoBERTa, XLNet, etc.) dominate most of the NLP leaderboards. While it can be a headache to put these enormous models into production, various solutions exist to reduce their size considerably. At NLP Town we successfully applied model distillation to train spaCy's text classifier to perform almost as well as BERT on sentiment analysis of product reviews.
Recently the standard approach to Natural Language Processing has changed drastically. Whereas until one year ago, almost all NLP models were trained entirely from scratch (usually with the exception of their pre-trained word embeddings), today the safest road to success is to download a pre-trained model such as BERT and finetune it for your particular NLP task. Because these transfer-learning models have already seen a large collection of unlabelled texts, they have acquired a lot of knowledge about language: they are aware of word and sentence meaning, co-reference, syntax, and so on. Exciting as this revolution may be, models like BERT have so many parameters they are fairly slow and resource-intensive. For some NLP tasks at least, finetuning BERT feels like using a sledgehammer to crack a nut.
Most transfer-learning models are huge. BERT's
multilingual models are
with 12 layers, a hidden size of 768 and 12 self-attention heads - no less than
110 million parameters in total.
BERT-large sports a whopping 340M parameters. Still, BERT
dwarfs in comparison to even more recent models, such as
with 665M parameters and OpenAI's GPT-2
with 774M. It certainly looks like this evolution towards ever larger models is set to continue for a while.
Of course, language is a complex phenomenon. It's obvious that more traditional, smaller models with relatively few parameters will not be able to handle all NLP tasks you throw at them. For individual text classification or sequence labelling tasks, however, it's questionable whether all the expressive power of BERT and its peers is really needed. That's why researchers have begun investigating how we can bring down the size of these models. Three possible approaches have emerged: quantization reduces the precision of the weights in a model by encoding them in fewer bits, pruning completely removes certain parts of a model (connection weights, neurons or even full weight matrices), while in distillation the goal is to train a small model to mimic the behaviour of a larger one.
In one of our summer projects at NLP Town, together with our intern
we set out to investigate the effectiveness of model distillation for sentiment analysis. Like
Pang, Lee and Vaithyanathan
in their seminal paper, our goal was to build an NLP model that was able to distinguish between positive and negative
reviews. We collected product reviews in six languages: English, Dutch, French, German, Italian and Spanish. The
reviews with one or two stars we gave the label
negative, and those with four or five stars we considered
We used 1000 examples for training, 1000 for development (early stopping) and 1000 examples for testing.
The first step was to determine a baseline for our task. With an equal number of positive and negative
examples in each of our data sets, a random baseline would obtain an accuracy of 50% on average. As
a simple machine learning baseline, we trained a
spaCy text classification model:
a stacked ensemble
of a bag-of-words model and a fairly simple convolutional neural network with mean pooling and attention. To this
we added an output layer of one node and had the model predict
positive when its output score was higher than 0.5 and
negative otherwise. This baseline achieved an accuracy of between 79.5% (for Italian) and 83.4% (for French) on the test
data - not bad, but not a great result either.
Because of its small training set, our challenge is extremely suitable for transfer learning. Even if a test phrase such as great book is not present in the training data, BERT already knows it is similar to excellent novel, fantastic read, or another similar phrase that may very well occur in the training set. As a result, it should be able to predict the rating for an unseen review much more reliably than a simple model trained from scratch.
To finetune BERT, we adapted the
BERTForSequenceClassification class in the
for binary classification. For all six languages we finetuned
BERT-multilingual-cased, the multilingual model
Google currently recommends. The results confirm our expectations: with accuracies between 87.2% (for Dutch) and
91.9% (for Spanish), BERT outperforms our initial spaCy models by an impressive 8.4% on average. This means BERT
nearly halves the number of errors on the test set.
Unfortunately, BERT is not without its drawbacks. Each of our six finetuned models takes up almost 700MB on disk and their inference times are much longer than spaCy's. That makes them hard to deploy on a device with limited resources or for many users in parallel. To address these challenges, we turn to model distillation: we have our finetuned BERT models serve as teachers and spaCy's simpler convolutional models as students that learn to mimic the teacher's behavior. We follow the model distillation approach described by Tang et al. (2019), who show it is possible to distill BERT to a simple BiLSTM and achieve results similar to an ELMo model with 100 times more parameters.
Before we can start training our small models, however, we need more data. In order to learn and mimic BERT's behavior, our students need to see more examples than the original training sets can offer. Tang et al. therefore apply three methods for data augmentation (the creation of synthetic training data on the basis of the original training data):
Since the product reviews in our data set can be fairly long, we add a fourth method to the three above:
These augmentation methods not only help us create a training set that is many times larger than the original one; by sampling and replacing various parts of the training data, they also inform the student model about what words or phrases have an impact on the output of its teacher. Moreover, in order to give it as much information as possible, we don't show the student the label its teacher predicted for an item, but its precise output values. In this way, the small model can learn how probable the best class was exactly, and how it compared to the other one(s). Tang et al. (2019) trained the small model with the logits of its teacher, but our experiments show using the probabilities can also give very good results.
One of the great advantages of model distillation is that it is model agnostic: the teacher model can be a black box, and the student model can have any architecture we like. To keep our experiments simple, we chose as our student the same spaCy text classifier as we did for our baselines. The training procedure, too, remained the same: we used the same batch sizes, learning rate, dropout and loss function, and stopped training when the accuracy on the development data stopped going up. We used the augmentation methods above to put together a synthetic data set of around 60,000 examples for each language. We then collected the predictions of the finetuned BERT models for this data. Together with the original training data, this became the training data for our smaller spaCy models.
Despite this simple setup, the distilled spaCy models outperformed our initial spaCy baselines by a clear margin. On average, they gave an improvement in accuracy of 7.3% (just 1% below the BERT models) and an error reduction of 39%. Their performance demonstrates that for a particular task such as sentiment analysis, we don't need all the expressive power that BERT offers. It is perfectly possible to train a model that performs almost as well as BERT, but with many fewer parameters.
With the growing popularity of large transfer-learning models, putting NLP solutions into production is becoming more challenging. Approaches like model distillation, however, show that for many tasks you don't need hundreds of millions of parameters to achieve high accuracies. Our experiments with sentiment analysis in six languages demonstrate it is possible to train spaCy's convolutional neural network to rival much more complex model architectures such as BERT's. In the future, we hope to investigate model distillation in more detail at NLP Town. For example, we aim to find out what data augmentation methods are most effective, or how much synthetic data we need to train a smaller model.
Yves discovered Natural Language Processing 15 years ago as an MSc student at the University of Edinburgh, and has never looked back. With a background as a researcher and developer in academia (University of Leuven, Stanford University) and industry (Textkernel, Wolters Kluwer), he founded NLP Town to further indulge and spread his love for NLP.