Meta-Learning for Keyphrase Extraction

10 min readSep 6, 2021

A strategy to boost zero-shot predictions

KeyPhrase Extraction (KPE) is the process of extracting relevant chunks of words from a document to best capture and represent its content. Having keyphrases helps the reader get the gist of the document in a glance and browse quickly through many documents. Hence it is very valuable for the user experience of search engines to enhance previews (showing the beginning of a document while staying on the main results page).

In this article we outline what needs to be done to build a keyphrase extractor that performs well not only on in-domain data, but also in a zero-shot scenario where keyphrases need to be extracted from data that have a different distribution (either a different domain or a different type of documents).

An ill-defined task

The definition of what is or isn’t a “good” keyphrase is very fuzzy for humans, since many datasets report having a very low matching rate of keyphrases between annotators [1]. Yet no annotator is really assigning wrong keyphrases; then how come humans can be in such disagreement? The main reasons are:

There is a vast number of potentially relevant keyphrases for a given document (especially if the document is several paragraphs long or more), yet only a few (~ 3–10 on usual datasets) of the most salient are selected at a time, and
Humans do not really agree on the exact definition of what a “good” keyphrase should be.

When discussing the matter, reviewing each other’s opinion and working on a common definition, their F1-score (common agreement on the chosen keyphrases) can go up three times as high [2] on some easy-to-annotate datasets. A common definition can be something like:

Keyphrases are between one and five words long,
Keyphrases should represent the main topics and ideas of the document,
Keyphrases should each convey new information and not be synonyms of one another.

One of the great challenges of research is comparing models between one another, as different metrics can cause the same models to be evaluated and ranked differently. The main metric for this article is F1@5 (F1-score on the exact match of the top 5 extracted keyphrases), with:

Using the same metric, the scores of a model for different datasets can vary a lot, either because the number of annotated keyphrases is different from the number of keyphrases extracted by the model (i.e., if there are more annotated keyphrases than extracted keyphrases, recall will decrease and precision will increase), or simply because it’s harder to extract keyphrases from the documents in some datasets. Oftentimes, models tend to work well on a dataset for which they were tailored (through either feature engineering or model parameter tuning) and have poorer performance on a slightly differing dataset.

Scarcity of data

When benchmarking the state-of-the-art models, we were struck by how little data is available for training. We found only 39 datasets that were created with proper human annotation, there are some curated dataset lists available online [3]. Most datasets are extracted from scientific papers from natural sciences, and some important domains such as law are not represented at all. Moreover, 25 of those datasets are in English, and most contain only a few hundred or thousand documents, which is limiting for supervised training. Luckily, some are quite big:

KPTimes [4]: 289,923 news articles, license terms not entirely clear [5],
OpenKP (MS MARCO) [6]: 148,124 web pages, limited to academic use, and
KP20K [7]: 547,090 science paper abstracts, limited to academic use.

As you will see, having a bigger dataset is both necessary and sufficient for proper training and boosting performance.

Creating a new keyphrase dataset

Due to the scarcity of data, especially in languages other than English, we turned to HAL [8] which is “an open archive where authors can deposit scholarly documents from all academic fields”. When uploading an article, the author can specify a set of keyphrases. HAL offers an API [9] to crawl the available articles including authors, title, abstract and, most importantly, the keyphrases.

We decided to only keep the keyphrases that were present in the abstract and we only kept articles that had at least three of those present keyphrases. Unsurprisingly, most of the gathered data was in English and French (as it is a French initiative), but some other languages were also available. The number of documents per language are:

English: 206,365
French: 119,185
Spanish: 1,550
Portuguese: 379
Italian: 228
German: 96

Model choice

To solve the keyphrase extraction problem, we can choose from two fundamentally different approaches — supervised or unsupervised learning.

Unsupervised learning has the nice property of not depending on the training data distribution and, therefore, being theoretically more robust to unseen data. However, it seriously lags in performance when compared to supervised approaches. Moreover, our extensive study of SIFRank [10] (well-performing unsupervised model based on ELMo [11] embeddings) showed that robustness was not that good and that SIFRank often produced unsatisfactory keyphrases (either not very salient or not self-contained), which is problematic for any practical use.

The supervised approach requires an annotated dataset with golden keyphrases. Using these labels, an algorithm can teach a model to extract keyphrases by minimizing some loss function that, in turn, maximizes the F1 score on exact matches. This method works well with transformer-based models such as BERT.

This is why we decided to go with a BERT-based, state-of-the-art method called BERT-JointKPE [12]. We will give a brief overview of how the method works, but you should check out their great paper if you want to dive deeper. On a high level, the model steps are as follows:

The raw text is passed into a BERT model to extract one embedding per token.
The token embeddings are then fed to a custom CNN with various window sizes (from 0 to the maximum length of a keyphrase) which will create an n-gram representation for every possible n-gram.
Those n-gram representations are then fed to the chunking network, which will select which n-grams are keyphrases (chunks) thanks to a linear binary classification layer.
Those chunks are in turn fed into the ranking network which will rank them according to their saliency and assign a score to each of them using a linear layer.

The training is performed jointly on the combined loss of the chunking and ranking network, the total loss is simply the sum of the chunking and ranking loss. For the chunk loss, the cross-entropy is computed for every chunk where the chunk label is 1 if the chunk represents a keyphrase and 0 otherwise. For the ranking loss, the hinge loss like in pairwise learning to rank is computed on exact matches with the ground truth by minimizing the score of non-keyphrases minus the score of real keyphrases.

Meta-Learning

Our goal was not only to reach high scores for datasets that were used during training, but we also wanted to reach good scores for previously unseen data from other domains. For that purpose, we performed multi-source domain adaptation, i.e., training the model on various datasets to reach good performance in several types of domains (science, law, etc) or of text structure (tweets, articles, emails, etc).

When trained on at least two datasets with different documents and different annotators, performance on both a validation set from a known distribution and on zero-shot data goes up significantly. This phenomenon is called meta-learning [13].

As the reader will probably have noticed, no F1 score was reported for our experiences. We are collecting them, rendering them in pretty vizualizations and running extended experiences at the moment. For now the reader will have to trust us but no worries a post is coming soon with a lot more details.

When faced with different types of distributions, the model has to learn beyond recognizing patterns in a single kind of documents; it has to learn the general principle of “what is a good keyphrase”. Indeed, varying the type and source of training data reduces the confusion factors. For example, if all documents are scientific papers, the model will learn that rather complicated words are usually significant and people’s names are insignificant; but in a news article, it may be that the name of a person is indeed important. When learning on both types of data, the model will have to acknowledge that not all documents are the same. Hence when faced with a new type of document, it will not assume it is exactly like the training distribution, and therefore the zero-shot performance is expected to rise. There are four types of parameters to vary in training distributions:

Annotator (prior on what is a good keyphrase),
Structure of the documents (tweets, articles, emails, etc.),
Topic of the document (science, law, politics, etc.),
Language.

But what is more surprising is that the performance on any of the datasets used in training also rises, sometimes very sharply. How does mixing the distribution with another slightly different distribution help? This is because instead of learning to recognize a distribution, the algorithm learns the concept of keyphrases as a whole. The more diverse the added data is, the more this effect is noticeable. However, we noticed that we do not need many different datasets; using five datasets is barely more helpful than using only two. We also noticed that not all datasets are equally useful for meta-learning. Here are our main findings:

The larger the dataset, the more useful it is for meta-learning (i.e., more data means better performance).
Using a dataset from a certain domain improves zero-shot performance especially for another dataset from the same domain.
Small datasets (even if they have a very different distribution) can make a real difference (e.g., 10,000 samples from various languages).
As small datasets are easily overfitted, they should be combined with larger datasets. This way we can reap the benefits of diverse training data while avoiding the risk of overfitting.
In order to avoid small datasets being outweighed by larger datasets, repeat the data from the smaller datasets to match the size of the biggest dataset. Obviously there are limits to this approach, it should only be done if the datasets do not differ in size by more than an order of magnitude.

Combining all these findings (i.e., using a large dataset like KPTimes and some smaller datasets of various topics and types of text like emails, science papers and medical papers) allows us to train a single model that performs very well in a wide array of domains, and is competitive to unsupervised models in zero-shot scenarios. This is even more impressive given that most models we compare with only work well on a single dataset — their training dataset.

We also conducted experiments using datasets in different languages. First, we trained a multi-lingual BERT-Base model only on English datasets, and the second model was trained on English datasets plus the HAL datasets in the other available languages. While the zero-shot performance of both models for English data did not differ significantly, the performance for non-English data was, as expected, much better.

Influence of model size

Additionally, we tested the effect of reducing the size of BERT on model performance (while performing the same training on the same data) as switching from BERT-Base to smaller models can reduce inference times substantially which is highly desirable. It appears that using a BERT-Small [14] barely lowers the overall performance despite using only 22% of the weights used by BERT-Base. Remarkably, the reduction in performance on zero-shot datasets is much weaker but still noticeable.

Post-processing tricks

In the course of our research, we also explored a number of techniques that can make the results of the keyphrases extracted by the model more beneficial to users. The goal was to narrow down the result set to only the best keyphrases, and to ensure that each keyphrase brings new value and (hopefully) isn’t just a different way of saying the same thing as an earlier keyphrase. Here are some tips we suggest to maximize the value of a keyphrase system:

Threshold: Only accept keyphrases with a high confidence score to avoid predicting keyphrases of low quality.
Syntactic diversity: If a keyphrase is included in another keyphrase, then the longer keyphrase should convey more information. We delete the shorter keyphrase until all combinations have been explored and trimmed. This biases the results towards longer keyphrases but is very efficient for diversity.
Merge overlapping keyphrases: If two keyphrases overlap (e.g., the last word of a keyphrase matches the first word of another keyphrase), then it is probable that they represent two pieces of an important keyphrase, like a long name. It is desirable to merge them into a single keyphrase, however, an upper bound to limit the maximum length of a keyphrase created in such a manner should be applied (e.g., do not create keyphrases with more than eight words).
Remove stopwords at beginning or end: It can happen that the model predicts a keyphrase that contains a stopword at the beginning or the end, they are most likely not helpful and should be removed. Note that stopwords inside the keyphrase should be kept as they could very well be part of a valid keyphrase.

Concluding remarks

In order to achieve automatic KPE with good enough performance to be of practical use, we need to take into account the behavior of models on unseen data. In order to do that, some tricks can be very useful. If you know what the target distribution will look like, it is often very efficient to fine-tune on this distribution (or something close to it) in order to trigger transfer learning. But if you wish to have a more universal model, then robustness is crucial. A great way to achieve a robust model is to train it on various datasets to enable meta-learning, which also enhances performance. Tests showed great results on random types of data, including noisy, out-of-domain and more complicated datasets.