Self-training Improves Pre-training for Natural Language Understanding
Facebook AI & Stanford
Most semi-supervised NLP approaches require specifically in-domain unlabeled data. It means that for the best results, the unlabeled portion of the data which we want to use for semi-supervised training must be from the same domain as the annotated dataset.
This paper proposes SenAugment - a method that constructs task-specific in-domain unannotated datasets on the fly from the large external bank of sentences. So for any new NLP task where we have only a small dataset, we don't need to bother anymore to collect a very similar unannotated dataset if we want to use semi-supervised training.
Now we can sort of cheat to improve the performance of an NLP model on almost any downstream task using Self-training (which is also called Teacher-Student training):
1. We retrieve the most relevant sentences (few millions of them) for the current downstream task from the external bank. For retrieval we use the embedding space of a sentence encoder - Transformer, pre-trained with masked language modeling and finetuned to maximize cosine similarity between similar sentences.
2. We train the teacher model - a RoBERTa-Large model finetuned on the downstream task.
3. Then we use a teacher model to annotate the retrieved unlabeled in-domain sentences. We perform additional filtering by keeping the ones that have the high-confident predictions.
4. As our student model, we then finetune a new RoBERTa-Large using KL-divergence on the synthetic data by considering the post-softmax class probabilities as labels (i.e., not only the most confident class but the entire class distribution is used as a label for every sentence).
Such a self-training procedure significantly boosts the performance compared to the baseline. And the positive effect is higher when fewer GT annotated sentences are available.
As a large-scale external bank of unannotated sentences, authors use CommonCrowl. In particular, they use a corpus with 5 billion sentences (100B words). Because of its scale and diversity, the sentence bank contains data from various domains and with different styles, allowing to retrieve relevant data for many downstream tasks. To retrieve the most relevant sentences for a specific downstream task, we need to obtain an embedding for the task. Several options exist: (1) average embeddings of all sentences in the training set; (2) average embeddings for every class; (3) keep original sentences embeddings.
📝 Paper
🛠 Code
#paper_explained #nlp
Facebook AI & Stanford
Most semi-supervised NLP approaches require specifically in-domain unlabeled data. It means that for the best results, the unlabeled portion of the data which we want to use for semi-supervised training must be from the same domain as the annotated dataset.
This paper proposes SenAugment - a method that constructs task-specific in-domain unannotated datasets on the fly from the large external bank of sentences. So for any new NLP task where we have only a small dataset, we don't need to bother anymore to collect a very similar unannotated dataset if we want to use semi-supervised training.
Now we can sort of cheat to improve the performance of an NLP model on almost any downstream task using Self-training (which is also called Teacher-Student training):
1. We retrieve the most relevant sentences (few millions of them) for the current downstream task from the external bank. For retrieval we use the embedding space of a sentence encoder - Transformer, pre-trained with masked language modeling and finetuned to maximize cosine similarity between similar sentences.
2. We train the teacher model - a RoBERTa-Large model finetuned on the downstream task.
3. Then we use a teacher model to annotate the retrieved unlabeled in-domain sentences. We perform additional filtering by keeping the ones that have the high-confident predictions.
4. As our student model, we then finetune a new RoBERTa-Large using KL-divergence on the synthetic data by considering the post-softmax class probabilities as labels (i.e., not only the most confident class but the entire class distribution is used as a label for every sentence).
Such a self-training procedure significantly boosts the performance compared to the baseline. And the positive effect is higher when fewer GT annotated sentences are available.
As a large-scale external bank of unannotated sentences, authors use CommonCrowl. In particular, they use a corpus with 5 billion sentences (100B words). Because of its scale and diversity, the sentence bank contains data from various domains and with different styles, allowing to retrieve relevant data for many downstream tasks. To retrieve the most relevant sentences for a specific downstream task, we need to obtain an embedding for the task. Several options exist: (1) average embeddings of all sentences in the training set; (2) average embeddings for every class; (3) keep original sentences embeddings.
📝 Paper
🛠 Code
#paper_explained #nlp
Controllable Neural Text Generation
Self-supervised pretraining of Language Models has become a de-facto standard nowadays. When generating sentences from a Language Model by iteratively sampling the next token, we do not have much control over attributes of the output text, such as the topic, the style, the sentiment, etc. Many applications would demand good control over the model output. For example, if we plan to use LM to generate reading materials for kids, we would like to guide the output stories to be safe, educational, and easily understood by children.
How to steer a powerful unconditioned language model? Note that model steerability is still an open research question. In this blogpost,
Lilian Weng (OpenAI) discusses several approaches for acontrolled content generation with an unconditioned language model:
- Apply guided decoding strategies and select desired outputs at test time.
- Optimize for the most desired outcomes via good prompt design.
- Fine-tune the base model or steerable layers to do conditioned content generation.
🌀 Blogpost link
--
P.S. Lilian Weng has a very informative blog with a lot of interesting posts mostly on Reinforcement Learning and Natural Language Processing.
#NLP
Self-supervised pretraining of Language Models has become a de-facto standard nowadays. When generating sentences from a Language Model by iteratively sampling the next token, we do not have much control over attributes of the output text, such as the topic, the style, the sentiment, etc. Many applications would demand good control over the model output. For example, if we plan to use LM to generate reading materials for kids, we would like to guide the output stories to be safe, educational, and easily understood by children.
How to steer a powerful unconditioned language model? Note that model steerability is still an open research question. In this blogpost,
Lilian Weng (OpenAI) discusses several approaches for acontrolled content generation with an unconditioned language model:
- Apply guided decoding strategies and select desired outputs at test time.
- Optimize for the most desired outcomes via good prompt design.
- Fine-tune the base model or steerable layers to do conditioned content generation.
🌀 Blogpost link
--
P.S. Lilian Weng has a very informative blog with a lot of interesting posts mostly on Reinforcement Learning and Natural Language Processing.
#NLP