In this post, I would focus on all of the theoretical knowledge you need for the latest trends in NLP. I made this reading list as I learned new concepts. For the resources, I include papers, blogs, videos.

It is not necessary to read most of the stuff. Your main goal should be to understand that in this paper this thing was introduced and do I understand how it works, how it compares it with state of the art.

Trend: Use bigger transformer based models and solve multi-task learning.

Warning: Warning: It is an increasing trend in NLP that if you have a new idea in NLP during reading any of the papers, you will have to use massive compute power to get any reasonable results. So you are limited by the open-source models.
  1. fastai:- I had already watched the videos, so I thought I should add it to the top of the list.

    • Lesson 4 Practical Deep Learning for Coders. It will get you up with how to implement a language model in fastai.
    • Lesson 12 Deep Learning from the Foundations. Goes further into ULMFit training.
  2. LSTM:- Although transformers are mainly used nowadays, in some cases you can still use LSTM and it was the first successful model to get good results. You should use AWD_LSTM now if you want.

    • Long Short-Term Memory paper. A quick skim of the paper is sufficient.
    • Understanding LSTM Networks blog. It explains all the details of the LSTM network graphically.
  3. AWD_LSTM:- It was proposed to overcome the shortcoming of LSTM by introducing dropout between hidden layers, embedding dropout, weight tying. You should use AWS_LSTM instead of LSTM.

    • Regularizing and Optimizing LSTM Language Models paper. AWD_LSTM paper
    • Official code by Salesforce
    • fastai implementation
  4. Pointer Models:- Although not necessary, it is a good read. You can think of it as pre-attention theory.

    • Pointer Sentinel Mixture Models paper
    • Official video of above paper.
    • Improving Neural Language Models with a continuous cache paper

Tip: What is the difference between weight decay and regularization? In weight decay, you directly add something to the update rule while in regularization it is added to the loss function. Why bring this up? Most probably the DL libraries are using weight_decay instead of regularization under the hood.

Note: In some of the papers, you would see that the authors preferred SGD over Adam, citing that Adam does not give good performance. The reason for that is (maybe) PyTorch/Tensorflow are doing the above mistake. This thing is explained in detail in this post.
  1. Attention:- Remember Attention is not all you need.
    • CS224n video explaining attention. Attention starts from 1:00:55 hours.
    • Attention is all you need paper. This paper also introduces the Transformer which is nothing but a stack of encoder and decoder blocks. The magic is how these blocks are made and connected.
    • Read an annotated version of the above paper in PyTorch.
    • Official video explaining Attention
    • Google blog for Transformer
    • If you are interested in video you can check these link1, link2.
    • Transformer-XL: Attentive Language Models Beyond a Fixed Length Context paper. Better version of Transformer but BERT does not use this.
    • Google blog for Transformer-XL
    • Transformer-XL — Combining Transformers and RNNs Into a State-of-the-art Language Model blog
    • For video check this link.
    • The Illustrated Transformer blog
    • Attention and Memory in Deep Learning and NLP blog.
    • Attention and Augmented Recurrent Neural Networks blog.
    • Building the Mighty Transformer for Sequence Tagging in PyTorch: Part 1 blog.
    • Building the Mighty Transformer for Sequence Tagging in PyTorch: Part 2 blog.

There is a lot of research going on to make better transformers, maybe I will read more papers on this in the future. Some other transformers include the Universal Transformer and Evolved Transformer which used AutoML to come up with Transformer architecture.

  1. Random resources

    • Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) [blog](
    • Character-Level Language Modeling with Deeper Self-Attention paper.
    • Using the output embedding to Improve Langauge Models paper.
    • Quasi-Recurrent Neural Networks paper. A very fast version of LSTM. It uses convolution layers to make LSTM computations parallel. Code can be found in the fastai_library or official_code.
    • Deep Learning for NLP Best Practices blog by Sebastian Ruder. A collection of best practices to be used when training LSTM models.
    • Notes on the state of the art techniques for language modeling blog. A quick summary where Jeremy Howard summarizes some of his tricks which he uses in fastai library.
    • Language Modes and Contextualized Word Embeddings blog. Gives a quick overview of ELMo, BERT, and other models.
    • The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) blog.
  2. Multi-task Learning:- I am really excited about this. In this case, you train a single model for multiple tasks (more than 10 if you want). So your data looks like “translate to english some_text_in_german”. Your model actually learns to use the initial information to choose the task that it should perform.

    • An overview of Multi-Task Learning in deep neural networks paper.
    • The Natural Language Decathlon: Multitask Learning as Question Answering paper.
    • Multi-Task Deep Neural Networks for Natural Language Understanding paper.
    • OpenAI GPT is an example of this.
  3. PyTorch:- Pytorch provide good tutorials giving you good references on how to code up most of the stuff in NLP.

  4. ELMo:- The first prominent research done where we moved from pretrained word-embeddings to using pretrained-models for getting the word-embeddings. So you use the input sentence to get the embeddings for the tokens present in the sentence.

    • Deep Contextualized word representations paper, video
  5. ULMFit:- Is this better than BERT maybe not, but still in Kaggle competitions and external competitions ULMFiT gets the first place.

    • Universal Language Model Fine-tuning for Text Classification paper.
    • Jeremy Howard blog post announcing ULMFiT.
  6. OpenAI GPT:- I have not compared BERT with GPT2, but you work on some kind on ensemble if you want. Do not use GPT1 as BERT was made to overcome the limitations of GPT1.

  7. BERT:- The most successful language model right now (as of May 2019).

    • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper.
    • Google blog on BERT
    • Dissecting BERT Part 1: The Encoder blog
    • Understanding BERT Part 2: BERT Specifics blog
    • Dissecting BERT Appendix: The Decoder blog

To use all these models in PyTorch/Tensorflow you can use hugginface/transformers which gives complete implementations along with pretrained models for BERT, GPT1, GPT2, TransformerXL.

Congrats you made it to the end. You now have most of the theoretical knowledge needed to practice NLP using the latest models and techniques.

What to do now? You only learned the theory, now practice as much as you can.