This awesome article by @dalequark is almost 2 years old but still worth a read: Transformers, Explained: Understand the Model Behind GPT-3, BERT, and T5