I really enjoyed reading the transformer code here: [link] The key code is about ~70 lines, of which most is very straightforward / boilerplate. The core of it is maybe 20 lines - that's what does GPT-2!