Reading
A curated list of papers worth reading, grouped by topic.
Alternative architectures
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces Gu & Dao · 2023
The selective state space model that made an attention-free architecture competitive with Transformers.
Long context
- Lost in the Middle: How Language Models Use Long Contexts Liu et al. (TACL 2024) · 2023
Models use information best at the start and end of context, and worst in the middle.
Reasoning
- Training LLMs to Reason in a Continuous Latent Space (Coconut) Hao et al. (FAIR / UCSD) · 2024
Reasons in latent space by feeding the last hidden state back as the next input embedding.
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL DeepSeek-AI · 2025
Reasoning incentivized through pure RL with verifiable rewards.
Scaling
- Training Compute-Optimal Large Language Models (Chinchilla) Hoffmann et al. · 2022
Scale parameters and tokens together; most large models were undertrained.