Reading

A curated list of papers worth reading, grouped by topic.

Alternative architectures

Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu & Dao · 2023

The selective state space model that made an attention-free architecture competitive with Transformers.

Long context

Lost in the Middle: How Language Models Use Long Contexts
Liu et al. (TACL 2024) · 2023

Models use information best at the start and end of context, and worst in the middle.

Reasoning

Training LLMs to Reason in a Continuous Latent Space (Coconut)
Hao et al. (FAIR / UCSD) · 2024

Reasons in latent space by feeding the last hidden state back as the next input embedding.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
DeepSeek-AI · 2025

Reasoning incentivized through pure RL with verifiable rewards.

Scaling

Training Compute-Optimal Large Language Models (Chinchilla)
Hoffmann et al. · 2022

Scale parameters and tokens together; most large models were undertrained.