Question:
What’s the most powerful AI model you can train on a MacBook Pro in just five minutes?

Short answer:
The best I managed was a ~1.8M parameter GPT-style transformer, trained on ~20M TinyStories tokens. It reached a perplexity of ~9.6 on a held-out split.

Example output (prompt in bold):

Once upon a time, there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time.

Not exactly Shakespeare, but not bad for five minutes.


The Challenge

This was mostly a fun, curiosity-driven experiment — and maybe a little silly — for two reasons:

  1. If you can afford a MacBook Pro, you could just rent 30 minutes on an H100 GPU and train something vastly stronger.
  2. If you’re stuck with a laptop, there’s no real reason to limit training to five minutes.

That said, constraints breed creativity. The goal: train the best possible language model in just five minutes of compute time.


Key Limitation: Tokens per Second

Five minutes isn’t long enough to push many tokens through a model, so:

  • Large models are out — they’re too slow per token.
  • Tiny models train quickly, but can’t learn much.

It’s a balancing act: better to train a 1M-parameter model on millions of tokens than a billion-parameter model on a few thousand.


Performance Optimization

Initial transformer training on Apple’s MPS backend hit ~3,000 tokens/sec.
Surprisingly:

  • torch.compile, float16, and other math tweaks didn’t help.
  • Gradient accumulation made things slower (launch overhead was the real bottleneck).
  • Switching from PyTorch to MLX gave no meaningful boost.

Best practices for this scale:

  • Use MPS
  • Skip compilation/quantization
  • Avoid gradient accumulation
  • Keep the model small

Choosing the Right Dataset

With ~10M tokens (~50MB text), dataset choice matters.

  • Simple English Wikipedia was an okay start, but output was fact-heavy and noun-obsessed.

  • TinyStories — synthetic, short, 4-year-old-level stories — worked far better:

    • Coherent narratives
    • Cause-and-effect logic
    • Minimal proper nouns
    • Simple grammar

Perfect for small language models.


Tokenization

Tokenizer training wasn’t counted in the five-minute budget. At this scale:

  • Tokenization overhead is negligible.
  • Multi-byte tokens are easier for small models to learn than raw characters.

Architecture Experiments

Transformers

  • GPT-2 style was the default choice.
  • SwiGLU activation gave a boost.
  • 2–3 layers worked best.
  • Learning rate: 0.001–0.002 was optimal for fast convergence.
  • Positional embeddings outperformed RoPE.

LSTMs

  • Similar structure, but slightly worse perplexity than transformers.

Diffusion Models

  • Tried D3PM language diffusion — results were unusable, producing random tokens.
  • Transformers & LSTMs reached grammatical output within a minute; diffusion didn’t.

Finding the Sweet Spot in Model Size

Experimenting with sizes revealed:

  • ~2M parameters was the upper practical limit.
  • Any bigger: too slow to converge in 5 minutes.
  • Any smaller: plateaued too early.

The Strongest AI Model You Can Train on a Laptop in 5 Minutes

This lined up with the Chinchilla scaling laws, which relate optimal model size to training tokens.


Final Thoughts

This experiment won’t change the future of AI training — most interesting behavior happens after five minutes. But it was:

  • A great way to explore tiny-model training dynamics
  • A fun test of laptop GPU capabilities
  • Proof that you can get a coherent storytelling model in five minutes

With better architectures and faster consumer GPUs, we might eventually see surprisingly capable models trained in minutes — right from a laptop.

In case you have found a mistake in the text, please send a message to the author by selecting the mistake and pressing Ctrl-Enter.

You must be logged in to comment.

Sign In

Similar Posts