Mastering Neural Theorem Proving: A Step-by-Step Guide to DeepSeek-Prover-V2's Recursive Proof Search

Overview

DeepSeek-Prover-V2 represents a significant leap forward in automated mathematical reasoning. Built on the Lean 4 proof assistant, this open-source large language model (LLM) introduces a recursive theorem-proving pipeline that combines informal reasoning with rigorous formal verification. At its core lies a cold-start training method that synthesizes training data from scratch, followed by reinforcement learning to refine the model's ability to bridge the gap between human-like mathematical intuition and machine-checkable proofs. The model achieves state-of-the-art results on benchmarks like MiniF2F (88.9% pass) and PutnamBench (49/658 solved). This guide walks you through the key innovations, prerequisites for understanding the approach, and a step-by-step breakdown of the training pipeline.

Mastering Neural Theorem Proving: A Step-by-Step Guide to DeepSeek-Prover-V2's Recursive Proof Search — Source: syncedreview.com

Prerequisites

Before diving into the details of DeepSeek-Prover-V2, ensure you have a basic understanding of:

Formal theorem proving and the Lean 4 environment (syntax, tactics, and proofs)
Large language models (LLMs) and their training paradigms (supervised learning, reinforcement learning)
Chain-of-thought (CoT) reasoning in prompting LLMs
Basic machine learning concepts such as datasets, fine-tuning, and reward signals

Familiarity with the original DeepSeek-Prover (V1) is helpful but not required. The guide focuses on V2's unique recursive proof search.

Step-by-Step Instructions: The Training Pipeline of DeepSeek-Prover-V2

1. Cold-Start Data Generation via Recursive Decomposition

The process begins without any existing formal proof data for complex theorems. Instead, it uses a powerful base model (DeepSeek-V3) to generate high-quality synthetic data.

Prompt DeepSeek-V3 with a complex mathematical theorem (e.g., a lemma from number theory). Instruct it to decompose the theorem into a sequence of simpler subgoals and formalize each step in Lean 4 syntax.
Generate subgoals: DeepSeek-V3 outputs a list of intermediate lemmas that, if proven, entail the original theorem.
Search each subgoal: A smaller 7B-parameter prover model attempts to prove each subgoal independently using standard tactics. This search is computationally light because subgoals are simpler.
Assemble the proof: When all subgoals are proven, combine them with the original decomposition to form a complete formal proof. The informal chain-of-thought reasoning (CoT) from DeepSeek-V3 is paired with the formal steps.

Example (conceptual): For theorem "A implies B", DeepSeek-V3 might break it into "A implies C" and "C implies B", then formalize each. The 7B model solves those sub-goals, and the final training example includes the CoT + Lean code.

2. Reinforcement Learning from Subgoal-Proven Data

After the cold-start phase, the team curates a subset of challenging problems that the 7B prover could not solve end-to-end but for which all subgoals were proven successfully.

Construct complete proofs: By concatenating the formal proofs of each subgoal, a full proof for the original problem is obtained.
Create unified training examples: Each example pairs the informal CoT (outlining the decomposition) with the formal proof steps.
Fine-tune the main prover model (DeepSeek-Prover-V2) on this synthetic dataset using standard supervised learning.
Apply reinforcement learning: Use a binary reward signal (proof correct or incorrect) to further optimize the model. The reward is derived from Lean 4's verification result.

This phase teaches the model to generate both the high-level plan and the low-level tactics in a unified manner.

3. The Resulting Model and Benchmarking

The final DeepSeek-Prover-V2-671B (671 billion parameters) is evaluated on:

MiniF2F-test: Achieves 88.9% pass rate, surpassing previous neural provers.
PutnamBench: Solves 49 out of 658 problems from the prestigious Putnam competition.
ProverBench: A new benchmark introduced alongside the model for evaluating mathematical reasoning capabilities.

The model's proofs on MiniF2F are publicly available, allowing the community to verify and build upon them.

Common Mistakes and How to Avoid Them

Insufficient decomposition: Failing to break theorems into sufficiently small subgoals leads to proof search failures. Ensure subgoals are at least one tactic-call away from being trivial for the 7B model.
Ignoring informal reasoning: The chain-of-thought from DeepSeek-V3 is crucial. Omitting it reduces the model's ability to generalize. Always include the informal plan in training examples.
Overfitting to synthetic data: The cold-start data may contain biases. Use reinforcement learning with binary rewards to correct mistakes and encourage novel proof strategies.
Neglecting benchmarking: When applying the approach to new domains, create a diverse evaluation set like ProverBench to catch overfitting.
Underestimating compute: The 671B model requires substantial hardware. Consider using the 7B prover for initial experiments before scaling up.

Summary

DeepSeek-Prover-V2 introduces a recursive proof search framework that leverages a powerful LLM to decompose theorems, a smaller model to solve subgoals, and reinforcement learning to unify informal and formal reasoning. By understanding the cold-start data generation and RL fine-tuning steps, researchers can replicate or adapt this approach to advance automated theorem proving. Key takeaways: use DeepSeek-V3 for decomposition, the 7B prover for subgoal search, and binary reward signals for refinement. The model's state-of-the-art results on MiniF2F and PutnamBench demonstrate its effectiveness.