Bar Chart · 柱状图#gradient#value-labels#serif

Average Scores Across Warmup Steps

不同预热步骤的平均分

This bar chart displays average scores across different warmup steps, with a trend line indicating score progression. The bars are colored using a gradient, and value labels are placed above each bar.

该柱状图显示了不同预热步骤的平均分，趋势线指示分数变化。柱子使用渐变色填充，数值标签位于每个柱子上方。

@paper · 来自论文

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu · arXiv:cs.LG 2026

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

arXiv:2604.14142 ↗PDF ↗

// original from paper · 论文原图

// reproduced via average_scores_across_warmup_steps.py · 脚本复现download png

average_scores_across_warmup_steps.py

download .py

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LinearSegmentedColormap, to_hex

# ── Data ──────────────────────────────────────────────────────────────────────
warmup_steps = [5, 10, 15, 20, 25, 30]
avg_scores   = [55.69, 57.74, 57.11, 57.54, 57.27, 56.32]

# ── Style ─────────────────────────────────────────────────────────────────────
COL_START = "#9AC4F5"      # 柱子浅蓝
COL_END   = "#428AE5"      # 柱子深蓝
COL_LINE  = "#2E6FC7"      # 折线更深
GRID_COL  = "#d4d0e4"

plt.rcParams.update({
    "font.family":      "serif",
    "font.serif":       ["Times New Roman", "Times", "DejaVu Serif"],
    "font.size":         9,
    "axes.labelsize":    9,
    "xtick.labelsize":   8,
    "ytick.labelsize":   8,
})

# ── Figure ────────────────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(3.6, 2.3), facecolor="white")

x = np.arange(len(warmup_steps))
bar_width = 0.52

# Non-linear color mapping: compress high scores together, stretch low scores apart.
score_min, score_max = min(avg_scores), max(avg_scores)
if score_max > score_min:
    score_norm = [((s - score_min) / (score_max - score_min)) ** 0.4 for s in avg_scores]
else:
    score_norm = [0.5] * len(avg_scores)

bar_cmap = LinearSegmentedColormap.from_list("bar_grad", [COL_START, COL_END])
colors = [to_hex(bar_cmap(v)) for v in score_norm]

# Bars
bars = ax.bar(x, avg_scores, width=bar_width, color=colors,
              edgecolor="#3A78CC", linewidth=0.5, zorder=3)

# Trend line on top
ax.plot(x, avg_scores, color=COL_LINE, linewidth=0.9,
        marker="o", markersize=3, markerfacecolor="white",
        markeredgecolor=COL_LINE, markeredgewidth=0.8, zorder=4)

# Value labels (above the trend line)
for bar, val in zip(bars, avg_scores):
    ax.text(bar.get_x() + bar.get_width() / 2, val + 0.35,
            f"{val:.2f}", ha="center", va="bottom", fontsize=7,
            color="#2e2b40", zorder=5)

# GRPO baseline
ax.axhline(y=55.79, color="#D4686A", linewidth=0.8, linestyle=(0, (5, 3)), zorder=102, alpha=0.7)
ax.text(x[0] - 0.48, 55.79 + 0.05, "GRPO 55.79", va="bottom", ha="left",
        fontsize=6, color="#C53E43", zorder=10)

# Grid
ax.set_axisbelow(True)
ax.yaxis.grid(True, color=GRID_COL, linewidth=0.5, linestyle=(0, (3, 4)))

# Axes
ax.set_xticks(x)
ax.set_xticklabels([str(s) for s in warmup_steps])
ax.set_xlabel("Warmup Steps", labelpad=3, color="#2e2b40")
ax.set_ylabel("Avg Score", labelpad=3, color="#2e2b40")
ax.set_ylim(54.2, 58.2)
ax.set_yticks([55, 56, 57, 58])

# Spines
for spine in ax.spines.values():
    spine.set_linewidth(0.6)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)

ax.tick_params(axis="both", length=2, color="#9490a8",
               labelcolor="#2e2b40", pad=2)

plt.tight_layout(pad=0.4)
plt.savefig("./fig/warmup_ablation.pdf", dpi=300, bbox_inches="tight", facecolor="white")
plt.savefig("./fig/warmup_ablation.png", dpi=300, bbox_inches="tight", facecolor="white")
print("Saved warmup_ablation.pdf / .png")
plt.show()