RL-05-01-结构-Transition元组

Transition 是 Replay、Rollout 的原子单元。全系列统一字段命名，避免 Buffer 与 Agent 对接混乱。

一、最小五元组

$$
(s_t,, a_t,, r_{t+1},, s_{t+1},, d_{t+1})
$$

字段	类型	说明
`state` / `obs`	ndarray / int	当前观测
`action`	int / ndarray	执行动作
`reward`	float	即时奖励
`next_state`	同 state	下一观测
`done`	bool / float32	episode 结束（term 或 trunc 合并）

from typing import NamedTuple
import numpy as np

class Transition(NamedTuple):
    obs: np.ndarray
    action: int
    reward: float
    next_obs: np.ndarray
    done: float  # 0.0 or 1.0 for torch

二、Bootstrap 中的 done

1	target = reward + gamma * (1.0 - done) * q_next

done=1：终止，不 Bootstrap $Q(s’)$
truncated：多数 DQN 实现仍 Bootstrap（任务可延续）

可扩展为 terminated / truncated 分开存储。

三、PPO / Actor-Critic 扩展

字段	用途
`log_prob`	概率比 $r_t(\theta)$
`value`	Critic $V(s_t)$
`advantage`	GAE 后写入
`return`	$A_t + V_t$

四、批量存储布局

1
2
3

# Ring buffer 预分配 (capacity, *obs_shape)
obs_buf = np.zeros((cap, obs_dim), dtype=np.float32)
act_buf = np.zeros(cap, dtype=np.int64)

采样时 idx = np.random.randint(0, size, batch_size) 向量化索引。

五、序列化

1
2
3

import pickle
pickle.dump(transition, f)
# 或 np.savez_compressed for arrays

Checkpoint 不必存每条 transition；Replay 可单独 save_buffer。

六、小结

统一 (obs, action, reward, next_obs, done)。
下一篇：Q-Table