SayCanPay
SayCanPay
解决了什么问题
以往的基于LLM的规划方法,生成的计划存在不可行的情况,比如厨房门关着的时候也会做出从冰箱里拿物品的规划,或者说预训练的LLM获得的知识与当前不符合,比如LLM直到开关可以按动,但是当现实的开关变成旋钮的时候规划就可能失效。
如何实现的
总览:
(1)Say:在每个步骤t,LLM生成具有相关概率m个候选动作。论文中采用了对令牌的集束搜索。 (2)Can:经过训练的特定领域模型会衡量这些候选操作的可行性。 (3)Pay:经过训练的特定领域估计器根据估计的收益来权衡候选动作。
规划的动作如何选取
- 贪婪动作选取
每次取评分最高的动作 - 束动作选取
每次LLM给出m个候选动作,每层对k个高得分动作进行扩展,在每个高得分动作上继续扩展重复上述步骤直至终止,然后选取最终得分最高的序列。
can model
被定义为分类问题,将先决条件已经满足的动作分配最高概率
训练模型的形式如下,
Input ⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨NXT⟩ toggle yellow door.
Output 1.0 // feasible
Input ⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨NXT⟩ pick up purple box.
Output 0.0 // infeasible
pay model
被定义为回归问题,使用正反(将其它目标的规划或者奖励为0的动作加入进来)样本训练模型,让模型回归出一个0到1的值,越接近1则说明越接近规划终止部分。
Input ⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨Step 2⟩ toggle yellow door. ⟨Step 3⟩ drop key in void. ⟨Step 4⟩ pick up blue box. ⟨NXT⟩ done picking up.
Output 1.0 // end of plan Input
⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨Step 2⟩ toggle yellow door. ⟨Step 3⟩ drop key in void. ⟨NXT⟩ pick up blue box.
Output 0.6 //δ · r
Input ⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨Step 2⟩ toggle yellow door. ⟨Step 3⟩ drop key in void. ⟨NXT⟩ pick up green box.
Output 0 // very low payoff