SayCanPay

解决了什么问题

以往的基于LLM的规划方法,生成的计划存在不可行的情况,比如厨房门关着的时候也会做出从冰箱里拿物品的规划,或者说预训练的LLM获得的知识与当前不符合,比如LLM直到开关可以按动,但是当现实的开关变成旋钮的时候规划就可能失效。

如何实现的

总览:
(1)Say:在每个步骤t,LLM生成具有相关概率m个候选动作。论文中采用了对令牌的集束搜索。 (2)Can:经过训练的特定领域模型会衡量这些候选操作的可行性。 (3)Pay:经过训练的特定领域估计器根据估计的收益来权衡候选动作。

规划的动作如何选取

  • 贪婪动作选取
    每次取评分最高的动作
  • 束动作选取
    每次LLM给出m个候选动作,每层对k个高得分动作进行扩展,在每个高得分动作上继续扩展重复上述步骤直至终止,然后选取最终得分最高的序列。

can model
被定义为分类问题,将先决条件已经满足的动作分配最高概率
训练模型的形式如下,就是当前评估的动作
Input ⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨NXT⟩ toggle yellow door.

Output 1.0 // feasible

Input ⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨NXT⟩ pick up purple box.

Output 0.0 // infeasible

pay model
被定义为回归问题,使用正反(将其它目标的规划或者奖励为0的动作加入进来)样本训练模型,让模型回归出一个0到1的值,越接近1则说明越接近规划终止部分。

Input ⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨Step 2⟩ toggle yellow door. ⟨Step 3⟩ drop key in void. ⟨Step 4⟩ pick up blue box. ⟨NXT⟩ done picking up.

Output 1.0 // end of plan Input

⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨Step 2⟩ toggle yellow door. ⟨Step 3⟩ drop key in void. ⟨NXT⟩ pick up blue box.

Output 0.6 //δ · r

Input ⟨Goal⟩ pick up the purple box. ⟨Initial State⟩ Room 1 has yellow key, agent. Room 2 has purple box. The door connecting Room 1 and Room 2 is locked. ⟨Step 1⟩ pick up yellow key. ⟨Step 2⟩ toggle yellow door. ⟨Step 3⟩ drop key in void. ⟨NXT⟩ pick up green box.

Output 0 // very low payoff