GPT4 With Reflexion Has a Superior Coding Score

A slightly improved Reflexion-based GPT-4 agent achieves state-of-the-art pass@1 results (88%) on HumanEval, outperforming GPT-4 (67.0%) and CodeT: Code Generation with Generated Tests (65.8%), which were the previous state-of-the-art standards.

Relaxing Success Evaluation
By using Reflexion to iteratively refine the current implementation, researchers are shifting the “accuracy bottleneck” from correct syntactic and semantic code generation to correct syntactic and semantic test generation. In theory, test generation should be much easier to accomplish than code generation. Following this assumption, they hypothesized that if an agent can design diverse and accurate tests, then they can use the internal tests to iteratively refine their implementation, and then the agent’s accuracy can be redefined as its ability to generate accurate tests.

Test generation

The method for test generation was inspired by CodeT: Code Generation with Generated Tests found at https://github.com/microsoft/CodeT.

Nanothoughts describes the application of Reflexion to GPT4 at this substack article.

There is a 17 page research paper.

Hallucination vs. Inefficient Planning
They explored the reasons for failure in AlfWorld runs with and without Reflexion. Hallucination is the most common reason for failure. In the AlfWorld benchmark, they defined hallucination as the occurrence of two or more consecutive identical actions in which the environment responded with the same observation and inefficient planning as the occurrence of a trajectory in which the agent executed more than 30 actions without reaching a successful state. They defined inefficient planning as a mode of failure to encourage the agent to solve tasks by demonstrating strong, concise decision-making ability—not merely attempts to execute every admissible action. An example of hallucination is shown in Fig. 4 in which the agent hallucinates due to its belief that it has found the desklamp on desk 2. Therefore, it executes the same action twice, believing that the next best action is to use the desklamp.

Inefficient planning errors occur in trajectories in which the agent reaches action 30 without showing any signs of improvement. While the agent may be constructing a mental map of the environment as it discovers and observes new items and locations, we penalize this behavior as it does not demonstrate strong decision-making ability. Fig. 2 shows that although the agent can solve additional tasks through trial, it still converges to the same rough 3:1 hallucination to inefficient planning ratio as in Trial 1. However, with reflection Fig. 2, the agent can correct all of its mistakes related to inefficient planning in all but four mistakes related to hallucination.

Reflexion limitations – Reflexion agent is able to optimize reasoning trace and action execution but does not have complete awareness of the quality of the tools that it may be using.

Natural language agents can learn from past mistakes and redirect future decisions in planning sequences which removes the human trainer in a human-in-the-middle approach. They demonstrated learning curves on the AlfWorld and HotPotQA benchmarks that significantly outperform base ReAct agents. In addition, they included an inconclusive attempt to improve performance on the WebShop benchmark and provide a discussion that highlights a few limitations of this approach. Reflexion is a highly applicable method to improve performance between trials on decision-making and knowledge-intensive tasks due to its sole dependence on a binary reward model. In the AlfWorld and HotPotQA experiments, they constrained the reward model to imitate environments in which informative reward models may be difficult to design or compute. They encourage others to apply Reflexion to more complex tasks in which the agent must learn to develop new ideas, explore larger unseen state spaces, and form more accurate plans of action through its experiences in past environments.

4 thoughts on “GPT4 With Reflexion Has a Superior Coding Score”

  1. I think the only thing that will stop GPT 5 from creating an minimal operating system from scratch (500k lines) will be a big enough context window!

Comments are closed.