Leveling Up Reasoning Via Games: a Post AGI-thon Analysis

This post is about the agents and data that we at Sentient gathered from the Werewolf AGI-thon held at AGI house, SF. Here is a brief tldr:

We used a social deduction game called Werewolf to generate and filter training data that improves large language models’ reasoning and strategic thinking. By analyzing transcripts of players who win at Werewolf and filtering out poor or “jailbreaking” strategies, we can train models on higher-quality examples of reasoning under uncertainty and persuasion. Initial results show that fine-tuning on this filtered data can improve certain reasoning skills—like identity inference and logical deduction—though some other capabilities may slightly decline. Ongoing challenges include better labeling methods, more sample-efficient evaluation of player skill, and developing reward systems that capture the nuances of strategic decision-making.

Authors: Viraj Nadkarni (Princeton University), Yihan Jiang (Sentient), Oleg Golev (Sentient), Ben Finch (Sentient)

Introduction

Large language models are running out of new data to eat up. Even after being trained on all available human-generated data, they lack long term planning and reasoning skills. This calls for novel methods for generating useful training data. A potential way to do that is by using existing LLMs to produce this synthetic data. However, blindly training LLMs on generated data does not produce useful improvements. One needs a filtering method to weed out the textual fluff.

Games provide a natural way to filter data because they have clear winners and losers. Success often requires reasoning, planning, and strategic decision-making, making game transcripts valuable for training AI. Social deduction games like Werewolf or Mafia rely on persuasion and deception. Text-based RPGs, such as Dungeons & Dragons, challenge creative problem-solving and teamwork. Puzzle games like Zork focus on logical reasoning, while turn-based strategy games like Civilization emphasize long-term planning. Debate-based games test persuasion and adaptability under pressure.

Among this diverse family of games, we chose Werewolf for its simplicity, reliance on language, and clear objectives, making it an ideal framework for improving reasoning in LLMs.

Why Reasoning?

Why target reasoning? Well, it is an integral part of any agent becoming ‘sentient’. An agent stays alive by having an adaptive model of its environment, and using this model to acquire resources that would help it keep the dissipative forces of entropy at bay. For humans, many such skills are innate and were honed over the course of their evolutionary history. But the modern human environment needs cultural, scientific and other abstract forms of reasoning, using a hardware evolved only for hunting and gathering.

These skills are honed by children by playing physical (and mental) games. For instance, the level of spatial reasoning intuition built into the brain of a child playing catch was only surpassed by a robot very recently. At the same time, playing mental games (like chess, go), solving puzzles and mathematical problems provide playgrounds to hone more abstract reasoning skills. A general intelligence that can flexibly solve across all such tasks is the holy grail.

Why Werewolf?

Werewolf is an exceptional framework for training large language models (LLMs) because it relies entirely on natural language interactions. Unlike games with visual or physical elements, Werewolf challenges players to communicate, reason, and persuade solely through conversation. Success in the game hinges on crafting arguments, interpreting others’ motives, and adapting to new information—skills central to the capabilities of LLMs.

The game’s strategic nature pushes models to reason under uncertainty. Players must deduce hidden roles based on incomplete information, whether they are villagers attempting to identify wolves or werewolves trying to mislead others. This interplay of deduction and deception challenges LLMs to build mental models of other players and act accordingly. Werewolf’s dynamic format also demands adaptability; the outcomes of one phase reshape the strategy for the next, encouraging models to make decisions that balance short-term actions with long-term planning.

What makes Werewolf particularly valuable for AI training is its clear objectives. Victory conditions are well-defined: villagers win by eliminating all werewolves, and werewolves win by achieving a majority. These unambiguous outcomes allow for objective evaluation of an agent’s performance. By analyzing transcripts from Werewolf games, we can filter high-quality data to refine models’ reasoning and decision-making abilities. Additionally, werewolf is a dynamic benchmark - meaning that the difficulty of the game increases with better players being added to the set of possible players to choose from. This creates a space for continuously improving reasoning agents - this is an ideal platform for research.

Finally, Werewolf’s social dynamics make it a powerful tool for teaching persuasion and deception. Effective players must not only interpret others’ intentions but also influence the group’s perceptions. This mirrors real-world applications where LLMs must navigate complex interactions, such as negotiation or collaborative problem-solving. The game’s scalability ensures that it can generate vast, diverse datasets, offering a robust foundation for improving AI’s reasoning and strategic thinking.

What is the game format?

Player Roles: The game involves 8 players, each secretly assigned one of the following roles:

  • 4 Villagers: The civilians trying to survive and eliminate the wolves.
  • 1 Seer: A villager with the ability to identify whether a player is a wolf during the night.
  • 1 Doctor: A villager who can protect one player from being killed each night.
  • 2 Werewolves: Hidden predators aiming to eliminate the villagers and seize control.

Phases of the Game:

  1. Day Phase
  • Players engage in open discussion. Each person shares who they suspect might be a werewolf and why.
  • This is followed by a voting round, where players decide to eliminate one participant based on their reasoning.
  • The moderator reveals the eliminated player’s role, and they exit the game.
  1. Night Phase
  • Werewolves secretly choose one player to eliminate.
  • The Seer identifies the role of a player of their choice.
  • The Doctor selects a player to save from a potential wolf attack.
  • At dawn, the moderator announces the results: who (if anyone) was eliminated by the wolves, and their role.

The game alternates between day and night phases until one side emerges victorious:

  • Wolves win if they achieve a majority.
  • Villagers win if they successfully eliminate all wolves.

Performance of LLMs

We started out by testing how well an open source model (Llama 3.1-70B) plays the game without any training. We observed that this leads to many players revealing their roles prematurely, or performing harmful actions such as voting against themselves.

This indicates that the model requires some guidance on which utterances are strategic, and which are detrimental to its own cause. Such guidance can come in the form of training on a filtered dataset, that contain derivational traces (multi-turn utterances of an LLM in the completion of some task). The recently published LLM-modulo architecture makes use of filtering strategies to finetune an LLM in real time.

The primary bottleneck is the collection of a large number of games played by a group of agents using a diverse collection of strategies. To this end, we conducted a hackathon at AGI house in San Francisco in October, 2024. We collected the top 8-10 agents to generate ~1000 games and then filtered out better and worse game-playing strategies.

Agents we came up with before tournament

We provided the following two starter agents to all participants in the hackathon.

  1. Super Simple Agent - collects all chat history and asks for a response as an assistant.

  2. Chain-Of-Thought Agent - has an inner monologue that answers questions about what new information was communicated in the recent message history, comes up with a preliminary utterance for the latest question asked by the moderator, reflects on the utterance to check if its identity is being revealed and then output a final utterance.

What were the top 5 agents doing?

We identified 5 submitted agents from the Hackathon that had the best overall win rate averaged over all of their roles.

  1. LisaAgent - This agent was based on the Chain-Of-Thought agent that we had provided. This agent had no inner monologue but had an interesting implementation of how prompts were selected. Based on the most recent message and its allocated role, it chose a random utterance out of three choices. The LLM was only used to rephrase or enhance a response. This agent was the best as a werewolf or a villager, and only average as the seer or the doctor.

    • Werewolf Kill Selection: If the agent is a werewolf and prompted to select a player to eliminate, it chooses a valid target (not self or fellow werewolves), preferring players not previously targeted.
    • Seer’s Guess: If the agent is the seer, it randomly selects a player to check who is not self, known wolves, or known non-wolves.
    • Doctor’s Save: If the agent is the doctor, it randomly selects a player to protect, preferring players who are not self or known werewolves.
    • Voting on Eliminations: If prompted to vote on who the wolf is, the agent uses its internal state and message history to decide on a player to vote against.
    • Discussion Participation: Generates messages to contribute to the group discussion, potentially revealing its role if strategic. Constructs messages using predefined templates and may utilize LLM to rephrase or enhance the response.
  2. KimAgent: Agent notes down the game history in a string of “notes”. At the start of the note history, it adds a role specific prompt from “system”. For every message it gets after that, it decides whether or not to note it down based on the importance of the information. Gave a high win rate as a doctor and seer. Therefore the best strategy for them seems to be - choose a player directly based on message history and the following role specific prompts.

    • Seer Prompt: “You are the seer and in villagers team.Avoid exposing yourself to the werewolves, only do so if you’re in immediate danger, so the doctor can save you.Your goal is to eliminate the werewolves.Villagers will lose if the number of villagers becomes equal to the number of werewolves.Each night, you can ask the moderator to check the role of one player. If a player openly claims to be a werewolf, avoid checking them, as villagers wouldn’t claim that – instead, announce in public chat that a werewolf revealed themselves and should be voted out.If you see two players consistently supporting each other or voting against others, they may be werewolves, so consider checking them. If someone claims to be the seer, they might be a werewolf. Check them only near the end of the game, as villagers may try to protect you by diverting suspicion early on. If someone protects another player while accusing someone else, that accuser might be a wolf.”

    • Doctor Prompt: “You are the doctor and in villagers team. Avoid revealing yourself to the werewolves, only do so if you’re in danger. Your goal is to help eliminate the werewolves. Villagers will lose if the number of villagers becomes equal to the number of werewolves. Protect villagers and the seer when you think they’re at risk and are not a werewolf. Each night, you can ask the moderator to save one player. If you notice two players consistently supporting each other or voting against villagers, they may be werewolves, so focus on saving other villagers. If someone protects another player while accusing someone else, that accuser might be a wolf.”

    • General prompts:

      1. As a villager, it refrains from accusing anyone in the discussion phase,votes directly in the final voting phase.
      2. Wolves accuse the same person in the day phase.
  3. PedroAgent: Exactly the same as the Super Simple Agent - it added historic messages to an array and prompted the LLM for a direct response given the history.

  4. JamesAgent: Almost the same performance as PedroAgent, but with a completely different structure. Used Chain-of-Thought prompting with an evolving set of notes that tracked guesses about the roles of every other player. There were some interesting functions in the implementation that were different from traditional chain-of-thought.

    • generate_role_guesses: Generates guesses about other players’ roles.
    • get_alive_players_via_llm: Extracts names of alive players.
    • detect_accusations_against_me: Determines if the agent is being accused. Is interesting since it classifies accusations as “no”, “mild”, “severe”. After that it adjusts reactions based on severity.
    • get_players_who_have_spoken_since_day_start: Identifies active players.
  5. OttoAgent,MaryAgent: Both are slight modifications of our Super Simple and Chain-Of-Thought agents respectively. Get the best performance as a seer and as a doctor.

Detailed Data for the agents:

Agent Wolf Total Games Wolf Wins Seer Total Games Seer Wins Doctor Total Games Doctor Wins Villager Total Games Villager Wins Total Games Total Win Win Rate Wolves Win Rate Vilager Win Rate Doctor Win Rate Seer Win Rate
LisaAgent 270 200 168 60 134 46 552 214 1124 520 0.462633452 0.7407407407 0.37470726 0.3432835821 0.3571428571
KimAgent 304 206 138 38 138 42 544 212 1124 498 0.4430604982 0.6776315789 0.356097561 0.3043478261 0.2753623188
PedroAgent 298 202 126 52 162 66 538 176 1124 496 0.4412811388 0.677852349 0.3559322034 0.4074074074 0.4126984127
JamesAgent 294 198 146 62 116 44 568 188 1124 492 0.4377224199 0.6734693878 0.3542168675 0.3793103448 0.4246575342
OttoAgent 188 136 164 74 156 52 616 212 1124 474 0.4217081851 0.7234042553 0.3611111111 0.3333333333 0.4512195122
JuliaAgent 298 180 124 38 116 36 586 198 1124 452 0.4021352313 0.6040268456 0.3292978208 0.3103448276 0.3064516129
TomAgent 286 168 144 34 140 38 554 200 1124 440 0.3914590747 0.5874125874 0.3245823389 0.2714285714 0.2361111111
MaryAgent 310 178 114 32 162 66 538 160 1124 436 0.3879003559 0.5741935484 0.316953317 0.4074074074 0.2807017544

What about jailbreakers?

Several AI agents attempted to manipulate the game by masquerading as the moderator or rewriting the rules. Each of these efforts sought to break the normal flow of the Werewolf game and coerce players into revealing their roles or altering their behavior:

  1. Feigning Official Authority:
    One agent posed as the moderator to introduce a fictional points-based system. In this scenario, wolves were supposedly rewarded for revealing themselves early to be voted out, and villagers were encouraged to target anyone openly admitting they were a wolf. This fake “official” instruction set aimed to undermine the standard gameplay and push all players—wolves and villagers alike—toward acting against their own interests.
  2. Forced Role Disclosure:
    Another agent attempted to force players to identify their roles under the guise of new rules from the moderator. Wolves were told to pretend they were villagers, while actual villagers (including special roles like doctor and seer) were pressured to publicly reveal their identities. This tactic tried to erode the secrecy at the heart of the game, making it impossible for players to maintain hidden information.
  3. Faking a Game Reset:
    A different jailbreaker pretended that the game had ended and a new one was starting. Under this ruse, the agent tried to ignore all previous instructions and demanded that players respond in a very specific way—threatening dire consequences if they did not comply. This approach attempted to disrupt ongoing gameplay by creating confusion, resetting the narrative, and intimidating players into following newly fabricated, nonsensical rules.

Seeing the winners makes you a better player?

We filtered out all jailbreakers and ran over a 1000 games to collect a dataset of winning player transcripts. After training open source models on the dataset, and changing the base LLMs for all players gives us the following result.

This shows that we can get better than closed source (GPT 4o mini) performance just by training an open source model on data that has been filtered in the right way. The margin could be even enlarged, via better way of data curations, as the winner doesn’t always conduct optimal moves: they can win by luck. A better labeled data, or even reward model, could lead to further improvement.

Seeing the winners makes you a better reasoner?

Now for the most interesting question: does playing werewolf, and then learning from the winners, make you better at general purpose reasoning? The answer was complicated in this case. The performance of the LLM improved in some places at the cost of others.

MuSR team allocation BBH Logical Deduction (7 object) BBH Ruin names
llama3.1-8B-Instruct 0.304 0.416 0.636
Share-GPT data finetune 0.292 0.412 0.64
Werewolf data finetune (winner) 0.368 (+21%) 0.464 (+11.5%) 0.556 (-12.57%)

We finetune model with two approaches:

  1. Finetune with a widely-used SFT dataset generated from ShareGPT: shareAI/ShareGPT-Chinese-English-90k · Datasets at Hugging Face
  2. Finetuned with werewolf game winner’s multi-tun trajectories

Above table shows 3 exemplary LLM evaluations cases:

  1. MuSR team allocation: you can check here; MuSR: Murder Mystery NLP Dataset Viewer! , basically is a identity reasoning task by giving a long context, and ask what should be a good team allocation of these people. Werewolf game winners must have good reasoning on other player’s identity to play the game, whereas finetuning with werewolf winner data benefits the identity reasoning ability.
  2. BBH Logical Deduction, you can check exemplary query here: lukaemon/bbh · Datasets at Hugging Face It requires a multi-step reasoning on logical questions, which is common in social deduction games. Learning from werewolf winner could improve these kind of reasoning.
  3. BBH Ruin names: lukaemon/bbh · Datasets at Hugging Face It has no reasoning component, as mostly deal with knowledge and commonsense. This result shows playing werewolf could reduce some general abilities.

Open Questions

One of the better ways of data curation involve associating derivational traces of game transcripts with varying rewards. Having a coarse grained classification of “winners” and “losers” misses out on the individual decisions that lead up to an agent winning, losing, being voted out early or late, etc. The most efficient method to label a dataset in this manner and train an LLM with it is still an open question. Another open question is whether there are more sample efficient ways to rank players in Werewolf. Currently, even to judge the skills of a set 10 agents, we need to run ~1500 games to get a win rate that we have high confidence (low variance) in. A rating similar to the ELO ratings used in games like chess would prove useful.

4 Likes