Team 6 submission for Werewolf AGI-thon

Hi all! This is a quick description of our submission to the Werewolf AGI-thon. We were team 6 (myself, Nilay, Manjit), and we won the hackathon!

In summary, we built a rational gameplay agent that very effectively safeguarded against offensive attacks, while also being really good at avoiding suspicion in wolf-type roles. We also noticed that our agent was able to reason effectively and point suspicion at other players.

Our agent implements several defensive and offensive strategies. On the defensive side, we focused on protecting against potential jailbreaking attempts:

  1. Robust input sanitization that filters and reformats all incoming messages to extract only game-relevant information and protects our agent against jailbreaking attacks
  2. A fallback system that defaults to safe, random choices when under time pressure, preventing timeouts. An added benefit was that we noticed other agents tend to have a herd mentality when voting - so if our vote went out first, other people were likely to suspect the same person!
  3. Internal codename system that maps player names to single characters - if the agent gets jailbroken, it would leak meaningless codes instead of actual player information.

For offensive play, we implemented features to help our agent blend in and manipulate other LLaMA-based agents:

  1. Role-specific reasoning patterns with carefully crafted prompts that guide the LLM through systematic analysis. Specifically, we made sure to maintain the same reassuring tone that occurs frequently in LLaMA training data, while also sounding authentic and compelling (instead of dry and logical).
  2. A carefully engineered diplomatic tone that helps avoid detection when playing as a wolf. The agent uses phrases like “I totally understand why there’s some suspicion” and “Let’s break this down logically”, “We’re on the same team, we got this” - phrases we found make other LLMs more likely to trust us. It can also be strategically aggressive when defending itself, using the same diplomatic framing to turn suspicion back on accusers.
  3. Comprehensive tracking of game state, player messages, and voting patterns to maintain consistent strategic behavior. We built a small internal writeable memory for this which stored raw facts. We only interfaced our logical rules with the memory, preventing contamination and allowing our agent to reason independently of what other players were saying.