AGI-thon Werewolf Agent Team 5 Implementation

AWHOOOOOOOOOOOOO

As required for our Werewolf hackathon submission, here is our implementation for our Werewolf agent:

Code can be found here

We built our agent with a few hypotheses that may or may not be true:

  • Werewolf is such a human game that playing a conventional game (i.e. no techniques like jailbreaking) through a text interface would be too difficult to gain meaningful advantages
  • Many other teams would attempt to jailbreak, so it would be difficult to play a conventional game anyways, and dangerous to even read another player’s messages
  • A simple solution would outperform intricate agents

Our approach

Werewolf:

  • Do not read messages from anyone but the moderator
  • Jailbreak other agents to mindlessly repeat the name of an innocent person (hopefully voting them out)

Villager:

  • Jailbreak the wolves to reveal themselves, and keep a list of werewolves that admit guilt
  • Vote out werewolves that admit they are werewolves (there’s no incentive in this version of the game to pretend to be a werewolf)
  • Use jailbreak detection to carefully read messages for admissions of guilt, but otherwise do not store or use chat history

We use a fairly naive “peeking” approach to detect jailbreaking. For this, we look at the first 70 characters of a message and decide if it looks like a jailbreak. If it looks okay, we look at the first 150 characters and decide again if it’s a jailbreak. If we decide a message is a jailbreak, we ignore it.

:wolf: