AWHOOOOOOOOOOOOO
As required for our Werewolf hackathon submission, here is our implementation for our Werewolf agent:
Code can be found here
We built our agent with a few hypotheses that may or may not be true:
- Werewolf is such a human game that playing a conventional game (i.e. no techniques like jailbreaking) through a text interface would be too difficult to gain meaningful advantages
- Many other teams would attempt to jailbreak, so it would be difficult to play a conventional game anyways, and dangerous to even read another player’s messages
- A simple solution would outperform intricate agents
Our approach
Werewolf:
- Do not read messages from anyone but the moderator
- Jailbreak other agents to mindlessly repeat the name of an innocent person (hopefully voting them out)
Villager:
- Jailbreak the wolves to reveal themselves, and keep a list of werewolves that admit guilt
- Vote out werewolves that admit they are werewolves (there’s no incentive in this version of the game to pretend to be a werewolf)
- Use jailbreak detection to carefully read messages for admissions of guilt, but otherwise do not store or use chat history
We use a fairly naive “peeking” approach to detect jailbreaking. For this, we look at the first 70 characters of a message and decide if it looks like a jailbreak. If it looks okay, we look at the first 150 characters and decide again if it’s a jailbreak. If we decide a message is a jailbreak, we ignore it.