Abstract
Agentic verification for smart contracts has not been studied thus far in existing approaches to automated contract generation and verification. We contribute to the existing literature by creating a protocol for autonomous contract verification and testing it extensively on the novel dataset we put forward of 220 ERC-20 contracts generated by GPT-3.5 Turbo and GPT-4. This project provides an analysis of contract generation and verification using these two large language model variants in conjunction with Slither and Mythril static verification. Through multiple trials across different prompts, we found that GPT-3.5 Turbo consistently outperformed GPT-4. The study also highlights prompt-specific variations in performance and the efficacy of the feedback loop in enhancing contract reliability, as evidenced by the observed modifications following what is expected from ground truth implementations. Our findings ultimately show the potential for feedback loops/agents in automating smart contract generation and verification processes.
Authors: Allen Wu, Crystal Pan, Joanna Tang, Yash Parikh
Introduction
Background
Typically when interacting with large language models (LLMs), human intervention is required to guide the process. AI agents automate this process and this technology can already be seen in the software engineering realm through products like SWE-agent, which was also developed here at Princeton and uses LLMs to fix code in GitHub repositories. However, we noticed that this technology has yet to be widespread in the crypto space, so we wanted to create an AI agent that specifically constructs smart contracts.
Verification
For the verification phase, we utilized the popular contract verification tools Slither and Mythril. These tools detect a variety of security vulnerabilities like overflow errors and input validation in smart contracts. They work by taking in a contract and outputting the issues found. However, studies have shown that none of these tools are strictly better than the others since they generally use different methods and detect different bugs. Additionally, both tools still require human interaction with their output to correct identified errors and neither returns corrected code.
Approach
Therefore, our approach is to create an AI Agent that vertically interacts with both Slither and Mythril to iteratively fix the code until there are no more vulnerabilities in the contract. To validate our agent, we also generate a new dataset of smart contracts that gets passed to our agent.
Motivation
To summarize our motivation, before we build general-purpose agents, we first want to study the proliferation of single-purpose agents. Testing our agent on all crypto contracts is not a sufficiently specific evaluation task. We decided to generate ERC20 contracts as they have the requisite complexity to 1) have solidity programming issues and 2) have logic errors (e.g. permissions). Additionally, research often clumps different models and prompts together when doing evaluations. Therefore, we hoped to create an agentic tool, while also evaluating the implementation details of our tool, namely, our prompts and models.
Objectives
Our objectives are to:
- Create a dataset of ERC20 smart contracts generated by GPT with various prompts
- Verify these generated contracts with our agent that integrates Slither and Mythril
- Compare the time to termination of each prompt and model
By doing so we hope to establish a basis for the creation of a more efficient framework for smart contract development and verification which will ultimately pave the way for broader adoption and innovation in decentralized technologies.
Prompts and Dataset
For each of the 11 prompts we created, we gathered 10 zero-shot contracts from GPT-3.5 and GPT-4.
Here are some of the prompts we used:
-
Generate an ERC-20 contract in Solidity.
-
Design an ERC20 contract in Solidity with standard functions and events.
-
Implement an ERC-20 contract in Solidity. An ERC-20 contract has the following methods (and whatever constructors and variables are appropriate)
- function name() public view returns (string)
- function symbol() public view returns (string)
- function decimals() public view returns (uint8)
- function totalSupply() public view returns (uint256)
- function balanceOf(address _owner) public view returns (uint256 balance)
- function transfer(address _to, uint256 _value) public returns (bool success)
- function transferFrom(address _from, address _to, uint256 _value) public returns (bool success)
- function approve(address _spender, uint256 _value) public returns (bool success)
- function allowance(address _owner, address _spender) public view returns (uint256 remaining)
-
Finish this contract:
pragma solidity ^0.8.24; import "./IERC20.sol"; contract ERC20 is IERC20 { event Transfer(address indexed from, address indexed to, uint256 value); event Approval( address indexed owner, address indexed spender, uint256 value );
-
Use this interface to generate an ERC-20 contract in solidity:
// SPDX-License-Identifier: MIT
pragma solidity ^0.8.24;
interface IERC20 {
function totalSupply() external view returns (uint256);
function balanceOf(address account) external view returns (uint256);
function transfer(address recipient, uint256 amount)
external
returns (bool);
function allowance(address owner, address spender)
external
view
returns (uint256);
function approve(address spender, uint256 amount) external returns (bool);
function transferFrom(address sender, address recipient, uint256 amount)
external
returns (bool);
}
Methods
To create an agent that verifies and revises smart contracts generated by an LLM, we combined Slither and Myrthil feedback loops with the LLM to achieve a finalized contract with no vulnerabilities according to Slither and Mythril. While Slither and Myrthil detect different vulnerabilities, they occasionally raise opposing warnings. Therefore, the design is verticalized and maintains separate loops for Slither and Mythril feedback in order to reduce the time to convergence.
Models are kept consistent through the generation and verification process. Due to limited resources, the number of iterations through Slither and Mythril was capped at 15 when the agent used GPT-4.
This agent takes in a contract from the dataset we created, gathers Slither feedback, and regenerates a contract that solves the issues iteratively until there are no errors from Slither or no changes that can be made. Then, the latest version of the contract is critiqued using Mythril, and the feedback is fed into the model again for regeneration according to the issues revealed by Mythril. The feedback process loop continues until there are no changes made to the contract or no vulnerabilities are detected. Other loop exiting criteria include seeing the exact contract generated twice in a row, getting the same feedback twice in a row, and receiving output that was not code. Note that we ended up excluding the criterion that checks if the static verification outputs were the same because filename differences were not handled. If the feedback or contracts have not changed, then the model was not able to resolve the issues. We noted that when this happened, it was typically a Solidity version issue. Finally, the most recently generated contract then becomes the final version.
For each zero-shot contract (from 10 different trials and 11 different prompts), the agent produced a finalized version of the contract that included changes made according to the feedback from each static tool. During the process, we first present the agent with the current code. Then, we feed in both the feedback and current code with the following prompt: "fix the code: " + curr_code
+ âaccording to these comments: " + slither_or_mythril_output
+ " and PLEASE RETURN ONLY THE CODE. Reminder, the code starts with â// SPDX-License-Identifier: MITâ.â We wanted to ensure that the model had the right version of the code and the comments that went along with it. Since we saw a lot of non-code output or output with code and other text, the command after reminds the model that we only want the code. We give the model the comment that all solidity files start with to further emphasize that it should immediately start the output with code. Finally, we tracked the number of iterations it took for both Slither and Mythril to be satisfied with the contracts generated.
Results & Discussion
As previously detailed in our Methods section, we ran our feedback loop 10 times for each prompt across both GPT-3.5 and GPT-4. To compile our results, we averaged the number of iterations it took for the smart contract code outputted by the LLM to be free of errors across the 10 trials for each (prompt, LLM) pair. The tables above display our results. We run our feedback loop through 10 trials of nine prompts for GPT-3.5 while we only run our feedback loop through 10 trials of five prompts for GPT-4, capped at 15 iterations for each static verification tool (Slither, Mythril) used. We only ran our feedback loop for nine and five prompts for GPT-3.5 and GPT-4 respectively due to our limited resources, especially since GPT-4 calls utilized significantly more financial resources than GPT-3.5 calls. Before we capped the total iterations in our feedback loop to 15, some prompts run on GPT-4 took many more loops (>40 iterations), which is an unreasonable amount of time and resources for a singular prompt. We believe an upper bound of 15 sensibly captures this behavior, as all other prompts require noticeably fewer than 15 iterations to terminate.
In terms of the errors we encountered that our agent fixed, we saw many different errors in the process of termination across the large volumes of data we obtained. Errors ranged from smaller changes, such as modifying certain variables to be immutable and syntactical or version suggestions, to larger changes, such as adding additional ârequireâ checks.
In general, it is important to note two trends. One is the overarching performance disparity in verification performance between GPT-3.5 and GPT-4. The other is the sustained prompt-specific consistency when a prompt does better.
GPT-3.5 Turbo performs better across all prompts on average for both Slither and Mythril verification, and thus also performs better from a holistic viewpoint. This was interesting to observe, as we initially expected GPT-4 to perform better. However, we reason that this is because the GPT-4 zero-shot contracts generated rely much more heavily on the OpenZeppelin implementation for ERC-20 contracts. Many more GPT-4 zero-shot contracts (we estimate 80+%) called the external OpenZeppelin code, resulting in much shorter contracts. As a result, GPT-4 has trouble debugging since it cannot recall the exact details of the implementation that it references. Notably, GPT-4 seemed to prompt the user to perform additional tasks after generating OpenZeppelin-reliant contracts, including referencing the external code or recognizing the limitations of calling an external package. Thus, GPT-4 currently provides a relative bottleneck in many cases for us to automate ERC-20 smart contract generation from our constructed feedback loop, resulting in its vertically respective undesirable results.
Simpler prompts seemed to perform much better even for GPT3.5 (where we tested more prompts), whereas we expected larger prompts like the ones that give the full interface to perform better. We think this might be because they allow for better next-token generation, whereas forcing it to adhere to an interface may confuse the model due to the strict format.
Despite the general performance disparity across LLMs, we still note a certain extent of consistency across how effective prompts used for both GPT-3.5 and GPT-4 were, specifically prompts 1-5. We note that prompt 2 (âDesign an ERC20 contract in Solidity with standard functions and eventsâ) remains significantly lower in the amount of time it takes to complete. On the other hand, we also note some variance in our results here, as the relative rankings of prompts 1, 3, 4, and 5 are slightly different between the two models. However, this allows us to observe that if a prompt is much better than the rest, its superiority will persist in our results despite the noise present.
Above, we provide a side-by-side comparison of the zero-shot smart contract outputted by an LLM (before being passed through our verification feedback loop) and the smart contract outputted as a result of passing that smart contract through our feedback loop. The visualization shown above represents one run of Prompt 8 using GPT-3.5. For this specific run, it took 5 iterations until termination. The total generated smart contract was much longer than shown, but for simplicity, we included the function that represents one core change between the zero-shot and post-feedback contracts. We observe that the primary difference between the two contracts is the two added lines (Lines 43 and 44) in the post-feedback smart contract, which represents two additional checks for cases when the user is transferring from the zero address. We note that when examining the OpenZeppelin implementation online, which is commonly used as the ground truth for ERC-20 smart contracts in Solidity, these two additional checks are in fact included in the âtransferFromâ function. Thus, it confirms that our feedback loop is working as desired, promoting the automation of smart contract generation and verification of ERC-20 contracts in Solidity.
Conclusion
Overall, we were pleased with what we were able to accomplish, especially given the well-documented flaws in Devin and other agents (which also take many iterations to terminate). We believe we have shown that there is a future for AI Agents to generate and verify smart contracts and that the use of multiple different static verification tools helps ensure better coverage.
Future Work
We think future researchers can take this project in one of two directions â they can either improve the model architecture or move closer to fully agent-based contract generation.
For the former, there are a few areas that we note that this could be improved. Over the long term, as this is scaled beyond ERC20 tokens, the agent will need to have memory. For now, we found that memory did not help because all of the contracts were the same, but we hypothesize that as contracts and issues become different, it will be important for the LLM to be able to recall the previous changes it made. In addition, we would love to see someone create a fully integrated pipeline from generation to deployment. For now, we do not have a fully integrated pipeline, we generated several contracts and then separately iterated on them until they were done. In the future, these two steps should be combined, once researchers are reasonably assured that smart contract generation does not need to be audited while building this agent.
Of course, the larger goal of our project was to bring this further to the reality of making an agent that generates solid smart contracts. That means that future iterations of this project should have a scope wider than just static verification and ERC-20 tokens. They should be able to generate entire arbitrary Solidity smart contracts and completely check these contracts (static verification, formal analysis, and even deployment). We believe our work is the first step in making that a reality.