SCTest: LLM assisted Unit Test Generation for Smart Contracts

Project Overview

Authors: Mahsa Bastankhah, Zerui Cheng, Constantine Doumanidis, Jianzhu Yao (sorted alphabetically)

Abstract: Inspired by recent advances in artificial intelligence, we employ Large Language Models (LLMs) to take the place of a testing engineer in designing unit tests for smart contracts. This automated end-to-end pipeline includes preprocessing smart contracts, generating appropriate inputs for LLMs, creation of unit tests, and evaluation of the resulting code.

Project Code Link: GitHub - CedArctic/defi-llm-tests

Presentation Video Link: SCTest_presentation.mp4

1. Introduction

Smart contracts are a fundamental building blocks in today’s DeFi landscape. With over 637 million smart contracts deployed on Ethereum (>400B USD in volume currently), their safety and reliability is a topic of great interest. Contract immutability brings trust and reliability, but also makes any flaws or bugs irreversible.

Exploitation of such vulnerabilities can have serious financial and legal implications [1]. As a result, rigorously testing contracts before deployment is needed to ensure security and correct functionality. This underlines the significance of designing unit tests for smart contracts.

While hiring professional testing engineers for designing unit tests is the industry’s preferred method, for the average blockchain user, this detracts from the democratized nature of the ecosystem. Inspired by recent advances in artificial intelligence, and the AI software engineer Devin [2], we employ Large Language Models (LLMs) to take the place of a testing engineer in designing unit tests for smart contracts. We set up a comprehensive benchmark for AI-generated unit tests for smart contracts, and investigate the ability of LLMs to generate comprehensive unit tests using a dataset of real deployed Ethereum smart contracts. We hope that our work will help inform blockchain users aspiring to deploy their own smart contracts.

Our contributions are as follows:

  • We created an automated end-to-end pipeline that includes preprocessing smart contracts, generating appropriate inputs for LLMs, creation of unit tests, and evaluation of the resulting code.
  • We perform an initial exploration on LLM input formulation for improving the quality of the generated tests.
  • We evaluate our pipeline using the Slither-audited Smart Contracts dataset [3], and compare the performance of various LLMs.
  • We analyze insights from out testing, analyze the cost of large scale test generation, and discuss avenues for future work.

2. Our Approach


Figure 2.1 Roadmap of SCTest

Firstly, we get the smart contracts from ‘slither-audited-smart-contracts’ dataset from huggingface. This dataset contains source code and deployed bytecode for Solidity Smart Contracts that have been verified on Etherscan.io, along with a classification of their vulnerabilities according to the Slither static analysis framework.

Secondly, we compile the contract using the Solidity compiler, and generate standardized JSON output. And we parse the contract AST using a package called ‘py-solidity-ast’, preparing for the input of the large language models. In our experiment, we used GPT 3.5 and GPT 4 for test generation.

Then, we perform the LLM inference, obtain the generated tests, and used the Brownie framework to run and evaluate the generated tests. (Brownie is a Python-based development and testing framework for smart contracts targeting the Ethereum Virtual Machine.)

3. Prompt Exploration


Figure 3.1 High level illustration of how we adapted our LLM prompt to improve the quality of generated unit tests

Getting the language model to generate appropriate and quality outputs required some digging. We started by prompting the model with some basic instructions and a copy of the full flattened contract source code. However, we quickly ran into issues related to the model’s context window length, so we instead decided to parse the contract and present it with just the base contract instead. This worked, but the model would get lazy and instead output just boilerplate examples of how to write tests using Brownie. To overcome this, we used heuristics for filtering through interesting functions in the contract, and requested unit tests for each of them individually. Finally, to avoid certain runtime logic issues, we also added some additional constraints to the prompt regarding transaction sizes etc.

4. What do the tests look like?


Figure 4.1 Unit test for the transferFrom() function generated using GPT4


Figure 4.2 Unit test for the transferFrom() function generated using GPT3.5

In the above two figures we show two simple tests written in Brownie by GPT4 and GPT3.5 respectively, that initialize test accounts, and approve and makes a transaction in between them using the transferFrom() function. This raises the question: Which one is better and how do we evaluate the tests that we get? One popular metric for this is code coverage which essentially represents the percentage of instructions in the contract that a specific test runs through when executing. Below we see coverage metrics for the contract as a whole, and for each function individually for both the GPT4 and GPT3.5 generated tests. Based off this, we can see that both tests cover an equal percentage of the function’s instructions.


Figure 4.3 Code coverage using the GPT-4 generated test


Figure 4.4 Code coverage using the GPT-3.5 generated test

5. Evaluation

We use two metrics to assess the quality of the generated unit tests. The first metric evaluates whether the unit tests can execute without triggering runtime errors. We generated tests for 150 smart contracts and tracked how many tests ran successfully versus those that either crashed or failed before completion. The crashes were typically due to poorly written code that raised errors prematurely. Regrettably, about half of the unit tests fell into this category, suggesting that GPT models may not produce high-quality unit tests without further tuning. The third category involves unit tests that ran to completion but failed due to assertion errors. These failures could be due either to issues within the smart contract itself or to the unit test applying incorrect logic that the smart contract was not designed to handle.


Figure 5.1 Overall Performance by Model

The second metric we use to evaluate the quality of the unit tests is code coverage. This metric assesses what percentage of the smart contract code was executed by the tests. It gauges the relevance of the test to the smart contract and the extent to which the test can access different parts of the contract. Currently, the average coverage of the generated tests is around 25 percent. We anticipate that this number will increase with further fine-tuning or more refined prompt engineering.

Figure 5.2 Function Coverage Rate Comparison

We conducted a basic economic analysis to determine the average cost of querying GPT models for a typical smart contract, which is about 1 kilobyte or approximately 200 lines of code. We found that the cost of using GPT-3.5 is negligible, whereas the cost for GPT-4 is around 2 cents per query. This indicates that using GPT models for this purpose is economically feasible. The following table is the result of our economic analysis.

Figure 5.3 Economic Analysis of SCTest

Insights and Key Takeaways

From our design and experiments above, we summarize our main findings into two bullet-points, as shown below. And hopefully it can provide useful insights and guidance for users who would like to leverage the power of generative AI to test their smart contracts.

In all, as aligned with our initial intuitive expectation, GPT 4 is approximately as competent as GPT 3.5, while both are worse than a professional testing engineer.

Future Directions for Improvement

Given limited time duration and computation resources for the course project, it is not realistic for us to carry out more exciting ideas for improving AI’s capability of generating unit tests for smart contracts. Nevertheless, we have come up with the following promising directions for potential improvement. And we’re definitely open to and excited about discussing more if you are interested.

  • Direction 1: Dataset Expansion:bulb:
    The first direction is to expand the current dataset with various kinds of smart contracts, to make the dataset more comprehensive, reduce the variance for more stable and persuasive results. With higher confidence in the evaluation results, users can rest assured about how their code is assessed by our product.

  • Direction 2: Bug Identification and Code Diagnosis:bulb:
    The second direction is that, on top of a single “Yes or No” indicator for whether the code is correct, it would be helpful if we make it clear to users where and why their code fail, and assist users in debugging the code to finally get it right. To maximize the power of unit tests, we can add the detection of why and where the given code is incorrect upon failure as an extra functionality, and suggest possible fixes to the identified bugs.

  • Direction 3: Fine-tuning and In-context learning over Llama 3:bulb:
    The third direction originates from the recently released open-source Llama 3 foundation model.
    Given that unit test generation for smart contracts is a very concrete and well-defined task, it’s definitely a huge waste of AI’s expression power if we use a very general model such as ChatGPT for tackling this specific kind of task, while at the same time suggesting higher potential in AI for accomplishing this task. On top of prompt engineering, conducting fine-tuning over a well-established foundation model such as Llama 3, and leveraging more advanced techniques such as in-context-learning, to yield a very specialized agent with dedicated expertise for smart contract unit test generation, can potentially outperform general models such as GPT 3.5 and GPT 4, and we can even expect that the dedicated agent can match or outperform human experts.
    That said, we also have to acknowledge that, this direction requires a considerable amount of efforts and computation resources, and should be also based on a carefully-curated and comprehensive dataset, which is already mentioned in Direction 1, for the best training results. And it can also be integrated with Direction 2 to end up with a very competent, considerate, and accurate AI tester of smart contracts.

In all, our project is just a beginning step of the huge roadmap of decentralized AGI. We do hope that, with the advances in AI and clever techniques applied in this specific kind of task, smart contract testing won’t ever be a luxurious and inaccessible service for general smart contract writers, and more democracy, transparency, and fairness can be brought to the web3 world, aligning with the very first hope and motivation of the proposal of blockchains, while at the same time paving the way for our ultimate pursuit of AGI that can benefit every single individual across the world.

References

[1] Schneier B., “Smart Contract Bug Results in $31 Million Loss”, Schneier on Security, 2021. Online; accessed Apr. 11, 2024.
[2] Cognition Labs, “Introducing Devin, the first AI software engineer”, Cognition Labs Blog, 2024. Online; accessed Apr. 11, 2024.
[3] Rossini M., “Slither Audited Smart Contracts”, Hugging Face, 2022. Online; accessed Apr. 10, 2024.
[4] 0xAfroSec, “Py-Solidity-AST”, Github.com, 2023. Online; accessed Apr. 10, 2024.
[5] Brownie Project, “A Python-based development and testing framework for smart contracts targeting the Ethereum Virtual Machine.”, Github.com, 2019. Online; accessed Apr. 10, 2024.
[6] Icons provided by Flaticon.com, and Freepik.com

6 Likes

Very informative post, One of the best on this forum

2 Likes