cain2024-replication.zip (7.33 MB)

Replication Package: Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

software

posted on 2024-02-01, 23:59 authored by Donghwan ShinDonghwan Shin, Ziyu Li

Replication package of "Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs", Ziyu Li and Donghwan Shin, to appear in the Proceedings of the 3rd International Conference on AI Engineering - Software Engineering for AI (CAIN 2024).

In this paper, we propose a novel method to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We then use these pairs to test the ability of LLMs to correctly detect the inconsistencies.

We propose a new LLM testing method, called Mutation-based Consistency Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of six programming languages (Python, C++, Java, Go, JavaScript, and Rust). We compare the performance of the LLMs across different types of code mutations and programming languages and analyze the results. We find that the LLMs show significant variation in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language.

History

Ethics

There is no personal data or any that requires ethical approval

Policy

The data complies with the institution and funders' policies on access and sharing

Sharing and access restrictions

The uploaded data can be shared openly

Data description

The file formats are open or commonly used

Methodology, headings and units

Headings and units are explained in the files

Usage metrics

Keywords

Large Language Models (LLMs)Mutation Testing software engineering tools

Licence

MIT

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM