The Impact of Instruction-Tuned Code Language Models on Software Engineering Tasks

Recent research has shed light on the impressive capabilities of Large Language Models (LLMs) trained on Code for various software engineering tasks. These models can be classified into three main paradigms, namely code LLMs specialized in code completion, task-specific code LLMs fine-tuned for specific tasks, and instruction-tuned code LLMs adept at following human instructions and excelling in new tasks without additional fine-tuning.

To explore the potential of instruction-tuned code LLMs further, a group of researchers from Monash University and ServiceNow Research introduces ASTRAIOS, a collection consisting of 28 instruction-tuned code LLMs. These models undergo fine-tuning using seven different methods based on the base models of StarCoder, with varying model sizes ranging from 1B to 16B. The fine-tuning process utilizes the CommitPackFT dataset from OctoPack to ensure comprehensive enhancement of the models’ downstream capabilities.

The researchers employ recommended practices from Hugging Face’s PEFT configurations and merge selected methods from recent frameworks. They primarily focus on assessing scalability by evaluating cross-entropy loss during instruction tuning, considering model size and training time scales.

Furthermore, the researchers evaluate the performance of their instruction-tuned code LLMs on five representative code-related tasks: clone detection, defect detection, code synthesis, code repair, and code explanation. They also analyze the models’ robustness and code security by assessing their ability to generate code based on perturbed examples and identifying potential vulnerabilities in the generated code.

Interestingly, the study reveals that while larger PEFT Code LLMs excel in code generation tasks, they do not demonstrate similar advantages in code comprehension tasks such as clone detection and defect detection. Increasing the model size improves generation performance but raises concerns regarding susceptibility to adversarial examples and a bias toward insecure code.

The relationship between updated parameters, cross-entropy loss, and task performance is explored in-depth. The researchers find that the final loss of smaller PEFT models can be used to predict that of larger models, and there is a strong correlation between the last loss and overall performance in downstream tasks.

In addition, the study highlights the consistency of relative loss performance across different model sizes when comparing different tuning methods. This indicates that the enhancements achieved by each tuning method are comparable regardless of the model’s scale. Consequently, the observed loss in smaller models tuned with different methods can serve as a valuable indicator for predicting the performance of larger models.

The ASTRAIOS collection, along with the research paper and Github repository, provides valuable insights into the potential of instruction-tuned code language models for advancing software engineering tasks.

The source of the article is from the blog radardovalemg.com