Mastering the Craft of Data Synthesis for {C}ode{LLM}s

Authors: Chen, Meng and Arthur, Philip and Feng, Qianyu and Hoang, Cong Duy Vu and Hong, Yu-Heng and Moghaddam, Mahdi Kazemi and Nezami, Omid and Nguyen, Duc Thien and Tangari, Gioacchino and Vu, Duy and Vu, Thanh and Johnson, Mark and Kenthapadi, Krishnaram and Dharmasiri, Don and Duong, Long and Li, Yuan-Fang

Abstract:

Large language models (LLMs) have shown impressive performance in \textit{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.

Link: Read Paper

Labels: general coding task, empirical study

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mastering the Craft of Data Synthesis for {C}ode{LLM}s

FilesExpand file tree

paper_10.md

Latest commit

History

paper_10.md

File metadata and controls

Mastering the Craft of Data Synthesis for {C}ode{LLM}s