Talks

Core notes of some academic talks.

From Code Generation Towards Software Engineering

Motivation

Code generation is not software engineering. Software engineering need more automation than code generation. Gaps include: - weak program understanding - hallucination - security concerns

Background

Codegen's standard pipeline is like: data -> training -> inference -> benchmark.

graph TD 
Data --> Training --> Inference --> Benchmark

The problems are: - irrelevant context, missing semantics - textual similarity != code similarity - outdated inference - restricted benchmark

Research 1: Code structures

Some basic software engineering stuffs.

Research 2: Reason about semantics

[SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning （NIPS'24)] - training strategy: semantics alignment - Approximate -> Structural -> Abstract -> Operational - Dataset: code synthesize because 1. the open source code typically requires individual configurations and 2. lack unit tests

[CONCORD: Clone-Aware Contrastive Learning for Source Code (ISSTA'23)] - learning objective: contrasting code properties

[TRACED: Execution-aware Pre-training for Source Code (ICSE'24)] - model design: encode execution

Research 3: Software Design (Modular programming)

(COLING'24)
(NIPS'24-DB)

Research 4: Security

(ICSE'25)
(TSE'22)
CYCLE: Learning to Self-Refine the Code Generation (OOPSLA'24)

Future work

trustworthy deployment -> long-term future work

Questions

DSL language will miss training data
Can code changes treated as part of training data?