Talks
Core notes of some academic talks.
From Code Generation Towards Software Engineering
Motivation
Code generation is not software engineering. Software engineering need more automation than code generation. Gaps include: - weak program understanding - hallucination - security concerns
Background
Codegen's standard pipeline is like: data -> training -> inference -> benchmark.
The problems are: - irrelevant context, missing semantics - textual similarity != code similarity - outdated inference - restricted benchmark
Research 1: Code structures
Some basic software engineering stuffs.
Research 2: Reason about semantics
[SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning (NIPS'24)] - training strategy: semantics alignment - Approximate -> Structural -> Abstract -> Operational - Dataset: code synthesize because 1. the open source code typically requires individual configurations and 2. lack unit tests
[CONCORD: Clone-Aware Contrastive Learning for Source Code (ISSTA'23)] - learning objective: contrasting code properties
[TRACED: Execution-aware Pre-training for Source Code (ICSE'24)] - model design: encode execution
Research 3: Software Design (Modular programming)
- (COLING'24)
- (NIPS'24-DB)
Research 4: Security
- (ICSE'25)
- (TSE'22)
- CYCLE: Learning to Self-Refine the Code Generation (OOPSLA'24)
Future work
trustworthy deployment -> long-term future work
Questions
- DSL language will miss training data
- Can code changes treated as part of training data?