Skip to content

Talks

Core notes of some academic talks.

From Code Generation Towards Software Engineering

Video Author

Motivation

Code generation is not software engineering. Software engineering need more automation than code generation. Gaps include: - weak program understanding - hallucination - security concerns

Background

Codegen's standard pipeline is like: data -> training -> inference -> benchmark.

graph TD 
Data --> Training --> Inference --> Benchmark

The problems are: - irrelevant context, missing semantics - textual similarity != code similarity - outdated inference - restricted benchmark

Research 1: Code structures

Some basic software engineering stuffs.

Research 2: Reason about semantics

[SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning (NIPS'24)] - training strategy: semantics alignment - Approximate -> Structural -> Abstract -> Operational - Dataset: code synthesize because 1. the open source code typically requires individual configurations and 2. lack unit tests

[CONCORD: Clone-Aware Contrastive Learning for Source Code (ISSTA'23)] - learning objective: contrasting code properties

[TRACED: Execution-aware Pre-training for Source Code (ICSE'24)] - model design: encode execution

Research 3: Software Design (Modular programming)

  • (COLING'24)
  • (NIPS'24-DB)

Research 4: Security

  • (ICSE'25)
  • (TSE'22)
  • CYCLE: Learning to Self-Refine the Code Generation (OOPSLA'24)

Future work

trustworthy deployment -> long-term future work

Questions

  • DSL language will miss training data
  • Can code changes treated as part of training data?