Semantic Apparatus – Language-Agnostic Representation Learning of Source Code from Structure and Context

Cited by Lee Sonogan

shows the required transformation rules, using a language-agnostic... |  Download Scientific Diagram

Abstract by Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, Stephan Günnemann

Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.

Publication: Cornell University(Peer-Reviewed Journal)

Pub Date: 21 March, 2021 Doi:

Keywords: Computer Science, Software Engineering (Plenty more sections and references in this research article)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.