Semantic Apparatus – Linguistic Approach to Segmenting Source Code

Cited by Lee Sonogan

OpenAI's Codex model translates regular languages ​​into computer code -  Illinois News Today

Abstract by Aviel J. Stein, Daniel Schwartz, Yiwen Shi, Spiros Mancoridi

Source code segmentation is the process of dividing the source code of a program into meaningful pieces, such as in preparation for source code analysis (SCA) tasks. Our goal is to segment code based on the semantics of its content. Specifically such that the segments reflect logical locations that are good candidates for the insertion of manually composed comments or automatically generated comments. Instead of focusing on syntactic boundaries for segmentation, such as function and class declarations, we exploit the semantic content of the code. We use code snippets mined from Github as known semantic segments to train a LSTM Neural Network model. It is able to infer locations in the code where a programmer would likely insert comments. The model can operate on any text and performs well across multiple programming languages for detecting candidate segment boundaries within a program. This semantic code segmentation is especially useful for incomplete code repositories under development, which may be also written in more than one programming language. Additionally, our technique supports a detection threshold parameter so users can adjust the number of suggestions provided by our tool.

Publication: College of Computing and Informatics Drexel University, Philadelphia, Pennsylvania (Peer-Reviewed Journal)

Pub Date: 2021 Doi:

Keywords: Deep Learning, Natural Language, Big Data, Source Code Analysis, Segmentation (Plenty more sections and references in this research article)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.