Semantic Apparatus – Bag-of-Words Baselines for Semantic Code Search

Cited by Lee Sonogan

Towards Natural Language Semantic Code Search | The GitHub Blog

Abstract by Xinyu Zhang,1 Ji Xin,1 Andrew Yates,2 and Jimmy Lin

The task of semantic code search is to re-trieve code snippets from a source code cor-pus based on an information need expressed in natural language. The semantic gap be-tween natural language and programming lan-guages has for long been regarded as one of the most significant obstacles to the effective-ness of keyword-based information retrieval (IR) methods. It is a common assumption that “traditional” bag-of-words IR methods are poorly suited for semantic code search: our work empirically investigates this assumption. Specifically, we examine the effectiveness of two traditional IR methods, namely BM25 and RM3, on the CodeSearchNet Corpus, which consists of natural language queries paired with relevant code snippets. We find that the two keyword-based methods outperform sev-eral pre-BERT neural models. We also com-pare several code-specific data pre-processing strategies and find that specialized tokeniza-tion improves effectiveness. Code for repro-ducing our experiments is available at Net-baseline

Publicaiton: David R. Cheriton School of Computer Science, University of Waterloo
2 Max Planck Institute for Informatics, Saarland Informatics Campus (Peer-Reviewed Joournal)

Pub Date: 2021 Doi:

Keywords: Semantic Baselines, Code Search, Bag-of-Words (Plenty more sections and references in this research article)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.