Abstract
The rapid growth of scientific software development has led to the emergence of large and complex codebases, making it challenging to search, find, and compare software repositories within the scientific research community. In this paper, we propose a solution by leveraging deep learning techniques to learn embeddings that capture semantic similarities among repositories. Our approach focuses on identifying repositories with similar semantics, even when their code fragments and documentation exhibit different syntax. To address this challenge, we introduce two complementary open-source tools: RepoSim and RepoSnipy. RepoSim is a command-line toolbox designed to represent repositories at both the source code and documentation levels. It utilizes the UniXcoder pre-trained language model, which has demonstrated remarkable performance in code-related understanding tasks. RepoSnipy is a web-based neural semantic search engine that utilizes the powerful capabilities of RepoSim and offers a user-friendly search interface, allowing researchers and practitioners to query public repositories hosted on GitHub and discover semantically similar repositories. RepoSim and RepoSnipy empower researchers, developers, and practitioners by facilitating the comparison and analysis of software repositories. They not only enable efficient collaboration and code reuse but also accelerate the development of scientific software.
Original language | English |
---|---|
Title of host publication | Proceedings |
Subtitle of host publication | 2023 IEEE 19th international conference on e-science (e-science) |
Editors | George Angelos Papadopoulos, Rosa Filgueira, Rafael Ferreira Da Silva |
Place of Publication | Piscataway, NJ |
Publisher | IEEE |
Number of pages | 10 |
ISBN (Electronic) | 9798350322231 |
ISBN (Print) | 9798350322248 |
DOIs | |
Publication status | Published - 25 Sept 2023 |
Event | 19th IEEE International Conference on eScience - Limassol, Cyprus, Limassol, Cyprus Duration: 9 Oct 2023 → 13 Oct 2023 Conference number: 19 https://www.escience-conference.org/2023/ |
Publication series
Name | IEEE international conference on e-science |
---|---|
ISSN (Print) | 2325-372X |
ISSN (Electronic) | 2325-3703 |
Conference
Conference | 19th IEEE International Conference on eScience |
---|---|
Abbreviated title | eScience |
Country/Territory | Cyprus |
City | Limassol |
Period | 9/10/23 → 13/10/23 |
Internet address |
Keywords
- Semantic similarity
- Code search
- Code understanding
- Embeddings, pre-trained language models
- GitHub