Mapping the repository landscape: harnessing similarity with RepoSim and RepoSnipy

Zihao lI, Rosa Filgueira*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The rapid growth of scientific software development has led to the emergence of large and complex codebases, making it challenging to search, find, and compare software repositories within the scientific research community. In this paper, we propose a solution by leveraging deep learning techniques to learn embeddings that capture semantic similarities among repositories. Our approach focuses on identifying repositories with similar semantics, even when their code fragments and documentation exhibit different syntax. To address this challenge, we introduce two complementary open-source tools: RepoSim and RepoSnipy. RepoSim is a command-line toolbox designed to represent repositories at both the source code and documentation levels. It utilizes the UniXcoder pre-trained language model, which has demonstrated remarkable performance in code-related understanding tasks. RepoSnipy is a web-based neural semantic search engine that utilizes the powerful capabilities of RepoSim and offers a user-friendly search interface, allowing researchers and practitioners to query public repositories hosted on GitHub and discover semantically similar repositories. RepoSim and RepoSnipy empower researchers, developers, and practitioners by facilitating the comparison and analysis of software repositories. They not only enable efficient collaboration and code reuse but also accelerate the development of scientific software.
Original languageEnglish
Title of host publicationProceedings
Subtitle of host publication2023 IEEE 19th international conference on e-science (e-science)
EditorsGeorge Angelos Papadopoulos, Rosa Filgueira, Rafael Ferreira Da Silva
Place of PublicationPiscataway, NJ
PublisherIEEE
Number of pages10
ISBN (Electronic)9798350322231
ISBN (Print)9798350322248
DOIs
Publication statusPublished - 25 Sept 2023
Event19th IEEE International Conference on eScience - Limassol, Cyprus, Limassol, Cyprus
Duration: 9 Oct 202313 Oct 2023
Conference number: 19
https://www.escience-conference.org/2023/

Publication series

NameIEEE international conference on e-science
ISSN (Print)2325-372X
ISSN (Electronic)2325-3703

Conference

Conference19th IEEE International Conference on eScience
Abbreviated titleeScience
Country/TerritoryCyprus
CityLimassol
Period9/10/2313/10/23
Internet address

Keywords

  • Semantic similarity
  • Code search
  • Code understanding
  • Embeddings, pre-trained language models
  • GitHub

Fingerprint

Dive into the research topics of 'Mapping the repository landscape: harnessing similarity with RepoSim and RepoSnipy'. Together they form a unique fingerprint.

Cite this