Orchestrating graph and semantic searches for code analysis

Prast, L.T.; Corvino, R.; Reynolds, J.J.; Yang, N.

Orchestrating graph and semantic searches for code analysis

report

2025

Prast, L.T.

Corvino, R.

Reynolds, J.J.

Yang, N.

High-tech original equipment manufacturers rely on long-living, software-intensive complex systems that must continuously evolve to meet fast-paced market demands. This evolution often leads to the accumulation of technical debt, making maintenance increasingly challenging. This legacy code problem highlights the need for efficient strategies to keep these systems adaptable and fit-for-purpose. Code analysis tools like GitHub Copilot have integrated Large Language Models (LLMs) with vector databases to answer questions about codebases. These tools excel in answering questions on local code snippets or concerning the semantics of parts of the code (we call these functional questions) but have limitations in answering questions needing precise and complete knowledge of code dependencies, such as questions about the structure of the code. The LLM4Legacy group have developed a tool integrating LLMs with code graphs to address these structural questions. In this KIP project, both approaches (leveraging vector and graph databases) were combined to enable the answering of functional, structural, and hybrid questions (questions with both a functional and structural component) about legacy software. To achieve this, an orchestration-based retrieval approach was developed that dynamically selects the most suitable retrieval method depending on the question while considering the scale of the retrieved context not to exceed the context window of the orchestration LLM. Using LangGraph, a multi-source retrieval was implemented that leverages both vector and graph databases. Our results show improved accuracy in structural and hybrid questions while maintaining strong performance on functional questions compared to GitHub Copilot. This research advances code analysis for legacy software and provides insights into multi-agent coordination.

Topics

Orchestration Graph databases Semantic searches Code analysis Legacy software Large language models LLMs Vector databases Functional questions Structural questions Hybrid questions

TNO Identifier

1013964

Repository link

https://resolver.tno.nl/uuid:297e80c3-df7e-468f-817e-86a6a8bed819

Publisher

TNO

Collation

40 p.

Place of publication

Eindhoven

Files

Download TNO-2025-R10992.pdf

Orchestrating graph and semantic searches for code analysis

Make TNO yours!