Orchestrating graph and semantic searches for code analysis

report
High-tech original equipment manufacturers rely on long-living, software-intensive complex systems that must continuously evolve to meet fast-paced market demands. This evolution often leads to the accumulation of technical debt, making maintenance increasingly challenging. This legacy code problem highlights the need for efficient strategies to keep these systems adaptable and fit-for-purpose. Code analysis tools like GitHub Copilot have integrated Large Language Models (LLMs) with vector databases to answer questions about codebases. These tools excel in answering questions on local code snippets or concerning the semantics of parts of the code (we call these functional questions) but have limitations in answering questions needing precise and complete knowledge of code dependencies, such as questions about the structure of the code. The LLM4Legacy group have developed a tool integrating LLMs with code graphs to address these structural questions. In this KIP project, both approaches (leveraging vector and graph databases) were combined to enable the answering of functional, structural, and hybrid questions (questions with both a functional and structural component) about legacy software. To achieve this, an orchestration-based retrieval approach was developed that dynamically selects the most suitable retrieval method depending on the question while considering the scale of the retrieved context not to exceed the context window of the orchestration LLM. Using LangGraph, a multi-source retrieval was implemented that leverages both vector and graph databases. Our results show improved accuracy in structural and hybrid questions while maintaining strong performance on functional questions compared to GitHub Copilot. This research advances code analysis for legacy software and provides insights into multi-agent coordination.
TNO Identifier
1013964
Publisher
TNO
Collation
40 p.
Place of publication
Eindhoven