System Architecture document: Training Pipeline - GPTNL-DEL-4002

report
The GPT-NL project aims to develop a Dutch-English large language model (LLM) from the ground up to promote technological sovereignty and strengthen the Dutch and broader European LLM ecosystem. Achieving this objective requires a structured systems engineering approach encompassing requirement’s elicitation, design, implementation, and validation. Beyond the creation of the model itself, sovereignty and community growth depend on transparent dissemination of knowledge about how such a system is built. This document therefore presents the architectural blueprints—both in code and documentation—for the second part of this development phase: the System Architecture of the Training Pipeline. The documentation and systematic management of this technological blueprint are intended to stimulate new research directions and enable future improvements. The GPT-NL System Architecture effort serves as the foundation for these goals by providing a coherent, welldocumented engineering framework for large-scale model development.
From a general point of view, the system architecture activities provide a structured conceptual model defining the organization, behavior, and interactions of system components. It offers a high-level view of how hardware, software, data, and processes collaborate to achieve the intended system goals. Through clear specification of components, interfaces, and design principles, the architecture ensures that key system attributes—such as performance, scalability, security, and maintainability—are addressed systematically and in alignment with stakeholder requirements and operational constraints.
Within the GPT-NL team, system architecture plays a coordinating role by providing a shared technical framework that guides design, implementation, and verification across teams. This work, conducted under Work Package 13 (WP13), facilitates communication among engineers, researchers, and developers by defining clear interfaces and dependencies. The architectural team ensures design consistency, manages technical risks, and balances tradeoffs among quality attributes. As a result, this document and the associated work contribute to the alignment of strategic objectives and technical execution, promoting system coherence, continuity, and effective integration throughout the development lifecycle. The overview of the processes, tasks, and artifacts related to the architectural work is depicted in Figure 1. The system architecture team collaborates with all other working packages, but closest with WP12 (Data Curation), WP14 (Model Development), and WP18 (Data Acquisition and Quality). While WP12 and WP14 lead algorithmic development—such as the selection of filters, models, and training techniques—WP13 focuses on translating these designs into structured, maintainable, and scalable code. This includes defining clean interfaces between modules, ensuring continuous data processing flows suitable for HPC environments, and addressing non-functional aspects such as security, documentation, and energy efficiency. The WP18 is responsible for the processes of contacting data providers and acquiring/creating datasets. They are strongly involved with the architecture team assessing the quality of the data during and after the curation phase.
TNO Identifier
1023958
Publisher
TNO
Collation
109 p.