TNO-Auto occupation coder: A novel, large-language model based multilingual automatic coding tool for occupations: abstract

article
Objective Coding to job descriptions to standardised classification systems is an important prerequisite for the application of occupational exposure assessment tools in occupational epidemiology studies. We aimed to create of an automatic job coding tool using large language models (LLMs) capable of handling job description input in different languages and providing output in ISCO-08. Methods Our approach follows the retrieval-augmented generation technique that modifies responses of LLMs based on a specified set of supplementary information. Our supplementary information included domain knowledge documents defining and exemplifying the ISCO-08 ontology and job coding process, including the ISCO-08 official documentation plus approximately 34,000 ISCO-08 job titles in 28 languages. All text in domain knowledge information and free text input were converted to vector embeddings using Open AI’s “text-embedding-3-large” embedding model. Input and domain knowledge text embeddings were then compared by a retrieval model, which retrieves the 10 ISCO-08 job codes with highest embedding similarity. Finally, the original free text job description input, along with its 10 most similar ISCO job codes, were presented in a prompt to OpenAI’s GTP-4o to select an ISCO-08 job as output. Results. The TNO-Auto occupation coder was applied to approximately 26,000 job descriptions in various European languages for an Eurostat job classification competition. Preliminary results show that our model has the highest adjusted accuracy amongst competing teams at 58%. Further validation of model performance is underway to investigate model performance in different languages and different profession groups. Conclusion We created an automatic job coding tool capable of accepting multilingual job descriptions as input and providing job codes in ISCO-08 as output. Our general approach may be applied to create automatic coding tools for conversion of multilingual free text input of job or industry descriptions into occupation and industry codes under different classification systems.
Abstract from: 30th Epidemiology in Occupational Health Conference (EPICOH 2025), Hosted by Institute for Risk Assessment Sciences, Utrecht University, 6–9 OCTOBER 2025, Utrecht, the Netherlands
TNO Identifier
1019023
Source
Occupational and Environmental Medicine, 82(suppl. 2), pp. A25.
Pages
A25
Files
To receive the publication files, please send an e-mail request to TNO Repository.