Investigating the effect of different fine-tuning configuration scenarios on agricultural term extraction using BERT

article
This paper compares different transformer-based language models for automatic term extraction from agriculture-related texts. Agriculture is an important economic sector faced with severe environmental and societal challenges. The collection, annotation and sharing of agricultural scientific knowledge is key to enabling the agricultural sector to address its challenges. Automatic term extraction is a Natural Language Processing task that can provide solutions to text tagging and annotation towards better knowledge and information exchange. It is concerned with the identification of terms pertaining to a domain, or area of expertise, in text and is an important step in knowledge base creation and update pipelines. Transformer-based language modeling technologies like BERT have become popular for automatic term extraction, but limited work has been undertaken so far in applying these methods to agriculture. This paper systematically compares Agriculture-BERT to Sci-BERT, RoBERTa, and vanilla BERT, which were fine-tuned for the automatic extraction of agricultural terms from English texts. The greatest challenge faced in our research was the scarcity of agriculture-related gold standard corpora for measuring automatic term extraction performance. Our results show that, with a few exceptions, Agriculture-BERT performs better than the other models considered in our research. Our main contribution and novelty of the presented research is the investigation of the impact that different language model fine-tuning configuration scenarios had on the term extraction task. More specifically, we tested different scenarios related to the model layers kept frozen, or being updated, during training, to measure the impact they may have on Agriculture-BERT's performance in automatic term extraction. Our results show that the best performance was achieved by: (i) the “embedding layer updated + all encoder layers updated” scenario for the identification of terms also seen during training; (ii) the “embedding layer frozen + all encoder layers updated” scenario for the identification of terms being synonyms to those seen during training; and (iii) the “embedding layer updated + top 4 encoder layers updated” scenario for identifying terms neither seen during training nor being synonyms to those seen during training (novel terms).
TNO Identifier
997818
ISSN
01681699
Source
Computers and Electronics in Agriculture, 225
Files
To receive the publication files, please send an e-mail request to TNO Repository.