TY - JOUR
T1 - Large language models overcome the challenges of unstructured text data in ecology
AU - Castro, Andry
AU - Pinto, João
AU - Reino, Luís
AU - Pipek, Pavel
AU - Capinha, César
N1 - Funding Information:
AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \u2018 CEECIND/00445/2017 \u2019 under the \u2018Stimulus of Scientific Employment\u2014Individual Support\u2019 and by FCT \u2018UNRAVEL\u2019 project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ).
Publisher Copyright:
© 2024 The Authors
PY - 2024/9
Y1 - 2024/9
N2 - The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.
AB - The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.
KW - AI
KW - Automation
KW - Data integration
KW - GPT
KW - LLaMA
KW - Unstructured data
UR - http://www.scopus.com/inward/record.url?scp=85200389928&partnerID=8YFLogxK
U2 - 10.1016/j.ecoinf.2024.102742
DO - 10.1016/j.ecoinf.2024.102742
M3 - Article
AN - SCOPUS:85200389928
SN - 1574-9541
VL - 82
JO - Ecological Informatics
JF - Ecological Informatics
M1 - 102742
ER -