TY - GEN
T1 - Archives Metadata Text Information Extraction into CIDOC-CRM
AU - Varagnolo, Davide
AU - Melo, Dora
AU - Rodrigues, Irene Pimenta
AU - Rodrigues, Rui
AU - Couto, Paula
N1 - Publisher Copyright:
info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDP%2F04516%2F2020/PT#
info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F00297%2F2020/PT#
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - This paper presents an Information Extraction approach to extract events and entities from ISAD(G) elements with semi-structured text descriptions. Natural Language processing is done by using two methodologies: the ANNIE system, by defining proper Gazetteers and Jape rules to process the text and extract the intended information; and a reduced Portuguese BERT Language model that is fine-tuned for Semantic Role Labelling. The evaluation of the Information Extraction processes is done in a sample of 1000 records, for each type of information, and a corresponding dataset is manually built for each type of information considered, baptism events and passport requisitions. The CIDOC-CRM knowledge base is automatically populated with newly linked events and entities, using several automatic information extraction processes. The use of SPARQL queries to explore the information represented in CIDOC-CRM, obtained from the migration of DigitArq records and extracted from text descriptions, allows new ways of visualising the archival records and retrieving information from different sources, including archives digital repositories.
AB - This paper presents an Information Extraction approach to extract events and entities from ISAD(G) elements with semi-structured text descriptions. Natural Language processing is done by using two methodologies: the ANNIE system, by defining proper Gazetteers and Jape rules to process the text and extract the intended information; and a reduced Portuguese BERT Language model that is fine-tuned for Semantic Role Labelling. The evaluation of the Information Extraction processes is done in a sample of 1000 records, for each type of information, and a corresponding dataset is manually built for each type of information considered, baptism events and passport requisitions. The CIDOC-CRM knowledge base is automatically populated with newly linked events and entities, using several automatic information extraction processes. The use of SPARQL queries to explore the information represented in CIDOC-CRM, obtained from the migration of DigitArq records and extracted from text descriptions, allows new ways of visualising the archival records and retrieving information from different sources, including archives digital repositories.
KW - Archives linked data semantic representation
KW - Knowledge discovery
KW - Knowledge representation
KW - Natural language processing
KW - Semantic web
UR - http://www.scopus.com/inward/record.url?scp=85174232773&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-43471-6_9
DO - 10.1007/978-3-031-43471-6_9
M3 - Conference contribution
AN - SCOPUS:85174232773
SN - 978-3-031-43470-9
T3 - Communications in Computer and Information Science
SP - 195
EP - 216
BT - Knowledge Discovery, Knowledge Engineering and Knowledge Management
A2 - Coenen, Frans
A2 - Fred, Ana
A2 - Aveiro, David
A2 - Dietz, Jan
A2 - Bernardino, Jorge
A2 - Masciari, Elio
A2 - Filipe, Joaquim
PB - Springer
CY - Cham
T2 - 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2022
Y2 - 24 October 2022 through 26 October 2022
ER -