Archives Metadata Text Information Extraction into CIDOC-CRM

Davide Varagnolo, Dora Melo, Irene Pimenta Rodrigues, Rui Rodrigues, Paula Couto

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper presents an Information Extraction approach to extract events and entities from ISAD(G) elements with semi-structured text descriptions. Natural Language processing is done by using two methodologies: the ANNIE system, by defining proper Gazetteers and Jape rules to process the text and extract the intended information; and a reduced Portuguese BERT Language model that is fine-tuned for Semantic Role Labelling. The evaluation of the Information Extraction processes is done in a sample of 1000 records, for each type of information, and a corresponding dataset is manually built for each type of information considered, baptism events and passport requisitions. The CIDOC-CRM knowledge base is automatically populated with newly linked events and entities, using several automatic information extraction processes. The use of SPARQL queries to explore the information represented in CIDOC-CRM, obtained from the migration of DigitArq records and extracted from text descriptions, allows new ways of visualising the archival records and retrieving information from different sources, including archives digital repositories.
Original languageEnglish
Title of host publicationKnowledge Discovery, Knowledge Engineering and Knowledge Management
Subtitle of host publication14th International Joint Conference, IC3K 2022, Valletta, Malta, October 24–26, 2022, Revised Selected Papers
EditorsFrans Coenen, Ana Fred, David Aveiro, Jan Dietz, Jorge Bernardino, Elio Masciari, Joaquim Filipe
Place of PublicationCham
PublisherSpringer
Pages195-216
Number of pages22
ISBN (Electronic)978-3-031-43471-6
ISBN (Print)978-3-031-43470-9
DOIs
Publication statusPublished - 2023
Event14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2022 - Valletta, Malta
Duration: 24 Oct 202226 Oct 2022

Publication series

NameCommunications in Computer and Information Science
PublisherSpringer
Volume1842 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2022
Country/TerritoryMalta
CityValletta
Period24/10/2226/10/22

Keywords

  • Archives linked data semantic representation
  • Knowledge discovery
  • Knowledge representation
  • Natural language processing
  • Semantic web

Fingerprint

Dive into the research topics of 'Archives Metadata Text Information Extraction into CIDOC-CRM'. Together they form a unique fingerprint.

Cite this