Abstract
This article presents the research and development of a large-scale image search system applied to launch a word-wide innovative service that enables searching billions of historical images archived from the web since the 1990s. Contributions of this work were applied to enhance the Arquivo.pt web archive with an image-search service where users submit text queries, through a web user interface or an API, and immediately receive a list of historical web-archived images. However, supporting image search over web archives raised new challenges. The volume of data to be processed was big and heterogeneous, summing over 530TB of historical web data published since the early days of the web. The main contributions of this work are a toolkit of algorithms that extracts textual metadata to describe web-archived images, a system architecture and workflow to index large amounts of web-archived images considering their specific temporal features and a ranking algorithm to order image-search results by relevance. This research was applied to launch an enhanced image-search service that is publicly available since March 2021. All the developed software is fully available as free open-source software.
Original language | English |
---|---|
Title of host publication | 2023 IEEE 10th International Conference on Data Science and Advanced Analytics, DSAA 2023 - Proceedings |
Editors | Yannis Manolopoulos, Zhi-Hua Zhou |
Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
Number of pages | 10 |
ISBN (Electronic) | 979-8-3503-4503-2 |
ISBN (Print) | 979-8-3503-4504-9 |
DOIs | |
Publication status | Published - 2023 |
Event | 10th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2023 - Thessaloniki, Greece Duration: 9 Oct 2023 → 12 Oct 2023 |
Conference
Conference | 10th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2023 |
---|---|
Country/Territory | Greece |
City | Thessaloniki |
Period | 9/10/23 → 12/10/23 |
Keywords
- Image search
- web archive
- web archive information retrieval