Perspectives for machine-assisted subject indexing at the Berlin State Library
Implications for the library
With the constant increase in publication numbers, the question arises how the growing and extensive collections will be indexed in the future. How can an acceptable level of quality be ensured, if the available human resources do not increase at the same rate? To answer this challenge, one option is to develop systems for semi-automated subject indexing. This means using machines to support the subject librarians in their day-to-day work, so that time and effort are reduced and any freed capacities can be spent on an in-depth indexing of ambiguous and far more complex cases. One possibility for this is to import automatically generated suggestions into the Digital Assistant DA-3 (Beckmann et al. 2019), similar to how the ZBW – Leibniz Information Center for Economics already integrates keyword suggestions with “zbwase” (Kasprzik 2023, p. 5). In doing so, the human expertise should not be excluded, but rather utilized for a fruitful interaction with the machine, in the sense of a human-machine system or “human in the loop”.
Subject indexing with Annif
Institutions such as the ZBW and the German National Library (DNB) have already accumulated experience in the field of automated indexing. Both institutions make use of Annif, an open source toolkit which has been developed at the National Library of Finland (Suominen et al. 2022). The great strength of Annif is its modularity, allowing it to be used independently of its original context and making it widely adaptable. For example, the particular language and controlled vocabulary can be specified for each project and used for training a corresponding model.
To train a model with Annif, these three ingredients are needed:
- a vocabulary including all of the potential keywords that can be used during the process of indexing
- high-quality indexed works, consisting of full texts or metadata (e.g. titles; abstracts and tables of contents could be used as well) with manually assigned keywords
- a working installation of Annif
Then, the lexical (e.g. MLLM, STWFSA) or statistical (e.g. fastText, Omikuji) algorithms that are already integrated in the tool can be applied to the training data to train a model. Different models can also be combined into so-called ensembles. In addition to the use via command line, Annif also offers a web user interface and a REST API (a programming interface) through which e.g. suggestions based on the pre-trained models can be provided.
Next steps
Within the project “Human.Machine.Culture” and more specifically within the sub-project 3 “AI-supported content analysis and subject indexing”, we seek out to follow the current developments in the field of automated subject indexing and are excited about the potential perspectives for subject indexing at the SBB. We will therefore enter into an intensive dialogue with our own colleagues in subject indexing as well as with the DNB and ZBW. First, we plan to collect use cases for semi-automatic indexing and then we want to extend our testing of Annif with the help of these concrete scenarios. Last but not least, any results deriving from this project (including data and models) will be published in a suitable, reusable format so that other institutions can benefit from them as well.