automated subject indexing – Mensch.Maschine.Kultur

The team of sub-project 3 “AI-supported content analysis and subject indexing” in the BKM-financed research project Human.Machine.Culture (“Mensch.Maschine.Kultur” / MMK) successfully organized an internal workshop on semi-automatic subject indexing at the Berlin State Library (SBB) on 19 July 2023 (cf. Eberhardt 2011 and Golub 2021 as introductory/additional reading on this topic).

How can we employ automation to effectively support subject indexing? What kind of hopes and ideas, but also challenges should be considered? How can we approach this task together across different disciplines? We were eager to discuss these and further questions in more detail with our colleagues actively working on subject indexing. Numerous colleagues involved with subject indexing at the SBB attended and participated lively in the workshop. Subsequently, we received positive and useful feedback. All of this constituted a good foundation for further cooperation between the project and colleagues from various departments, considering the tasks ahead.

Workshop Agenda

After a quick round of introductions, the MMK project and sub-project 3 as well as the current state-of-the-art in (semi-)automatic subject indexing in general and specifically within our project were presented by the project team.

Taking too long?

Reload document

Open in new tab

Slides workshop semi-automatic subject indexing @ SBB [1.07 MB]

Afterwards, the open source tool Annif, which is used for instance in projects at the DNB, ZBW and TIB, was introduced and demonstrated. Following a quick Q&A session, the interactive part of the workshop commenced where the participants formed smaller groups to discuss specific requirements and wishes as well as the feasibility and usefulness of their ideas.

Results

Very fundamental questions were raised, such as the desired degree of automation or the quality level of subject indexing, as well as more specific wishes and suggestions. Also novel ideas were mentioned, for example the involvement of users in the indexing process itself and, if necessary, the normalization of freely assigned keywords by means of automated methods. The following key points provide a broad overview of the topics and ideas discussed:

the wish to reduce workload and to gain time for relevant conceptual tasks such as vocabulary maintenance
the identification of gaps and problems in the target vocabularies: Which keywords or classes are missing, which are no longer needed (or are no longer up-to-date)?
the consideration of regional departments (“Regionale Sonderabteilungen”) and languages other than German and English, e.g. in the creation of training data
the re-use of additional metadata, e.g. from formal indexing or concordances
the wish for further clarification of the concept of quality and that quality will be assessed transparently, taking both qualitative and quantitative measures into account
the demand for assistance systems that facilitate capturing contents, for instance by delivering information on the occurring languages, the degree of abstraction or the temporal coverage of a publication

Outlook

Despite a slight outrun of time, most of the participants stayed until the end, which we were very pleased about. We are convinced that in the near future, we will be able to specify the requirements and ideas developed in this workshop even further and we are already looking forward to the continuation and further cooperation in this area.

Implications for the library

With the constant increase in publication numbers, the question arises how the growing and extensive collections will be indexed in the future. How can an acceptable level of quality be ensured, if the available human resources do not increase at the same rate? To answer this challenge, one option is to develop systems for semi-automated subject indexing. This means using machines to support the subject librarians in their day-to-day work, so that time and effort are reduced and any freed capacities can be spent on an in-depth indexing of ambiguous and far more complex cases. One possibility for this is to import automatically generated suggestions into the Digital Assistant DA-3 (Beckmann et al. 2019), similar to how the ZBW – Leibniz Information Center for Economics already integrates keyword suggestions with “zbwase” (Kasprzik 2023, p. 5). In doing so, the human expertise should not be excluded, but rather utilized for a fruitful interaction with the machine, in the sense of a human-machine system or “human in the loop”.

Subject indexing with Annif

Institutions such as the ZBW and the German National Library (DNB) have already accumulated experience in the field of automated indexing. Both institutions make use of Annif, an open source toolkit which has been developed at the National Library of Finland (Suominen et al. 2022). The great strength of Annif is its modularity, allowing it to be used independently of its original context and making it widely adaptable. For example, the particular language and controlled vocabulary can be specified for each project and used for training a corresponding model.

To train a model with Annif, these three ingredients are needed:

a vocabulary including all of the potential keywords that can be used during the process of indexing

high-quality indexed works, consisting of full texts or metadata (e.g. titles; abstracts and tables of contents could be used as well) with manually assigned keywords

a working installation of Annif

Then, the lexical (e.g. MLLM, STWFSA) or statistical (e.g. fastText, Omikuji) algorithms that are already integrated in the tool can be applied to the training data to train a model. Different models can also be combined into so-called ensembles. In addition to the use via command line, Annif also offers a web user interface and a REST API (a programming interface) through which e.g. suggestions based on the pre-trained models can be provided.

Next steps

Within the project “Human.Machine.Culture” and more specifically within the sub-project 3 “AI-supported content analysis and subject indexing”, we seek out to follow the current developments in the field of automated subject indexing and are excited about the potential perspectives for subject indexing at the SBB. We will therefore enter into an intensive dialogue with our own colleagues in subject indexing as well as with the DNB and ZBW. First, we plan to collect use cases for semi-automatic indexing and then we want to extend our testing of Annif with the help of these concrete scenarios. Last but not least, any results deriving from this project (including data and models) will be published in a suitable, reusable format so that other institutions can benefit from them as well.

Tag Archive for: automated subject indexing

Workshop Semi-Automatic Subject Indexing at the Berlin State Library