Workshop Semi-Automatic Subject Indexing at the Berlin State Library
The team of sub-project 3 “AI-supported content analysis and subject indexing” in the BKM-financed research project Human.Machine.Culture (“Mensch.Maschine.Kultur” / MMK) successfully organized an internal workshop on semi-automatic subject indexing at the Berlin State Library (SBB) on 19 July 2023 (cf. Eberhardt 2011 and Golub 2021 as introductory/additional reading on this topic).
How can we employ automation to effectively support subject indexing? What kind of hopes and ideas, but also challenges should be considered? How can we approach this task together across different disciplines? We were eager to discuss these and further questions in more detail with our colleagues actively working on subject indexing. Numerous colleagues involved with subject indexing at the SBB attended and participated lively in the workshop. Subsequently, we received positive and useful feedback. All of this constituted a good foundation for further cooperation between the project and colleagues from various departments, considering the tasks ahead.
Workshop Agenda
After a quick round of introductions, the MMK project and sub-project 3 as well as the current state-of-the-art in (semi-)automatic subject indexing in general and specifically within our project were presented by the project team.
Afterwards, the open source tool Annif, which is used for instance in projects at the DNB, ZBW and TIB, was introduced and demonstrated. Following a quick Q&A session, the interactive part of the workshop commenced where the participants formed smaller groups to discuss specific requirements and wishes as well as the feasibility and usefulness of their ideas.
Results
Very fundamental questions were raised, such as the desired degree of automation or the quality level of subject indexing, as well as more specific wishes and suggestions. Also novel ideas were mentioned, for example the involvement of users in the indexing process itself and, if necessary, the normalization of freely assigned keywords by means of automated methods. The following key points provide a broad overview of the topics and ideas discussed:
- the wish to reduce workload and to gain time for relevant conceptual tasks such as vocabulary maintenance
- the identification of gaps and problems in the target vocabularies: Which keywords or classes are missing, which are no longer needed (or are no longer up-to-date)?
- the consideration of regional departments (“Regionale Sonderabteilungen”) and languages other than German and English, e.g. in the creation of training data
- the re-use of additional metadata, e.g. from formal indexing or concordances
- the wish for further clarification of the concept of quality and that quality will be assessed transparently, taking both qualitative and quantitative measures into account
- the demand for assistance systems that facilitate capturing contents, for instance by delivering information on the occurring languages, the degree of abstraction or the temporal coverage of a publication
Outlook
Despite a slight outrun of time, most of the participants stayed until the end, which we were very pleased about. We are convinced that in the near future, we will be able to specify the requirements and ideas developed in this workshop even further and we are already looking forward to the continuation and further cooperation in this area.


