We are delighted to announce that the Zooniverse teams at the Adler Planetarium and the University of Minnesota have been awarded a Digital Extension Grant from the American Council of Learned Societies.
Optimizing Crowdsourced Transcription using Handwritten Text Recognition will explore the use of machine learning within online crowdsourced text transcription projects. We will train a machine-learning model for handwritten text recognition using tens of thousands of pages of text transcribed by Zooniverse volunteers for the Anti-Slavery Manuscripts project (ASM), and create a workflow prototype to combine machine-generated transcriptions with crowdsourced effort using the collaborative transcription tools created for ASM. We will then test the HTR model on similar datasets from UMN’s Archives & Special Collections. Ultimately, we hope to create a viable prototype for uploading machine transcription data into the Zooniverse platform, and an evaluation of best practices for combining human and machine effort in the production of high-quality transcription data.
The project co-directors are Dr. Samantha Blickhan (Zooniverse Humanities Lead), Dr. Benjamin Wiggins (Director of the Digital Arts, Sciences, & Humanities (DASH) Program for University Libraries and Assistant Professor of History at UMN), and Dr. Darryl Wright (Research Associate in Physics and Astronomy at UMN).
Read the full announcement and view the list of awardees here.
Regarding HTR – have you thought of integrating Transkribus and Zooniverse?
https://transkribus.eu/Transkribus/ It seems a shame to reinvent the HTR wheel!
Great point — we’ve actually had initial conversations with the Transkribus team in the past, as we’re big fans of their work!
While we’re training our own model in this particular case, one of the major outputs for this prototype is actually the data pipeline, in terms of linking up offline automated processes and being able to send HTR results into the project builder for volunteer review. The idea is that project builders would ultimately be able to train their own model or use any kind of existing service (like Transkribus), not to restrict them to using our model or anything like that.
So while the integration you mention isn’t part of this effort here, this work certainly isn’t closing any doors to this type of opportunity in the future!
I love transcribing documents and feel sad that the fruits of my labours are to underpin machine learning, which may be used for good or ill. I continue with the work because it is challenging and rewarding, but regret that future generations will be denied this pleasure by AI.
Hi P,
Thanks for sharing your thoughts. I imagine you’re not alone in your concerns. However, I’ll note there are also volunteers who regularly ask if their transcriptions are going toward machine training efforts and feel that not exploring automated options is a waste of effort. Luckily, this effort aims to expand the options of research teams, while acknowledging that machine learning approaches won’t be appropriate for all projects.
I can say with confidence that we are nowhere near a point where human transcription is made obsolete by machine transcription. Even to get a machine to the point where it can interpret a specific script, human transcription is necessary to train the model. We also know that trained models aren’t perfect, and still require human review.
Our goals for this project are fairly small-scale: we want to build a prototype for teams who want to upload machine-transcribed text into the project builder for volunteers to review and edit. This will still require close reading of primary source material from our volunteers, but our hope is that it will help to ensure that the resulting transcribed text is useful for archivists, researchers, historians, teachers, and members of the public.
Before we move past a prototyping stage, I know at least from my perspective that we will be soliciting feedback from our community about this type of transcription review project — my guess is that there will be people who like the review option, and those who prefer transcribing from scratch. Luckily, there’s more than enough digital images of text to go around, and we have no intention to require projects to use machine learning if it isn’t appropriate for their project.
Looking forward to more conversation around this topic as we start working on our project.