Bento Bako on the Road: A Journey to Kauaʻi's Public Libraries

Efficiency and Integrity: The Case for Human-AI Partnership in Oral History Transcription

by Jesse R. Elam, Associate Professor, Meiji Gakuin University Center for Liberal Arts Education

My sabbatical at the Japanese Cultural Center of Hawaiʻi (JCCH), from April 2025 to February 20, 2026, focused on building an automatic speech recognition (ASR) system that can transcribe oral history recordings as efficiently and accurately as possible. While some interviews in the JCCH archives include Japanese-native speakers, the broader goal extends beyond any one accent. It is about improving how AI systems handle non-native and accented English more generally to reduce linguistic bias.

I felt the importance of that goal the first time I sat alone in the Tokioka Heritage Resource Center. I put on headphones and pressed play. The speaker moved between English and Japanese, pausing occasionally as memories surfaced. There was a rhythm to the storytelling that felt rooted in another time, deliberate, reflective, layered. As I listened, I glanced at my laptop screen. For the first time, my AI transcription system was turning archival audio into text in real time. It was impressively fast, but it wasn’t perfect. A vowel shifted in a surname. A place name looked slightly unfamiliar. A pause was interpreted as something it wasn’t. A few sentences were simply wrong. I remember thinking: This is powerful, but it is also fragile.

That fragility is not random. Speech recognition systems learn from massive datasets, and when those datasets overrepresent standard American or British accents, performance on anything outside those norms can decline. Accented speech is not misheard because it is unclear. It is misheard because it is underrepresented. Over time, I began to recognize recurring patterns in the errors: vowel substitutions influenced by first-language phonology, segmentation problems caused by rhythm differences, and hallucinated insertions triggered by repetition or background noise. These patterns appeared across speakers with different linguistic backgrounds, and they pointed back to how large AI models are trained.

Before coming to the JCCH, I spent several years studying how ASR systems handle accented speech. In my most recent benchmark study, I tested leading models under both ideal and challenging conditions. For native-English speakers, accuracy often exceeded 97% in clean recordings and remained around 91% even in spontaneous or noisy speech. For Japanese speakers of English, accuracy was roughly 90% in clean recordings and frequently dropped into the low 80% range in less controlled settings. On paper, those numbers may not seem dramatic. But 80% accuracy means roughly two out of every ten words are wrong. Across a long oral history interview, that margin becomes significant.

Working directly with the JCCH’s oral history recordings made the stakes clearer. A slightly altered surname is not just a typo, because it can affect search results and historical interpretation. A misrecognized plantation camp name can shift geographic understanding. If transcripts become the primary way researchers, students, and community members access the archives, then consistent inaccuracies in transcribing accented speech shape which stories are easy to find and which are obscured. In that sense, transcription accuracy is not only a technical issue. It influences visibility, and visibility shapes memory.

Those lessons affected the design system I built at the JCCH. Rather than relying on a single AI model, I integrated multiple high-performing systems and compared their outputs. I treated preprocessing carefully instead of automatically, because earlier experiments showed that excessive filtering can introduce new distortions. The goal was not to create a perfect system, but to reduce predictable weaknesses and make the workflow more accountable through comparison, testing, and repeatable decisions.

Even with these safeguards, the AI is only the first step. After the system produces a draft, transcriptionists review it carefully. They listen again to confirm names, locations, and phrasing. They catch subtle vowel shifts. They recognize when repetition reflects emphasis rather than error. They understand historical and cultural nuance that no model can infer. The workflow is designed to support their expertise. By automating the mechanical first pass, the software allows transcriptionists to focus on accuracy and meaning rather than raw typing. The result is a deliberate partnership: AI increases efficiency, and human listeners protect integrity.

At this stage, the JCCH oral history custom system transcribes at an average rate of around 92% on noisy, legacy formats and about 97% on clean content. However, I plan to continue modifying the system with the goal of increasing accuracy further. An important next phase of this project will be fine-tuning. After completing benchmarking across JCCH recordings and identifying consistent error patterns specific to these archives, I plan to explore adapting state-of-the-art models using verified transcripts from these oral histories. Fine-tuning would allow the models to become more sensitive to the pronunciation patterns, pacing, and vocabulary common in this collection. The intention is not speculative modification, but evidence-based improvement, building directly on the benchmark findings generated during this sabbatical.

Looking back, that first listening session captured the heart of the project. I saw both the promise and the limitations of AI transcription. This sabbatical has been about narrowing that gap by using research to inform design, combining strong models with careful evaluation, and centering human review at every stage. The JCCH oral transcription system is the result: research-driven, deliberately engineered, and focused on transcribing non-native and accented speech as accurately as possible. Listening well is not automatic for AI. It requires attention, intention, and iteration. And when the voices being preserved carry cultural memory, it is worth designing systems that listen with care.