Visual Speech Recognition

Improving Speech Perception in Noise through Artificial Intelligence

Raghavan, Arun M.
Lipschitz, Noga MD
Breen, Joseph T. MD
Samy, Ravi N. MD
Kohlberg, Gavriel D. MD

¹University of Cincinnati College of Medicine, Cincinnati, Ohio, USA
²Department of Otolaryngology–Head and Neck Surgery, University of Cincinnati/Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, USA
³Department of Otolaryngology–Head and Neck Surgery, University of Washington, Seattle, Washington, USA

Gavriel D. Kohlberg, MD, Division of Otology and Neurotology, Department of Otolaryngology - Head and Neck Surgery, University of Washington, Virginia Merrill Bloedel Hearing Research Center 1701 NE Columbia Rd, CHDD Clinic Bldg, Rm CD176, Seattle, WA 98195, USA. Email: [email protected]

Received October 29, 2019

Accepted April 15, 2020

Otolaryngology - Head & Neck Surgery 163(4):p 771-777, October 2020. | DOI: 10.1177/0194599820924331

Objectives

To compare speech perception (SP) in noise for normal-hearing (NH) individuals and individuals with hearing loss (IWHL) and to demonstrate improvements in SP with use of a visual speech recognition program (VSRP).

Study Design

Single-institution prospective study.

Setting

Tertiary referral center.

Subjects and Methods

Eleven NH and 9 IWHL participants in a sound-isolated booth facing a speaker through a window. In non-VSRP conditions, SP was evaluated on 40 Bamford-Kowal-Bench speech-in-noise test (BKB-SIN) sentences presented by the speaker at 50 A-weighted decibels (dBA) with multiperson babble noise presented from 50 to 75 dBA. SP was defined as the percentage of words correctly identified. In VSRP conditions, an infrared camera was used to track 35 points around the speaker’s lips during speech in real time. Lip movement data were translated into speech-text via an in-house developed neural network–based VSRP. SP was evaluated similarly in the non-VSRP condition on 42 BKB-SIN sentences, with the addition of the VSRP output presented on a screen to the listener.

Results

In high-noise conditions (70-75 dBA) without VSRP, NH listeners achieved significantly higher speech perception than IWHL listeners (38.7% vs 25.0%, P = .02). NH listeners were significantly more accurate with VSRP than without VSRP (75.5% vs 38.7%, P < .0001), as were IWHL listeners (70.4% vs 25.0% P < .0001). With VSRP, no significant difference in SP was observed between NH and IWHL listeners (75.5% vs 70.4%, P = .15).

Conclusions

The VSRP significantly increased speech perception in high-noise conditions for NH and IWHL participants and eliminated the difference in SP accuracy between NH and IWHL listeners.