Automatic speech recognition

Speech recognition is a technology that enables the recognition and translation of spoken language into text. As such, it is very valuable in hospitals, law courts and countless other settings. The applications of automatic speech recognition include voice dialling  for phone calls, data entry (e.g. entry of credit card numbers) and conversion of speech to text. This last example is exemplified by one of the first commercial applications of speech recognition, ‘Dragon Dictate’ which was launched in about 1990 and is currently available as ‘Dragon Professional’ (now part of Nuance). This enables people to generate texts even if they are no longer able to use their hands to write or type. But in recent years speech recognition has penetrated increasingly into everyday life, in areas such as communication with household appliances and speech translation. Since 2010, neural networks have also proved useful in speech recognition (see ‘Neural machine translation‘).

History of speech recognition

Development of automatic speech recognition started at Bell Labs in 1952, not long after initial attempts at computer-aided machine translation of texts. At this stage, speech recognition was limited to individual voices and a vocabulary of about ten words. Initial enthusiasm generated funding, at least at Bell Labs, but progress was restricted by limited computer capability and the lack of appropriate algorithms. Raj Reddy started to work on speech recognition as a graduate student at Stanford University in the 1960s, and continued this work at Carnegie Mellon University with his students James and Janet M. Baker in the 1970s. James Baker brought experience with hidden Markov models (statistical models often used in temporal pattern recognition) and Janet Baker had  carried out research in neurophysiology and artificial intelligence. In 1982 the Bakers founded Dragon Systems, which released the first commercial speech dictation software (Dragon Dictate) in 1990  at a cost of USD 9,000. Dragon technology was ultimately taken over by Nuance,  and is now incorporated in Apple’s Siri, IBM’s Watson and a wide range of applications incorporating speech recognition. In 1987,  a speech recogniser was released by Kurzweil Applied Intelligence (also taken over by Nuance); Ray Kurzweil is an inventor and entrepreneur who now works full time at Google on machine learning and language processing, and has predicted that machines will reach human levels of translation quality by the year 2029. Raj Reddy’s former student Kai-Fu Lee joined Apple in 1992, where he helped to develop a speed interface prototype (1992), and in 1993 another of Raj Reddy’s students, Xuedong Huang, founded the speech recognition group at Microsoft.

Recent applications of speech recognition

Google, Apple, Microsoft (and more recently Amazon) have since invested heavily in speech recognition. Google Assistant is an ‘intelligent personal assistant’ that can control Google Home smart loudspeakers, which can give you the weather forecast, play music, read out news headlines, update shopping lists and support home automation. Apple’s Siri was originally developed by Nuance. It first appeared in the iPhone 4S in 2011 and is now an integral part of Apple’s products, including the iPad, iPod Touch, Mac, Apple TV and Home Pod. Siri is able to find information when requested and is conceived as a ‘digital personal assistant’. Microsoft’s  speech recognition products include Cortana, a digital assistant that is integrated into Microsoft’s ‘Edge’ browser, and  Skype, which allows one to speak in one language and have it simultaneously translated (or ‘interpreted’) into another language (spoken and written as subtitles). It is currently available in eight languages: English, French, German, Chinese (Mandarin), Italian, Spanish, Portuguese, Arabic, Japanese and Russian). Microsoft claims that with a 5.1% error rate, it is able to transcribe conversational speech as well as (or even better than) humans. Amazon launched Alexa (named after the ancient library in Alexandria)  in 2014 as a virtual assistant and speech recognition tool in combination with a device called Echo. According to Amazon, Alexa’s voice-control system “lets you speak your wishes to an Echo smart speaker and see them fulfilled—at least simple ones, like dimming your lights or playing music tracks”. Amazon is currently leading Google in the ‘battle to control your home‘: it is estimated that Amazon has Echos in 8% of American households, about ten times as many as Google’s devices. Samsung is one of the most recent companies to introduce a digital assistant with speech recognition. Bixby was introduced in 2017, and is used in Samsung’s Galaxy mobile phones as well as its Family Hub 2.0 refrigerators.

Most of these current applications make use of neural networks (see ‘Neural machine translation‘).

Automatic speech recognition - nuance image