Engineering speech recognition from machine learning

Speech recognition operates on human inputs that allow users to communicate with machines (e.g., computers, smartphones and home assistants) and machines to respond to an implanted voice. 

To work correctly, a piece of software like this should be able to “transcribe” all complexities inherent in human speech, such as voice rhythm, length of speech and intonation.

The main types of speech recognition are “automatic speech recognition” (ASR), “computer speech recognition” or “speech to text” (STT). Voice recognition might be the same technology used for the biometric identification of specific users.

What distinguishes humans from robots are emotions; therefore, the voice in a person’s speech conveys a semantic message and some emotion. Speech emotion recognition (SER) is a type of speech recognition whose purpose is to establish a speaker’s underlying emotional state by analyzing their voice. Many applications can help detect emotions, some of which have to do with web-based e-learning, audio surveillance, call centers, computer games, clinical studies etc. Popular apps such as Amazon’s Alexa, Apple’s Siri and Google Maps employ speech recognition.

Machine learning (ML) software can make measurements of spoken words through a set of numbers that represent the speech signal.

Key challenges in automating speech recognition

Read More: