Sei sulla pagina 1di 3

Meta Description: Human transcribers are at the risk of being obsolete with Microsoft's latest

speech recognition system surpassing human abilities

Title: Robots Stole My Job: Will Human Transcribers Be Replaced by Microsoft's Newest Speech
Recognition System?

In 2004, economists and professors, Frank Levy and Richard Murnane, published a book detailing a
list of jobs that will one day be rendered obsolete by automation.

According to their research, truck drivers can rest easy, as their jobs require a different type of
information processing – which requires expert thinking and complex decision-making – will most
likely survive the computer takeover.

Fast forward to today, and companies like Daimler are currently running a series of tests to fine-
tune the first autonomous truck in the US. Where their results reveal, that by eliminating "driver
variability," these trucks not only emit less carbon footprint they are also more fuel-efficient.

Now, with Microsoft's speech recognition system reaching human-level of accuracy, it looks like
transcribers will soon be added to the robots-stole-my-job list.

According to Microsoft, their latest update yielded a 5.1 percent error rate on the Switchboard
speech recognition task, which is reduced by 12 percent from last year's 5.9 percent - the average
error rate for humans. This means, that the system is already performing at the same level as that
of professional human transcribers.

But, what's even more interesting is that, the technical report on the system shows, that
transcriptions made by the human transcriber and one made by the automatic speech recognition
(ASR) system, were strikingly similar.
Both made the most mistakes in the same short function words, had either an easy or difficult
time when transcribing the same speaker, and humans can't tell whether the transcript was made
by a person or a computer.

Microsoft achieved this by increasing the vocabulary size of their Language Models from 30,500
words to 164,000, resulting in a lower out-of-vocabulary (OOV) percentage.

This was further enhanced by adding the CNN-BLSTM (Convolutional Neural Network -
Bidirectional Long-Short-Term Memory) model to their existing acoustic structure.

The CNN-BLSTM can predict words using a combination of acoustic models - utterance-level LSTM
and dialogue session-based LSTM.

While the utterance-level LSTM determines words based on characters instead of words, dialogue
session-based LSTM uses entire conversations. This two-step approach allows the ASR to learn not
just more words but complex words from a wider acoustic context.

The team further refined their research by using Microsoft products like their Cognitive Toolkit 2.1
(CNTK) to train the system's algorithms to learn like a human brain and integrated cloud service
Azure to build, manage, and deploy their ideas.

While there are still much to improve, like increased accuracy in recognizing words when in a noisy
environment or when the speaker uses a thick and heavy accent, the researchers remain
motivated to continue improvements in their ASR system.

So while it took Microsoft 25 years to create a highly-functional ASR system for conversational
speech – which has long been considered as one of the hardest in the field of speech recognition
because its affected by unlimited variables – the arrival of more advanced technology could make
it so, that it can outperform humans in 5-10 years time.
So, unless you're part of Microsoft’s ASR system development team, better start writing your
resume and checking out the job market.

Potrebbero piacerti anche