Sei sulla pagina 1di 16

Get Up to Speed

With NLP
Natural Language Processing for Non-Technical Readers:
Techniques, Trends, and Business Use Cases
2019 has been a landmark year for the field of natural language processing,
more commonly referred to as NLP. In the last couple of years, we’ve seen
a ferocious race of models and researchers trying to get to the first place
of podiums across a variety of NLP tasks, from reading comprehension to
sentiment analysis. From the rise of self-supervised learning and unstructured
data to major model breakthroughs such as the Transformer models and
BERT, the past year has been anything but boring for the realm of NLP.

All of these techniques, which once were mainly restricted to the research
area, are now becoming much more mainstream and translating into real-
world business applications. With the increasing availability of massive
neural network models pre-trained on publically available unlabeled data,
companies can now leverage these NLP models on their own data within
their organization.

This white paper will go over the emerging trends and techniques in the field
of NLP; the recent landmark breakthroughs in NLP architecture, in particular
with regards to the Attention technique and Transformer models; and
finally, the emerging business applications of NLP that these technological
breakthroughs are unlocking, and that we’re likely to see implemented at a
large scale across organizations in the years to come.

1 ©2020 Dataiku, Inc. | | | @dataiku

NLP in a Nutshell
NLP is a branch of machine learning and AI which deals with human language, and more specifically with bridging the gap between
human communication and computer understanding. Its practical applications span from topic extraction from documents, to
sentiment analysis of clients putting reviews in social media, to getting insights about the needs and the struggles of people calling
customer support services, or even going as far as building near human conversational agents to offload these call centers, for

NLP sounds like a very niche thing, but it’s actually incredibly prevalent. You’ve probably encountered a natural language processing
system in your day to day life without realizing it. Some common subfields of NLP are:

• Question answering (search engines)

• Speech recognition (Siri, Alexa)
• Machine translation - translating from one language to another (Google Translate)
• Information extraction - pulling relevant details from unstructured and/or structured data (like important info
from health records, relevant news that could impact a trade for a trading algorithm, etc.)
• Sentiment analysis - detecting the attitude (positive, negative, neutral) of a piece of text (used by businesses on
their social media comments or for customer service, etc.)


The data must be cleaned and annotated (labeled) so that it can be processed by an algorithm. It’s also worth noting that newer
techniques can leverage non-labelled data in pre-training models which would then be trained or fine tuned on labeled data (read
more about this in the next section).

Cleaning usually involves deconstructing the data into words or chunks of words (tokenization), removing parts of speech without
any inherent meaning (like stop words such as a, the, an), making the data more uniform (like changing all words to lowercase), and
grouping words into predefined categories such as the names of persons (entity extraction). All of this can be done using the spaCy
library in Python.

Annotation boils down to examining surrounding words and using language rules or statistics to tag parts of speech (similar to how
we would use context clues to guess the meaning of a word).

After preprocessing, the text data is transformed into numerical data, since machine learning models can only handle numerical
input. Traditionally, the two main vectorization techniques that have been used most widely are Count Vectorization and Term
Frequency-Inverse Document Frequency (TF-IDF).

©2020 Dataiku, Inc. | | | @dataiku 2

Count Vectorization involves counting the number of appearances of each word in a document or document section (i.e distinct text
such as an article, book, a paragraph, etc.).

The TF-IDF approach takes the logarithmic function of the size of the set of documents, and in how many documents a word appears.
This is then multiplied by the term frequency to get a score. If the TF-IDF score is high, it means that it is good at discriminating
between documents. This can be very useful, unlike Count Vectorization which only counts how many times a word occurs.

Finally, a third technique called word embedding has nowadays become the dominant approach to vectorization. Embedding is a
type of word representation that allows words with similar meaning to have a similar representation by mapping them to vectors of
real numbers. Unlike older methods, word embeddings are able to represent implicit relationships between words that are useful
when training on data that can benefit from contextual information.

Once a baseline has been created (the “rough draft” NLP model), its prediction accuracy is tested using a test subset. The model is
built using the training subset and then tested on the testing subset to see if the model is generalizable-- we don’t want a model that
only gives accurate predictions for one specific dataset!

3 ©2020 Dataiku, Inc. | | | @dataiku

In the last year there have been significant empirical breakthroughs in NLP. One key research trend that stands out is the rise
of transfer learning in NLP, which refers to the use of massive pre-trained models, which can be then fine-tuned to specific
language-related tasks. Transfer learning allows to reuse knowledge from previously built models, which can give a boost in
performance, while demanding much less labelled training data.

Pretraining models to learn high- and low-level features has already been transformative in computer vision, largely via
ImageNet. ImageNet is a dataset of annotated images that contains more than 20,000 categories. A typical category, such as
"balloon" or "strawberry," consists of several hundred annotated images. Researchers in image processing fields have used
this public data to pre-train huge convolution neural network (CNN) models.

This method could be further scaled up to generate gains in NLP tasks and unlock many new commercial applications in the
same way that transfer learning from ImageNet has driven more industrial uses of computer vision.

Another good news is that, unlike ImageNet, for NLP, you don’t need to have labeled data anymore. Newer language models
are typically trained on very large amounts of publicly available data, i.e. unlabeled text from the web, for instance to predict
the next word in a sentence based on previous words or to predict masked parts of the sentence. This is called self-supervised
learning, and it's in its own a very interesting and promising concept in the research field of NLP.

As a consequence of the important advances made in transfer learning, self-supervised learning and the ability to pretrain NLP
models on unlabeled data, in 2019, the Enterprise AI space has seen a significant increase in interest in using unstructured
data, primarily in the form of text, but also images. We will still continue to see NLP use cases with structured data, the good
old Excel and CSV files won’t go anywhere. But catering to this big interest around leveraging unstructured data in the form of
text and images will become key for improving company value.

©2020 Dataiku, Inc. | | | @dataiku 4

In machine learning, there is a key distinction between supervised and unsupervised learning. In supervised learning, the machine
learning algorithm is trained on data which is labeled, which means it’s already tagged with the correct answer, to predict the correct
answer on unforeseen data. In unsupervised learning, the model mainly deals with the unlabelled data and works on its own to
discover patterns and predict outcomes.

Self-supervised learning is a relatively recent learning technique in machine learning where the training data is autonomously (or
automatically) labelled. It is still supervised learning, but the datasets do not need to be manually labelled by a human. Instead,
they can be labelled by finding and exploiting the relations (or correlations) between different input signals (that is, input coming
from different sensor modalities).

In the field of NLP, this means that we can now leverage large amounts of existing text to pretrain a model’s parameters using self-
supervision, with no data annotation required. So, rather than needing to train a machine-learning model for natural language
processing from scratch, one can start from a model primed with knowledge of a language.

The main reason why we're seeing this shift in use cases is the rapid development of tools and techniques that are needed to
answer them, that has reached an inflection point over the past couple of years.

All of these techniques, which once were very restricted to the research area, are now becoming more and more mainstream.
As a consequence, companies won't have as much trouble leveraging this kind of data and techniques within their organization.

This has a lot to do with the important advances made in NLP architecture in the past few years. The NLP field has greatly benefited
from the resurgence of deep neural networks (DNNs), due to their high performance with less need of engineered features.

5 ©2020 Dataiku, Inc. | | | @dataiku

Recurrent Neural Networks (RNN) are a type of neural network where the output from previous step is fed as input to the current
step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases like when it is required
to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words.
Here is what a typical RNN looks like:

A recurrent neural network and the unfolding in time of the computation

involved in its forward computation. Source: Nature

When dealing with NLP, what most people and organizations are still doing nowadays is using recurrent neural networks, or RNN,
which seems like a more “natural” approach due to the inherent sequential structure of text, a.k.a. the fact that each word comes
after another.

While still widely used in business applications, in the research field, RNN have been progressively falling out of vogue the past year
or so. Because RNN are inherently sequential, it is very hard to parallelize their training or their inference. This, along with their high
memory bandwidth usage (as such, they are memory-bandwidth-bound, rather than computation-bound), makes them hard to

This is where more recent breakthroughs in NLP architecture, such as the so-called Transformer models, step in. In contrast to
RNN, the main advantage of the Transformer models is that they are not sequential, which means they can be parallelized and
scaled much more easily. But in order to understand Transformers, we will need to dive into its core technique: the novel paradigm
called Attention. It is precisely the Attention technique that allows to get rid of the inherent sequential structure of RNNs, which
hinders the parallelization of such models.


To solve some of the problems related to dependencies, researchers created a technique for paying attention to specific
words. When translating a sentence or transcribing an audio recording, a human agent would pay special attention to the
word they are presently translating or transcribing. Neural networks can achieve this same behavior using attention, focusing
on part of a subset of the information they are given. When used in RNNs, instead of only encoding the whole sentence in a
hidden state, each word has a corresponding hidden state that is passed all the way to the decoding stage. Then, the hidden
states are used at each step of the RNN to decode.

©2020 Dataiku, Inc. | | | @dataiku 6
The Attention Architecture,
The attention paradigm made its grand entrance into the NLP landscape (specifically in translation systems) in 2014, well
before the deep learning hype, in an iconic paper by Bahdanau et. al “Neural Machine Translation by Jointly Learning to Align
and Translate.” Before going any further, let’s recall the basic architecture of a machine translation system.


It follows a typical encoder-decoder architecture, where both the encoder and decoder are generally variants of RNNs (such
as LSTMs or GRUs). The encoder RNN reads the input sentence one token at a time. It helps to imagine an RNN as a succession
of cells, one for each timestep. At each timestep t, the RNN cell produces a hidden state h(t), based on the input word X(t) at
timestep t, and the previous hidden state h(t-1). This output will be then fed to the next RNN cell.

Eventually when the whole sentence has been processed, the last-generated hidden state will hopefully capture the gist of all
the information contained in every word of the input sentence. This vector, called the context vector, will then be the input to
the decoder RNN, which will produce the translated sentence one word at a time.

But is it safe to reasonably assume that the context vector can retain all the needed information of the input sentence? What
about if the sentence is, say, 50 words long? No. This phenomenon was aptly dubbed the bottleneck problem.

7 ©2020 Dataiku, Inc. | | | @dataiku


So how can we avoid this bottleneck? Why not feed the decoder not only the last hidden state vector, but all the hidden state
vectors! Remember that each input RNN cell produces one such vector for each input word. We can then concatenate these
vectors, average them, or (even better!) weight them as to give higher importance to words — from the input sentence — that
are most relevant to decode the next word (of the output sentence). This is what attention is all about.

As per the tradition now, this paradigm was in fact first leveraged on images before being replicated on text. The idea was to
shift the focus of the model on specific areas of the image (that is, specific pixels) to better help it in its task.

An Image Captioning application: In order to generate the next word in the caption, the model shifts its attention
on relevant parts of the image.

©2020 Dataiku, Inc. | | | @dataiku 8

The same idea applies to translating text. In order for the decoder to generate the next word, it will first weigh the input words
(encoded by their hidden states) according to their relevance at the current phase of the decoding process.

In order to generate the word “took”, the decoder attends heavily to the equivalent french word “pris” as well as the word “a”,
which set the tense of the verb.

The use case explained above was the very first time where an attention mechanism was successfully applied to machine
translation, and it opened the door for different architectures that leverage this technique in some way or another. One of
these architectures drastically changed the NLP game and set it on a path into a new area: the Transformer.

As you now understand, attention was a revolutionary idea in sequence-to-sequence systems such as translation models.
Thus, in 2017 the researchers at the Google Translate team had the idea to push attention even further.

This boiled down to the following observation: in addition to using attention to compute representations (i.e., context vectors)
out of the encoder’s hidden state vectors, why not use attention to compute the encoder’s hidden state vectors themselves?
The immediate advantage of leveraging this idea was appealing: get rid of the inherent sequential structure of RNNs, which
hinders the parallelization of such models.

9 ©2020 Dataiku, Inc. | | | @dataiku

To solve the problem of parallelization, attention boosts the speed of how fast the model can translate from one sequence to
another. And so in 2017, in the now iconic paper "Attention Is All You Need", the world was introduced to this new architecture:

As already mentioned, the main advantage of Transformer models is that they are not sequential, which means they can be
parallelized, and that bigger and bigger models can be trained by parallelizing the training. What’s more, Transformer models
have so far displayed better performance and speed than RNN models. Due to all these factors, a lot of the NLP research in
the past couple of years has been focused on Transformer models, and we can expect this to translate into new use cases in
organizations as well.


BERT (Bidirectional Encoder Representations from Transformers) is a new model by researchers at Google AI Language, which
was introduced and open-sourced in late 2018, and has since caused a stir in the NLP community. BERT’s key innovation lies
in applying the bidirectional training of Transformer models, to language modelling.

This contrasts with previous language modeling efforts, which looked at a text sequence either from left to right, such as the
ELMo model, or combined left-to-right and right-to-left training, such as OpenAI’s GPT model. The results demonstrated by
BERT show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than
single-direction language models.

While still largely restricted to the research area, variants of BERT are now beating all kinds of records across a wide array
of NLP tasks, such as document classification, document entanglement, sentiment analysis, question answering, sentence
similarity, etc.

Given the rate of developments in NLP architecture that we’ve seen over the last few years, we can expect these breakthroughs
to start moving from the research area into concrete business applications.

©2020 Dataiku, Inc. | | | @dataiku 10



Since introducing their landmark NLP model BERT in 2018, the Google research team has applied it to improving the query
understanding capabilities of Google Search.

This breakthrough was the result of Google's research on transformers: models that process words in relation to all the other
words in a sentence, rather than one-by-one in order. BERT models can therefore consider the full context of a word by looking
at the words that come before and after it—particularly useful for understanding the intent behind search queries, and especially
for longer, more conversational queries, or searches where prepositions like “for” and “to” matter a lot to the meaning.

By applying BERT models to both ranking and featured snippets in Search, BERT can help Search better understand one in
10 searches in the U.S. in English. Another powerful characteristic of these systems is that they can take learnings from one
language and apply them to others, so the improvements in understanding search queries in English can be consequently
applied to other languages as well.

The Google researchers’ decision to open-source their breakthrough model has spawned a wave of BERT-based innovations
from other leading companies. Microsoft announced it was using BERT to power its Bing search engine too. At LinkedIn,
search results are now categorized using a modified version of BERT called LiBERT that the company created and calibrated
on its own data. It has reportedly helped increase engagement metrics from search results, such as connecting to another
person’s profile or applying for a job—by 3% overall, and click-through rates on online help center query results by 11%.


Facebook also took and developed its own modified version of BERT, by changing its training regimen, objective and training
on more data, and for a longer time. The result was a model that Facebook named RoBERTa which tackles one of the social
network’s thorniest issues: content moderation.

Facebook took the algorithm and instead of having it learn the statistical map of just one language, tried having it learn
multiple languages simultaneously. By doing this across many languages, the algorithm builds up a statistical image of what
hate speech or bullying looks like in any language. This means that Facebook can now use automatic content monitoring
tools for a number of languages. Thanks to RoBERTa, Facebook claims that in just six months, they were able to increase by
70% the amount of harmful content that was automatically blocked from being posted.

11 ©2020 Dataiku, Inc. | | | @dataiku

While recent technological breakthroughs such as the Transformer models, BERT and its variants are already being implemented
in business by leading tech companies, and are surely going to see an even wider span of applications in the near future,
companies of various technical maturity could stand to benefit from an array of simpler, more traditional NLP use cases.


When it comes to adjusting sales and marketing strategy, sentiment analysis helps estimate how customers feel about your
brand. This technique, also known as opinion mining, stems from social media analysis and is capable of identifying and
extracting opinions within a given text across blogs, reviews, social media, forums, news etc. Sentiment analysis can help craft
all this exponentially growing unstructured text into structured data using NLP and open source tools.


Topic analysis is a Natural Language Processing (NLP) technique that allows to automatically extract meaning from texts
by identifying recurrent themes or topics. Businesses generate and collect huge volumes of data every day. Analyzing and
processing this data using automated topic analysis will help businesses make better decisions, optimize internal processes,
identify trends and all sorts of other advantages that will make companies much more efficient and productive.

The two most common approaches for topic analysis with machine learning are topic modeling and topic classification.
Topic modeling is an unsupervised machine learning technique. This means it can infer patterns and cluster similar
expressions without needing to define topic tags or train data beforehand.

Text classification, on the other hand, needs to know the topics of a text before starting the analysis, because you need to
tag data in order to train a topic classifier. Although there’s an extra step involved, topic classifiers pay off in the long run and
they’re much more precise than clustering techniques.


Neural machine translation is the use of deep neural networks for the problem of machine translation, to predict the
likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. Applied in neural machine
translation, NLP helps educate neural machine networks.

The encoder-decoder recurrent neural network architecture with attention (as seen in the Transformers architecture) is
currently the state-of-the-art on some benchmark problems for machine translation. And this architecture is used in the heart
of the Google Neural Machine Translation system, or GNMT, used in their Google Translate service.

Businesses can leverage machine translation tools to translate low impact content like emails, regulatory texts, etc. and speed
up communication with partners as well as other business interactions.

©2020 Dataiku, Inc. | | | @dataiku 12


Chatbots help meet customers’ request for personalization: by collecting user-relevant data they can address them
individually and offer fully personalized experiences devoid of the stress of human-to-human communication. Moreover,
chatbots increasingly find application in sales: they can target prospects, strike a conversation, schedule appointments and
much more.

Chatbots have actually been around for quite a while, back since 1966, but NLP has propelled them to an entirely new level.
Today, the language understanding capability of chatbots built with NLP is so advanced that they could practically be confused
with humans, as they are intelligent and can also recognize human emotions. Unsurprisingly, chatbots are increasingly used
in business, and prove to deliver significant business value to companies. For instance, Asos reported being able to increase
orders by 300% using FB Messenger Chatbot and enjoyed a 250% ROI while reaching almost 4 times more target users. In its
turn, thanks to their chatbots, Sephora was able to increase its makeover appointments by 11%.

In the years to come, as the rapid technological advancements unlock more and more NLP use cases, and as organizations
scale and improve the level of trust they are willing to put in AI-driven systems, we can expect to see more and more companies
leverage NLP models in their operations. This means more and more organizations investing in the right architecture to
retrieve data critical for NLP, the means to process it quickly, and apply models for the biggest impact and business value.

This does not mean that trust is inherent, or that once stakeholders trust one model the rest will naturally follow. Transparency
and model interpretability will always be critical to ensure the successful adoption and integration of NLP in the enterprise.

13 ©2020 Dataiku, Inc. | | | @dataiku

Your Path to
*data scientists, analysts, engineers, & more
Dataiku is the platform democratizing access to data
and enabling enterprises to build their own path to AI.
To make this vision of Enterprise AI a reality, Dataiku is
the only platform on the market that provides one
simple UI for data wrangling, mining, visualization,
machine learning, and deployment based on
a collaborative and team-based user interface accessible
to anyone on a data team, from data scientist
to beginner analyst.

1. Clean & Wrangle 5. Monitor & Adjust

Network_dataset Test Test_Scored

Teradata Train MLlib_Prediction

Oracle Vertica
HDFS_Avro Joined_Data

Amazon_S3 HDFS_Parquet

Cassandra 4. Deploy
2. Build + Apply to production
Machine Learning

3. Mining
& Visualization

Potrebbero piacerti anche