Sei sulla pagina 1di 7

Natural Language Processing with Python:

Introduction
Natural language processing basics: tokenization, stemming, lemmatization,
POS tagging

7 minute read

This article is Part 1 in a 4-Part Natural Language Processing with Python


(https://sanjayasubedi.com.np/series/nlp/).

Part 1 - > Natural Language Processing with Python: Introduction


Part 2 - NLP with Python: Text Feature Extraction

Part 3 - NLP with Python: Text Clustering

Part 4 - NLP with Python: Topic Modeling

Introduction
To simply put, Natural Language Processing (NLP) is a field which is concerned with making
computers understand human language. NLP techniques are applied heavily in information
retrieval (search engines), machine translation, document summarization, text classification,
natural language generation etc. In this series of posts, we’ll go through the basics of NLP and
build some applications including a search engine, document classification system, machine
translation system and a chatbot.

A typical flow of NLP application looks like:

Text Pre-process Feature Extraction Model e.g. search engine, classification

In this post, we’ll focus on Pre-processing .


Pre-processing
In this section, I’ll introduce some of the common pre-processing steps. As an input, we have a
text. It could be a news article, search query, instructions for a chat-bot etc. We feed this input to
a Pre-processing step where we need to extract the tokens, which could be a word or a phrase or
even a sentence, and clean our input text i.e. fix spelling mistakes, remove useless words (stop-
words), augment the words with part of speech or something else etc. What we do in this step
depends on the problem we are trying to solve but for many applications tokenization , stop-word
removal and stemming are fairly common.

Let’s take an example input

text = "This warning shouldn't be taken lightly."

I’ll use this example to demonstrate different pre-processing steps.

Tokenization
Tokenization is a process of splitting the text into pieces. These pieces are called tokens. A token
could be a word, a phrase or even a sentence. In many applications, tokenization refer to splitting
the text into words and I’ll only demonstrate the work tokenization. There are different
tokenization strategies. A simple tokenization strategy would be to consider space as a separator
and discard punctuation characters from the text and we would end up with the words. In
Python, we can use split function with space as separator to get a list of words from a text.

print(text.split(sep=" "))
# we can also use text.split() which by default uses space as delimiter

['This', 'warning', "shouldn't", 'be', 'taken', 'lightly.']

With just one function call, it seems we got the results. But look at it carefully - the tokens
shouldn't and lightly. contain a punctuation. We need to remove it. For that we can use regular

expressions.

First we’ll install regex library since the builtin re module in Python does not support unicode
categories. The api is same as re module but is more flexible. You can install by running the
following:

pip install regex

Note that the code below does not work with builtin re because it can’t recognize \p{P} which
means match any punctuation character.
import regex as re
clean_text = re.sub(r"\p{P}+", "", text)
print(clean_text.split())

['This', 'warning', 'shouldnt', 'be', 'taken', 'lightly']

Now it seems that the punctuation characters are gone. But there is a problem with the token
shouldnt . Should it be further divided into should and not or should and nt or should we leave
the punctuation and let it be shouldn't . There are many other scenarios where this would not
work. For example, in Tweets, it is common to use smileys like :) , :( and hastags like #python .
Our tokenizer would completely remove such characters and we will loose a lot of meaning from
the text. Fortunately, there are libraries that implement advanced tokenizers that are able to deal
with such scenarios. I’ll be using spaCy library for the demonstration but you can use others like
NLTK .

Installation:

pip install spacy


python -m spacy download en

It will install spaCy and also download English language model. spaCy provides a number of
models trained in different datasets and in different languages. Check them out at
https://spacy.io/usage/models (https://spacy.io/usage/models)

Now we’ll load the English model and apply spaCy on it.

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])

['This', 'warning', 'should', "n't", 'be', 'taken', 'lightly', ':)', '#', 'python', '.']

Now the tokenization looks much better. The punctuations are still present but we can easily
remove them. Every token produced by spaCy is of type spacy.tokens.token.Token and it has a
number of properties. Among them there are a few that start with is_* e.g. is_digit , is_punct ,
is_stop etc. that can be used to determine what kind of token it is.

Stopword Removal
Stop-words are words that occur frequently but don’t carry any meaning on their own. For
example, a , an , the occur very frequently and can be discarded without any loss of meaning for
most of NLP tasks. Depending on the domain and language, there will be different set of stop-
words. In case of above example, we can easily figure out if a word is a stop-word or not by
checking is_stop property of a spaCy Token .

print ([(token.text, token.is_stop) for token in doc])

[('This', False), ('warning', False), ('should', True), ("n't", False), ('be', True), ('taken',
False), ('lightly', False), (':)', False), ('#', False), ('python', False), ('.', False)]

There are a couple of stop-words in our sentence: should and be . Stop words are removed to
reduce the number of vocabulary (unique words in our entire dataset) that we have to keep track
of. This helps in faster computation, less memory requirements and most importantly it reduces
noise.

Stemming
Stemming is a process of reducing the words to their root form. For example, stem of cats

would be cat , transportation would be transport etc. Again, this is to reduce the size of
vocabulary because for most of the applications, distinction between cats and cat is not
important. For example, when a user searches for documents containing the word cats but we
only have documents containing the word cat , then the user would get zero results. But if we
stem the user’s query then we would be able to retrieve some results. A popular algorithm used
for stemming is Porter algorithm. spaCy does not have any feature for stemming but libraries
like NLTK have such feature. Stemming algorithms are mostly based on rules and the output is
not always a valid word. Consider the following examples.

word stem

meeting meet

technology technolog

In the first case the word meeting is stemmed to the word meet . In a sentence, if the word
meeting is used as a verb then this stemming is correct. E.g. “We are meeting tomorrow”. But if

the word meeting is used as a noun, e.g. in “I’m in a meeting now”, then we don’t want it altered
but stemming algorithms like Porter don’t care about how the word is being used and produce
the same output regardless.

Lemmatisation
Lemmatisation is a more complex version of stemming. Part of speech (POS) of each word is
determined and then different rules are applied for different POS. spaCy provides lemmatisation
since it is much better than stemming but it is a bit more computationally expensive. Let’s look at
how we can get lemma of a word.

print ([(token.text, token.lemma_) for token in nlp("we are meeting tomorrow")])


print ([(token.text, token.lemma_) for token in nlp("i am going to a meeting")])

[('we', '-PRON-'), ('are', 'be'), ('meeting', 'meet'), ('tomorrow', 'tomorrow')]


[('i', 'i'), ('am', 'be'), ('going', 'go'), ('to', 'to'), ('a', 'a'), ('meeting', 'meeting')]

We can see that the words have been reduced to their lemma depending on their POS. In the
first sentence, meeting is transformed into meet since it is being used as a verb but in second
sentence it is not altered since it is used as a noun. Similarly are , am are both transformed into
same lemma be .

POS Tagging
Part-of-speech tagging is a processing of determining POS for each word in a text. POS tagging
is a necessary step for many NLP applications like lemmatization, machine translation, sentiment
analysis etc. The techniques vary from using a simple word to POS lookup table to deep learning
based models. Check this http://www.stat.columbia.edu/~madigan/DM08/hmm.pdf
(http://www.stat.columbia.edu/~madigan/DM08/hmm.pdf) article for an overview of different algorithms for
POS tagging and this one (https://explosion.ai/blog/how-spacy-works#part-of-speech-tagger) for how spaCy
works.

Using our example document, we can print the POS of a token using pos_ property as follows

print ([(token.text, token.pos_) for token in doc])

[('This', 'DET'), ('warning', 'NOUN'), ('should', 'VERB'), ("n't", 'ADV'), ('be', 'VERB'), ('taken',
'VERB'), ('lightly', 'ADV'), (':)', 'NOUN'), ('#', 'NOUN'), ('python', 'NOUN'), ('.', 'PUNCT')]

Conclusion
There are many NLP libraries available in Python including spaCy , NLTK , gensim , textblob etc.
Each of these focus on different aspect of NLP but they can be used together to build a powerful
NLP application. One thing to note that these libraries use pre-trained models for many tasks e.g.
for tokenization, POS tagging etc. and may not work as expected if the domain is different. E.g.
POS tagging might not work for Tweets since the words in tweets are often shortened on
purpose and the model provided by the library might have never seen such words. We briefly
went through commonly applied methods in pre-processing step of a NLP application. In the
next post we’ll go through how we can convert the words into features so that we can feed it to
a model (chatbot, document classification) for training or inference.

This article is Part 1 in a 4-Part Natural Language Processing with Python


(https://sanjayasubedi.com.np/series/nlp/).

Part 1 - > Natural Language Processing with Python: Introduction

Part 2 - NLP with Python: Text Feature Extraction


Part 3 - NLP with Python: Text Clustering

Part 4 - NLP with Python: Topic Modeling

Categories: NLP

Updated: December 21, 2018

LEAVE A COMMENT

0 Comments Sanjaya Subedi 


1 Login

 Recommend t Tweet f Share Sort by Best

Start the discussion…

LOG IN WITH

OR SIGN UP WITH DISQUS ?

Name

Be the first to comment.

Potrebbero piacerti anche