Talk - NLP For Indian Languages

1 December 2018
NLP for Indian Languages
wingifydevfest@nirantk.com
Why should you care?
Language is emotion
Language is emotion expressed

Very few people care about making software and tech for us!
Indians who speak in mixed languages e.g. Hinglish or native

languages.
Other equally good Titles for this talk
1. Transfer Learning for Text

2. Making Deep Learning work for Small
Text Datasets
Who am I
Nirant Kasliwal (@NirantK)
● Claim to 5 minutes of
Internet fame ->
● Research Engineer /
NLP Hacker - Maker of
hindi2vec
● Work for Soroco

(Bengaluru)
Outline
● Text Classification
○ How much tagged data do we really need?
○ How can we use untagged data?
● Transfer Learning for Text
○ Language Models
○ Language Models for Hindi
○ Language Models for 100+ languages
What I expect you
know already
What I expect you know already
Python
Some exposure to
modern (deep)
machine learning
Great to know: modern

(neural) NLP*
Ideas like:
● Seq2seq
● Text Vectors: GloVe,
word2vec
● Transformer
What you'll learn
today
What you'll learn today
NEW Idea: Transfer Learning for Text
how to do NLP with small datasets
There are too many NLP challenges in any language!
Automatic speech recognition Relationship extraction Machine translation

CCG supertagging Semantic textual similarity Multi-task learning
Chunking Semantic parsing Relation prediction
Common sense Semantic role labeling Natural language inference
Constituency parsing Sentiment analysis Part-of-speech tagging
Coreference resolution Stance detection Question answering
Dependency parsing Summarization
Dialogue Taxonomy learning
Domain adaptation Temporal processing
Entity linking Text classification
Grammatical error correction Word sense disambiguation
Information extraction Named entity recognition
Language modeling
Lexical normalization
Selecting topics which deal more with text semantics (meaning) than
grammar (syntax)
Domain adaptation Temporal processing
Entity linking Text classification
Grammatical error correction Word sense disambiguation
Information extraction Named entity recognition
Language modeling
Lexical normalization
And for today’s discussion:

Temporal processing
Domain Adaptation
Entity linking Text
Grammatical error correction
Information extraction
Language modeling
Classification
Word sense disambiguation
Lexical normalization Named entity recognition
What you’ll learn today
EXAMPLE
What you’ll NOT learn
today
What you’ll NOT learn today
No Math.
What you’ll NOT learn today
No peeking under the hood. No code. We will do that later!
Text Classification needs a lot of
data!
But exactly how much data is enough?
Let's get some estimates from English datasets?
Dataset Type No. of Classes No. of Examples in

Training Split
IMDb Sentiment - Movie Reviews 2 25k
Yelp-bi Sentiment - Restaurant 2 560K

Reviews
Yelp-full Sentiment - Restaurant 5 650K

Reviews
DBPedia Topic 14 560K

But exactly how much data is enough?
And what is the lowest error rate we get on these?
Dataset No. of Classes No. of Examples in Test Error Rates

Training Split
IMDb 2 25k 5.9
Yelp-bi 2 560K 2.64
Yelp-full 5 650K 30.58
DBPedia 14 560K 0.88

Text Classification
needs a lot of data!
How? Transfer Learning!
Image from https://machinelearningmastery.com/transfer-learning-for-deep-learning/

Data++
Dataset No. of Classes Use Untagged Samples Data Efficiency
IMDb 2 No 10x
IMDb 2 Yes, 50k Untagged 50x = 100 samples needed
Comparing to identical accuracy when training from scratch

Data--;
On IMDb On TREC-6
SAME TASK MULTI-TASK
TRANSFER - TRANSFER -
Different Data Different Data,
Different Task
SAME TASK MULTI-TASK
TRANSFER - TRANSFER -
Different Data Different Data,
Different Task
How does this change
things for you?
Simpler code & ideas
Simpler code
NOW: DOWNLOAD AND ADAPT to
BEFORE: DEVELOP and REUSE
your Task
1. Select Source Task & Model e.g.
1. Select Source Model e.g. ULMFit
Classification
or BERT
2. Reuse Model e.g. for classifying
cars types or screenshot 2. Reuse Model e.g. for text
segmentation classification or any other text task
3. Tune Model to Your Dataset 3. Tune Model
a. Downside: Needs tagged a. Can use both untagged and
samples, does not learn from tagged samples
untagged samples
b. Upside: Can give me an initial
Can use the same source model
performance boost
4. Repeat for every New Challenge across multiple tasks, and languages
which you see. BORING!
TEXT BACKBONE TASK SPECIFIC
EMBEDDING LAYER
DATA FLOW DIRECTION
Simpler code
BEFORE: DEVELOP and REUSE NOW: DOWNLOAD AND ADAPT to
1. Select Source Task & Model e.g. your Task
Classification 1. Select Source Model e.g. ULMFit
2. Reuse Model e.g. for classifying or BERT
cars types or screenshot 2. Reuse Model e.g. for text
segmentation classification or any other text task
3. Tune Model to Your Dataset 3. Tune Model
a. Downside: Needs tagged
a. Can use both untagged and
samples, does not learn from
untagged samples tagged samples
b. Upside: Can give me
4. Repeat for every New Challenge Can use the same source model
which you see. BORING! across multiple tasks, and languages
GLoVe Language Classifier
Models
DATA FLOW DIRECTION
Simpler Code
We will download pre-trained language models instead of word
vectors
Making the Backbone
or Source Model
Making the Backbone
Pre-training for Language Models
The BERT model was trained in two tasks simultaneously: Masked Words
(Masked LM) and Next Sentence Prediction.
Making the Backbone
Task 1: Masked Language Models

Predict masked word anywhere. 5% of the words that were fed in
as input were masked. But not all tokens were masked in the
same way.
Making the Language Models

Existing Ideas in word2vec and Glove -
Making the backbone

Example: ‘My dog is hairy’
● 80% were replaced by the ‘<MASK>’ token

○ Example: “My dog is <MASK>”
● 10% were replaced by a random token
○ Example: “My dog is apple”
● 10% were left intact
○ Example: “My dog is hairy”
Making the backbone
Task 2: Next Sentence Prediction

Input = {
sentence1 : the man went to [MASK] store

sentence2: he bought a gallon [MASK] milk [SEP]
Label = isNext
Making the backbone
Task 2: Next Sentence Prediction

Input = {
sentence1 : the man [MASK] to the store
sentence2: penguin [MASK] are flight ##less birds
Label = NotNext
Pause!
Any questions at this point?
Indian Languages
e.g. Hindi, Telugu, Tamil
First Challenge: Making a good backbone
Indian Languages
e.g. Hindi, Telugu, Tamil
Text Backbone Task Specific
Embedding Layer
DATA FLOW DIRECTION
Hindi2vec: Based on ULMFit
- Designed to work well on tiny datasets and small
compute e.g. I work off free K80 GPUs via Colab
- State of the Art Classification Results on several

languages: Polish, German, Chinese, Thai
Hindi2vec: Download a ready to use Backbone
Disclaimer: I made this using FastAI v0.7, and it is a little outdated!
https://github.com/NirantK/hindi2vec
Alternative: Use Google AI’s BERT
Indian Languages
e.g. Hindi, Tamil
Text Embedding BERT Language Specific
Layer e.g. हंद
DATA FLOW DIRECTION
BERT: Based on OpenAI’s General Purpose
Transformer
- Designed to work well on larger datasets and
large compute e.g. they need few GPU-days to fine
tune for a specific language
- State of the Art Results on 11 NLP Tasks
BERT: Based on OpenAI’s General Purpose
Transformer
BERT-Multilingual : Works for 104 languages!
RELATED MYTH:
Not enough Indian
Language Resources!
Datasets
Ready to Use
Sidenote: You can Make
- Wikimedia Dumps with 100+ languages
your Own!
- IIT Bombay English Hindi Corpus includes
the following: - Online Newspapers
and Regional TV
Forums
- WhatsApp groups!
Just 2 things above are about 100M+

words/tokens with at least 100k unique works
Indic NLP Library
- Link: http://anoopkunchukuttan.github.io/indic_nlp_library/
- GPL! Do not use at work
- Languages Supported:
RELATED MYTH:
Non English is hard
in Python!
Related Myth: Non English is Hard
Works out of the box in Python3.5+!
Python is natively Unicode now. Not ASCII.

More!
This looks promising. What else can I do with this?
- Pretty crazy stuff:

E.g. Ask questions and learn inference!
Screenshot from SQuAD Explorer 1.1

Where does this fail?
Where does this fail?
1. Small Sentences e.g. chat, Tweets
2. Long tail inference e.g. stories
○ E.g. Who was on Arjuna’s chariot in
Mahabharata? Cannot infer Hanuman
3. Hinglish - but, bbbut - aap finetune kar sakte ho!
Takeaway
Takeaway
Transfer Learning for text is here
- It helps us work with really small compute and data
Key Idea: Language Models are great backbones
- BERT and ULMFit are reusable proven, LMs

What can I do from
this talk?
What can I try from this talk?
PyTorch: Tensorflow
- Download the Google - Download the

BERT or ULMFit Models GoogleAI BERT Models
Train your own good-morning message or not

classifier from WhatsApp chats!
Thanks for Coming!
Questions?
@NirantK
Created by @rasagy,
Typo: 1st Dec 2018 not 2019
Credits and Citations
- Slides and gifs from Writing Good Code for NLP Research by Joel
Grus at AllenAI
- ULMFit Paper and Blog by Jeremy Howard (fast.ai) and Sebastian
Ruder (@seb_ruder)
- Recommended Reading: Illustrated BERT
- BERT Dissections: Paper, Blogs:The Encoder, The Specific
Mechanics, The Decoder
- Visualisations Made from Neural Nets Visualisation Cheatsheet
Appendix
Appendix: 1 Slide Summary of ULMFit Paper
Howard and Ruder suggest using pre-trained models for solving a wide range of NLP problems. With this
approach, you don’t need to train your model from scratch, but only fine-tune the original model. Their
method, called Universal Language Model Fine-Tuning (ULMFiT) outperforms state-of-the-art results,
reducing the error by 18-24%. Even more, with only 100 labeled examples, ULMFiT matches the
performance of models trained from scratch on 10K labeled examples.
However, to be successful, this fine-tuning should take into account several important considerations:
● Different layers should be fine-tuned to different extents as they capture different kinds of
information.
● Adapting model’s parameters to task-specific features will be more efficient if the learning rate is
firstly linearly increased and then linearly decayed.
● Fine-tuning all layers at once is likely to result in catastrophic forgetting; thus, it would be better to
gradually unfreeze the model starting from the last layer.
From TopBots: The Most Important AI Papers of 2018

Appendix: 1 Slide Summary of BERT
Training Tasks: Masked Language Model tried on 5% at Random, Next Sentence Prediction
Results: SoTA on 11 NLP Tasks, mostly around Inference and QA. Indicated that model can be fine tuned on
new datasets and tasks both
Model: BERT-Base is inspired from OpenAI Transformer, roughly the same parameter size. BERT-Large is
340M parameters, based on Transformer Networks.
Want to understand Transformer Network architecture? Here is an Illustrated Intro to Transformers

Talk - NLP For Indian Languages

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Talk - NLP For Indian Languages

Caricato da

Copyright:

Formati disponibili

1 December 2018

NLP for Indian Languages

Language is emotion expressed

Indians who speak in mixed languages e.g. Hinglish or native

1. Transfer Learning for Text

● Work for Soroco

Great to know: modern

Automatic speech recognition Relationship extraction Machine translation

Automatic speech recognition Relationship extraction Machine translation

Dataset Type No. of Classes No. of Examples in

IMDb Sentiment - Movie Reviews 2 25k

Yelp-bi Sentiment - Restaurant 2 560K

Yelp-full Sentiment - Restaurant 5 650K

DBPedia Topic 14 560K

Dataset No. of Classes No. of Examples in Test Error Rates

IMDb 2 25k 5.9

Yelp-bi 2 560K 2.64

Yelp-full 5 650K 30.58

DBPedia 14 560K 0.88

Image from https://machinelearningmastery.com/transfer-learning-for-deep-learning/

IMDb 2 Yes, 50k Untagged 50x = 100 samples needed

Comparing to identical accuracy when training from scratch

Task 1: Masked Language Models

Task 1: Masked Language Models

Task 1: Masked Language Models

● 80% were replaced by the ‘<MASK>’ token

Task 2: Next Sentence Prediction

sentence1 : the man went to [MASK] store

Task 2: Next Sentence Prediction

sentence1 : the man [MASK] to the store

sentence2: penguin [MASK] are flight ##less birds

- State of the Art Classification Results on several

Just 2 things above are about 100M+

Python is natively Unicode now. Not ASCII.

- Pretty crazy stuff:

Screenshot from SQuAD Explorer 1.1

- It helps us work with really small compute and data

Key Idea: Language Models are great backbones

- BERT and ULMFit are reusable proven, LMs

- Download the Google - Download the

Train your own good-morning message or not

From TopBots: The Most Important AI Papers of 2018

Want to understand Transformer Network architecture? Here is an Illustrated Intro to Transformers

Potrebbero piacerti anche