Sei sulla pagina 1di 6

Confidential Review Copy. DO NOT DISTRIBUTE.

000 050
001
Unsupervised Language Identification in the Wild 051
002 052
003 053
004 054
005 Anonymous submission 055
006 056
007 057
008 058
009 059
010 060
011 061
Abstract In this paper, we develop a simple unsupervised
012 062
algorithm that shines in the aforementioned set-
013 Modern language identification modules are 063
tings. The model uses no external labeled re-
014 typically trained on large, clean data sets in a 064
sources, works directly on the data-set in question,
015 supervised learning setting. Our experiments 065
and is able to categorize documents with a high
016 reveal that such methods are brittle and per- 066
form poorly when used in the wild for lower-
level of accuracy. The unsupervised nature of the
017 technique eliminates the need for extensive anno- 067
resource languages on noisy text or code-
018 tation. 068
switching text. This paper proposes an unsu-
019 pervised algorithm that operates well in a va- We demonstrate the strengths of our model on 069
020 riety of domains, such as well-written prose several noisy corpora and evaluate against a pop- 070
021 and more noisy social media commentary on ular open source model and a commercial model. 071
entertainment, current affairs etc. We fo-
022 Our unsupervised technique is either on-par with 072
cus mainly on languages used in the Indian
023
sub-continent, including Bengali and Oriya – these models or clearly superior. We also release 073
024 two similar low-resource languages. Our so- our training corpora and language identification 074
025 lution re-purposes a popular language mod- modules for Oriya - a low resource language. 075
026 eling algorithm for obtaining paragraph vec- 076
027 tors, constructing high-precision and robust 2 Related Work 077
language clusters. Through an extensive set
028 078
of experiments, we demonstrate our unsuper- We focus on prior work in three domains - lan-
029 079
vised approach exhibits better or on-par per- guage identification systems, word embeddings,
030 formance with multiple high-performance su- 080
and paragraph embeddings.
031 pervised language identification methods. 081
Language identification systems are modeled as
032 082
classification tasks taking as input a document,
033 1 Introduction 083
and returning the detected language as output.
034 084
Popular solutions to detect the language of a doc- (Baldwin and Lui, 2010) contains a comprehen-
035 085
ument are modeled as supervised learning prob- sive survey of approaches. The models mainly dif-
036 086
lems. In such settings, large corpus are collected fer in the features employed and the learning algo-
037 087
and annotated. Several popular models with this rithm used. A few popular supervised models are
038 presented in (Aslam and Frost, 2003; Martins and 088
approach achieve high accuracy for clean text.
039 Silva, 2005; Joulin et al., 2017). 089
A supervised solution however is not ideal in
040 Recent challenges (Zampieri et al., 2015) have 090
many situations. For instance, when the train-
041 focused on settings where modern systems fail. 091
ing corpora are limited to clean, well-written text,
042 the performance of the model will not necessarily Noisy text and short documents are a difficult set- 092
043 transfer to a noisy text environment typical of so- ting for most systems. Also, when the languages 093
044 cial media posts. In a low resource setting, state in question are very similar, distinguishing be- 094
045 of the art models are unable to predict the desired tween them is difficult. 095
046 language(s) and often misclassify noisy or code- Unsupervised approaches to this problem have 096
047 switching text. Short of an exhaustive round of re- been fairly successful but only in settings with 097
048 labeling and re-training, the supervised solutions clean text (Zhang et al., 2016; Amine et al., 2010). 098
049 offer no other options. Our model is tested against a more diverse set of 099

1
Confidential Review Copy. DO NOT DISTRIBUTE.

100 corpora and in harder, noisier contexts. are capable of producing reliable representations 150
101 Our model leverages recent advances in lan- for noisy settings (with a high ratio of misspellings 151
102 guage modeling, and word and paragraph embed- for example). 152
103 dings. We briefly discuss these topics here. Once we have a set of word embeddings, for 153
104 Dense word embeddings have been deployed each of the documents in question, a single em- 154
105 successfully in many natural language tasks. For bedding is constructed by (i) normalizing each 155
106 instance, word embeddings from (Mikolov et al., constituent word’s embedding, and (ii) averaging 156
107 2013; Pennington et al., 2014) have been used these normalized embeddings. This is a default 157
108 in tasks like named-entity recognition (Lample baseline used with the word embeddings from 158
109 et al., 2016), parsing (Che et al., 2018; Ling et al., (Joulin et al., 2017). 159
110 2015), constructing lexicons for analyzing senti- Once a paragraph embedding is obtained, we 160
111 ment (Hamilton et al., 2016), and several other sample a small set of documents (typically of size 161
112 syntactic and semantic problems. Newer exten- 20,000) and cluster their embeddings using the k- 162
113
sions to these models incorporate sub-word infor- means algorithm. The number of clusters is set 163
114
mation as well (Bojanowski et al., 2017). using a combination of the elbow method and by 164
115
Language modeling and word vectors have been manually inspecting the discovered clusters. 165
116
applied to the language identification problem pre- After this step, a few sentences from each of the 166
viously (Veena et al., 2018). These approaches all discovered clusters are inspected and the whole
117 167
model the problem as a classification task. cluster is assigned a language, post facto (this is
118 168
Word embeddings from different languages not used in training). Membership in a language
119 169
have been combined to improve the quality of cluster is the test used to assign a language to a
120 170
the embeddings with success (Chen and Cardie, document.
121 171
2018; Lample et al., 2018). Paragraph embeddings For our evaluation, we set aside a held out cor-
122 172
have received a lot of attention recently (Conneau pus. This corpus is not used during the train-
123 173
et al., 2017; Devlin et al., 2018; Arora et al., 2016; ing phase (for either the language modeling or the
124 174
Peters et al., 2018). clustering phases). When labels are not available,
125 175
Recent work to understand these models and this held out corpus is annotated by annotators
126 176
representations involves (i) using the representa- proficient in the languages involved.
127 tions in diverse tasks, (ii) training these models in 177
128 different contexts (Levy and Goldberg, 2014). Our 178
4 Data set
129 work can be considered as an exploration of what 179
130 word embeddings (and the subsequent paragraph We evaluate the performance of the system on five 180
131 embeddings) are learned when a language model corpora (one existing and four first presented in 181
132 is trained on a multilingual corpus. this paper): 182
133 • DEuro : The Europarl corpus (Koehn, 2005) 183
134 3 Approach contains 21 languages with well-written text. 184
135 The processed version is obtained from 185
Our approach is a three step process: (i) first a pop-
136 (Tiedemann, 2012). 420,000 documents 186
ular language modeling technique is used to obtain
137 were reserved for training and 210,000 docu- 187
word embeddings, next (ii), a bag of words model
138 ments were used for test. 188
is used to obtain embeddings for individual doc-
139 • DABP : The ABP Ananda news channel is 189
uments (a Youtube comment or a sentence in the
140 a Bengali news organization. We crawled 190
DEuro corpus), and finally (iii) a small sample of
141 the comments on videos uploaded by their 191
documents is obtained and their embeddings are
142 YouTube channel1 and obtained 219,927 192
clustered yielding highly precise language clus-
143
comments. Most of the comments are in Ben- 193
ters.
144
gali, Hindi, and English. Note that internet 194
The language modeling step uses the continu-
145
users in the Indian subcontinent use the Latin 195
ous Skipgram model (Mikolov et al., 2013) and
script as well as their native script for writ-
146 the variant proposed in (Joulin et al., 2017) where 196
ing. The use of the Latin script for writing in
147 a word is represented by the sum of its con- 197
148 stituent character n-grams. The resulting word- 1
https://www.youtube.com/channel/ 198
149 representations capture sub-word information and UCv3rFzn-GHGtqzXiaq3sWNg 199

2
Confidential Review Copy. DO NOT DISTRIBUTE.

DataSet Train Size Cluster Train Size Eval Size Method Accuracy Language P R F1
200 DOT V 90,225 20,000 200 Our Method 0.985 Oriya (E) (65.5%) 1.0 0.98 0.99
250
DABP 134,858 20,000 200 Oriya (6.5%) 1.0 1.0 1.0
201 English (18.5%) 1.0 1.0 1.0
251
DIndP ak 615,258 20,000 200
DBenM ovie 66,260 20,000 200 Hindi (E) (9.5%) 0.86 1.0 0.93
202 fastText 0.185 Oriya (E) (65.5%) 0.0 0.0 0.0
252
DEuro 420,000 20,000 210,000
Oriya (6.5%) 0.0 0.0 0.0
203 English (18.5%) 0.23 1.0 0.38
253
204 Table 1: Data set sizes. The train size is the number of docu- Hindi (E) (9.5%) 0.0 0.0 0.0
254
GoogleLangId 0.26 Oriya (E) (65.5%) 0.0 0.0 0.0
ments used in the language modeling phase. The cluster train Oriya (6.5%) 0.0 0.0 0.0
205 size is the size of the sample used for the k-means clustering English (18.5%) 0.92 0.97 0.95
255
phase and the eval size is the size of the evaluation corpus Hindi (E) (9.5%) 0.38 0.84 0.52
206 256
against which performance is reported.
207 Table 2: Performance evaluation on DOTV . Language written 257
208 Hindi and Bengali is significant in this cor- in Latin script is indicated with (E). Percentage of the ground 258
truth assigned this label is indicated for each language. Best
209 pus. metric is highlighted in bold for each language. We follow 259
210 • DIndP ak : Comments crawled from videos this same convention in all other Tables summarizing perfor- 260
211 posted by the top YouTube influencers from mance evaluation. 261
212 India and Pakistan. The vast majority of punctuation is stripped and finally emojis are re- 262
213 the comments are in Hindi / Urdu, English moved using a popular module 3 . 263
214 with trace amounts of other Indian languages. 264
Method Accuracy Language P R F1
215 As mentioned above the Latin script is fre- Our Method 0.96 Bengali(E) (54%) 1.0 0.95 0.98
265
Bengali (22.5%) 1.0 1.0 1.0
216 quently employed as are the Devanagari and English (18%) 1.0 0.92 0.96
266
Hindi(E) (5%) 0.53 1.0 0.69
217
Urdu scripts. Hindi (0.5%) 0.0 0.0 0.0
267
fastText 0.4 Bengali(E) (54%) 0 0.0 0.0
218
• DBenM ovie : We retrieved video results Bengali (22.5%) 1.0 1.0 1.0
268
English (18%) 0.34 0.94 0.50
219
from Youtube using the search queries Hindi(E) (5%) 0 0.0 0.0
269
Hindi (0.5%) 1.0 1.0 1.0
bengali movie trailer and GoogleLangId 0.91 Bengali(E) (54%) 0.99 0.87 0.93
220 Bengali (22.5%) 0.98 1.0 0.99
270
anupam roy (a popular Bengali en- English (18%) 0.97 0.97 0.97
221 Hindi(E) (5%) 0.5 0.7 0.58
271
tertainment personality). Subsequently Hindi (0.5%) 0.13 1.0 0.22
222 272
comments for these videos were crawled
223 Table 3: DABP 273
yielding 129,068 comments. The vast
224 274
majority of the comments are written in 5 Results
225 275
Bengali, Hindi, and English. Bengali and
226
Hindi appeared in both the native and the 5.1 Language Identification 276
227 Latin script. 277
5.2 Code-switching Detection
228 • DOT V : OTV is an Oriya news network with 278
For each of the corpora mentioned in Section 4,
229 a popular YouTube channel2 . We crawled we report results on the respective held-out set.
279
230 videos from this network and subsequently 280
Each held-out set is annotated by annotators profi-
231 crawled comments to obtain 153,435 com- cient in the language. From the intersection that
281
232 ments. with most of the comments posted the annotators agree upon, we sample a test set
282
233 in Oriya, Hindi, and English. Latin script is and report precision and recall values for each lan- 283
234 heavily used alongside the native script for guage discovered by the annotators in the corpus. 284
235 Oriya and Hindi. We evaluate our model against two popu- 285
236 All the data sets sourced from YouTube (DABP , lar models: (i) fastText - a popular open 286
237 DIndP ak , DBenM ovie , and DOT V are from chan- source model that can identify 176 languages 287
238 nels with subscriber counts in the millions. Each (Joulin et al., 2017; Facebook, 2016), and (ii) 288
239 video is heavily commented on and thus the corpus GoogleLangId (Google) - a commercial solu- 289
240 is a strong indicator of how Indian internet users tion that supports over 100 languages. 290
241 express themselves. Other data set stats such as Our results are presented in Tables 2, 3, 4, 5. For 291
242 the sizes of train and test sets are presented in Ta- each corpus, we present precision (P ), recall (R) 292
243
ble 1. values for the languages discovered in addition to 293
244 Preprocessing: For each document in the corpus overall accuracy. If a language is written in mul- 294
245
(either a sentence or a short post), we first lower- tiple scripts in a corpus, our technique splits them 295
246
case the text. This step has no effect in some sit- into different clusters and we report precision and 296
uations (the Devanagari script for instance). Then recall independently.
247 297
248 2
We discuss and summarize the results below. 298
https://www.youtube.com/channel/
3
249 UCCgLMMp4lv7fSD2sBz1Ai6Q https://github.com/carpedm20/emoji 299

3
Confidential Review Copy. DO NOT DISTRIBUTE.

Method Accuracy Language P R F1


300 DABP : We set the number of clusters K to 350
Our Method 0.97 Hindi(E) (62%) 0.99 0.95 0.97
301 English (35.5%) 0.92 0.99 0.95 351
4. The annotators discovered 5 languages. The Hindi (2.5%) 1.0 1.0 1.0
fastText 0.375 Hindi(E) (62%) 1.0 0.01 0.02
302 fastText model mislabeled all documents con- English (35.5%) 0.41 0.97 0.57
352
Hindi (2.5%) 1.0 1.0 1.0
303 taining Hindi and Bengali written in the Latin GoogleLangId 0.89 Hindi(E) (62%) 0.98 0.84 0.90
353
304 script as English resulting in low recall for Hindi English (35.5%)
Hindi (2.5%)
0.91
0.71
0.96
1.0
0.93
0.83
354
305 and low precision for English. 355
306 DIndPak : We set K to 3. The annotators discov- Table 5: DIndPak 356
307 ered 3 languages. The fastText model misla- Method Accuracy Language P R F1 357
Our Method 0.96 Oriya (E) (32.4%) 0.98 0.98 0.98
308 beled all documents containing Hindi written in Bengali (E) (28.4%) 1.0 0.95 0.98 358
English (17.6%) 0.97 0.92 0.95
309 the Latin script as English. Bengali (10.67%) 1.0 1.0 1.0 359
0.7 0.97 0.82
310 DBenMovie : We set K to 4. The annotators dis- Hindi (E) (7%)
Oriya (3%) 1.0 1.0 1.0 360
Hindi (0.4%) 0 0 0
311 covered 4 languages. The fastText model mis- fastText 0.28 Oriya (E) (32.4%) 0 0 0 361
312 labeled Hindi and Bengali written in the Latin as Bengali (E) (28.4%)
English (17.6%)
1.0
0.26
0.01
0.97
0.02
0.42 362
313 English. Bengali(10.67%)
Hindi (E)(7%)
1.0
0
1.0
0.0
1.0
0.0 363
314 DOTV : We set K to 4. The annotators discovered Oriya (3%)
Hindi (0.4%)
0
1.0
0
1.0
0
1.0 364
315 4 languages. Neither the fastText model nor GoogleLangId 0.59 Oriya (E) (32.4%)
Bengali (E) (28.4%)
0
0.69
0
0.88
0
0.78 365
316 the GoogleLangId model was able to identify English (17.6%)
Bengali (10.67%)
0.87
0.50
0.96
1.0
0.91
0.67 366
317 the Oriya corpus. fastText predicted a variety Hindi (E) (7%)
Oriya (3%)
0.42
0
0.81
0
0.55
0 367
318
of languages for documents written in Oriya and Hindi (0.4%) 0.05 1.0 0.1
368
GoogleLangId assigned them to the label unk
319 Table 6: DnewsMix 369
- unknown. In a low-resource setting both these
320 370
labels are of little use. The former is completely
321 371
inaccurate and the latter would not be useful in dis- 6 Conclusion
322 372
ambiguation were there are multiple low-resource
323 In this paper, we demonstrate an unsupervised so- 373
languages.
324 lution for language identification in a corpus . The 374
DEuro : We set K to 21 and observe that our
325 model utilizes paragraph embeddings from a lan- 375
model’s performance is on-par with fastText.
326 guage model trained on said corpus and these em- 376
We did not evaluate against GoogleLangId due
327 beddings are able to identify the language with 377
to prohibitive costs and it is reasonable to expect
328 high precision. We construct corpora that are chal- 378
very high accuracy due to the clean nature of the
329 lenging for existing language identification sys- 379
corpus. Our method is near-perfect and on-par
330 tems and demonstrate the ability of our solution 380
with fastText. Our model’s accuracy is 99.9%
331
to operate with high accuracy in these settings - 381
versus 99.3% for fastText. We do not provide
332
outperforming a popular open source model and 382
a language-wise breakdown due to the high accu-
333
a popular commercial solution. Our model has 383
racy values of both models.
significant applications in the annotation, analy-
334 DnewsMix : Additionally, we mixed the two sis and mining of low-resource languages. We
384
335 news corpora (DABP and DOT V ) to evaluate our demonstrate the accuracy on corpora comprising
385
336 model’s ability to separate Oriya and Bengali – two low resource languages - Bengali and Oriya.
386
337 two similar languages. We obtained an accuracy 387
When evaluating our model on a recent chal-
338 of 96.22% as compared to 29% by fastText 388
lenging task for language identification systems
339 and 59% by GoogleLangId. Detailed perfor- 389
(Zampieri et al., 2015), we observed that the
340 mance breakdown is presented in Table 6. 390
model is unable to disambiguate between very
341 391
Method Accuracy Language P R F1 similar languages. For instance, Malay and In-
342 Our Method 0.92 Bengali(E) (38%) 0.98 0.84 0.91
donesian, Urdu and Hindi, and so on are different 392
English (32%) 0.85 0.97 0.91
343 Bengali (24%) 1.0 0.96 0.98
dialects of the same base languages. Our model 393
Hindi(E) (6%) 0.69 0.92 0.79
344 fastText 0.56 Bengali(E) (38%) 1.0 0.01 0.02 was unable to disambiguate between these very 394
English (32%) 0.51 1.0 0.67
345 Bengali (24%) 1.0 0.98 0.99 similar language pairs (a task that is difficult for 395
Hindi(E) (6%) 0.0 0.0 0.0
346 GoogleLangId 0.91 Bengali(E) (38%) 0.98 0.84 0.91 annotators as well). A good future direction is 396
English (32%) 0.87 0.97 0.92
347 Bengali (24%) 0.98 1.0 0.99 to explore if language modeling alone can distin- 397
Hindi(E) (6%) 0.89 0.67 0.76
348 guish between such language pairs. 398
349 Table 4: DBenMovie 399

4
Confidential Review Copy. DO NOT DISTRIBUTE.

400 References William L Hamilton, Kevin Clark, Jure Leskovec, and 450
401 Dan Jurafsky. 2016. Inducing domain-specific senti- 451
Abdelmalek Amine, Zakaria Elberrichi, and Michel Si- ment lexicons from unlabeled corpora. In Proceed-
402 452
monet. 2010. Automatic language identification: ings of the Conference on Empirical Methods in Nat-
403 An alternative unsupervised approach using a new ural Language Processing. Conference on Empirical 453
404 hybrid algorithm. IJCSA, 7(1):94–107. Methods in Natural Language Processing, volume 454
405 2016, page 595. NIH Public Access. 455
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2016.
406 Armand Joulin, Edouard Grave, Piotr Bojanowski, and 456
A simple but tough-to-beat baseline for sentence em-
407 beddings. Tomas Mikolov. 2017. Bag of tricks for efficient 457
408 text classification. In Proceedings of the 15th Con- 458
Javed A Aslam and Meredith Frost. 2003. An ference of the European Chapter of the Association
409 459
information-theoretic measure for document simi- for Computational Linguistics: Volume 2, Short Pa-
410 larity. In Proceedings of the 26th annual interna- pers, pages 427–431. 460
411 tional ACM SIGIR conference on Research and de- 461
velopment in informaion retrieval, pages 449–450. Philipp Koehn. 2005. Europarl: A parallel corpus for
412 statistical machine translation. In MT summit, vol- 462
ACM.
413 ume 5, pages 79–86. 463
414 Timothy Baldwin and Marco Lui. 2010. Language 464
Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
415 identification: The long and the short of the mat- ramanian, Kazuya Kawakami, and Chris Dyer. 2016. 465
416 ter. In Human Language Technologies: The 2010 Neural architectures for named entity recognition. 466
Annual Conference of the North American Chap- In Proceedings of NAACL-HLT, pages 260–270.
417 467
ter of the Association for Computational Linguistics,
418 pages 229–237. Association for Computational Lin- Guillaume Lample, Alexis Conneau, Marc’Aurelio 468
419 guistics. Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. 469
420
Word translation without parallel data. In 6th Inter- 470
Piotr Bojanowski, Edouard Grave, Armand Joulin, and national Conference on Learning Representations,
421 ICLR 2018, Vancouver, BC, Canada, April 30 - May 471
Tomas Mikolov. 2017. Enriching word vectors with
422 subword information. Transactions of the Associa- 3, 2018, Conference Track Proceedings. 472
423 tion for Computational Linguistics, 5:135–146. 473
Omer Levy and Yoav Goldberg. 2014. Dependency-
424 based word embeddings. In Proceedings of the 52nd 474
Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng,
425 and Ting Liu. 2018. Towards better ud pars- Annual Meeting of the Association for Computa- 475
426 ing: Deep contextualized word embeddings, ensem- tional Linguistics (Volume 2: Short Papers), vol- 476
ble, and treebank concatenation. arXiv preprint ume 2, pages 302–308.
427 477
arXiv:1807.03121. Wang Ling, Chris Dyer, Alan W Black, and Isabel
428 478
429
Trancoso. 2015. Two/too simple adaptations of 479
Xilun Chen and Claire Cardie. 2018. Unsupervised word2vec for syntax problems. In Proceedings of
430 multilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chap- 480
431 the 2018 Conference on Empirical Methods in Nat- ter of the Association for Computational Linguistics: 481
ural Language Processing, pages 261–270, Brus- Human Language Technologies, pages 1299–1304.
432 482
sels, Belgium. Association for Computational Lin-
433 guistics. Bruno Martins and Mário J. Silva. 2005. Language 483
434 identification in web pages. In Proceedings of the 484
435 Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c 2005 ACM Symposium on Applied Computing, SAC 485
Barrault, and Antoine Bordes. 2017. Supervised ’05, pages 764–768, New York, NY, USA. ACM.
436 486
learning of universal sentence representations from
437 natural language inference data. In Proceedings of Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey 487
438 the 2017 Conference on Empirical Methods in Nat- Dean. 2013. Efficient estimation of word represen- 488
ural Language Processing, pages 670–680, Copen- tations in vector space. In 1st International Con-
439 ference on Learning Representations, ICLR 2013, 489
hagen, Denmark. Association for Computational
440 Linguistics. Scottsdale, Arizona, USA, May 2-4, 2013, Workshop 490
441 Track Proceedings. 491
442 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Jeffrey Pennington, Richard Socher, and Christo- 492
Kristina Toutanova. 2018. Bert: Pre-training of deep pher D. Manning. 2014. Glove: Global vectors for
443 493
bidirectional transformers for language understand- word representation. In Empirical Methods in Nat-
444 ing. arXiv preprint arXiv:1810.04805. 494
ural Language Processing (EMNLP), pages 1532–
445 1543. 495
446 Facebook. 2016. fastText: Language identification. 496
[Online; accessed 12-May-2019]. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
447 497
Gardner, Christopher Clark, Kenton Lee, and Luke
448 Google. GoogleLangID. [Online; accessed 12-May- Zettlemoyer. 2018. Deep contextualized word rep- 498
449 2019]. resentations. In Proc. of NAACL. 499

5
Confidential Review Copy. DO NOT DISTRIBUTE.

500 Jörg Tiedemann. 2012. Parallel data, tools and inter- 550
501 faces in opus. In Lrec, volume 2012, pages 2214– 551
502
2218. 552
503 PV Veena, M Anand Kumar, and KP Soman. 2018. 553
504 Character embedding for language identification in 554
505
hindi-english code-mixed social media text. Com- 555
putación y Sistemas, 22(1):65–74.
506 556
507 Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg 557
Tiedemann, and Preslav Nakov. 2015. Overview of
508 558
the dsl shared task 2015. In Proceedings of the Joint
509 Workshop on Language Technology for Closely Re- 559
510 lated Languages, Varieties and Dialects, pages 1–9. 560
511 561
Wei Zhang, Robert AJ Clark, Yongyuan Wang, and
512 Wen Li. 2016. Unsupervised language identifica- 562
513 tion based on latent dirichlet allocation. Computer 563
514
Speech & Language, 39:47–66. 564
515 565
516 566
517 567
518 568
519 569
520 570
521 571
522 572
523 573
524 574
525 575
526 576
527 577
528 578
529 579
530 580
531 581
532 582
533 583
534 584
535 585
536 586
537 587
538 588
539 589
540 590
541 591
542 592
543 593
544 594
545 595
546 596
547 597
548 598
549 599

Potrebbero piacerti anche