SeaMLS 2019 - Lessons Learned

All materials are here:
https://docs.google.com/spreadsheets/d/1sh2Lx0iAKRkjZasMbsFDvv4zFoAuIrSoLWyxaU9s51A/edit?usp=sharing
SeaMLS 2019
Lessons Learned
@philip.thomas
What I’ve learned about hiring from seamls’19
● Traveloka is perceived as a mysterious and secretive company by many students
● Good students are already taken (one with relevant experience and good GPA)
○ My hire target (3.99 GPA, great personality) is hired by GDP AI labs (despite GDP doesn’t have
the name, they just have the speed)
○ The companies are making their move very early in the process. Starting from year 3 and year 4
they have been contacted by the company
● If we want to focus on students, the approach needs to be more personal
○ Many high executives are going out to dinner with group of students. For example, GDP AI labs:
■ They invite top performing students for dinner
■ The seats are alternated betweens students and AI engineers of GDP
■ They can exchange contacts, and ask them for career questions later on
● What students care is brand and speed. We need to be fast, and build our tech brand:
○ Gojek develop their brands by inviting top performing students to top tech conference, e.g.
seamls (so some of the spots are saved for students instead!)
○ GDP also do this for their hiring leads
What I learn from the
contents
The summer school gives us a glimpse of what they are doing in
deepmind, facebook research, openAI
● SeaMLS material is at a graduate level of deep learning/artificial intelligence course
● These guys focus on breaking the boundaries, at the same time, they are looking for opportunities to
use these bleeding edge technique to solve real world problems
● These guys are good, like real good. We got a sense of how world-class deep learning experts do
their work.
○ We know they too keeping track of matrix dimension when they compile their code
○ They know the expected results of the machine learning training before they even run the
model training.
■ They can calculate the expected results of the loss function off the top of their head
■ They can calculate the lower and upper bound of possible loss function given a certain
neural network architecture
■ This is BEFORE training the model
● They go into the math. Real Deep. Focus on fundamental maths and use it to expand their
knowledge and understanding
● We got a lot of things to learn if we want to set up world class ML research company
Day 1 Lessons Learned
● Knowing math goes a long way
● All the math you need in this field is Linear
Algebra, differential equation, statistics, and
some optimization theory
● The deeper you go, the deeper you need to
understand these maths
○ Basic: what is a matrix
○ Medium: Why regularization works
○ High: How do you use jensen’s
inequality and KL divergence, to pass
through unobserved posterior with
ELBO in variational autoencoder
● Start with the basics!!!!
Day 1 Lessons Learned (Cont’d)
● There is a template for machine learning problems. Basically you need to know 4 things:
○ Loss Function
○ Maximum Likelihood Estimation of the Loss Function (Maximizing likelihood equal to
Minimizing Expected Risk)
○ Optimization
○ Regularization (And it’s also discussed how regularization link to bayesian estimation)
● Model selection is still done through trial and error (unfortunately!!)
● Feature engineering tricks:
○ Transformation (sigmoidal transform, log transform)
○ Normalization (centering, unit variance, clipping)
○ Dimension Reduction
Day 2 - Lessons Learned
● There are plenty of unsupervised learning models:
○ K means, mixture models, matrix factorization, soft clustering, hierarchical, latent dirichlet
allocation, tensor factorization…… and more……
● Two school of thoughts:
○ Generative models : build the world model and use the world model that solve the problems
○ Distance based models: build the model that solve problems but can’t tell you much about the
world
● Unsupervised learning is not based on some random theory (there’s actually a well studied
theoretical research that explains why it works)
○ Based on label switching problems → cluster doesn’t come with labels
○ Label switching problems disappear if we use partitions theory
○ Based on pitman-yor process
● Learn basics of NN and some practical modeling tricks (consult the slide)
● Word representations is trying to find meaning of the words
○ Which translates to find word representation that captures the relationship between words
○ Knowledge based models (e.g. wordnet), and Corpus based models (high/low dimensional word
vectors)
● The word embedding approach is then supersedes the word representation approach
○ Goal: predict surrounding words within a window of each word
○ Everything is discussed here:
■ Probabilistic representation, The cost functions,
■ And the derivation of the cost functions to do SGD
○ Word2vec variants: Skip-gram, CBOW, Language Models
○ However, word2vec doesn’t take into account co-occurences of words
● Comes Glove: Global Vectors for Word Representation (which combine the best of the both worlds)
● Then it was discuss to do word representation evaluation (intrinsic (meaning) vs extrinsic (task related))
Day 3 - Lessons Learned (Cont’d)
● Sequence Models is all about autoregressive modelling, and the goal is to model the process of
generation of text
○ Iterative decoding VS Parallel decoding
● Learn some bleeding edge research that is currently ongoing for Seq2Seq/NLP models
○ BERT: undirected neural sequence model
○ Monotonic and non-monotonic sequential generation (quick-sort inspired technique)
○ Training and Inference should be treated differently. They are different algorithmic problems (even
though they are coming from the same models)
○ Search is also a different problem in sequential models
● Practical Session
○ Going through line by line and step by step to create a simplified version of GPT-2
○ Learn about attention models (to be honest I lost it here)
● CNN is all about adding structure to the models. Why CNN works better for image because of these
things
○ Hierarchical representation → have something to do about receptive fields
○ Locality of data → Local filters
○ Translation Invariance → Weights Sharing
● Goes into deep into different layers of CNN
○ Conv layer → taking locality of data
○ Pooling and unpooling layer → playing with receptive fields
○ Dilated conv, separated conv, grouped conv, etc.
● Then discussing favorite types of CNN flavor out there: alexnet, vgg, inception, resnets
○ Whenever possible, use pretrain from these models!
● You can use CNN beyond image (use for video, sound, graph convnets, etc)
● Multimodal learning is the session that awaken my inner fear towards AI advancement
○ Literally, in the conference, and I quote, Douwe Kiela, researcher from facebook said this:
■ “There are two school of thoughts regarding innovations in AI. One was to control and
therefore limiting the growth of AI. The other one is to set the AI free and push the limit as
far as possible. I am in the second school. Because it’s better for the universe although
it’s not better for the mankind.”
● Machine Learning has made progress in unimodal world. But the real world is multimodal.
● A lot of crazy maths here that describe the use of representation, attention mechanism, and fusion
○ Tbh it’s over my head at this time
● But the use of the crazy maths will enable you to do awesome things:
○ Create inference machine that able to combine text, image, speech, video, even smell
○ In the presentation he show a walking robot that is given order in form of free-flow text and infer
these text and combine it with his vision observation
● I only attended one session for day 5 as I have outside meeting with external about possibility of
traveloka’s contribution to develop AI talents in Indonesia (with google, gojek, bukalapak, and
tokopedia)
● There is a new attention given to deep probabilistic graphical models given by the deep learning
community
○ This is called deep pgm
● The goal of the deep pgm is to learn whole representation space using representation learning
technique
○ The representation learning technique is done by minimizing distance between two probability
distribution (Kullback Leibler, Wasserstein, etc)
○ For KL, this is equal to minimizing ELBO
● Once you have the models correctly learned, you can infer using only some part of the graph, or the
whole graph.
○ This make it an interesting use case for self-supervised learning

SeaMLS 2019 - Lessons Learned

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

SeaMLS 2019 - Lessons Learned

Caricato da

Copyright:

Formati disponibili

All materials are here:

Potrebbero piacerti anche