Natural Language Processing 2019
The 2019 episode at Faculty of Mathematics, Physics and Informatics of Comenius University.
The current episode can be found at NLP 2020.
- Labs (I-8 or any other room upon previous appointment)
- Tuesday, 12:20 - 14:00 (voluntary)
- Lectures (I-8)
- Tuesday, 14:00 - 15:40 (voluntary)
Lectures
22nd of February
- Discussed material:
- Course information (see below)
- Basic Text Processing (slides)
- Regular Expressions
- Tokenization
- Text normalization
- Basically the first part of SLP Chapter II
- Supplementary resources:
- Eliza Bot Demo: one of the most famous use cases of regular expressions. It is really worth trying out -- you may end up having some surprisingly good conversations.
- Unix for poets: a nice 25 pages worth of examples on how to process text on the Unix command line. Here is a shorter version by the authors of the SLP book.
- Scriptio continua: the reason why English also nearly ended up without word and sentence separators.
26th of February
- Discussed material:
- Byte-pair encoding
- A quick description can be found at its Wiki page
- A more comprehensive description can be found in section 2.6 of SLP Chapter II
- A full description along with results from empirical experiments can be found in (Sennrich, 2016)
- Edit Distance (slides)
- Edit Distance
- Weighted Edit Distance
- Alignment
- The last part of SLP Chapter II
- (we did not discuss the bio applications ...)
- Intro to Language Modeling (slides)
- The first part of SLP Chapter III
- Supplementary resources:
- subsync: a tool for automatically synchronizing subtitles with video (a nice use-case of using alignment in a not-so-ordinary context)
- Autocomplete using Markov chains: a nice example (along with code in Python) that shows how Language Models can be used to generate "text resembling language" and build a simple 'autocomplete' engine.
5th of March
- Discussed material:
- Language modeling
- Estimating N-gram probabilities
- Perplexity and Language Model Evaluation
- Dealing with zeros
- Smoothing, backoff and interpolation
- "Stupid backoff"
- Most of SLP Chapter III
- Supplementary resources:
- Google Ngram Viewer -- a nice way of visualizing the rate of use of n-grams in books written in various languages. Check out this quick example for instance.
- kenlm -- an open-source language modeling toolkit. Probably best in class when it comes to speed and memory efficiency.
12th of March
- Discussed material:
- From n-grams to Neural Networks
- n-gram language modeling recap
- big(ger) data usually helps
- Feed-forward neural language model(s)
- Word2Vec
- CBOW
- Skip-gram
- Negative sampling
- Language Modelling and RNNs Part 1
- Supplementary resources:
- Embedding Projector -- a nice way of visualizing distributed representations obtained using Word2Vec in 2D and 3D from the TensorFlow project
- Word2Vec example by ML5JS -- a simple example of the analogy task Word2Vec has became famous for, using a nice JavaScript library called ML5JS
19th of March
- Discussed material:
- Language Modelling and RNNs Part 2
- Simple RNNs
- Vanishing and Exploding Gradient
- Long Short Term Memory (LSTM)
- Gated Recurrent Unit (GRU)
- Dropout
- Supplementary resources:
- SCIgen - An Automatic CS Paper Generator -- exactly what it sounds like.
- The Unreasonable Effectiveness of Recurrent Neural Networks -- at this point a classical blog post in Deep Learning blogosphere by Andrej Karpathy on what RNNs are capable of. Still worth checking out.
- The unreasonable effectiveness of Character-level Language Models -- a quick "setting-the-record-straight" response to the previous blog post by Yoav Goldberg. Note that it is not entirely critical, as its subtitle is (and why RNNs are still cool). Definitely worth a read.
- Unsupervised Sentiment Neuron -- a blogoduction (introduction-via-blog) of OpenAI for their work on training character-level language models on Amazon reviews, which happen to pick up the notion of "sentiment" on their own.
- Understanding LSTM Networks -- another classic which does a great job introducing LSTMs from the ground up with nicely done visualizations using the "circuit board" metaphor.
26th of March
- Discussed material:
- Introduction to PyTorch
- Tensors
- Operations with Tensors
- Computation Graphs and Automatic Differentiation
- Language Modeling with Recurrent Neural Networks
- Loading and preprocessing Penn Treebank
- RNN-based Language Model in PyTorch
- Model Training and Testing
- Sampling from the model
- Supplementary resources:
- RNNLM Lecture Notes from Stanford's CS224N -- I also strongly encourage you to check out the lecture slides
- Visualizing memorization in RNNs -- a nice exploration of the memorization capability of Recurrent Neural Networks.
- A summary of (mostly recent) state-of-the-art preplexities for language models. This list at NLPProgress.com is probably even more recent.
2nd of April
- Discussed material:
Spelling Correction and the Noisy Channel
- Spelling Correction task
- Noisy Channel model
- Damerau-Levenshtein edit distance
Text Classification and Naive Bayes
- Text Classification task
- Bag of Words representation
- Naive Bayes classifier
- Classification metrics: accuracy, precision, recall, F score
- Micro vs Macro averaging
- Supplementary resources:
- How to Write a Spelling Corrector: a classic article by Peter Norvig which describes what it takes to create a simple spelling corrector in practice.
9nd of April
- Discussed material:
- Sentiment Analysis
- Sentiment Analysis Task
- Sentiment Analysis using Naive Bayes classification
- Lexicon-based approaches
- Supplementary resources:
- How to Write a Spelling Corrector: a classic article by Peter Norvig which describes what it takes to create a simple spelling corrector in practice.
15nd of April
- Discussed material:
- Text Classification with fastText
- Text Classification
- Introduction to fastText
- Simple fastText classification model (see below)
- Combining word embeddings with embedding of n-grams
- Metrics used in context of classification tasks (accuracy / precision / recall)
- Importance of preprocessing
- Supplementary resources:
- Bag of Tricks for Efficient Text Classification: the paper which has introduced the fastText classifier as a simple competitive baseline (compared to deep learning models), with much more effective training procedure.
- Enriching Word Vectors with Subword Information: the paper which presents and interesting approach for dealing with Out Of Vocabulary words that is already present in fastText -- combining word vectors with embeddings of character n-grams.
30th of April
- Discussed material:
- Part of Speech tagging
- Parts of Speech
- Part of Speech tagging as a NLP problem
- Feature-based Part of Speech tagging
- Named Entity Recognition
- The task of extracting knowledge from text
- Finding and Classifying Named Entities
- Named Entity Recognition as a Sequence Modeling task
- Inference in Sequence Modeling
- Supplementary resources:
- Penn Treebank tag set: a set of tags used in the Penn Treebank (there are roughly 45 of them -- those who've been doing NLP long enough allegedly know them by heart)
- spaCy NER demo and AllenNLP NER demo: quick two demos of industry-strength and former-state-of-the-art NER systems. Notice that the latter one powered by a pretty big neural network is able to correctly pick up a location it almost certainly did not see in the training data.
7th of May
- Discussed material:
- Sparse word representations
- Term Frequency
- Inverse Document Frequency
- TF-IDF
- Contextual representations: from word2vec to BERT
- Word vectors and issues with using them in context-free manner
- Representations with Language Models
- ELMo: Deep Contextual Word Embeddings
- Transformers and Self-Attention
- Masked Language Models
- NLP projects in the real world
- Maximizing NLP project's risk of failure
- Machine Learning Hierarchy of Needs
- Building NLP projects by iterating on both the code and the data
Supplementary resources:
- Understanding BERT Transformer: Attention is not all you need -- a very nice discussion on what the Self-Attention-based models may model and
- spaCy cheat sheet and spaCy course: a very nice list of features the spaCy library offers along with a quick interactive course where you can try it out without leaving the browser.
14th of May
Discussed material:
- Lessons Learned in the industry by Ondrej Jariabka
- What to do with unbalanced datasets
- Dealing with sparsity (stratify sampling)
- When in doubt, XGBoost
- Supplementary resources:
- So what else can NLP be useful for? Can it perhaps help cure cancer? Go out there, try it out and let us all know if it works!
Resources
Speech and Language Processing, 3rd Edition -- Daniel Jurafsky, James H Martin
A Primer on Neural Network Models for Natural Language Processing -- Yoav Goldberg
Neural Network Methods for Natural Language Processing -- Yoav Goldberg
Grading
Assignments: | 50% |
Project: | 50% |
Assignments are available via the Moodle e-learning system but they are also available in the following repository on GitHub: https://github.com/NaiveNeuron/nlp-exercises
A list of project ideas can be found here.
Points | Grade |
---|---|
(90, inf] | A |
(80, 90] | B |
(70, 80] | C |
(60, 70] | D |
(50, 60] | E |
[0, 50) | FX |