The 2019 episode at Faculty of Mathematics, Physics and Informatics of Comenius University.
The current episode can be found at NLP 2020.
- Labs (I-8 or any other room upon previous appointment)
Tuesday, 12:20 - 14:00 (voluntary)
- Lectures (I-8)
Tuesday, 14:00 - 15:40 (voluntary)
Lectures#
22nd of February#
- Discussed material:
- Course information (see below)
- Basic Text Processing (slides)
- Regular Expressions
- Tokenization
- Text normalization
- Basically the first part of SLP Chapter II
- Supplementary resources:
- Eliza Bot Demo: one of the most famous use cases of regular expressions. It is really worth trying out -- you may end up having some surprisingly good conversations.
- Unix for poets: a nice 25 pages worth of examples on how to process text on the Unix command line. Here is a shorter version by the authors of the SLP book.
- Scriptio continua: the reason why English also nearly ended up without word and sentence separators.
26th of February#
- Discussed material:
- Byte-pair encoding
- A quick description can be found at its Wiki page
- A more comprehensive description can be found in section 2.6 of SLP Chapter II
- A full description along with results from empirical experiments can be found in (Sennrich, 2016)
- Edit Distance (slides)
- Edit Distance
- Weighted Edit Distance
- Alignment
- The last part of SLP Chapter II
- (we did not discuss the bio applications …)
- Intro to Language Modeling (slides)
- The first part of SLP Chapter III
- Supplementary resources:
- subsync: a tool for automatically synchronizing subtitles with video (a nice use-case of using alignment in a not-so-ordinary context)
- Autocomplete using Markov chains: a nice example (along with code in Python) that shows how Language Models can be used to generate "text resembling language" and build a simple ‘autocomplete’ engine.
5th of March#
- Discussed material:
- Language modeling
- Estimating N-gram probabilities
- Perplexity and Language Model Evaluation
- Dealing with zeros
- Smoothing, backoff and interpolation
- "Stupid backoff"
- Most of SLP Chapter III
- Supplementary resources:
- Google Ngram Viewer -- a nice way of visualizing the rate of use of n-grams in books written in various languages. Check out this quick example for instance.
- kenlm -- an open-source language modeling toolkit. Probably best in class when it comes to speed and memory efficiency.
12th of March#
- Discussed material:
- From n-grams to Neural Networks
- n-gram language modeling recap
- big(ger) data usually helps
- Feed-forward neural language model(s)
- Word2Vec
- CBOW
- Skip-gram
- Negative sampling
- Language Modelling and RNNs Part 1
- Supplementary resources:
- Embedding Projector -- a nice way of visualizing distributed representations obtained using Word2Vec in 2D and 3D from the TensorFlow project
- Word2Vec example by ML5JS -- a simple example of the analogy task Word2Vec has became famous for, using a nice JavaScript library called ML5JS
19th of March#
- Discussed material:
- Language Modelling and RNNs Part 2
- Simple RNNs
- Vanishing and Exploding Gradient
- Long Short Term Memory (LSTM)
- Gated Recurrent Unit (GRU)
- Dropout
- Supplementary resources:
- SCIgen - An Automatic CS Paper Generator -- exactly what it sounds like.
- The Unreasonable Effectiveness of Recurrent Neural Networks --at this point a classical blog post in Deep Learning blogosphere by Andrej Karpathy on what RNNs are capable of. Still worth checking out.
- The unreasonable effectiveness of Character-level Language Models -- a quick "setting-the-record-straight" response to the previous blog post by Yoav Goldberg. Note that it is not entirely critical, as its subtitle is (and why RNNs are still cool). Definitely worth a read.
- Unsupervised Sentiment Neuron -- a blogoduction (introduction-via-blog) of OpenAI for their work on training character-level language models on Amazon reviews, which happen to pick up the notion of "sentiment" on their own.
- Understanding LSTM Networks -- another classic which does a great job introducing LSTMs from the ground up with nicely done visualizations using the "circuit board" metaphor.
26th of March#
- Discussed material:
- Introduction to PyTorch
- Tensors
- Operations with Tensors
- Computation Graphs and Automatic Differentiation
- Language Modeling with Recurrent Neural Networks
- Loading and preprocessing Penn Treebank
- RNN-based Language Model in PyTorch
- Model Training and Testing
- Sampling from the model
- Supplementary resources:
- RNNLM Lecture Notes from Stanford’s CS224N -- I also strongly encourage you to check out the lecture slides
- Visualizing memorization in RNNs -- a nice exploration of the memorization capability of Recurrent Neural Networks.
- A summary of (mostly recent) state-of-the-art preplexities for language models. This list at NLPProgress.com is probably even more recent.
2nd of April#
- Discussed material:
Spelling Correction and the Noisy Channel
- Spelling Correction task
- Noisy Channel model
- Damerau-Levenshtein edit distance
Text Classification and Naive Bayes
- Text Classification task
- Bag of Words representation
- Naive Bayes classifier
- Classification metrics: accuracy, precision, recall, F score
- Micro vs Macro averaging
- Supplementary resources:
- How to Write a Spelling Corrector: a classic article by Peter Norvig which describes what it takes to create a simple spelling corrector in practice.
9nd of April#
- Discussed material:
- Sentiment Analysis
- Sentiment Analysis Task
- Sentiment Analysis using Naive Bayes classification
- Lexicon-based approaches
- Supplementary resources:
- How to Write a Spelling Corrector: a classic article by Peter Norvig which describes what it takes to create a simple spelling corrector in practice.
15nd of April#
- Discussed material:
- Text Classification with fastText
- Text Classification
- Introduction to fastText
- Simple fastText classification model (see below)
- Combining word embeddings with embedding of n-grams
- Metrics used in context of classification tasks (accuracy / precision / recall)
- Importance of preprocessing
- Supplementary resources:
- Bag of Tricks for Efficient Text Classification: the paper which has introduced the fastText classifier as a simple competitive baseline (compared to deep learning models), with much more effective training procedure.
- Enriching Word Vectors with Subword Information: the paper which presents and interesting approach for dealing with Out Of Vocabulary words that is already present in fastText --combining word vectors with embeddings of character n-grams.
30th of April#
- Discussed material:
- Part of Speech tagging
- Parts of Speech
- Part of Speech tagging as a NLP problem
- Feature-based Part of Speech tagging
- Named Entity Recognition
- The task of extracting knowledge from text
- Finding and Classifying Named Entities
- Named Entity Recognition as a Sequence Modeling task
- Inference in Sequence Modeling
- Supplementary resources:
- Penn Treebank tag set: a set of tags used in the Penn Treebank (there are roughly 45 of them -- those who’ve been doing NLP long enough allegedly know them by heart)
- spaCy NER demo and AllenNLP NER demo: quick two demos of industry-strength and former-state-of-the-art NER systems. Notice that the latter one powered by a pretty big neural network is able to correctly pick up a location it almost certainly did not see in the training data.
7th of May#
- Discussed material:
- Sparse word representations
- Term Frequency
- Inverse Document Frequency
- TF-IDF
- Contextual representations: from word2vec to BERT
- Word vectors and issues with using them in context-free manner
- Representations with Language Models
- ELMo: Deep Contextual Word Embeddings
- Transformers and Self-Attention
- Masked Language Models
- NLP projects in the real world
- Maximizing NLP project’s risk of failure
- Machine Learning Hierarchy of Needs
- Building NLP projects by iterating on both the code and the data
Supplementary resources:
- Understanding BERT Transformer: Attention is not all you need -- a very nice discussion on what the Self-Attention-based models may model and
- spaCy cheat sheet and spaCy course: a very nice list of features the spaCy library offers along with a quick interactive course where you can try it out without leaving the browser.
14th of May#
Discussed material:
- Lessons Learned in the industry by Ondrej Jariabka
- What to do with unbalanced datasets
- Dealing with sparsity (stratify sampling)
- When in doubt, XGBoost
- Supplementary resources:
- So what else can NLP be useful for? Can it perhaps help cure cancer? Go out there, try it out and let us all know if it works!
Resources#
Speech and Language Processing, 3rd Edition -- Daniel Jurafsky, James H Martin
A Primer on Neural Network Models for Natural Language Processing -- Yoav Goldberg
Neural Network Methods for Natural Language Processing -- Yoav Goldberg
Grading#
| Component | Weight |
|---|---|
| Assignments | 50% |
| Project | 50% |
Assignments are available via the Moodle e-learning system but they are also available in the following repository on GitHub: https://github.com/NaiveNeuron/nlp-exercises
A list of project ideas can be found here.
| Points | Grade |
|---|---|
| (90, inf] | A |
| (80, 90] | B |
| (70, 80] | C |
| (60, 70] | D |
| (50, 60] | E |
| [0, 50) | FX |