Marek Šuppa

Natural Language Processing 2024

Last modified at: 2024-02-23 07:15:00+01:00

The 2024 episode at Faculty of Mathematics, Physics and Informatics of Comenius University

Lectures M-XI: Friday, 9:00 - 10:20 (voluntary)
Labs M-XI: Friday, 10:20 - whenever (voluntary)

Course description
Lectures
Resources
Similar Courses Elsewhere
Grading

Course description

This course tries to go deeper into how we can represent human language (say English or Slovak) in a way that can be processed by computational systems (a.k.a. computer programs), and how this representation can then be used to do interesting things, such as

translation
question answering
grammatical error correction
summarization
text (like poems or song lyrics) generation
and much more...

All of this combined is part of a field called Natural Language Processing, which ended up being the name of the course.

Feel free to check out the previous year's class webpage as well!

Lectures

Lecture I: Text Processing

Discussed material:

Basic Text Processing (slides)
- Regular Expressions
- Tokenization
- Text normalization
Basically the first part of SLP Chapter II

Supplementary resources:

Eliza Bot Demo: one of the most famous use cases of regular expressions. It is really worth trying out -- you may end up having some surprisingly good conversations.
Unix for poets: a nice 25 pages worth of examples on how to process text on the Unix command line. Here is a shorter version by the authors of the SLP book.
Scriptio continua: the reason why English also nearly ended up without word and sentence separators.

Lecture II: Edit Distance + Intro to Language Modeling

Discussed material:

Edit Distance (slides)
- Edit Distance
- Weighted Edit Distance
- Alignment
- The last part of SLP Chapter II
- (we did not discuss the bio applications ...)
Language modeling (slides)
- Estimating N-gram probabilities
- Perplexity and Language Model Evaluation
- Dealing with zeros
- Smoothing, backoff and interpolation
- Most of SLP Chapter III

Supplementary resources:

subsync: a tool for automatically synchronizing subtitles with video (a nice use-case of using alignment in a not-so-ordinary context)
Seam carving: although not necessarily NLP related, seam carving is a very nice real-world example of how dynamic programming can still be a useful tool.
Google Ngram Viewer -- a nice way of visualizing the rate of use of n-grams in books written in various languages. Check out this quick example for instance. Note that the dataset on which the Ngrams are computed plays a big role -- here is a similar example for coronavirus.
kenlm -- an open-source language modeling toolkit. Probably best in class when it comes to speed and memory efficiency.

Lecture III: Language Modeling, Word Embeddings and RNNs

Discussed material:

"From N-grams to Word2Vec"

Language modeling and word embeddings

Shortcomings of n-gram language models

Neural language model

Word2Vec

CBOW + Skip-Gram

Visualization of (vector) word spaces

"From Word2Vec to Recurrent Networks"

Language Modeling and RNNs Part 1

Recurrent Neural Networks (RNNs)

Back Propagation Through Time (BPTT)

Language Modeling and RNNs Part 2

Simple RNNs

Vanishing and Exploding Gradient

Long Short Term Memory (LSTM)

Gated Recurrent Unit (GRU)

Supplementary resources:

Semantic Space Surfer -- a bit of a fun spin on word embeddings. While normally we'd like the word embeddings to work "intuitively" (i.e. king should be to man as queen is to woman), we know it's not always like that. In this quick game you're forced to "think like a word embedding" and pick the word that the pre-trained embeddings would have chosen. Not only is this quite fun, it'll also help you deepen your intuition around what's actually going on with word embeddings.
The Unreasonable Effectiveness of Recurrent Neural Networks -- at this point a classical blog post in Deep Learning blogosphere by Andrej Karpathy on what RNNs are capable of. Still worth checking out.
The unreasonable effectiveness of Character-level Language Models -- a quick "setting-the-record-straight" response to the previous blog post by Yoav Goldberg. Note that it is not entirely critical, as its subtitle is (and why RNNs are still cool). Definitely worth a read.
Unsupervised Sentiment Neuron -- a blogoduction (introduction-via-blog) of OpenAI for their work on training character-level language models on Amazon reviews, which happen to pick up the notion of "sentiment" on their own.
Understanding LSTM Networks -- another classic which does a great job introducing LSTMs from the ground up with nicely done visualizations using the "circuit board" metaphor.

Lecture IV: Attention, Transformers, all the way to GPT

Discussed material:

Visualizing attention + The Illustrated Transformer

The limits of RNNs

The concept of attention

The limits of vanilla attention

Values, Keys and Queries

Multi Head Attention

Tricks of the Transformer architecture (i.e. positional embeddings)

What's inside ChatPGT

From Transformers to GPT architectures

GPT, GPT-2, GPT-3

InstructGPT (Reinforcement Learning from Human Feedback)

ChatGPT? and its implications

Supplementary resources:

Sequence to Sequence and Attention: one of the best written lecture notes on the topic. If any of the discussed concepts didn't make sense (or too much sense) during the lecture, they are almost certainly described much better on this website.
Although some might disagree, experience shows that the best way how to learn how the Transformer architecture really works is to re-implement it. It really is not such a big deal -- just a couple hundred lines of code. Here are some resources to get you started:
GPT in 60 lines of NumPy makes the point above in an even stronger fashion, as it re-implements (part of) the GPT model in very simple NumPy code. Going through it (and doing it on your own) is strongly recommended if you'd like to really understand what is going on in there.
Let's build GPT: from scratch, in code, spelled out is perhaps the best way to spend 2 hours doing the re-implementing above while being guided by Andrej Karpathy, perhaps the best educator when it comes to Deep Learning and (modern) Neural Networks.
Intro to Large Language Models is currently very probably the best way of getting up to speed in the area of Large Language Models in one hour. It's by Andrej Karpathy again, which has become a guarantee of quality at this point.

Lecture V: Text Classification

Discussed material:

Text Classification and Naive Bayes
- Text Classification task
- Bag of Words representation
- Naive Bayes classifier
- Classification metrics: accuracy, precision, recall, F score
- Micro vs Macro averaging
Sentiment Analysis
- Sentiment Analysis Task
- Sentiment Analysis using Naive Bayes classification
- Lexicon-based approaches

Supplementary resources:

Trump Tweet Bot: being able to estimate sentiment of text can have real-world implications: by assessing what a leader of a large country thinks about a publicly traded company and buying/selling its shares as a result, you can end up with a fairly interesting investment portfolio!

Bag of Tricks for Efficient Text Classification: the paper which has introduced the fastText classifier as a simple competitive baseline (compared to deep learning models), with much more effective training procedure.

Enriching Word Vectors with Subword Information: the paper which presents and interesting approach for dealing with Out Of Vocabulary words that is already present in fastText -- combining word vectors with embeddings of character n-grams.

Text Classification in the NLP Course For You: full of pretty pictures, this page outlines "all you need to know" about text classification, going from zero to hero. I very much recommend taking a look!

Lecture VI: Transfer Learning and BERT

Discussed material:

Transfer Learning in the NLP Course For You

From words to words-in-context

Single multi-task pre-trained model(s)

Contextual representations via BERT

Word vectors and issues with using them in context-free manner

Representations with Language Models

ELMo: Deep Contextual Word Embeddings

Transformers and Self-Attention

Masked Language Models

Supplementary resources:

The Illustrated BERT:

Although the presentation we discussed is very informative, a visual presentation is usually even more impactful. This is one of the best ones that you can find online.

Byte Pair Encoding:

Dealing with out of vocabulary words out-of-the-box was one of the big improvements BERT-style pre-trained models made very popular. This article goes into greater detail on how does that happen, what are the limits of this method and how one may go about fixing them.

(I actually recommend you read through the whole article -- it's a very nice introduction to the concept of Attention and sequence-to-sequence tasks in general)

The Dark Secrets of BERT:

It turns out BERT learns interesting things during training. Part of it may be due to its use of self-attention but as this article (and associated paper) shows, there may be some black magic going on.

Lecture VII: LLaMA 3 Chalk Talk

The release of LLama3 has shaken things up quite a bit, so we had to adapt the class appropriately

Discussed material:

Llama 3 Introduction

Llama 3 training data/process

Llama 3 benchmark results

Llama 3 and the future of open models

Natural Language Generation

NLG systems in the wild (GPT models and so on)

Decoding from NLG models (top-k, top-p sampling)

Training and evaluating NLG models

Supplementary material:

LMSYS Chatbot Arena Leaderboard:

As many would say, "the only benchmark worth taking a look at".

Comoditize Your Complement:

Perhaps the reason why Meta released Llama 3 in the open.

Ollama:

The arguably easiest way of running LLama-style models on your own.

Lecture IX: PoS, NER and Question Answering

Discussed material:

Part of Speech tagging

Parts of Speech

Part of Speech tagging as a NLP problem

Feature-based Part of Speech tagging

Named Entity Recognition

The task of extracting knowledge from text

Finding and Classifying Named Entities

Named Entity Recognition as a Sequence Modeling task

Inference in Sequence Modeling

Question Answering: [video]

Reading comprehension

Open Domain (textual) question answering

Supplementary resources:

Penn Treebank tag set: a set of tags used in the Penn Treebank (there are roughly 45 of them -- those who've been doing NLP long enough allegedly know them by heart)
spaCy NER demo: a quick demo of an industry-strength NER system. Note that the code switching between English and Slovak is still indeed a problem.

Lecture X: Machine Translation

Discussed material:

Machine Translation

The History

The Data

The Metrics

Supplementary resources:

Electronic Brain Translates Russian, a historical article (from 1951) on the very first machine translation system put together by IBM.
Amazing Robot Brain Translates Russian, an article on how the very first machine translation systems came to be, 71 years later.

Lecture XI: Prompting and Retrieval Augmented Generation

Discussed material:

Prompting

Retrieval and RAG

Supplementary resources:

Musings on building a Generative AI product discusses what it takes to use these LLMs in an actual product. Perhaps the biggest highlight you can find there is that they ended up using YAML for structured output. This is perhaps the first actually "good" use of YAML... (check https://noyaml.com/ for the reasons why)

Lecture XII: Multimodal Language Models

Discussed material:

GPT-4o

Vision Language Models

Supplementary resources:

Breaking resolution curse of vision-language models discusses how vision-language models often cannot focus on the right parts of the image what can be done about it.

PaliGemma, Google's new open-weights vision-language model that is a nice example of all the things that can be done with the currently available vision-language models out there.

Lecture XIII: (Parameter-efficient) Finetuning + Instruction Tuning

Discussed material:

Parameter-efficient Finetuning and Instruction Tuning

Multi-task Learning

Fine-tuning

Adapters, BitFit, LoRA

Instruction tuning

Supplementary resources:

MMLU Pro, a new benchmark of multi-task understanding. The "Pro" signifies that it's a new updated version of the MMLU dataset that has seemingly been saturated.
Practical Tips for Finetuning LLMs Using LoRA, a very nice article full of practical info on how to go about finetuning LLMs with LoRA. It's certainly worth checking out!

Resources

Introduction to Natural Language Processing by Jacob Eisenstein

Speech and Language Processing, 3rd Edition by Daniel Jurafsky, James H Martin

A Primer on Neural Network Models for Natural Language Processing by Yoav Goldberg

Neural Network Methods for Natural Language Processing by Yoav Goldberg

Similar Courses Elsewhere

There are more than a few similar (and often times even better) courses out there. Here is a sample:

Grading

Assignments:	50%
Project:	50%

Assignments are available via Google Classroom (the class code is rehw7og -- feel free to use the following invite link ) but they are also available in the following repository on GitHub: https://github.com/NaiveNeuron/nlp-exercises

Check out the Project Ideas for 2024!

Points	Grade
(90, inf]	A
(80, 90]	B
(70, 80]	C
(60, 70]	D
(50, 60]	E
[0, 50)	FX