Natural Language Processing 2024

Last modified at: 2024-02-23 07:15:00+01:00

The 2024 episode at Faculty of Mathematics, Physics and Informatics of Comenius University

Lectures M-XI
Friday, 9:00 - 10:20 (voluntary)
Labs M-XI
Friday, 10:20 - whenever (voluntary)

Course description

This course tries to go deeper into how we can represent human language (say English or Slovak) in a way that can be processed by computational systems (a.k.a. computer programs), and how this representation can then be used to do interesting things, such as

  • translation
  • question answering
  • grammatical error correction
  • summarization
  • text (like poems or song lyrics) generation
  • and much more...

All of this combined is part of a field called Natural Language Processing, which ended up being the name of the course.

Feel free to check out the previous year's class webpage as well!


Lecture I: Text Processing

Discussed material:
Supplementary resources:
  • Eliza Bot Demo: one of the most famous use cases of regular expressions. It is really worth trying out -- you may end up having some surprisingly good conversations.
  • Unix for poets: a nice 25 pages worth of examples on how to process text on the Unix command line. Here is a shorter version by the authors of the SLP book.
  • Scriptio continua: the reason why English also nearly ended up without word and sentence separators.

Lecture II: Edit Distance + Intro to Language Modeling

Discussed material:
  • Edit Distance (slides)
    • Edit Distance
    • Weighted Edit Distance
    • Alignment
    • The last part of SLP Chapter II
    • (we did not discuss the bio applications ...)
  • Language modeling (slides)
    • Estimating N-gram probabilities
    • Perplexity and Language Model Evaluation
    • Dealing with zeros
    • Smoothing, backoff and interpolation
    • Most of SLP Chapter III
Supplementary resources:
  • subsync: a tool for automatically synchronizing subtitles with video (a nice use-case of using alignment in a not-so-ordinary context)
  • Seam carving: although not necessarily NLP related, seam carving is a very nice real-world example of how dynamic programming can still be a useful tool.
  • Google Ngram Viewer -- a nice way of visualizing the rate of use of n-grams in books written in various languages. Check out this quick example for instance. Note that the dataset on which the Ngrams are computed plays a big role -- here is a similar example for coronavirus.
  • kenlm -- an open-source language modeling toolkit. Probably best in class when it comes to speed and memory efficiency.

Lecture III: Language Modeling, Word Embeddings and RNNs

Discussed material:

"From N-grams to Word2Vec"

"From Word2Vec to Recurrent Networks"

Supplementary resources:
  • Semantic Space Surfer -- a bit of a fun spin on word embeddings. While normally we'd like the word embeddings to work "intuitively" (i.e. king should be to man as queen is to woman), we know it's not always like that. In this quick game you're forced to "think like a word embedding" and pick the word that the pre-trained embeddings would have chosen. Not only is this quite fun, it'll also help you deepen your intuition around what's actually going on with word embeddings.
  • The Unreasonable Effectiveness of Recurrent Neural Networks -- at this point a classical blog post in Deep Learning blogosphere by Andrej Karpathy on what RNNs are capable of. Still worth checking out.
  • The unreasonable effectiveness of Character-level Language Models -- a quick "setting-the-record-straight" response to the previous blog post by Yoav Goldberg. Note that it is not entirely critical, as its subtitle is (and why RNNs are still cool). Definitely worth a read.
  • Unsupervised Sentiment Neuron -- a blogoduction (introduction-via-blog) of OpenAI for their work on training character-level language models on Amazon reviews, which happen to pick up the notion of "sentiment" on their own.
  • Understanding LSTM Networks -- another classic which does a great job introducing LSTMs from the ground up with nicely done visualizations using the "circuit board" metaphor.

Lecture IV: Attention, Transformers, all the way to GPT

Discussed material:

Visualizing attention + The Illustrated Transformer
  • The limits of RNNs
  • The concept of attention
  • The limits of vanilla attention
  • Values, Keys and Queries
  • Multi Head Attention
  • Tricks of the Transformer architecture (i.e. positional embeddings)
What's inside ChatPGT
  • From Transformers to GPT architectures
  • GPT, GPT-2, GPT-3
  • InstructGPT (Reinforcement Learning from Human Feedback)
  • ChatGPT? and its implications
Supplementary resources:
  • Sequence to Sequence and Attention: one of the best written lecture notes on the topic. If any of the discussed concepts didn't make sense (or too much sense) during the lecture, they are almost certainly described much better on this website.

  • Although some might disagree, experience shows that the best way how to learn how the Transformer architecture really works is to re-implement it. It really is not such a big deal -- just a couple hundred lines of code. Here are some resources to get you started:

  • GPT in 60 lines of NumPy makes the point above in an even stronger fashion, as it re-implements (part of) the GPT model in very simple NumPy code. Going through it (and doing it on your own) is strongly recommended if you'd like to really understand what is going on in there.

  • Let's build GPT: from scratch, in code, spelled out is perhaps the best way to spend 2 hours doing the re-implementing above while being guided by Andrej Karpathy, perhaps the best educator when it comes to Deep Learning and (modern) Neural Networks.

  • Intro to Large Language Models is currently very probably the best way of getting up to speed in the area of Large Language Models in one hour. It's by Andrej Karpathy again, which has become a guarantee of quality at this point.

Similar Courses Elsewhere

There are more than a few similar (and often times even better) courses out there. Here is a sample:


Assignments: 50%
Project: 50%

Assignments are available via Google Classroom (the class code is rehw7og -- feel free to use the following invite link ) but they are also available in the following repository on GitHub:

Check out the Project Ideas for 2024!

Points Grade
(90, inf] A
(80, 90] B
(70, 80] C
(60, 70] D
(50, 60] E
[0, 50) FX