Natural Language Processing

Last modified at: 2019-03-27 10:00:00+01:00


The 2019 episode at Faculty of Mathematics, Physics and Informatics of Comenius University

Labs (I-8 or any other room upon previous appointment)
Tuesday, 12:20 - 14:00 (voluntary)
Lectures (I-8)
Tuesday, 14:00 - 15:40 (voluntary)

Lectures

22nd of February

Discussed material:
Supplementary resources:
  • Eliza Bot Demo: one of the most famous use cases of regular expressions. It is really worth trying out -- you may end up having some surprisingly good conversations.
  • Unix for poets: a nice 25 pages worth of examples on how to process text on the Unix command line. Here is a shorter version by the authors of the SLP book.
  • Scriptio continua: the reason why English also nearly ended up without word and sentence separators.

26th of February

Discussed material:
Supplementary resources:
  • subsync: a tool for automatically synchronizing subtitles with video (a nice use-case of using alignment in a not-so-ordinary context)
  • Autocomplete using Markov chains: a nice example (along with code in Python) that shows how Language Models can be used to generate "text resembling language" and build a simple 'autocomplete' engine.

5th of March

Discussed material:
  • Language modeling
    • Estimating N-gram probabilities
    • Perplexity and Language Model Evaluation
    • Dealing with zeros
    • Smoothing, backoff and interpolation
    • "Stupid backoff"
    • Most of SLP Chapter III
Supplementary resources:
  • Google Ngram Viewer -- a nice way of visualizing the rate of use of n-grams in books written in various languages. Check out this quick example for instance.
  • kenlm -- an open-source language modeling toolkit. Probably best in class when it comes to speed and memory efficiency.

12th of March

Discussed material:
Supplementary resources:
  • Embedding Projector -- a nice way of visualizing distributed representations obtained using Word2Vec in 2D and 3D from the TensorFlow project
  • Word2Vec example by ML5JS -- a simple example of the analogy task Word2Vec has became famous for, using a nice JavaScript library called ML5JS

19th of March

Discussed material:
Supplementary resources:

26th of March

Discussed material:
Supplementary resources:

2nd of April

Discussed material:
Supplementary resources:

9nd of April

Discussed material:
  • Sentiment Analysis
    • Sentiment Analysis Task
    • Sentiment Analysis using Naive Bayes classification
    • Lexicon-based approaches
Supplementary resources:

15nd of April

Discussed material:
  • Text Classification with fastText
    • Text Classification
    • Introduction to fastText
    • Simple fastText classification model (see below)
    • Combining word embeddings with embedding of n-grams
    • Metrics used in context of classification tasks (accuracy / precision / recall)
    • Importance of preprocessing
Supplementary resources:

30th of April

Discussed material:
  • Part of Speech tagging
    • Parts of Speech
    • Part of Speech tagging as a NLP problem
    • Feature-based Part of Speech tagging
  • Named Entity Recognition
    • The task of extracting knowledge from text
    • Finding and Classifying Named Entities
    • Named Entity Recognition as a Sequence Modeling task
    • Inference in Sequence Modeling
Supplementary resources:
  • Penn Treebank tag set: a set of tags used in the Penn Treebank (there are roughly 45 of them -- those who've been doing NLP long enough allegedly know them by heart)
  • spaCy NER demo and AllenNLP NER demo: quick two demos of industry-strength and former-state-of-the-art NER systems. Notice that the latter one powered by a pretty big neural network is able to correctly pick up a location it almost certainly did not see in the training data.

7th of May

Discussed material:
  • Sparse word representations
    • Term Frequency
    • Inverse Document Frequency
    • TF-IDF
  • Contextual representations: from word2vec to BERT
    • Word vectors and issues with using them in context-free manner
    • Representations with Language Models
    • ELMo: Deep Contextual Word Embeddings
    • Transformers and Self-Attention
    • Masked Language Models
  • NLP projects in the real world
    • Maximizing NLP project's risk of failure
    • Machine Learning Hierarchy of Needs
    • Building NLP projects by iterating on both the code and the data

Supplementary resources:

  • Understanding BERT Transformer: Attention is not all you need -- a very nice discussion on what the Self-Attention-based models may model and
  • spaCy cheat sheet and spaCy course: a very nice list of features the spaCy library offers along with a quick interactive course where you can try it out without leaving the browser.

Grading

Assignments: 50%
Project: 50%

Assignments are available via the Moodle e-learning system but they are also available in the following repository on GitHub: https://github.com/NaiveNeuron/nlp-exercises

A list of project ideas can be found here.

Points Grade
(90, inf] A
(80, 90] B
(70, 80] C
(60, 70] D
(50, 60] E
[0, 50) FX