Natural Language Processing 2024: Project Ideas

Last modified at: 2024-02-21 08:00:00+01:00


The 2024 episode at Faculty of Mathematics, Physics and Informatics of Comenius University

This site lists various ideas that could be explored as possible projects as part of the NLP course. The list is far from exhaustive, but represents (current) course instructor('s/s) interest.

All of the ideas assume the final project will contain some "novelty bits", either in the explored idea itself, its execution, practical usability or the underlying dataset.

Creating a new dataset for any of the NLP tasks listed below (or any other, really) is a huge plus.

Project MIMEDIS

The project's aim is to "study the impact of media discourse on attitudes towards migration, migrants and migration policy in Slovakia". As such, there are many classification tasks that can be explored in that regard.

You can find more about the project at https://cogsci.fmph.uniba.sk/MIMEDIS/index.html

Text Classification

Pick any interesting text dataset, the text of which can be classified into various categories and try a couple of methods.

  • Language identification
  • Sentiment analysis of Anketa comments
  • Detection of inappropriate comments in Anketa
  • Adaptation of Label-wise attention to other non-Twitter data (or to new Twitter data)
  • Applications of Active Learning in context of classification

Key Phrase Extraction

Given a body of text (say a document or a news article), can we extract the most important phrases out of it (and hence help summarize it a bit)?

Various approaches do exist (many of the popular ones are implemented as part of the pke module) -- could you think of a dataset where something like this could be useful? Could we perhaps build one automatically (with the expected keywords attached)?

Summarization

Can we build a Slovak/Czech/Croatian/Serbian/[any other language] summarization dataset that would allow us to "compress" news articles into a smaller list of sentences?

Can we use the TextRank algorithm to perform extractive summarization, or generate a list of keywords for a given body of text? Can you think of some interesting text for which this method would be a good fit?

Portmanteau Generation

Given words A and B, can we create their portmanteau C?

A B Portmanteau
beef buffalo beefalo
sheep people sheeple
breakfast lunch brunch
frozen yogurt froyo
parachute trooper paratrooper
emotion icon emoticon

Or better yet, given the words A and B can we create a portmanteau C that would not directly feature A or B but would still be related? Such as for instance

A B Portmanteau
angry Mozart scaria (scary / aria)

Here is a quick demo of such approach, which requires a pretty complex setup and external tools, such as the CMU Pronouncing Dictionary, which makes it pretty difficult to port this approach for different languages.

Is there something we can do to work around that? Some neural approach perhaps?

Shared tasks

Shared tasks are essentially "academic Kaggle": you get a task, some data and produce a model that tries to do well on it. During the evaluation period, you normally produce a prediction on the test set. It's a relatively straightforward way of going from a task to some solution, while not having to bother with the difficult part of finding an appropriate dataset.

A few examples:

  • Automatic Humor Analysis (https://www.joker-project.com/clef-2024/tasks)
  • Detecting hero, villain, and victim from memes Given a meme (the image + the text extracted from the meme) and a list of entities, the task constitutes predicting the role of each entity: “hero”, “villain”, “victim”, or “other”. (https://checkthat.gitlab.io/clef2024/task4/)
  • Multilingual Text Detoxification Given a toxic piece of text, re-write it in a non-toxic way while saving the main content as much as possible. (https://pan.webis.de/clef24/pan24-web/text-detoxification.html)
  • Persuasion Techniques Given a set of news articles and a list of 23 persuasion techniques, including logical fallacies (straw man, red herring, bandwagon, …) and emotional manipulation techniques (loaded language, appeal to fear, name calling, …) that might be used to support flawed argumentation, the task consists of identifying the spans of texts in which each technique occurs. (https://checkthat.gitlab.io/clef2024/task3/)
  • SOTA? Given the full text of an AI paper, recognize whether an incoming AI paper indeed reports model scores on benchmark datasets, and if so, to extract all pertinent (Task, Dataset, Metric, Score) tuples presented within the paper (https://sites.google.com/view/simpletext-sota/home?authuser=0)

Your own idea!

Feel free to come up with an idea on your own -- if you are working on something NLP-related for your thesis, that would be a good candidate. But in general, I'd be happy to talk about any NLP-related idea you may have.

Alternatively feel free to check out the sites below, find a NLP task you find interesting and see if you can make an interesting project out of it!