Linux CLI for Data Science 2023
The 2023 episode at Faculty of Mathematics, Physics and Informatics of Comenius University.
- Lectures (F1-247)
- Thursday, 11:30 - 13:00
- Labs (H-6 / F2-295)
- Tuesday, 16:30 - 18:00 (voluntary)
There also is a Teams Class
- Goal
- Lab Lectures
- Lecture 1: Intro to Command Line
- Lecture 2: Files and Directories
- Lecture 3: Standard I/O, Pipes and Text Processing
- Lecture 4: Processes and Signals
- Lecture 5: Users, Groups and Regular Expressions
- Lecture 6: Vim
- Lecture 7: File and directory attributes
- Lecture 8: find and xargs
- Lecture 9: sed and awk
- Lecture 10: csvkit and jq
- Lecture 11: Git
- Lecture 12: Modern Unix Tools
- Resources
- Grading
Goal
The goal of this lab is to
- show you the cool things (your) computers are capable of
- get you acquainted with UNIX-like operating systems, the tradition which powers much of modern computing
- be a fun break from other classes
What you are studying is non-trivial already. It is not our job to punish you for choosing to do that but to give you some practical skills that will let you apply it straight away.
Lab Lectures
Lecture 1: Intro to Command Line
- Discussed material:
- History of UNIX-like operating systems
- Text console, Shell and Secure Shell (SSH)
- Shell Commands (short intro and some examples)
- ... and more in the first set of slides
- Supplementary resources:
- The TTY demystified: so what exactly is this teletype that has been mentioned a few times? This article starts with a caveat that it is not particularly elegant, but once you read through it, you'll get a much more thorough understanding of (modern) UNIX-like system and the UNIX history as well.
- The History of Unix by Rob Pike: it is not every day that you get an important piece of (computing) history described by someone who helped with making it. Well worth the watch!
Lecture 2: Files and Directories
- Discussed material:
- UNIX-style file system
- Directory tree and its important parts
- Navigating the filesystem
- Complete and autocomplete in Bash
- ... and more in the second set of slides
- Supplementary resources:
- How dotfiles came to be: A short story (by Rob Pike once again) about how dotfiles (you know, the hidden files that start with a dot) came to be and what it says about the unintended effects of cutting corners and just "hacking around" a problem.
- The history of the /usr split: Different story but a very similar morale. Read through it to find out how did the /bin vs. /usr/bin split happen, how irrelevant it is these days and how one needs to fight against the bad ideas in order not to let them propagate.
- Linux Filesystem Hierarchy: A deeper discussion on the various parts of the standard Linux filesystem, describing several of the directories in much higher detail than the slides ever could.
Lecture 3: Standard I/O, Pipes and Text Processing
- Discussed material:
- Standard Input/Output
- Pipes
- Introduction to Text Processing
- ... and more in the third set of slides
- Supplementary resources:
- AT&T Advertisement for UNIX: Watch Brian Kernighan describe (in a very down-to-earth fashion) what's great about UNIX, especially how pipes play an important role in that.
- How are UNIX pipes implemented: a very thorough and deep overview into how the UNIX pipes came to be, how they were originally implemented and what that implementation looks like now.
- Bash Oneliners Explained -- All about redirections: a more in-dept discussion on how all sorts of redirections work in Bash and how you can make use of it in your work.
- Introduction to text manipulation on UNIX-based systems: A very extensive in-depth guide into what's possible with just the standard tools, when it comes to text processing on UNIX-like systems. (Spoiler alert: a lot!)
- The UNIX Command Language: This paper from 1976 (!), written by no one else but Ken Thompson, is the first paper ever published on the Unix shell.. If for nothing else, it's almost certainly worth reading for its amazing clarity of presentation and concise treatment.
- Fun with Redirection: A very nice retelling of the content of the lecture in a a very approachable manner. Do check it out, especially if you like cats!
Lecture 4: Processes and Signals
- Discussed material:
- Processes
- Signals
- ... and more in the fourth set of slides
- Supplementary resources:
- An introduction to UNIX processes: This piece gives you "yet another" rundown of what are the UNIX processes about. What's interesting about it is the part about fork and exec we've just quickly gone over in the lecture. I would very much recommend taking a look at it.
- Two great signals: SIGSTOP and SIGCONT: What do you do when you've got a long-running script that you cannot afford to (or just don't want to) stop but would very much like to at least pause? This article will tell you a bit about that.
- Should you be scared of Unix signals?: A short attempt at making the Unix signals look a bit less scary. It's a bit technical but if you'd like to go a bit deeper, still very worth reading.
Lecture 5: Users, Groups and Regular Expressions
- Discussed material:
- Users
- Groups
- Regular Expressions
- ... and more in the fifth set of slides
- Supplementary resources:
Ken Thompson's Unix password: A story on how the password of one of the old-timers was cracked nearly 40 years later and why "shadowing" is generally not a bad idea.
The origins of grep: Brian Kernighan, one of the forefathers of UNIX discusses how grep came to be, and it makes for a rather interesting story! (If you are in a hurry, here is a 10 minute video.)
When it comes to regular expressions, it helps a lot to visualize what they match and how. There are two tools we recommend in this regard:
- Regex101 which is basically an integrated development environment (IDE) for regular expressions
- Regexper which nicely visualizes regular expressions as "proto programs". Here is a sample visualization.
If you'd like to play a bit with regular expressions and improve your skills at the same time, there is
We recommend them all!
Lecture 6: Vim
- Discussed material:
- Vim's philosophy
- NORMAL, INSERT, VISUAL and COMMAND modes
- Editing text in Vim using regular expressions and Unix commands
- ... and more in the sixth set of slides
- Supplementary resources:
- Why does Vim use hjkl a nice historical explanation of a rather strange phenomenon of hjkl.
- Why should you learn Vim in 2020: a nice reflection on the question may of you are asking
- How to exit vim: a fairly crazy list of (unconventional) ideas that end up closing Vim after all...
- Your problem with Vim is that you don't grok vi: probably one of the most famous StackOverflow answers of all time on how Vim relates to Vi and how you can learn much from well designed technologies that haven't changed much in the past.
Lecture 7: File and directory attributes
- Discussed material:
- The concept of inode
- File metadata (permissions, timestamps, owner and group)
- Hardlinks and Symlinks
- ... and more in the seventh set of slides
- Supplementary resources:
- Symlinks, Hardlinks, Reflinks and ML projects: This article goes deeper into how these concepts of links can be used for various Machine Learning (ML) projects where you work with a ton of data.
- Symlinks in Windows 10: Yes, they are such a good idea that even Windows (at least in the currently latest version) has them now. The reason why is interesting: in the current environment the generally used development tools basically require them.
- unix-permissions: Swiss Army knife for Unix permissions: A simple utility that allows you to (programmatically) convert between various ways of describing permissions of UNIX files.
Lecture 8: find and xargs
- Discussed material:
- What would looking up stuff on filesystem entail (grep)
- find
- xargs
- ... and more in the eight set of slides
- Supplementary resources:
- The history of find: Despite what we make it to be, find does have a bit of a negative connotation to it. Mostly because it does not embody UNIX philosophy to the extend other tools do. Check out this link for a fun story that provides a bit of a backstory (and one heck of a punchline!).
- Why doesn't grep work: A short article on the difference between the Basic and Extended Regular Expressions (and how that relates to what's the situation like in "real" programming languages).
- Things you don't know about xargs: Some more advanced capabilities of xargs described in a friendly way with more than a few examples.
- xargs considered harmful?: A nice discussions on the shortcomings of xargs, especially on the design and/or UX level. It is actually a discussion associated with a blogpost on a forum called lobste.rs, which we can only recommend. Turns out the discussion is perhaps more valuable than the blogpost itself.
Lecture 9: sed and awk
- Discussed material:
- sed
- awk
- wget / curl
- ... and more in the ninth set of slides
- Supplementary resources:
- A conundrum for a sed wizard: Real life story of what sort of craziness people solve with sed (and a reminder that not being able to figure something out is more than OK).
- Removing duplicate lines from files preserving their order: Despite what it sounds like, the task is actually not that simple/straightforward and yet with awk you can pull it off with a simple oneliner. [1] I especially recommend this article for its second part where the author explains what is actually happening when that oneliner gets executed and why the alternatives would not work.
- Expense Calculator in awk: One of the most beautiful examples of what awk is capable of. The best part: you already know enough to read through the awk code yourself!
- Awk: The Power and Promise of a 40-Year-Old Language is a very nice article worth reading through if you would like to understand how come people would still use this to this day. Oh, and it certainly is worth reading thorugh this comment on lobste.rs -- it's one of the reasons why people read forums like this. You will have a hard time finding it anywhere else.
- As we discussed during the lecture, one of the biggest advantage of the tools we are learning about is their versatility and ubiquity. Here is a quick example of how wget was used to demonstrate a data leak of patients tested for Covid-19
Lecture 10: csvkit and jq
- Discussed material:
- CSV as a format
- csvkit
- JSON and jq
- ... and more in the tenth set of slides
- Supplementary resources:
- So You Want To Write Your Own CSV code?: A passive-aggressive discussion on what parsing CSV actually entails. If you never did it yourself, take a look and I am pretty sure you will think twice about doing so at any time in the future.
- Falsehoods Programmers Believe About CSVs: An article from a similar genre, which sums up one programmer's experience about all the oddities one can run into when dealing with CSVs in the real world.
- Illustrated jq tutorial: If you liked the few jq examples we've shown, check out this tutorial as well. I goes very nicely over more than a few examples which are rather close to the real life. Note that you can click on any "piped" command and see the interim results (i.e. what it looks like after that part of the pipe gets evaluated).
- Console Spreadsheets: You already learned that people tend to be crazy when using the command line. This links shows that Spreadsheets are no exception -- it can help you grasp what the world looked like before Microsoft Excel and Google Spreadsheet became the status quo.
- Other CSV processing CLI tools
- xsv: A Rust implementation of a suite very similar to csvkit. A bit quirky but really fast.
- miller: An alternative to csvkit which tries to be more multi-purpose (its tag line says "Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON").
- textql: It often times happens that you wish you could just run an SQL query against your .csv file. That's exactly what textql allows you to do. If you'd like something that's even more versatile, check out q.
https://github.com/jqnatividad/qsv
Lecture 11: Git
- Discussed material:
- Version Control
- Git fundamentals (repository, staging area, commit, branch)
- Introduction to Git remotes (cloning, pushing/pulling)
- ... and more in the eleventh set of slides
- Supplementary resources:
- A Grip On Git: Since the lecture used slides, it was a bit boring and it was not difficult to lose the conceptual train of thought. The tutorial on this link could be of help in that case: it walks you through similar content, while providing visual representation of what is going on in the repository while you execute various git commands. Worth checking out, even if just for the artistic experience.
- Visual Git Reference: We've only gone through a handful of commands and a few straightforward use cases in the lecture. This visual reference covers most of the commands you will encounter if you end up using Git. If you like pretty pictures, I certainly recommend you check it out.
- fh: file history: It turns out the diff command we talked about some time ago is actually capable of printing out not just the difference between two files in a "human readable" format, but also as ed commands (the single-line predecessor of Vi and Vim). What this allows one to do is to put together a simple "version control system" using just the commands we already discuss in the class, that is ed, diff, awk, sed, and sh. (Yes, this is considered nerdy, even for people who already control their computers from the command line...)
- Git Purr: Git Commands Explained With Cats: The title says it all.
Lecture 12: Modern Unix Tools
- Discussed material:
- tmux
- new shells (fish, xonsh and nushell)
- grep alternatives (especially ripgrep)
- other miscellaneous tools (replacements for ls, cat, df, ps and so on)
- ... and more in the twelfth set of slides
Supplementary resources:
- tmux - a very simple beginner's guide: An extremely oversimplified guide to tmux which you can go through in about 5 minutes. We strongly encourage you to try it out -- feel free to use davos where tmux ought to be set up already as we've used it throughout the whole semester. For a list of things you can do with tmux, feel free to check out the tmux cheat sheet
- Modern Alternatives of Command-Line Tools: This article has inspired much of the content discussed in the slides. It is accompanied with graphical demos of various commands. Worth taking a look!
- Become shell literate: Our final attempt at trying to persuade you that this whole thing made sense. Full disclosure: the author of the article is a well-known free software advocate, so he is far from impartial in his article. That said, he is certainly not alone in suggesting it; here is another example from Letters To A New Developer
Resources
Slides
LISA conference (part of USENIX, the old UNIX organization) has had a workshop called Linux Productivity Tools. It's basically "zero to hero" in 89 slides. It's very worth checking out, especially if you are in a hurry.
Historical Books
If you like books, here are two worth reading:
UNIX: A History and a Memoir by Brian W Kernighan
A historical account of how UNIX came to be by someone who was there when it happened. It will help you paint the proper picture of what is meant when people say stuff like "UNIX legacy" or "the UNIX era".
The Cuckoo's Egg: Tracking a Spy Through the Maze of Computer Espionage by Cliff Stoll
Strangely enough, this is a novel; a true story of a physicist who tracked one of the first documented "hackers" (cracker would really be a better term here, but I digress) who he found snooping around his systems. The best part is that it's all real, down to the (obviously UNIX) commands that were used. Well worth a read!
Grading
Assignments: | 50% |
Exam: | 50% |
There will be one assignment per week. Each of them is (normally) worth 5% (plus some bonuses). You have up to a week to finish them, but most people manage to do it during the lab.
Exam will be conducted from the content discussed at the Friday lectures
Points | Grade |
---|---|
(92, inf] | A |
(84, 92] | B |
(76, 84] | C |
(68, 76] | D |
(60, 68] | E |
[0, 60) | FX |
[1] | To be fair, that oneliner is a bit "golfed" (i.e. not that straightforward to read and interpret). Here is another, hopefully clearer, version: awk '! visited[$0] { print $0; visited[$0] = 1 }' |