Introduction to Computing and the Command Line (optional)

Overview

This module introduces you to some computing concepts and provides a brief tutorial on working on the command line (also known as, the shell, or the terminal).

Preparation

Our primary computing context will be the UC Berkeley DataHub. This provides us with a common platform on which we are assured to get the same behavior as each other, allowing us to use the command line/shell/terminal and Python, including Jupyter notebooks. In particular getting a common command line environment would be hard otherwise.

If you haven’t already used the shell/terminal and Python on your laptop, we recommend you use the DataHub so that we don’t have to troubleshoot problems on multiple laptops.

If you do use your laptop, you’ll need Python installed, for which we recommend using Miniforge. Another option is the Anaconda distribution of Python](https://www.anaconda.com/distribution)

For the shell/terminal parts of this material, you can use the Terminal on your Mac, but not on Windows.

If you do use your own laptop, you’ll need to clone this GitHub repository. If you don’t know how to do that, just download this zip file and unzip it into your home directory on your computer.

Parts of a computer

We won’t spend a lot of time thinking about computer hardware, but we do need to know some of the basics that relate to using a computer and programming effectively. There are many layers of technical detail one could get into, but the key things for now are to have a basic understanding of these components of a computer:

  • CPU
    • cache (limited fast memory easily accessible from the CPU)
  • main memory (RAM)
  • bus
  • disk

We’ll discuss these components via this overview from a class at CMU.

Depending on what your code is doing, the slow step in a computation might be:

  • doing the actual calculations on the CPU,
  • moving data back and forth from (main) memory (RAM) to the CPU, or
  • reading/writing (I/O) data to/from disk.

Concepts of parallelization

Computers now come with multiple processors for doing computation. Basically, physical constraints have made it harder to keep increasing the speed of individual processors, so the chip industry is now putting multiple processing units in a given computer and trying/hoping to rely on implementing computations in a way that takes advantage of the multiple processors.

Everyday personal computers usually have more than one processor (more than one chip) and on a given processor, often have more than one core (multi-core). A multi-core processor has multiple processors on a single computer chip. On personal computers, all the processors and cores share the same memory.

Here are some key terms relating to hardware:

  • cores: We’ll use this term to mean the different processing units available on a single machine or node of a cluster. A given CPU will have multiple cores. (E.g, the AMD EPYC 7763 has 64 cores per CPU.)
    • hardware threads: these are very different than the (software) threads discussed below: hardware hyperthreading involves separate hyperthreads (“virtual cores”) on a single (physical) core that behave like separate cores
  • nodes: We’ll use this term to mean the different computers, each with their own distinct memory, that make up a cluster or supercomputer.
  • GPUs: a GPU has a very large number of simple processing units that operate many tasks in parallel.

Here are some key terms related to computation:

  • processes: instances of a program(s) executing on a machine; multiple processes may be executing at once. A given program may start up multiple processes at once. Ideally we have no more processes than cores on a node.
    • threads: multiple paths of execution within a single process; the operating system sees the threads as a single process, but one can think of them as ‘lightweight’ processes. Ideally when considering the processes and their threads, we would have the same number of cores available to our code as we have processes and threads combined.
  • workers: the individual processes that are carrying out the (parallelized) computation. We’ll use worker and process interchangeably.
  • tasks: individual units of computation; one or more tasks will be executed by a given process on a given core.

Basic parallelization example

As an example, suppose we were carrying out 20-fold cross-validation and we were using a laptop with 4 cores.

We would generally use 4 workers (4 processes) to carry out the 20 tasks (each task being to fit the model on the non-held out data and make predictions on the held-out data in the fold). So 4 of the tasks could be executing at the same time. When a process finishes a task assigned to it, it can start on the next task that is still left to be done.

The filesystem

The filesystem is the organization of data/files on a storage device, such as the local disk on your machine.

The filesystem on a Linux or MacOS machine can be thought of as an upside-down tree, with / as the root (top) of the tree.

For example, a basic tree on a Linux machine might look like this:

.
├── accounts
│   ├── biost
│   │   ├── wang
│   │   ├── jones
│   │   └── mcfadden
│   ├── grad
│       ├── smith
│       ├── sanchez
│       ├── zhang
├── bin
│   ├── echo
│   ├── grep
│   ├── tmux
├── lib
│   ├── gcc
│   ├── grep
│   ├── tmux

/accounts, /bin, and /lib are the directories at the root in the example above.

We’ll see more examples when we start using the shell.

Things are similar on Windows, though there the root is a lettered drive, such as C:.

For scientific computing, you’ll often be connecting remotely to other machines/clusters and navigating the filesystem on those other systems.

The shell and UNIX commands

Operating on the UNIX command line is also known as using the terminal and using the shell.

The shell is the UNIX program that you interact with when in a terminal window interacting with a UNIX-style operating system (e.g., Linux or MacOS). The shell sits between you and the operating system and provides useful commands and functionality. Basically, the shell is a program that serves to run other commands for you and show you the results. There are actually different shells that you can use, of which bash is very common and is the default on many systems. In recent versions of MacOS, zsh is the default shell. zsh is an extension of bash.

We’ll start a shell on our DataHub (Jupyter) server, which is (basically) a Linux virtual machine running in the cloud.

On DataHub, the prompt in the shell looks like this:

jovyan@jupyter-paciorek:~$

but below we’ll indicate that simply as:

$

Files and directories

Moving around and listing information

We’ll start by thinking about the filesystem, which organizes our information/data into files on the computer’s disk.

Anytime you are at the UNIX command line, you have a working directory, which is your current location in the file system.

Here’s how you can see where you are using the pwd (“print working directory”) command:

$ pwd
/home/jovyan/compute-skills-2024

and here’s how you use ls to list the files (and subdirectories) in the working directory…

$ ls
assets       license.qmd  README.md     units           _variables.yml
buttons.yml  prep.qmd     schedule.yml  Untitled.ipynb
index.qmd    _quarto.yml  syllabus.qmd  untitled.py

Now suppose I want to be in a different directory so I can see what is there or do things to the files in that directory.

The command you need is cd (“change directory”) and an important concept you need to become familiar with is the notion of ‘relative’ versus ‘absolute’ path. A path is the set of nested directories that specify a location of interest on the filesystem.

Now let’s go into a subdirectory. We can use cd with the name of the subdirectory. The subdirectory is found ‘relative’ to our working directory, i.e., found from where we currently are.

$ cd units
$ pwd
/home/jovyan/compute-skills-2024/units

We could also navigate through nested subdirectories. For example, after going back to our home directory (using cd alone will do this), let’s go to the units subdirectory of the compute-skills-2024 subdirectory. The / is a separate character that distinguishes the nested subdirectories.

$ cd
$ cd compute-skills-2024/units

You can access the parent directory of any directory using ..:

$ cd ..
$ pwd
/home/jovyan/compute-skills-2024

We can get more complicated in our use of .. with relative paths. Here we’ll go up a directory and then down to a different subdirectory.

$ cd units
$ cd ../assets
$ pwd
/home/jovyan/compute-skills-2024/assets

And here we’ll go up two directories and then down to another subdirectory (but note that /home/jovyan/tmp may not exist for your server).

$ cd ../../tmp  # Go up two directories and down.
$ pwd
/home/jovyan/tmp

Absolute versus relative paths

All of the above examples used relative paths to navigate based on your working directory at the moment you ran the command.

We can instead use absolute paths so that it doesn’t matter where we are when we run the command. Specifying an absolute path is done by having your path start with /, such as /home/jovyan. If the path doesn’t start with / then it is interpreted as being a relative path, relative to your working directory. Here we’ll go to the units subdirectory again, but this time using an absolute path.

$ cd /home/jovyan/compute-skills-2024/units 
$ pwd
/home/jovyan/compute-skills-2024/units
Absolute paths can be dangerous

It’s best to generally use relative paths, relative to the main directory of a project.

Using absolute paths is generally a bad idea for reproducibility and automation because the file system will be different on different machines, so your code wouldn’t work correctly anywhere other than your current machine.

The filesystem

The filesystem is basically a upside-down tree.

For example, if we just consider the compute-skills-2024 directory, we can see the tree structure using tree (not available on DataHub):

$ cd /home/jovyan/compute-skills-2024
$ tree

Often (as is the case here), that would print out a lot of files and directories, so I’ll just print out the first few lines of the result here. One doesn’t usually use tree that much – our purpose here is to emphasize the hierarchical, nested structure of the filesystem.

.
├── assets
│   ├── buttons-alt.ejs
│   ├── buttons.ejs
│   ├── schedule-alt.ejs
│   ├── schedule.ejs
│   ├── stat_bear.png
│   ├── styles-alt.scss
│   └── styles.css
├── buttons.yml
├── _freeze
│   ├── site_libs
│   │   ├── clipboard
│   │   │   └── clipboard.min.js
│   │   └── quarto-listing
│   │       ├── list.min.js
│   │       └── quarto-listing.js

The dot (.) means “this directory”, so the top of the tree here is the compute-skills-2024 directory itself, within which there are various subdirectories. Then within each of these are files and further subdirectories.

If we consider the entire filesystem, the top, or root of the tree, is the / directory. Within / there are subdirectories, such as /home (which contains users’ home directories where all of the files owned by a user are stored) and /bin (containing UNIX programs, aka ‘binaries’). We’ll use ls again, this time telling it the directory to operate on:

$ ls /
bin   dev  home  lib32  libx32  mnt  proc  run   srv  tmp  var
boot  etc  lib   lib64  media   opt  root  sbin  sys  usr

If there is a user named jovyan, everything specific to that user would be stored in /home/jovyan. The shortcut ~jovyan refers to /home/jovyan. If you are the jovyan user, you can also refer to /home/jovyan by the shortcut ~.

$ ls /home
jovyan shiny
$ cd /home/jovyan

Go to the home directory of the current user (which happens to be the jovyan user):

$ cd ~
$ pwd
/home/jovyan

Go to the home directory of the jovyan user explicitly:

$ cd ~jovyan
$ pwd
/home/jovyan

Another useful directory is /tmp, which is a good place to put temporary files that you only need briefly and don’t need to save. These will disappear when a machine is rebooted.

$ cd /tmp
$ ls
jupyter-runtime

We can return to the most recent directory we were in like this:

$ cd -
$ pwd
/home/jovyan

Using commands

Overview

Let’s look more at various ways to use commands. We just saw the ls command. Here’s one way we can modify the behavior of the command by passing a command option. Here the -F option (also called a flag) shows directories by appending / to anything that is a directory (rather than a file) and a * to anything that is an executable (i.e., a program).

$ cd compute-skills-2024
$ ls -F
$ ls -F /usr/bin

Next we’ll use multiple options to the ls command. -l shows extended information about files/directories. -t shows files/directories in order of the time at which they were last modified and -r shows in reverse order. Before I run ls, I’ll create an empty file using the touch command.

$ cd
$ touch example.txt
$ ls -lrt
total 14496
-rw-r--r-- 1 jovyan jovyan    36365 Aug 29  2022 'Screen Shot 2021-10-26 at 10.54.07 AM.png'
-rw-r--r-- 1 jovyan jovyan 14757718 Oct 25  2022  data_c80_regsample_3.dta
-rw-r--r-- 1 jovyan jovyan     2436 Oct 25  2022  Untitled1.ipynb
-rw-r--r-- 1 jovyan jovyan      470 Oct 25  2022  Untitled2.ipynb
-rw------- 1 jovyan jovyan      438 Jan  3  2023  WHERE-ARE-MY-FILES.txt
drwxr-xr-x 5 jovyan jovyan     4096 May 30 16:12  python-workshop-2023
-rw------- 1 jovyan jovyan       75 May 31 16:18  tmphn20xgkx
-rw------- 1 jovyan jovyan       91 Jun  3 14:56  tmp4d_d65l_
drwxr-xr-x 3 jovyan jovyan     4096 Jun  4 15:58  test-private
drwxr-xr-x 4 jovyan jovyan     4096 Jul 19 14:55  test-jh
-rw-r--r-- 1 jovyan jovyan      918 Jul 19 14:59  Untitled.ipynb
drwxr-xr-x 2 jovyan jovyan     4096 Jul 19 15:52  tmp
-rw-r--r-- 1 jovyan jovyan        0 Jul 31 14:10  untitled.py
-rw-r--r-- 1 jovyan jovyan      617 Jul 31 14:12  Untitled3.ipynb
drwxr-xr-x 7 jovyan jovyan     4096 Aug  1 15:45  compute-skills-2024
-rw-r--r-- 1 jovyan jovyan        0 Aug  1 16:03  example.txt

While each command has its own syntax, there are some rules usually followed. Generally, executing a command consists of four things:

  • the command
  • command option(s)
  • argument(s)
  • line acceptance (i.e., hitting )

Here’s an example:

$ echo "hello there" >> example.txt
$ echo "and goodbye" >> example.txt
$ echo "really now" >> example.txt
$ wc -l example.txt
3 example.txt

In the above example, wc is the command, -l is a command option specifying to count the number of lines, example.txt is the argument, and the line acceptance is indicated by hitting the <Return> key at the end of the line.

So that invocation counts the number of lines in the file named example.txt.

The spaces are required and distinguish the different parts of the invocation. For this reason, it’s generally a bad idea to have spaces within file names. But if you do, you can use quotation marks to distinguish the file name, e.g.,

$ ls -l "name of my file with spaces.txt"

Also, capitalization matters. For example -l and -L are different options.

Note that options, arguments, or both might not be included in some cases. Recall that we’ve used ls without either options or arguments.

Arguments are usually one or more files or directories.

Options

Often we can specify an option either in short form (as with -l here) or long form (--lines here), as seen in the following equivalent invocations:

$ wc -l example.txt
3 example.txt
$ wc --lines example.txt
3 example.txt

We can also ask for the number of characters with the -m option, which can be combined with the -l option equivalently in two ways:

$ wc -lm example.txt
  3 35 example.txt
$ wc -l -m example.txt
  3 35 example.txt

Options will often take values, e.g., if we want to get the first two lines of the file, the following invocations are equivalent:

$ head -n 2 example.txt
$ head --lines=2 example.txt
$ head --lines 2 example.txt

Comments

Anything that follows # is a comment and is ignored.

$ # This is ignored
$ ls  # Everything after the # is ignored
 compute-skills-2024                          test-jh           Untitled2.ipynb
 data_c80_regsample_3.dta                     test-private      Untitled3.ipynb

Getting help with UNIX commands

Essentially all UNIX commands have help information (called a man page), accessed using man.

$ man ls

Once you are in the man page, you can navigate by hitting <space> (to scroll down) and the <up> and <down> arrows. You can search by typing /, typing the string you want to search for and hitting <Return>. You can use n and p for the next and previous search hits and q to quit out of the search.

Unfortunately man pages are often quite long, hard to understand, and without examples. But the information you need is usually there if you take the time to look for it.

Also, UNIX commands as well as other programs run from the command line often provide help information via the --help option, e.g., for help on ls:

$ ls --help

Tab completion

If you hit Tab the shell tries to figure out what command or filename you want based on the initial letters you typed.

Try it with:

cd com
ech

The first should allow you to get compute-skills-2024 and the second echo.

Tab completion everywhere

Lots of interactive programs have adopted the idea of tab completion, including Python/IPython and R.

Command history

Hit the up key. You should see the previous command you typed. Hit it again. You should see the 2nd most recent command.

Ctrl-a and Ctrl-e navigate to the beginning and end of a line. You can edit your previous command and then hit Enter to run the modified code.

Seeing if a command or program is available

You can see if a command or program is installed (and where it is installed) using type.

$ type grep
grep is /usr/bin/grep
$ type R
R is /usr/bin/R
$ type python
python is /srv/conda/bin/python

Working with files

Copying and removing files

You’ll often want to make a copy (cp) of a file, move it (mv) between directories, or remove it (rm).

$ cd ~/compute-skills-2024/units
$ cp calc.py calc-new.py
$ mv calc-new.py /tmp/.
$ cd /tmp
$ ls -lrt
total 8
drwx-----T 2 jovyan jovyan 4096 Aug  1 16:15 jupyter-runtime
-rw-r--r-- 1 jovyan jovyan  413 Aug  1 16:16 calc-new.py

When we moved the file, the use of . in /tmp/. indicates we want to use the same name as the original file.

$ rm calc-new.py
$ ls -lrt
total 4
drwx-----T 2 jovyan jovyan 4096 Aug  1 16:15 jupyter-runtime
rm cannot be undone

I used rm above to remove the file. Be very careful about removing files - there is no Trash folder in UNIX - once a file is removed, it’s gone for good.

The mv command is also used if you want to rename a file.

$ cd ~/compute-skills-2020/units
$ mv session3.qmd _sesson3.qmd
$ ls

Exercises: shell commands

Exercise 1

Where is gzip installed on the system? What are some other commands/executables that are installed in the same directory?

Exercise 2

Try to run the following command mkdir ~/projects/drought. It will fail. Look in the help information on mkdir to figure out how to make it work without first creating the projects directory.

Exercise 3

Figure out how to list out the files in a directory in order of decreasing file size, as a way to see easily what the big files are that are taking up the most space. Modify this command to get the result in the ascending order.

Exercise 4

Figure out how to copy an entire directory and have the timestamps of the files retained rather than having the timestamps be the time that you copied the files.

See if you can combine the short form of an option with the long form of a different option.

What happens if you use a single dash with the long form of an option. Are you able to interpret the error message? (Note that confusingly there are some situations one can use the long form with a single dash.)

Running processes

We’ll run a simple linear algebra operation in Python to illustrate how we can monitor what is happening on a machine.

DataHub memory limits

Be careful about increasing n in the example below when on DataHub. Basic DataHub virtual machines by default only have 1 GB memory, so don’t increase n below or you may run out of memory if running on DataHub and this can cause problems in your DataHub session. If you’re doing this as part of the workshop, we’ve increased those limits somewhat, so somewhat larger values of n should be fine.

Code
import numpy as np
import time

def run_linalg(n):
    z = np.random.normal(0, 1, size=(n, n))
    print(time.time())
    x = z.T.dot(z)   # x = z'z
    print(time.time())
    U = np.linalg.cholesky(x)  # factorize as x = U'U
    print(time.time())

## This allows us to run the code from the command line
## without running it when we import the file as a module.
if __name__ == '__main__':
    run_linalg(4000)
python calc.py

We might need to run it in a loop to be able to monitor the running process (since it finishes so quickly):

Code
for ((i=0;i<10;i++)); do
  python calc.py
done

Monitoring CPU and memory use

We can see CPU and memory usage via top.

If we see more than 100% CPU usage, that indicates that the process is running a computation in parallel using multiple threads.

What about if we see less than 100% CPU usage?

  • Our process might be busy doing I/O (reading/writing to/from disk).
  • Our process might be waiting for data from the internet or from another machine on the network.
  • The machine might be running low on physical memory, which in some cases will involve using disk as additional memory (slow!).