This module introduces you to some computing concepts and provides a brief tutorial on working on the command line (also known as, the shell, or the terminal).
Preparation
Our primary computing context will be the UC Berkeley DataHub. This provides us with a common platform on which we are assured to get the same behavior as each other, allowing us to use the command line/shell/terminal and Python, including Jupyter notebooks. In particular getting a common command line environment would be hard otherwise.
Using your own laptop (optional)
If you haven’t already used the shell/terminal and Python on your laptop, we recommend you use the DataHub so that we don’t have to troubleshoot problems on multiple laptops.
If you do use your laptop, you’ll need Python installed, for which we recommend using Miniforge. Another option is the Anaconda distribution of Python](https://www.anaconda.com/distribution)
For the shell/terminal parts of this material, you can use the Terminal on your Mac, but not on Windows.
If you do use your own laptop, you’ll need to clonethis GitHub repository. If you don’t know how to do that, just download this zip file and unzip it into your home directory on your computer.
Parts of a computer
We won’t spend a lot of time thinking about computer hardware, but we do need to know some of the basics that relate to using a computer and programming effectively. There are many layers of technical detail one could get into, but the key things for now are to have a basic understanding of these components of a computer:
CPU
cache (limited fast memory easily accessible from the CPU)
main memory (RAM)
bus
disk
We’ll discuss these components via this overview from a class at CMU.
Depending on what your code is doing, the slow step in a computation might be:
doing the actual calculations on the CPU,
moving data back and forth from (main) memory (RAM) to the CPU, or
reading/writing (I/O) data to/from disk.
Concepts of parallelization
Computers now come with multiple processors for doing computation. Basically, physical constraints have made it harder to keep increasing the speed of individual processors, so the chip industry is now putting multiple processing units in a given computer and trying/hoping to rely on implementing computations in a way that takes advantage of the multiple processors.
Everyday personal computers usually have more than one processor (more than one chip) and on a given processor, often have more than one core (multi-core). A multi-core processor has multiple processors on a single computer chip. On personal computers, all the processors and cores share the same memory.
Here are some key terms relating to hardware:
cores: We’ll use this term to mean the different processing units available on a single machine or node of a cluster. A given CPU will have multiple cores. (E.g, the AMD EPYC 7763 has 64 cores per CPU.)
hardware threads: these are very different than the (software) threads discussed below: hardware hyperthreading involves separate hyperthreads (“virtual cores”) on a single (physical) core that behave like separate cores
nodes: We’ll use this term to mean the different computers, each with their own distinct memory, that make up a cluster or supercomputer.
GPUs: a GPU has a very large number of simple processing units that operate many tasks in parallel.
Here are some key terms related to computation:
processes: instances of a program(s) executing on a machine; multiple processes may be executing at once. A given program may start up multiple processes at once. Ideally we have no more processes than cores on a node.
threads: multiple paths of execution within a single process; the operating system sees the threads as a single process, but one can think of them as ‘lightweight’ processes. Ideally when considering the processes and their threads, we would have the same number of cores available to our code as we have processes and threads combined.
workers: the individual processes that are carrying out the (parallelized) computation. We’ll use worker and process interchangeably.
tasks: individual units of computation; one or more tasks will be executed by a given process on a given core.
Basic parallelization example
As an example, suppose we were carrying out 20-fold cross-validation and we were using a laptop with 4 cores.
We would generally use 4 workers (4 processes) to carry out the 20 tasks (each task being to fit the model on the non-held out data and make predictions on the held-out data in the fold). So 4 of the tasks could be executing at the same time. When a process finishes a task assigned to it, it can start on the next task that is still left to be done.
The filesystem
The filesystem is the organization of data/files on a storage device, such as the local disk on your machine.
The filesystem on a Linux or MacOS machine can be thought of as an upside-down tree, with / as the root (top) of the tree.
For example, a basic tree on a Linux machine might look like this:
/accounts, /bin, and /lib are the directories at the root in the example above.
We’ll see more examples when we start using the shell.
Things are similar on Windows, though there the root is a lettered drive, such as C:.
For scientific computing, you’ll often be connecting remotely to other machines/clusters and navigating the filesystem on those other systems.
The shell and UNIX commands
Operating on the UNIX command line is also known as using the terminal and using the shell.
The shell is the UNIX program that you interact with when in a terminal window interacting with a UNIX-style operating system (e.g., Linux or MacOS). The shell sits between you and the operating system and provides useful commands and functionality. Basically, the shell is a program that serves to run other commands for you and show you the results. There are actually different shells that you can use, of which bash is very common and is the default on many systems. In recent versions of MacOS, zsh is the default shell. zsh is an extension of bash.
We’ll start a shell on our DataHub (Jupyter) server, which is (basically) a Linux virtual machine running in the cloud.
On DataHub, the prompt in the shell looks like this:
jovyan@jupyter-paciorek:~$
but below we’ll indicate that simply as:
$
Files and directories
Moving around and listing information
We’ll start by thinking about the filesystem, which organizes our information/data into files on the computer’s disk.
Anytime you are at the UNIX command line, you have a working directory, which is your current location in the file system.
Here’s how you can see where you are using the pwd (“print working directory”) command:
$ pwd/home/jovyan/compute-skills-2024
and here’s how you use ls to list the files (and subdirectories) in the working directory…
Now suppose I want to be in a different directory so I can see what is there or do things to the files in that directory.
The command you need is cd (“change directory”) and an important concept you need to become familiar with is the notion of ‘relative’ versus ‘absolute’ path. A path is the set of nested directories that specify a location of interest on the filesystem.
Now let’s go into a subdirectory. We can use cd with the name of the subdirectory. The subdirectory is found ‘relative’ to our working directory, i.e., found from where we currently are.
$ cd units$ pwd/home/jovyan/compute-skills-2024/units
We could also navigate through nested subdirectories. For example, after going back to our home directory (using cd alone will do this), let’s go to the units subdirectory of the compute-skills-2024 subdirectory. The / is a separate character that distinguishes the nested subdirectories.
$ cd$ cd compute-skills-2024/units
You can access the parent directory of any directory using ..:
$ cd ..$ pwd/home/jovyan/compute-skills-2024
We can get more complicated in our use of .. with relative paths. Here we’ll go up a directory and then down to a different subdirectory.
$ cd units$ cd ../assets$ pwd/home/jovyan/compute-skills-2024/assets
And here we’ll go up two directories and then down to another subdirectory (but note that /home/jovyan/tmp may not exist for your server).
$ cd ../../tmp # Go up two directories and down.$ pwd/home/jovyan/tmp
Absolute versus relative paths
All of the above examples used relative paths to navigate based on your working directory at the moment you ran the command.
We can instead use absolute paths so that it doesn’t matter where we are when we run the command. Specifying an absolute path is done by having your path start with /, such as /home/jovyan. If the path doesn’t start with / then it is interpreted as being a relative path, relative to your working directory. Here we’ll go to the units subdirectory again, but this time using an absolute path.
$ cd /home/jovyan/compute-skills-2024/units $ pwd/home/jovyan/compute-skills-2024/units
Absolute paths can be dangerous
It’s best to generally use relative paths, relative to the main directory of a project.
Using absolute paths is generally a bad idea for reproducibility and automation because the file system will be different on different machines, so your code wouldn’t work correctly anywhere other than your current machine.
The filesystem
The filesystem is basically a upside-down tree.
For example, if we just consider the compute-skills-2024 directory, we can see the tree structure using tree (not available on DataHub):
$ cd /home/jovyan/compute-skills-2024$ tree
Often (as is the case here), that would print out a lot of files and directories, so I’ll just print out the first few lines of the result here. One doesn’t usually use tree that much – our purpose here is to emphasize the hierarchical, nested structure of the filesystem.
The dot (.) means “this directory”, so the top of the tree here is the compute-skills-2024 directory itself, within which there are various subdirectories. Then within each of these are files and further subdirectories.
If we consider the entire filesystem, the top, or root of the tree, is the / directory. Within / there are subdirectories, such as /home (which contains users’ home directories where all of the files owned by a user are stored) and /bin (containing UNIX programs, aka ‘binaries’). We’ll use ls again, this time telling it the directory to operate on:
$ ls /bin dev home lib32 libx32 mnt proc run srv tmp varboot etc lib lib64 media opt root sbin sys usr
If there is a user named jovyan, everything specific to that user would be stored in /home/jovyan. The shortcut ~jovyan refers to /home/jovyan. If you are the jovyan user, you can also refer to /home/jovyan by the shortcut ~.
$ ls /homejovyan shiny
$ cd /home/jovyan
Go to the home directory of the current user (which happens to be the jovyan user):
$ cd ~$ pwd/home/jovyan
Go to the home directory of the jovyan user explicitly:
$ cd ~jovyan$ pwd/home/jovyan
Another useful directory is /tmp, which is a good place to put temporary files that you only need briefly and don’t need to save. These will disappear when a machine is rebooted.
$ cd /tmp$ lsjupyter-runtime
We can return to the most recent directory we were in like this:
$ cd -$ pwd/home/jovyan
Using commands
Overview
Let’s look more at various ways to use commands. We just saw the ls command. Here’s one way we can modify the behavior of the command by passing a command option. Here the -F option (also called a flag) shows directories by appending / to anything that is a directory (rather than a file) and a * to anything that is an executable (i.e., a program).
$ cd compute-skills-2024$ ls -F$ ls -F /usr/bin
Next we’ll use multiple options to the ls command. -l shows extended information about files/directories. -t shows files/directories in order of the time at which they were last modified and -r shows in reverse order. Before I run ls, I’ll create an empty file using the touch command.
In the above example, wc is the command, -l is a command option specifying to count the number of lines, example.txt is the argument, and the line acceptance is indicated by hitting the <Return> key at the end of the line.
So that invocation counts the number of lines in the file named example.txt.
The spaces are required and distinguish the different parts of the invocation. For this reason, it’s generally a bad idea to have spaces within file names. But if you do, you can use quotation marks to distinguish the file name, e.g.,
$ ls -l"name of my file with spaces.txt"
Also, capitalization matters. For example -l and -L are different options.
Note that options, arguments, or both might not be included in some cases. Recall that we’ve used ls without either options or arguments.
Arguments are usually one or more files or directories.
Options
Often we can specify an option either in short form (as with -l here) or long form (--lines here), as seen in the following equivalent invocations:
Options will often take values, e.g., if we want to get the first two lines of the file, the following invocations are equivalent:
$ head -n 2 example.txt$ head --lines=2 example.txt$ head --lines 2 example.txt
Comments
Anything that follows # is a comment and is ignored.
$# This is ignored$ ls # Everything after the # is ignoredcompute-skills-2024 test-jh Untitled2.ipynbdata_c80_regsample_3.dta test-private Untitled3.ipynb
Getting help with UNIX commands
Essentially all UNIX commands have help information (called a man page), accessed using man.
$ man ls
Once you are in the man page, you can navigate by hitting <space> (to scroll down) and the <up> and <down> arrows. You can search by typing /, typing the string you want to search for and hitting <Return>. You can use n and p for the next and previous search hits and q to quit out of the search.
Unfortunately man pages are often quite long, hard to understand, and without examples. But the information you need is usually there if you take the time to look for it.
Also, UNIX commands as well as other programs run from the command line often provide help information via the --help option, e.g., for help on ls:
$ ls --help
Tab completion
If you hit Tab the shell tries to figure out what command or filename you want based on the initial letters you typed.
Try it with:
cd comech
The first should allow you to get compute-skills-2024 and the second echo.
Tab completion everywhere
Lots of interactive programs have adopted the idea of tab completion, including Python/IPython and R.
Command history
Hit the up key. You should see the previous command you typed. Hit it again. You should see the 2nd most recent command.
Ctrl-a and Ctrl-e navigate to the beginning and end of a line. You can edit your previous command and then hit Enter to run the modified code.
Seeing if a command or program is available
You can see if a command or program is installed (and where it is installed) using type.
$ type grepgrep is /usr/bin/grep$ type RR is /usr/bin/R$ type pythonpython is /srv/conda/bin/python
Working with files
Copying and removing files
You’ll often want to make a copy (cp) of a file, move it (mv) between directories, or remove it (rm).
$ cd ~/compute-skills-2024/units$ cp calc.py calc-new.py$ mv calc-new.py /tmp/.$ cd /tmp$ ls -lrttotal 8drwx-----T 2 jovyan jovyan 4096 Aug 1 16:15 jupyter-runtime-rw-r--r-- 1 jovyan jovyan 413 Aug 1 16:16 calc-new.py
When we moved the file, the use of . in /tmp/. indicates we want to use the same name as the original file.
$ rm calc-new.py$ ls -lrttotal 4drwx-----T 2 jovyan jovyan 4096 Aug 1 16:15 jupyter-runtime
rm cannot be undone
I used rm above to remove the file. Be very careful about removing files - there is no Trash folder in UNIX - once a file is removed, it’s gone for good.
The mv command is also used if you want to rename a file.
$ cd ~/compute-skills-2020/units$ mv session3.qmd _sesson3.qmd$ ls
Exercises: shell commands
Exercise 1
Where is gzip installed on the system? What are some other commands/executables that are installed in the same directory?
Exercise 2
Try to run the following command mkdir ~/projects/drought. It will fail. Look in the help information on mkdir to figure out how to make it work without first creating the projects directory.
Exercise 3
Figure out how to list out the files in a directory in order of decreasing file size, as a way to see easily what the big files are that are taking up the most space. Modify this command to get the result in the ascending order.
Exercise 4
Figure out how to copy an entire directory and have the timestamps of the files retained rather than having the timestamps be the time that you copied the files.
See if you can combine the short form of an option with the long form of a different option.
What happens if you use a single dash with the long form of an option. Are you able to interpret the error message? (Note that confusingly there are some situations one can use the long form with a single dash.)
Running processes
We’ll run a simple linear algebra operation in Python to illustrate how we can monitor what is happening on a machine.
DataHub memory limits
Be careful about increasing n in the example below when on DataHub. Basic DataHub virtual machines by default only have 1 GB memory, so don’t increase n below or you may run out of memory if running on DataHub and this can cause problems in your DataHub session. If you’re doing this as part of the workshop, we’ve increased those limits somewhat, so somewhat larger values of n should be fine.
Code
import numpy as npimport timedef run_linalg(n): z = np.random.normal(0, 1, size=(n, n))print(time.time()) x = z.T.dot(z) # x = z'zprint(time.time()) U = np.linalg.cholesky(x) # factorize as x = U'Uprint(time.time())## This allows us to run the code from the command line## without running it when we import the file as a module.if__name__=='__main__': run_linalg(4000)
python calc.py
We might need to run it in a loop to be able to monitor the running process (since it finishes so quickly):
Code
for ((i=0;i<10;i++)); do python calc.pydone
Monitoring CPU and memory use
We can see CPU and memory usage via top.
If we see more than 100% CPU usage, that indicates that the process is running a computation in parallel using multiple threads.
What about if we see less than 100% CPU usage?
Our process might be busy doing I/O (reading/writing to/from disk).
Our process might be waiting for data from the internet or from another machine on the network.
The machine might be running low on physical memory, which in some cases will involve using disk as additional memory (slow!).
Source Code
---title: "Introduction to Computing and the Command Line (optional)"format: html: theme: cosmo css: ../styles.css toc: true code-copy: true code-block-background: true code-fold: show code-tools: trueexecute: freeze: auto---::: {.callout-note}## OverviewThis module introduces you to some computing concepts and provides a brief tutorial on working on the *command line* (also known as, the *shell*, or the *terminal*).:::{{< include point-to-prep.qmd >}}# Parts of a computerWe won't spend a lot of time thinking about computer hardware, but we doneed to know some of the basics that relate to using a computer andprogramming effectively. There are many layers of technical detail onecould get into, but the key things for now are to have a basicunderstanding of these components of a computer:- CPU - cache (limited fast memory easily accessible from the CPU)- main memory (RAM)- bus- diskWe'll discuss these components via [this overview](https://36-750.github.io/tools/computer-architecture/) from a class at CMU.Depending on what your code is doing, the slow step in a computation might be: - doing the actual calculations on the CPU, - moving data back and forth from (main) memory (RAM) to the CPU, or - reading/writing (I/O) data to/from disk.# Concepts of parallelizationComputers now come with multiple processors for doing computation. Basically, physical constraints have made it harder to keep increasing the speed of individual processors, so the chip industry is now putting multiple processing units in a given computer and trying/hoping to rely on implementing computations in a way that takes advantage of the multiple processors.Everyday personal computers usually have more than one processor (more than one chip) and on a given processor, often have more than one core (multi-core). A multi-core processor has multiple processors on a single computer chip. On personal computers, all the processors and cores share the same memory.Here are some key terms relating to hardware:- **cores**: We'll use this term to mean the different processing units available on a single machine or node of a cluster. A given CPU will have multiple cores. (E.g, the AMD EPYC 7763 has 64 cores per CPU.) - **hardware threads**: these are very different than the (software) threads discussed below: hardware hyperthreading involves separate hyperthreads ("virtual cores") on a single (physical) core that behave like separate cores- **nodes**: We'll use this term to mean the different computers, each with their own distinct memory, that make up a cluster or supercomputer.- **GPUs**: a GPU has a very large number of simple processing units that operate many tasks in parallel.Here are some key terms related to computation:- **processes**: instances of a program(s) executing on a machine; multiple processes may be executing at once. A given program may start up multiple processes at once. Ideally we have no more processes than cores on a node. - **threads**: multiple paths of execution within a single process; the operating system sees the threads as a single process, but one can think of them as ‘lightweight’ processes. Ideally when considering the processes and their threads, we would have the same number of cores available to our code as we have processes and threads combined.- **workers**: the individual processes that are carrying out the (parallelized) computation. We’ll use worker and process interchangeably.- **tasks**: individual units of computation; one or more tasks will be executed by a given process on a given core.## Basic parallelization exampleAs an example, suppose we were carrying out 20-fold cross-validation and we were using a laptop with 4 cores.We would generally use 4 workers (4 processes) to carry out the 20 tasks (each task being to fit the model on the non-held out data and make predictions on the held-out data in the fold). So 4 of the tasks could be executing at the same time. When a process finishes a task assigned to it, it can start on the next task that is still left to be done.# The filesystemThe filesystem is the organization of data/files on a storage device, such as the local disk on your machine.The filesystem on a Linux or MacOS machine can be thought of as an upside-down tree, with `/` as the root (top) of the tree.For example, a basic tree on a Linux machine might look like this:```.├── accounts│ ├── biost│ │ ├── wang│ │ ├── jones│ │ └── mcfadden│ ├── grad│ ├── smith│ ├── sanchez│ ├── zhang├── bin│ ├── echo│ ├── grep│ ├── tmux├── lib│ ├── gcc│ ├── grep│ ├── tmux````/accounts`, `/bin`, and `/lib` are the directories at the root in the example above.We'll see more examples when we start using the shell.Things are similar on Windows, though there the root is a lettered drive, such as `C:`.For scientific computing, you'll often be connecting remotely to other machines/clusters and navigating the filesystem on those other systems. # The shell and UNIX commandsOperating on the UNIX command line is also known as *using the terminal* and *using the shell*.The shell is the UNIX program that you interact with when in a terminal window interacting with a UNIX-style operating system (e.g., Linux or MacOS). The shell sits between you and the operating system and provides useful commands and functionality. Basically, the shell is a program that serves to run other commands for you and show you the results. There are actually different shells that you can use, of which bash is very common and is the default on many systems. In recent versions of MacOS, zsh is the default shell. zsh is an extension of bash.We'll start a shell on our DataHub (Jupyter) server, which is (basically) a Linux virtual machine running in the cloud.On DataHub, the prompt in the shell looks like this:```jovyan@jupyter-paciorek:~$```but below we'll indicate that simply as:```$```## Files and directories### Moving around and listing informationWe'll start by thinking about the filesystem, which organizes our information/data into files on the computer's disk.Anytime you are at the UNIX command line, you have a **working directory**, which is your current location in the file system. Here's how you can see where you are using the `pwd` ("print working directory") command:```bash$ pwd/home/jovyan/compute-skills-2024```and here's how you use `ls` to list the files (and subdirectories) in the working directory...```bash$ lsassets license.qmd README.md units _variables.ymlbuttons.yml prep.qmd schedule.yml Untitled.ipynbindex.qmd _quarto.yml syllabus.qmd untitled.py```Now suppose I want to be in a different directory so I can see what is there or do things to the files in that directory.The command you need is `cd` ("change directory") and an important concept you need to become familiar with is the notion of 'relative' versus 'absolute' *path*. A path is the set of nested directories that specify a location of interest on the filesystem.Now let's go into a subdirectory. We can use `cd` with the name of the subdirectory. The subdirectory is found 'relative' to our working directory, i.e., found from where we currently are.```bash$ cd units$ pwd/home/jovyan/compute-skills-2024/units```We could also navigate through nested subdirectories. For example, after going back to our home directory (using `cd` alone will do this), let's go to the `units` subdirectory of the `compute-skills-2024` subdirectory. The `/` is a separate character that distinguishes the nested subdirectories.```bash$ cd$ cd compute-skills-2024/units```You can access the parent directory of any directory using `..`:```bash$ cd ..$ pwd/home/jovyan/compute-skills-2024```We can get more complicated in our use of `..` with relative paths. Here we'll go up a directory and then down to a different subdirectory.```bash$ cd units$ cd ../assets$ pwd/home/jovyan/compute-skills-2024/assets```And here we'll go up two directories and then down to another subdirectory (but note that `/home/jovyan/tmp` may not exist for your server).```bash$ cd ../../tmp # Go up two directories and down.$ pwd/home/jovyan/tmp```### Absolute versus relative pathsAll of the above examples used **relative** paths to navigate based on your working directory at the moment you ran the command.We can instead use **absolute** paths so that it doesn't matter where we are when we run the command. Specifying an absolute path is done by having your path start with `/`, such as `/home/jovyan`. If the path doesn't start with `/` then it is interpreted as being a relative path, relative to your working directory. Here we'll go to the `units` subdirectory again, but this time using an absolute path. ```bash$ cd /home/jovyan/compute-skills-2024/units $ pwd/home/jovyan/compute-skills-2024/units```::: {.callout-warning}## Absolute paths can be dangerousIt's best to generally use relative paths, relative to the main directory of a project.Using absolute paths is generally a bad idea for reproducibility and automation because the file system will be different on different machines, so your code wouldn't work correctly anywhere other than your current machine. :::### The filesystemThe filesystem is basically a upside-down tree.For example, if we just consider the `compute-skills-2024` directory, we can see the tree structure using `tree` (not available on DataHub):```bash$ cd /home/jovyan/compute-skills-2024$ tree```Often (as is the case here), that would print out a lot of files and directories, so I'll just print out the first few lines of the result here. One doesn't usually use `tree` that much -- our purpose here is to emphasize the hierarchical, nested structure of the filesystem.```.├── assets│ ├── buttons-alt.ejs│ ├── buttons.ejs│ ├── schedule-alt.ejs│ ├── schedule.ejs│ ├── stat_bear.png│ ├── styles-alt.scss│ └── styles.css├── buttons.yml├── _freeze│ ├── site_libs│ │ ├── clipboard│ │ │ └── clipboard.min.js│ │ └── quarto-listing│ │ ├── list.min.js│ │ └── quarto-listing.js```The dot (`.`) means "this directory", so the top of the tree here is the `compute-skills-2024` directory itself, within which there are various subdirectories. Then within each of these are files and further subdirectories.If we consider the entire filesystem, the top, or root of the tree, is the `/` directory. Within `/` there are subdirectories, such as `/home` (which contains users' home directories where all of the files owned by a user are stored) and `/bin` (containing UNIX programs, aka 'binaries'). We'll use `ls` again, this time telling it the directory to operate on:```bash$ ls /bin dev home lib32 libx32 mnt proc run srv tmp varboot etc lib lib64 media opt root sbin sys usr```If there is a user named *jovyan*, everything specific to that user would be stored in `/home/jovyan`. The shortcut `~jovyan` refers to `/home/jovyan`. If you are the `jovyan` user, you can also refer to `/home/jovyan` by the shortcut `~`.```bash$ ls /homejovyan shiny``````bash$ cd /home/jovyan```Go to the home directory of the current user (which happens to be the `jovyan` user):```bash$ cd ~$ pwd/home/jovyan```Go to the home directory of the jovyan user explicitly:```bash$ cd ~jovyan$ pwd/home/jovyan```Another useful directory is `/tmp`, which is a good place to put temporary files that you only need briefly and don't need to save. These will disappear when a machine is rebooted. ```bash$ cd /tmp$ lsjupyter-runtime```We can return to the most recent directory we were in like this:```bash$ cd -$ pwd/home/jovyan```## Using commands### OverviewLet's look more at various ways to use commands. We just saw the `ls` command. Here's one way we can modify the behavior of the command by passing a command option. Here the `-F` option (also called a *flag*) shows directories by appending `/` to anything that is a directory (rather than a file) and a `*` to anything that is an executable (i.e., a program).```bash$ cd compute-skills-2024$ ls -F$ ls -F /usr/bin```Next we'll use multiple options to the `ls` command. `-l` shows extended information about files/directories. `-t` shows files/directories in order of the time at which they were last modified and `-r` shows in reverse order. Before I run `ls`, I'll create an empty file using the `touch` command. ```bash$ cd$ touch example.txt$ ls -lrttotal 14496-rw-r--r-- 1 jovyan jovyan 36365 Aug 29 2022 'Screen Shot 2021-10-26 at 10.54.07 AM.png'-rw-r--r-- 1 jovyan jovyan 14757718 Oct 25 2022 data_c80_regsample_3.dta-rw-r--r-- 1 jovyan jovyan 2436 Oct 25 2022 Untitled1.ipynb-rw-r--r-- 1 jovyan jovyan 470 Oct 25 2022 Untitled2.ipynb-rw------- 1 jovyan jovyan 438 Jan 3 2023 WHERE-ARE-MY-FILES.txtdrwxr-xr-x 5 jovyan jovyan 4096 May 30 16:12 python-workshop-2023-rw------- 1 jovyan jovyan 75 May 31 16:18 tmphn20xgkx-rw------- 1 jovyan jovyan 91 Jun 3 14:56 tmp4d_d65l_drwxr-xr-x 3 jovyan jovyan 4096 Jun 4 15:58 test-privatedrwxr-xr-x 4 jovyan jovyan 4096 Jul 19 14:55 test-jh-rw-r--r-- 1 jovyan jovyan 918 Jul 19 14:59 Untitled.ipynbdrwxr-xr-x 2 jovyan jovyan 4096 Jul 19 15:52 tmp-rw-r--r-- 1 jovyan jovyan 0 Jul 31 14:10 untitled.py-rw-r--r-- 1 jovyan jovyan 617 Jul 31 14:12 Untitled3.ipynbdrwxr-xr-x 7 jovyan jovyan 4096 Aug 1 15:45 compute-skills-2024-rw-r--r-- 1 jovyan jovyan 0 Aug 1 16:03 example.txt```While each command has its own syntax, there are some rules usuallyfollowed. Generally, executing a command consists of four things: - the command - command option(s) - argument(s) - line acceptance (i.e., hitting <Return>)Here's an example:```bash$ echo "hello there">> example.txt$ echo "and goodbye">> example.txt$ echo "really now">> example.txt$ wc -l example.txt3 example.txt```In the above example, `wc` is the command, `-l` is a command optionspecifying to count the number of lines, `example.txt` is the argument, and theline acceptance is indicated by hitting the `<Return>` key at the end ofthe line.So that invocation counts the number of lines in the file named `example.txt`.The spaces are required and distinguish the different parts of the invocation. For this reason,it's generally a bad idea to have spaces within file names. But if you do, you canuse quotation marks to distinguish the file name, e.g.,```bash$ ls -l"name of my file with spaces.txt"```Also, capitalization matters. For example `-l` and `-L` are different options.Note that options, arguments, or both might not be included in some cases. Recall that we've used `ls`without either options or arguments.Arguments are usually one or more files or directories.### OptionsOften we can specify an option either in short form (as with `-l` here)or long form (`--lines` here), as seen in the following equivalent invocations:```bash$ wc -l example.txt3 example.txt$ wc --lines example.txt3 example.txt```We can also ask for the number of characters with the `-m` option, which canbe combined with the `-l` option equivalently in two ways:```bash$ wc -lm example.txt3 35 example.txt$ wc -l-m example.txt3 35 example.txt```Options will often take values, e.g., if we want to get the first two lines of the file,the following invocations are equivalent:```bash$ head -n 2 example.txt$ head --lines=2 example.txt$ head --lines 2 example.txt```### CommentsAnything that follows `#` is a comment and is ignored.```bash$# This is ignored$ ls # Everything after the # is ignoredcompute-skills-2024 test-jh Untitled2.ipynbdata_c80_regsample_3.dta test-private Untitled3.ipynb```### Getting help with UNIX commandsEssentially all UNIX commands have help information (called a man page), accessed using `man`.```bash$ man ls```Once you are in the *man* page, you can navigate by hitting `<space>` (to scroll down) and the `<up>` and `<down>` arrows. You can search by typing `/`, typing the string you want to search for and hitting `<Return>`. You can use `n` and `p` for the next and previous search hits and `q` to quit out of the search.Unfortunately *man* pages are often quite long, hard to understand, and without examples. But the information you need is usually there if you take the time to look for it.Also, UNIX commands as well as other programs run from the command line often provide help information via the `--help` option, e.g., for help on `ls`:```bash$ ls --help```### Tab completionIf you hit `Tab` the shell tries to figure out what command or filename you want based on the initial letters you typed.Try it with:```bashcd comech```The first should allow you to get `compute-skills-2024` and the second `echo`.::: {.callout-tip}## Tab completion everywhereLots of interactive programs have adopted the idea of tab completion, including Python/IPython and R.:::### Command historyHit the `up` key. You should see the previous command you typed.Hit it again. You should see the 2nd most recent command.`Ctrl-a` and `Ctrl-e` navigate to the beginning and end of a line. You can edit your previous command and then hit `Enter` to run the modified code.### Seeing if a command or program is availableYou can see if a command or program is installed (and where it is installed) using `type`.```bash$ type grepgrep is /usr/bin/grep$ type RR is /usr/bin/R$ type pythonpython is /srv/conda/bin/python```## Working with files### Copying and removing filesYou'll often want to make a copy (`cp`) of a file, move it (`mv`) between directories, or remove it (`rm`). ```bash$ cd ~/compute-skills-2024/units$ cp calc.py calc-new.py$ mv calc-new.py /tmp/.$ cd /tmp$ ls -lrttotal 8drwx-----T 2 jovyan jovyan 4096 Aug 1 16:15 jupyter-runtime-rw-r--r-- 1 jovyan jovyan 413 Aug 1 16:16 calc-new.py```When we moved the file, the use of `.` in `/tmp/.` indicates we want to use the same name as the original file.```bash$ rm calc-new.py$ ls -lrttotal 4drwx-----T 2 jovyan jovyan 4096 Aug 1 16:15 jupyter-runtime```::: {.callout-important}## `rm` cannot be undoneI used `rm` above to remove the file. Be very careful about removing files - there is no Trash folder in UNIX - once a file is removed, it's gone for good.:::The `mv` command is also used if you want to rename a file. ```bash$ cd ~/compute-skills-2020/units$ mv session3.qmd _sesson3.qmd$ ls```## Exercises: shell commands::: {.callout-tip}## Exercise 1Where is `gzip` installed on the system? What are some other commands/executables that are installed in the same directory?:::::: {.callout-tip}## Exercise 2Try to run the following command `mkdir ~/projects/drought`. It will fail. Look in the help information on `mkdir` to figure out how to make it work without first creating the `projects` directory.:::::: {.callout-tip}## Exercise 3Figure out how to list out the files in a directory in order of decreasing file size, as a way to see easily what the big files are that are taking up the most space. Modify this command to get the result in the ascending order.:::::: {.callout-tip}## Exercise 4Figure out how to copy an entire directory and have the timestamps of the files retained rather than having the timestamps be the time that you copied the files.See if you can combine the short form of an option with the long form of a different option.What happens if you use a single dash with the long form of an option. Are you able to interpret the error message? (Note that confusingly there are some situations one can use the long form with a single dash.):::## Running processesWe'll run a simple linear algebra operation in Python to illustrate how we can monitor what is happening on a machine.::: {.callout-warning}## DataHub memory limitsBe careful about increasing `n` in the example below when on DataHub. Basic DataHub virtual machines by default only have 1 GB memory, so don't increase `n` below or you may run out of memory if running on DataHub and this can cause problems in your DataHub session. If you're doing this as part of the workshop, we've increased those limits somewhat, so somewhat larger values of `n` should be fine.:::```{python}#| eval: falseimport numpy as npimport timedef run_linalg(n): z = np.random.normal(0, 1, size=(n, n))print(time.time()) x = z.T.dot(z) # x = z'zprint(time.time()) U = np.linalg.cholesky(x) # factorize as x = U'Uprint(time.time())## This allows us to run the code from the command line## without running it when we import the file as a module.if__name__=='__main__': run_linalg(4000)``````bashpython calc.py```We might need to run it in a loop to be able to monitor the running process (since it finishes so quickly):```{python}#| eval: falsefor ((i=0;i<10;i++)); do python calc.pydone```### Monitoring CPU and memory useWe can see CPU and memory usage via `top`.If we see more than 100% CPU usage, that indicates that the process is running a computation in parallel using multiple threads.What about if we see less than 100% CPU usage? - Our process might be busy doing I/O (reading/writing to/from disk). - Our process might be waiting for data from the internet or from another machine on the network. - The machine might be running low on physical memory, which in some cases will involve using disk as additional memory (slow!).
Comments
Anything that follows
#
is a comment and is ignored.