ssh username@hostname
CPDG Genetics and Genomics 2025 - Winter School
Introduction to computation
This document aims to teach students of the CPDG Genetics and Genomics 2025 Winter School the following:
1. Connection to the winter school cluster
1. Connection to the winter school cluster
Internet Access
There is WiFi coverage across the campus, including the computer rooms where the winter school take place (Building 69 - Room 305, 315 or 316).
If you are a student or staff member from a University or Institution, you should be able to use eduroam following those instructions.
If you are not able to use eduroam, you will be provided a visitor account - ask a tutor!.
The provided username and PIN can be used to activate your UQ account at the following link.- After completing your account activation, you can use your account to connect to the UQ network
Anyone can connect to the “UQ guest” network to get internet access, it however blocks server access and therefore cannot be used for the practicals.
Cluster Access
You should all have been provided with your login details (username, hostname and password) to access a computer server needed for the practical exercises.
The operating system used is Rocky Linux, the second part of this document goes over how to interact with it using BASH.
For now we will go over how to connect to the cluster.
To log into the server, type the command below in Terminal for mac/Unix or PowerShell for windows users:
You will then be asked to provide the password you were given.
If cannot connect to the cluster for any reason, ask for help!
Data
Here is a list of all the different paths you will be using during this practical.
/scratch/username # Home directory
/data/module*/ # Data for each module (replace the * by the module number)
If you do not know what Paths are, the second part of the guide goes over it.
Lecture slides and Practical notes
Lecture slides and practical notes can be found at the following link:
https://cnsgenomics.com/data/teaching/GNGWS25/module*/
Change the *
by the number of the module you want to access.
Data download
You can download from the cluster using the following command:
scp username@hostname:[path_to_file] [local_path_to_save]
You can change [local_path_to_save]
by a simple .
to download the data in your current path.
Please, do not download any genetic data but only plots or summary data.
You will need to replace username and hostname to correspond to your own login. If you do not know how to use this command, the third part of the guide teaches you how to.
If you already know how to use both BASH (cd
, ls
, mkdir
, rm
, scp
) and R you can stop at this point and grab a coffee.
If on the other hand you are new to BASH, R or the use of command line in general, the rest of this document should get you up to speed.
2. Basic BASH command
BASH is a programming language used to interact with Unix operating systems such as:
Ubuntu
Arch Linux
Rocky Linux (system used for the winter school cluster)
To this day, BASH holds a prominent place in the computing world due to the predominance of Unix-like systems on servers. macOS and Windows also support it - Windows in its server distributions - and both operating systems allow BASH to be used natively through their terminal and PowerShell applications.
Now that you are connected to the Winter School, let’s explore the use of BASH.
Argumentation with commands
While we now have seen how to use commands, we are missing one of the most important part of Unix systems - passing arguments to them.
Let’s take a simple command and see how arguments can be used to expand its functions
ls
ls -l
We have seen the ls
command before, it lists all files within the current folder.
However, different arguments can be given to the command to change its behavior, ls -l
will create a list containing:
file permissions
user
date of creation
file size
The character -
is called a flag, allowing you to give information to a command and modify its behaviour. In this case, ls
will interpret a space as a new argument, but seeing a flag, it will understand that the character l
needs to be given to ls
as an argument and change its behaviour.
There are two types of flags, short form using the symbol -
and long form using the --
format. Some functions have both and can be use interchangeably while other require a specific one.
ld -d
ls --directory
ls --help
ls -h # Does not correspond to help that does not have any shortform flag.
Here, for example, ls -d
and ls --directory
both list all directory in the current path.
On the other hand, ls --help
and ls -h
do not give the same output. The first one print a list of possible arguments given to the ls
function while the second one does not seem to modify ls
behaviour.
Can you figure out what ls -h
does? (Hint: use the ls --help
command)
Arguments can also take input the same as a command. The next example is a plink
command from module 1.
plink --bfile /data/module1/5_gwasPrac/gwasQC --assoc --pheno /data/module1/5_gwasPrac/BMI_QC.phen
The plink
command is called and three flags are given to it --bfile
, --assoc
and --pheno
While we will not go into details about plink
, we can see that both --bile
and --pheno
takes arguments while --pheno
does not, highlighting the flexibility of argument in extending a command’s behaviour.
As a rule of thumbs, the flag --help
will give you more information about a command and how to modify its behaviour. Another longer version is to use man
to get the manual of the command man ls
for example will give you all the information you need to know (and a lot more) about the ls
command.
Removing files
You can remove files using the following function:
rm
You can also pass argument to the function rm
, such as the extremely dangerous -r
standing for recursive. This command will remove the path given as well as any child found within this path.
This effectively allows the rm
command to remove directory - empty or not - so use it carefully.
Miscellaneous commands
Here is a list of command that might be useful to you during the week:
cp file copiedFile #Copy a file (can include path)
mv fileToMove placeToMove #Move (or rename) a file
echo $PATH # print a variable
head file # Print the first 10 lines of a file (-n change number of line)
tail file # Print the last 10 lines of a file (-n change number of line)
wc file # Give the number of lines, words and character in a file.
cd ~/ # move to your home directory
cd # Move to your home directory
cd - # Move to the last previous path
open . # Open the current folder in a file explorer. Does not work on the cluster.
3. Basic R command
Now that we know how to use BASH, we will learn how to use R within a terminal window.
R is a programming language developed for statistical computing and data visualization. While it was initially created as a teaching tool at the University of Auckland, it became and remains to this day one of the most widely use programming language for research in genetic and genomics.
If R is install on your system, you can just type the command R
within a terminal window and an R interpreter will be open.
You can observe a difference between the starting line in BASH (starting by a $
) and R (starting by a >
). This sign allow you to know whether your terminal is expecting R or BASH code.
You can leave the R interpreter by typing q()
you will be asked if you want to save your workspace.
As a rule of thumb, we advice not to save your workspace. While it might save you a bit of time, previous code might sneak up into your analysis and render your code less reproducible or worse.
We recommend instead fo keep track of the code use to run your analysis and save specific files using R functions.
Now that we know how to open and close R, we will see the basis of using it.
Using R
Functions
R, just like BASH has functions that can be used to perform different tasks (called commands in BASH). While the principles are the same, functions in R have a slightly different syntax as BASH commands.
list.files()
Here the list.files
function is called with ()
after and fills the same role as the ls
command in BASH. All arguments for the function will be given within the brackets, no flags are used in R.
list.files(path = '/data/module1')
The argument is matched by using an =
sign the way a flag is used in BASH.
Variables and different data types
To understand how to work with R, we will first see how to generate variables
<- 'Hello, World!'
var1 print(var1)
= 'Hello, World!'
var2 print(var2)
The code above uses two different ways (<-
and =
) to assign the variables var1
and var2
.
While those expressions are different, they result in the same variable.
The recommended R syntax is to use <-
, but a large part of the R community uses =
as well. In this case, the only important thing is consistency: pick one and stick to it.
The most common type of object in R is a vector. Vectors are a collection of similar items grouped together, allowing for quick computations.
<- c(5, 6, 8, 4, 2, 21, 5, 5)
vector1 print(vector1)
<- c(5, 6, 8, 4, 2, 21, 5, 5.7)
vector2 print(vector2)
<- c(5, 't', 6, 7, 5)
vector3 print(vector3)
Vector1 contains only integer (whole numbers) and is considered as a vector of integer.
Vector2, on the other hand contains integer and float (number with decimal points), all integers are therefore converted to floats (as seen by the decimal point).
Finally, vector3 contains both a string (text) and integers; given that text cannot be converted to a number, the whole vector is converted to text (as observable with the "
around the numbers)
If you are unfamiliar with the different type of data, you can read more here. Of interest: Boolean type, numeric types, string and text types.
Vectors are extremely powerful since operations can be applied to the entire vector simultaneously.
+ 10 # Add 10 to every element of the vector
vector1 mean(vector2) # Get the mean of vector 2
+ mean(vector2) # Add the mean of vector 2 to vector 1
vector1 %*% vector2 # Perform the dot product between vector 1 and 2 vector1
The previous code chunk shows the versatility of operation possible using vectors of numbers, from simply adding a constant, to calculate the mean or taking the dot product of two vectors.
Beyond vectors are data frames and matrices. Those are collections of vectors organised in a single structure. Here are the different ways you can interact with a data frame or matrix:
data(iris) #load a data frame
head(iris) # print the 10 first rows of the data frame
dim(irirs) # Get the dimensions of the object
nrow(iris) # Get the number of row
ncol(iris) # Get the number of columns
colnames(iris) # shows the columns of the dataframe
# Use columns name to subet the dataframe:
# One column:
'Sepal.Width']
iris[# More than one column:
c('Sepal.Width', 'Petal.Length')]
iris[# Select the same columns using column numbers:
c(2,3)]
iris[,# Extract one column as a vector:
$Sepal.Width
iris# Subet a number of rows:
c(1, 10, 130),] iris[
The $
sign allow to extract a named column from data frame as a vector.
Subsetting an object in R can be done using the following expression: [row,columns]
. The square brackets will allow you to get a subset of the object you need, be careful with the placement of the ,
it will decide if the object is subsetted by rows or columns and is one of the most common mistake done while coding in R.
Read data
Now that we saw how to handle objects in R, we need to be able to import data from outside R. For this, we use simple functions such as read.table()
.
<- read.table('/data/module0/measles.csv', header = T, sep = ',')
cases head(cases)
The data we use here corresponds to the number of measles cases around the world and comes from the world health organisation. The data curration was performed in the context of tidytuesday a weekly R data visualisation challenge.
Write data
Some of the measles data was reported without the country of origin for their number, we will use this to learn how to filter unwanted data and write the clean version to the disk:
dim(cases)
<- cases[!is.na(cases$country),]
cases.f dim(cases.f)
write.csv(file = '/scratch/USERNAME/measeCasesFiltered.csv', quote = F, sep = ',')
Change the USERNAME
value to your username and write the filtered data to the disk.
The !
sign in the expression above inverse the selection performed. In this case, the is.na
function identify lines of the data frame where the value country is missing, we then inverse the selection to keep all the lines where the country value is kept.
R packages
A package is a bundle of code that a generous person has written, tested, and then given away. Putting it another way, it’s code already written for you and ready to be used! The cluster come with a large number of packages installed. We will use those packages to generate a plot from our measles data, save it to the disk and download it to our own computer.
One of the most useful R package is called tidyverse it is a collection of packages developed with data science in mind. It facilitates data wrangling while making R code more readable. You will surely encounter it during the practicals this week.
A possibly confusing part of it is the %>%
function called a pipe. It passes the input of one function to another for example head(df) %>% write.csv(file = 'file.csv')
will takes the first 10 columns of the data frame and write it to the disk as a csv file.
library(tidyverse)
# Sample 10 countries to plot at random
# Different student will have different plots!
<- sample(unique(cases.f$country), 10)
countryToPlot <- cases.f %>% filter(country %in% countryToPlot)
cases.f <- ggplot(cases.f, aes(x= year,
plot y= rubella_incidence_rate_per_1000000_total_population,
col = country)) +
geom_point() +
geom_line() +
theme_minimal() +
xlab('Year') +
ylab('Number of Measle Cases per 1,000,000 individuals')
jpeg('/scratch/USERNAME/measleplot.jpeg',width = 7,
height = 7, res = 300, units = 'in', type = 'cairo')
plotdev.off()
Change the USERNAME
in the previous command to save your plot to the disk.
Now, we only have to download the plot. On your local laptop, use the scp
command given earlier in the Data download section of this document.
Congratulation, you should now understand the basis of R and BASH!
Have a good week of learning at the CPDG Genetics and Genomics Winter School!