{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Winter school - IMB\n", "\n", "## scRNA-seq\n", "\n", "\n", "In this practical module, we will get familiar with single cell sequencing (scRNA-seq) gene expression data. We will go through several main analysis types, including preprocessing, clustering, and cell type annotation. Some parts in this section were inspired from the scanpy workflow by Fabian Theis's Group (see [Wolf *et al*. SCANPY: large-scale single-cell gene expression data analysis. *Genome Biol* 19, 15 (2018)](https://doi.org/10.1186/s13059-017-1382-0)). \n", "\n", "### Resources\n", "\n", "* The UQ Bioinformatics Python Guide (on Blackboard)\n", "* The [Python 3 documentation]. For those unfamiliar with Python the [official tutorial] is recommended\n", "* The Software Carpentry [novice Python lessons]\n", "* [IPython's own notebook tutorial](http://nbviewer.jupyter.org/github/ipython/ipython/blob/3.x/examples/Notebook/Index.ipynb)\n", "* [Markdown cheatsheet] (Markdown is the syntax you use to write formatted text into cells in a notebook.)\n", "\n", "[Python 3 documentation]: https://docs.python.org/3/\n", "[official tutorial]: https://docs.python.org/3/tutorial/index.html\n", "[novice python lessons]: http://swcarpentry.github.io/python-novice-inflammation/\n", "[Markdown cheatsheet]: https://github.com/adam-p/markdown-here/wiki/Markdown-Here-Cheatsheet\n", "\n", "Often, there is a need to have a common analysis framework that allows the broad research community to use as a starting point for the analysis to establish baseline results that can be comparable across research groups, projects and technologies. Scanpy (Python) and Seurat (R) are such frameworks for single cell analysis. Imporantly, it is necessary to understand the background theory and the low-level codes behind convenient wrapper functions.\n", "\n", "For scRNA-seq data, over 30 analysis types and more than 700 software programs are available. These programs are written in different languages, most commonly in R and Python. For updated information about the scRNA-seq analysis software collection, refer to scRNAtool website, https://www.scrna-tools.org/. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1: Understanding single cell data, preprocessing and clustering\n", "\n", "## 1. Setting up \n", "### 1.1 Annotated data object \n", "\n", "\n", "Single cell data has multiple layers of information, not just the large gene expression matrix. Often, the expression matrix is ~1000 times larger than a traditional RNA-seq dataset. Additional layers, for example, include information about the cell type of each cell, the sequencing experiment condition (batch), genes and gene groups, and various dimensionality reduction matrices, or even images of the tissues where the single cells were extracted. To handle all information in one Python object, the Theis's Lab has developed Anndata object structure, with the aim to provide a common framework for various Python-based software to work with single cell data.\n", "\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The diagram above describes the structure of an AnnData object. AnnData stores four main slots (layers): 1) a gene expression matrix .X, 2) annotations of cells .obs, 3) annotation of genes .var, and 4) unstructured annotations .uns. Each component has one or several datatypes. You may notice the gene expression dataset has sparse matrix format, which is especially suitable for single cell gene expression data, because there are many 0 values in the dataset. Additional/customised layers can be added to the AnnData object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. The PBMC dataset \n", "\n", "To get a good feel of scRNAseq data and key analysis types, we will use a 'small' dataset with 2,700 Peripheral Blood Mononuclear Cells (PBMC). The PBMC dataset has well-characterised immune cell types. This dataset has been used by researchers around the world as one of the gold-standard dataset for the development and comparison of new analysis methods. \n", "\n", "We have provided this data already, in the directory where this ipython notebook is located in the 'data/' folder. \n", "How we downloaded this data is provided below.\n", "\n", "If you are interested in exploring a similar, but larger dataset, you can try downloading the PBMC 68,000 cells at https://cf.10xgenomics.com/samples/cell-exp/1.1.0/fresh_68k_pbmc_donor_a/fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar.gz. With this larger dataset, you will need a large RAM and some steps (e.g. PCA) will take longer to run. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#manually download the PBMC dataset (7.3 MB)\n", "#!mkdir data\n", "#!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz\n", "#!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz\n", "#!mkdir write" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Setting up python environment #\n", "import numpy as np\n", "import pandas as pd\n", "import scanpy as sc\n", "from scipy import sparse, stats\n", "import anndata as anndata\n", "from sklearn.utils import sparsefuncs, issparse\n", "import seaborn as sbn\n", "import matplotlib.pyplot as plt\n", "from sklearn.cluster import KMeans\n", "\n", "from sklearn.cluster import AgglomerativeClustering\n", "from pathlib import Path\n", "#! pip install scvi-tools # optional for variational autoencoder exercise \n", "# import scvi # optional for variational autoencoder exercise " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Suppress warnings for tidy representation of the notebook \n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "class bcolors:\n", " HEADER = '\\033[95m'\n", " OKBLUE = '\\033[94m'\n", " OKGREEN = '\\033[92m'\n", " WARNING = '\\033[93m'\n", " FAIL = '\\033[91m'\n", " ENDC = '\\033[0m'\n", " BOLD = '\\033[1m'\n", " UNDERLINE = '\\033[4m'\n", " \n", "def print_msg(msg_lst: list, sep='\\t', color='\\033[91m'):\n", " msg = \"\"\n", " for i in msg_lst:\n", " msg += str(i) + sep\n", "\n", " print(color + \"-\" * 80 + bcolors.ENDC)\n", " print(color + msg.center(80, ' ') + bcolors.ENDC)\n", " print(color + '-' * 80 + bcolors.ENDC)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Load the data \n", "- There are multiple ways to load single cell data. Typically, a single cell dataset after mapping reads to the genome will contain three main components: barcodes.tsv (cell IDs), genes.tsv, matrix.mtx (expression matrix).\n", "\n", "- We will load the data in the typical way for scRNA-seq data from the 10X platform, as is usually outputted by the 'CellRanger' software. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 2700 × 32738\n", " var: 'gene_ids'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#############################\n", "# Loading the Data # \n", "#############################\n", "\n", "pbmc = sc.read_10x_mtx(\n", " '/data/module5/Datasets/scRNAseq_10xPBMC/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file\n", " var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)\n", " cache=True) # write a cache file for faster subsequent reading\n", "pbmc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The Nature of the Data\n", "\n", "- Single cell RNA-seq is plagued with a large source of technical variation - an extremely low sampling of the total number of mRNAs captured by the cell. The result of this is that there are few genes captured per cell; and typically few reads per gene. This results in what is referred to as a 'sparse matrix'; whereby the majority of values within the matrix are '0'. \n", "\n", "- Below, we illustrate this by visualising this matrix; rows are genes, columns are cells." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example: plot a graph to visualise the sparsity of the pbmc single cell gene expression matrix**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[94m--------------------------------------------------------------------------------\u001b[0m\n", "\u001b[94m This is an example answer to the question\t \u001b[0m\n", "\u001b[94m--------------------------------------------------------------------------------\u001b[0m\n" ] }, { "data": { "text/plain": [ "