Tutorial of BrainCellR

`BrainCellR` provides researchers with a powerful and user-friendly package for efficient cell type classfication and nomination of single-cell transcriptomic data in the brain.

BrainCellR process comprises the following steps:

Preparation of input data
Fine-scale clustering of single-cell data
Provision of clusters to a supervised cell type classifier capable of outputting major cell classes and subclasses
Identification of differentially expressed genes and select marker genes from the identified differentially expressed genes
- Find differentially expressed genes
- Selection of genes that exhibit a high fold change value and are expressed in the majority of cells within a given cell type
- Ranking of genes based on specific score
Acquisition of the final cell type annotation
Identification of markers that cannot be detected using the ROC method

In this tutorial, we will be using data from the mouse M1 cortex, as presented in Bakken TE et al.’s 2021 study, ‘Comparative Cellular Analysis of Motor Cortex in Human, Marmoset and Mouse’, published in Nature.
To expedite the tutorial, we have extracted only the first 1000 cells.

Preparation of input data

The input data should consist of single-cell transcriptome data, arranged with genes in rows and cells in columns. The clustering method recommends that the input data be log-transformed. Using a sparse matrix can help speed up the process.

library(scrattch.hicat)
library(BrainCellR)

##

library(Matrix)
library(matrixStats)

data = read.csv("G:/PFC/package_example/data.csv",row.names=1)
data = data[,1:1000]
print(data[1:5,1:5])

##               mouse_pBICCNsMMrMOpRAiF007d190314_CAGGTATAGGTAGGCT
## 0610009D07Rik                                                  1
## 0610009O20Rik                                                  0
## 0610010F05Rik                                                  2
## 0610010K14Rik                                                  0
## 0610030E20Rik                                                  0
##               mouse_pBICCNsMMrMOpRMiM008d190320_CCGCAAGAGAACAGGA
## 0610009D07Rik                                                  4
## 0610009O20Rik                                                  0
## 0610010F05Rik                                                  5
## 0610010K14Rik                                                  0
## 0610030E20Rik                                                  0
##               mouse_pBICCNsMMrMOpRMiM008d190320_AACCAACAGTCGTTAC
## 0610009D07Rik                                                  3
## 0610009O20Rik                                                  0
## 0610010F05Rik                                                  2
## 0610010K14Rik                                                  0
## 0610030E20Rik                                                  0
##               mouse_pBICCNsMMrMOpRMiM008d190320_ATGGATCGTTTCCATT
## 0610009D07Rik                                                  0
## 0610009O20Rik                                                  0
## 0610010F05Rik                                                  1
## 0610010K14Rik                                                  0
## 0610030E20Rik                                                  0
##               mouse_pBICCNsMMrMOpRPiF008d190314_GTTCCGTGTCCAATCA
## 0610009D07Rik                                                  1
## 0610009O20Rik                                                  1
## 0610010F05Rik                                                  0
## 0610010K14Rik                                                  0
## 0610030E20Rik                                                  0

print(dim(data))

## [1] 13888  1000

data = as.matrix(data)
norm.dat <- cpm(data)
norm.dat <- log2(norm.dat+1)
norm.dat <- Matrix(norm.dat, sparse = TRUE)

Fine-scale clustering of single-cell data

The parameters of de_param() function ban be seen from scrattch.hicat package(https://github.com/AllenInstitute/scrattch.hicat/tree/master).

padj.th: P-value after fdr adjustment .
q1.th: Proportion threshold of cells in current cell type.
q2.th: Proportion threshold of cells in other cell type.
de.score: Threshold for gene selection. min.cells: The minimum number of cells in each cluster.

set.seed(3456)
de.param <- de_param(padj.th=0.01,
                     q1.th       = 0.4,
                     q.diff.th   = 0.7,
                     de.score.th = 100,
                     min.cells=20)
file_path_root = "G:/PFC/package_example/"
dir.create(paste0(file_path_root,"run_cluster_m"))
result_1 <- RunConsensus1(norm.dat,
                          de.param = de.param,
                          output_dir = paste0(file_path_root,"run_cluster_m"),
                          mc.cores = 1)
group_meta = data.frame(names(result_1$cl),result_1$cl)
colnames(group_meta) = c("cell","group")

Print clustering result

print(head(group_meta))

##                                                                                                  cell
## mouse_pBICCNsMMrMOpRAiF007d190314_CAGGTATAGGTAGGCT mouse_pBICCNsMMrMOpRAiF007d190314_CAGGTATAGGTAGGCT
## mouse_pBICCNsMMrMOpRMiM008d190320_CCGCAAGAGAACAGGA mouse_pBICCNsMMrMOpRMiM008d190320_CCGCAAGAGAACAGGA
## mouse_pBICCNsMMrMOpRMiM008d190320_AACCAACAGTCGTTAC mouse_pBICCNsMMrMOpRMiM008d190320_AACCAACAGTCGTTAC
## mouse_pBICCNsMMrMOpRMiM008d190320_ATGGATCGTTTCCATT mouse_pBICCNsMMrMOpRMiM008d190320_ATGGATCGTTTCCATT
## mouse_pBICCNsMMrMOpRPiF008d190314_GTTCCGTGTCCAATCA mouse_pBICCNsMMrMOpRPiF008d190314_GTTCCGTGTCCAATCA
## mouse_pBICCNsMMrMOpRPiM008d190320_CATTCTAAGCTGACAG mouse_pBICCNsMMrMOpRPiM008d190320_CATTCTAAGCTGACAG
##                                                    group
## mouse_pBICCNsMMrMOpRAiF007d190314_CAGGTATAGGTAGGCT     1
## mouse_pBICCNsMMrMOpRMiM008d190320_CCGCAAGAGAACAGGA     1
## mouse_pBICCNsMMrMOpRMiM008d190320_AACCAACAGTCGTTAC    10
## mouse_pBICCNsMMrMOpRMiM008d190320_ATGGATCGTTTCCATT     1
## mouse_pBICCNsMMrMOpRPiF008d190314_GTTCCGTGTCCAATCA     1
## mouse_pBICCNsMMrMOpRPiM008d190320_CATTCTAAGCTGACAG    10

print(dim(group_meta))

## [1] 1000    2

Provision of clusters to a supervised cell type classifier capable of outputting major cell classes and subclasses

There should be reference data available for classification, which includes a gene expression matrix and metadata containing information on the major classes and subclasses of cell types.

method: Method to run classification, can be chosen from Seurat or SingleR.

ref_data = read.csv("G:/PFC/package_example/data.csv",row.names=1)
ref_meta = read.csv("G:/PFC/package_example/mouse_meta.csv",row.names=1)
new_data = data.frame(data)
anno_meta = RunSubclassClassify(new_data, group_meta, ref_data, ref_meta, method="Seurat")

Print classification result

print(head(anno_meta))

##                                                                                                  cell
## mouse_pBICCNsMMrMOpRAiF007d190314_CAGGTATAGGTAGGCT mouse_pBICCNsMMrMOpRAiF007d190314_CAGGTATAGGTAGGCT
## mouse_pBICCNsMMrMOpRMiM008d190320_CCGCAAGAGAACAGGA mouse_pBICCNsMMrMOpRMiM008d190320_CCGCAAGAGAACAGGA
## mouse_pBICCNsMMrMOpRMiM008d190320_AACCAACAGTCGTTAC mouse_pBICCNsMMrMOpRMiM008d190320_AACCAACAGTCGTTAC
## mouse_pBICCNsMMrMOpRMiM008d190320_ATGGATCGTTTCCATT mouse_pBICCNsMMrMOpRMiM008d190320_ATGGATCGTTTCCATT
## mouse_pBICCNsMMrMOpRPiF008d190314_GTTCCGTGTCCAATCA mouse_pBICCNsMMrMOpRPiF008d190314_GTTCCGTGTCCAATCA
## mouse_pBICCNsMMrMOpRPiM008d190320_CATTCTAAGCTGACAG mouse_pBICCNsMMrMOpRPiM008d190320_CATTCTAAGCTGACAG
##                                                    class_label subclass_label
## mouse_pBICCNsMMrMOpRAiF007d190314_CAGGTATAGGTAGGCT   GABAergic      Sst Chodl
## mouse_pBICCNsMMrMOpRMiM008d190320_CCGCAAGAGAACAGGA   GABAergic      Sst Chodl
## mouse_pBICCNsMMrMOpRMiM008d190320_AACCAACAGTCGTTAC   GABAergic      Sst Chodl
## mouse_pBICCNsMMrMOpRMiM008d190320_ATGGATCGTTTCCATT   GABAergic      Sst Chodl
## mouse_pBICCNsMMrMOpRPiF008d190314_GTTCCGTGTCCAATCA   GABAergic      Sst Chodl
## mouse_pBICCNsMMrMOpRPiM008d190320_CATTCTAAGCTGACAG   GABAergic      Sst Chodl
##                                                    cluster_label
## mouse_pBICCNsMMrMOpRAiF007d190314_CAGGTATAGGTAGGCT             1
## mouse_pBICCNsMMrMOpRMiM008d190320_CCGCAAGAGAACAGGA             1
## mouse_pBICCNsMMrMOpRMiM008d190320_AACCAACAGTCGTTAC            10
## mouse_pBICCNsMMrMOpRMiM008d190320_ATGGATCGTTTCCATT             1
## mouse_pBICCNsMMrMOpRPiF008d190314_GTTCCGTGTCCAATCA             1
## mouse_pBICCNsMMrMOpRPiM008d190320_CATTCTAAGCTGACAG            10

Identification of differentially expressed genes and select marker genes from the identified differentially expressed genes

The data will be processed into pseudo cells and the ROC method will be run to find differentially expressed (DE) genes. These DE genes are calculated between clusters that originate from the same subclass.

cell_number: Number of cells to be intergated within pseudo cells.

new_data = new_data[,rownames(anno_meta)]
result = RunFindDEGene(new_data, anno_meta,cell_number=10)
all_marker_list = result[[1]]
pse_meta = result[[2]]
pse_data = result[[3]]

Print marker result

print(head(pse_meta))

##   cell cluster_label subclass_label     class
## 1    1             1      Sst Chodl GABAergic
## 2    2             1      Sst Chodl GABAergic
## 3    3             1      Sst Chodl GABAergic
## 4    4             1      Sst Chodl GABAergic
## 5    5             1      Sst Chodl GABAergic
## 6    6             1      Sst Chodl GABAergic

Acquisition of the final cell type annotation

Select the top three marker genes ranked by specific score.
The annotation table contains seven columns, which include class, subclass, and cluster information from previous steps. From this, select the top three markers for this cluster. The ‘cluster_new’ column contains the name obtained from our package.

cell_type_anno <- GetCellTypeName(pse_data, pse_meta, all_marker_list)

## [1] 8
## [1] 1
## [1] 2
## [1] 3

print(cell_type_anno)

##        class  subclass cluster   gene1   gene2    gene3
## 2  GABAergic Sst Chodl       1   Trpm3  Slc4a4 Adamtsl1
## 3  GABAergic Sst Chodl      10    Hhip   Nxph2    Prkg1
## 4  GABAergic       Vip      13  Mybpc1 Col15a1    Pde3a
## 5  GABAergic       Vip      15   Bcar3   Ptprk    Bmper
## 6  GABAergic       Vip      16    Car4    Cdh6    Asic2
## 7  GABAergic       Vip      17    <NA>    <NA>     <NA>
## 8  GABAergic       Vip      14 Col15a1 Kirrel3     <NA>
## 9  GABAergic       Vip      12    Ror1 Col14a1     Chn2
## 10 GABAergic     Pvalb      11    <NA>    <NA>     <NA>
##                  cluster_new
## 2  GABAergic_Sst Chodl_Trpm3
## 3   GABAergic_Sst Chodl_Hhip
## 4       GABAergic_Vip_Mybpc1
## 5        GABAergic_Vip_Bcar3
## 6         GABAergic_Vip_Car4
## 7           GABAergic_Vip_NA
## 8      GABAergic_Vip_Col15a1
## 9         GABAergic_Vip_Ror1
## 10        GABAergic_Pvalb_NA

Identification of markers that cannot be detected using the ROC method

We found that two clusters could not be annotated by our pipeline. One is Cluster 11, which belongs to the Pvalb subclass. Since it only contains one cluster in the Pvalb subclass, it cannot be calculated by our pipeline, as our pipeline calculates differentially expressed (DE) genes between clusters that belong to the same subclass. The other, Cluster 17, could not find markers using ROC methods. Therefore, we used other DE gene detection methods to find markers, and we will annotate the method used to find these markers.

anno_new <- GetCellTYpeNameExtend(cell_type_anno, pse_data, pse_meta)

## Calculating cluster 13

## Calculating cluster 15

## Calculating cluster 16

## Calculating cluster 17

## Calculating cluster 14

## Calculating cluster 12

## [1] 6
## [1] 1

## Calculating cluster 11

## [1] 0
## [1] 1

## Calculating cluster 11

## [1] 0
## [1] 1

## Calculating cluster 11

## [1] 0
## [1] 1

print(anno_new)

##        class  subclass cluster   gene1   gene2    gene3
## 2  GABAergic Sst Chodl       1   Trpm3  Slc4a4 Adamtsl1
## 3  GABAergic Sst Chodl      10    Hhip   Nxph2    Prkg1
## 4  GABAergic       Vip      13  Mybpc1 Col15a1    Pde3a
## 5  GABAergic       Vip      15   Bcar3   Ptprk    Bmper
## 6  GABAergic       Vip      16    Car4    Cdh6    Asic2
## 7  GABAergic       Vip      17   Scml4  Frmd4b   Anks1b
## 8  GABAergic       Vip      14 Col15a1 Kirrel3     <NA>
## 9  GABAergic       Vip      12    Ror1 Col14a1     Chn2
## 10 GABAergic     Pvalb      11    <NA>    <NA>     <NA>
##                  cluster_new quality
## 2  GABAergic_Sst Chodl_Trpm3     ROC
## 3   GABAergic_Sst Chodl_Hhip     ROC
## 4       GABAergic_Vip_Mybpc1     ROC
## 5        GABAergic_Vip_Bcar3     ROC
## 6         GABAergic_Vip_Car4     ROC
## 7        GABAergic_Vip_Scml4  wilcox
## 8      GABAergic_Vip_Col15a1     ROC
## 9         GABAergic_Vip_Ror1     ROC
## 10        GABAergic_Pvalb_NA     ROC