Creating MAMS from Seurat
2024-05-15
rmams_tutorial.Rmd
Introduction
Matrix and Analysis Metadata Standards (MAMS) can be used to capture the relevant information about the data matrices and annotations that are produced during common and complex analysis workflows for single-cell data. This use case demonstrates how MAMS can be used to annotate matrices and annotations that are produced during the analysis of multi-modal single-cell data containing both RNA and Antibody Derived Tags (ADTs). We will analyze the PBMC 1K dataset with the Seurat package according to the Multi-modal and Weighted Nearst Neighbor vignettes. The analytical steps include:
- Importing the raw/droplet and filtered/cell matrices
- Performing quality control and further subsetting the filtered matrix to only include high-quality cells
- Clustering all cells with RNA only
- Clustering all cells with ADTs only
- Clustering all cells with both RNA and ADT
- Clustering T-cells with RNA only
- Clustering T-cells with ADT only
- Clustering T-cells with both RNA and ADT
This workflow demonstrates how different subsets of cells/observations are produced due to QC/filtering or can be made based on different biological subsets. It also produces several matrices and annotations from different modalities. Note that the goal of this workflow is less about producing the best clustering results and cell type characterization. Rather it is focused on producing data/results that is commonly generated during real-world analyses which can be annotated with MAMS. For more information about MAMS, see the GitHub repository
Step 1: Creating the Seurat object
Importing
First, we will import raw/droplet and filtered/cell matrices:
raw.matrices <- Read10X(data.dir = "../data/raw_feature_bc_matrix/")
#> 10X data contains more than one type and is being returned as a list containing matrices of each type.
data.raw <- CreateSeuratObject(counts = raw.matrices[[1]], assay = "RNA", project = "pbmc1k")
adt <- CreateAssayObject(counts = raw.matrices[[2]])
data.raw[["ADT"]] <- adt
nonempty.matrices <- Read10X(data.dir = "../data/filtered_feature_bc_matrix")
#> 10X data contains more than one type and is being returned as a list containing matrices of each type.
data.nonempty <- CreateSeuratObject(counts = nonempty.matrices[[1]], assay = "RNA", project = "pbmc1k")
adt <- CreateAssayObject(counts = nonempty.matrices[[2]])
data.nonempty[["ADT"]] <- adt
Quality control
The filtered matrix will be further subsetted based on mitochondrial percentage and total number of features detected:
data.nonempty[["percent.mt"]] <- PercentageFeatureSet(data.nonempty, pattern = "^MT-")
data <- subset(data.nonempty, subset = nFeature_RNA > 200 & percent.mt < 5)
Clustering
The goal of many single cell workflows are to perform clustering to identify biologically distrinct groups of cells. We will perform clustering on all QC’ed cells using RNA, ADT, or both and then also re-cluster the T-cells by themselves.
Analyzing all cells
RNA only clustering of all cells
DefaultAssay(data) <- 'RNA'
data <- NormalizeData(data) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
data <- FindNeighbors(data, dims = 1:10)
data <- FindClusters(data, resolution = 2, verbose = FALSE)
data <- RunUMAP(data, reduction = 'pca', dims = 1:10, assay = 'RNA',
reduction.name = 'rna.umap', reduction.key = 'rnaUMAP_')
DimPlot(data, reduction = 'rna.umap')
ADT only clustering of all cells
DefaultAssay(data) <- 'ADT'
VariableFeatures(data) <- rownames(data[["ADT"]])
data <- NormalizeData(data, normalization.method = 'CLR', margin = 2) %>%
ScaleData() %>% RunPCA(reduction.name = 'apca', dims = 1:10)
data <- FindNeighbors(data, reduction = "apca", dims = 1:10)
data <- FindClusters(data, graph.name = "ADT_snn", resolution = 2, verbose = FALSE)
data <- RunUMAP(data, reduction = 'apca', dims = 1:10,
reduction.name = 'adt.umap', reduction.key = 'adtUMAP_')
DimPlot(data, reduction = 'adt.umap')
Multimodal clustering of all cells
data <- FindMultiModalNeighbors(
data, reduction.list = list("pca", "apca"),
dims.list = list(1:20, 1:10), modality.weight.name = "RNA.weight"
)
data <- RunUMAP(data, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
data <- FindClusters(data, graph.name = "wsnn", algorithm = 3, resolution = 2, verbose = FALSE)
DimPlot(data, reduction = 'wnn.umap', label = TRUE, repel = TRUE)
FeaturePlot(data, features = c("adt_CD3","rna_CD3E"), reduction = 'wnn.umap')
Analysis of T-cell subset
RNA only clustering of T-cells
DefaultAssay(data.tcell) <- 'RNA'
data.tcell <- NormalizeData(data.tcell) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(reduction.name = "pca.tcell")
data.tcell <- FindNeighbors(data.tcell, reduction = "pca.tcell", dims = 1:10)
data.tcell <- FindClusters(data.tcell, resolution = 2, verbose = FALSE)
data.tcell <- RunUMAP(data.tcell, reduction = 'pca.tcell', dims = 1:10, assay = 'RNA',
reduction.name = 'rna.tcell.umap', reduction.key = 'rnaTcellUMAP_')
DimPlot(data.tcell, reduction = 'rna.tcell.umap')
ADT only clustering of T-cells
DefaultAssay(data.tcell) <- 'ADT'
VariableFeatures(data.tcell) <- rownames(data.tcell[["ADT"]])
data.tcell <- NormalizeData(data.tcell, normalization.method = 'CLR', margin = 2) %>%
ScaleData() %>% RunPCA(reduction.name = 'apca.tcell', dims = 1:10)
data.tcell <- RunUMAP(data.tcell, reduction = 'apca.tcell', dims = 1:10, assay = 'ADT',
reduction.name = 'adt.tcell.umap', reduction.key = 'adtTcellUMAP_')
data.tcell <- FindNeighbors(data.tcell, reduction = "apca.tcell", dims = 1:10)
data.tcell <- FindClusters(data.tcell, graph.name = "ADT_snn", resolution = 2, verbose = FALSE)
data.tcell <- FindClusters(data.tcell, algorithm = 3, resolution = 2, verbose = FALSE)
DimPlot(data.tcell, reduction = 'adt.tcell.umap')
Multimodal clustering of T-cells
data.tcell <- FindMultiModalNeighbors(
data.tcell, reduction.list = list("pca.tcell", "apca.tcell"),
dims.list = list(1:10, 1:10), modality.weight.name = "RNA.weight"
)
data.tcell <- RunUMAP(data.tcell, nn.name = "weighted.nn", reduction.name = "wnn.tcell.umap", reduction.key = "wnnTcellUMAP_")
data.tcell <- FindClusters(data.tcell, graph.name = "wsnn", algorithm = 3, resolution = 2, verbose = FALSE)
DimPlot(data.tcell, reduction = 'wnn.tcell.umap', label = TRUE, repel = TRUE)
The data matrices stored within a Seurat object will be saved to an “rds” file. Note that we could extract and save each matrix in a basic txt or mtx file as well.
Step 2: Creating the MAMS object from seurat object
Loading data
library(rmams)
# get the file paths of seurat objects
file_paths <- c("../inst/extdata/pbmc1k_seurat_filtered.rds", "../inst/extdata/pbmc1k_seurat_nonempty.rds", "../inst/extdata/pbmc1k_seurat_tcell.rds")
# get the names of these objects
file_names <- gsub(pattern = "\\.rds$", replacement = "", x = basename(file_paths))
# save the seurat objects onto the object_list by reading the object from file path.
object_list <- lapply(file_paths, readRDS)
# name the list objects using previously generated file_names.
names(object_list) <- file_names
# define the type of each subset
observational_subsets<-c("filtered","nonempty","subset")
datasetid = "PBMC1K"
Create MAMS object
result<- convert_seurat_to_MAMS(object_list,observational_subsets,datasetid)
Accessing data
# Get the dataset_id from the first fom
fom(result,"fom1","dataset_id")
#> [1] "PBMC1K"
Step 3: Validating MAMS object
check_MAMS(mams_object = result)
#> Warning: Please use the setter functions to add the missing field information for the following slots:
#> (Ex. If missing 'filepath' for fom1, then 'fom(mams = mams_object, fom_id = 'fom1', key = 'filepath') <- ...')
#> :
#> Warning in check_MAMS(mams_object = result): fom1 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom2 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom3 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom4 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom5 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom6 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom7 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom8 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom9 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom10 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom11 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom12 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom13 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom14 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom15 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom16 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom17 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom18 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom19 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom20 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom21 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom22 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom23 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> fom24 is missing the field(s): c("representation", "representation_description", "obs_unit", "processing_description", "analyte_description", "obs_subset_description", "feature_subset", "feature_subset_description", "record_id", "parent_id", "parent_relationship", "parent_relationship_description", "fid", "obs", "fea", "ong", "fng")
#> All clear!
#> Warning: Please use the setter functions to add the missing field information for the following slots:
#> (Ex. If missing 'filepath' for fom1, then 'fom(mams = mams_object, fom_id = 'fom1', key = 'filepath') <- ...')
#> :
#> Warning in check_MAMS(mams_object = result): ogr1 is missing the field(s): c("parent_id", "edge_metric")
#> ogr2 is missing the field(s): c("parent_id", "edge_metric")
#> ogr3 is missing the field(s): c("parent_id", "edge_metric")
#> ogr4 is missing the field(s): c("parent_id", "edge_metric")
#> ogr5 is missing the field(s): c("parent_id", "edge_metric")
#> ogr6 is missing the field(s): c("parent_id", "edge_metric")
#> ogr7 is missing the field(s): c("parent_id", "edge_metric")
#> ogr8 is missing the field(s): c("parent_id", "edge_metric")
#> ogr9 is missing the field(s): c("parent_id", "edge_metric")
#> ogr10 is missing the field(s): c("parent_id", "edge_metric")
#> ogr11 is missing the field(s): c("parent_id", "edge_metric")
#> ogr12 is missing the field(s): c("parent_id", "edge_metric")
#> ogr13 is missing the field(s): c("parent_id", "edge_metric")
#> ogr14 is missing the field(s): c("parent_id", "edge_metric")
#> All clear!
#> Warning in check_MAMS(mams_object = result): The following slot is empty: FEA
#> Warning in check_MAMS(mams_object = result): The following slot is empty: OBS
#> Warning in check_MAMS(mams_object = result): The following slot is empty: FID
#> Warning in check_MAMS(mams_object = result): The following slot is empty: OID
#> Warning in check_MAMS(mams_object = result): The following slot is empty: REC
#> Warning in check_MAMS(mams_object = result): The following slot is empty: FNG
Step 4: Filling the missing field
fom(result,"fom1","representation") <-"example"
Step 4: Saving MAMS object to JSON
Here we show how you can convert the MAMS object to the JSON format for export. MAMS and JSON formats are interchangeable.
write_MAMS_to_output(MAMS=result, filepath = "../inst/extdata/pbmc1k_rna_adt_seurat_mams.JSON",format = "JSON")
Step 5: Saving MAMS as YAML
write_MAMS_to_output(MAMS=result, filepath = "../inst/extdata/pbmc1k_rna_adt_seurat_mams.yml",format = "YAML")
Support for Seurat V5 objects
In this example we show how we can extract counts from Seurat V5 object and make a V3 object that can be used with rmams pacakge
options(Seurat.object.assay.version = "v5")
pbmc.counts <- Read10X(data.dir = "../data/filtered_feature_bc_matrix/")
pbmc <- CreateSeuratObject(counts = pbmc.counts)
class(pbmc[["RNA"]])
#> [1] "Assay5"
#> attr(,"package")
#> [1] "SeuratObject"
# Extract counts from v5
v3_counts = LayerData(pbmc, assay="RNA", layer='counts')
# make v3 object
options(Seurat.object.assay.version = "v3")
pbmc_v3 <- CreateSeuratObject(counts = v3_counts)
class(pbmc_v3[["RNA"]])
#> [1] "Assay"
#> attr(,"package")
#> [1] "SeuratObject"
# This pbmc_v3 can now be used with rmams pacakge