空间组deg genemodule heatmap go分析
空间转录组通过手动划区注释之后,得到空间组的区域注释信息,这个时候我们便是想要得到其差异表达基因的功能注释信息,这里并没有能够通用的脚本,本人仅是提供一种思路,供大家参考。
由于我们的空间组的数据大都是 anndata 的 h5ad 的文件格式,所以我们需要将其转换为能够被 seurat 读取的 rds 文件格式,这里采用我们之前博客讲到的一个便捷工具 rds与h5ad的相互转换,大概代码如下所述:
sceasy::convertFormat('/data/*.h5ad',
from="anndata",
to="seurat",
main_layer = 'count',
outFile='/data/*.rds')
于是便是得到了相应的 rds 文件,之后我们借助 seurat 中的 findallmarkers 函数,来进行差异表达基因的筛选,并且去除一些我们不需要的基因,例如线粒体基因,核糖体蛋白基因,血红蛋白基因,细胞质基因,胶原蛋白基因等等,这里我们可以通过正则表达式来进行筛选,以及对 avglog2fc 进行排序,筛选出每个类群的前 50 个差异表达基因,之后对其进行 genemodule 的计算。这里需要注意的是,我在进行 deg 的计算的时候,并没有对空间组的数据进行归一化,因为我是bin50的数据,当然也不是不能进行归一化,这个大家可以自行决定是否要进行归一化;此外在计算 genemodule 的地方,我则是选取了选取了 raw 矩阵和均一化之后的矩阵都进行了 genemodule 的计算,大家可以出现的结果决定是否要进行归一化,虽然 addgenemodule 的文档上写了输入的矩阵是归一化的矩阵。
脚本里面还有一些看着比较奇怪的代码,输入计算 deg 的矩阵和计算
genemodule
的矩阵可以是不一样的,原因在于我们可能空间组芯片上某些位置的捕获与其他地方不一样,需要我们去除掉,这个时候我们可以选取
去除了某些区域的矩阵
去计算 deg,之后再用正常的芯片去计算
genemodule。
# Rscript your_script.R path_to_rds_file.rds refined_pred_column_name path_to_output_csv_file.csv
# 读取命令行参数
args <- commandArgs(trailingOnly = TRUE)
# 检查是否提供了正确的参数数量
if (length(args) != 4) {
stop("需要提供三个参数:rds_deg_path, rds_module_path, refined_pred, def_save_path")
}
# 解析命令行参数
rds_deg_path <- args[1]
rds_module_path <- args[2]
cluster <- args[3]
save_path <- args[4]
# 导入所需的库
library(Seurat)
library(tidyverse)
# 读取单细胞数据
rds <- readRDS(rds_deg_path)
# 设置 rds 对象的细胞类型标签
Idents(rds) <- rds@meta.data[[cluster]]
# 查找差异表达基因
df <- FindAllMarkers(rds, only.pos = TRUE)
# 去除以 MT|HB|RP|AC|COL 开头的基因
# MT 线粒体
# RP 核糖体蛋白
# HB 血红蛋白
# AC 细胞质
# COL 胶原蛋白
deg_save_path = paste(save_path,'deg/deg.csv', sep = '')
write_csv(df, deg_save_path)
filtered_df <- df[!grepl("^(MT|HB|RP|AC|COL)", df$gene), ]
# 创建一个以 cluster 为索引的列表,每个 cluster 对应一个子数据框
cluster_list <- split(filtered_df, filtered_df$cluster)
sort_and_select <- function(df) {
sorted_df <- df[order(df$avg_log2FC, decreasing = TRUE), ]
return(sorted_df)
}
# 对每个子数据框应用排序和筛选函数
sorted_dfs <- lapply(cluster_list, sort_and_select)
# 合并所有排序后的子数据框为一个新的数据框
filtered_df <- do.call(rbind, sorted_dfs)
deg_save_path = paste(save_path,'deg/filtered_deg.csv', sep = '')
write_csv(filtered_df, deg_save_path)
# 创建一个以 cluster 为索引的列表,每个 cluster 对应一个子数据框
cluster_list <- split(filtered_df, filtered_df$cluster)
# 定义一个函数,用于对子数据框按照 avglog2fc 列进行排序,并筛选前 50 行
sort_and_select <- function(df) {
sorted_df <- df[order(df$avg_log2FC, decreasing = TRUE), ]
if (nrow(sorted_df) >= 50) {
return(sorted_df[1:50, ])
} else {
return(sorted_df)
}
}
# 对每个子数据框应用排序和筛选函数
sorted_dfs <- lapply(cluster_list, sort_and_select)
# 合并所有排序后的子数据框为一个新的数据框
new_dataframe <- do.call(rbind, sorted_dfs)
# 输出最终结果
# print(new_dataframe)
deg_save_path = paste(save_path,'deg/filtered_top50_deg.csv', sep = '')
write_csv(new_dataframe, deg_save_path)
# 将 cluster 列中的唯一值保存到 unique_cluster 中
unique_cluster <- unique(new_dataframe$cluster)
# 对每个唯一的 cluster 值进行循环操作
for (cluster_value in unique_cluster) {
# 这里执行你的操作,例如:
# 通过 subset 函数筛选出特定 cluster 值的行
print(cluster_value)
subset_df <- subset(filtered_df, cluster == cluster_value)
gene_list <- subset_df$gene
gene_list = list(gene_list)
gene_modules = paste(cluster_value ,'_module', sep = '')
rds <- readRDS(rds_module_path)
rds = AddModuleScore(rds, features=gene_list,name='gene_modules', nbin = 12)
gene_module_path = paste(save_path, 'genemodules/csv/',cluster_value, '_raw_genemodule.csv', sep = "")
write.csv(rds@meta.data, gene_module_path)
rds = NormalizeData(rds)
rds = AddModuleScore(rds, features=gene_list,name='gene_modules', nbin = 12)
gene_module_path = paste(save_path, 'genemodules/csv/',cluster_value, '_lognor_genemodule.csv', sep = "")
write.csv(rds@meta.data, gene_module_path)
# 打印子集数据框的摘要信息
}
计算了 genemodule 的信息之后,便是需要对其在空间组上进行绘制,看看这些计算出的 deg 是否在我们划分的区域富集,这里采用了 scanpy ,采用scanpy的原因在于 scanpy 的 spatial 绘图函数较为通用以及便捷。
# python your_script.py --adata_file_path path_to_adata_file --csv_file_path path_to_csv_file --gene_modules module_name
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import os
import argparse
import scanpy as sc
# 创建参数解析器
parser = argparse.ArgumentParser()
# 添加命令行选项
parser.add_argument('--adata_file_path', type=str, help='Path to the AnnData file')
parser.add_argument('--save_path', type=str, help='Name of the gene module column')
# 解析命令行参数
args = parser.parse_args()
# 将参数值赋给对应的变量
adata_file_path = args.adata_file_path
save_path = args.save_path
adata = sc.read(adata_file_path)
csv_path_dir = os.listdir(save_path+'genemodules/csv')
print(csv_path_dir)
for i in csv_path_dir:
if 'genemodule' in i:
print(i)
name = i.replace('.csv', '')
gene_modules = 'gene_modules1'
csv = pd.read_csv(save_path +'genemodules/csv/'+ i)
gene_modules = csv[gene_modules].tolist()
adata.obs[name] = gene_modules
sc.pl.spatial(adata, color = name, spot_size = 100, show = False, frameon = False)
plt.savefig(save_path + 'genemodules/fig/'+i.replace('.csv', '.pdf'))
else:
continue
print('genemodule plot over!')
到此为止,deg计算完毕,genemodule 计算完毕,绘制完毕。接下来进行 heatmap 的绘制,这里是用的 python 进行计算,R 语言进行绘图,我们将每个区域视为一个 psudo bulk,对于每个基因,计算其在每个 psudo bulk 中的平均表达量,然后保存为 csv,之后采用 R 语言进行绘制。
import pandas as pd
import anndata as ad
import os
import argparse
import scanpy as sc
# 创建参数解析器
parser = argparse.ArgumentParser()
# 添加命令行选项
parser.add_argument('--adata_file_path', type=str, help='Path to the AnnData file')
parser.add_argument('--save_path', type=str, help='Name of the gene module column')
parser.add_argument('--cluster', type=str, help='Name of the cluster column')
# 解析命令行参数
args = parser.parse_args()
# 将参数值赋给对应的变量
adata_file_path = args.adata_file_path
save_path = args.save_path
cluster = args.cluster
listdir = os.listdir(save_path + 'deg/')
print(listdir)
for i in listdir:
if 'filtered' in i:
adata = ad.read(adata_file_path)
name = i.replace('.csv', '')
print(name)
csv = pd.read_csv(save_path + 'deg/' + i)
deg_df_cluster = list(set(csv['cluster']))
all_genes_ = []
for j in deg_df_cluster:
temp = csv[csv['cluster'] ==j].sort_values('avg_log2FC', ascending = False)[:50].gene.tolist()
all_genes_ += temp
all_genes = []
for j in all_genes_:
if j in all_genes:
continue
else:
all_genes.append(j)
heatmap = pd.DataFrame(index = all_genes)
for j in deg_df_cluster:
adata_temp = adata[adata.obs[cluster] == j]
adata_temp = adata_temp[:,all_genes].X.todense()
adata_temp = pd.DataFrame(adata_temp,columns = all_genes)
heatmap[j] = adata_temp.mean()
heatmap.to_csv(save_path + 'heatmap/' + name +'_raw_heatmap.csv')
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata[:,all_genes]
sc.pp.scale(adata, max_value=10)
heatmap = pd.DataFrame(index = all_genes)
for j in deg_df_cluster:
adata_temp = adata[adata.obs[cluster] == j]
# print(adata_temp.X)
adata_temp = adata_temp.X
adata_temp = pd.DataFrame(adata_temp,columns = all_genes)
heatmap[j] = adata_temp.mean()
heatmap.to_csv(save_path + 'heatmap/' + name +'_lognor_heatmap.csv')
else:
continue
绘制函数如下所示:
# Rscript 1.R gene_list_value rds_path_value gene_modules_value gene_module_path_value gene_module_normalized_path_value
# 读取命令行参数
args <- commandArgs(trailingOnly = TRUE)
# 检查是否提供了正确的参数数量
if (length(args) != 1) {
stop("需要提供一个参数:save_path")
}
# 解析命令行参数
save_path = args[1]
library(pheatmap)
library(ggplot2)
library(viridis)
library(tidyverse)
raw = paste(save_path, 'heatmap/filtered_deg_raw_heatmap.csv', sep = '')
data <- read.csv(raw, encoding="UTF-8")
rownames(data) = data$X
data$X =NULL
mycolors = viridis(11)
options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,color = colorRampPalette(mycolors)(100),)
raw_save_path = paste(save_path, 'heatmap/filtered_deg_raw_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)
options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,scale = "row",color = colorRampPalette(mycolors)(100),)
raw_save_path = paste(save_path, 'heatmap/filtered_deg_raw_scale_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)
raw = paste(save_path, 'heatmap/filtered_deg_lognor_heatmap.csv', sep = '')
data <- read.csv(raw, encoding="UTF-8")
rownames(data) = data$X
data$X =NULL
mycolors = viridis(11)
options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,color = colorRampPalette(mycolors)(100),)
raw_save_path = paste(save_path, 'heatmap/filtered_deg_lognor_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)
options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,scale = "row",color = colorRampPalette(mycolors)(100),)
raw_save_path = paste(save_path, 'heatmap/filtered_deg_lognor_scale_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)
raw = paste(save_path, 'heatmap/filtered_top50_deg_raw_heatmap.csv', sep = '')
data <- read.csv(raw, encoding="UTF-8")
rownames(data) = data$X
data$X =NULL
mycolors = viridis(11)
options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,color = colorRampPalette(mycolors)(100),)
raw_save_path = paste(save_path, 'heatmap/filtered_top50_deg_raw_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)
options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,scale = "row",color = colorRampPalette(mycolors)(100),)
raw_save_path = paste(save_path, 'heatmap/filtered_top50_deg_raw_scale_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)
raw = paste(save_path, 'heatmap/filtered_top50_deg_lognor_heatmap.csv', sep = '')
data <- read.csv(raw, encoding="UTF-8")
rownames(data) = data$X
data$X =NULL
mycolors = viridis(11)
options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,color = colorRampPalette(mycolors)(100),)
raw_save_path = paste(save_path, 'heatmap/filtered_top50_deg_lognor_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)
options(repr.plot.width=6, repr.plot.height=20)
p = pheatmap(data ,cluster_col = FALSE,cluster_row = FALSE,scale = "row",color = colorRampPalette(mycolors)(100),)
raw_save_path = paste(save_path, 'heatmap/filtered_top50_deg_lognor_scale_heatmap.pdf', sep = '')
ggsave(raw_save_path, plot = p, width=6,height=20,limitsize = FALSE)
R 函数的heatmap绘制了不止一个图,而且多个,大家可以自行选择需要的。最后是进行 GO 富集分析,这里我采用了 clusterProfiler 进行 GO 富集分析,这里的代码也是比较简单的,大家可以自行参考。
# Rscript your_script.R path_to_rds_file.rds refined_pred_column_name path_to_output_csv_file.csv
# 读取命令行参数
args <- commandArgs(trailingOnly = TRUE)
# 检查是否提供了正确的参数数量
if (length(args) != 1) {
stop("需要提供三个参数:save_path")
}
# 解析命令行参数
save_path <- args[1]
###---------------------go分析
library("clusterProfiler")
#install.packages("BiocManager")
#BiocManager::install("clusterProfiler")
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
library(DOSE)
library(enrichplot)
library("tidyverse")
#install.packages("tidyverse")
library(RColorBrewer)
library(tidyverse)
#### go
cal_go = function(genelist, fig_name, csv_name) {
go = enrichGO(genelist,
OrgDb = org.Hs.eg.db,
ont = 'ALL',
pAdjustMethod = "BH",
pvalueCutoff = 1,
qvalueCutoff = 1,
keyType = "SYMBOL")
go.count = data.frame(go)
go.count_sort = go.count[order(go.count$pvalue, decreasing = FALSE),]
top10 = function(i) {
k = head(i[order(i$pvalue), ], 10);
return(k)
}
dp = rbind(top10(go.count[go.count['ONTOLOGY'] == "BP", ]),
top10(go.count[go.count['ONTOLOGY'] == "CC", ]),
top10(go.count[go.count['ONTOLOGY'] == "MF", ]))
d1 = dp[which(dp[, 1] == "BP"), ]
d2 = dp[which(dp[, 1] == "CC"), ]
d3 = dp[which(dp[, 1] == "MF"), ]
d_l = rbind(d1, d2, d3)
mt = d_l[, c(1, 3, 10)]
colnames(mt)<-c("Type","name","number")
mt$name <- factor(mt$name, levels = rev(mt$name))
ggplot(data=mt, aes(x=name,y=number,fill=Type))+ theme_bw()+
geom_bar(stat = "identity", position=position_dodge())+coord_flip()+xlab("GO term") + ylab("Number of genes")+
scale_fill_brewer(palette="Set2")+
theme(panel.border = element_blank(),
plot.title = element_text(face = "bold",size = 12,vjust =1),
axis.title = element_text(face = "bold",size = 12, vjust = 0.3),
axis.text = element_text(size = 8, vjust = 0.8),
axis.line = element_line(),
panel.grid = element_blank(),
legend.position=c(0.8,0.75),
legend.key.size = unit(4, "mm"))
ggsave(fig_name, height = 5, width = 10)
#write.csv(go.count_sort, csv_name, row.names = F, quote = F)
go.count_sort %>% write_csv(csv_name)
}
# cluster = paste('cluster', '', )
dataframe_path = paste(save_path, 'deg/', 'filtered_deg.csv',sep = '')
dataframe = read_csv(dataframe_path, col_names = FALSE)
# colnames(dataframe) <- as.character(dataframe[1, ])
dataframe <- dataframe[-1, ]
# rownames(dataframe) <- NULL
# 将 cluster 列中的唯一值保存到 unique_cluster 中
unique_clusters <- unique(dataframe$X6)
# 对每个唯一的 cluster 值进行循环操作
for (cluster_value in unique_clusters ) {
# 这里执行你的操作,例如:
# 通过 subset 函数筛选出特定 cluster 值的行
print(cluster_value)
subset_df <- dataframe[dataframe$X6 == cluster_value, ]
genelist = subset_df$X7
genelist = genelist[!is.na(genelist)]
genelist = genelist[-1]
csv_result = paste(save_path, 'go/raw_results/', cluster_value,'.csv', sep = '')
pdf_result = paste(save_path, 'go/raw_results/', cluster_value,'.pdf', sep = '')
cal_go(genelist, pdf_result, csv_result)
}
dataframe_path = paste(save_path, 'deg/', 'filtered_top50_deg.csv',sep = '')
dataframe = read_csv(dataframe_path, col_names = FALSE)
dataframe <- dataframe[-1, ]
# rownames(dataframe) <- NULL
# 将 cluster 列中的唯一值保存到 unique_cluster 中
unique_cluster <- unique(dataframe$X6)
# 对每个唯一的 cluster 值进行循环操作
for (cluster_value in unique_cluster) {
# 这里执行你的操作,例如:
# 通过 subset 函数筛选出特定 cluster 值的行
subset_df <- dataframe[dataframe$X6 == cluster_value, ]
genelist = subset_df$X7
genelist = genelist[!is.na(genelist)]
genelist = genelist[-1]
csv_result = paste(save_path, 'go/raw_results/', cluster_value,'_top50.csv', sep = '')
pdf_result = paste(save_path, 'go/raw_results/', cluster_value,'_top50.pdf', sep = '')
cal_go(genelist, pdf_result, csv_result)
}
这段代码能够给一个大概的参考,有时候我们会只想要 BP 相关的数据,我们还可以读入上述生成的表格,重新进行绘制,代码如下:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import os
import argparse
parser = argparse.ArgumentParser()
# 添加命令行选项
parser.add_argument('--save_path', type=str, help='Name of the gene module column')
# 解析命令行参数
args = parser.parse_args()
# 将参数值赋给对应的变量
save_path = args.save_path
csv_file_dir_path = save_path + 'go/raw_results/'
save_dir_path = save_path + 'go/filtered_results/'
csv_file_dir = os.listdir(csv_file_dir_path)
csv_file = []
for i in csv_file_dir:
if '.csv' in i:
csv_file.append(i)
def go_bp_plot(csv_file_path, save_path):
csv = pd.read_csv(csv_file_path)
csv = csv[csv['ONTOLOGY']=='BP']
# 根据 geneID 列进行去重
# csv = csv.drop_duplicates(subset=['geneID'])
csv.index = range(len(csv))
# 设置字体样式和大小
plt.rcParams['font.size'] = 8 # 坐标轴字号为16
ax = csv[:10].plot.barh(y = 'Count', x = 'Description', width=0.92, color='#66c2a5')
ax.set_xlabel('Number of genes', fontsize='12')
ax.set_ylabel('GO term', fontsize='12')
ax.set_yticklabels([]) # Remove y-axis tick labels
# Add Description labels on top of each bar
max_v = max(csv[:10]['Count'])
for i, v in enumerate(csv[:10]['Count']):
Description = csv['Description'][i]
ax.text(0.025*max_v, i, Description, color='black', va='center')
# csv['Description'][index]
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Set x-axis tick locator
ax.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))
# Set x-axis tick formatter
ax.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d'))
ax.legend().remove()
plt.savefig(save_path, format='pdf')
plt.close()
for i in csv_file:
name = i.split('.')[0]
go_bp_plot(csv_file_dir_path + name + '.csv', save_dir_path + name + '.pdf')
总体上,我们实现了空间组数据的 deg 的计算,genemodule 的计算与绘制,每个基因在各个空间注释区域的表达量的计算与 heatmap 的绘制,以及各个区域的 GO 分析。
最后,还有一个简易的 sh 脚本将这 6 个 python 或 R 脚本进行统一的运行,如下所示:
# 所有文件的保存地址
save_path="/data/gw21_scp_0103/"
# 绘制 genemodule heatmap 的 h5ad path
h5ad_path="/data/*_0.01.h5ad"
# 求取 deg 的 rds path
rds_deg_path="/data/*_0103.rds"
# 求取 module 的 rds path
rds_module_path="/data/*_0102.rds"
# 脑区的注释信息
region='Ribbon'
folder=${save_path}""
if [ -d "$folder" ]; then
echo "folder already exists!"
else
mkdir "$folder"
fi
deg_folder=${save_path}"deg/"
if [ -d "$deg_folder" ]; then
echo "deg_folder already exists!"
else
mkdir "$deg_folder"
fi
genemodules_folder=${save_path}"genemodules/"
if [ -d "$genemodules_folder" ]; then
echo "genemodules_folder already exists!"
else
mkdir "$genemodules_folder"
fi
genemodules_folder=${save_path}"genemodules/csv/"
if [ -d "$genemodules_folder" ]; then
echo "genemodules_folder already exists!"
else
mkdir "$genemodules_folder"
fi
genemodules_folder=${save_path}"genemodules/fig/"
if [ -d "$genemodules_folder" ]; then
echo "genemodules_folder already exists!"
else
mkdir "$genemodules_folder"
fi
heatmap_folder=${save_path}"heatmap/"
if [ -d "$heatmap_folder" ]; then
echo "genemodule_folder already exists!"
else
mkdir "$heatmap_folder"
fi
go_folder=${save_path}"go/"
if [ -d "$go_folder" ]; then
echo "go_folder already exists!"
else
mkdir "$go_folder"
fi
go_folder=${save_path}"go/raw_results/"
if [ -d "$go_folder" ]; then
echo "go_raw_results_folder already exists!"
else
mkdir "$go_folder"
fi
go_folder=${save_path}"go/filtered_results/"
if [ -d "$go_folder" ]; then
echo "go_filtered_results_folder already exists!"
else
mkdir "$go_folder"
fi
# deg 和 genemodule 计算
Rscript /data/work/17.Script/addgenemodule/1.R ${rds_deg_path} ${rds_module_path} ${region} ${save_path}
# genemodule 绘制
python /data/work/17.Script/addgenemodule/2.py --adata_file_path ${h5ad_path} --save_path ${save_path}
# heatmap 计算
python /data/work/17.Script/addgenemodule/3.py --adata_file_path ${h5ad_path} --save_path ${save_path} --cluster ${region}
# heatmap 绘制
Rscript /data/work/17.Script/addgenemodule/4.R ${save_path}
# go 求取
Rscript /data/work/17.Script/addgenemodule/5.R ${save_path}
# go 绘制
python /data/work/17.Script/addgenemodule/6.py --save_path ${save_path}
以上。