Differential Expression Analysis
sjanpy.tl provides fast vectorized differential expression computation
and helper functions for multi-cluster analyses.
Fast two-group DEG
fast_two_group_deg() uses Welch’s t-test on the
expression matrix directly, much faster than Scanpy’s rank_genes_groups for
simple two-group comparisons:
import scanpy as sc
from sjanpy.tl import fast_two_group_deg
adata = sc.datasets.pbmc3k_processed()
results = fast_two_group_deg(
adata,
label_col='louvain',
lst1=['B cells'],
lst2=['CD4 T cells'],
)
print(results.head(10))
The result DataFrame contains:
gene: gene namelog2FC: log2 fold change (group1 vs group2)pct.1,pct.2: detection rates in each grouppval,padj: raw and FDR-adjusted p-values
Within-cluster DEG
compute_nested_deg_df() computes DEGs between two
conditions within each cluster, using Scanpy’s rank_genes_groups:
from sjanpy.tl import compute_nested_deg_df
# Requires a condition column in adata.obs
nested_deg = compute_nested_deg_df(
adata,
cluster_key='louvain',
condition_key='condition',
target_condition='Disease',
reference_condition='Control',
method='wilcoxon',
min_cells=10,
compute_pct=True,
)
Key parameters:
min_cells: skip clusters with fewer cells in either conditioncompute_pct: add detection rate columns (pct_target,pct_reference)expr_layer: use a specific layer for detection rate (e.g.'counts')
Clipping extreme logFC
clip_logfc_in_nested_deg_df() clips outlier logFC values
per cluster to prevent extreme values from dominating visualizations:
from sjanpy.tl import clip_logfc_in_nested_deg_df
clipped = clip_logfc_in_nested_deg_df(
nested_deg,
logfc_col='logfc',
cluster_col='cluster',
quantile=0.95,
)
Selecting genes to highlight
generate_highlight_dict() selects important genes per
cluster for labeling in plots:
from sjanpy.tl import generate_highlight_dict
highlights = generate_highlight_dict(
nested_deg,
strategies=['topn', 'ktimes'],
cluster_key='cluster',
top_n=5,
k=3,
exclude_regex=[r'^MT-', r'^RP[SL]'],
)
# Returns: {'Cluster_0': ['GENE1', ...], 'Cluster_1': [...], ...}
for cluster, genes in highlights.items():
print(f"{cluster}: {genes}")
Three strategies can be combined:
'topn': select top N genes by absolute logFC per cluster'ktimes': genes that exceed logFC cutoffs in at least k clusters'manual': user-specified gene list (filtered to those present in the data)
exclude_regex removes unwanted genes (mitochondrial, ribosomal, etc.)
after selection.