Tools (`sjanpy.tl`)

Differential Expression

sjanpy.tl.deg.fast_two_group_deg(adata, label_col, lst1, lst2)[source]: High-speed DEG calculation using vectorized operations. Focuses on raw matrix extraction and batch statistics.

sjanpy.tl.deg.compute_nested_deg_df(adata, cluster_key, condition_key, target_condition, reference_condition, method='wilcoxon', min_cells=10, compute_pct=True, expr_layer=None, expr_threshold=0.0)[source]

Compute within-cluster differential expression between two conditions, and optionally report per-gene detection rate (fraction of cells expressing) in each condition.

For each cluster defined by cluster_key, the function subsets adata to that cluster and runs scanpy.tl.rank_genes_groups comparing target_condition vs reference_condition within condition_key. Clusters are skipped if either group has fewer than min_cells.

Parameters:

adata (anndata.AnnData) – Annotated data matrix. Uses adata.X by default for detection-rate computation; DEG testing is performed by Scanpy on the subset AnnData.
cluster_key (str) – Column name in adata.obs defining clusters to iterate over.
condition_key (str) – Column name in adata.obs defining the two conditions to compare.
target_condition (str) – Name of the condition treated as the numerator / “case” group in the comparison (e.g., “Disease”, “Adult”).
reference_condition (str) – Name of the condition treated as the reference / “control” group in the comparison (e.g., “Normal”, “Fetal”).
method (str, default "wilcoxon") – Method passed to scanpy.tl.rank_genes_groups (e.g., “wilcoxon”, “t-test”, “logreg”).
min_cells (int, default 10) – Minimum number of cells required in each condition within a cluster. Clusters not meeting this threshold are skipped.
compute_pct (bool, default True) –
If True, add detection-rate columns: - pct_target: fraction of target-condition cells with expression

> expr_threshold
- pct_reference: fraction of reference-condition cells with expression > expr_threshold
expr_layer (str or None, default None) – Which matrix to use for detection-rate computation: - None: use adata_c.X - str : use adata_c.layers[expr_layer] This does not change the DEG test itself (handled by Scanpy).
expr_threshold (float, default 0.0) – Expression threshold for defining a gene as “expressed” when computing detection rates. A gene is counted as expressed in a cell if its value is strictly greater than this threshold.

Returns:

Concatenated results across clusters with one row per ranked gene per cluster. Always includes: - gene : str - logfc : float - pvals_adj : float - cluster : str If compute_pct=True, also includes: - pct_target : float in [0, 1] - pct_reference : float in [0, 1]

Return type:

pandas.DataFrame

Notes

Detection rates are computed over all genes in the subset cluster matrix and then indexed to the ranked genes returned by Scanpy.
If expr_layer is provided, it must exist in adata.layers.
For sparse matrices, detection rates are computed efficiently without densifying.

sjanpy.tl.deg.clip_logfc_in_nested_deg_df(df, logfc_col='logfc', cluster_col='cluster', quantile=0.95)[source]: 按 Cluster 对 logfc 进行分位数裁剪 (Clipping)

sjanpy.tl.deg.generate_highlight_dict(deg_df, strategies=['topn'], cluster_key='cluster', top_n=5, k=3, ktimes_poscut=1.0, ktimes_negcut=-1.0, manual_genes=None, exclude_genes=None, exclude_regex=None)[source]

根据多种策略生成每个 Cluster 需要 highlight 的基因字典，支持正则表达式排除。

正则表达式示例:

排除线粒体基因 (以 MT- 开头): r’^MT-’
排除核糖体基因 (以 RPS 或 RPL 开头): r’^RP[SL]’
排除特定模式的基因 (如 AC 后面跟数字，以 .1 结尾): r’^ACd+.1$’
排除所有以 Gm 开头的基因: r’^Gm’
同时排除多种 (使用 | 分隔): r’^MT-|^RP[SL]|^Gm’

参数:

deg_dfpd.DataFrame: 差异分析结果，包含 gene, logfc, cluster 等列。
exclude_regexlist of str: 正则表达式列表。任何匹配其中一个正则的基因都将被排除。

Pearson Residuals

class sjanpy.tl.pres.PearsonResidualsScaler(theta=100, clip=None, feature_names=None)[source]

Bases: object

Implements Analytic Pearson Residuals for scRNA-seq normalization.

This method computes residuals based on a Negative Binomial null model. It is used to identify highly variable genes and to provide a variance-stabilized representation of count data without the need for pseudo-counts or log-transformation.

__init__(theta=100, clip=None, feature_names=None)[source]

Args:

theta (float): Overdispersion parameter. As theta -> infinity,: the model converges to a Poisson distribution.
clip (float, optional): Maximum absolute value for residuals.: Defaults to sqrt(number of observations).

feature_names (list-like, optional): Names of genes/features for reporting.

diagnose(X)[source]

Performs data integrity checks to identify potential numerical issues.

Args:: X: Input matrix (raw counts).
Returns:: dict: Summary of diagnostic results.

fit(X)[source]: Fits the Pearson Residual model by calculating gene probabilities and diagnostic statistics.

transform(X)[source]

Transforms raw counts into clipped Pearson residuals.

Math: z_ij = (x_ij - mu_ij) / sqrt(mu_ij + mu_ij^2 / theta)

get_statistics()[source]: Returns a DataFrame containing gene-wise fitting parameters and diagnostic statistics.

fit_transform(X)[source]: Fit the model and return the transformed residuals.

Tools (sjanpy.tl)

Differential Expression

正则表达式示例:

参数:

Pearson Residuals

Tools (`sjanpy.tl`)