Tools (sjanpy.tl)

Differential Expression

sjanpy.tl.deg.fast_two_group_deg(adata, label_col, lst1, lst2)[source]

High-speed DEG calculation using vectorized operations. Focuses on raw matrix extraction and batch statistics.

sjanpy.tl.deg.compute_nested_deg_df(adata, cluster_key, condition_key, target_condition, reference_condition, method='wilcoxon', min_cells=10, compute_pct=True, expr_layer=None, expr_threshold=0.0)[source]

Compute within-cluster differential expression between two conditions, and optionally report per-gene detection rate (fraction of cells expressing) in each condition.

For each cluster defined by cluster_key, the function subsets adata to that cluster and runs scanpy.tl.rank_genes_groups comparing target_condition vs reference_condition within condition_key. Clusters are skipped if either group has fewer than min_cells.

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix. Uses adata.X by default for detection-rate computation; DEG testing is performed by Scanpy on the subset AnnData.

  • cluster_key (str) – Column name in adata.obs defining clusters to iterate over.

  • condition_key (str) – Column name in adata.obs defining the two conditions to compare.

  • target_condition (str) – Name of the condition treated as the numerator / “case” group in the comparison (e.g., “Disease”, “Adult”).

  • reference_condition (str) – Name of the condition treated as the reference / “control” group in the comparison (e.g., “Normal”, “Fetal”).

  • method (str, default "wilcoxon") – Method passed to scanpy.tl.rank_genes_groups (e.g., “wilcoxon”, “t-test”, “logreg”).

  • min_cells (int, default 10) – Minimum number of cells required in each condition within a cluster. Clusters not meeting this threshold are skipped.

  • compute_pct (bool, default True) –

    If True, add detection-rate columns: - pct_target: fraction of target-condition cells with expression

    > expr_threshold

    • pct_reference: fraction of reference-condition cells with expression > expr_threshold

  • expr_layer (str or None, default None) – Which matrix to use for detection-rate computation: - None: use adata_c.X - str : use adata_c.layers[expr_layer] This does not change the DEG test itself (handled by Scanpy).

  • expr_threshold (float, default 0.0) – Expression threshold for defining a gene as “expressed” when computing detection rates. A gene is counted as expressed in a cell if its value is strictly greater than this threshold.

Returns:

Concatenated results across clusters with one row per ranked gene per cluster. Always includes: - gene : str - logfc : float - pvals_adj : float - cluster : str If compute_pct=True, also includes: - pct_target : float in [0, 1] - pct_reference : float in [0, 1]

Return type:

pandas.DataFrame

Notes

  • Detection rates are computed over all genes in the subset cluster matrix and then indexed to the ranked genes returned by Scanpy.

  • If expr_layer is provided, it must exist in adata.layers.

  • For sparse matrices, detection rates are computed efficiently without densifying.

sjanpy.tl.deg.clip_logfc_in_nested_deg_df(df, logfc_col='logfc', cluster_col='cluster', quantile=0.95)[source]

按 Cluster 对 logfc 进行分位数裁剪 (Clipping)

sjanpy.tl.deg.generate_highlight_dict(deg_df, strategies=['topn'], cluster_key='cluster', top_n=5, k=3, ktimes_poscut=1.0, ktimes_negcut=-1.0, manual_genes=None, exclude_genes=None, exclude_regex=None)[source]

根据多种策略生成每个 Cluster 需要 highlight 的基因字典,支持正则表达式排除。

正则表达式示例:

  1. 排除线粒体基因 (以 MT- 开头): r’^MT-’

  2. 排除核糖体基因 (以 RPS 或 RPL 开头): r’^RP[SL]’

  3. 排除特定模式的基因 (如 AC 后面跟数字,以 .1 结尾): r’^ACd+.1$’

  4. 排除所有以 Gm 开头的基因: r’^Gm’

  5. 同时排除多种 (使用 | 分隔): r’^MT-|^RP[SL]|^Gm’

参数:

deg_dfpd.DataFrame

差异分析结果,包含 gene, logfc, cluster 等列。

exclude_regexlist of str

正则表达式列表。任何匹配其中一个正则的基因都将被排除。

Pearson Residuals

class sjanpy.tl.pres.PearsonResidualsScaler(theta=100, clip=None, feature_names=None)[source]

Bases: object

Implements Analytic Pearson Residuals for scRNA-seq normalization.

This method computes residuals based on a Negative Binomial null model. It is used to identify highly variable genes and to provide a variance-stabilized representation of count data without the need for pseudo-counts or log-transformation.

__init__(theta=100, clip=None, feature_names=None)[source]
Args:
theta (float): Overdispersion parameter. As theta -> infinity,

the model converges to a Poisson distribution.

clip (float, optional): Maximum absolute value for residuals.

Defaults to sqrt(number of observations).

feature_names (list-like, optional): Names of genes/features for reporting.

diagnose(X)[source]

Performs data integrity checks to identify potential numerical issues.

Args:

X: Input matrix (raw counts).

Returns:

dict: Summary of diagnostic results.

fit(X)[source]

Fits the Pearson Residual model by calculating gene probabilities and diagnostic statistics.

transform(X)[source]

Transforms raw counts into clipped Pearson residuals.

Math: z_ij = (x_ij - mu_ij) / sqrt(mu_ij + mu_ij^2 / theta)

get_statistics()[source]

Returns a DataFrame containing gene-wise fitting parameters and diagnostic statistics.

fit_transform(X)[source]

Fit the model and return the transformed residuals.