Tools (sjanpy.tl)
Differential Expression
- sjanpy.tl.deg.fast_two_group_deg(adata, label_col, lst1, lst2)[source]
High-speed DEG calculation using vectorized operations. Focuses on raw matrix extraction and batch statistics.
- sjanpy.tl.deg.compute_nested_deg_df(adata, cluster_key, condition_key, target_condition, reference_condition, method='wilcoxon', min_cells=10, compute_pct=True, expr_layer=None, expr_threshold=0.0)[source]
Compute within-cluster differential expression between two conditions, and optionally report per-gene detection rate (fraction of cells expressing) in each condition.
For each cluster defined by cluster_key, the function subsets adata to that cluster and runs scanpy.tl.rank_genes_groups comparing target_condition vs reference_condition within condition_key. Clusters are skipped if either group has fewer than min_cells.
- Parameters:
adata (anndata.AnnData) – Annotated data matrix. Uses adata.X by default for detection-rate computation; DEG testing is performed by Scanpy on the subset AnnData.
cluster_key (str) – Column name in adata.obs defining clusters to iterate over.
condition_key (str) – Column name in adata.obs defining the two conditions to compare.
target_condition (str) – Name of the condition treated as the numerator / “case” group in the comparison (e.g., “Disease”, “Adult”).
reference_condition (str) – Name of the condition treated as the reference / “control” group in the comparison (e.g., “Normal”, “Fetal”).
method (str, default "wilcoxon") – Method passed to scanpy.tl.rank_genes_groups (e.g., “wilcoxon”, “t-test”, “logreg”).
min_cells (int, default 10) – Minimum number of cells required in each condition within a cluster. Clusters not meeting this threshold are skipped.
compute_pct (bool, default True) –
If True, add detection-rate columns: - pct_target: fraction of target-condition cells with expression
> expr_threshold
pct_reference: fraction of reference-condition cells with expression > expr_threshold
expr_layer (str or None, default None) – Which matrix to use for detection-rate computation: - None: use adata_c.X - str : use adata_c.layers[expr_layer] This does not change the DEG test itself (handled by Scanpy).
expr_threshold (float, default 0.0) – Expression threshold for defining a gene as “expressed” when computing detection rates. A gene is counted as expressed in a cell if its value is strictly greater than this threshold.
- Returns:
Concatenated results across clusters with one row per ranked gene per cluster. Always includes: - gene : str - logfc : float - pvals_adj : float - cluster : str If compute_pct=True, also includes: - pct_target : float in [0, 1] - pct_reference : float in [0, 1]
- Return type:
pandas.DataFrame
Notes
Detection rates are computed over all genes in the subset cluster matrix and then indexed to the ranked genes returned by Scanpy.
If expr_layer is provided, it must exist in adata.layers.
For sparse matrices, detection rates are computed efficiently without densifying.
- sjanpy.tl.deg.clip_logfc_in_nested_deg_df(df, logfc_col='logfc', cluster_col='cluster', quantile=0.95)[source]
按 Cluster 对 logfc 进行分位数裁剪 (Clipping)
- sjanpy.tl.deg.generate_highlight_dict(deg_df, strategies=['topn'], cluster_key='cluster', top_n=5, k=3, ktimes_poscut=1.0, ktimes_negcut=-1.0, manual_genes=None, exclude_genes=None, exclude_regex=None)[source]
根据多种策略生成每个 Cluster 需要 highlight 的基因字典,支持正则表达式排除。
正则表达式示例:
排除线粒体基因 (以 MT- 开头): r’^MT-’
排除核糖体基因 (以 RPS 或 RPL 开头): r’^RP[SL]’
排除特定模式的基因 (如 AC 后面跟数字,以 .1 结尾): r’^ACd+.1$’
排除所有以 Gm 开头的基因: r’^Gm’
同时排除多种 (使用 | 分隔): r’^MT-|^RP[SL]|^Gm’
参数:
- deg_dfpd.DataFrame
差异分析结果,包含 gene, logfc, cluster 等列。
- exclude_regexlist of str
正则表达式列表。任何匹配其中一个正则的基因都将被排除。
Pearson Residuals
- class sjanpy.tl.pres.PearsonResidualsScaler(theta=100, clip=None, feature_names=None)[source]
Bases:
objectImplements Analytic Pearson Residuals for scRNA-seq normalization.
This method computes residuals based on a Negative Binomial null model. It is used to identify highly variable genes and to provide a variance-stabilized representation of count data without the need for pseudo-counts or log-transformation.
- __init__(theta=100, clip=None, feature_names=None)[source]
- Args:
- theta (float): Overdispersion parameter. As theta -> infinity,
the model converges to a Poisson distribution.
- clip (float, optional): Maximum absolute value for residuals.
Defaults to sqrt(number of observations).
feature_names (list-like, optional): Names of genes/features for reporting.
- diagnose(X)[source]
Performs data integrity checks to identify potential numerical issues.
- Args:
X: Input matrix (raw counts).
- Returns:
dict: Summary of diagnostic results.
- fit(X)[source]
Fits the Pearson Residual model by calculating gene probabilities and diagnostic statistics.
- transform(X)[source]
Transforms raw counts into clipped Pearson residuals.
Math: z_ij = (x_ij - mu_ij) / sqrt(mu_ij + mu_ij^2 / theta)