Pearson Residuals Normalization ================================ :class:`~sjanpy.tl.pres.PearsonResidualsScaler` provides analytic Pearson residuals normalization — a variance-stabilizing alternative to the standard ``log1p`` workflow. It models gene expression as a Negative Binomial distribution and computes clipped residuals. Creating synthetic count data ----------------------------- .. code-block:: python import numpy as np import scipy.sparse as sp np.random.seed(42) n_cells, n_genes = 500, 200 # Simulate sparse count matrix X = np.random.negative_binomial(n=5, p=0.3, size=(n_cells, n_genes)) X = sp.csr_matrix(X) gene_names = [f"Gene_{i}" for i in range(n_genes)] Diagnosing the input --------------------- Before fitting, run diagnostics to check for data issues: .. code-block:: python from sjanpy.tl import PearsonResidualsScaler scaler = PearsonResidualsScaler(theta=100, feature_names=gene_names) report = scaler.diagnose(X) print(report) This reports NaN/Inf values, negative values, and zero-count genes. Fitting and transforming ------------------------ .. code-block:: python residuals = scaler.fit_transform(X) print(residuals.shape) # (500, 200) print(residuals.min(), residuals.max()) The ``theta`` parameter controls overdispersion (higher = closer to Poisson). Residuals are clipped to ``[-sqrt(n_cells), sqrt(n_cells)]`` by default. Custom clipping: .. code-block:: python scaler = PearsonResidualsScaler(theta=100, clip=30, feature_names=gene_names) residuals = scaler.fit_transform(X) Inspecting gene statistics --------------------------- After fitting, retrieve per-gene statistics: .. code-block:: python stats = scaler.get_statistics() print(stats.head()) The DataFrame includes: - ``mean_counts``: observed mean expression - ``residual_variance``: variance of the Pearson residuals (higher = more variable) - ``gene_probability``: estimated relative abundance - ``is_zero_count``: flag for unexpressed genes You can use ``residual_variance`` to select highly variable genes: .. code-block:: python hvg = stats.nlargest(50, 'residual_variance') print(hvg.index.tolist()) Using with AnnData ------------------- .. code-block:: python import anndata as ad adata = ad.AnnData(X=X) adata.var_names = gene_names scaler = PearsonResidualsScaler( theta=100, feature_names=adata.var_names, ) adata.layers['pearson_residuals'] = scaler.fit_transform(adata.X)