Pearson Residuals Normalization
PearsonResidualsScaler provides analytic Pearson
residuals normalization — a variance-stabilizing alternative to the standard
log1p workflow. It models gene expression as a Negative Binomial
distribution and computes clipped residuals.
Creating synthetic count data
import numpy as np
import scipy.sparse as sp
np.random.seed(42)
n_cells, n_genes = 500, 200
# Simulate sparse count matrix
X = np.random.negative_binomial(n=5, p=0.3, size=(n_cells, n_genes))
X = sp.csr_matrix(X)
gene_names = [f"Gene_{i}" for i in range(n_genes)]
Diagnosing the input
Before fitting, run diagnostics to check for data issues:
from sjanpy.tl import PearsonResidualsScaler
scaler = PearsonResidualsScaler(theta=100, feature_names=gene_names)
report = scaler.diagnose(X)
print(report)
This reports NaN/Inf values, negative values, and zero-count genes.
Fitting and transforming
residuals = scaler.fit_transform(X)
print(residuals.shape) # (500, 200)
print(residuals.min(), residuals.max())
The theta parameter controls overdispersion (higher = closer to Poisson).
Residuals are clipped to [-sqrt(n_cells), sqrt(n_cells)] by default.
Custom clipping:
scaler = PearsonResidualsScaler(theta=100, clip=30, feature_names=gene_names)
residuals = scaler.fit_transform(X)
Inspecting gene statistics
After fitting, retrieve per-gene statistics:
stats = scaler.get_statistics()
print(stats.head())
The DataFrame includes:
mean_counts: observed mean expressionresidual_variance: variance of the Pearson residuals (higher = more variable)gene_probability: estimated relative abundanceis_zero_count: flag for unexpressed genes
You can use residual_variance to select highly variable genes:
hvg = stats.nlargest(50, 'residual_variance')
print(hvg.index.tolist())
Using with AnnData
import anndata as ad
adata = ad.AnnData(X=X)
adata.var_names = gene_names
scaler = PearsonResidualsScaler(
theta=100,
feature_names=adata.var_names,
)
adata.layers['pearson_residuals'] = scaler.fit_transform(adata.X)