Pearson Residuals Normalization

PearsonResidualsScaler provides analytic Pearson residuals normalization — a variance-stabilizing alternative to the standard log1p workflow. It models gene expression as a Negative Binomial distribution and computes clipped residuals.

Creating synthetic count data

import numpy as np
import scipy.sparse as sp

np.random.seed(42)
n_cells, n_genes = 500, 200

# Simulate sparse count matrix
X = np.random.negative_binomial(n=5, p=0.3, size=(n_cells, n_genes))
X = sp.csr_matrix(X)
gene_names = [f"Gene_{i}" for i in range(n_genes)]

Diagnosing the input

Before fitting, run diagnostics to check for data issues:

from sjanpy.tl import PearsonResidualsScaler

scaler = PearsonResidualsScaler(theta=100, feature_names=gene_names)
report = scaler.diagnose(X)
print(report)

This reports NaN/Inf values, negative values, and zero-count genes.

Fitting and transforming

residuals = scaler.fit_transform(X)
print(residuals.shape)  # (500, 200)
print(residuals.min(), residuals.max())

The theta parameter controls overdispersion (higher = closer to Poisson). Residuals are clipped to [-sqrt(n_cells), sqrt(n_cells)] by default.

Custom clipping:

scaler = PearsonResidualsScaler(theta=100, clip=30, feature_names=gene_names)
residuals = scaler.fit_transform(X)

Inspecting gene statistics

After fitting, retrieve per-gene statistics:

stats = scaler.get_statistics()
print(stats.head())

The DataFrame includes:

mean_counts: observed mean expression
residual_variance: variance of the Pearson residuals (higher = more variable)
gene_probability: estimated relative abundance
is_zero_count: flag for unexpressed genes

You can use residual_variance to select highly variable genes:

hvg = stats.nlargest(50, 'residual_variance')
print(hvg.index.tolist())

Using with AnnData

import anndata as ad

adata = ad.AnnData(X=X)
adata.var_names = gene_names

scaler = PearsonResidualsScaler(
    theta=100,
    feature_names=adata.var_names,
)
adata.layers['pearson_residuals'] = scaler.fit_transform(adata.X)