Preprocessing (`sjanpy.pp`)

Gene Filtering

sjanpy.pp.genecraft.filter_human_sc_genes(adata, remove_predicted=True, remove_non_coding=True, remove_antisense=True, remove_ig_var=True, remove_hb=True, remove_metallothionein=True, remove_histone=False, remove_mt_encoded=False, remove_ribo=False, mask_hvg_only=True)[source]

Comprehensive filtering of uninformative genes for human scRNA-seq data.

If mask_hvg_only is True, it requires that sc.pp.highly_variable_genes has already been run on the adata object.

sjanpy.pp.genecraft.filter_mouse_sc_genes(adata, remove_predicted=True, remove_non_coding=True, remove_antisense=True, remove_ig_var=True, remove_hb=True, remove_metallothionein=True, remove_histone=False, remove_mt_encoded=False, remove_ribo=False, mask_hvg_only=True)[source]: Filtering for Mouse (Mus musculus) scRNA-seq data.

sjanpy.pp.genecraft.filter_rat_sc_genes(adata, remove_predicted=True, remove_non_coding=True, remove_antisense=True, remove_ig_var=True, remove_hb=True, remove_metallothionein=True, remove_histone=False, remove_mt_encoded=False, remove_ribo=False, mask_hvg_only=True)[source]: Filtering for Rat (Rattus norvegicus) scRNA-seq data.

sjanpy.pp.genecraft.get_background_gene_dict(adata)[source]: Returns an exhaustive dictionary of ‘background’ gene categories for human datasets. Based on nomenclature-based patterns and biological artifacts.

HVG Selection

Stratified Splitting

Stratified train / val / test splitting for single-cell obs DataFrames.

sjanpy.pp.split.stratified_split(obs: DataFrame, stratify_col: str, val_ratio: float = 0.05, test_ratio: float = 0.05, seed: int = 42) → DataFrame[source]

Two-stage stratified split into train / val / test.

Parameters:

obs – Cell-level metadata (one row per cell).
stratify_col – Column in obs used for stratification (e.g. "cell_type").
val_ratio – Fraction of total cells for validation and test sets.
test_ratio – Fraction of total cells for validation and test sets.
seed – Random seed for reproducibility.

Returns:

Two columns: cell_index (int position) and split (one of "train", "val", "test").

Return type:

pd.DataFrame

Preprocessing (sjanpy.pp)

Gene Filtering

HVG Selection

Stratified Splitting

Preprocessing (`sjanpy.pp`)