Preprocessing (sjanpy.pp)

Gene Filtering

sjanpy.pp.genecraft.filter_human_sc_genes(adata, remove_predicted=True, remove_non_coding=True, remove_antisense=True, remove_ig_var=True, remove_hb=True, remove_metallothionein=True, remove_histone=False, remove_mt_encoded=False, remove_ribo=False, mask_hvg_only=True)[source]

Comprehensive filtering of uninformative genes for human scRNA-seq data.

If mask_hvg_only is True, it requires that sc.pp.highly_variable_genes has already been run on the adata object.

sjanpy.pp.genecraft.filter_mouse_sc_genes(adata, remove_predicted=True, remove_non_coding=True, remove_antisense=True, remove_ig_var=True, remove_hb=True, remove_metallothionein=True, remove_histone=False, remove_mt_encoded=False, remove_ribo=False, mask_hvg_only=True)[source]

Filtering for Mouse (Mus musculus) scRNA-seq data.

sjanpy.pp.genecraft.filter_rat_sc_genes(adata, remove_predicted=True, remove_non_coding=True, remove_antisense=True, remove_ig_var=True, remove_hb=True, remove_metallothionein=True, remove_histone=False, remove_mt_encoded=False, remove_ribo=False, mask_hvg_only=True)[source]

Filtering for Rat (Rattus norvegicus) scRNA-seq data.

sjanpy.pp.genecraft.get_background_gene_dict(adata)[source]

Returns an exhaustive dictionary of ‘background’ gene categories for human datasets. Based on nomenclature-based patterns and biological artifacts.

HVG Selection

Stratified Splitting

Stratified train / val / test splitting for single-cell obs DataFrames.

sjanpy.pp.split.stratified_split(obs: DataFrame, stratify_col: str, val_ratio: float = 0.05, test_ratio: float = 0.05, seed: int = 42) DataFrame[source]

Two-stage stratified split into train / val / test.

Parameters:
  • obs – Cell-level metadata (one row per cell).

  • stratify_col – Column in obs used for stratification (e.g. "cell_type").

  • val_ratio – Fraction of total cells for validation and test sets.

  • test_ratio – Fraction of total cells for validation and test sets.

  • seed – Random seed for reproducibility.

Returns:

Two columns: cell_index (int position) and split (one of "train", "val", "test").

Return type:

pd.DataFrame