Gene Filtering

sjanpy.pp.genecraft provides organism-specific functions to remove or mask uninformative genes from scRNA-seq data — predicted genes, non-coding RNAs, hemoglobin, metallothioneins, and more.

Listing background genes

Use get_background_gene_dict() to see which genes in your dataset fall into artifact categories:

import scanpy as sc
from sjanpy.pp import get_background_gene_dict

adata = sc.datasets.pbmc3k_processed()

bg = get_background_gene_dict(adata)
for category, genes in bg.items():
    print(f"{category}: {len(genes)} genes — {genes[:5]}")

Categories include Mito_Encoded, Ribosomal, Hemoglobin, HSP, IEG, Cell_Cycle, Histone, Genomic_Clone, Predicted_LOC, and more.

Masking genes from HVG selection

The recommended approach: keep the genes in the matrix but prevent them from driving PCA/clustering by setting highly_variable = False:

from sjanpy.pp import filter_human_sc_genes

# Requires sc.pp.highly_variable_genes to have been run
sc.pp.highly_variable_genes(adata)
print(f"HVGs before: {adata.var['highly_variable'].sum()}")

adata = filter_human_sc_genes(
    adata,
    mask_hvg_only=True,        # default: mask, don't remove
    remove_predicted=True,
    remove_non_coding=True,
    remove_antisense=True,
    remove_ig_var=True,
    remove_hb=True,
    remove_metallothionein=True,
    remove_mt_encoded=False,    # keep MT- for QC
    remove_ribo=False,          # keep ribosomal for QC
)
print(f"HVGs after: {adata.var['highly_variable'].sum()}")

Physically removing genes

Set mask_hvg_only=False to remove genes from the AnnData entirely:

n_before = adata.n_vars
adata = filter_human_sc_genes(adata, mask_hvg_only=False)
print(f"Genes: {n_before} -> {adata.n_vars}")

Mouse and rat data

Separate functions handle organism-specific naming conventions:

from sjanpy.pp import filter_mouse_sc_genes, filter_rat_sc_genes

# Mouse: Gm... predicted genes, mt- mito, Rp[sl] ribosomal
adata_mouse = filter_mouse_sc_genes(adata_mouse, mask_hvg_only=True)

# Rat: LOC/RGD predicted genes, Mt- mito
adata_rat = filter_rat_sc_genes(adata_rat, mask_hvg_only=True)

Choosing what to remove

Each gene category can be toggled independently. Typical choices:

Parameter

Default

Rationale

remove_predicted

True

LOC/AC/AL clones add noise

remove_non_coding

True

LINC/MIR/SNOR rarely informative

remove_antisense

True

-AS transcripts confound analyses

remove_ig_var

True

IG variable regions dominate B cell PCA

remove_hb

True

Hemoglobin contamination

remove_metallothionein

True

Stress response artifact

remove_mt_encoded

False

Keep for QC (% mitochondrial)

remove_ribo

False

Keep for QC (% ribosomal)

remove_histone

False

Usually fine unless studying cell cycle