Input Formats¶

stangene supports two input formats. In both cases, only feature metadata is extracted — the expression matrix is never loaded into memory.

h5ad (AnnData)¶

The primary format. stangene reads adata.var and adata.var_names to extract feature metadata.

Recognized adata.var columns:

Column	Maps to
`gene_ids`	`original_feature_id` (Ensembl ID)
`feature_types`	`original_feature_type` (e.g., Gene Expression, Antibody Capture)

When writing results, harmonization columns are added to adata.var in a new *_harmonized.h5ad file. The original var_names are never overwritten.

TSV / CSV¶

stangene auto-detects common column names:

Detected column name	Maps to
`gene`, `gene_name`, `feature_name`, `gene_symbol`, `symbol`	`original_feature_name`
`gene_id`, `gene_ids`, `ensembl_id`, `ensembl_gene_id`, `feature_id`	`original_feature_id`
`feature_types`, `feature_type`	`original_feature_type`

If your columns have different names, pass an explicit column_map:

ft = stangene.load_features(
    "features.tsv",
    species="human",
    column_map={
        "my_gene_col": "original_feature_name",
        "my_id_col": "original_feature_id",
    },
)

File extension determines the delimiter:

.tsv, .txt → tab-separated
.csv → comma-separated