Quick Start¶
Installation¶
pip install -e .
# With development dependencies (pytest, scanpy):
pip install -e ".[dev]"
Dependencies: pandas, anndata, pyarrow. Downloads use stdlib urllib (no requests).
Step 1: Build reference data¶
Before harmonizing, build the reference annotation databases. This is a one-time step per species.
stangene build-refs --species human # downloads HGNC (~15 MB)
stangene build-refs --species mouse # downloads MGI + BioMart (~10 MB)
Or from Python:
from stangene import build_reference
build_reference("human")
build_reference("mouse")
References are stored in a local references/ directory (gitignored by default). Re-run with --force to update from the latest upstream sources.
Step 2: Harmonize a dataset¶
Python API¶
import stangene
result = stangene.run(
path="my_data.h5ad", # or .tsv / .csv
species="human", # or "mouse"
output_dir="results/", # where to write reports
dataset_name="pbmc_10k", # optional label
)
# Inspect results programmatically
print(result.stats)
print(result.mapping_table.head())
CLI¶
stangene harmonize --input my_data.h5ad --species human --output-dir results/
Step 3: Review outputs¶
After running, the output directory contains:
File |
Contents |
|---|---|
|
Full mapping table, one row per original feature |
|
Dataset-level statistics as JSON |
|
Human-readable markdown report |
|
Many-to-one collisions, ambiguities, outdated names |
|
Unmapped features for manual review |
|
Enriched h5ad with harmonization columns in |
The markdown report (report.md) is the best starting point for understanding your results. It includes summary tables, tier breakdowns, conflict details, and warnings about potential issues like Excel-corrupted gene names.
Example output¶
Running on the 10x pbmc3k dataset (32,738 human genes):
Stats: {
'exact_id': 24260, # 74.1% - matched by Ensembl ID
'exact_symbol': 411, # 1.3% - matched by approved symbol
'previous_symbol': 172, # 0.5% - matched by old gene name
'alias_symbol': 33, # 0.1% - matched by alias
'unmapped': 7859, # 24.0% - GENCODE novel loci not in HGNC
'ambiguous': 3
}