Reference Data¶
stangene harmonizes gene names against official annotation databases. References must be built once per species before harmonization.
Building references¶
# CLI
stangene build-refs --species human
stangene build-refs --species mouse
stangene build-refs --species human --force # re-download and rebuild
# Python
from stangene import build_reference
build_reference("human")
build_reference("mouse", force=True)
Human (HGNC)¶
Source: HGNC complete gene set (~15 MB)
Provides:
Approved gene symbols
Alias (alternative) symbols
Previous (old) symbols
Ensembl gene IDs
HGNC IDs
Gene types (protein-coding, lncRNA, pseudogene, etc.)
Approval status (Approved, Entry Withdrawn)
Mouse (MGI + Ensembl BioMart)¶
Sources:
MGI marker list (~7 MB) — approved symbols, synonyms, feature types
MGI-to-Ensembl mapping — MGI ID to Ensembl ID links
Ensembl BioMart (supplementary, non-fatal if unavailable) — fills Ensembl ID gaps for mouse genes not covered by MGI mapping
Internal format¶
Built references are stored as parquet files:
references/<species>/
├── gene_table.parquet # one row per gene
├── symbol_lookup.parquet # flattened symbol → gene index
└── build_metadata.json # source URLs, timestamps, checksums
gene_table columns¶
Column |
Description |
|---|---|
|
Ensembl gene ID (nullable for some mouse genes) |
|
Approved gene symbol |
|
Pipe-delimited alias symbols |
|
Pipe-delimited previous symbols |
|
Gene biotype (protein-coding, lncRNA, etc.) |
|
Approval status (Approved / Entry Withdrawn) |
|
Reference authority (HGNC / MGI) |
|
Authority-specific ID (HGNC:12345 / MGI:12345) |
symbol_lookup columns¶
Column |
Description |
|---|---|
|
The symbol/alias/prev string (original case) |
|
Uppercased for case-insensitive matching |
|
Target Ensembl gene ID (nullable) |
|
Target authority ID (always present) |
|
|
|
Reference authority |
build_metadata.json¶
Records exactly what was downloaded and when, for reproducibility:
Source URLs
SHA-256 checksums of downloaded files
Download timestamps
Row counts
Custom reference directory¶
By default, references are stored in a references/ directory relative to the package. To use a custom location:
build_reference("human", reference_dir="/path/to/my/refs")
result = stangene.run("data.h5ad", species="human", reference_dir="/path/to/my/refs")
Versioning references¶
The references/ directory is gitignored by default. If you want to version-control your references (recommended for reproducibility), you can:
Commit the parquet files to a separate git repo or GitHub release
Or remove
references/from.gitignoreand commit them directly