Adding a New Species¶
stangene is designed to be extensible. Adding support for a new model organism requires three steps.
Step 1: Add a SpeciesConfig¶
In src/stangene/species.py, add an entry to _SPECIES_CONFIGS:
_SPECIES_CONFIGS["rat"] = SpeciesConfig(
name="rat",
ensembl_prefix="ENSRNOG",
transcript_prefix="ENSRNOT",
naming_convention="capitalized", # or "uppercase"
reference_sources={
"rgd": {
"url": "https://download.rgd.mcw.edu/data_release/GENES_RAT.txt",
"description": "RGD rat gene annotations",
},
},
)
If the new species has unique feature classification patterns (e.g., a new Ensembl prefix), add them to CLASSIFICATION_PATTERNS:
CLASSIFICATION_PATTERNS.insert(0, (re.compile(r"^ENSRNOG\d+"), "gene"))
CLASSIFICATION_PATTERNS.insert(0, (re.compile(r"^ENSRNOT\d+"), "transcript"))
Step 2: Implement a reference builder¶
In src/stangene/references.py, add a _build_rat_reference() function:
def _build_rat_reference(config, ref_dir: str) -> None:
"""Build rat reference from RGD gene data."""
url = config.reference_sources["rgd"]["url"]
raw_data = _download_file(url)
checksum = hashlib.sha256(raw_data).hexdigest()
# Parse the source file into a gene_table DataFrame with columns:
# ensembl_id, symbol, alias_symbols, prev_symbols, gene_type, status, source, source_id
gene_table = ... # your parsing logic here
# Build the flattened symbol lookup index
symbol_lookup = _build_symbol_lookup(gene_table, source="RGD")
# Save metadata
metadata = {
"species": "rat",
"download_timestamp": datetime.now(timezone.utc).isoformat(),
"sources": {"rgd": {"url": url, "sha256": checksum, "rows": len(gene_table)}},
"gene_count": len(gene_table),
"lookup_count": len(symbol_lookup),
}
_save_reference(ref_dir, gene_table, symbol_lookup, metadata)
Step 3: Register the builder¶
In the build_reference() function, add the dispatch:
elif config.name == "rat":
_build_rat_reference(config, ref_dir)
Key requirements¶
The
gene_tablemust have columns:ensembl_id,symbol,alias_symbols,prev_symbols,gene_type,status,source,source_idensembl_idcan be null for genes without Ensembl annotation;source_idserves as fallbackalias_symbolsandprev_symbolsare pipe-delimited strings_build_symbol_lookup()handles the flattening automatically
Testing¶
Add test fixtures with mock data for your species in tests/test_references.py, following the pattern of the existing human (HGNC) and mouse (MGI) tests.