Quick Start

Installation

pip install -e .

# With development dependencies (pytest, scanpy):
pip install -e ".[dev]"

Dependencies: pandas, anndata, pyarrow. Downloads use stdlib urllib (no requests).

Step 1: Build reference data

Before harmonizing, build the reference annotation databases. This is a one-time step per species.

stangene build-refs --species human   # downloads HGNC (~15 MB)
stangene build-refs --species mouse   # downloads MGI + BioMart (~10 MB)

Or from Python:

from stangene import build_reference

build_reference("human")
build_reference("mouse")

References are stored in a local references/ directory (gitignored by default). Re-run with --force to update from the latest upstream sources.

Step 2: Harmonize a dataset

Python API

import stangene

result = stangene.run(
    path="my_data.h5ad",       # or .tsv / .csv
    species="human",            # or "mouse"
    output_dir="results/",      # where to write reports
    dataset_name="pbmc_10k",    # optional label
)

# Inspect results programmatically
print(result.stats)
print(result.mapping_table.head())

CLI

stangene harmonize --input my_data.h5ad --species human --output-dir results/

Step 3: Review outputs

After running, the output directory contains:

File

Contents

harmonization_table.tsv

Full mapping table, one row per original feature

summary.json

Dataset-level statistics as JSON

report.md

Human-readable markdown report

conflicts.tsv

Many-to-one collisions, ambiguities, outdated names

unmapped.tsv

Unmapped features for manual review

*_harmonized.h5ad

Enriched h5ad with harmonization columns in adata.var

The markdown report (report.md) is the best starting point for understanding your results. It includes summary tables, tier breakdowns, conflict details, and warnings about potential issues like Excel-corrupted gene names.

Example output

Running on the 10x pbmc3k dataset (32,738 human genes):

Stats: {
    'exact_id': 24260,        # 74.1% - matched by Ensembl ID
    'exact_symbol': 411,      #  1.3% - matched by approved symbol
    'previous_symbol': 172,   #  0.5% - matched by old gene name
    'alias_symbol': 33,       #  0.1% - matched by alias
    'unmapped': 7859,          # 24.0% - GENCODE novel loci not in HGNC
    'ambiguous': 3
}