Differential Expression Analysis
=================================

``sjanpy.tl`` provides fast vectorized differential expression computation
and helper functions for multi-cluster analyses.

Fast two-group DEG
------------------

:func:`~sjanpy.tl.deg.fast_two_group_deg` uses Welch's t-test on the
expression matrix directly, much faster than Scanpy's rank_genes_groups for
simple two-group comparisons:

.. code-block:: python

   import scanpy as sc
   from sjanpy.tl import fast_two_group_deg

   adata = sc.datasets.pbmc3k_processed()

   results = fast_two_group_deg(
       adata,
       label_col='louvain',
       lst1=['B cells'],
       lst2=['CD4 T cells'],
   )
   print(results.head(10))

The result DataFrame contains:

- ``gene``: gene name
- ``log2FC``: log2 fold change (group1 vs group2)
- ``pct.1``, ``pct.2``: detection rates in each group
- ``pval``, ``padj``: raw and FDR-adjusted p-values

Within-cluster DEG
------------------

:func:`~sjanpy.tl.deg.compute_nested_deg_df` computes DEGs between two
conditions within each cluster, using Scanpy's rank_genes_groups:

.. code-block:: python

   from sjanpy.tl import compute_nested_deg_df

   # Requires a condition column in adata.obs
   nested_deg = compute_nested_deg_df(
       adata,
       cluster_key='louvain',
       condition_key='condition',
       target_condition='Disease',
       reference_condition='Control',
       method='wilcoxon',
       min_cells=10,
       compute_pct=True,
   )

Key parameters:

- ``min_cells``: skip clusters with fewer cells in either condition
- ``compute_pct``: add detection rate columns (``pct_target``, ``pct_reference``)
- ``expr_layer``: use a specific layer for detection rate (e.g. ``'counts'``)

Clipping extreme logFC
-----------------------

:func:`~sjanpy.tl.deg.clip_logfc_in_nested_deg_df` clips outlier logFC values
per cluster to prevent extreme values from dominating visualizations:

.. code-block:: python

   from sjanpy.tl import clip_logfc_in_nested_deg_df

   clipped = clip_logfc_in_nested_deg_df(
       nested_deg,
       logfc_col='logfc',
       cluster_col='cluster',
       quantile=0.95,
   )

Selecting genes to highlight
-----------------------------

:func:`~sjanpy.tl.deg.generate_highlight_dict` selects important genes per
cluster for labeling in plots:

.. code-block:: python

   from sjanpy.tl import generate_highlight_dict

   highlights = generate_highlight_dict(
       nested_deg,
       strategies=['topn', 'ktimes'],
       cluster_key='cluster',
       top_n=5,
       k=3,
       exclude_regex=[r'^MT-', r'^RP[SL]'],
   )

   # Returns: {'Cluster_0': ['GENE1', ...], 'Cluster_1': [...], ...}
   for cluster, genes in highlights.items():
       print(f"{cluster}: {genes}")

Three strategies can be combined:

- ``'topn'``: select top N genes by absolute logFC per cluster
- ``'ktimes'``: genes that exceed logFC cutoffs in at least k clusters
- ``'manual'``: user-specified gene list (filtered to those present in the data)

``exclude_regex`` removes unwanted genes (mitochondrial, ribosomal, etc.)
after selection.