Benchmark

Task: align a 2D Archimedean spiral to a 3D Swiss roll with the same angular parameterization. Quality metric: Spearman rank correlation between matched angular positions (1.0 = perfect alignment).

Large-Scale Results

End-to-end wall-clock time using distance_mode="landmark" and mixed_precision=True.

NVIDIA H100 80 GB HBM3:

Scale

Time

Spearman ρ

GPU Memory

4,000 × 5,000

0.8 s

0.999

0.7 GB

10,000 × 12,000

4.1 s

0.999

3.9 GB

20,000 × 25,000

4.6 s

0.999

16 GB

30,000 × 35,000

9.3 s

0.999

34 GB

40,000 × 50,000

17 s

0.999

64 GB

45,000 × 45,000

18 s

0.999

65 GB

NVIDIA L40S 48 GB:

Scale

Time

Spearman ρ

GPU Memory

4,000 × 5,000

2.4 s

0.999

1.1 GB

10,000 × 12,000

3.0 s

0.999

6.7 GB

20,000 × 25,000

12 s

0.999

18 GB

30,000 × 35,000

25 s

0.999

34 GB

35,000 × 40,000

34 s

0.999

45 GB

Alignment quality (Spearman ≥ 0.999) is maintained across all scales. Maximum scale is bounded by GPU memory for the dense N×K transport plan; stable operation requires ≤ 80% VRAM utilization.

TorchGW vs POT

Scale

Method

Time

Spearman ρ

400 × 500

POT entropic_gromov_wasserstein

1.6 s

0.999

400 × 500

TorchGW sampled_gw

0.9 s

0.998

4,000 × 5,000

POT entropic_gromov_wasserstein

183 s

0.999

4,000 × 5,000

TorchGW sampled_gw

1.0 s

0.999

At 4,000×5,000, TorchGW is ~175× faster than POT with equal quality. At larger scales POT runs out of memory; TorchGW scales to 45k×45k on a single GPU.

Visualization

400 vs 500 spiral-to-Swiss-roll alignment 4000 vs 5000 spiral-to-Swiss-roll alignment

Reproducing

# Large-scale benchmark (TorchGW only)
python examples/benchmark_scale.py

# POT comparison (requires: pip install pot)
python examples/demo_spiral_to_swissroll.py