HSG-12M: A Large-Scale Spatial Multigraph Dataset

Published in ArXiv, 2025

Abstract: Existing graph benchmarks assume non-spatial, simple edges, collapsing physically distinct paths into a single link. We introduce HSG-12M, the first large-scale dataset of \(\textbf{spatial multigraphs}\) — graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. HSG-12M contains 11.6 million static and 5.1 million dynamic \(\textit{Hamiltonian spectral graphs}\) across 1401 characteristic-polynomial classes, derived from 177 TB of spectral potential data. Each graph encodes the full geometry of a 1-D crystal’s energy spectrum on the complex plane, producing diverse, physics-grounded topologies that transcend conventional node-coordinate datasets. To enable future extensions, we release \(\href{https://github.com/sarinstein-yan/Poly2Graph}{\texttt{Poly2Graph}}\): a high-performance, open-source pipeline that maps arbitrary 1-D crystal Hamiltonians to spectral graphs. Benchmarks with popular GNNs expose new challenges in learning from multi-edge geometry at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for geometry-aware graph learning and new opportunities of data-driven scientific discovery in condensed matter physics and beyond.

Recommended citation: Yan, Xianquan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, and Ching Hua Lee. “HSG-12M: A Large-Scale Spatial Multigraph Dataset.” arXiv, 2025. https://doi.org/10.48550/ARXIV.2506.08618.
Download Paper | Download Bibtex