NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces

Abstract

Generative modeling of neural network parameters is often tied to architectures because standard parameter representations rely on known weight-matrix dimensions. Generation is further complicated by permutation symmetries that allow networks to model similar input-output functions while having widely different, unaligned parameterizations. In this work, we introduce Neural Network Diffusion Transformers (NNiTs), which generate weights in a width-agnostic manner by tokenizing weight matrices into patches and modeling them as locally structured fields. We establish that Graph HyperNetworks (GHNs) with a convolutional neural network (CNN) decoder structurally align the weight space, creating the local correlation necessary for patch-based processing. Focusing on Multilayer Perceptrons (MLPs), where permutation symmetry is especially apparent, NNiTs generate fully functional networks across a range of architectures. Our approach jointly models discrete architecture tokens and continuous weight patches within a single sequence model. On ManiSkill3 robotics tasks, NNiT achieves >85% success on architecture topologies unseen during training, while baseline approaches fail to generalize; the same pipeline also generalizes to MNIST classification beyond the robotic control setting.

Contributions

Structurally aligned weight spaces. We show that Graph HyperNetworks with a CNN decoder collapse permutation symmetry into a structurally aligned weight space — consistent and spatially correlated across networks — so raw weights can be tokenized into patches.
Patch tokenization for weights. Representing weights as patches makes generation width-agnostic: synthesizing a wider layer just adds patches, enabling zero-shot synthesis for unseen architecture topologies.
NNiT. A multimodal diffusion transformer that jointly models discrete architecture tokens and continuous weight patches — supporting both conditional synthesis p(w | a) and joint generation p(a, w).

Why it works: structurally aligned weight spaces

GHN vs SGD weight correlation heatmaps. — **Weight-magnitude maps for 35 independent networks.** **Top (GHN):** our generator produces a structurally aligned weight space — consistent and spatially banded across seeds — resolving the permutation ambiguity and giving weights the local structure that makes patch tokenization possible. **Bottom (SGD):** ordinary training scrambles this structure, so adjacent weights share no consistent meaning across networks.

Does it generalize to unseen architectures?

Conditional weight generation on seen vs unseen architectures. — **Conditional weight synthesis — p(w | a), top-10 of 100 samples.** Given a target architecture, NNiT generates its weights. On topologies held out from training, vectorized baselines collapse off the training grid while NNiT holds its success rate.

Multimodal joint synthesis of architecture and weights. — **Joint synthesis — p(a, w).** NNiT also generates the architecture *and* its weights together from scratch, reaching 99–100% success on PickCube/PushCube and 90% on StackCubeEasy. The same pipeline transfers to MNIST (96.1% on unseen architectures).

Generated Policy Rollouts

PickCube

PushCube

StackCubeEasy

BibTeX

@article{kim2026nnit,
  author    = {Kim, Jiwoo and Mehta, Swarajh and Hsu, Hao-Lun and Ryu, Hyunwoo and Liu, Yudong and Pajic, Miroslav},
  title     = {NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces},
  journal   = {arXiv preprint arXiv:2603.00180},
  year      = {2026},
}