REOrdering Patches Improves Vision Models

Abstract

Transformers require inputs to be represented as one-dimensional sequences, and in vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

TL;DR - So What?

Long-sequence models are order-sensitive.
Alternative patch orderings can improve accuracy by 6% (or more)!
REOrder optimizes a patch ordering for a given model and dataset pair.

Can We Do Better Than Row Major?

Our research shows patch ordering has a major impact on long-sequence models. Transformer-XL improved by nearly 2 % with column-major scans but fell by over 6 % with spiral scans. Longformer gained up to 1.83 % using column-major, Hilbert, or snake patterns. Orders that boost ImageNet-1K often underperform on FMoW. REOrder consistently outperforms fixed schemes. Mamba achieves average gains of 2.20 % on ImageNet-1K and 9.32 % on FMoW, with some orders exceeding 13 %. Even Transformer-XL sees up to 1.50 % improvement with learned sequences. Learning the optimal patch order for each model and dataset unlocks reliable accuracy gains.

Method

Prior: Measure sequence compressibility under six scan patterns.
Policy: Parameterize a Plackett-Luce model over patches and train with REINFORCE.
Curriculum:
- Warm-up: Train for a few epochs with standard row-major ordering to stabilize the classifier.
- Policy Learning: Enable REOrder with high Gumbel noise. Sample patch sequences via the Plackett-Luce policy for several iterations while jointly updating model weights and patch scores.
- Freeze & Fine-tune: Sort patches by their learned scores, freeze the ordering policy, then fine-tune the model to convergence.

Results

Model Sensitivity Ranking:

Transformer-XL: Most sensitive (±6.4% accuracy swing)
Mamba: Highly sensitive (±4% swing)
Longformer: Moderately sensitive (±2% swing)
ViT: Invariant (as expected)

Data Specific Insights:

Data structure drives sequence compressibility and model bias. Row-major and Hilbert scans preserve local continuity and yield high LZMA compression, which can reinforce trivial correlations. Column-major and spiral scans lower compressibility and push models to learn global features. Because compressibility alone does not predict accuracy, REOrder learns the optimal ordering for each task.

BibTeX

@misc{kutscher2025REOrder,
        title={REOrdering Patches Improves Vision Models}, 
        author={Declan Kutscher and David M. Chan and Yutong Bai and Trevor Darrell and Ritwik Gupta},
        year={2025},
        eprint={2505.23751},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2505.23751}, 
  }