REOrdering Patches Improves Vision Models

1University of Pittsburgh, 2Berkeley AI Research, UC Berkeley

Images can be represented as sequences of patches, and the order of these patches can significantly affect the performance of long sequence vision transformers.

Abstract

Transformers require inputs to be represented as one-dimensional sequences, and in vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

TL;DR - So What?

  • Long-sequence models are order-sensitive.
  • Alternative patch orderings can improve accuracy by 6% (or more)!
  • REOrder optimizes a patch ordering for a given model and dataset pair.

Utilizing a Plackett-Luce policy with reinforcement learning, we learn task-optimal patch sequences for long-sequence vision transformers.

Can We Do Better Than Row Major?

Our research shows patch ordering has a major impact on long-sequence models. Transformer-XL improved by nearly 2 % with column-major scans but fell by over 6 % with spiral scans. Longformer gained up to 1.83 % using column-major, Hilbert, or snake patterns. Orders that boost ImageNet-1K often underperform on FMoW. REOrder consistently outperforms fixed schemes. Mamba achieves average gains of 2.20 % on ImageNet-1K and 9.32 % on FMoW, with some orders exceeding 13 %. Even Transformer-XL sees up to 1.50 % improvement with learned sequences. Learning the optimal patch order for each model and dataset unlocks reliable accuracy gains.

Descriptive alt text
Due to the full self-attention approximations made in long sequence models, the order of image patches greatly affects the task performance of the model.

Method

  1. Prior: Measure sequence compressibility under six scan patterns.
  2. Policy: Parameterize a Plackett-Luce model over patches and train with REINFORCE.
  3. Curriculum:
    • Warm-up: Train for a few epochs with standard row-major ordering to stabilize the classifier.
    • Policy Learning: Enable REOrder with high Gumbel noise. Sample patch sequences via the Plackett-Luce policy for several iterations while jointly updating model weights and patch scores.
    • Freeze & Fine-tune: Sort patches by their learned scores, freeze the ordering policy, then fine-tune the model to convergence.
Descriptive alt text
REOrder learns optimal patch orderings for long-sequence vision models, improving accuracy accross different image modalities.

Results

Model Sensitivity Ranking:

  • Transformer-XL: Most sensitive (±6.4% accuracy swing)
  • Mamba: Highly sensitive (±4% swing)
  • Longformer: Moderately sensitive (±2% swing)
  • ViT: Invariant (as expected)

Data Specific Insights:

Data structure drives sequence compressibility and model bias. Row-major and Hilbert scans preserve local continuity and yield high LZMA compression, which can reinforce trivial correlations. Column-major and spiral scans lower compressibility and push models to learn global features. Because compressibility alone does not predict accuracy, REOrder learns the optimal ordering for each task.

Descriptive alt text
The compressibility of the patch sequence is a good indicator of the model's performance. The more compressible the sequence, the less information it contains.

BibTeX

@misc{kutscher2025REOrder,
        title={REOrdering Patches Improves Vision Models}, 
        author={Declan Kutscher and David M. Chan and Yutong Bai and Trevor Darrell and Ritwik Gupta},
        year={2025},
        eprint={2505.23751},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2505.23751}, 
  }