LAD: LoRA-Adapted Denoiser

Authors: Ruurd Kuiper1, Maarten van Smeden1, Lars de Groot1, Ayoub Bagheri2

1UMC Utrecht    2Utrecht University

Huggingface Extensive Demo Read the Blogpost

Demonstration

Interact with the LAD model above to see it in action. The demo is running the LAD 8B Instruct model, which has been fine-tuned to follow user instructions. Try turning off intermediate noising to experience diffusion without remasking. You can choose to add a short pause between iterations to visualize masked (red) and generated (green) tokens.

Core Innovations (TL;DR)

  1. Efficient Adaptation: Adapted for diffusion with only several hours of training on a single GPU.
  2. Rapid Inference: Often requires less steps than number of output tokens for simpler queries.
  3. Scalable test time compute: Answer quality can be improved by increasing iterations.
  4. Structural Denoising: Combines masking and structural noising for full-sequence refinement.
  5. Noiseless Refinement: Enables iterative improvement of complete sequences without remasking.

Motivation and Approach

Autoregressive models, while powerful, are constrained by a strict left-to-right generation process that limits efficiency and bidirectional reasoning. Diffusion-based models offer a promising, flexible alternative but have faced challenges in adapting to discrete text data. LAD (LoRA-Adapted Denoiser) was created to bridge this gap, offering a framework for non-autoregressive generation that is both flexible and efficient.

We introduce a novel approach that adapts pretrained LLaMA1 models for iterative, bidirectional sequence refinement. By repurposing these autoregressive models, LAD benefits from their extensive world knowledge while breaking free from their sequential decoding limitations. This is achieved without costly full-model retraining, making it a practical solution for developing advanced generative capabilities.

Visualizing the Denoising Process

With Remasking
Left: Denoising with remasking after each step. Tokens are re-masked with decreasing frequency to allow iterative correction.
Without Remasking
Right: Denoising without remasking. Tokens are refined only once, reducing flexibility but speeding up inference.

Legend: Token shading reflects model confidence, ranging from red (low certainty) to green (high certainty). "MASK" refers to a masking token. Although not apparent from the right image, the denoising with remasking is also initialized with only "MASK" tokens.

Unlike autoregressive models that require one step per token, LAD often converges in less iterations than its token output length. Both examples below reach coherent outputs in fewer steps than the sequence length. The no-remasking version converges in 16 steps, while the remasked version refines up to 32. Remasking enables better final quality and scalable compute, exemplified by the slight error in the final sentence of the version that does not use remasking.

How LAD Works: Core Innovations

The core of LAD is its combined structural and masked training objective which enables flexible inference. We modify the standard causal attention mechanism of the frozen LLaMA transformer backbone to be fully bidirectional, allowing the model to consider the entire context of the sequence at once. On top of this frozen base, we train lightweight Low-Rank Adaptation (LoRA) adapters. This parameter-efficient approach preserves the original model's knowledge while drastically reducing the computational resources required for adaptation. By our knowledge, this is the first time it has been shown that an autoregressive model can be transformed for diffusive generation using solely LoRA finetuning.

Training and Efficiency

The LAD 8B Instruct model was trained with an emphasis on extreme efficiency, which was accomplished by fine-tuning instead of training from scratch. We ran only 100,000 training iterations using a context length of 256 and a batch size of 8, which amounted to just 200 million training tokens. For comparison, Meta’s LLaMA 3 8B was trained on 15 trillion tokens, and comparable diffusion models trained from scratch such as LlaDa2, used 2.3 trillion tokens. This means LAD used just 0.001% and 0.008% of their respective token counts.

Training was completed in approximately 10 hours on a single NVIDIA A100 (40GB) GPU. By employing a frozen Llama backbone with LoRA adapters, LAD significantly lowers the barrier to entry for training powerful, non-autoregressive text models, avoiding the high computational costs of training from scratch or full-parameter finetuning.

Preliminary Results

BenchmarkScore (% correct)
ARC-Easy88.5
ARC-Challenge81.0
MMLU60.5
HellaSwag70.0

Note: Scores were computed on a random 200-example subset per dataset. Full benchmark results will follow.

Citation

A full paper is forthcoming and will be submitted for peer review. In the meantime, if you use LAD in your work, please cite this preliminary version:

@misc{kuiper2025lad,
      author       = {Ruurd Kuiper and Maarten van Smeden and Lars de Groot and Ayoub Bagheri},
      title        = {LAD: LoRA-Adapted Denoiser},
      year         = {2025},
      howpublished = {\url{https://ruurdkuiper.github.io/tini-lad/}},
      note         = {Work in progress}
}

We welcome feedback and collaboration as we continue to develop and evaluate LAD.

References

  1. Grattafiori, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
  2. Nie, S., et al. (2025). Large Language Diffusion Models. arXiv:2502.09992.