LAD: LoRA-Adapted Denoiser

Authors: Ruurd Kuiper¹, Maarten van Smeden¹, Lars de Groot¹, Ayoub Bagheri²

¹UMC Utrecht ²Utrecht University

Huggingface Extensive Demo Watch the video Read the Blogpost

Demonstration

Interact with the LAD model above to see it in action. The demo is running the LAD 8B Instruct model, which has been fine-tuned to follow user instructions. Try turning off intermediate noising to experience diffusion without remasking. You can choose to add a short pause between iterations to visualize masked (red) and generated (green) tokens.

Core Innovations (TL;DR)

Motivation and Approach

Autoregressive models, while powerful, are constrained by a strict left-to-right generation process that limits efficiency and bidirectional reasoning. Diffusion-based models offer a promising, flexible alternative but have faced challenges in adapting to discrete text data. LAD (LoRA-Adapted Denoiser) was created to bridge this gap, offering a framework for non-autoregressive generation that is both flexible and efficient.

We introduce a novel approach that adapts pretrained LLaMA¹ models for iterative, bidirectional sequence refinement. By repurposing these autoregressive models, LAD benefits from their extensive world knowledge while breaking free from their sequential decoding limitations. This is achieved without costly full-model retraining, making it a practical solution for developing advanced generative capabilities.

Visualizing the Denoising Process

With Remasking — **Left:** Denoising with *remasking* after each step. Tokens are re-masked with decreasing frequency to allow iterative correction.

Without Remasking — **Right:** Denoising *without remasking*. Tokens are masked only at the beginning. However, our approach makes it possible for already unmasked tokens to still be refined.

Legend: Token shading reflects model confidence, ranging from red (low certainty) to green (high certainty). "MASK" refers to a masking token. Although not apparent from the right image, the denoising with remasking is also initialized with only "MASK" tokens.

Unlike autoregressive models that require one step per token, LAD often converges in less iterations than its token output length. Both examples below reach coherent outputs in fewer steps than the sequence length. The no-remasking version converges in 16 steps, while the remasked version refines up to 32. Remasking enables better final quality and scalable compute, exemplified by the slight error in the final sentence of the version that does not use remasking.

How LAD Works: Core Innovations

The core of LAD is its combined structural and masked training objective which enables flexible inference. We modify the standard causal attention mechanism of the frozen LLaMA transformer backbone to be fully bidirectional, allowing the model to consider the entire context of the sequence at once. On top of this frozen base, we train lightweight Low-Rank Adaptation (LoRA) adapters. This parameter-efficient approach preserves the original model's knowledge while drastically reducing the computational resources required for adaptation. By our knowledge, this is the first time it has been shown that an autoregressive model can be transformed for diffusive generation using solely LoRA finetuning.

Training and Efficiency

The LAD 8B Instruct model was trained with an emphasis on extreme efficiency, which was accomplished by fine-tuning instead of training from scratch. We ran only 100,000 training iterations using a context length of 256 and a batch size of 8, which amounted to just 200 million training tokens. For comparison, Meta’s LLaMA 3 8B was trained on 15 trillion tokens, and comparable diffusion models trained from scratch such as LlaDa², used 2.3 trillion tokens. This means LAD used just 0.001% and 0.008% of their respective token counts.

Training was completed in approximately 10 hours on a single NVIDIA A100 (40GB) GPU. By employing a frozen Llama backbone with LoRA adapters, LAD significantly lowers the barrier to entry for training powerful, non-autoregressive text models, avoiding the high computational costs of training from scratch or full-parameter finetuning.

Preliminary Results

Benchmark	Score (% correct)
ARC-Easy	88.5
ARC-Challenge	81.0
MMLU	60.5
HellaSwag	70.0

Note: Scores were computed on a random 200-example subset per dataset. Full benchmark results will follow.

Citation

A full paper is forthcoming and will be submitted for peer review. In the meantime, if you use LAD in your work, please cite this preliminary version:

We welcome feedback and collaboration as we continue to develop and evaluate LAD.