Underwater vision systems need large, varied, and accurately labeled data, but real underwater collection is expensive and hard to annotate. WaterGen addresses this by treating underwater generation as two separate controls: the scene content and the surrounding water medium.
The model first generates clean underwater scene latents using a LoRA-adapted latent diffusion backbone. A medium-conditioned decoder then applies physically meaningful attenuation and backscattering according to specified water parameters. This decoupling lets a single generated scene be rendered under many water conditions without changing the underlying geometry.
The resulting paired clean/degraded images become scalable synthetic training data for underwater restoration and semantic segmentation, improving downstream performance across multiple models and datasets.
On clear underwater scene generation, WaterGen improves visual quality and text-image alignment while adding explicit text + medium controllability.
| Method | UIQM ↑ | MUSIQ ↑ | CLIP Score ↑ | Controllability |
|---|---|---|---|---|
| Atlantis | 2.8338 ± 0.1927 | 67.5437 ± 1.7373 | 0.2457 ± 0.0274 | Text-only |
| TIDE | 2.3725 ± 0.3816 | 66.4304 ± 2.1780 | 0.2305 ± 0.0118 | Text-only |
| WaterGen | 3.0239 ± 0.1317 | 69.2638 ± 0.8813 | 0.2614 ± 0.0073 | Text + Medium |
@inproceedings{wu2026watergen,
title={WaterGen: Decoupling Scene and Medium in Underwater Image Generation},
author={Wu, Jiayi and Wang, Tianfu and Xiong, Tianyi and Yuan, Dehao and Lin, Xiaomin and Islam, Md Jahidul and Fermuller, Cornelia and Metzler, Christopher and Aloimonos, Yiannis},
booktitle={European Conference on Computer Vision (ECCV)},
year={2026}
}