Abstract
We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes.
This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning.
With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.
Key Contributions
Large-Scale Dataset
First million-scale industrial defect dataset with 1.24M image-text pairs spanning 421 defect types across 63 domains, surpassing existing benchmarks by two orders of magnitude.
Foundation Model
Diffusion-based multimodal model (890M parameters) trained from scratch, unifying generative and discriminative capabilities for industrial defect understanding.
Data Efficiency
Achieves 96.1% accuracy with only 200 samples per class (less than 5% of supervised requirements), demonstrating exceptional transfer learning capabilities.
Unified Framework
Single architecture supporting classification, detection, segmentation, and text-guided generation without task-specific modifications.
Dataset Statistics
IMDD-1M Statistics: Distribution across 63 industrial domains including semiconductor, electronics assembly, steel processing, food processing, and more. The dataset contains 285,451 normal samples and 954,928 anomaly samples with expert-verified annotations.
Method Overview
Our framework comprises three key components: (1) An implicit captioner that encodes defect images into text embeddings when captions are unavailable, (2) A frozen diffusion U-Net (860M parameters) trained from scratch on IMDD-1M to extract multi-scale features, and (3) A Mask2Former generator (45M parameters) that predicts binary masks and embeddings for open-vocabulary classification. The model is trained in two stages: Stage 1 pre-trains the diffusion model on IMDD-1M for 100 epochs, and Stage 2 freezes the diffusion model and fine-tunes the mask generator on downstream tasks.
Qualitative Results
Classification: 96.7% average accuracy across MVTec AD, VisA, Magnetic Tile, and Steel Surface datasets.
Segmentation: 52.9% average IoU with fine-grained defect localization across diverse materials.
Detection: 74.6% mAP@0.5 using mask-based detection, approaching dedicated YOLO models with 5% of training data.
Generation: Realistic defect synthesis from text descriptions (FID: 5.5–13.6, IS: 100.29).
Quantitative Results
Comparison with State-of-the-Art
| Method | P-AUC-ROC (%) | AUC-PRO (%) | Training Samples |
|---|---|---|---|
| MuSc | 97.3 | 93.8 | Full |
| FAIR | 98.2 | 94.0 | Full |
| DMAD | 97.9 | 93.3 | Full |
| Ours | 96.1 | 90.2 | 200/class (~5%) |
Our method achieves competitive performance while using less than 5% of the training data required by fully supervised methods.
Data Efficiency Analysis
Performance rapidly improves up to 200 samples per class and plateaus beyond, demonstrating that our foundation model learns generalizable defect representations requiring minimal task-specific fine-tuning.
Ablation Study
| Model Configuration | Accuracy (%) | F1 (%) | IoU (%) |
|---|---|---|---|
| Full Model | 91.0 | 58.6 | 52.9 |
| w/o Implicit Text Embedding | 86.2 (-4.8) | 54.1 | 49.2 |
| w/o Grounding Loss | 88.3 | 56.4 | 49.8 (-3.1) |
| w/o Diffusion Conditioning | 84.0 (-7.0) | 52.3 | 46.7 |
Each component contributes significantly to overall performance. Removing diffusion conditioning causes the largest drop (-7.0% accuracy), confirming that large-scale pre-training on IMDD-1M is essential for learning generalizable defect representations.
BibTeX
@inproceedings{anonymous2026imdd1m,
title={IMDD-1M: Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset},
author={Anonymous},
booktitle={****},
year={2026}
}