Occlusion robustness of CLIP for military vehicle classification
conference paper
Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications where labeled data is scarce. However, CLIP is primarily trained on high-quality internet imagery. Its robustness in challenging military operational environments, characterized by factors like partial occlusion and degraded signal-to-noise ratio (SNR) because of obscurant objects or weather, remains underexplored. We investigate the robustness of several CLIP variants to occlusion, using a custom dataset of 18 military vehicle classes. We simulate both contiguous occlusions (slide blackout, bar occlusion) and dispersed occlusions (random rain, snow, grid dropout) to reflect real-world environmental challenges. Robustness is evaluated using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Fine-grained, dispersed occlusions (e.g., snow, rain) degrade performance more than larger contiguous occlusions - NAUC of 61.3% for dispersed occlusion vs 78.9% for contiguous occlusions for PE-Core-ViTL/ 14-336 (2) Transformer-based CLIP models consistently outperform CNN-based CLIP models, with ViT-B/16 achieving up to 22 percentage point higher NAUC than ResNet50; (3) Pre-training methodology significantly affects robustness - PE-Core models consistently outperform CLIPA counterparts at similar scales (e.g., +6.7pp NAUC at 320M parameters), showing that improved pre-training augments robustness beyond scaling alone; (4) Fine-tuning introduces a trade-off - linear probing boosts clean-image accuracy (55.6%→88.0%) but reduces robustness under dispersed occlusions (snow NAUC 54.0%→36.0%), while full fine-tuning mitigates this effect (snow NAUC 44.5%) yet still falls short of zero-shot consistency. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.
Topics
Contrastive Language-Image Pre-trainingEDF FaRADAIEDF STOREMilitary Vehicle RecognitionOcclusion RobustnessSignal-to-Noise RatioVision-Language ModelsZero-Shot ClassificationComputer visionImage classificationImage enhancementLabeled dataMilitary photographyMilitary vehiclesRainSnowTuningVisual languagesLanguage modelNoise ratioPre-trainingShot classificationSignal to noiseVehicle recognitionVision-language model
TNO Identifier
1024187
ISSN
0277786X
ISBN
978-151069297-8
Publisher
The International Society for Optical Engineering
Article nr.
1367913
Source title
Proceedings Artificial Intelligence for Security and Defence Applications III, Madrid, Spain, 16-18 September 2025
Collation
11 p.
Files
To receive the publication files, please send an e-mail request to TNO Repository.