Occlusion robustness of CLIP for military vehicle classification

Woerden, J.E. van; Burghouts, G.; Nijskens, L.; Liezenga, A.M.; Rooij, S.B. van; Ruis, F.A.; Kuijf, H.J.

Occlusion robustness of CLIP for military vehicle classification

conference paper

2025

Woerden, J.E. van

Burghouts, G.

Nijskens, L.

Liezenga, A.M.

Rooij, S.B. van

Ruis, F.A.

Kuijf, H.J.

Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications where labeled data is scarce. However, CLIP is primarily trained on high-quality internet imagery. Its robustness in challenging military operational environments, characterized by factors like partial occlusion and degraded signal-to-noise ratio (SNR) because of obscurant objects or weather, remains underexplored. We investigate the robustness of several CLIP variants to occlusion, using a custom dataset of 18 military vehicle classes. We simulate both contiguous occlusions (slide blackout, bar occlusion) and dispersed occlusions (random rain, snow, grid dropout) to reflect real-world environmental challenges. Robustness is evaluated using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Fine-grained, dispersed occlusions (e.g., snow, rain) degrade performance more than larger contiguous occlusions - NAUC of 61.3% for dispersed occlusion vs 78.9% for contiguous occlusions for PE-Core-ViTL/ 14-336 (2) Transformer-based CLIP models consistently outperform CNN-based CLIP models, with ViT-B/16 achieving up to 22 percentage point higher NAUC than ResNet50; (3) Pre-training methodology significantly affects robustness - PE-Core models consistently outperform CLIPA counterparts at similar scales (e.g., +6.7pp NAUC at 320M parameters), showing that improved pre-training augments robustness beyond scaling alone; (4) Fine-tuning introduces a trade-off - linear probing boosts clean-image accuracy (55.6%→88.0%) but reduces robustness under dispersed occlusions (snow NAUC 54.0%→36.0%), while full fine-tuning mitigates this effect (snow NAUC 44.5%) yet still falls short of zero-shot consistency. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.

Topics

Contrastive Language-Image Pre-training EDF FaRADAI EDF STORE Military Vehicle Recognition Occlusion Robustness Signal-to-Noise Ratio Vision-Language Models Zero-Shot Classification Computer vision Image classification Image enhancement Labeled data Military photography Military vehicles Rain Snow Tuning Visual languages Language model Noise ratio Pre-training Shot classification Signal to noise Vehicle recognition Vision-language model

TNO Identifier

1024187

Repository link

https://resolver.tno.nl/uuid:485dea42-17b6-4dd5-8df4-76bda635ad02

ISSN

0277786X

ISBN

978-151069297-8

Publisher

The International Society for Optical Engineering

Article nr.

1367913

Source title

Proceedings Artificial Intelligence for Security and Defence Applications III, Madrid, Spain, 16-18 September 2025

Collation

11 p.

Files

To receive the publication files, please send an e-mail request to TNO Repository.

Occlusion robustness of CLIP for military vehicle classification

Make TNO yours!