Key Features

Evaluation

mean Precision @ 5 metric + a small modification to avoid penalizing queries with fewer than 5 expected index images.

\[mP@5 = \frac{1}{Q} \sum_{q=1}^Q \frac{1}{\min(n_q, 5)} \sum_{j=1}^{\min(n_q, 5)} rel_q(j)\]

The host will use k-NN (k=5) to lookup for each test sample, using the Euclidean distance between test and index embeddings.

No training data provided. Here is the distribution of test data.

External data thread: https://www.kaggle.com/competitions/google-universal-image-embedding/discussion/337384

CLIP-TF-Train-Example
1. CLIP + Arcface + TPU training.
GCVIT
1. Global Context Vision Transformer
Understand Comp Domain and ImageNet 21k Labels
1. Understanding comp domain and labels

1st place solution
1. Using pre-trained weights w/o training or fine-tuning first.
2. CLIP Github
3. ArcFace
4. Add datasets to training list iteratively to save time and maintain good performance
5. unfreeze the backbone after linear head is well trained, so we don’t need to worry about the random linear head would affect the backbone weights
  1. use 10 times lower initial learning rate.
  2. Easy to get overfit and linear projection weights jumped sharply.
  3. So freeze linear head to train and add dropout to fully connection layer.
6. Clever ensemble to overcome different F(C, X) issues
  1. resolution 224 + resolution 280
7. LAION-5B CLIP Model blog
2nd place solution
1. dynamic margin
2. stratified learning rate when training non-backbone part.
4th place solution
5th place solution
The rest thread