TBNet V1 - Cross domain fashion image retrieval CNN model

Note: This model architecture is confidential, but I will share the thoughts behind the model design and non-compromising information.

Objective

The core feature of the Trendbook app is image-based semantic search for fashion: a user snaps a photo of an outfit in the real world and finds the same or similar items in a product catalog. Existing open-source models (ViT-based and CNN-based) did not generalize well to our domain, so we designed a custom architecture optimized for small to medium datasets and modest compute.

The Model Architecture

The TBNet model is a convolutional neural network built on top of ResNet50 as a backbone for initial feature extraction. We use an FPN (Feature Pyramid Network) along with a multi-branch architecture to better extract the different layers of information that a piece of clothing presents - from texture to shape and shadows.

The model outputs a 1024-dimensional feature vector that contains clothing-specific semantics.

Training Strategy

The model is trained to perform well in a cross-domain use case. This means we encourage the model to focus on clothing features no matter the setting - whether the item is photographed in a studio or on the street. Due to our relatively small dataset of approximately 200k images, we developed a multi-loss strategy. After a few tests, we settled on a combination of Triplet loss, center loss, and attribute classification, with carefully optimized weighting.

The model initially had a tendency to overfit, but this was handled by introducing a warmup phase over 15 epochs. Training spanned 120 epochs, with performance stagnating around epoch 102.

Model Performance

The model achieves an Acc@1 of 49.7% on our internal evaluation set, outperforming the other models we have tested so far.

Next Steps

The model architecture is solid, but we see significant potential in training on an even larger dataset of 1–2 million images. We also plan to introduce a self-supervised approach, potentially using BYOL or our own custom method - combining setting-labeled positives (street/product) with an image abstraction technique to generate more diverse training data for edge cases and to extend the dataset further.

Acknowledgement:

The model is highly inspired by public scientific papers, and all credit goes to the autors and researchers for on these papers!