• 0 Posts
  • 3 Comments
Joined 1 year ago
cake
Cake day: August 2nd, 2023

help-circle
  • MrConfusion@lemmy.worldtoMicroblog Memes@lemmy.worldOr they go to adtech
    link
    fedilink
    English
    arrow-up
    12
    arrow-down
    1
    ·
    9 months ago

    Well, this is simply incorrect. And confidently incorrect at that.

    Vision transformers (ViT) is an important branch of computer vision models that apply transformers to image analysis and detection tasks. They perform very well. The main idea is the same, by tokenizing the input image into smaller chunks you can apply the same attention mechanism as in NLP transformer models.

    ViT models were introduced in 2020 by Dosovitsky et. al, in the hallmark paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929). A work that has received almost 30000 academic citations since its publication.

    So claiming transformers only improve natural language and vision output is straight up wrong. It is also widely used in visual analysis including classification and detection.