The ad is from the Israeli real estate company Harey Zahav. They posted this on their Instagram account. After criticism and backlash they claimed it was posted as a joke.
The Norwegian journalist collaborative fact checking service Faktisk.no has done a deep dive on this ad with a lot of detail if you want to check it out. You can read it using google translate (or similar tools) you are interested.
https://www.faktisk.no/artikler/jdplr/kraftige-reaksjoner-etter-spok-om-boligprosjekt-i-gaza
Some interesting facts from the article: the ad text says they are working to prepare for a return to Gush Katif, an earlier Israeli settlement in Gaza. The company is responsible for the development of settlements on the West Bank. The owner of the company lives in Moscow and seems to be an oligarch.
So based on the info in the linked in the article the ad is very real. And while the company behind it claims it was all a joke, that does seem a lot like damage control.
Well, this is simply incorrect. And confidently incorrect at that.
Vision transformers (ViT) is an important branch of computer vision models that apply transformers to image analysis and detection tasks. They perform very well. The main idea is the same, by tokenizing the input image into smaller chunks you can apply the same attention mechanism as in NLP transformer models.
ViT models were introduced in 2020 by Dosovitsky et. al, in the hallmark paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929). A work that has received almost 30000 academic citations since its publication.
So claiming transformers only improve natural language and vision output is straight up wrong. It is also widely used in visual analysis including classification and detection.