吾生也有涯，而知也无涯

【动机/立论】

Popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency.

VIT等视觉编码器，在处理大分辨率的图像时，速度会比较慢。原因是：1. 太多token要处理。2. 编码过程的时延高。

The vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM.

所以，要提升VLM的处理速度，有两方面的工作可以做：1. 减少从视觉编码器进入LLM的token数量。2. 降低视觉编码器的编码时延。

【方法】

串接VLM的方法一般是：视觉编码器 + 映射层 + 大语言模型。在低分辨率训练图像上训练的视觉编码器，不支持高分辨率图像。优化效果的方法包括：重训视觉编码器，把高分辨率的图切成子图来处理。但是在大分辨率图像推理时，视觉编码器需要处理的token数量变多，编码的时间也过长。

借鉴了hybrid convolutional-transformer architecture FastViT，设计了一个新的hybrid vision encoder FastViTHD来处理高分辨率图像。

主要是用了参数量更小的mobilclip，加上新设计的网络模型结构，得到了提速。

【效果】

In the LLaVA 1.5 setup, FastVLM achieves 3.2×improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. VLM的首token出现时间提升了3.2倍。
Compared to LLaVa-OneVision at the highest resolution (1152×1152), FastVLM achieves better performance on key benchmarks like SeedBench, MMMU and DocVQA, using the same 0.5B LLM, but with 85×faster TTFT and a vision encoder that is 3.4×smaller. 最高分辨率图像上，首token出现时间加速85倍，视觉编码器缩小3.4倍。

【启发】

FastVLM阅读笔记

质能方程

世界，您好！