论文解读——BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 介绍了BLIP-2这一多模态的模型; 其利用Q-Former来抽取图像特征,并结合BERT,对其图像特征到文本空间。