Securing Multimodal Language Models

We believe that the end-state for most AI systems will be multimodal. Security is key, especially in high exposure sectors like government and financial services. Securing a multimodal AI system is however challenging given input complexity, the breadth of potential risk scenarios as well as latency considerations in production environment. Having clear guidelines in place ensures sound security by design for AI system build and should facilitate an evolution in vendor security assessment. Organizations putting AI guardrails should consider input handling, output sanitization as well as data pipeline security so as to have a comprehensive defence against known vulnerabilities.

Download our full guide on securing multimodal AI systems


  • What are Vision-language (“VL”) models?
  • Common enterprise use cases for VL models 
  • Known vulnerabilities
  • Security considerations
  • Security implementation

What are Vision-language (“VL”) models?

Multimodal vision-language models, such as OpenAI’s GPT-4V, Google’s Gemini, Alibaba’s QWEN-VL and LLaVA, represent a significant leap in the realm of artificial intelligence. These models are designed to process and comprehend both visual and textual information simultaneously.

Common enterprise use cases for VL models 

Government and enterprises across various industries are increasingly integrating multimodal vision-language models into their operations, unlocking new possibilities for automation and intelligent decision-making. From automating content categorization to enhancing user interactions through virtual assistants, these models are becoming indispensable for organizations seeking efficiency gains and improved user experiences.

In the government sector, VL models find applications in video surveillance and threat detection. For instance, these models can analyze both live and recorded video feeds, automatically identifying and categorizing objects, activities, or anomalies. This capability enhances the efficiency of security operations, allowing agencies to respond swiftly to potential threats.

In the financial sector, multimodal vision-language models serve as invaluable assets beyond traditional fraud detection. One prominent application lies in streamlining document processing within consumer and retail banking. Consider the intricate task of handling a myriad of loan applications, each laden with textual information and accompanying images. Here, vision-language models function as highly efficient clerks, automating the arduous process. Acting like tireless assistants, these models analyze applications, extract crucial information, and even flag potential issues. The transformative impact is akin to having an AI-powered assistant capable of swiftly navigating through mountains of paperwork, drastically reducing the time required for document processing compared to human counterparts.

Moreover, these models possess the ability to process diverse forms of financial data, including charts, images, and tables. Such capabilities open the door to more advanced AI systems for investment analyses. Beyond simple text-based reports, vision-language models can interpret complex financial charts, analyze trends depicted in images, and extract valuable insights from tabular data. This multifaceted understanding of financial information positions these models as powerful tools for enhancing investment decision-making processes in the financial services sector.

In sectors like government and financial services where the stakes are high, the security of multimodal vision-language models takes on heightened importance. The potential impact of a security breach in these use cases extends beyond data integrity to national security or financial stability. Securing these applications therefore involves addressing not only traditional concerns related to data privacy and model robustness but also considering the broader implications of compromised multimodal AI systems.

Known Vulnerabilities

Attack Vector: Image based visual prompt and code injection 

Visual prompt injection involves manipulating the input data provided to LLMs by introducing carefully crafted visual stimuli. These stimuli could include videos or images such as screenshots of text or even handwritten code. Attackers may exploit vulnerabilities in the LLM’s multimodal capabilities to deceive the model into generating unintended / malicious output or execute small snippets of codes.

Let’s use an example of a multi-model VL chat agent that can also write and execute code like ChatGPT’s code-interpreter. A bad actor can upload an image screenshot of a snippet of code while using the text prompt to ask the agent to execute the code. A simpler example would involve a bad actor passing instructions through an image screenshot to bypass text-based system prompts serving as input filters.

In short, just as multimodal models provide different modes of communication with AI systems, it opens additional modes for delivering prompt or code injection. This renders simple system prompts, prepared statements, keyword or regex-based prompt filters ineffective.

Attack Vector: Visual backdoors via data poisoning 

Unlike image-based prompt injection that occurs at inference time, visual backdoors are most typically associated with model training, model refinement or vector database ingestion pipelines. Such backdoors involve the use of a specific image pattern or token that is typically not visible or obvious to the naked human eye. Once embedded within the AI system, the backdoor can be triggered with a specific stimuli or token delivered during inference time. Perhaps the most popular example used in academic papers is that of an image marker (i.e. backdoor) to trick a backdoored facial recognition system into always unlocking a security gate.

The introduction of visual backdoors can be broadly classified as follow:

  • Latent – For example, the backdoor was already introduced into a VL model or LoRa before it was uploaded to a public repository for opensource usage by the developer community
  • Passive – For example, an ingestion pipeline picks up the backdoor pattern when ingesting training data from an email inbox or website
  • Active – When bad actors bypass application security and directly introduces additional data into a vector database

System risk: System performance degradation or data leakage 

Consider an infinite loop or chained LLM prompt introduced through image-based instruction; or a highly complex image-based vector search with extensive query duration. Left unchecked, visual prompt ingestion or the malicious triggering of a VL model backdoor can lead to extensive resource consumption that either impacts user experience or bring down a system.

Yet, perhaps the most widely discussed form of system risk is that of secret or confidential data leakage. Often, prompt injection attacks rely on carefully bypassing system prompts or “forcing” an agent to dig into its foundation model. Many companies and developers are now accustomed to the ideal of using system prompts, text filters or text-based guardrail models to sanitize both the user-input as well as the model output. This is however insufficient in applications running multimodal models where inputs can be in the form of an image file or audio clip.

Reputational risk: Biases and reputational impact via data poisoning 

Outside of system performance and information security, there is now increasing awareness about the potential reputation impact to companies and brands when models misbehave. The recent “poem poem poem poem” prompt injection experiment against OpenAI was just one example of how AI system hacks with little to no real-world consequences can still deal a blow to perception around a firm’s security standing.

Given the importance of security perception for enterprises, especially those in sensitive sectors like finance, a runaway AI customer service support chat agent can do much more than harm the experience of an individual consumer. In cases when biases or factually inaccurate information have been introduced into systems, one can imagine direct business implications. Consider an AI chat agent for a bank, that has been tampered with to always quote a 0.01% borrowing rate – all it takes is an inaccurate output and a viral screenshot.

Legal risk: Copyright violations linked to user inputs 

Albeit less common, copyright violations continue to be a form of risk for AI systems.  OpenAI currently faces the risk of a lawsuit from The New York Times for copyright infringement of web scrapped content.

Legal professionals would be familiar with the idea of the “substantial part” defence in copyright laws. While it is arguably harder for end-users to upload an entire book into an AI agent, it is much easier for one to copy and paste or upload an entire image subject to copyright protection. This is what makes copyright considerations more critical for multimodal VL models as compared to text-based LLMs.

Download our full guide on securing multimodal AI systems

More Insights