Developer Offer

Try ImaginePro API with 50 Free Credits

Build and ship AI-powered visuals with Midjourney, Flux, and more — free credits refresh every month.

Start Free Trial

Unlocking Deeper Video Insights with NVIDIA AI Blueprints

2025-11-04•Ilyas Bankole-Hameed•8 minutes read

Video Analytics

Generative AI

NVIDIA

Supercharge Your Video Analytics with Context

Organizations are constantly searching for better ways to pull meaningful insights from complex data like video and audio. While Retrieval-Augmented Generation (RAG) helps generative AI use private company data, applying it to video brings unique challenges like efficient data ingestion and indexing.

This post explores a powerful solution: integrating the NVIDIA AI Blueprint for video search and summarization (VSS) with the NVIDIA AI Blueprint for retrieval-augmented generation (RAG). By combining these powerful workflows, developers can enhance video analysis with trusted, context-rich enterprise data, paving the way for smarter, business-critical applications.

Here’s what you'll discover:

How to combine VSS and RAG Blueprints for advanced multimodal search.
Methods for enriching video analytics with your enterprise knowledge.
Techniques for architecting scalable workflows for real-time video Q&A.
Real-world applications of these solutions across various industries.

Building on our previous introduction to the VSS Blueprint, we now demonstrate how merging it with RAG elevates video analysis to deliver more accurate and context-aware insights for enterprise AI.

What Are NVIDIA AI Blueprints

NVIDIA AI Blueprints are reference workflows designed to help developers build custom generative AI pipelines. The RAG Blueprint utilizes NVIDIA NeMo Retriever models to index multimodal documents for rapid and precise semantic search at an enterprise scale. The VSS Blueprint is designed to ingest massive volumes of video, whether streaming or archived, to enable search, summarization, interactive Q&A, and event-triggered actions like alerts.

Real-World Application AI Powered Health Insights

Let's compare the output of the VSS Blueprint with and without the RAG Blueprint's contextual enrichment. We'll use an input video of someone preparing breakfast to show how AI can analyze eating habits.

First, the AI generates a video summary using only the standard VSS Blueprint. The result is a factual but basic summary, categorizing actions like ingredient selection and cooking techniques. While descriptive, it lacks deeper nutritional context.

Figure 1 shows a bullet-point summary of a breakfast video, with categories for ingredient selection, cooking techniques, nutritional insights, hygiene practices, and presentation tips. Figure 1. Default VSS Blueprint summary of a breakfast preparation video, listing observed actions and basic categories

Next, we enrich the analysis by providing the RAG Blueprint with data from the Wikipedia page for a healthy diet. With this added context, the VSS Blueprint can now draw on nutritional guidelines. The enriched summary not only describes the actions but also explains the benefits of whole grains, the importance of fiber, the value of dairy, and the role of hygiene in food safety.

A bullet-point summary of the same breakfast video, but with added context from external nutritional sources. Figure 2. VSS summary enriched with RAG, connecting observed actions to nutritional value and healthy habits

By linking video analysis to external knowledge, the enriched summary transforms simple observations into actionable health advice, making nutrition information more accessible.

Deployment Steps

To deploy this integrated solution, follow these steps. Note: This guide assumes you have already installed the RAG Blueprint.

Download and deploy the RAG Blueprint from the official repository.
Clone the video-search-and-summarization repository: bash $ git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
Edit the src/vss-engine/docker/Dockerfile to apply the necessary integration patches: diff diff --git a/src/vss-engine/docker/Dockerfile b/src/vss-engine/docker/Dockerfile index 58b25e3..e1df783 100644 --- a/src/vss-engine/docker/Dockerfile +++ b/src/vss-engine/docker/Dockerfile @@ -17,7 +17,7 @@ RUN --mount=type=bind,source=binaries/gradio_videotimeline-1.0.2-py3-none-any.wh pip install --no-deps /tmp/gradio_videotimeline-1.0.2-py3-none-any.whl

-RUN git clone https://github.com/NVIDIA/context-aware-rag.git -b v1.0.0 /tmp/vss-ctx-rag +RUN git clone https://github.com/NVIDIA/context-aware-rag.git -b dev/vss-external-rag-support-v2 /tmp/vss-ctx-rag ARG TARGETARCH RUN pip install /tmp/vss-ctx-rag --no-deps &&
if [ "$TARGETARCH" = "amd64" ]; then \

Follow the VSS deployment instructions in the src/vss-engine/README.md file to deploy the patched VSS Blueprint.

Test the Integration

Use the following Python script with kubectl exec to test the VSS pod. This example analyzes a meal preparation video, enriching it with nutritional guidelines. Any text inside <e>...</e> tags is sent to the RAG Blueprint for contextual enrichment.

python import subprocess, textwrap

deployment_id = "vss-vss-deployment-595d5b4ccb-8678v" vid_id = "6482b573-3aa6-4231-b981-a3e75806826b"

def run_in_vss(pod, cmd): subprocess.run( ["kubectl", "exec", pod, "-c", "vss", "--", "/bin/bash", "-c", cmd], check=True, text=True)

prompt = textwrap.dedent(""" Summarize key events only. Breakfast nutriontal guidelines? """)

cmd = f"""python3 via_client_cli.py summarize
--id {vid_id} --model vila-1.5 --enable-chat
--chunk-duration 10
--caption-summarization-prompt "{prompt}" """

run_in_vss(deployment_id, cmd)

The context returned from RAG is inserted into a tunable enrichment prompt before the final LLM generation. Here is the template used in the nutrition example:

Here is the summary generated about the meal preparation video:
{original_response}

Here is additional nutritional and food safety information:
{external_context}

Please enrich the summary by naturally incorporating relevant nutritional facts, food safety guidelines, and practical advice from the external context. Connect observed actions in the video to their health benefits, such as highlighting the value of specific ingredients, cooking methods, or hygiene practices. Ensure the enrichment is contextual, informative, and supports everyday healthy choices.

Do not include any introductory phrases, notes, explanations, or comments about how the inputs were combined. Do not reference the original summary or external context. Only provide the enriched summary itself, organized as bullet points under the categories: Ingredient Selection, Cooking Techniques, Nutritional Insights, Hygiene Practices, and Presentation Tips.

How It Works

Ingestion: VSS processes video streams and indexes visual metadata, while RAG ingests enterprise documents into a GPU-accelerated vector store.
Query Flow: A user asks a question like, “Am I eating healthy today?” VSS identifies relevant video segments of the user’s meal and simultaneously queries the RAG server for related health guidelines.
Knowledge Fusion: The RAG Blueprint retrieves relevant enterprise knowledge and provides it to the VSS LLM to create a grounded, context-aware answer.
Response: The final answer is anchored in video data, enriched with external knowledge, and delivered to the user in real-time with proper citations.

VSS and RAG Integration Architecture

The system's modular architecture is key to its effectiveness.

VSS ingests and analyzes video content.
The RAG Blueprint operates as a standalone microservice, handling knowledge retrieval from enterprise data.
VSS communicates with RAG via an API. When a prompt contains <e>...</e> tags, VSS sends that sub-prompt to the RAG server.
The RAG Blueprint returns the relevant context.
VSS fuses this context into its final response using a customizable prompt.

This API-based integration allows each blueprint to be used and scaled independently based on demand.

Architecture diagram showing the integration of VSS and RAG Blueprints. Figure 3. Architecture diagram of the VSS and RAG Blueprint solution

Connecting Workflows with Composable AI

By composing multiple NVIDIA AI Blueprints, developers can solve complex, cross-functional challenges. This modularity offers:

Flexible Integration: Combine specialized blueprints to build tailored solutions.
Cross-Functional Collaboration: Enable video engineers and data scientists to work together, enriching video with domain-specific knowledge.
Context-Aware Results: Supplement video summaries with relevant information from organizational documents for more precise insights.

The Case for Dedicated RAG

Keeping the RAG Blueprint as a separate server was a deliberate architectural choice that provides several benefits:

Multi-Workstream Support: A single RAG instance can serve as a unified knowledge layer for multiple applications.
Decoupled Scaling: Video and document workloads can be scaled and optimized independently.
Rapid Innovation and Security: Centralized RAG management simplifies updates and security without disrupting VSS deployments.
Minimal Integration Overhead: Integrating with new use cases only requires the RAG server endpoint, with no need to re-index video data.

Latency Impact

We also evaluated the performance impact of this integration. The total latency is a sum of the time spent in VSS, RAG, and the final LLM fusion.

$Formula for total latency$

Our tests show that adding RAG accounts for only 10% of the overall latency in a chat Q&A scenario and just 1% in a video summarization task, making the enrichment highly efficient.

Bar chart displaying runtime percentages for each system component in the VSS and Enterprise RAG pipeline. Figure 4. VSS and RAG Blueprint runtime percent by component

Pipeline Stage	VSS Summarization Latency (seconds)	VSS Chat Q&A Latency (seconds)
RAG retrieval	1.69	1.81
LLM fusion	1.24	1.35
End-to-End	250	29.77
VSS Summarization / Chat Q&A (Main Task)	247.07	26.61

Table 1. VSS and Enterprise RAG composable Blueprint expected system runtimes per pipeline

How Industries Are Making Smarter Decisions

The integration of VSS and RAG Blueprints is already converting raw video into valuable, context-rich insights across various sectors with minimal latency.

Construction: Shimizu uses the technology on job sites to monitor progress and improve safety and compliance.
Forestry Management: Cloudian’s HyperScale AIDP demo uses the blueprints to detect overgrowth, retrieve relevant policy documents, and generate reports for fire insurance.
Media: Monks generates personalized sports highlights from large content libraries for social media and broadcast.

Annotated summary of a forestry video, showing how VSS and RAG enrich scene understanding with external criteria. Figure 5. Cloudian VSS + RAG Blueprints Bureau of Land Management based forestry evaluation

To start building your own complex, accelerated pipelines, visit NVIDIA AI Blueprints.

Read Original Post

Compare Plans & Pricing

Find the plan that matches your workload and unlock full access to ImaginePro.

ImaginePro pricing comparison
Plan	Price	Highlights
Standard	$8 / month	300 monthly credits included Access to Midjourney, Flux, and SDXL models Commercial usage rights
Premium	$20 / month	900 monthly credits for scaling teams Higher concurrency and faster delivery Priority support via Slack or Telegram