Developer Offer
Try ImaginePro API with 50 Free Credits
Build and ship AI-powered visuals with Midjourney, Flux, and more — free credits refresh every month.
Unlocking Deeper Video Insights with NVIDIA AI Blueprints
Supercharge Your Video Analytics with Context
Organizations are constantly searching for better ways to pull meaningful insights from complex data like video and audio. While Retrieval-Augmented Generation (RAG) helps generative AI use private company data, applying it to video brings unique challenges like efficient data ingestion and indexing.
This post explores a powerful solution: integrating the NVIDIA AI Blueprint for video search and summarization (VSS) with the NVIDIA AI Blueprint for retrieval-augmented generation (RAG). By combining these powerful workflows, developers can enhance video analysis with trusted, context-rich enterprise data, paving the way for smarter, business-critical applications.
Here’s what you'll discover:
- How to combine VSS and RAG Blueprints for advanced multimodal search.
- Methods for enriching video analytics with your enterprise knowledge.
- Techniques for architecting scalable workflows for real-time video Q&A.
- Real-world applications of these solutions across various industries.
Building on our previous introduction to the VSS Blueprint, we now demonstrate how merging it with RAG elevates video analysis to deliver more accurate and context-aware insights for enterprise AI.
What Are NVIDIA AI Blueprints
NVIDIA AI Blueprints are reference workflows designed to help developers build custom generative AI pipelines. The RAG Blueprint utilizes NVIDIA NeMo Retriever models to index multimodal documents for rapid and precise semantic search at an enterprise scale. The VSS Blueprint is designed to ingest massive volumes of video, whether streaming or archived, to enable search, summarization, interactive Q&A, and event-triggered actions like alerts.
Real-World Application AI Powered Health Insights
Let's compare the output of the VSS Blueprint with and without the RAG Blueprint's contextual enrichment. We'll use an input video of someone preparing breakfast to show how AI can analyze eating habits.
First, the AI generates a video summary using only the standard VSS Blueprint. The result is a factual but basic summary, categorizing actions like ingredient selection and cooking techniques. While descriptive, it lacks deeper nutritional context.
Figure 1. Default VSS Blueprint summary of a breakfast preparation video, listing observed actions and basic categories
Next, we enrich the analysis by providing the RAG Blueprint with data from the Wikipedia page for a healthy diet. With this added context, the VSS Blueprint can now draw on nutritional guidelines. The enriched summary not only describes the actions but also explains the benefits of whole grains, the importance of fiber, the value of dairy, and the role of hygiene in food safety.
Figure 2. VSS summary enriched with RAG, connecting observed actions to nutritional value and healthy habits
By linking video analysis to external knowledge, the enriched summary transforms simple observations into actionable health advice, making nutrition information more accessible.
Deployment Steps
To deploy this integrated solution, follow these steps. Note: This guide assumes you have already installed the RAG Blueprint.
-
Download and deploy the RAG Blueprint from the official repository.
-
Clone the video-search-and-summarization repository: bash $ git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
-
Edit the
src/vss-engine/docker/Dockerfileto apply the necessary integration patches: diff diff --git a/src/vss-engine/docker/Dockerfile b/src/vss-engine/docker/Dockerfile index 58b25e3..e1df783 100644 --- a/src/vss-engine/docker/Dockerfile +++ b/src/vss-engine/docker/Dockerfile @@ -17,7 +17,7 @@ RUN --mount=type=bind,source=binaries/gradio_videotimeline-1.0.2-py3-none-any.wh pip install --no-deps /tmp/gradio_videotimeline-1.0.2-py3-none-any.whl
-RUN git clone https://github.com/NVIDIA/context-aware-rag.git -b v1.0.0 /tmp/vss-ctx-rag
+RUN git clone https://github.com/NVIDIA/context-aware-rag.git -b dev/vss-external-rag-support-v2 /tmp/vss-ctx-rag
ARG TARGETARCH
RUN pip install /tmp/vss-ctx-rag --no-deps &&
if [ "$TARGETARCH" = "amd64" ]; then \
- Follow the VSS deployment instructions in the
src/vss-engine/README.mdfile to deploy the patched VSS Blueprint.
Test the Integration
Use the following Python script with kubectl exec to test the VSS pod. This example analyzes a meal preparation video, enriching it with nutritional guidelines. Any text inside <e>...</e> tags is sent to the RAG Blueprint for contextual enrichment.
python import subprocess, textwrap
deployment_id = "vss-vss-deployment-595d5b4ccb-8678v" vid_id = "6482b573-3aa6-4231-b981-a3e75806826b"
def run_in_vss(pod, cmd): subprocess.run( ["kubectl", "exec", pod, "-c", "vss", "--", "/bin/bash", "-c", cmd], check=True, text=True)
prompt = textwrap.dedent(""" Summarize key events only. Breakfast nutriontal guidelines? """)
cmd = f"""python3 via_client_cli.py summarize
--id {vid_id} --model vila-1.5 --enable-chat
--chunk-duration 10
--caption-summarization-prompt "{prompt}"
"""
run_in_vss(deployment_id, cmd)
The context returned from RAG is inserted into a tunable enrichment prompt before the final LLM generation. Here is the template used in the nutrition example:
Here is the summary generated about the meal preparation video:
{original_response}
Here is additional nutritional and food safety information:
{external_context}
Please enrich the summary by naturally incorporating relevant nutritional facts, food safety guidelines, and practical advice from the external context. Connect observed actions in the video to their health benefits, such as highlighting the value of specific ingredients, cooking methods, or hygiene practices. Ensure the enrichment is contextual, informative, and supports everyday healthy choices.
Do not include any introductory phrases, notes, explanations, or comments about how the inputs were combined. Do not reference the original summary or external context. Only provide the enriched summary itself, organized as bullet points under the categories: Ingredient Selection, Cooking Techniques, Nutritional Insights, Hygiene Practices, and Presentation Tips.
How It Works
- Ingestion: VSS processes video streams and indexes visual metadata, while RAG ingests enterprise documents into a GPU-accelerated vector store.
- Query Flow: A user asks a question like, “Am I eating healthy today?” VSS identifies relevant video segments of the user’s meal and simultaneously queries the RAG server for related health guidelines.
- Knowledge Fusion: The RAG Blueprint retrieves relevant enterprise knowledge and provides it to the VSS LLM to create a grounded, context-aware answer.
- Response: The final answer is anchored in video data, enriched with external knowledge, and delivered to the user in real-time with proper citations.
VSS and RAG Integration Architecture
The system's modular architecture is key to its effectiveness.
- VSS ingests and analyzes video content.
- The RAG Blueprint operates as a standalone microservice, handling knowledge retrieval from enterprise data.
- VSS communicates with RAG via an API. When a prompt contains
<e>...</e>tags, VSS sends that sub-prompt to the RAG server. - The RAG Blueprint returns the relevant context.
- VSS fuses this context into its final response using a customizable prompt.
This API-based integration allows each blueprint to be used and scaled independently based on demand.
Figure 3. Architecture diagram of the VSS and RAG Blueprint solution
Connecting Workflows with Composable AI
By composing multiple NVIDIA AI Blueprints, developers can solve complex, cross-functional challenges. This modularity offers:
- Flexible Integration: Combine specialized blueprints to build tailored solutions.
- Cross-Functional Collaboration: Enable video engineers and data scientists to work together, enriching video with domain-specific knowledge.
- Context-Aware Results: Supplement video summaries with relevant information from organizational documents for more precise insights.
The Case for Dedicated RAG
Keeping the RAG Blueprint as a separate server was a deliberate architectural choice that provides several benefits:
- Multi-Workstream Support: A single RAG instance can serve as a unified knowledge layer for multiple applications.
- Decoupled Scaling: Video and document workloads can be scaled and optimized independently.
- Rapid Innovation and Security: Centralized RAG management simplifies updates and security without disrupting VSS deployments.
- Minimal Integration Overhead: Integrating with new use cases only requires the RAG server endpoint, with no need to re-index video data.
Latency Impact
We also evaluated the performance impact of this integration. The total latency is a sum of the time spent in VSS, RAG, and the final LLM fusion.
Our tests show that adding RAG accounts for only 10% of the overall latency in a chat Q&A scenario and just 1% in a video summarization task, making the enrichment highly efficient.
Figure 4. VSS and RAG Blueprint runtime percent by component
| Pipeline Stage | VSS Summarization Latency (seconds) | VSS Chat Q&A Latency (seconds) |
|---|---|---|
| RAG retrieval | 1.69 | 1.81 |
| LLM fusion | 1.24 | 1.35 |
| End-to-End | 250 | 29.77 |
| VSS Summarization / Chat Q&A (Main Task) | 247.07 | 26.61 |
Table 1. VSS and Enterprise RAG composable Blueprint expected system runtimes per pipeline
How Industries Are Making Smarter Decisions
The integration of VSS and RAG Blueprints is already converting raw video into valuable, context-rich insights across various sectors with minimal latency.
- Construction: Shimizu uses the technology on job sites to monitor progress and improve safety and compliance.
- Forestry Management: Cloudian’s HyperScale AIDP demo uses the blueprints to detect overgrowth, retrieve relevant policy documents, and generate reports for fire insurance.
- Media: Monks generates personalized sports highlights from large content libraries for social media and broadcast.
Figure 5. Cloudian VSS + RAG Blueprints Bureau of Land Management based forestry evaluation
To start building your own complex, accelerated pipelines, visit NVIDIA AI Blueprints.
Compare Plans & Pricing
Find the plan that matches your workload and unlock full access to ImaginePro.
| Plan | Price | Highlights |
|---|---|---|
| Standard | $8 / month |
|
| Premium | $20 / month |
|
Need custom terms? Talk to us to tailor credits, rate limits, or deployment options.
View All Pricing Details

