Use Case: Wrapping SAM2 with Neo for Anomaly Detection

This notebook demonstrates how to wrap the SAM2 segmentation model using the Neo wrapper to enable uncertainty-aware anomaly detection. By extracting vacuity scores from the model output, we can identify anomalous regions in the input image — higher vacuity values typically indicate greater uncertainty, which is useful for detecting out-of-distribution inputs or novel objects.

We visualize the vacuity scores across frames to better understand how the model responds to anomalies in a video sequence.

Setup: Forked SAM2 and Compilation with Neo wrapper

To enable tracing and wrapping of the SAM2 model with Neo wrapper, we use a forked version of SAM2 available here:

https://github.com/chohk88/sam2/tree/torch-trt

This fork includes modifications that make TorchDynamo-based compilation and integration with tools like Neo possible.

Repository Structure

After cloning the repository, place this Jupyter notebook in the parent directory of the cloned sam2 folder:

/path/to/
├── sam2/         ← cloned fork
└── this_notebook.ipynb

This ensures that relative imports and file paths work correctly during execution and compilation. We also need to download the SAM2 model weights, by running the following script:

checkpoints/download_ckpts.sh

For more information about the forked SAM2, here is the official guide to compile SAM2 using the dynamo backend, as documented in the PyTorch TensorRT tutorials:

Compiling SAM2 using TorchDynamo

import sys
import os
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.build_sam import build_sam2_video_predictor
#from dataset_utils import Cityscapes_MUAD
from argparse import ArgumentParser
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from tqdm import tqdm
import cv2
from capsa_torch import neo

# use bfloat16 for the entire notebook
torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()

if torch.cuda.get_device_properties(0).major >= 8:
    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

Step 1: Initialize and Wrap the SAM2 Image Encoder with Neo

We begin by loading the SAM2 model and extracting its image encoder. To integrate uncertainty estimation, we wrap the encoder using the Neo wrapper.

Key steps:

  • Load the SAM2 model and its predictor using the specified config and checkpoint.

  • Freeze the encoder weights to avoid modifying the pretrained backbone.

  • Use the neo.Wrapper to wrap the encoder with custom modules at specific integration sites (conv2d_2).

  • Run the wrapped encoder once on a dummy input — this is necessary to complete the wrapping process.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

sam2_checkpoint = "checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"

sam2_model = build_sam2(model_cfg, sam2_checkpoint, device="cuda")

predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint, device=device)

encoder = sam2_model.image_encoder

for param in encoder.parameters():
    param.requires_grad = False

wrapper = neo.Wrapper(integration_sites=3, layer_out_dims=(128,64), node_name_filter=['conv2d_2'], kernel_size = 3, padding = 1, stride = 1, add_batch_norm = True, pixel_wise = True)
wrapped_encoder = wrapper(encoder).to(device)
dummy_input = torch.randn(1, 3, 1024, 1024).to(device)
outputs = wrapped_encoder(dummy_input)

Step 2: Load Training Data and Train the Wrapped Encoder

We train the wrapped SAM2 encoder on an anomaly-free dataset to learn a baseline distribution of normal samples. The training data combines frames from Cityscapes and MUAD, simulating normal urban driving conditions.

from pathlib import Path
from typing import Callable, Literal, NamedTuple
from torchvision.datasets import Cityscapes, VisionDataset
from torchvision.datasets.utils import download_and_extract_archive
from torchvision import tv_tensors
from torchvision import tv_tensors
from PIL import Image
from collections import namedtuple
from transformers import SegformerFeatureExtractor
from torchvision.transforms import ColorJitter, functional

feature_extractor = SegformerFeatureExtractor(size={"height": 1024, "width": 1024})
jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1)

def train_transform_muad(image, target):
    data = feature_extractor(image, target)   
    return data["pixel_values"][0], data["labels"][0]

class MUADClass(NamedTuple):
    name: str
    id: int
    color: tuple[int, int, int]
    cityscapes_id: int

class MUAD(VisionDataset):
    base_url = "https://zenodo.org/records/10619959/files/"
    zip_md5 = {
        "train": "cea6a672225b10dda1add8b2974a5982",
        "val": "957af9c1c36f0a85c33279e06b6cf8d8",
    }

    classes = [
        MUADClass("road", 0, (128, 64, 128), 7),
        MUADClass("sidewalk", 1, (244, 35, 232), 8),
        MUADClass("building", 2, (70, 70, 70), 11),
        MUADClass("wall", 3, (102, 102, 156), 12),
        MUADClass("fence", 4, (190, 153, 153), 13),
        MUADClass("pole", 5, (153, 153, 153), 17),
        MUADClass("traffic_light", 6, (250, 170, 30), 19),
        MUADClass("traffic_sign", 7, (220, 220, 0), 20),
        MUADClass("vegetation", 8, (107, 142, 35), 21),
        MUADClass("terrain", 9, (152, 251, 152), 22),
        MUADClass("sky", 10, (70, 130, 180), 23),
        MUADClass("person", 11, (220, 20, 60), 24),
        MUADClass("rider", 12, (255, 0, 0), 25),
        MUADClass("car", 13, (0, 0, 142), 26),
        MUADClass("truck", 14, (0, 0, 70), 27),
        MUADClass("bus", 15, (0, 60, 100), 28),
        MUADClass("train", 16, (0, 80, 100), 31),
        MUADClass("motorcycle", 17, (0, 0, 230), 32),
        MUADClass("bicycle", 18, (119, 11, 32), 33),
        MUADClass("bear deer cow", 19, (255, 228, 196), 0),
        MUADClass("garbage_bag stand_food trash_can", 20, (128, 128, 0), 0),
        MUADClass("unlabeled", 21, (0, 0, 0), 0)
    ]

    def __init__(
        self,
        root: str | Path,
        split: Literal["train", "ood"],
        target_type: Literal["semantic"] = "semantic",
        transforms: Callable | None = None,
        download: bool = True
    ) -> None:
        if split not in ["train", "ood"]:
            raise ValueError("Only 'train' and 'ood' splits are supported.")
        if target_type != "semantic":
            raise ValueError("Only 'semantic' target_type is supported.")

        # Handle full dataset root logic like original class
        if split != "ood":
            dataset_root = Path(root) / "MUAD"
        else:
            dataset_root = Path(root)

        super().__init__(dataset_root, transforms=transforms)
        self.root = dataset_root
        self.split = split
        self.target_type = target_type

        if not self._check_exists():
            if not download:
                raise FileNotFoundError(f"MUAD {split} not found at {self.root}.")
            if split == "ood":
                raise FileNotFoundError("No download available for 'ood' split. Place it manually.")
            self._download()

        self.samples = sorted((self.root / split / "leftImg8bit").glob("**/*"))
        self.targets = sorted((self.root / split / "leftLabel").glob("**/*"))
        self.len = len(self.samples)

    def __getitem__(self, index: int) -> tuple[tv_tensors.Image, tv_tensors.Mask]:
        image = tv_tensors.Image(Image.open(self.samples[index]).convert("RGB"))
        target = tv_tensors.Mask(Image.open(self.targets[index]))
        if self.transforms is not None:
            image, target = self.transforms(image, target)
        return image, target

    def __len__(self) -> int:
        return self.len

    def _check_exists(self) -> bool:
        img_dir = self.root / self.split / "leftImg8bit"
        lbl_dir = self.root / self.split / "leftLabel"
        return (
            img_dir.is_dir()
            and lbl_dir.is_dir()
            and len(list((img_dir).glob("**/*"))) > 1
            and len(list((lbl_dir).glob("**/*"))) > 1
        )

    def _download(self) -> None:
        filename = f"{self.split}.zip"
        url = self.base_url + filename
        md5 = self.zip_md5[self.split]
        print(f"[MUAD] Downloading {self.split} split from {url} ...")
        download_and_extract_archive(url, download_root=str(self.root.parent), md5=md5)

    @property
    def color_palette(self) -> list[list[int]]:
        return [list(c.color) for c in self.classes]


def train_transform_cityscapes(image, target):
    
    # Convert image to tensor
    image_tensor = functional.to_tensor(image)
    
    # Convert target (semantic mask) to tensor
    # The target should be a long tensor with class indices
    target_tensor = functional.to_tensor(target).squeeze(0)
    
    # If target has values in [0,1], scale to class indices
    if target_tensor.max() <= 1.0:
        target_tensor = (target_tensor * 255).long()
    else:
        target_tensor = target_tensor.long()
    
    # Apply color jitter for data augmentation
    image = jitter(image)
    
    # Use feature extractor to prepare the image
    data = feature_extractor(image, target_tensor)
    
    return data["pixel_values"][0], data["labels"][0]

class Cityscapes_MUAD:
    """A combined dataset that mixes Cityscapes and MUAD datasets together."""

    def __init__(
        self,
        cityscapes_root,
        muad_root,
        split,
        muad_version = "small",
        download: bool = False,
    ) -> None:

        # super().__init__(root, transforms=transforms)
        
        self.cityscapes_root = Path(cityscapes_root)
        self.muad_root = Path(muad_root)
        self.split = split
        self.muad_version = muad_version
        self.target_type = "semantic"
        
        # Initialize Cityscapes dataset
        self.cityscapes_dataset = Cityscapes(
            root=self.cityscapes_root,
            split=split,
            mode="fine",
            target_type=self.target_type,
            transforms=train_transform_cityscapes
        )
        
        # Initialize MUAD dataset
        self.muad_dataset = MUAD(
            root=str(self.muad_root),    
            split=split,
            target_type=self.target_type,
            transforms=train_transform_muad,
        )

        self.num_classes = len(self.cityscapes_dataset.classes)
        
        # Total length is the sum of both datasets
        self.length = len(self.cityscapes_dataset) + len(self.muad_dataset)
        

    def __getitem__(self, index: int) -> tuple[tv_tensors.Image, tv_tensors.Mask]:

        # Determine which dataset to use based on the index
        cityscapes_len = len(self.cityscapes_dataset)
        
        if index < cityscapes_len:
            # Get from Cityscapes
            image, target = self.cityscapes_dataset[index]
        else:
            # Get from MUAD
            image, target = self.muad_dataset[index - cityscapes_len]

            target[target == 255] = 21
            
            # Convert MUAD class IDs to Cityscapes class IDs
            # Create a mapping tensor and use it to replace values in target
            mapping = np.array(self.muad_to_cityscapes_id)
            target = mapping[target]
        
        return image, target

    def __len__(self) -> int:
        """The number of samples in the combined dataset."""
        return self.length
 
    @property
    def muad_to_cityscapes_id(self):
        muad_classes = [muad_class.cityscapes_id for muad_class in self.muad_dataset.classes]

        return muad_classes


        
def create_combined_dataset(
    root_path: str | Path,
    split: Literal["train", "val"],
    transform_fn: Callable,
    muad_version: Literal["small", "full"] = "small",
    download: bool = False
) -> Cityscapes_MUAD:
    """
    Creates and returns a combined Cityscapes and MUAD dataset.
    
    Args:
        root_path (str | Path): Root directory where both datasets are stored
        split (str): The image split to use, 'train' or 'val'
        transform_fn (callable): Transformation function to apply to the images and targets
        muad_version (str, optional): The version of MUAD to use. Defaults to 'small'.
        download (bool, optional): Whether to download MUAD if not found. Defaults to False.
        
    Returns:
        Cityscapes_MUAD: The combined dataset
    """
    return Cityscapes_MUAD(
        root=root_path,
        split=split,
        transforms=transform_fn,
        muad_version=muad_version,
        target_type="semantic",
        download=download
    )

Download the MUAD and cityscapes datasets from

CITYSCAPES_ROOT = "datasets/cityscapes"
MUAD_ROOT = "datasets/muad"

train_dataset = Cityscapes_MUAD(cityscapes_root=CITYSCAPES_ROOT, 
                                   muad_root=MUAD_ROOT, 
                                   split="train", 
                                   muad_version="full")
print(len(train_dataset))
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_workers=4)
6395
optimizer = optim.Adam(wrapped_encoder.parameters(), lr=1e-3, weight_decay=1e-5)
min_running_loss = np.inf
model_name = f'sam2_neo.pt'
print(f'Training Neo on anomaly-free dataset ...')

for epoch in range(20):
    running_loss = 0.0

    for i, batch_images in tqdm(enumerate(train_dataloader)):
        batch_images = batch_images[0].to(device)
        optimizer.zero_grad()
        output, vacuity_scores = wrapped_encoder(batch_images, return_risk=True)
        loss = vacuity_scores.mean()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if i % 10 == 9:
            print(f"Epoch {epoch+1}, iter {i+1} \t loss: {running_loss}")
            if running_loss < min_running_loss:
                #torch.save(wrapped_encoder.state_dict(), model_name)
                print(f"Loss decreased: {min_running_loss} -> {running_loss}.")
                print(f"Model saved to {model_name}.")

            min_running_loss = min(min_running_loss, running_loss)
            running_loss = 0.0

We can also load the weights directly to skip the training.

import gdown

file_id = "1Ffft8eY4HovpRzxswdyVngga8kUjGG7_"
output_path = "sam2_neo.pt"

gdown.download(f"https://drive.google.com/uc?id={file_id}", output_path, quiet=False)

wrapped_encoder.load_state_dict(torch.load(output_path))
<All keys matched successfully>

Step 3: Initialize Video and Add Prompts for Segmentation

In this step, we prepare the input video frames and user prompts for SAM2 to perform video segmentation with guidance.

We first need to download part of the A2D2 datasets from

  • A2D2 Dataset We use a short sequence of 151 frames (from 01800.jpg to 01950.jpg) extracted from the A2D2 dataset “Gaimersheim Camera - Side Right” to demonstrate the results.

Then, we need to give prompts to SAM2 to guide it in segmenting specific objects over time, the prompts are given by the string prompt_points. Each line in the string defines a segmentation prompt for the SAM2 model. The format is as follows:

<obj_id> <frame_idx> <x1,y1;x2,y2;...> [<label1,label2,...>]
  • <obj_id>: The unique index identifying the object being segmented.

  • <frame_idx>: The time frame at which the segmentation is performed.

  • <x1,y1;x2,y2;...>: A semicolon-separated list of 2D coordinates used as prompts. These represent points on the object to guide SAM2’s segmentation. If multiple points are provided, SAM2 will generate a single segmentation mask that covers all specified points at the given time frame.

  • [<label1,label2,...>] (optional): A comma-separated list of binary labels (0 or 1) corresponding to each point:

    • 1: The point should be included in the segmentation mask.

    • 0: The point should be excluded from the segmentation mask.

If the label list is not provided, all points are assumed to have a label of 1.

This labeling mechanism allows for fine control over the segmentation output:

from io import StringIO
prompt_points = """
3 0 1750,650;1850,400
4 0 1100,1100
5 0 1000,750
7 30 1500,800;1750,800
8 30 50,500
10 30 750,750;500,800 1,0
11 30 500,750;750,750 1,0
12 30 500,700
13 60 1000,800
14 60 750,600
16 90 1000,1000
17 90 1000,800;750,800
18 90 250,800;375,900
19 90 400,600;450,700
20 100 1100,800
21 100 600,700
22 100 1100,700
23 120 1400,900
24 120 500,900
25 100 250,800
26 150 500,600;250,400
"""

Below, we load the video frames into SAM2, add point prompts.

  • initial_video(...) loads a sorted list of frame filenames from the specified video directory (e.g., A2D2 video frames).

  • load_points_and_frames(...) reads user-defined prompts (points, frame indices, object IDs, and labels) from a text file. These serve as input hints to guide SAM2 in segmenting specific objects over time.

  • Each set of prompts is associated with a specific frame and object ID.

  • The SAM2 video predictor (predictor) uses these prompts to initialize its internal state and begin tracking objects frame-by-frame.

def initial_video(video_dir):
    frame_names = [
        p for p in os.listdir(video_dir)
        if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
    ]
    frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
    return frame_names

def load_points_and_frames(prompt_points):
    f = StringIO(prompt_points.strip())
    data = [line.strip().split(maxsplit=3) for line in f.readlines()]
    ann_obj_ids = [int(line[0]) for line in data]
    ann_frame_idx = [int(line[1]) for line in data]
    points = [
        np.array([[float(x) for x in pair.split(',')] for pair in line[2].split(';')], dtype=np.float32)
        for line in data
    ]

    labels = [
        np.array([int(x) for x in line[3].split(',')], dtype=np.int32) if len(line) > 3 else np.ones(len(point), dtype=np.int32)
        for point, line in zip(points, data)
    ]

    return ann_frame_idx, points, ann_obj_ids, labels
def load_image_by_index(x, folder, start_idx=60):
    filename = f"{start_idx + x:05d}.jpg"
    image_path = os.path.join(folder, filename)
    image = Image.open(image_path)
    image = np.array(image.convert("RGB"))
    return image

video_dir = f"a2d2"
frame_names = initial_video(video_dir)
inference_state = predictor.init_state(video_path=video_dir)
ann_frame_idx, points, ann_obj_ids, labels = load_points_and_frames(prompt_points)

for frame_idx, point_set, obj_id, label in zip(ann_frame_idx, points, ann_obj_ids, labels):
    # Add new points for each frame index and object id
    _, out_obj_ids, out_mask_logits = predictor.add_new_points_or_box(
            inference_state=inference_state,
            frame_idx=frame_idx,
            obj_id=obj_id,
            points=point_set,
            labels=label,
        )
frame loading (JPEG): 100%|██████████| 151/151 [00:05<00:00, 27.02it/s]

Step 4: Propagate Segmentation and Extract Vacuity-Based Anomalies

Once prompts are added and SAM2 has initialized its video state, we begin propagating segmentation across frames. During this process, we also compute pixel-wise vacuity scores using the Neo-wrapped image encoder.

What Happens During Propagation:

  • For each frame, SAM2 predicts segmentation masks for previously prompted objects.

  • The input frame is passed through the Neo-wrapped encoder, producing a downsampled vacuity map indicating pixel-level uncertainty.

  • Each segmentation mask is upsampled to match the vacuity map resolution.

  • We compute the average vacuity score within each mask region, giving us an object-wise uncertainty score for every frame.

Visual Output:

  • Each video frame is saved with overlaid segmentation.

  • Red indicates anomalous regions (high vacuity).

  • Blue/white indicates normal segmentation (low vacuity).

This step bridges segmentation with uncertainty estimation, enabling localized anomaly detection based purely on the model’s confidence.

Note: We track the maximum and minimum vacuity scores observed, which will be used later for visualization normalization.

def show_vacuity(mask, ax, value=0.5, vmin=0.0, vmax=1.0):
    colormap = plt.cm.coolwarm

    normalized_value = (value - vmin) / (vmax - vmin)  # Normalize value to [0, 1]
    normalized_value = np.clip(normalized_value, 0, 1)  # Ensure value is within [0, 1]
    color_solid = colormap(normalized_value)
    color = np.concatenate([color_solid[0:3], np.array([0.6])], axis=0)
    h, w = mask.shape[-2:]
    mask = mask.astype(np.uint8)
    mask_image =  mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)

video_segments = {}
vacuity_points = {}
max_vacuity = 0
min_vacuity = 10
output_dir = "results_a2d2"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
for out_frame_idx, out_obj_ids, out_mask_logits in predictor.propagate_in_video(inference_state):
    video_segments[out_frame_idx] = {
            out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
            for i, out_obj_id in enumerate(out_obj_ids)
    }
    vacuity_points_per_obj = []
    image = load_image_by_index(out_frame_idx, video_dir, 1800)
    image_tensor = torch.from_numpy(image).permute(2, 0, 1).unsqueeze(0).float()
    inputs = feature_extractor(images=image, return_tensors="pt")
    image_tensor = inputs["pixel_values"].to(device)
    output, vacuity_scores = wrapped_encoder(image_tensor, return_risk=True)
    vacuity_scores_avg_upsampled = F.interpolate(vacuity_scores, size=(256, 256), mode='bilinear', align_corners=False).to(device)
    for i, mask in enumerate(out_mask_logits):
        mask = F.interpolate(mask.unsqueeze(0), size=(256, 256), mode='bilinear', align_corners=False)
        mask = F.sigmoid(mask).to(device)
        if torch.isinf(mask).any():
            print(f"Warning: object {i} mask contains infinite values")
        weighted_sum = (vacuity_scores_avg_upsampled * mask).sum(dim=(2, 3)).mean(dim=1)
        normalization_factor = mask.squeeze(1).sum(dim=(1, 2))
        vacuity_scores_points = (weighted_sum / normalization_factor).detach().cpu().numpy()
        max_vacuity = max(max_vacuity, vacuity_scores_points[0])
        min_vacuity = min(min_vacuity, vacuity_scores_points[0])
        #print(f"Object {i}, Vacuity score: {vacuity_scores_points[0]:.8f}, Max vacuity: {max_vacuity:.8f}, Min vacuity: {min_vacuity:.8f}")
        vacuity_points_per_obj.append(vacuity_scores_points[0])
    vacuity_points[out_frame_idx] = {
            out_obj_id: vacuity_points_per_obj[i]
            for i, out_obj_id in enumerate(out_obj_ids)
    }
    plt.figure(figsize=(6, 4))
    plt.imshow(Image.open(os.path.join(video_dir, frame_names[out_frame_idx])))
    for out_mask, vacuity_point in zip(video_segments[out_frame_idx].values(), vacuity_points[out_frame_idx].values()):
        # Plot the mask with color-coded vacuity
        show_vacuity(out_mask, plt.gca(), vacuity_point, 2.0, 6.0)
    plt.axis('off')
    plt.savefig(f"{output_dir}/vacuity_{out_frame_idx:05d}.png", dpi=300, bbox_inches='tight', pad_inches=0)
    plt.close()
print(f"max_vacuity: {max_vacuity}")
print(f"min_vacuity: {min_vacuity}")
propagate in video: 100%|█████████████████████████████████████████████████████████████████████████| 151/151 [24:40<00:00,  9.80s/it]
max_vacuity: 6.421767711639404
min_vacuity: 2.1555912494659424

Step 5: Visualize Vacuity-Based Anomaly Detection as a Video

To observe how the model’s uncertainty evolves over time, we compile the saved output frames into an animated video. To reduce the file size of the notebook, we resize the frames to a smaller resolution before displaying them via HTML.

import matplotlib.pyplot as plt
import matplotlib.animation as animation
from PIL import Image
from IPython.display import HTML
import os

# Set the path to your image folder
frame_dir = "results_a2d2"
frame_files = sorted([f for f in os.listdir(frame_dir) if f.startswith("vacuity_") and f.endswith(".png")])

# Define target resolution (e.g., 480p)
target_size = (480, 270)

# Load and resize all frames as PIL images
frames = [Image.open(os.path.join(frame_dir, f)).resize(target_size, Image.Resampling.LANCZOS) for f in frame_files]

# Create figure and axis
fig, ax = plt.subplots()
im = ax.imshow(frames[0])
ax.axis('off')

# Animation update function
def update(i):
    im.set_data(frames[i])
    return [im]

# Create animation
ani = animation.FuncAnimation(
    fig, update, frames=len(frames), interval=50, blit=True
)

# Limit embedded animation size in HTML to avoid large notebook files
plt.rcParams['animation.embed_limit'] = 50

# Display the animation inline
HTML(ani.to_jshtml())
../_images/f9f27a98dafe61d6c1f3d5b7ff994d52611b17f6095a2c18b6280a94b2a38d4a.png

Results and Discussion

The animation above shows a segment from an A2D2 driving scene, specifically in a construction zone. The region includes:

  • Construction structures obstructing part of the road

  • A construction worker standing near or within the driving area

In the visualization, these regions are consistently highlighted in red, indicating high vacuity scores from the Neo-wrapped SAM2 encoder. This suggests that the model correctly identifies these areas as out-of-distribution or anomalous, relative to the normal driving conditions it was trained on.