Meta-backed Modal powers GitHub’s SAM‑3 inference on GPUs in real time
Photo by Possessed Photography on Unsplash
Meta‑backed Modal now runs GitHub’s SAM‑3 inference on its GPUs in real time, enabling text‑prompted image and video segmentation via a new API, according to a recent report.
Key Facts
- •Key company: Meta
- •Also mentioned: Modal
Modal’s new SAM‑3 inference service runs on the platform’s GPU‑backed compute nodes, delivering real‑time, text‑prompted segmentation for both images and video streams. The open‑source repository posted on GitHub (TheFloatingString/sam3‑on‑modal) shows that the deployment is built with Modal’s Python‑centric CLI, leveraging a `modal_app.py` entry point that exposes two REST endpoints: `/infer_image` and `/infer_video`. Each endpoint accepts a base‑64‑encoded payload (for images) or a session‑based workflow (for video) and returns masks, bounding boxes, and confidence scores in JSON format, as documented in the repo’s README. The health‑check endpoint (`/health_check`) confirms service availability with a simple `{ "status": "healthy", "service": "sam3-inference" }` response, enabling automated monitoring in production environments.
The image‑segmentation API follows a straightforward request pattern: a POST to `/infer_image` with an `image_base64` field and a textual `prompt`. Internally, the service loads the SAM‑3 model—Meta’s latest iteration of the Segment Anything Model—via the `sam3.model_builder.build_sam3_image_model` helper, then runs the `Sam3Processor` to bind the image to the model state and apply the text prompt. The response payload includes three parallel arrays (`masks`, `boxes`, `scores`), mirroring the output format of the original SAM architecture. Sample client code in the repository demonstrates both Python (`requests.post`) and cURL usage, confirming that the API can be consumed from any environment that can generate base‑64 image data.
Video segmentation is handled through a two‑step session protocol. A client first initiates a session by posting `{ "action": "start_session", "video_path": "/path/to/video.mp4" }` to `/infer_video`. The service returns a `session_id`, which the client then uses to add prompts for individual frames via `{ "action": "add_prompt", "session_id": "...", "frame_index": 0, "prompt": "person walking" }`. This design allows asynchronous processing of long video sequences, with each frame’s segmentation results returned in the `outputs` field of the JSON response. The underlying implementation reuses the same `Sam3Processor` logic as the image path, but wraps it in a per‑session state manager that caches intermediate frame embeddings, reducing redundant GPU work across prompts.
From an infrastructure perspective, Modal’s platform abstracts away the provisioning of GPU resources, letting developers spin up a fully managed inference service with a single command (`uv run modal deploy modal_app.py`). The repository’s `pyproject.toml` lists dependencies such as `torch`, `transformers`, and the Hugging Face token secret required to pull the SAM‑3 weights from the model hub. By storing the token as a Modal secret (`modal secret create huggingface HF_TOKEN=...`), the deployment keeps credentials out of source code while still granting the runtime access to the model assets. The use of Modal’s `uv sync` tool ensures reproducible environments across local development and cloud execution, a pattern that aligns with Meta’s broader push to democratize high‑performance AI workloads.
The release arrives amid a wave of Meta‑backed AI tooling, including the recent Code Llama launch reported by TechCrunch. By exposing SAM‑3 via a low‑latency API, Modal enables developers to integrate sophisticated segmentation into downstream products—ranging from content moderation pipelines to interactive media editors—without managing their own GPU clusters. The open‑source nature of the implementation also invites community contributions, potentially extending the service to support additional modalities such as point‑cloud or multi‑view inputs. As the codebase matures, the combination of Modal’s serverless GPU execution model and Meta’s cutting‑edge segmentation research could lower the barrier to real‑time visual understanding for a broad swath of enterprise and consumer applications.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.