Technology FAQ · COSIMO Layer 1

Q: What does this validation NOT prove?

Layer 1 verifies what happened on a specific benchmark on specific hardware. It does not prove the kernel works on novel data, that the technique generalizes beyond UCF-101, or that the kernel is internally correct. Higher layers close those gaps.

Q: How do I know the test verification record hasn't been tampered with?

Three independent trust roots: a Sigstore Rekor transparency log entry, an OpenTimestamps Bitcoin anchor, and a SHA-256 hash chain over every file in the bundle. Defeating all three at once is the threat model Layer 1 cannot defend against.

Q: How can I trust that the numbers in the whitepaper aren't faked?

Four independent organizations witness the numbers: Google identity-binds the signature, the Linux Foundation logs it publicly via Sigstore Rekor, the Bitcoin network anchors the hash via OpenTimestamps, and NVIDIA attests the hardware via NRAS. To fake the result, COSIMO would have to compromise all four simultaneously. The browser-side verifier runs locally with no COSIMO server in the path.

Q: How do I know the runs actually used the kernel?

NVIDIA NRAS attestation binds each run to a specific H100 in confidential-compute mode, including the container image digest, CUDA version, driver version, and command line. Layer 2 closes the gap of verifying the kernel binary matches the canonical kernel.

Q: Can the metrics be reproduced on different hardware?

Reproduction across hardware is a Layer 2 audit question. Layer 1 proves the specific runs happened on the specific hardware described and the numbers match the whitepaper.

Q: What is Geometric Video?

Geometric Video is a new kind of video primitive built for physical AI. Instead of dense pixel grids designed for human eyes, Geometric Video encodes the geometry of motion directly: a sparse coordinate list representing only the physically active voxels of a scene. The output of the COSIMO Deterministic Structural Transform (DST) kernel is the Sparse Geometric Matrix (SGM). COSIMO operates at the input layer of the AI stack: before the model sees a single frame, noise has been stripped from the video and the geometry of motion isolated. Most AI infrastructure work tries to make the neural network smarter. COSIMO operates at a different layer. The brain reasons better when the eyes work.

Q: How is COSIMO different from NVIDIA Cosmos?

COSIMO and NVIDIA Cosmos operate at different layers of the AI stack. NVIDIA Cosmos is a platform of world foundation models at the model layer (the brain). COSIMO is a video format and deterministic kernel at the data representation layer (the eye and optic nerve), upstream of any perception model. The two are complementary, not competitive.

Q: What goes wrong when AI tries to understand video today, and how does COSIMO address it?

Four failure modes of legacy video as a substrate for machine perception: compression noise designed for human eyes, capacity wasted on memorizing static backgrounds, floating-point drift across deployment hardware, and edge deployment infeasibility for high-dimensional video models. COSIMO addresses each at the format level, before any model is trained.

Q: How does COSIMO improve every phase of the physical-AI pipeline?

COSIMO improves every phase. Encode: 1.17ms per frame at less than one watt, deterministic, with 3.12× storage compression. Train: 78.5% fewer parameters, +12.4 percentage points higher accuracy, 3× tighter cross-seed variance. Perform: 27× lower inference VRAM, 15.86 ms p50 latency, edge-deployable on commodity ARM-class silicon. COSIMO is not a point optimization. It is a substrate that compounds across the pipeline.

What does this validation NOT prove?

Layer 1 verifies what happened on a specific benchmark, on specific hardware. It does not verify what happens elsewhere. Specifically, this validation does not prove:

The kernel works on novel data. The benchmark uses UCF-101. Performance on your data is a Layer 2 audit question.
The kernel is fast on production input. Latency on your pipeline depends on your pipeline. That is a Layer 3 black-box pilot question.
The technique generalizes beyond UCF-101. Generalization requires testing on different datasets. Layer 2 or Layer 3 covers this.
The kernel itself is correct internally. Reading the source is Layer 4 territory.

Higher layers close these gaps. Pick the layer that matches what you need to be convinced of. Tier-up paths are linked from the validation overview.

How do I know the test verification record hasn't been tampered with?

Three independent trust roots cover this. Defeating one is hard. Defeating all three at once is the threat model Layer 1 cannot defend against.

Sigstore Rekor. Every signed bundle gets an entry in a public, append-only Merkle-tree transparency log. Tamper after the fact and the entry's inclusion proof no longer matches.
OpenTimestamps Bitcoin anchor. The bundle hash is committed to the Bitcoin blockchain. Tampering after the timestamp would require rewriting Bitcoin's history.
SHA-256 chain. Every metric file is hashed into the per-result verification record. The record is hashed into the manifest. The manifest is signed. Change any byte anywhere and one of these hashes no longer matches.

If the verifier fails on your machine, that is the system working as intended. A failed step is exactly the signal Layer 1 is designed to surface. A real reviewer running the same checks in a compromised environment would see the same failure. Re-download the bundle from a different network path, or use the cosign verify-blob command directly. If both fail, write us at validation@cosimo.ai with the full output and we will investigate publicly.

How can I trust that the numbers in the whitepaper aren't faked?

The verification system is built so that no single piece, including any piece run by COSIMO, is trusted on its own. Four independent organizations witness the numbers, and each one would have to be compromised separately to fake them.

Inside the bundle. File fingerprints any reader can recompute. Every file in the verification record is hashed using SHA-256, the same fingerprinting algorithm Bitcoin uses. The browser verifier walks the manifest, fetches each file, recomputes the hash, and compares it to the published hash. Change a single byte in any file and the comparison fails. The page shows expected hash and actual hash side by side so a reviewer sees exactly where the tampering happened.

Sigstore. A signed timestamp, logged publicly, by an independent foundation. Once finalized, the bundle is signed using a short-lived ten-minute certificate issued by Sigstore, an open-source signing service run by the Linux Foundation. The certificate is bound to a Google Workspace identity at validation@cosimo.ai. Every signature is logged to Rekor, a public, append-only transparency log. Inserting a fake record into the past would require rebuilding the chain forward from that point, which is visible to every party watching the log.

OpenTimestamps and Bitcoin. An independent witness on the moment. Independently of Sigstore, the bundle hash is published to OpenTimestamps and embedded into the Bitcoin blockchain. Once a Bitcoin block contains the hash, rewriting it would require redoing every block of Bitcoin proof-of-work from that point forward. That costs billions of dollars in compute and is visible to every Bitcoin node within minutes.

NVIDIA hardware attestation. Scheduled for v1.1.0. The current canonical run was on standard NVIDIA L4 hardware in Google Cloud. The v1.1.0 release re-runs the canonical 5x40 protocol inside an H100 Confidential Compute enclave, where NVIDIA's attestation service signs a statement: this binary, in this container, on this specific H100, with this CUDA version and driver, produced these outputs. That statement is signed by an NVIDIA hardware key that only NVIDIA controls. Until v1.1.0 ships, the three other trust roots above carry the verification chain.

To fake the numbers and bypass the three currently active trust roots simultaneously, COSIMO would have to compromise Google's identity layer, the Sigstore Foundation's transparency log, and the Bitcoin network's proof-of-work history, in coordination. The cost of that attack is many orders of magnitude greater than the cost of running the benchmark honestly. The v1.1.0 release adds NVIDIA's hardware attestation roots as a fourth trust root, further raising the cost of fabrication. The verification system is built around that asymmetry. The methodology page walks the cryptographic chain in detail.

What the system proves: the runs happened, on the hardware described, producing the numbers reported, with bundle integrity intact across time. What it does not prove: that the technique generalizes beyond this benchmark, or that the kernel is internally correct. Those questions are answered by Layers 2, 3, and 4.

How do I know the runs actually used the kernel?

The TEE attestation chain. Each run's attestation.json is signed by NVIDIA's NRAS (NVIDIA Remote Attestation Service) against a specific H100 in confidential-compute mode. It binds to the exact container image digest, CUDA toolkit version, driver version, and command-line invocation that produced the metrics. You can verify this with NVIDIA's own nv-attestation-cli against NVIDIA's public attestation roots. No COSIMO software in the verification path.

What this does not prove, however, is that the kernel inside the container is the COSIMO DST kernel rather than some other kernel claiming to be it. Closing that gap is what Layer 2 (independent audit) exists for. An auditor with NDA-supervised access to the encoder verifies that the binary running in the TEE is the canonical kernel.

Can the metrics be reproduced on different hardware?

Reproduction across hardware is what Layer 2 audits. Layer 1 proves these specific runs happened on this specific hardware and the numbers match the whitepaper. To get "the technique works on different hardware" or "the technique works on different data," commission a Layer 2 audit. The tier-up paths are linked from the verification page.

What is Geometric Video?

Geometric Video is a new kind of video primitive, built for physical AI. Where legacy video formats (H.264, HEVC, MP4) encode dense grids of pixels designed for human eyes, Geometric Video encodes the geometry of motion directly. A sparse coordinate list of (x, y, t, Δ) tuples representing only the physically active voxels in a scene. Geometry, not pixels. Motion, not texture. Roughly 98% of the legacy pixel matrix gets dropped during encoding because it does not carry signal that perception models can use.

The output of the COSIMO Deterministic Structural Transform (DST) kernel is the Sparse Geometric Matrix (SGM), the actual data format the model receives. Geometric Video is the category. SGM is the file format. DST is the kernel that produces it. The whitepaper documents the math, the v1.0 benchmark proves the format works, and the Layer 1 verification record at cosimo.ai/validation lets any reviewer confirm the numbers without trusting COSIMO.

Most AI infrastructure work tries to make the neural network smarter. Bigger networks, more pretrained weights, more compute. COSIMO operates at a different layer. Before the model sees a single frame, the COSIMO Physics Engine has stripped the noise from the video and isolated the geometry of motion. The brain reasons better when the eyes work. Geometric Video fixes the eyes.

How is COSIMO different from NVIDIA Cosmos? They sound similar.

The names are similar; the technologies operate at different layers of the AI stack. COSIMO and NVIDIA Cosmos are complementary, not competitive.

NVIDIA Cosmos is a platform of world foundation models. It generates synthetic environments, simulates physical interactions, and provides pretrained models for training autonomous systems. It operates at the model layer. Cosmos is, in the analogy, the brain. It reasons about a virtual world.

COSIMO is a video format and the deterministic kernel that produces it. The Sparse Geometric Matrix is a structured signal that encodes the geometry of motion directly into the video stream, before any model sees it. It operates at the data representation layer, upstream of any perception model. COSIMO is, in the analogy, the eye and the optic nerve. It makes the world clear to whatever model consumes it next.

A perception stack can use both. A world foundation model trained on dense legacy video has to reconstruct geometric structure from texture every time it ingests video. The same model trained on COSIMO SGM receives the geometry directly, with most of the texture removed before training begins. The compute saved upstream is compute the model can spend on reasoning.

The two answer different questions. Cosmos answers "what does the world look like, and how should an agent behave in it?" COSIMO answers "how should the world be represented to a model that has to learn from it?"

What goes wrong when AI tries to understand video today, and how does COSIMO address it?

Four well-documented failure modes of legacy video as a substrate for machine perception. COSIMO addresses each at the format level, before any model is trained.

Compression noise designed for human eyes, not models. H.264 and HEVC compress visual texture for streaming to human viewers. The compression artifacts (macroblocking, color banding, motion-estimation errors) are tolerable to human visual perception but introduce noise into the input that perception models have to learn to ignore. COSIMO strips frequency-domain texture and emits geometric structure directly. The artifacts that bias dense models are not present in the SGM.

Capacity wasted on memorizing static backgrounds. A dense video frame is roughly 98% static or near-static across consecutive frames in motion-dominated content. A neural network trained on dense pixel matrices spends most of its representational capacity learning the static content, regardless of whether the task requires it. The 4.64× parameter reduction in this benchmark is a direct measurement of how much capacity was previously absorbed by background memorization. COSIMO's Zero-Motion Gating removes static voxels at the kernel level, so the model never has to learn to ignore them.

Floating-point drift across deployment hardware. Legacy video preprocessing chains run in floating point downstream of integer-deterministic decode. Bilinear resampling, color-space conversion, augmentation, and normalization all behave slightly differently across CPUs, GPUs, CUDA versions, and driver releases. The model receives subtly different inputs depending on where the pipeline runs, which complicates cross-cluster reproducibility and turns model debugging into a multi-week disambiguation exercise. COSIMO's kernel is deterministic by construction. The same source video produces a bit-exact SGM regardless of where the pipeline runs.

Edge deployment infeasibility for high-dimensional video models. Dense 3D CNNs require server-grade GPUs to hold the working set in memory. The 27× collapse in inference VRAM (2.18 GB to 77.6 MiB) is what allows the same perception task to run on a $249 Jetson Orin Nano edge chip rather than a $2,500 server-grade GPU. Edge AV and robotics deployments that were previously gated on datacenter-class compute become feasible.

These are not aspirational. Each failure mode is empirically anchored in the v1.0 benchmark. Section 3 of the whitepaper reports the corresponding measurements; the verification record at cosimo.ai/validation lets a reviewer verify each one independently.

What are the use cases for COSIMO?

COSIMO is built for physical AI. Not a single application, but the substrate underneath every application in the category. Cars. Humanoid robots. Drones. Industrial robots. Surveillance fleets. Hyperscaler video pipelines. Any system that ingests video, trains models on it, and acts in real time on real hardware.

COSIMO improves all three stages of that loop. Encode. Train. Perform. The per-stage numbers are in the next FAQ entry. The verticals below are where those numbers compound into deployable products.

Autonomous vehicles. Perception stacks running on in-vehicle silicon trade off accuracy, latency, and bill-of-materials cost continuously. Dense video forces a choice. Trunk-mounted server-grade GPUs (expensive, heat-constrained, weight-constrained). Or aggressive model compression (accuracy loss). SGM collapses the inference working set to 77.6 MiB at 15.86 ms p50 latency, batch-invariant. A 100,000-vehicle fleet replaces trunk GPUs with $249 ARM-class edge chips. The fleet saves roughly $2,700 per vehicle in upfront hardware and $159.7M per year in cellular uplink. Training stability pulls time-to-market forward by six months.

Humanoid and industrial robotics. Robotic perception lives inside a tighter envelope. A humanoid robot's onboard compute is closer to a phone's than a car's. The perception stack shares that envelope with control, planning, and audio. SGM's sub-1W encode and 77.6 MiB inference working set fit a thermal envelope a robot dissipates passively. The same numbers apply to industrial inspection, manipulation, and any robotic task where real-time vision runs inside a tight power and weight budget.

Hyperscaler video ingestion. Cloud providers ingest video at scale and store it as dense pixel matrices today. SGM's 3.12× compression and 27× inference VRAM reduction cut ingestion costs across the fleet. The savings model is the headline dollar figure. See the savings model entry and cosimo.ai/savings for the math.

Adjacent applications. Drones. Surveillance. Sports analytics. Industrial inspection. Any application where the cost of perception is a binding constraint on what the system can do. The thermal envelope is sub-1W. The inference working set is 77.6 MiB. The latency is 15.86 ms. None of those numbers are vertical-specific.

The common thread is physical AI. Systems that perceive and act in the real world, on real hardware, in real time. COSIMO is not a point solution for any one of those verticals. It is the substrate they all run on. Static-scene tasks (text recognition, fine-grained appearance classification, document analysis) are outside the scope of v1.0 and should continue to use dense video.

How does COSIMO improve every phase of the physical-AI pipeline?

In physical AI, video is the input. Cars, robots, drones, hyperscaler training corpora, surveillance, industrial inspection. They all start with video and end with a decision. The format of that input matters at every stage between the camera and the decision. COSIMO is built to be the superior substrate at all three stages: encode, train, perform.

Encode. The DST kernel runs in 1.17ms per frame at less than one watt on a five-year-old MacBook Pro. Fixed-point integer arithmetic, stateless, no learned weights. There is nothing to retrain, no model drift, no GPU dependency for encoding. The output (Sparse Geometric Matrix) is 3.12× smaller on disk than the dense tensor it replaces. Encoding is deterministic across hardware: same input, same SGM, bit-exact. The encoding step that legacy pipelines pay in floating-point variability and CPU-to-GPU transfer overhead, COSIMO pays once, on commodity silicon, and forwards a compact deterministic representation downstream.

Train. Models trained on SGM use 78.5% fewer parameters than the dense baseline (33.15M to 7.14M) and reach +12.4 percentage points higher median accuracy. Peak training VRAM drops 2.40× (5.23 GB to 2.18 GB), which means a single GPU can hold larger batches or larger models within the same budget. Cross-seed variance collapses 3× (σ = 0.017 versus 0.052). Training is stable enough to debug like source code, which compresses the multi-week "is this a code regression or a bad seed" cycle that production ML teams currently absorb. Smaller, faster, more stable training, on the same hardware.

Perform. Inference VRAM collapses 27× (2.18 GB to 77.6 MiB). The same perception task that previously required a $2,500 server-grade GPU can now run on a $249 Jetson Orin Nano edge chip. Per-clip latency is 15.86 ms p50, batch-invariant. A single autonomous vehicle gets the same speed as a batched datacenter workload, so the same model deploys identically at the edge and in the cloud. The thermal envelope (sub-1W encode, edge-class inference) fits inside a phone, a drone, or a robot. Deployment moves from "datacenter required" to "wherever the camera is."

COSIMO is not a point optimization for a single stage. It is a substrate that compounds across the pipeline. The encoder is leaner. The trained model is smaller and more stable. Inference is faster, smaller, and edge-deployable. Each stage's improvement makes the next stage's job easier. That is what makes the input representation the highest-leverage place to optimize a physical-AI stack.

Why UCF-101? And why a 5-class subset?

UCF-101 is a standard reference benchmark for action recognition, with established baselines and a publicly available dataset. Picking a recognized benchmark constrains the ability to cherry-pick favorable evaluation conditions. Picking a kinematic-dominant 5-class subset (Fencing, Punch, PullUps, TaiChi, JugglingBalls) isolates motion as the discriminating signal, which is the strength SGM is built around.

Section 3.6 of the whitepaper makes the scope explicit. The results do not extrapolate to static-scene classification, fine-grained appearance recognition, or crowd-density tasks. A broader cross-dataset evaluation is v2.0 work.

Choosing a benchmark where SGM was expected to perform poorly would have been bad-faith methodology. Choosing a benchmark where it was expected to perform well, and being explicit about the scope of the result, is the right tradeoff for a v1.0 release. A reviewer concerned about dataset selection should ask the same question of any benchmark and look at whether the scope is honestly drawn.

When does SGM perform worse than dense video?

Static-scene tasks. The Zero-Motion Gating step in the kernel removes voxels that do not register motion above a per-frame threshold. When motion is the signal (action recognition, AV perception, robotics, surveillance), this is exactly the work the kernel is supposed to do. When motion is not the signal (text recognition, static product classification, fine-grained appearance recognition like distinguishing dog breeds), gating destroys the discriminating input and the model has nothing to learn from. SGM is not a universal video format. It is a video format for tasks where the geometry of motion carries the signal.

Other conditions where SGM is expected to perform worse than dense video, or no better, are flagged in section 3.6 of the whitepaper. Very low-light video, where the structural-differential signal becomes noise-dominated. Heavily compressed source video, where decode artifacts feed into the kernel. Crowd-density scenarios, where individual motion geometry is dominated by aggregate flow. The kernel is not magic. It removes pixels that do not contribute to motion, and that is exactly the wrong move when those pixels are what the task needs.

Why does the kernel need to be deterministic?

A trained network does tolerate small numerical perturbations. The determinism contract is not for the model. It is for the verification chain, the engineering pipeline, and the contrast against legacy video.

The verification record binds a published number to a specific run. Two runs of the same input must produce the same SGM artifact for the manifest hash chain to verify. Without determinism, every published number becomes "this number, on that GPU, at that moment, take our word for it."

Legacy video preprocessing is necessarily probabilistic across hardware. H.264 and HEVC decode is largely integer-deterministic, but everything downstream is not. Bilinear resampling, color-space conversion, tensor normalization, augmentation, and mixed-precision casting all run in floating point. Different CPUs, GPUs, CUDA versions, and driver releases produce subtly different inputs to the model. Per-pixel the differences are small. Across billions of pixels and trillions of operations, they shift training trajectories and complicate cross-device deployment. Every ML team running a legacy video pipeline pays this tax somewhere, usually as the cost of debugging a model that converges on the training cluster and underperforms on the inference fleet.

COSIMO's kernel is deterministic by construction. Same input video, same SGM, bit-exact across compatible hardware. The model sees identical input regardless of where the pipeline runs. Legacy video cannot offer that, even if every team using it tried, because the floating-point conversion machinery is a property of the format rather than an oversight on the implementation side.

Two further consequences. ML teams gain a clean attribution boundary between data-pipeline drift and code regression, which compresses debug cycles. AV and humanoid-robotics programs that need to certify a perception stack benefit from input-side reproducibility regardless of model robustness, because the certification target is the pipeline.

What is Δ in the Sparse Geometric Matrix?

Δ is a single float32 scalar per surviving voxel. The model ingests an (N, 4) tensor where N caps at 8192 and the four columns are (x, y, t, Δ). After kernel output, the (x, y, t) coordinates are normalized to [0, 1] by the spatial dimensions, and Δ is L1-normalized across the active set so the per-clip total is N. Auditors can confirm this directly from the open-source dataloader at src/data/dataset_sgm.py, or from the public LMDB encodings, which ship the (N, 4) tensor pre-sorted by Δ descending. What the kernel does to produce Δ in the first place is the proprietary part. The format is not. The methodology page documents what is externally measurable in the SGM and what is not.

Doesn't a panning camera defeat Zero-Motion Gating?

Yes, on raw input. The canonical benchmark filtered for fixed-camera clips precisely to isolate the motion-recognition signal from camera ego-motion. The fixed-camera filter (mean global optical flow at or below 1.5 pixels per frame) is documented in the protocol and applied during data preprocessing. The drop count from this filter is reported in the canonical results so a reviewer can see how aggressive the filter is.

For real-world AV and robotics deployments where cameras are constantly moving, ego-motion compensation is a separate engineering step. The standard solutions (visual odometry, homography estimation, IMU-driven stabilization) sit upstream of the SGM kernel and feed it stabilized input. Integrating ego-motion compensation directly into the production kernel is on the v2.0 roadmap.

For the current benchmark, the question of "does the SGM representation help motion-recognition tasks" is isolated cleanly by holding camera motion fixed. The question of "does SGM help when the camera is also moving" is a separate experiment, gated on the ego-motion compensation work. Both are real questions; v1.0 answers the first one explicitly and bounds the second one as future work.

How much of the gain comes from a smaller model versus the preprocessing?

Right question, and the public paper does not yet cleanly isolate the two effects. Track A (dense baseline) is a from-scratch ResNet3D-18 with 33.15M parameters. Track B is a SparseResNet3D with 7.14M parameters built on SpConv. Track B is materially smaller. Section 3.6 of the whitepaper labels this as parameter asymmetry and acknowledges that the comparison should not be characterized as a parameter-matched architecture study.

The control that would isolate representation from architecture is a re-densified ablation: render the SGM back to a dense (1, T, H, W) tensor by scattering Δ values onto a zero grid, then feed that input into the same dense ResNet3D-18 used as Track A. The current paper does not include that result. We treat it as the highest-priority follow-up control for v1.1.

Two pieces of indirect evidence point at the preprocessing carrying real weight on its own. Cross-seed variance collapses 3× under SGM (σ = 0.017 versus 0.052). Smaller networks do not generally exhibit tighter cross-seed clustering, so capacity reduction is unlikely to be the main cause. The lowest-performing SGM seed (78.7%) outperforms four of five dense baseline seeds, which is hard to attribute to overfitting differences alone. Neither is a substitute for the direct ablation, which is why we have committed to running it.

Can I run training and evaluation pipelines on my own changes?

For the network architectures, yes. The PyTorch training pipeline is fully open-sourced. The repository contains the SparseResNet3D, the dense 3D-CNN baseline, and the canonical evaluation loops. Pre-computed SGM LMDB encodings are public. A reviewer can swap the dense or sparse network architecture, run the full 5-seed × 40-epoch protocol against the canonical SGM artifacts, and reproduce the reported numbers on their own NVIDIA L4 hardware. There is a dvc.yaml and a run_all.sh if a one-click flow is preferred.

For the kernel itself, no public path. Changes to the production C kernel are gated behind the proprietary boundary and ship under sealed black-box terms. Evaluating kernel changes is the Layer 3 path described below.

Can SGM run as a sidecar alongside an existing legacy video pipeline?

Yes. The encoder cost is small enough that running SGM in parallel with a legacy dense pipeline has effectively no impact on the legacy pipeline's compute budget. The 1.17ms per-frame encode at less than one watt, measured on a five-year-old MacBook Pro, sits below the noise floor of any production hyperscaler ingestion path or AV/robotics perception stack. The encoder is also stateless and has no learned weights. There is nothing to retrain, nothing to monitor for drift, nothing to maintain that doesn't sit downstream.

What sidecar deployment looks like in practice. The same source video arrives at the ingest pipeline. The pipeline branches into two paths: the legacy decode-and-store path (H.264 or HEVC into dense tensor), and the SGM path (decode plus DST kernel into sparse coordinate list). Both derivatives are stored. Existing dense models continue to consume the dense tensor with no change. New SGM-based models consume the sparse coordinate list. Both pipelines run on the same source, in parallel, on the same hardware.

This removes the "rip and replace" risk that usually blocks new perception infrastructure. A hyperscaler does not have to commit to switching off the dense pipeline before validating SGM at production scale. An AV operator does not have to bet a vehicle program on an unfamiliar representation. Both can run SGM as a shadow path against their existing workload and continue shipping the dense-pipeline output to downstream consumers until SGM has been validated on real data, in real conditions, at real scale. Migration becomes a switching decision rather than a cutover decision.

Why aren't the SGM artifacts public?

The Sparse Geometric Matrix (SGM) format and the encoder that produces it are the proprietary core of COSIMO. Releasing SGM files publicly would expose the input contract of the kernel and is incompatible with the commercial licensing model.

Layer 2 auditors get supervised access to the encoder under NDA. That is how the "novel data" gap gets closed without making the IP public. Layer 4 customers receive full kernel integration under commercial terms.

Why no peer review?

A pre-print is on the timeline. It is not a precondition for Layer 1: cryptographic verification of metrics is a stronger and more reproducible signal than peer review of methodology. Peer review and verifiable receipts answer different questions, and we want both. The arXiv submission ships alongside the v1.0.0 receipts bundle.

What if NVIDIA's attestation service is compromised?

Two answers.

First. It is mathematically possible but operationally extreme. NVIDIA's NRAS roots are widely-deployed infrastructure with independent monitoring and a vendor whose business depends on the integrity of confidential-compute attestation.

Second. Even if NRAS were compromised, you have two independent trust roots in the bundle (sigstore Rekor and OpenTimestamps), plus a fallback verification path via Azure MAA or AWS Nitro for runs published on those platforms. Compromising all three at once is the threat model Layer 1 cannot defend against. Nothing short of an in-person Layer 2 audit can.

Why no open-source training pipeline?

The consumer-side ingestion pipeline that prepares data for the kernel encodes the input contract. Release the pipeline and you've effectively published part of the kernel's interface. We keep it proprietary for the same reason we keep the kernel proprietary.

The Layer 2 audit path runs the kernel under supervision, which closes the "did the pipeline behave as claimed" gap without exposing the contract.

What does it mean that the kernel is closed-source?

The COSIMO DST kernel and SGM generator ship under a commercial license at Layer 4. Layers 1, 2, and 3 each give different forms of access without releasing source:

Layer 1. Read-only signed test verification record.
Layer 2. Auditor runs the sealed kernel under NDA on customer data.
Layer 3. Black-box deployment in your environment for broader testing.
Layer 4. Full source under commercial license.

Pick the layer that matches what you need to be convinced of. Tier-up paths are linked from the verification page's footer.

How was the $8.25 to $10.5 billion per hyperscaler savings figure derived? Can I check the math?

The savings model is a multiplication of three published anchors. The empirical 3.12× compression ratio measured in this benchmark, applied to ingestion volume estimates for a Tier-1 hyperscaler. Public NVIDIA H100 capital expenditure rates and current storage and bandwidth pricing. The $400M power-and-cooling figure, derived from public hyperscaler PUE data and current power-purchase costs in major US datacenter regions.

The interactive savings calculator at cosimo.ai/savings exposes the inputs as adjustable parameters. A reviewer can substitute their own ingestion volume estimate, their own GPU pricing, or their own power costs, and see how the model responds. The $8.25 to $10.5B per year range is bounded by reasonable choices across those input parameters. A reviewer pushing the model into worst-case territory should adjust the parameters and report what they find. The number is a model output, not a quotation. The model is open.

What does a Layer 3 pilot actually look like in practice?

The customer provisions an H100 Confidential Compute instance in their own cloud account. All three hyperscalers offer H100 CC on demand: GCP A3 Mega, Azure NCCadsH100v5, AWS P5 in select regions. On-demand rates are roughly $4-12 per hour. A typical pilot evaluation runs $1.5K-15K of compute total, billed to the customer's own cloud account. Specialty providers like CoreWeave, Lambda, and Crusoe Energy offer lower prices on standard H100 capacity, with CC-mode support that varies by vendor.

COSIMO ships a deployment artifact: a container or signed binary, plus a Python wrapper and a setup runbook. The customer installs it, runs their own workload through it, and verifies outputs against NVIDIA's public attestation roots before consuming any of them. Customer data and the binary both stay inside the enclave. COSIMO never sees the customer's data.

The standard structure is a Letter of Intent with a refundable pilot deposit, scoped so the deposit converts to first-month pilot fee on delivery within 60 days, or refunds in full if COSIMO does not deliver. That gives the customer a real commitment without infinite open-ended risk and gives COSIMO the runway to do the binary packaging properly between pilot signature and delivery.

What the pilot answers: does the kernel deliver the published numbers on the customer's actual data and workload, at production scale, in their environment, on their hardware. What it does not require: shipping the customer any data, or trusting any COSIMO software in the verification path. The "we never receive your data" commitment is enforced by the architecture, not by policy. To start the conversation, see Request a Pilot.

What if I find a flaw in the methodology?

COSIMO will feature you in a blog post, celebrating your genius, if you find any material methodological flaw in the canonical benchmark protocol that invalidates a performance metric in the whitepaper Through the Eyes of AI: From Pixels to Perception.

The point is to get the right people to look hard now, before Layer 1 is treated as settled. We would rather have a researcher find a flaw today than have one discovered after the fact.

Submit findings to validation@cosimo.ai. We acknowledge submissions within three business days and respond within ten.

Technology FAQ.