Common questions about the COSIMO video format, the verification
chain, use cases for physical AI, and what the v1.0 benchmark does
and does not prove. Honest, short answers. If yours is missing,
open an issue at
github.com/COSIMOAI/validation.
What does this validation NOT prove?
Layer 1 verifies what happened on a specific benchmark, on specific
hardware. It does not verify what happens elsewhere. Specifically, this
validation does not prove:
The kernel works on novel data. The benchmark uses
UCF-101. Performance on your data is a Layer 2 audit question.
The kernel is fast on production input. Latency on
your pipeline depends on your pipeline. That is a Layer 3 black-box
pilot question.
The technique generalizes beyond UCF-101.
Generalization requires testing on different datasets. Layer 2 or
Layer 3 covers this.
The kernel itself is correct internally. Reading
the source is Layer 4 territory.
Higher layers close these gaps. Pick the layer that matches what you
need to be convinced of. Tier-up paths are linked from the
validation overview.
How do I know the test verification record hasn't been tampered with?
Three independent trust roots cover this. Defeating one is hard.
Defeating all three at once is the threat model Layer 1 cannot defend
against.
Sigstore Rekor. Every signed bundle gets an entry
in a public, append-only Merkle-tree transparency log. Tamper after
the fact and the entry's inclusion proof no longer matches.
OpenTimestamps Bitcoin anchor. The bundle hash is
committed to the Bitcoin blockchain. Tampering after the timestamp
would require rewriting Bitcoin's history.
SHA-256 chain. Every metric file is hashed into
the per-result verification record. The record is hashed into the
manifest. The manifest is signed. Change any byte anywhere and one
of these hashes no longer matches.
If the verifier fails on your machine, that is the
system working as intended. A failed step is exactly the signal Layer 1
is designed to surface. A real reviewer running the same checks in a
compromised environment would see the same failure. Re-download the
bundle from a different network path, or use the
cosign verify-blob command
directly. If both fail, write us at
validation@cosimo.ai with
the full output and we will investigate publicly.
How can I trust that the numbers in the whitepaper aren't faked?
The verification system is built so that no single piece, including
any piece run by COSIMO, is trusted on its own. Four independent
organizations witness the numbers, and each one would have to be
compromised separately to fake them.
Inside the bundle. File fingerprints any reader can recompute.
Every file in the verification record is hashed using SHA-256, the
same fingerprinting algorithm Bitcoin uses. The browser verifier
walks the manifest, fetches each file, recomputes the hash, and
compares it to the published hash. Change a single byte in any file
and the comparison fails. The page shows expected hash and actual
hash side by side so a reviewer sees exactly where the tampering
happened.
Sigstore. A signed timestamp, logged publicly, by an
independent foundation. Once finalized, the bundle is signed
using a short-lived ten-minute certificate issued by Sigstore, an
open-source signing service run by the Linux Foundation. The
certificate is bound to a Google Workspace identity at
validation@cosimo.ai. Every signature is logged to
Rekor, a public, append-only transparency log. Inserting a fake
record into the past would require rebuilding the chain forward from
that point, which is visible to every party watching the log.
OpenTimestamps and Bitcoin. An independent witness on the moment.
Independently of Sigstore, the bundle hash is published to
OpenTimestamps and embedded into the Bitcoin blockchain. Once a
Bitcoin block contains the hash, rewriting it would require redoing
every block of Bitcoin proof-of-work from that point forward. That
costs billions of dollars in compute and is visible to every Bitcoin
node within minutes.
NVIDIA hardware attestation. Scheduled for v1.1.0.
The current canonical run was on standard NVIDIA L4 hardware in
Google Cloud. The v1.1.0 release re-runs the canonical 5x40
protocol inside an H100 Confidential Compute enclave, where
NVIDIA's attestation service signs a statement: this binary, in
this container, on this specific H100, with this CUDA version
and driver, produced these outputs. That statement is signed by
an NVIDIA hardware key that only NVIDIA controls. Until v1.1.0
ships, the three other trust roots above carry the verification
chain.
To fake the numbers and bypass the three currently active trust
roots simultaneously, COSIMO would have to compromise Google's
identity layer, the Sigstore Foundation's transparency log, and
the Bitcoin network's proof-of-work history, in coordination.
The cost of that attack is many orders of magnitude greater than
the cost of running the benchmark honestly. The v1.1.0 release
adds NVIDIA's hardware attestation roots as a fourth trust root,
further raising the cost of fabrication. The verification system
is built around that asymmetry. The
methodology page walks the cryptographic
chain in detail.
What the system proves: the runs happened, on the hardware
described, producing the numbers reported, with bundle integrity
intact across time. What it does not prove: that the technique
generalizes beyond this benchmark, or that the kernel is internally
correct. Those questions are answered by Layers 2, 3, and 4.
How do I know the runs actually used the kernel?
The TEE attestation chain. Each run's attestation.json is
signed by NVIDIA's NRAS (NVIDIA Remote Attestation Service) against a
specific H100 in confidential-compute mode. It binds to the exact
container image digest, CUDA toolkit version, driver version, and
command-line invocation that produced the metrics. You can verify
this with NVIDIA's own nv-attestation-cli against
NVIDIA's public attestation roots. No COSIMO software in the
verification path.
What this does not prove, however, is that the
kernel inside the container is the COSIMO DST kernel rather
than some other kernel claiming to be it. Closing that gap is what
Layer 2 (independent audit)
exists for. An auditor with NDA-supervised access to the encoder
verifies that the binary running in the TEE is the canonical kernel.
Can the metrics be reproduced on different hardware?
Reproduction across hardware is what Layer 2 audits. Layer 1 proves
these specific runs happened on this specific hardware and
the numbers match the whitepaper. To get "the technique works on
different hardware" or "the technique works on different data,"
commission a Layer 2 audit. The tier-up paths are linked from the
verification page.
What is Geometric Video?
Geometric Video is a new kind of video primitive, built for
physical AI. Where legacy video formats (H.264, HEVC, MP4) encode
dense grids of pixels designed for human eyes, Geometric Video
encodes the geometry of motion directly. A sparse coordinate list
of (x, y, t, Δ) tuples representing only the physically
active voxels in a scene. Geometry, not pixels. Motion, not
texture. Roughly 98% of the legacy pixel matrix gets dropped
during encoding because it does not carry signal that perception
models can use.
The output of the COSIMO Deterministic Structural Transform (DST)
kernel is the Sparse Geometric Matrix (SGM), the actual data
format the model receives. Geometric Video is the category. SGM
is the file format. DST is the kernel that produces it. The
whitepaper documents the math, the v1.0 benchmark proves the
format works, and the Layer 1 verification record at
cosimo.ai/validation lets any reviewer
confirm the numbers without trusting COSIMO.
Most AI infrastructure work tries to make the neural network
smarter. Bigger networks, more pretrained weights, more compute.
COSIMO operates at a different layer. Before the model sees a
single frame, the COSIMO Physics Engine has stripped the noise
from the video and isolated the geometry of motion. The brain
reasons better when the eyes work. Geometric Video fixes the
eyes.
How is COSIMO different from NVIDIA Cosmos? They sound similar.
The names are similar; the technologies operate at different layers
of the AI stack. COSIMO and NVIDIA Cosmos are complementary, not
competitive.
NVIDIA Cosmos is a platform of world foundation models. It generates
synthetic environments, simulates physical interactions, and
provides pretrained models for training autonomous systems. It
operates at the model layer. Cosmos is, in the analogy, the brain.
It reasons about a virtual world.
COSIMO is a video format and the deterministic kernel that produces
it. The Sparse Geometric Matrix is a structured signal that encodes
the geometry of motion directly into the video stream, before any
model sees it. It operates at the data representation layer,
upstream of any perception model. COSIMO is, in the analogy, the
eye and the optic nerve. It makes the world clear to whatever model
consumes it next.
A perception stack can use both. A world foundation model trained
on dense legacy video has to reconstruct geometric structure from
texture every time it ingests video. The same model trained on
COSIMO SGM receives the geometry directly, with most of the texture
removed before training begins. The compute saved upstream is
compute the model can spend on reasoning.
The two answer different questions. Cosmos answers "what does the
world look like, and how should an agent behave in it?" COSIMO
answers "how should the world be represented to a model that has
to learn from it?"
What goes wrong when AI tries to understand video today, and how does COSIMO address it?
Four well-documented failure modes of legacy video as a substrate
for machine perception. COSIMO addresses each at the format level,
before any model is trained.
Compression noise designed for human eyes, not models.
H.264 and HEVC compress visual texture for streaming to human
viewers. The compression artifacts (macroblocking, color banding,
motion-estimation errors) are tolerable to human visual perception
but introduce noise into the input that perception models have to
learn to ignore. COSIMO strips frequency-domain texture and emits
geometric structure directly. The artifacts that bias dense models
are not present in the SGM.
Capacity wasted on memorizing static backgrounds.
A dense video frame is roughly 98% static or near-static across
consecutive frames in motion-dominated content. A neural network
trained on dense pixel matrices spends most of its representational
capacity learning the static content, regardless of whether the
task requires it. The 4.64× parameter reduction in this
benchmark is a direct measurement of how much capacity was
previously absorbed by background memorization. COSIMO's Zero-Motion
Gating removes static voxels at the kernel level, so the model
never has to learn to ignore them.
Floating-point drift across deployment hardware.
Legacy video preprocessing chains run in floating point downstream
of integer-deterministic decode. Bilinear resampling, color-space
conversion, augmentation, and normalization all behave slightly
differently across CPUs, GPUs, CUDA versions, and driver releases.
The model receives subtly different inputs depending on where the
pipeline runs, which complicates cross-cluster reproducibility and
turns model debugging into a multi-week disambiguation exercise.
COSIMO's kernel is deterministic by construction. The same source
video produces a bit-exact SGM regardless of where the pipeline
runs.
Edge deployment infeasibility for high-dimensional video models.
Dense 3D CNNs require server-grade GPUs to hold the working set in
memory. The 27× collapse in inference VRAM (2.18 GB to 77.6
MiB) is what allows the same perception task to run on a $249
Jetson Orin Nano edge chip rather than a $2,500 server-grade GPU.
Edge AV and robotics deployments that were previously gated on
datacenter-class compute become feasible.
These are not aspirational. Each failure mode is empirically
anchored in the v1.0 benchmark. Section 3 of the whitepaper reports
the corresponding measurements; the verification record at
cosimo.ai/validation lets a reviewer
verify each one independently.
What are the use cases for COSIMO?
COSIMO is built for physical AI. Not a single application, but the
substrate underneath every application in the category. Cars.
Humanoid robots. Drones. Industrial robots. Surveillance fleets.
Hyperscaler video pipelines. Any system that ingests video, trains
models on it, and acts in real time on real hardware.
COSIMO improves all three stages of that loop. Encode. Train.
Perform. The per-stage numbers are in the
next FAQ entry. The verticals below are
where those numbers compound into deployable products.
Autonomous vehicles. Perception stacks running on
in-vehicle silicon trade off accuracy, latency, and
bill-of-materials cost continuously. Dense video forces a choice.
Trunk-mounted server-grade GPUs (expensive, heat-constrained,
weight-constrained). Or aggressive model compression (accuracy
loss). SGM collapses the inference working set to 77.6 MiB at
15.86 ms p50 latency, batch-invariant. A 100,000-vehicle fleet
replaces trunk GPUs with $249 ARM-class edge chips. The fleet
saves roughly $2,700 per vehicle in upfront hardware and $159.7M
per year in cellular uplink. Training stability pulls
time-to-market forward by six months.
Humanoid and industrial robotics. Robotic
perception lives inside a tighter envelope. A humanoid robot's
onboard compute is closer to a phone's than a car's. The
perception stack shares that envelope with control, planning, and
audio. SGM's sub-1W encode and 77.6 MiB inference working set fit
a thermal envelope a robot dissipates passively. The same numbers
apply to industrial inspection, manipulation, and any robotic task
where real-time vision runs inside a tight power and weight
budget.
Hyperscaler video ingestion. Cloud providers
ingest video at scale and store it as dense pixel matrices today.
SGM's 3.12× compression and 27× inference VRAM
reduction cut ingestion costs across the fleet. The savings model
is the headline dollar figure. See the savings
model entry and cosimo.ai/savings
for the math.
Adjacent applications. Drones. Surveillance.
Sports analytics. Industrial inspection. Any application where the
cost of perception is a binding constraint on what the system can
do. The thermal envelope is sub-1W. The inference working set is
77.6 MiB. The latency is 15.86 ms. None of those numbers are
vertical-specific.
The common thread is physical AI. Systems that perceive and act in
the real world, on real hardware, in real time. COSIMO is not a
point solution for any one of those verticals. It is the substrate
they all run on. Static-scene tasks (text recognition,
fine-grained appearance classification, document analysis) are
outside the scope of v1.0 and should continue to use dense video.
How does COSIMO improve every phase of the physical-AI pipeline?
In physical AI, video is the input. Cars, robots, drones,
hyperscaler training corpora, surveillance, industrial inspection.
They all start with video and end with a decision. The format of
that input matters at every stage between the camera and the
decision. COSIMO is built to be the superior substrate at all
three stages: encode, train, perform.
Encode. The DST kernel runs in 1.17ms per frame
at less than one watt on a five-year-old MacBook Pro. Fixed-point
integer arithmetic, stateless, no learned weights. There is
nothing to retrain, no model drift, no GPU dependency for
encoding. The output (Sparse Geometric Matrix) is 3.12×
smaller on disk than the dense tensor it replaces. Encoding is
deterministic across hardware: same input, same SGM, bit-exact.
The encoding step that legacy pipelines pay in floating-point
variability and CPU-to-GPU transfer overhead, COSIMO pays once,
on commodity silicon, and forwards a compact deterministic
representation downstream.
Train. Models trained on SGM use 78.5% fewer
parameters than the dense baseline (33.15M to 7.14M) and reach
+12.4 percentage points higher median accuracy. Peak training
VRAM drops 2.40× (5.23 GB to 2.18 GB), which means a single
GPU can hold larger batches or larger models within the same
budget. Cross-seed variance collapses 3× (σ = 0.017
versus 0.052). Training is stable enough to debug like source
code, which compresses the multi-week "is this a code regression
or a bad seed" cycle that production ML teams currently absorb.
Smaller, faster, more stable training, on the same hardware.
Perform. Inference VRAM collapses 27×
(2.18 GB to 77.6 MiB). The same perception task that previously
required a $2,500 server-grade GPU can now run on a $249 Jetson
Orin Nano edge chip. Per-clip latency is 15.86 ms p50,
batch-invariant. A single autonomous vehicle gets the same speed
as a batched datacenter workload, so the same model deploys
identically at the edge and in the cloud. The thermal envelope
(sub-1W encode, edge-class inference) fits inside a phone, a
drone, or a robot. Deployment moves from "datacenter required"
to "wherever the camera is."
COSIMO is not a point optimization for a single stage. It is a
substrate that compounds across the pipeline. The encoder is
leaner. The trained model is smaller and more stable. Inference
is faster, smaller, and edge-deployable. Each stage's improvement
makes the next stage's job easier. That is what makes the input
representation the highest-leverage place to optimize a
physical-AI stack.
Why UCF-101? And why a 5-class subset?
UCF-101 is a standard reference benchmark for action recognition,
with established baselines and a publicly available dataset.
Picking a recognized benchmark constrains the ability to
cherry-pick favorable evaluation conditions. Picking a
kinematic-dominant 5-class subset (Fencing, Punch, PullUps, TaiChi,
JugglingBalls) isolates motion as the discriminating signal, which
is the strength SGM is built around.
Section 3.6 of the whitepaper makes the scope explicit. The results
do not extrapolate to static-scene classification, fine-grained
appearance recognition, or crowd-density tasks. A broader
cross-dataset evaluation is v2.0 work.
Choosing a benchmark where SGM was expected to perform poorly would
have been bad-faith methodology. Choosing a benchmark where it was
expected to perform well, and being explicit about the scope of
the result, is the right tradeoff for a v1.0 release. A reviewer
concerned about dataset selection should ask the same question of
any benchmark and look at whether the scope is honestly drawn.
When does SGM perform worse than dense video?
Static-scene tasks. The Zero-Motion Gating step in the kernel
removes voxels that do not register motion above a per-frame
threshold. When motion is the signal (action recognition, AV
perception, robotics, surveillance), this is exactly the work the
kernel is supposed to do. When motion is not the signal (text
recognition, static product classification, fine-grained appearance
recognition like distinguishing dog breeds), gating destroys the
discriminating input and the model has nothing to learn from. SGM
is not a universal video format. It is a video format for tasks
where the geometry of motion carries the signal.
Other conditions where SGM is expected to perform worse than dense
video, or no better, are flagged in section 3.6 of the whitepaper.
Very low-light video, where the structural-differential signal
becomes noise-dominated. Heavily compressed source video, where
decode artifacts feed into the kernel. Crowd-density scenarios,
where individual motion geometry is dominated by aggregate flow.
The kernel is not magic. It removes pixels that do not contribute
to motion, and that is exactly the wrong move when those pixels are
what the task needs.
Why does the kernel need to be deterministic?
A trained network does tolerate small numerical perturbations. The
determinism contract is not for the model. It is for the
verification chain, the engineering pipeline, and the contrast
against legacy video.
The verification record binds a published number to a specific run.
Two runs of the same input must produce the same SGM artifact for
the manifest hash chain to verify. Without determinism, every
published number becomes "this number, on that GPU, at that moment,
take our word for it."
Legacy video preprocessing is necessarily probabilistic across
hardware. H.264 and HEVC decode is largely integer-deterministic,
but everything downstream is not. Bilinear resampling, color-space
conversion, tensor normalization, augmentation, and mixed-precision
casting all run in floating point. Different CPUs, GPUs, CUDA
versions, and driver releases produce subtly different inputs to the
model. Per-pixel the differences are small. Across billions of
pixels and trillions of operations, they shift training trajectories
and complicate cross-device deployment. Every ML team running a
legacy video pipeline pays this tax somewhere, usually as the cost
of debugging a model that converges on the training cluster and
underperforms on the inference fleet.
COSIMO's kernel is deterministic by construction. Same input video,
same SGM, bit-exact across compatible hardware. The model sees
identical input regardless of where the pipeline runs. Legacy video
cannot offer that, even if every team using it tried, because the
floating-point conversion machinery is a property of the format
rather than an oversight on the implementation side.
Two further consequences. ML teams gain a clean attribution boundary
between data-pipeline drift and code regression, which compresses
debug cycles. AV and humanoid-robotics programs that need to certify
a perception stack benefit from input-side reproducibility
regardless of model robustness, because the certification target is
the pipeline.
What is Δ in the Sparse Geometric Matrix?
Δ is a single float32 scalar per surviving voxel. The model
ingests an (N, 4) tensor where N caps at 8192 and the four columns
are (x, y, t, Δ). After kernel output, the (x, y, t)
coordinates are normalized to [0, 1] by the spatial dimensions, and
Δ is L1-normalized across the active set so the per-clip total
is N. Auditors can confirm this directly from the open-source
dataloader at src/data/dataset_sgm.py, or from the
public LMDB encodings, which ship the (N, 4) tensor pre-sorted by
Δ descending. What the kernel does to produce Δ in the
first place is the proprietary part. The format is not. The
methodology page documents what is
externally measurable in the SGM and what is not.
Doesn't a panning camera defeat Zero-Motion Gating?
Yes, on raw input. The canonical benchmark filtered for fixed-camera
clips precisely to isolate the motion-recognition signal from
camera ego-motion. The fixed-camera filter (mean global optical
flow at or below 1.5 pixels per frame) is documented in the
protocol and applied during data preprocessing. The drop count from
this filter is reported in the canonical results so a reviewer can
see how aggressive the filter is.
For real-world AV and robotics deployments where cameras are
constantly moving, ego-motion compensation is a separate
engineering step. The standard solutions (visual odometry,
homography estimation, IMU-driven stabilization) sit upstream of
the SGM kernel and feed it stabilized input. Integrating ego-motion
compensation directly into the production kernel is on the v2.0
roadmap.
For the current benchmark, the question of "does the SGM
representation help motion-recognition tasks" is isolated cleanly
by holding camera motion fixed. The question of "does SGM help when
the camera is also moving" is a separate experiment, gated on the
ego-motion compensation work. Both are real questions; v1.0 answers
the first one explicitly and bounds the second one as future work.
How much of the gain comes from a smaller model versus the preprocessing?
Right question, and the public paper does not yet cleanly isolate
the two effects. Track A (dense baseline) is a from-scratch
ResNet3D-18 with 33.15M parameters. Track B is a SparseResNet3D with
7.14M parameters built on SpConv. Track B is materially smaller.
Section 3.6 of the whitepaper labels this as parameter asymmetry and
acknowledges that the comparison should not be characterized as a
parameter-matched architecture study.
The control that would isolate representation from architecture is a
re-densified ablation: render the SGM back to a dense (1, T, H, W)
tensor by scattering Δ values onto a zero grid, then feed that
input into the same dense ResNet3D-18 used as Track A. The current
paper does not include that result. We treat it as the
highest-priority follow-up control for v1.1.
Two pieces of indirect evidence point at the preprocessing carrying
real weight on its own. Cross-seed variance collapses 3× under
SGM (σ = 0.017 versus 0.052). Smaller networks do not
generally exhibit tighter cross-seed clustering, so capacity
reduction is unlikely to be the main cause. The lowest-performing
SGM seed (78.7%) outperforms four of five dense baseline seeds,
which is hard to attribute to overfitting differences alone. Neither
is a substitute for the direct ablation, which is why we have
committed to running it.
Can I run training and evaluation pipelines on my own changes?
For the network architectures, yes. The PyTorch training pipeline is
fully open-sourced. The repository contains the SparseResNet3D, the
dense 3D-CNN baseline, and the canonical evaluation loops.
Pre-computed SGM LMDB encodings are public. A reviewer can swap the
dense or sparse network architecture, run the full 5-seed ×
40-epoch protocol against the canonical SGM artifacts, and reproduce
the reported numbers on their own NVIDIA L4 hardware. There is a
dvc.yaml and a run_all.sh if a one-click
flow is preferred.
For the kernel itself, no public path. Changes to the production C
kernel are gated behind the proprietary boundary and ship under
sealed black-box terms. Evaluating kernel changes is the Layer 3
path described below.
Can SGM run as a sidecar alongside an existing legacy video pipeline?
Yes. The encoder cost is small enough that running SGM in parallel
with a legacy dense pipeline has effectively no impact on the
legacy pipeline's compute budget. The 1.17ms per-frame encode at
less than one watt, measured on a five-year-old MacBook Pro, sits
below the noise floor of any production hyperscaler ingestion path
or AV/robotics perception stack. The encoder is also stateless and
has no learned weights. There is nothing to retrain, nothing to
monitor for drift, nothing to maintain that doesn't sit downstream.
What sidecar deployment looks like in practice. The same source
video arrives at the ingest pipeline. The pipeline branches into
two paths: the legacy decode-and-store path (H.264 or HEVC into
dense tensor), and the SGM path (decode plus DST kernel into
sparse coordinate list). Both derivatives are stored. Existing
dense models continue to consume the dense tensor with no change.
New SGM-based models consume the sparse coordinate list. Both
pipelines run on the same source, in parallel, on the same
hardware.
This removes the "rip and replace" risk that usually blocks new
perception infrastructure. A hyperscaler does not have to commit
to switching off the dense pipeline before validating SGM at
production scale. An AV operator does not have to bet a vehicle
program on an unfamiliar representation. Both can run SGM as a
shadow path against their existing workload and continue shipping
the dense-pipeline output to downstream consumers until SGM has
been validated on real data, in real conditions, at real scale.
Migration becomes a switching decision rather than a cutover
decision.
Why aren't the SGM artifacts public?
The Sparse Geometric Matrix (SGM) format and the encoder that produces
it are the proprietary core of COSIMO. Releasing SGM files publicly
would expose the input contract of the kernel and is incompatible with
the commercial licensing model.
Layer 2 auditors get supervised access to the encoder under NDA. That
is how the "novel data" gap gets closed without making the IP public.
Layer 4 customers receive full kernel integration under commercial
terms.
Why no peer review?
A pre-print is on the timeline. It is not a precondition for Layer 1:
cryptographic verification of metrics is a stronger and more
reproducible signal than peer review of methodology. Peer review and
verifiable receipts answer different questions, and we want both.
The arXiv submission ships alongside the v1.0.0 receipts bundle.
What if NVIDIA's attestation service is compromised?
Two answers.
First. It is mathematically possible but
operationally extreme. NVIDIA's NRAS roots are widely-deployed
infrastructure with independent monitoring and a vendor whose
business depends on the integrity of confidential-compute
attestation.
Second. Even if NRAS were compromised, you have two
independent trust roots in the bundle (sigstore Rekor and
OpenTimestamps), plus a fallback verification path via Azure MAA or
AWS Nitro for runs published on those platforms. Compromising all
three at once is the threat model Layer 1 cannot defend against.
Nothing short of an in-person Layer 2 audit can.
Why no open-source training pipeline?
The consumer-side ingestion pipeline that prepares data for the
kernel encodes the input contract. Release the pipeline and you've
effectively published part of the kernel's interface. We keep it
proprietary for the same reason we keep the kernel proprietary.
The Layer 2 audit path runs the kernel under supervision, which closes
the "did the pipeline behave as claimed" gap without exposing the
contract.
What does it mean that the kernel is closed-source?
The COSIMO DST kernel and SGM generator ship under a commercial
license at Layer 4. Layers 1, 2, and 3 each give different forms of
access without releasing source:
Layer 1. Read-only signed test verification record.
Layer 2. Auditor runs the sealed kernel under NDA on customer data.
Layer 3. Black-box deployment in your environment for broader testing.
Layer 4. Full source under commercial license.
Pick the layer that matches what you need to be convinced of. Tier-up
paths are linked from the verification page's
footer.
How was the $8.25 to $10.5 billion per hyperscaler savings figure derived? Can I check the math?
The savings model is a multiplication of three published anchors.
The empirical 3.12× compression ratio measured in this
benchmark, applied to ingestion volume estimates for a Tier-1
hyperscaler. Public NVIDIA H100 capital expenditure rates and
current storage and bandwidth pricing. The $400M power-and-cooling
figure, derived from public hyperscaler PUE data and current
power-purchase costs in major US datacenter regions.
The interactive savings calculator at
cosimo.ai/savings exposes the inputs
as adjustable parameters. A reviewer can substitute their own
ingestion volume estimate, their own GPU pricing, or their own
power costs, and see how the model responds. The $8.25 to $10.5B
per year range is bounded by reasonable choices across those input
parameters. A reviewer pushing the model into worst-case territory
should adjust the parameters and report what they find. The number
is a model output, not a quotation. The model is open.
What does a Layer 3 pilot actually look like in practice?
The customer provisions an H100 Confidential Compute instance in
their own cloud account. All three hyperscalers offer H100 CC on
demand: GCP A3 Mega, Azure NCCadsH100v5, AWS P5 in select regions.
On-demand rates are roughly $4-12 per hour. A typical pilot
evaluation runs $1.5K-15K of compute total, billed to the customer's
own cloud account. Specialty providers like CoreWeave, Lambda, and
Crusoe Energy offer lower prices on standard H100 capacity, with
CC-mode support that varies by vendor.
COSIMO ships a deployment artifact: a container or signed binary,
plus a Python wrapper and a setup runbook. The customer installs it,
runs their own workload through it, and verifies outputs against
NVIDIA's public attestation roots before consuming any of them.
Customer data and the binary both stay inside the enclave. COSIMO
never sees the customer's data.
The standard structure is a Letter of Intent with a refundable pilot
deposit, scoped so the deposit converts to first-month pilot fee on
delivery within 60 days, or refunds in full if COSIMO does not
deliver. That gives the customer a real commitment without infinite
open-ended risk and gives COSIMO the runway to do the binary
packaging properly between pilot signature and delivery.
What the pilot answers: does the kernel deliver the published
numbers on the customer's actual data and workload, at production
scale, in their environment, on their hardware. What it does not
require: shipping the customer any data, or trusting any COSIMO
software in the verification path. The "we never receive your data"
commitment is enforced by the architecture, not by policy. To start
the conversation, see Request a Pilot.
What if I find a flaw in the methodology?
COSIMO will feature you in a blog post, celebrating your genius, if
you find any material methodological flaw in the canonical benchmark
protocol that invalidates a performance metric in the whitepaper
Through the Eyes of AI: From Pixels to Perception.
The point is to get the right people to look hard now, before Layer 1
is treated as settled. We would rather have a researcher find a flaw
today than have one discovered after the fact.
Submit findings to validation@cosimo.ai.
We acknowledge submissions within three business days and respond
within ten.