Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local cuda proving "unknown error" #309

Open
JossDuff opened this issue Jan 2, 2025 · 4 comments
Open

local cuda proving "unknown error" #309

JossDuff opened this issue Jan 2, 2025 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@JossDuff
Copy link

JossDuff commented Jan 2, 2025

Hi!

We are attempting to generate a CUDA proof locally and running into an issue that appears to come from inside the CUDA proof acceleration Docker image.

error message

Full error message output.txt
Relevant parts of error message output, in order:

INFO prove_core:execute: tried to write to unknown file descriptor
thread '<unnamed>' panicked at /root/.cargo/git/checkouts/sp1-20c98843a1ffc860/e3f12c1/crates/co
:14:13:
failed reading stdin due to insufficient input data: input_stream_ptr=2, input_stream_len=2
INFO prove_core: fixing shape
thread 'tokio-runtime-worker' panicked at /root/.cargo/git/checkouts/sp1-20c98843a1ffc860/e3f12c
prove.rs:418:71:
called `Result::unwrap()` on an `Err` value: Any { .. }
INFO prove_core: close time.busy=3.33s time.idle=32.4s
panic: called `Result::unwrap()` on an `Err` value: Any { .. }
2025-01-02T15:17:28.353060Z  INFO Response { url: "http://localhost:3000/twirp/api.ProverService
s: {"content-type": "called `Result::unwrap()` on an `Err` value: Any { .. }", "content-length":
 15:16:49 GMT"} }
thread 'main' panicked at /root/.cargo/git/checkouts/sp1-9091391fc1cd5ab7/f0b61cf/crates/cuda/sr
called `Result::unwrap()` on an `Err` value: HttpError { status: 500, msg: "unknown error", path
eCore", content_type: "called `Result::unwrap()` on an `Err` value: Any { .. }" }
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: sp1_cuda::SP1CudaProver::prove_core
   4: <sp1_sdk::provers::cuda::CudaProver as sp1_sdk::provers::Prover<sp1_prover::components::De
   5: sp1_sdk::action::Prove::run
   6: server::main::{{closure}}
   7: tokio::runtime::park::CachedParkThread::block_on
   8: server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace

diagnostics

hostnamectl output:

 Static hostname: testnet-prover
 Icon name: computer-vm
 Chassis: vm
 Machine ID: 5fa6be304a3c604a25565d74674f33b4
 Boot ID: f0cee139f2fa451f9d48e08cb2132b17
 Virtualization: kvm
 Operating System: Ubuntu 22.04.4 LTS
 Kernel: Linux 5.15.0-125-generic
 Architecture: x86-64
 Hardware Vendor: DigitalOcean
 Hardware Model: Droplet

replication

The above error log is replicable by cloning op-succinct version op-succinct-v1.0.0-rc3 on a machine with a GPU capable of generating a proof (IIRC >20GB of vRam) and replacing op-succinct/proposer/succinct/bin/server.rs with this modified server.rs. Additionally, add the cuda feature to the workspace level sp1-sdk.

The modified server.rs removes the functionality of being a server and simply makes a single request for a span proof across a block range. We produced the same error logs with a proper server.rs, but the intention here was to slim it down to a minimal example to reduce variables. Environment variables for L1_RPC, L1_BEACON_RPC, L2_RPC, and L2_NODE_RPC must be set.

Run the server binary with
RUST_LOG=info RUST_BACKTRACE=1 cargo run --release inside op-succinct/proposer/succinct

@ratankaliani
Copy link
Member

OP Succinct currently relies on SP1 version 3.4.0. This version has a bug - it incorrectly depends on SP1 GPU version 3.0.0. This causes issues with programs that depend on patches that use SP1 versions >= 3.4.0.

In the next day or two, I'm going to merge: #300 which will bump the SP1 version to 4.0.0-rc3 & you should be able to run your local testing with the CudaProver then.

@ratankaliani ratankaliani added the bug Something isn't working label Jan 7, 2025
@JossDuff
Copy link
Author

JossDuff commented Jan 9, 2025

Thank you! Testing on the branch of #300 this did solve the original errors but now we're finding a different error

Using the same minimal server.rs and other specs listed above to send a single proof request. Relevant logs, in order:

stderr: thread '<unnamed>' panicked at index.crates.io-6f17d22bba15001f/sp1-lib-4.0.0-rc.8/src/io.rs:79:32:
stderr: deserialization failed: Custom("Invalid size 16490003400825876010: sizes must fit in a usize (0 to 4294967295)")

... 

thread '<unnamed>' panicked at /root/.cargo/git/checkouts/sp1-wip-7e8893292eb2977e/9e1df2a/crates/core/machine/src/utils/prove.rs:409:80:
called `Result::unwrap()` on an `Err` value: ExecutionError(HaltWithNonZeroExitCode(1))

...

thread 'tokio-runtime-worker' panicked at /root/.cargo/git/checkouts/sp1-wip-7e8893292eb2977e/9e1df2a/crates/prover/src/lib.rs:371:64:
called `Result::unwrap()` on an `Err` value: Any { .. }

...

panic: called `Result::unwrap()` on an `Err` value: Any { .. }
thread 'main' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/sp1-cuda-4.0.0-rc.8/src/lib.rs:265:82:
called `Result::unwrap()` on an `Err` value: HttpError { status: 500, msg: "unknown error", path: "/twirp/api.ProverService/ProveCore", content_type: "called `Result::unwrap()` on an `Err` value: Any { .. }" }
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: sp1_cuda::SP1CudaProver::prove_core
   4: <sp1_sdk::cuda::CudaProver as sp1_sdk::prover::Prover<sp1_prover::components::CpuProverComponents>>::prove
   5: sp1_sdk::env::prove::EnvProveBuilder::run
   6: server::main::{{closure}}
   7: tokio::runtime::park::CachedParkThread::block_on
   8: server::main

All logs with RUST_LOG=info RUST_BACKTRACE=1

@JossDuff
Copy link
Author

JossDuff commented Jan 9, 2025

Just an update:
I saw #300 was merged into main so I pulled and am still finding the same issue as just above. If its any help, the invalid size number that is trying to be deserialized into a usize is deterministic on the start & end block of the proof.

@ratankaliani
Copy link
Member

Have you tried running that same proof with the MockProver and confirming it executes correctly? The error you're seeing seems to be related to the execution of the program, rather than an error in CUDA proof generation.

@ratankaliani ratankaliani assigned fakedev9999 and unassigned leruaa Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants