Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: async fn get_record_from_network fails to handle RecordNotFound #2596

Open
jadkins-me opened this issue Jan 5, 2025 · 0 comments
Open

Comments

@jadkins-me
Copy link

Autonomi Client v0.3.1
Network version: ant/0.3/1
Package version: 2024.12.1.6
Git info: stable / f55680c / 2024-12-21

Seeing repeated issues of failing to download records from the network after > 24 hrs from upload, from multiple clients in multiple locations (DE / FR / SP), on multiple architectures (x64 / Arch64)

One such test file below, but happens on multiple files (only tested on files 1mb > 8mb in size)

Test file with ./ant file download 653d9c5a97b589193609782db1543a782f47a40a6127de8c5369682acfd15508 .

On occasions, there will be an issue with get_record_from_network where despite peers being returned, the error being sent back is RecordNotFound , where the code ignores this, and continues to try and download other records, before failing.

It's also interesting to note, that over a period of time, the peers being returned for the record change, indicating a possible issue with the peer lookup for a Xor address on Kad ? It would be hard to imagine that more than 6 peers had all lost the record on multiple occasions ?

Also seeing some strange behavior from the Tokio thread pool, had expected more than one worker thread, but only seeing one being spawned, so wondering if there is something weird happening, as this single worker run's at 98% CPU on a single core, and I would of expected the pool to spawn more workers, to use more cores.


Every 1.0s: ./gdb_get_threads 98663

PID: 98663
TID: 98663, Thread Name: ant
TID: 98664, Thread Name: tokio-runtime-worker
TID: 98665, Thread Name: tracing-appender
TID: 98666, Thread Name: ant
TID: 98667, Thread Name: futures-timer
TID: 98668, Thread Name: ant

Expected Behavior.

  1. The network should return the record when requested.
  2. If the peers are unable to locate the record, then with no local caching of the successful records, the client download should terminate. There is no benefit in waiting for other records to download, when we have failed.
  3. If local caching of the record is implemented, we can store the records we have successfully fetched, and then on next retry, just try to fetch the failed records from before.
  4. When processing RecordNotFound, the code doesn't attempt to refresh peers, or try again on a separate thread - it would be interesting if another attempt to fetch the record would succeed within the same Swarm session.

Snip of retrieval where the request worked for record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7

[2025-01-05T11:50:29.510434Z INFO ant_networking 537] Getting record from network of ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7). with cfg GetRecordCfg { get_quorum: One, retry_strategy: None, target_record: "None", expected_holders: {} }
[2025-01-05T11:50:29.539689Z INFO ant_networking::cmd 415] We now have 1 pending get record attempts and cached 0 fetched copies
[2025-01-05T11:50:37.057868Z INFO ant_networking 557] Record returned: ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7).

Snip of failure for record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7

[2025-01-05T11:47:06.637871Z INFO ant_networking 537] Getting record from network of ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7). with cfg GetRecordCfg { get_quorum: One, retry_strategy: None, target_record: "None", expected_holders: {} }
[2025-01-05T11:47:06.639278Z INFO ant_networking::cmd 415] We now have 1 pending get record attempts and cached 0 fetched copies
[2025-01-05T11:47:51.282880Z INFO ant_networking::event::kad 154] Query task QueryId(25) NotFound record ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7) among peers [
	PeerId("12D3KooWFLiEqwVHpFqp8oedRBSBH95GUjzgpcC6yoB1Smm7GLJC"), 
	PeerId("12D3KooWQU15Jnu8pp4FGhwcLV7wRdMVtJjWYT7mBUc4GNMbgTxT"), 
	PeerId("12D3KooWNmaoTNEkK9maTXnAkwEVTJm9sWxrBwdRvYGuwMBsgUMB"), 
	PeerId("12D3KooWAg8dMV9cCeozf8SFj3D99FEDoESQxpUTfTYqpiS9UgmL"), 
	PeerId("12D3KooWNh9ysSsNbwEp78nqeoADKCwh52xEzW5kYPJrqFpB5mVA"), 
	PeerId("12D3KooWCwqjfq7E2WYB53Eh329vvnWMsecrAWcUwsaKBFthXTzX"), 
	PeerId("12D3KooWQK2ehb6FDfpoMHwUPKjakgrZ4nFTABVJDzsqRx1BPbKJ"), 
	PeerId("12D3KooWA8fg4xmbDFNfFip73djd9FyyNNbixv8jsY5f9BBDuBto"), 
	PeerId("12D3KooWBmhf5EDLRSedMfCSHKggAYpzWdg1oHFX8q7c1wUZXNCT"), 
	PeerId("12D3KooWP13mdmHJKBjszw72Ktm9CBtvyonsqu6vxVGJ1SjLcuJH")
], QueryStats { requests: 44, success: 14, failure: 19, start: Some(Instant { tv_sec: 1015450, tv_nsec: 140282755 }), end: Some(Instant { tv_sec: 1015494, tv_nsec: 775418630 }) } - ProgressStep { count: 1, last: true }
[2025-01-05T11:47:51.283336Z INFO ant_networking::event::kad 562] Get record task QueryId(25) failed with error NotFound { 
	key: Key(b"\xad\x1c\xc7\x0f\xb8^PJ\xa3z\x0b\x1f\xa7\xe9\x08@\x833\x84\xad\x8f\x0f\xa6H\x189\xb3\x14\xfc\x1cJ\xea"), 
	closest_peers: [
	PeerId("12D3KooWFLiEqwVHpFqp8oedRBSBH95GUjzgpcC6yoB1Smm7GLJC"), 
	PeerId("12D3KooWQU15Jnu8pp4FGhwcLV7wRdMVtJjWYT7mBUc4GNMbgTxT"), 
	PeerId("12D3KooWNmaoTNEkK9maTXnAkwEVTJm9sWxrBwdRvYGuwMBsgUMB"), 
	PeerId("12D3KooWAg8dMV9cCeozf8SFj3D99FEDoESQxpUTfTYqpiS9UgmL"), 
	PeerId("12D3KooWNh9ysSsNbwEp78nqeoADKCwh52xEzW5kYPJrqFpB5mVA"), 
	PeerId("12D3KooWCwqjfq7E2WYB53Eh329vvnWMsecrAWcUwsaKBFthXTzX"), 
	PeerId("12D3KooWQK2ehb6FDfpoMHwUPKjakgrZ4nFTABVJDzsqRx1BPbKJ"), 
	PeerId("12D3KooWA8fg4xmbDFNfFip73djd9FyyNNbixv8jsY5f9BBDuBto"), 
	PeerId("12D3KooWBmhf5EDLRSedMfCSHKggAYpzWdg1oHFX8q7c1wUZXNCT"), 
	PeerId("12D3KooWP13mdmHJKBjszw72Ktm9CBtvyonsqu6vxVGJ1SjLcuJH")] }
[2025-01-05T11:47:51.284417Z WARN ant_networking 575] No holder of record 'ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7)' found.

Another failure for same record in different location record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7


[2025-01-05T13:10:00.600554Z INFO ant_networking 537] Getting record from network of ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7). with cfg GetRecordCfg { get_quorum: One, retry_strategy: None, target_record: "None", expected_holders: {} }
[2025-01-05T13:10:00.601715Z INFO ant_networking::cmd 415] We now have 1 pending get record attempts and cached 0 fetched copies
[2025-01-05T13:10:49.436573Z INFO ant_networking::event::kad 154] Query task QueryId(25) NotFound record ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7) among peers [
	PeerId("12D3KooWSMyYfGvdzoqSjBC3ZBL2Mir2WnhiDSUhNGSqjg7A8vkB"), 
	PeerId("12D3KooWLVA5Sq5RuRwHfBgHBFJRG3GnV3Q1DaTiboNgEiBD9xnE"), 
	PeerId("12D3KooWPwgARdgVMLPbhi4aVzNMz4Q5VM2yxxzpqk9PSWzLqobT"), 
	PeerId("12D3KooWNmaoTNEkK9maTXnAkwEVTJm9sWxrBwdRvYGuwMBsgUMB"), 
	PeerId("12D3KooWAg8dMV9cCeozf8SFj3D99FEDoESQxpUTfTYqpiS9UgmL"), 
	PeerId("12D3KooWMD7cStTButaPMbaG2dPEDNcHC9Th5WFc6BNfE7aHnxyu"), 
	PeerId("12D3KooWAwMaVzfJkYFsGdZtrFwigqXQXEoFNeijsSbHMpX5Cq9C"), 
	PeerId("12D3KooWNr23srrTDQvyu8kRJjZnBTw5KGuniBTBX8U2CqTVHWdg"), 
	PeerId("12D3KooWCwqjfq7E2WYB53Eh329vvnWMsecrAWcUwsaKBFthXTzX"), 
	PeerId("12D3KooWCRkgUoVNGAZTvWgHz8Q6qQhztSviShKXvKbRDJAssKvW"), 
	PeerId("12D3KooWDsviXzPC62pV6QBz6qAp2Qj3EY8ErLSv3sUZ51UHJ5yJ"), 
	PeerId("12D3KooWRJcPSHZEY3Ngkgy3cW7eqjBdwKs5mGt3zzWoM5y8W5nV"), 
	PeerId("12D3KooWD6NGurJn1wVVz1eA7NG3LUUPAabyGdYorBYRKPZmqXob"), 
	PeerId("12D3KooWNMP9CmmLsNBksbiwCQry87AfJ8B6HCfnueLzyujRmdTx")
], QueryStats { requests: 108, success: 47, failure: 46, start: Some(Instant { tv_sec: 1020424, tv_nsec: 148424572 }), end: Some(Instant { tv_sec: 1020472, tv_nsec: 929126825 }) } - ProgressStep { count: 1, last: true }
[2025-01-05T13:10:49.436651Z INFO ant_networking::event::kad 562] Get record task QueryId(25) failed with error NotFound { 
	key: Key(b"\xad\x1c\xc7\x0f\xb8^PJ\xa3z\x0b\x1f\xa7\xe9\x08@\x833\x84\xad\x8f\x0f\xa6H\x189\xb3\x14\xfc\x1cJ\xea"), 
	closest_peers: [
	PeerId("12D3KooWSMyYfGvdzoqSjBC3ZBL2Mir2WnhiDSUhNGSqjg7A8vkB"), 
	PeerId("12D3KooWLVA5Sq5RuRwHfBgHBFJRG3GnV3Q1DaTiboNgEiBD9xnE"), 
	PeerId("12D3KooWPwgARdgVMLPbhi4aVzNMz4Q5VM2yxxzpqk9PSWzLqobT"), 
	PeerId("12D3KooWNmaoTNEkK9maTXnAkwEVTJm9sWxrBwdRvYGuwMBsgUMB"), 
	PeerId("12D3KooWAg8dMV9cCeozf8SFj3D99FEDoESQxpUTfTYqpiS9UgmL"), 
	PeerId("12D3KooWMD7cStTButaPMbaG2dPEDNcHC9Th5WFc6BNfE7aHnxyu"), 
	PeerId("12D3KooWAwMaVzfJkYFsGdZtrFwigqXQXEoFNeijsSbHMpX5Cq9C"), 
	PeerId("12D3KooWNr23srrTDQvyu8kRJjZnBTw5KGuniBTBX8U2CqTVHWdg"), 
	PeerId("12D3KooWCwqjfq7E2WYB53Eh329vvnWMsecrAWcUwsaKBFthXTzX"), 
	PeerId("12D3KooWCRkgUoVNGAZTvWgHz8Q6qQhztSviShKXvKbRDJAssKvW"), 
	PeerId("12D3KooWDsviXzPC62pV6QBz6qAp2Qj3EY8ErLSv3sUZ51UHJ5yJ"), 
	PeerId("12D3KooWRJcPSHZEY3Ngkgy3cW7eqjBdwKs5mGt3zzWoM5y8W5nV"), 
	PeerId("12D3KooWD6NGurJn1wVVz1eA7NG3LUUPAabyGdYorBYRKPZmqXob"), 
	PeerId("12D3KooWNMP9CmmLsNBksbiwCQry87AfJ8B6HCfnueLzyujRmdTx")] }
[2025-01-05T13:10:49.438280Z WARN ant_networking 575] No holder of record 'ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7)' found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant