You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Seeing repeated issues of failing to download records from the network after > 24 hrs from upload, from multiple clients in multiple locations (DE / FR / SP), on multiple architectures (x64 / Arch64)
One such test file below, but happens on multiple files (only tested on files 1mb > 8mb in size)
Test file with ./ant file download 653d9c5a97b589193609782db1543a782f47a40a6127de8c5369682acfd15508 .
On occasions, there will be an issue with get_record_from_network where despite peers being returned, the error being sent back is RecordNotFound , where the code ignores this, and continues to try and download other records, before failing.
It's also interesting to note, that over a period of time, the peers being returned for the record change, indicating a possible issue with the peer lookup for a Xor address on Kad ? It would be hard to imagine that more than 6 peers had all lost the record on multiple occasions ?
Also seeing some strange behavior from the Tokio thread pool, had expected more than one worker thread, but only seeing one being spawned, so wondering if there is something weird happening, as this single worker run's at 98% CPU on a single core, and I would of expected the pool to spawn more workers, to use more cores.
Every 1.0s: ./gdb_get_threads 98663
PID: 98663
TID: 98663, Thread Name: ant
TID: 98664, Thread Name: tokio-runtime-worker
TID: 98665, Thread Name: tracing-appender
TID: 98666, Thread Name: ant
TID: 98667, Thread Name: futures-timer
TID: 98668, Thread Name: ant
Expected Behavior.
The network should return the record when requested.
If the peers are unable to locate the record, then with no local caching of the successful records, the client download should terminate. There is no benefit in waiting for other records to download, when we have failed.
If local caching of the record is implemented, we can store the records we have successfully fetched, and then on next retry, just try to fetch the failed records from before.
When processing RecordNotFound, the code doesn't attempt to refresh peers, or try again on a separate thread - it would be interesting if another attempt to fetch the record would succeed within the same Swarm session.
Snip of retrieval where the request worked for record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7
[2025-01-05T11:50:29.510434Z INFO ant_networking 537] Getting record from network of ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7). with cfg GetRecordCfg { get_quorum: One, retry_strategy: None, target_record: "None", expected_holders: {} }
[2025-01-05T11:50:29.539689Z INFO ant_networking::cmd 415] We now have 1 pending get record attempts and cached 0 fetched copies
[2025-01-05T11:50:37.057868Z INFO ant_networking 557] Record returned: ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7).
Snip of failure for record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7
[2025-01-05T11:47:06.637871Z INFO ant_networking 537] Getting record from network of ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7). with cfg GetRecordCfg { get_quorum: One, retry_strategy: None, target_record: "None", expected_holders: {} }
[2025-01-05T11:47:06.639278Z INFO ant_networking::cmd 415] We now have 1 pending get record attempts and cached 0 fetched copies
[2025-01-05T11:47:51.282880Z INFO ant_networking::event::kad 154] Query task QueryId(25) NotFound record ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7) among peers [
PeerId("12D3KooWFLiEqwVHpFqp8oedRBSBH95GUjzgpcC6yoB1Smm7GLJC"),
PeerId("12D3KooWQU15Jnu8pp4FGhwcLV7wRdMVtJjWYT7mBUc4GNMbgTxT"),
PeerId("12D3KooWNmaoTNEkK9maTXnAkwEVTJm9sWxrBwdRvYGuwMBsgUMB"),
PeerId("12D3KooWAg8dMV9cCeozf8SFj3D99FEDoESQxpUTfTYqpiS9UgmL"),
PeerId("12D3KooWNh9ysSsNbwEp78nqeoADKCwh52xEzW5kYPJrqFpB5mVA"),
PeerId("12D3KooWCwqjfq7E2WYB53Eh329vvnWMsecrAWcUwsaKBFthXTzX"),
PeerId("12D3KooWQK2ehb6FDfpoMHwUPKjakgrZ4nFTABVJDzsqRx1BPbKJ"),
PeerId("12D3KooWA8fg4xmbDFNfFip73djd9FyyNNbixv8jsY5f9BBDuBto"),
PeerId("12D3KooWBmhf5EDLRSedMfCSHKggAYpzWdg1oHFX8q7c1wUZXNCT"),
PeerId("12D3KooWP13mdmHJKBjszw72Ktm9CBtvyonsqu6vxVGJ1SjLcuJH")
], QueryStats { requests: 44, success: 14, failure: 19, start: Some(Instant { tv_sec: 1015450, tv_nsec: 140282755 }), end: Some(Instant { tv_sec: 1015494, tv_nsec: 775418630 }) } - ProgressStep { count: 1, last: true }
[2025-01-05T11:47:51.283336Z INFO ant_networking::event::kad 562] Get record task QueryId(25) failed with error NotFound {
key: Key(b"\xad\x1c\xc7\x0f\xb8^PJ\xa3z\x0b\x1f\xa7\xe9\x08@\x833\x84\xad\x8f\x0f\xa6H\x189\xb3\x14\xfc\x1cJ\xea"),
closest_peers: [
PeerId("12D3KooWFLiEqwVHpFqp8oedRBSBH95GUjzgpcC6yoB1Smm7GLJC"),
PeerId("12D3KooWQU15Jnu8pp4FGhwcLV7wRdMVtJjWYT7mBUc4GNMbgTxT"),
PeerId("12D3KooWNmaoTNEkK9maTXnAkwEVTJm9sWxrBwdRvYGuwMBsgUMB"),
PeerId("12D3KooWAg8dMV9cCeozf8SFj3D99FEDoESQxpUTfTYqpiS9UgmL"),
PeerId("12D3KooWNh9ysSsNbwEp78nqeoADKCwh52xEzW5kYPJrqFpB5mVA"),
PeerId("12D3KooWCwqjfq7E2WYB53Eh329vvnWMsecrAWcUwsaKBFthXTzX"),
PeerId("12D3KooWQK2ehb6FDfpoMHwUPKjakgrZ4nFTABVJDzsqRx1BPbKJ"),
PeerId("12D3KooWA8fg4xmbDFNfFip73djd9FyyNNbixv8jsY5f9BBDuBto"),
PeerId("12D3KooWBmhf5EDLRSedMfCSHKggAYpzWdg1oHFX8q7c1wUZXNCT"),
PeerId("12D3KooWP13mdmHJKBjszw72Ktm9CBtvyonsqu6vxVGJ1SjLcuJH")] }
[2025-01-05T11:47:51.284417Z WARN ant_networking 575] No holder of record 'ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7)' found.
Another failure for same record in different location record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7
[2025-01-05T13:10:00.600554Z INFO ant_networking 537] Getting record from network of ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7). with cfg GetRecordCfg { get_quorum: One, retry_strategy: None, target_record: "None", expected_holders: {} }
[2025-01-05T13:10:00.601715Z INFO ant_networking::cmd 415] We now have 1 pending get record attempts and cached 0 fetched copies
[2025-01-05T13:10:49.436573Z INFO ant_networking::event::kad 154] Query task QueryId(25) NotFound record ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7) among peers [
PeerId("12D3KooWSMyYfGvdzoqSjBC3ZBL2Mir2WnhiDSUhNGSqjg7A8vkB"),
PeerId("12D3KooWLVA5Sq5RuRwHfBgHBFJRG3GnV3Q1DaTiboNgEiBD9xnE"),
PeerId("12D3KooWPwgARdgVMLPbhi4aVzNMz4Q5VM2yxxzpqk9PSWzLqobT"),
PeerId("12D3KooWNmaoTNEkK9maTXnAkwEVTJm9sWxrBwdRvYGuwMBsgUMB"),
PeerId("12D3KooWAg8dMV9cCeozf8SFj3D99FEDoESQxpUTfTYqpiS9UgmL"),
PeerId("12D3KooWMD7cStTButaPMbaG2dPEDNcHC9Th5WFc6BNfE7aHnxyu"),
PeerId("12D3KooWAwMaVzfJkYFsGdZtrFwigqXQXEoFNeijsSbHMpX5Cq9C"),
PeerId("12D3KooWNr23srrTDQvyu8kRJjZnBTw5KGuniBTBX8U2CqTVHWdg"),
PeerId("12D3KooWCwqjfq7E2WYB53Eh329vvnWMsecrAWcUwsaKBFthXTzX"),
PeerId("12D3KooWCRkgUoVNGAZTvWgHz8Q6qQhztSviShKXvKbRDJAssKvW"),
PeerId("12D3KooWDsviXzPC62pV6QBz6qAp2Qj3EY8ErLSv3sUZ51UHJ5yJ"),
PeerId("12D3KooWRJcPSHZEY3Ngkgy3cW7eqjBdwKs5mGt3zzWoM5y8W5nV"),
PeerId("12D3KooWD6NGurJn1wVVz1eA7NG3LUUPAabyGdYorBYRKPZmqXob"),
PeerId("12D3KooWNMP9CmmLsNBksbiwCQry87AfJ8B6HCfnueLzyujRmdTx")
], QueryStats { requests: 108, success: 47, failure: 46, start: Some(Instant { tv_sec: 1020424, tv_nsec: 148424572 }), end: Some(Instant { tv_sec: 1020472, tv_nsec: 929126825 }) } - ProgressStep { count: 1, last: true }
[2025-01-05T13:10:49.436651Z INFO ant_networking::event::kad 562] Get record task QueryId(25) failed with error NotFound {
key: Key(b"\xad\x1c\xc7\x0f\xb8^PJ\xa3z\x0b\x1f\xa7\xe9\x08@\x833\x84\xad\x8f\x0f\xa6H\x189\xb3\x14\xfc\x1cJ\xea"),
closest_peers: [
PeerId("12D3KooWSMyYfGvdzoqSjBC3ZBL2Mir2WnhiDSUhNGSqjg7A8vkB"),
PeerId("12D3KooWLVA5Sq5RuRwHfBgHBFJRG3GnV3Q1DaTiboNgEiBD9xnE"),
PeerId("12D3KooWPwgARdgVMLPbhi4aVzNMz4Q5VM2yxxzpqk9PSWzLqobT"),
PeerId("12D3KooWNmaoTNEkK9maTXnAkwEVTJm9sWxrBwdRvYGuwMBsgUMB"),
PeerId("12D3KooWAg8dMV9cCeozf8SFj3D99FEDoESQxpUTfTYqpiS9UgmL"),
PeerId("12D3KooWMD7cStTButaPMbaG2dPEDNcHC9Th5WFc6BNfE7aHnxyu"),
PeerId("12D3KooWAwMaVzfJkYFsGdZtrFwigqXQXEoFNeijsSbHMpX5Cq9C"),
PeerId("12D3KooWNr23srrTDQvyu8kRJjZnBTw5KGuniBTBX8U2CqTVHWdg"),
PeerId("12D3KooWCwqjfq7E2WYB53Eh329vvnWMsecrAWcUwsaKBFthXTzX"),
PeerId("12D3KooWCRkgUoVNGAZTvWgHz8Q6qQhztSviShKXvKbRDJAssKvW"),
PeerId("12D3KooWDsviXzPC62pV6QBz6qAp2Qj3EY8ErLSv3sUZ51UHJ5yJ"),
PeerId("12D3KooWRJcPSHZEY3Ngkgy3cW7eqjBdwKs5mGt3zzWoM5y8W5nV"),
PeerId("12D3KooWD6NGurJn1wVVz1eA7NG3LUUPAabyGdYorBYRKPZmqXob"),
PeerId("12D3KooWNMP9CmmLsNBksbiwCQry87AfJ8B6HCfnueLzyujRmdTx")] }
[2025-01-05T13:10:49.438280Z WARN ant_networking 575] No holder of record 'ad1cc7(5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7)' found.
The text was updated successfully, but these errors were encountered:
Seeing repeated issues of failing to download records from the network after > 24 hrs from upload, from multiple clients in multiple locations (DE / FR / SP), on multiple architectures (x64 / Arch64)
One such test file below, but happens on multiple files (only tested on files 1mb > 8mb in size)
Test file with
./ant file download 653d9c5a97b589193609782db1543a782f47a40a6127de8c5369682acfd15508 .
On occasions, there will be an issue with
get_record_from_network
where despite peers being returned, the error being sent back isRecordNotFound
, where the code ignores this, and continues to try and download other records, before failing.It's also interesting to note, that over a period of time, the peers being returned for the record change, indicating a possible issue with the peer lookup for a Xor address on Kad ? It would be hard to imagine that more than 6 peers had all lost the record on multiple occasions ?
Also seeing some strange behavior from the Tokio thread pool, had expected more than one worker thread, but only seeing one being spawned, so wondering if there is something weird happening, as this single worker run's at 98% CPU on a single core, and I would of expected the pool to spawn more workers, to use more cores.
Expected Behavior.
Snip of retrieval where the request worked for record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7
Snip of failure for record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7
Another failure for same record in different location record : 5d25fd36ae965e75de530644fb9265472fe350588443d010ae9ed94c9644e2a7
The text was updated successfully, but these errors were encountered: