Hey, so… you won’t believe this, but I just spent hours chasing one of those bugs that makes you question your life choices
Everything looked fine at first. You know that feeling, right? Pods running, services up, nothing obviously broken. But then… random failures. Logs screaming “connection refused,” traces looking like total nonsense.
So I’m sitting there like, “Okay… what is even happening?”
Well, after digging way too deep, I finally found it. Turns out it was a race condition. Yeah, one of those. It was happening between federation hooks and Redis cache invalidation. Basically, things were happening slightly out of order… just enough to break stuff randomly.
And the worst part? It didn’t fail every time. Only sometimes. Can you imagine that?
So yeah, I kept going back and forth, thinking I fixed it… then boom, same issue again.
Here’s what I ended up doing. Nothing fancy at all. Just added exponential backoff to the retry logic:
async fn retry_federation(activity: Activity, max_retries: u32) -> Result<()> { let mut delay = Duration::from_millis(100); for attempt in 0…max_retries { match send_to_relays(&activity).await { Ok(_) => return Ok(()), Err(e) if attempt < max_retries - 1 => { tokio::time::sleep(delay).await; delay *= 2; } Err(e) => return Err(e), } } Err(anyhow!(“Federation failed after {} retries”, max_retries)) }
And yeah… that actually fixed it.
Not some big architectural change. Just… “wait a bit and try again” 😄
So yeah, lesson learned, timing issues in distributed systems are sneaky. Especially with federation stuff. Cold starts, retries, cache timing… all of it can mess with you.
Anyway, that was my day. What do you think? Ever had a bug like this where everything looked fine but totally wasn’t?
0…max_retriesGood thing Rust replaced
with..=for inclusive range syntax. Otherwise, the webshit markdown implementation used by Lemmy UI replacing..with the…ligature would have been confusingly problematic 😉.And this seems to be yet another case showing that federation was poorly designed, and should have been designed as pull-based (and batchable/packable), instead of endlessly spam-pushing individual messages and hoping for the best.


