Spawned my own instance of lemmy: now I've got a lot of questions about federation

gabriele97@lemmy.g97.top · 2 years ago

Spawned my own instance of lemmy: now I've got a lot of questions about federation

chiisana@lemmy.chiisana.net · 2 years ago

Header is expired issue is big part of the current federation problem. And whether you know it like it or not, you’ve just made the matter worse. You’re not to blame though. I’ve done it too, along with many other people self hosting our own instance.

The way federation currently works is each write action must be federated outwards to each federated instance. A comment reply, such as this one, must be federated outwards by the hosting instance. An instance receiving a federation event must also discard messages that are older than 10 seconds.

Here lies the problem… popular instances like lemmy.world and lemmy.ml has thousands of users, and thousands of federated servers. Yesterday, when I checked, lemmy.world had 3600 users per day and 2200+ federated servers. If there’s a really popular post on a very popular community, and 10% of the users comments on it? Lemmy.world server must send 360x2200 = 700K+ outbound federation event messages. Each one of these are sent over HTTPS via TCP so they can’t send all of them at the same time, and the messages are put into a queue where the federation workers will send them out. Each worker will send the message and because HTTPS is over TCP, it is not fire and forget, the worker must wait for acknowledgement for the packets. If an instance owner gets bored because they’re not getting all the messages and shuts down? Now the worker needs to wait for that to error out and thereby delaying messages further down the queue. If it had to wait more than 10 seconds? Everyone down the queue will just get expired messages because the event is already outdated.

So now you’ve already created an instance and adding to the load of the network, just like me, what can you do? Keep your server online in a fast data center. Use Cloudflare to reduce latency. That way at least your server isn’t going to introduce too much latency to other servers down the queue. Hopefully the devs figure out something to make the process better. I’ve put in a more scalable notification fleet architecture change on GitHub already. Lets see if they can implement that or change other requirements on the system.

gabriele97@lemmy.g97.top · 2 years ago

Can you link your proposed change? I am interested

TitanLaGrange@lemmy.world · edit-2 2 years ago

Each one of these are sent over HTTPS via TCP

Do you happen to know how the server-to-server connections are managed? I’m not too familiar with it, but it seems like HTTP/3 might provide some benefits for server-to-server communication.

Also, regarding queuing federation messages, I’m curious if packages like Kafka or Pulsar have been considered? They aren’t typically used over HTTP, but it doesn’t seem like it would be too hard to adapt, and the stream retention policy could be set to allow consumers to pick up older records as they have capacity (to avoid the issue around servers getting out of sync. The consumer would know the queue offset for each stream it was consuming and could pick up records as it has capacity, provided it doesn’t fall so far behind that the records expire). Publishers could provide separate topics for different message types to allow consumers to prioritize activity types (for example, prioritizing receiving replies over up/down votes). Also servers could potentially use cluster replication (Mirror Maker) to handle moving activity records from one server to another (again, HTTP-only would be an issue here), and each server could then consume the federation activity messages locally from its own queue.

Kafka/Pulsar support have strong scaling support, so adding capacity for federation messages should be fairly straightforward.

I’ve only used Kafka once, and I’m completely unqualified to operate an instance of any complexity, by in general my experience with it was pretty good.

Quacksalber@sh.itjust.works · 2 years ago

As someone who has just enough knowledge to know how big the task of creating a performant way to propagate updates through the federation is, I really hope there are some smart people working on a solution. That is the biggest advantage reddit has over lemmy: Known and centralized hardware standards. Lemmy needs to find a way to make propagation work when half of all instances are hosted at home on consumer-grade hardware.

z3bra · 2 years ago

Isn’t there a mechanism to remove timing out servers? Or a way to unregister your instance ? Otherwise the model could never scale properly as servers get retired every now and then, even within the same instance.

chiisana@lemmy.chiisana.net · 2 years ago

If there is an option to adjust/disable it, I wasn’t able to find it.

kopper [they/them]@lemmy.blahaj.zone · 2 years ago

This commit changes the timeout to 1 day. I assume 0.18 will ship it, though I haven’t checked.

z3bra · 2 years ago

It changes the timeout of the HTTP request though. It means servers are more likely to accept a request that has been delayed. But if a server gets removed, the sender will still try to send requests to it over and over, waiting for the TCP socket to timeout before going onto the next server.

My initial question was, if a server tries to send, say 100 requests to a server, and they all timeout, will the sending server eventually remove it from the queue ?

HTTP_404_NotFound@lemmyonline.com · 2 years ago

And whether you know it like it or not, you’ve just made the matter worse.

What is making the matter worse, is everyone clobbering together on lemmy.ml, and lemmy.world. This causes those particular servers to be vastly overloaded.

If, say, people created communities ELSEWHERE, the load could be spread-around.

Not- saying the architecture of federation isn’t a problem, as indeed, it is a huge problem- but, in the interim time, this can be helped out by people spreading out.