[HN Gopher] Migrating Millions of Concurrent WebSockets to Envoy ___________________________________________________________________ Migrating Millions of Concurrent WebSockets to Envoy Author : jbredeche Score : 93 points Date : 2021-03-16 14:12 UTC (1 days ago) (HTM) web link (slack.engineering) (TXT) w3m dump (slack.engineering) | endisneigh wrote: | I wonder if Slack has considered using webrtc to do peer to peer | chats on the client side and then gathering up the chat metadata | and having each client periodically send their version of the | history and reconciling it server side. | | This would also have the effect of allowing slack to peer more or | less normally even if Slack was down (of course bots, search, etc | wouldn't work). | ryanianian wrote: | I suspect there may be regulatory restrictions about allowing | text-based communications that aren't available during an | audit. | ssss11 wrote: | What regulation do you think would apply? And how/why would | this regulation differ for e2e encrypted chat products like | Signal, Telegram, WhatsApp etc that can't access text based | chat messages? | zonotope wrote: | IANAL, but the enterprise companies that make up Slack's | customer base are often under regulations to preserve their | employees' official communications in case they are needed | for future investigations. Those same regulations prevent | them from using the products you listed as official | communication channels. | [deleted] | toomuchtodo wrote: | FINRA recordkeeping and retention requirements, as well as | SEC statute around records and reporting requirements | (finance industry specific). | detaro wrote: | Companies that have such audit requirements do not use | Signal et al either for internal comms. | lovedswain wrote: | It's possible to implement all of this without inheriting the | additional infrastructure and networking complexity WebRTC | brings along with it, not forgetting WebRTC still relies on | centralized components to coordinate. Don't use WebRTC unless | you really need the features it offers, routers in many | scenarios hate it and even where they allow it, the | combinatorial explosion in possible configurations to support | and diagnose between peers is a problem nobody should willingly | invite unless they can't achieve a solution any other way | | With WebRTC you give up the nice ultra-low-common-denominator | "outbound port 443/TCP needs to work" requirement and replace | it with "UDP networking generally healthy, possible to | establish port mappings, possible to maintain stable port | mappings over time, possible to not have mappings go away due | to lack of traffic" etc etc | SahAssar wrote: | It sounds like the only thing you did was signaling, not STUN | and TURN. | | If you do both STUN and TURN it works on most networks. I've | worked at really restricted work sites, and while STUN fails | at those if you have a TURN server then it almost always | works. | | These sort of comments are why people think webRTC is | unstable while the same people use slack calls which | literally use webRTC. | | I might be wrong, but please don't talk about network | reliability in webRTC without specifying if you have a | working STUN and/or TURN setup. | meheleventyone wrote: | Hah, this is so true. Am building a little hobby project to | try out WebRTC for game development. On my ISP provided | router a Mac and Windows computer can't see each other over | WiFi due to some mDNS issue likely the router support for | multicast. Using Chrome flags to turn off mDNS and they can | connect fine but obviously expose internal IPs. Wire one of | the machines and mDNS works. TURN is essentially a necessity | but then why not use a server (particularly for a chat app). | SahAssar wrote: | Sounds like you mean STUN, not TURN. | meheleventyone wrote: | No, I'm using a STUN server. This issue is unrelated and | due to the local IPs being masked by mDNS addresses so | that local network topology isn't leaked to the world at | large and my routers handling of mDNS. Which is why | everything works over the local network if I disable mDNS | use in Chrome. TURN is the ultimate fallback to being | unable to NAT punch. | | Ironically getting machines connected across the internet | with WebRTC has so far been relatively smooth sailing. | littlestymaar wrote: | There's some truth in what you said, but also a few | exaggerations. | | First of all, while WebRTC has its share of complexity when | using it for videoconferencing, here we are talking about | using the DataChannel, which is really straightforward to use | and doesn't need additional infrastructure. | | > not forgetting WebRTC still relies on centralized | components to coordinate | | It needs a centralized component to _setup_ the connection | (signaling), if it fails later, your communication channel is | still up. And the good thing if you have a websocket-based | chat service, is that you can directly use it for the | signaling purpose with zero modifications on the back-end | side. | | > routers in many scenarios hate it and even where they allow | it, the combinatorial explosion in possible configurations to | support and diagnose between peers is a problem nobody should | willingly invite unless they can't achieve a solution any | other way | | When using the Datachannel, your failure mode is _can 't | establish a connection_, not some hard to understand | Heisenbug. All you need is to provide a centralized fallback | for clients who cannot establish a connection. This fallback | will depend on the centralized service being up, but in case | of failure you'll keep most of your users without disturbance | (at least in the first world, the network is not as WebRTC | friendly in other places of the world). And because the | DataChannel's API is close to the WebSocket's one, | implementing the fallback is straightforward. | | Though, in Slack's situation there is a good reason not to | use WebRTC: they can have several thousands of people in the | same channel (IIRC IBM uses Slack and have most of their | employees in a shared channel for official announcements). | You won't be able to do that with WebRTC[1] if a user needs | to establish a connection with every other users in the | channel (there's just not enough ports available). And even | worse, back in 2016, Chrome's implementation of the | DataChannel was so poor, you could not establish more than a | handful of PeerConnection before feeling the browser's | becoming sluggish (this wasn't the case in Firefox so maybe | Google fixed that since then). | | Also, Slack's users are likely to be in some enterprise | network, which makes WebRTC more likely to fail than when you | customers are home, which reduces the opportunity. | | Main takeaway: WebRTC-based chat is probably not a great fit | for Slack, but don't be afraid of using it: it's not hard, it | combines well with your already existing centralized | infrastructure, and can massively reduces the load on it. | | [1] unless you want to build some fancy sparse mesh network, | but _this_ is likely overengineering. | ex3ndr wrote: | I am curious about backend part - do ws is still ws on services? | Why? For example, why to have thousands of connections instead of | a single one (or a bunch) that simply forwards websocket packets | with some "connection id" with them. | | This way you could restart service without killing ws connection, | move all overhead of handling millions of connections to the lb. | jeffbee wrote: | When you control both client and server it seems like hot restart | is just a complicated stunt you don't need. Isn't it fine to just | stop accepting connections, tell all your clients to reconnect, | and do a normal restart? The frontend load balancer that stands | between you and Gmail doesn't know how to restart hot but you | probably never noticed. | hermanradtke wrote: | > stop accepting connections, tell all your clients to | reconnect | | This "drain" pattern is great for maintenance, upgrades, etc | too. | | The only caveat is that the clients need to be given time to | migrate. How long that is depends on how well the clients | behave. A hot restart may be much faster. | jsiepkes wrote: | > Isn't it fine to just stop accepting connections, tell all | your clients to reconnect, and do a normal restart? | | Dependents on how many config changes you need (per day). | | Besides Envoy supports it and I would call it a bonus if you | can reload your configuration without client interruption. As | for complicating things the implementation for hot reload isn't | terribly complicated in Envoy. | jeffbee wrote: | I'm mentally separating the hot restart part from the | reloadable configs part, even though they are together in the | article. To me, not having reloadable configs is too crazy to | even imagine. | mbyio wrote: | I think Slack is different than Gmail because people are | actively having conversations, so if you disconnect, it is much | more likely to be noticeable and annoying. | | Reading between the lines, I think what they would need is a | way to tell clients to move to a new websocket connection at | the _proxy_ layer. I don 't think there is an easy built-in way | to do this in the websocket protocol, so they would have to | implement something custom in their application layer. This | would also require triggering custom code in the client to make | a new websocket connection, start using it, and then close the | old connection. | | I feel like it would have been simpler to just have the client | do a graceful reconnect every 5 minutes. But they probably | decided to use envoy so they could have the other advantages | too. | zemo wrote: | maybe it could work, but in practice it's often not as easy as | you'd like it to be. disconnecting everyone at or around the | same time simultaneously can easily create a thundering herd or | a TCP global synchronization problem, so "just ask everyone to | reconnect" has its own set of complications. | jayd16 wrote: | The Gmail load balancer has to do a cold restart to add or | remove an instance? That's the requirement they placed on | themselves because they do not trust the runtime HAProxy api. | theflyinghorse wrote: | If they do not trust HAProxy runtime API then why are they | using HAProxy at all? | vad_ wrote: | Haproxy itself is a solid piece of software. The runtime | API is something they added on top of it because of | competitors (envoy). | [deleted] | forgotmypw17 wrote: | Is there a plan to migrate to an open protocol or non-crappy | client? ___________________________________________________________________ (page generated 2021-03-17 23:00 UTC)