[HN Gopher] Migrating Millions of Concurrent WebSockets to Envoy
       ___________________________________________________________________
        
       Migrating Millions of Concurrent WebSockets to Envoy
        
       Author : jbredeche
       Score  : 93 points
       Date   : 2021-03-16 14:12 UTC (1 days ago)
        
 (HTM) web link (slack.engineering)
 (TXT) w3m dump (slack.engineering)
        
       | endisneigh wrote:
       | I wonder if Slack has considered using webrtc to do peer to peer
       | chats on the client side and then gathering up the chat metadata
       | and having each client periodically send their version of the
       | history and reconciling it server side.
       | 
       | This would also have the effect of allowing slack to peer more or
       | less normally even if Slack was down (of course bots, search, etc
       | wouldn't work).
        
         | ryanianian wrote:
         | I suspect there may be regulatory restrictions about allowing
         | text-based communications that aren't available during an
         | audit.
        
           | ssss11 wrote:
           | What regulation do you think would apply? And how/why would
           | this regulation differ for e2e encrypted chat products like
           | Signal, Telegram, WhatsApp etc that can't access text based
           | chat messages?
        
             | zonotope wrote:
             | IANAL, but the enterprise companies that make up Slack's
             | customer base are often under regulations to preserve their
             | employees' official communications in case they are needed
             | for future investigations. Those same regulations prevent
             | them from using the products you listed as official
             | communication channels.
        
             | [deleted]
        
             | toomuchtodo wrote:
             | FINRA recordkeeping and retention requirements, as well as
             | SEC statute around records and reporting requirements
             | (finance industry specific).
        
             | detaro wrote:
             | Companies that have such audit requirements do not use
             | Signal et al either for internal comms.
        
         | lovedswain wrote:
         | It's possible to implement all of this without inheriting the
         | additional infrastructure and networking complexity WebRTC
         | brings along with it, not forgetting WebRTC still relies on
         | centralized components to coordinate. Don't use WebRTC unless
         | you really need the features it offers, routers in many
         | scenarios hate it and even where they allow it, the
         | combinatorial explosion in possible configurations to support
         | and diagnose between peers is a problem nobody should willingly
         | invite unless they can't achieve a solution any other way
         | 
         | With WebRTC you give up the nice ultra-low-common-denominator
         | "outbound port 443/TCP needs to work" requirement and replace
         | it with "UDP networking generally healthy, possible to
         | establish port mappings, possible to maintain stable port
         | mappings over time, possible to not have mappings go away due
         | to lack of traffic" etc etc
        
           | SahAssar wrote:
           | It sounds like the only thing you did was signaling, not STUN
           | and TURN.
           | 
           | If you do both STUN and TURN it works on most networks. I've
           | worked at really restricted work sites, and while STUN fails
           | at those if you have a TURN server then it almost always
           | works.
           | 
           | These sort of comments are why people think webRTC is
           | unstable while the same people use slack calls which
           | literally use webRTC.
           | 
           | I might be wrong, but please don't talk about network
           | reliability in webRTC without specifying if you have a
           | working STUN and/or TURN setup.
        
           | meheleventyone wrote:
           | Hah, this is so true. Am building a little hobby project to
           | try out WebRTC for game development. On my ISP provided
           | router a Mac and Windows computer can't see each other over
           | WiFi due to some mDNS issue likely the router support for
           | multicast. Using Chrome flags to turn off mDNS and they can
           | connect fine but obviously expose internal IPs. Wire one of
           | the machines and mDNS works. TURN is essentially a necessity
           | but then why not use a server (particularly for a chat app).
        
             | SahAssar wrote:
             | Sounds like you mean STUN, not TURN.
        
               | meheleventyone wrote:
               | No, I'm using a STUN server. This issue is unrelated and
               | due to the local IPs being masked by mDNS addresses so
               | that local network topology isn't leaked to the world at
               | large and my routers handling of mDNS. Which is why
               | everything works over the local network if I disable mDNS
               | use in Chrome. TURN is the ultimate fallback to being
               | unable to NAT punch.
               | 
               | Ironically getting machines connected across the internet
               | with WebRTC has so far been relatively smooth sailing.
        
           | littlestymaar wrote:
           | There's some truth in what you said, but also a few
           | exaggerations.
           | 
           | First of all, while WebRTC has its share of complexity when
           | using it for videoconferencing, here we are talking about
           | using the DataChannel, which is really straightforward to use
           | and doesn't need additional infrastructure.
           | 
           | > not forgetting WebRTC still relies on centralized
           | components to coordinate
           | 
           | It needs a centralized component to _setup_ the connection
           | (signaling), if it fails later, your communication channel is
           | still up. And the good thing if you have a websocket-based
           | chat service, is that you can directly use it for the
           | signaling purpose with zero modifications on the back-end
           | side.
           | 
           | > routers in many scenarios hate it and even where they allow
           | it, the combinatorial explosion in possible configurations to
           | support and diagnose between peers is a problem nobody should
           | willingly invite unless they can't achieve a solution any
           | other way
           | 
           | When using the Datachannel, your failure mode is _can 't
           | establish a connection_, not some hard to understand
           | Heisenbug. All you need is to provide a centralized fallback
           | for clients who cannot establish a connection. This fallback
           | will depend on the centralized service being up, but in case
           | of failure you'll keep most of your users without disturbance
           | (at least in the first world, the network is not as WebRTC
           | friendly in other places of the world). And because the
           | DataChannel's API is close to the WebSocket's one,
           | implementing the fallback is straightforward.
           | 
           | Though, in Slack's situation there is a good reason not to
           | use WebRTC: they can have several thousands of people in the
           | same channel (IIRC IBM uses Slack and have most of their
           | employees in a shared channel for official announcements).
           | You won't be able to do that with WebRTC[1] if a user needs
           | to establish a connection with every other users in the
           | channel (there's just not enough ports available). And even
           | worse, back in 2016, Chrome's implementation of the
           | DataChannel was so poor, you could not establish more than a
           | handful of PeerConnection before feeling the browser's
           | becoming sluggish (this wasn't the case in Firefox so maybe
           | Google fixed that since then).
           | 
           | Also, Slack's users are likely to be in some enterprise
           | network, which makes WebRTC more likely to fail than when you
           | customers are home, which reduces the opportunity.
           | 
           | Main takeaway: WebRTC-based chat is probably not a great fit
           | for Slack, but don't be afraid of using it: it's not hard, it
           | combines well with your already existing centralized
           | infrastructure, and can massively reduces the load on it.
           | 
           | [1] unless you want to build some fancy sparse mesh network,
           | but _this_ is likely overengineering.
        
       | ex3ndr wrote:
       | I am curious about backend part - do ws is still ws on services?
       | Why? For example, why to have thousands of connections instead of
       | a single one (or a bunch) that simply forwards websocket packets
       | with some "connection id" with them.
       | 
       | This way you could restart service without killing ws connection,
       | move all overhead of handling millions of connections to the lb.
        
       | jeffbee wrote:
       | When you control both client and server it seems like hot restart
       | is just a complicated stunt you don't need. Isn't it fine to just
       | stop accepting connections, tell all your clients to reconnect,
       | and do a normal restart? The frontend load balancer that stands
       | between you and Gmail doesn't know how to restart hot but you
       | probably never noticed.
        
         | hermanradtke wrote:
         | > stop accepting connections, tell all your clients to
         | reconnect
         | 
         | This "drain" pattern is great for maintenance, upgrades, etc
         | too.
         | 
         | The only caveat is that the clients need to be given time to
         | migrate. How long that is depends on how well the clients
         | behave. A hot restart may be much faster.
        
         | jsiepkes wrote:
         | > Isn't it fine to just stop accepting connections, tell all
         | your clients to reconnect, and do a normal restart?
         | 
         | Dependents on how many config changes you need (per day).
         | 
         | Besides Envoy supports it and I would call it a bonus if you
         | can reload your configuration without client interruption. As
         | for complicating things the implementation for hot reload isn't
         | terribly complicated in Envoy.
        
           | jeffbee wrote:
           | I'm mentally separating the hot restart part from the
           | reloadable configs part, even though they are together in the
           | article. To me, not having reloadable configs is too crazy to
           | even imagine.
        
         | mbyio wrote:
         | I think Slack is different than Gmail because people are
         | actively having conversations, so if you disconnect, it is much
         | more likely to be noticeable and annoying.
         | 
         | Reading between the lines, I think what they would need is a
         | way to tell clients to move to a new websocket connection at
         | the _proxy_ layer. I don 't think there is an easy built-in way
         | to do this in the websocket protocol, so they would have to
         | implement something custom in their application layer. This
         | would also require triggering custom code in the client to make
         | a new websocket connection, start using it, and then close the
         | old connection.
         | 
         | I feel like it would have been simpler to just have the client
         | do a graceful reconnect every 5 minutes. But they probably
         | decided to use envoy so they could have the other advantages
         | too.
        
         | zemo wrote:
         | maybe it could work, but in practice it's often not as easy as
         | you'd like it to be. disconnecting everyone at or around the
         | same time simultaneously can easily create a thundering herd or
         | a TCP global synchronization problem, so "just ask everyone to
         | reconnect" has its own set of complications.
        
         | jayd16 wrote:
         | The Gmail load balancer has to do a cold restart to add or
         | remove an instance? That's the requirement they placed on
         | themselves because they do not trust the runtime HAProxy api.
        
           | theflyinghorse wrote:
           | If they do not trust HAProxy runtime API then why are they
           | using HAProxy at all?
        
             | vad_ wrote:
             | Haproxy itself is a solid piece of software. The runtime
             | API is something they added on top of it because of
             | competitors (envoy).
        
             | [deleted]
        
       | forgotmypw17 wrote:
       | Is there a plan to migrate to an open protocol or non-crappy
       | client?
        
       ___________________________________________________________________
       (page generated 2021-03-17 23:00 UTC)