[HN Gopher] FreeBSD optimizations used by Netflix to serve video...
       ___________________________________________________________________
        
       FreeBSD optimizations used by Netflix to serve video at 800Gb/s
       [pdf]
        
       Author : _trackno5
       Score  : 301 points
       Date   : 2022-11-03 10:58 UTC (12 hours ago)
        
 (HTM) web link (people.freebsd.org)
 (TXT) w3m dump (people.freebsd.org)
        
       | krylon wrote:
       | And here I sit like a chump with my home server connected to a
       | 100MBit switch. (I paid for that switch, and I'm not replacing it
       | until it gives up the ghost.) (And before you ask, the server
       | also runs FreeBSD, and I'm very happy with the result.)
        
         | nix23 wrote:
         | Bring it to the max with multipath ;) since you have already
         | Freebsd, no need to throw those beautiful reliable thing's
         | away, maybe just buy a second...third? Dirt-cheap 100MBit card:
         | 
         | https://en.wikipedia.org/wiki/Multipath_TCP#Implementation
        
           | krylon wrote:
           | The server has a second NIC, but the switch has no more free
           | ports. I briefly thought of bonding, but stopped when I read
           | that the switch would need to support it (which it almost
           | certainly does not).
           | 
           | But my point was that for my requirements, 100MBit are
           | actually sufficient and FreeBSD still is a good choice for
           | me, I was just being snarky about it. (I do find it
           | aesthetically displeasing, though, that my wifi is now faster
           | than my wired network, but I can live with that.)
        
         | toast0 wrote:
         | I understand the motivation, but $20 gets you an 8-port gigE
         | switch, so it seems like the wrong hill to die on. :)
        
           | krylon wrote:
           | I know, but so far 100MBit is sufficient, actually, I rarely
           | move Gigabytes of data around. When it becomes annoying, I'll
           | get a new switch, but so far the pressure is really low.
        
       | nix23 wrote:
       | I really think Netflix could make some good money being a
       | multimedia-cdn (even for "competitors")
        
         | jedberg wrote:
         | I thought the same thing 10 years ago when I worked there. At
         | the time management was not interested in losing focus on doing
         | anything other than streaming movies to customers.
         | 
         | But it should be noted that the FreeBSD Openconnect boxes are
         | highly optimized to Netflix's use case. Which is serving a
         | predefined set of content that has been pre-rendered. Youtube
         | and its ilk are a completely different use case.
         | 
         | The Netflix cache is so optimized for serving Netflix movies
         | that for many years we still used Akamai for all of our other
         | CDN needs, but it looks like they may have finally moved that
         | to Netflix's own CDN now.
        
         | virtuallynathan wrote:
         | It works with such high efficiency because we know how to place
         | content in advance, and the catalog is relatively small. Trying
         | to serve 800Gbps of YouTube content would be a nightmare.
        
           | ilyt wrote:
           | I wonder how much of YT traffic is the "big" (say >200k
           | viewers in a month) vs the small guys.
           | 
           | But yeah, once your hot data size exceeds cache byebye
           | efficiency
        
             | adgjlsfhk1 wrote:
             | One hard part is that on the youtube side, most views occur
             | within the first 48 hours or so and a good fraction occur
             | within the first 6. With netflix, they have a catalogue of
             | ~5000 videos and gets <200 new per month. Youtube has
             | around 30k channels with more than 500k subscriptions, so
             | that's somewhere around 30k videos per week.
        
           | nix23 wrote:
           | I don't talk directly about Youtube, but also serving Disney,
           | hula, but ESPECIALLY national/continental portals like
           | Arte.tv, Play-SRF, ARD-Mediathek etc.
        
           | drewg123 wrote:
           | Indeed. YT has a much different problem, which is to
           | determine which video is going to go viral, and then
           | transcode it into popular formats when it does.
           | 
           | In comparison, we pre-transcode everything to exacting
           | standards, so all our CDN has to do is serve is static files.
        
         | coldpie wrote:
         | Wow, that is actually a really interesting idea in the context
         | of developing a YouTube competitor. Delivery & bandwidth are a
         | really high barrier to entry, and piggy-backing off of
         | Netflix's existing network could really lower those costs. I
         | agree the "providing services to your direct competition" is
         | probably a stumbling block, and likely Netflix has other irons
         | in the fire. But anyway it's a cool idea to think about.
        
       | ilyt wrote:
       | It's nice to see someone actually still does proper engineering
       | instead of farting something about cloud and webscale and just
       | throwing money at a problem.
        
       | kleiba wrote:
       | Yet, in order to _watch_ Netflix on FreeBSD, you have to jump
       | through such hoops as  "downloading either google chrome,
       | vivaldi, or brave, and [using] a small shell script which
       | basically creates a small jail for some ubuntu binaries that
       | actually install widevine which is essential for viewing some DRM
       | content such as Netflix" [1]
       | 
       | [1] https://www.youtube.com/watch?v=mBYor4wL62Q
        
         | __MatrixMan__ wrote:
         | BitTorrent is probably easier. I just wish there was a good way
         | to send money to the artists without also funding DRM
         | enhancements.
        
           | andsoitis wrote:
           | So you want to send money to all the people who worked on the
           | TV Show or the Movie you just downloaded?
           | 
           | I don't think you realize how impractical that is. Take a
           | look at the credits at the end of a movie some time. Or look
           | up the list of people who worked on a particular episode of a
           | show (yes, it can vary throughout a season).
        
             | andrewxdiamond wrote:
             | Certainly impractical for big budget shows, but Patreon has
             | proved the model works
        
             | __MatrixMan__ wrote:
             | It wouldn't be impractical if the studio planned ahead for
             | it.
             | 
             | There could be a the address of a smart contact at the end
             | of the credits. Every time more than, say $1000, piles up
             | in that address, whatever is there gets dispensed to the
             | contributors at the end of that month.
             | 
             | Plex could aggregate those addresses and tell you how to
             | allocate your payment based on how you allocated your
             | attention. Yes I know that's what Netflix does, but I
             | control my Plex server. Nobody is then going to find
             | additional ways to monetize that data.
             | 
             | I know it's unconventional, but I really don't think it's
             | crazy to want to reward the creators of content that you
             | consume while simultaneously not wanting to contribute
             | towards the development of ecosystems that prevent people
             | from being in control of their tech.
        
               | reaperducer wrote:
               | _It wouldn 't be impractical if the studio planned ahead
               | for it._
               | 
               | Studios already plan for this.
               | 
               | For a short time in the 80's, one of my mother's job
               | responsibilities was making sure every single person
               | involved in the production of a movie in the 1940's got
               | their revenue check each quarter, whether it was for
               | $50.00, or 12C/. Hundreds of people. Hundreds of checks.
        
               | __MatrixMan__ wrote:
               | Ok, so I've torrented a movie and I want to send the
               | equivalent of your mom a check so that next quarter it's
               | $0.13 instead of $0.12, where do I look in the credits to
               | get her address?
               | 
               | Perhaps in the 80's it would've been impractical to pay
               | her to multiplex hundreds of $1 input checks into the
               | appropriate set of $50 or $0.12 output checks, but that's
               | now a job that's early done by a computer.
        
             | [deleted]
        
         | [deleted]
        
         | IntelMiner wrote:
         | Devils advocate: The people who work on the server engineering
         | at Netflix don't exactly have much control over copyright
         | holders being lawyer brained man children
        
         | jbirer wrote:
         | That is the problem with the BSD license, it says "use my work
         | and don't give anything back". Of course, GPL gets violated
         | too, but would be very difficult by an American company like
         | Netflix.
        
         | pjmlp wrote:
         | UNIX's strength was never in the desktop experience, rather
         | server room.
        
           | akreal wrote:
           | Not true for macOS.
        
             | pjmlp wrote:
             | You mean NeXTSTEP, all that makes it unique isn't part of
             | POSIX, and Steve Jobs had a quite clear position on UNIX
             | value for desktop computing.
        
         | SpaceInvader wrote:
         | FreeBSD is not a "desktop first" system and has strengths
         | elsewhere. I use it for 20+ years constantly. Sadly my
         | experiments with FreeBSD desktop ended years ago as there
         | always was something "not working".
        
           | asveikau wrote:
           | Typing this on a FreeBSD laptop.
           | 
           | Haven't tried using netflix on it though.
        
             | Ar-Curunir wrote:
             | I don't think it needs to be said that while FreeBSD can
             | serve as a daily driver for some people, it is insufficient
             | for the vast majority of computer users in the world
        
         | alberth wrote:
         | > which is essential for viewing some DRM content such as
         | Netflix
         | 
         | Are you complaining that Netflix doesn't want people to pirate
         | content, content they might have licensed from 3rd parties
         | which contractually bind them to not let being pirated?
         | 
         | This + is the development/resources/cost of serving such few
         | people on FreeBSD even worth it.
         | 
         | Note: I'm a huge FreeBSD fan. But consider this totally
         | understandable on Netflix part.
        
           | somehnguy wrote:
           | But it doesn't prevent it from being pirated at all. You can
           | get any Netflix release you want within minutes of release on
           | any torrent site. Sometimes _before_ the official release
           | even.
           | 
           | It just makes normal people jump through hoops to watch the
           | things they are trying to pay for. That's a DRM issue in
           | general though, I acknowledge this isn't just a Netflix
           | thing.
        
             | seanw444 wrote:
             | And I will stick to getting it that way for as long as DRM
             | exists on the given platform. I'll still pay for the
             | subscription, but I'm handling the data my way.
        
               | KronisLV wrote:
               | > I'll still pay for the subscription, but I'm handling
               | the data my way.
               | 
               | Huh, that's an interesting take. I feel like something
               | similar might end up being what you need to do with
               | certain video games as well.
               | 
               | For example, I bought Grand Theft Auto IV as a boxed copy
               | back when it came out (though most of my games are
               | digital now). The problem is that the game expects Games
               | For Windows Live to be present, which is now deprecated
               | and some folks out there can't even launch the game
               | anymore. It's pretty obvious what one of the solutions
               | here is.
        
               | webmobdev wrote:
               | Me too. Especially because these same DRM will soon be
               | used to uniquely identify and profile you when these
               | streamers also become an ad platform.
        
               | judge2020 wrote:
               | What does DRM have to do with this? They'll connect what
               | you watch on Peacock with what you watch on Netflix on
               | your computer? Do you have a reference?
        
           | Thaxll wrote:
           | DRM makes 0 sense since you can get any content using
           | torrents in 2min. It's not protecting anything, as a matter
           | of fact it's just making people download more since it's a
           | painful experience.
           | 
           | For example on Windows with Chrome you only get 720p playback
           | for Netflix, complete nonsense.
        
             | mschuster91 wrote:
             | Yeah, but tell that to braindead content license owners.
        
             | googlryas wrote:
             | Sure, but if there was no drm, there would probably just be
             | a chrome extension you could install and rip/share content
             | more readily than via BitTorrent.
             | 
             | I don't like it, but there is some logic to it. For
             | business types, it isn't merely the existence of ripped
             | copies, but the ease of creating and spreading them.
        
       | _trackno5 wrote:
       | Recording of the presentation can be found here:
       | https://www.youtube.com/watch?v=36qZYL5RlgY
       | 
       | Pretty cool stuff
        
         | [deleted]
        
       | eatonphil wrote:
       | From what I can see in a quick search (and from this
       | presentation), Netflix only uses FreeBSD for serving video and
       | they run these servers themselves in their own datacenters I
       | guess. In contrast their apps on EC2 use Linux [0]. Sounds like
       | the time has not yet come when AWS is paying anyone full time to
       | support FreeBSD on EC2.
       | 
       | [0] https://twitter.com/brendangregg/status/1412201241472471048
        
         | erk__ wrote:
         | cperciva whom you link have worked quite a bit on EC2 support
         | for FreeBSD, a lot of it documented on their blog [0] and
         | supported by Patreons at [1].
         | 
         | But yeah it would be nice if there was someone who could work
         | on it full time
         | 
         | [0]: https://www.daemonology.net/blog/2022-03-29-FreeBSD-
         | EC2-repo...
         | 
         | [1]: https://www.patreon.com/cperciva
        
           | eatonphil wrote:
           | Yep! In the thread he describes how he is not enough.
        
         | vbezhenar wrote:
         | What does it mean to support FreeBSD on EC2? Surely it's just a
         | KVM so you can run whatever you want?
        
           | [deleted]
        
           | sanxiyn wrote:
           | It means, for example, writing a FreeBSD kernel driver for
           | Elastic Network Adapter (ENA). Both Linux kernel driver and
           | FreeBSD kernel driver is available at
           | https://github.com/amzn/amzn-drivers
        
         | cotillion wrote:
         | Netflix works because they move content close to the users.
         | This is done by either having the ISP establish a peering
         | connection directly to Netflix hosted servers or by having the
         | ISPs host "Open Connect Appliances" which cache the most
         | requested content. These appliances are based on FreeBSD.
         | 
         | The AWS egress savings from this setup must be immense.
         | 
         | https://openconnect.netflix.com/
        
           | ilyt wrote:
           | Yup, cloud bandwidth is insanely expensive considering to
           | what you _actually_ pay to get link to your datacenter.
           | 
           | And you pay either by 95th percentile (basically "peak
           | usage") or by whole link, not per megabyte sent
        
         | [deleted]
        
       | paravz wrote:
       | What is Gb/s per watt of power between 2x400Gb/s servers and a
       | single 800Gb/s ?
       | 
       | Following these reports since 2015, when I compared estimated
       | cost of your 9Gb/s server to F5 load balancer :)
        
       | pyuser583 wrote:
       | I think it's weird and cool how Netflix used FreeBSD/Dlang.
       | 
       | Linux is just the automatic go to. It's great the big tech
       | companies are rethinking these basics.
        
         | loeg wrote:
         | Where are you seeing any mention of Dlang?
        
       | throw0101a wrote:
       | And to think not that long ago I remember being excited when the
       | V.92 standard was released and I could get 56 kb/s on my dial-up
       | connection:
       | 
       | * https://en.wikipedia.org/wiki/V.92
        
         | rwl4 wrote:
         | How about the marvel that was Walnut Creek's cdrom.com that
         | served 10,000 simultaneous FTP connections back in 1999? [1]
         | 
         | I was always blown away by how much more efficient FreeBSD's
         | network stack was compared to Linux at the time. It convinced
         | me to go FreeBSD-only for a few years.
         | 
         | [1] http://www.kegel.com/c10k.html
        
           | alberth wrote:
           | > compared to Linux at the time
           | 
           | Do you consider that not to still be the case?
        
             | adrian_b wrote:
             | Before 2003, FreeBSD was definitely both faster and more
             | reliable than Linux, especially for networking or storage
             | applications.
             | 
             | After that, Intel and AMD have introduced cheap multi-
             | threaded and multi-core CPUs. Linux was adapted very
             | quickly to work well on such CPUs, but FreeBSD has
             | struggled for many years until reaching an acceptable
             | performance on multi-threaded or multi-core CPUs, so it
             | became much slower than Linux.
             | 
             | Later, the performance gap between Linux and FreeBSD has
             | diminished continuously, so now there is no longer any
             | large difference between them.
             | 
             | Depending on the hardware and on the application, either
             | Linux or FreeBSD can be faster, but in the majority of the
             | cases the winner is Linux.
             | 
             | Despite that, for certain applications there may be good
             | reasons to choose FreeBSD, even where it happens to be
             | slower than Linux.
        
               | lukego wrote:
               | FreeBSD was held back by limited TCP options around when
               | packet mobile internet (GPRS) came along. That was around
               | 2003 too.
               | 
               | I remember noticing Yahoo properties being almost
               | unusable in GPRS because they did packet loss detective
               | and recovery in such basic ways e.g. no SACK.
        
               | anthk wrote:
               | Any setting for today's connection on capped mobile data?
               | 2.7 KB/S max.
        
               | LeonenTheDK wrote:
               | > Depending on the hardware and on the application,
               | either Linux or FreeBSD can be faster, but in the
               | majority of the cases the winner is Linux.
               | 
               | I'm not denying this, but do you have a source? I've been
               | trying to find modern "Linux vs FreeBSD" performance
               | tests but haven't been super successful. Mostly I find
               | things from the early 2000s when FreeBSD had a clear
               | lead.
        
               | yakubin wrote:
               | https://www.phoronix.com/review/bsd-linux-eo2021
        
               | mrtweetyhack wrote:
        
               | jedberg wrote:
               | > Depending on the hardware and on the application,
               | either Linux or FreeBSD can be faster, but in the
               | majority of the cases the winner is Linux.
               | 
               | Do you have any data to back that up? Everything I've
               | seen recently and my own experience tells me this isn't
               | the case but I also don't have any data to back up my
               | position. Would love to find some good data on this
               | either way.
        
               | adrian_b wrote:
               | I have been using continuously both FreeBSD and Linux
               | since around 1995, since FreeBSD 2.0 and some Slackware
               | Linux distribution.
               | 
               | In the early years, I have run many benchmarks between
               | them, in order to choose the one that was the best suited
               | for certain applications.
               | 
               | However, during the last decade, I did not bother to
               | compare them any more, because now the main reasons why I
               | choose one or the other do not include the speed.
               | 
               | Even if I have right now, besides me, several computers
               | with FreeBSD and several with Linux, it would not be easy
               | for me to run any benchmark, because they have very
               | different hardware, which would influence the results
               | much more than the OS.
               | 
               | For all the applications where I use FreeBSD (for various
               | networking and storage services), its performance is
               | adequate, and I use it instead of Linux for other
               | reasons, not depending on whether it might be faster or
               | slower.
               | 
               | In the applications where computational performance is
               | important, I use Linux, but that is not due to some
               | benchmark results, but because some commercial software
               | is available only for Linux, e.g. CUDA libraries or FPGA
               | design programs.
               | 
               | Many benchmark results comparing FreeBSD and Linux may be
               | influenced more by the file systems used than by the OS
               | kernel.
               | 
               | I have seen recently some benchmark comparing FreeBSD and
               | Linux for a database application dominated by SSD I/O,
               | but I cannot remember a link to it.
               | 
               | The only file system shared by Linux and FreeBSD is ZFS.
               | With ZFS, the benchmark results were similar for Linux
               | and FreeBSD. However, FreeBSD was faster when using UFS
               | and Linux was much faster, when using either XFS or EXT4
               | (BTRFS was much slower than ZFS). Such a benchmark was
               | much more influenced by the file system than by the
               | operating system.
               | 
               | In conclusion, it is very hard to make a good comparison
               | between FreeBSD and Linux, because you need identical
               | hardware, which must be restricted to the shorter list
               | that is well supported by FreeBSD, and you need to run
               | some micro-benchmark testing some kernel system calls.
               | 
               | Otherwise, the result may depend more on the supported
               | software, hardware or file systems, than on the OS
               | kernel.
        
               | jedberg wrote:
               | Right exactly which is why it's hard to find data. But
               | I'd love to see someone who has tried to limit variables
               | to just the network stack to figure out if one network
               | stack is better than the other.
               | 
               | But you're right, in the end you just have to set up both
               | for your particular use case with the best optimizations
               | each has to offer and see which performs better.
        
               | Thaxll wrote:
               | The web run on Linux like most FANG servers do, so it
               | makes sense with the $$$ / people / R&D that this OS is
               | faster. A conservative number would be that 99.9% of the
               | web runs on Linux and it's probably much higher.
               | 
               | At the scale of Google / MS / Amazon / Apple if servers
               | would run faster of BSD* they would use it. We're talking
               | about 10's millions of servers here.
               | 
               | https://www.phoronix.com/review/bsd-linux-eo2021/7
               | 
               | It gives you a pretty clear picture.
        
               | jedberg wrote:
               | Based on that logic, Windows is the superior operating
               | system and always has been, because it's always been used
               | by more people on their desktop than anything else.
               | 
               | There are a lot more factors involved in OS choice that
               | could drive popularity other than the speed of the
               | network stack. And BTW, Hotmail runs on BSD. MacOS is a
               | fork of BSD. And Yahoo ran on BSD (and may still).
        
       | drewg123 wrote:
       | Author here, happy to answer questions
        
         | waynesonfire wrote:
         | How did you generate those flamegraphs and what other tools did
         | you use to measure performance?
         | 
         | My motivation for asking comes from these findings in the pdf,
         | 
         | Did the graph show the bottleneck contention on aio queue? Did
         | the graph show that "a lot of time was spent accessing memory"?
         | 
         | What made freebsd a better platform compared to Linux to begin
         | tackling this problem?
         | 
         | Thanks! Super interesting. Both a freebsd fan and I have
         | workloads that I'd love to explore benchmarking to squeeze more
         | performance.
        
           | drewg123 wrote:
           | > How did you generate those flamegraphs and what other tools
           | did you use to measure performance?
           | 
           | We have an internal shell script that takes hwpmc output and
           | generates flamegraphs from the stacks. It also works with
           | dtrace. I'm a huge fan of dtrace. I also make heavy use of
           | lockstat, AMD uProf, and Intel Vtune.
           | 
           | > Did the graph show the bottleneck contention on aio queue?
           | Did the graph show that "a lot of time was spent accessing
           | memory"?
           | 
           | See the graph on page 32 or so of the presentation. It shows
           | huge plateaus in lock_delay called out of the aio code. Its
           | also obvious from lockstat stacks (run as lockstat -x
           | aggsize=4m -s 10 sleep 10 > results.txt)
           | 
           | See the graph on page 38 or so. The plateaus are mostly
           | memory copy functions (memcpy, copyin, copyout).
           | 
           | We already use FreeBSD on our CDN, so it just made sense to
           | do the work in FreeBSD.
           | 
           | The talk is on Youtube https://youtu.be/36qZYL5RlgY
        
           | smokel wrote:
           | The flame graphs might be generated using Brendan Gregg's
           | utility, see https://www.brendangregg.com/flamegraphs.html
        
             | drewg123 wrote:
             | They are generated by a local shell script that uses the
             | same helpers (stackcollapse*.pl, difffolded.pl). Our
             | revision control says the script was committed by somebody
             | else though. It existed before I joined Netflix.
        
         | monotux wrote:
         | How long will you be able to keep up with this near yearly
         | doubling of bandwidth used for serving video? :)
        
           | drewg123 wrote:
           | It depends on when we get PCIe Gen5 NICs and servers with
           | DDR5 :)
        
             | alberth wrote:
             | Any current estimates on timing?
        
               | toast0 wrote:
               | Not the OP, but PCIe5 NICs are already available in the
               | market; I've seen people requesting help getting them to
               | work on desktop platforms which have PCIe5 as of the most
               | recent chips. AFAIK, currently, both AMD and Intel
               | release desktop before server; I don't think there's a
               | public release date for Zen4 server chips, but probably
               | this quarter or next? Intel's release process is too hard
               | for me to follow, but they've got desktop chips with
               | PCIe5, so whenever those get to the server, then that
               | might be an option too.
        
               | Rafuino wrote:
               | Public release date for Zen4 server has been disclosed
               | for November 10, FYI. https://www.servethehome.com/amd-
               | epyc-genoa-launches-10-nove....
               | 
               | Looks like Intel's release is coming January 10.
               | https://www.tomshardware.com/news/intel-sapphire-rapids-
               | laun...
        
         | dist1ll wrote:
         | How involved was Netflix in the design of the Mellanox NIC? How
         | many stakeholders does this type of networking hardware have,
         | relatively speaking?
         | 
         | Also, what percentage of CDN traffic that reaches the user is
         | served directly from your co-located appliances?
        
         | _-david-_ wrote:
         | There are a lot of slides and I am on my phone, so sorry if it
         | was addressed in the slides.
         | 
         | How does Linux compare currently? I know in the past FreeBSD
         | was faster, but are there any current comparisons?
        
         | tame3902 wrote:
         | 1. I got excited when I saw arm64 mentioned. How competitive is
         | it? Do you think it will be a viable alternative for Netflix in
         | the future?
         | 
         | 2. On amd, did you play around with BIOS settings? Like turbo,
         | sub-numa clustering or cTDP?
        
           | drewg123 wrote:
           | Arm64 is very competitive. As you can see from the slides,
           | the Ampere Q80-30 is pretty much on-par with our production
           | AMD systems.
           | 
           | Yes, I've spent lots of time in the AMD BIOS over the years,
           | and lots of time with our AMD FAE (who is _fantastic_ , BTW)
           | poking at things.
        
         | crest wrote:
         | Which NIC and driver combinations support kTLS offloading to
         | the NIC?
         | 
         | How did you deal with the hardware/firmware limitations on the
         | number of offloadable TLS sessions?
        
           | drewg123 wrote:
           | We use Mellanox ConnectX6-DX NICs, with the Mellanox drivers
           | built into FreeBSD 14-current (which are also present in
           | FreeBSD 13).
        
             | throw0101a wrote:
             | > _We use Mellanox ConnectX6-DX NICs_
             | 
             | Is there a plan to move to the Connect X-7 eventually?
             | 
             | Depending on the bandwidth available, that'd be either 2x
             | to get the same 800Gb/s as here (or perhaps eventually with
             | 4x to get 1600Gb/s).
        
               | drewg123 wrote:
               | Yes, I'm looking forward to CX7. And to other pcie Gen5
               | NICs!
        
         | eddyg wrote:
         | Wondering if there's a video presentation to go along with the
         | slides?
        
           | notaplumber1 wrote:
           | This talk was given at this years EuroBSDcon in Vienna,
           | recording is up on YouTube.
           | 
           | https://2022.eurobsdcon.org/
           | 
           | https://www.youtube.com/watch?v=36qZYL5RlgY
           | 
           | Some really great talks this year from all the *BSDs, highly
           | recommend checking a look: https://www.youtube.com/playlist?l
           | ist=PLskKNopggjc6_N7kpccFZ...
        
           | coredog64 wrote:
           | And is the video presentation on Netflix?
        
         | alberth wrote:
         | A. Just curious, are these servers performing any work besides
         | purely serving content? Eg user auth, album art, show
         | description, etc?
         | 
         | B. What's the current biggest bottleneck preventing higher
         | throughout?
         | 
         | C. Has everything been up streamed? Meaning, if I were to
         | theoretically purchase the exact same hardware - would I be
         | able to achieve similar throughout?
         | 
         | (Amazing work by the way in these continued accomplishments.
         | These posts over thr years are always my favorite HN stories.)
        
           | drewg123 wrote:
           | a) These are CDN servers, so they serve CDN stuff. Some do
           | serve cover art, etc sorts of things.
           | 
           | b) Memory bandwidth and PCIe bandwidth. I'm eagerly awaiting
           | Gen5 PCIe NICs and Gen5 PCIe / DDR5 based servers :)
           | 
           | c) Yes, everything in the kernel has been upstreamed. I think
           | there may be some patches to nginx that we have not
           | upstreamed (SO_REUSEPORT_LB patches, TCP_REUSPORT_LB_NUMA
           | patches).
        
         | fsckin wrote:
         | What tools do you use for load testing / benchmarking?
        
           | drewg123 wrote:
           | At a very basic microbenchmark level, I use stream, netpef, a
           | few private VM stress tests, etc. But the majority of my
           | testing is done using real production traffic.
        
         | alberth wrote:
         | If a "typical" NIC was used, what do you think the throughput
         | would be?
         | 
         | I have to imagine considerably less (e.g. 100 Gb/s instead of
         | 800).
        
           | drewg123 wrote:
           | Back of the envelop guess is ~400Gb/s. Each node has enough
           | memory BW for about 240Gb/s, then factor in some efficiency
           | loss for NUMA..
        
           | toast0 wrote:
           | Not the OP, but that's basically in the slides. When it's
           | kTLS, but not NIC kTLS. Maybe you could optimize that a bit
           | more around the edges if NIC kTLS wasn't an option.
        
         | amelius wrote:
         | At what point will it make more sense to use specialized
         | hardware, e.g. network card that can do encryption?
        
           | drewg123 wrote:
           | We already do. The Mellanox ConnectX6-Dx with crypto
           | support.. It does inline crypto on TLS records as they are
           | transmitted. This saves memory bandwidth, as compared to a
           | traditional lookaside card.
        
             | MichaelZuo wrote:
             | What's the error rate, or uptime ratio, of those cards?
        
               | drewg123 wrote:
               | Were you assuming they were giant FPGA based NICs..? They
               | are production server NICs, using asics with a reasonable
               | power budget. I don't recall any failures.
        
               | MichaelZuo wrote:
               | Well I wasn't, though I was expecting some non-zero
               | amount of failures.
               | 
               | That's pretty impressive if it's literally zero.
               | 
               | How many machines are deployed with NICs?
        
               | drewg123 wrote:
               | I don't have any visibility into how many DOA NICs we
               | have, so I can't say that Mellanox is better or worse at
               | that point. But I do see most NIC related tickets for NIC
               | failures once machines are in production. In general,
               | we've found Mellanox NICs to be very reliable.
        
         | PYTHONDJANGO wrote:
         | * How is the DRM applied? * Is the software, that does DRM open
         | source, too?
        
         | alberth wrote:
         | How much "U's" of space do ISP typically give you (e.g. 4U, 8U,
         | etc)?
        
           | nixgeek wrote:
           | This is going to be a "How long is a piece of string?". Each
           | ASN will be unique, and even within any large ISP, there may
           | be many OCA deployment sites (there won't just be one for
           | Virgin Media in UK) and each site will likely have subtly
           | different traffic patterns and content consumption patterns,
           | meaning the OCA deployment may be customized to suit, and the
           | content pushed out (particularly to these NVME-based nodes)
           | will be tailored accordingly.
           | 
           | Since the alternative for an ISP is to be carrying the bits
           | for Netflix further, the likelihood is they'll devote
           | whatever space is required because that's much cheaper than
           | backhauling the traffic and ingressing over either a
           | settlement-free PNI or IXP link to a Netflix-operated cache
           | site, or worse, ingressing the traffic over a paid transit
           | link.
           | 
           | Meanwhile, on the flipside, since Netflix funds the OCA
           | deployments they have a strong interest in not "oversizing"
           | the sites. That said I'm sure there is an element of growth
           | forecasting involved once a site has been operational for a
           | period of time.
        
         | vkaku wrote:
         | Read the presentation. Had super noobie level questions.
         | 
         | Is the RAM mostly used by page content read by the NICs due to
         | kTLS?
         | 
         | If there was better DMA/Offload could this be done with a
         | fraction of the RAM? (NVME->NIC)
         | 
         | If there was no need to TLS, would the RAM usage drop
         | dramatically?
        
           | drewg123 wrote:
           | These are actually fantastic questions.
           | 
           | Yes, the RAM is mostly used by content sitting in the VM page
           | cache.
           | 
           | Yes, you could go NVME->NIC with P2P DMA. The problem is that
           | NICs want to read data at once TCP mss (~1448b) and NVME
           | really wants to speak in 4K sized chunks. So there needs to
           | be some buffers somewhere. It might eventually be CXL based
           | memory, but for now it is host memory.
           | 
           | EDIT: missed the last question. No, with NIC kTLS, the host
           | RAM usage is about the same as it would be without TLS at
           | all. Eg, connection data sitting in the socket buffers refers
           | to pages in the host vm page cache which can be shared among
           | multiple connections. With software kTLS, data in the socket
           | buffers must refer to private, per-connection encrypted data
           | which increases RAM requirements.
        
         | hzhou321 wrote:
         | What prevents linux to achieve the same bandwidth?
        
           | _trackno5 wrote:
           | Not sure about all other optimisations, but Linux doesn't
           | have support for async sendfile.
        
           | [deleted]
        
         | erk__ wrote:
         | Do you know if there is any documentation regarding interfacing
         | with the KTLS, eg to implement support for a new library?
        
           | sanxiyn wrote:
           | For Linux, there is a documentation at kernel.org:
           | https://docs.kernel.org/networking/tls.html
        
           | drewg123 wrote:
           | The ktls(4) man page is a start. The reference implementation
           | is OpenSSL right now. I did support for an internal Netflix
           | library a while ago, I probably should have documented it at
           | the time. For now feel free to contact me via email with
           | questions (the username in the URL, but @netflix.com)
        
         | kloch wrote:
         | What filesystem(s) are you using for root and content?
         | 
         | And If ZFS, what options are you using?
        
           | drewg123 wrote:
           | We use ZFS for root, but not content. For content we use UFS.
           | This is because ZFS is not compatible with "zero-copy"
           | sendfile, since it uses its own ARC cache rather than the
           | kernel page cache, meaning sending data stored on ZFS
           | requires an extra data copy out of the ARC. Its also not
           | compatible with async sendfile, as it does not have the
           | methods required to call the sendfile completion handler
           | after data is read from disk into memory.
        
             | deltarholamda wrote:
             | >For content we use UFS
             | 
             | I found this extremely interesting. ZFS is almost a cure-
             | all for what ails you WRT storage, but there is always
             | something that even Superman can't do. Sometimes old-school
             | is best-school.
             | 
             | Thanks for the presentation and QA!
        
       | [deleted]
        
       | nicholasjarnold wrote:
       | I come from the time when the first internet connection my house
       | had was a 56k modem...just before cable modems/DOCSIS started
       | rolling out in the midwest. These speeds are somewhat mind
       | boggling to me. (Yeah, yeah, datacenter vs home, but it's still
       | somewhat hard to imagine saturating pipes like those.)
       | 
       | While standing in a state of mild awe at 800Gb/s I read reviews
       | and consider upgrading my house to 2.5Gb/s equipment... Should I
       | just wait for 10Gbit to get a bit cheaper? Should I ditch copper
       | and go fiber like that guy who was on the front page here
       | recently (probably not, but that was cool)? Maybe raw single core
       | CPU performance is starting to level off a bit, but it seems that
       | networking technologies are still advancing a rapid clip!
        
         | seized wrote:
         | Fiber 10Gb is very cheap. NICs and SFPs from Ebay, fibre from
         | FS.com in whatever length you want. I got a plenum rated 100 ft
         | 4 pair cable from FS.com for $100 or so, and it was only that
         | expensive for the plenum rating as it runs through my cold air
         | returns.
        
       | ksec wrote:
       | Just some napkin maths. ( Correct me if I am wrong )
       | 
       | Looking at the 800Gbps Config, Dell R7525 with Dual 64C / 128T
       | and 4x Connect-DX 800Gbps in 2U.
       | 
       | With Zen 4C, 128C and PCI-E 5.0, Connect-7, two node could fit
       | into 2U. i.e doubling to 1.6Tbps per 2U.
       | 
       | That is going from 16Tbps to 32Tbps per Rack. ( Using 40U only )
       | 
       | To things in perspective, if every user were to use 20Mbps Stream
       | at the same time, ( not going to happen due to time zone
       | difference ), the 250M Netflix subscribers worldwide would need
       | 5000M Mbps or 5000 Tbps. That is less than 200 Racks to serve
       | every single of their customer on planet earth. ( Ignoring
       | Storage. ) You could ship a Rack to every Region, State, Nation,
       | Jurisdiction or Local ISP and Exchange and be done with it.
       | 
       | I hope Lisa Su sent drewg123 and his team at Netflix with Zen 4C
       | ASAP to play, _cough_ , I mean help them test it.
       | 
       | Note: We have PCI-E 6.0 ( and 7.0 ), DDR6 on Roadmap. The 200
       | Racks could be down to 50 Racks by the end of this decade.
       | Assuming Netflix is still streaming at the same bitrate.
        
         | loeg wrote:
         | Netflix is more likely to use a single box of this kind of
         | throughout at any given POP than a rack of them. For bigger
         | installations they can use cheaper, less throughput-dense
         | hardware (although I don't know if they do).
        
           | carlhjerpe wrote:
           | Take a look at the hardware, it isn't particularly expensive
           | stuff.
        
             | loeg wrote:
             | Aside from GPUs, I'm not sure how you would increase the
             | cost density much. Those NICs doing hundreds of Gbps and
             | TLS aren't cheap, nor are the fast SSDs needed to sustain
             | the load, nor is RAM or top end AMD server CPUs. Of course,
             | the cost is absolutely worth it to Netflix!
        
               | carlhjerpe wrote:
               | Yes, but it's still just one box, if you're building a
               | cluster of cheaper characteristics you need more of
               | everything. A high-end server VS a cluster of 10
               | machines, 10 machines wouldn't be cheaper to get to the
               | same throughput, it's not alien specialized supertech,
               | it's just top of the line commodity hardware. (10 is just
               | an example number here).
        
               | loeg wrote:
               | I mean, I guess I disagree with your stipulation that you
               | couldn't lower total costs somewhat using slightly more
               | slightly lower end hardware, if rack space was cheap.
               | 
               | > top of the line commodity hardware
               | 
               | Yeah -- cost in commodity hardware scales super-linearly
               | with performance.
        
         | alberth wrote:
         | > That is going from 16Tbps to 32Tbps per Rack ... only need
         | 200 racks
         | 
         | I doubt ISP's give an entire rack to Netflix. I wouldn't be
         | surprised if they only get like 4U total (hence why throughput
         | per server is so important to Netflix).
        
           | BonoboIO wrote:
           | The minimum requirements
           | 
           | https://openconnect.zendesk.com/hc/en-
           | us/articles/3600345383...
           | 
           | I think it depends on the size of isp, probably a rack would
           | be too much even for the biggest isps, but one 4u too less.
        
             | TFortunato wrote:
             | Looking at the banner pic on their main page, they seem to
             | have at least one ISP install of multiple racks in the
             | wild. Also, doing a little reading on how "fill" of the
             | devices works, they talk about doing peer-to-peer filling
             | of appliances located at the same site, which leads me to
             | believe, even if not deploying a full rack, deploying
             | multiple appliances to an ISP site is a relatively normal
             | occurance
             | 
             | https://openconnect.netflix.com/en/peering/
        
           | meltedcapacitor wrote:
           | Why not? It's top bandwidth consumer for a retail ISP and
           | surely any reasonable amount of rack space is worth the
           | savings in interconnect bandwidth.
        
             | loeg wrote:
             | They are frequently rack space constrained, hence these
             | super dense hardware.
        
           | jedberg wrote:
           | Some ISPs give a full rack, some don't. It depends on how
           | much traffic they have and how willing they are.
           | 
           | But a lot of the racks sit at internet exchange points, where
           | Netflix rents one or more racks at a time.
        
         | recuter wrote:
         | There is the rather intriguing prospect of NVM Express over
         | Fabrics (NVMe-oF):
         | https://en.wikipedia.org/wiki/NVM_Express#NVMe-oF
         | 
         | Marvel Octeon 10 DPU (with an integrated 1 Terabit switch):
         | https://www.marvell.com/content/dam/marvell/en/company/media...
         | 
         | Probably pretty soon you'll be able to chuck in a few hot
         | swappable 100 TB Nimbus exadrives
         | (https://nimbusdata.com/products/exadrive/) in there and call
         | it a day. 1T in 1U. :)
        
           | Melatonic wrote:
           | Interesting to see that Infiniband is still kicking
        
             | coherentpony wrote:
             | Not really. Ethernet and Infiniband are both perfectly
             | capable from a bandwidth perspective. Streaming video isn't
             | remotely close to latency-bound, which is where Infiniband
             | would be better suited.
        
         | Melatonic wrote:
         | The people doing this might also be doing infra as code for the
         | virtualization layer on the hardware itself - which this might
         | not be able to satisfy. At minimum they surely have a ton of
         | this stuff deployed already so changing hardware specs big time
         | might not be worth the cost.
         | 
         | Also are you taking into account encryption for those specs?
        
         | reaperducer wrote:
         | _That is less than 200 Racks to serve every single of their
         | customer on planet earth. ( Ignoring Storage. )_
         | 
         | If you're going to ignore storage, Netflix could just ship a
         | low-end video server to every one of its customers and be done
         | with it.
         | 
         | Every problem is an easy problem if your pretend the hard parts
         | don't exist.
        
           | OliverGuy wrote:
           | How much storage does Netflix actually need for its whole
           | library?
           | 
           | It's got about 17,000 titles globally [1]. If they have
           | copies in SD, 720p, HD and 4k that would be 68,000 versions
           | (plus some extra audio tracks for stuff dubbed in multiple
           | languages, but I suspect this is fairly minimal in terms of
           | storage though)
           | 
           | Let's assume that the resolutions have the bitrates at 5, 10,
           | 15 and 20 mbps.
           | 
           | The average length of a Netflix original movie is ~90mins [2]
           | 
           | So that would require about 575TB in storage if I have done
           | my maths correctly.
           | 
           | You would need about 20x30TB Kioxia CD6 SSDs for all that.
           | Very expensive but definitely technically possible.
           | 
           | I could totally see it being possible to fit those drives in
           | a single node to push the 800gbps required, not increasing
           | the over rack requirement at all. Not sure if the bandwidth
           | from that many drives is enough, might have to cache some of
           | the most watched stuff to ram)
           | 
           | Not gonna see any in home boxes with all the titles pre
           | loaded any time soon though. As a hard drive array that's
           | still 30x20TB drives.
           | 
           | [1] https://www.comparitech.com/blog/vpn-privacy/netflix-
           | statist...
           | 
           | [2] https://stephenfollows.com/netflix-original-movies-
           | shows/#:~...)
        
             | AdrianB1 wrote:
             | Do they keep on every server the global library? I guess
             | they partition it geographically.
        
               | tecleandor wrote:
               | In their OpenConnect network they keep the most demanded
               | titles and the latest releases. And IIRC that refreshes
               | nightly (with new releases and whatever is hot that day)
               | 
               | https://openconnect.zendesk.com/hc/en-
               | us/articles/3600356180...
        
         | virtuallynathan wrote:
         | Back of the Napkin Zen4 / Genoa gets you to ~500GB/s PCIe and
         | ~500GB/s of DRAM bandwidth -- nearly 4Tbps! Zen3/Rome is
         | ~300GB/s PCIe and ~300GB/s DRAM -- about 2.4Tbps. A single 2U
         | box with Genoa might scale to 1.25Tbps+ of useful Netflix
         | traffic. We'll have to see what magic Drew can pull :)
        
         | aeyes wrote:
         | You are probably overestimating Netflix traffic by a lot.
         | 
         | IX.br peak traffic is 20Tb/s, DE-CIX peak traffic is 14Tb/s,
         | AMS-IX is around 11Tb/s.
         | 
         | The 800Gpbs machine is probably enough for a country.
         | 
         | Netflix traffic stats at PIT Chile, this is their only peering
         | connection in Chile: https://www.pitchile.cl/wp/graficos-con-
         | indicadores/streamin...
        
           | srmn wrote:
           | This assumption misses out on all the private interconnect
           | links and deployed OpenConnect appliances within ISP networks
           | - a majority of Netflix's traffic today. IXes are only a
           | small portion of overall internet traffic.
        
           | lostlogin wrote:
           | I notice people streaming in very low resolution without
           | realising it, and sometimes intervene when the pain gets too
           | great.
           | 
           | I'd be vey surprised if the average bitrate was anywhere near
           | the appropriation.
           | 
           | However that wasn't the point of the calculation, it was
           | looking for a maximum.
        
             | orangepurple wrote:
             | Agree and 20 mbps is a reasonable rate for modern codecs
             | for resolutions up to 4k for the 99% of viewers
        
       | ocbyc wrote:
       | "In networking units"
        
       | BonoboIO wrote:
       | It amazes me, that Netflix is capable of such top of the line
       | engineering things (really mindblowing stuff, one machine that
       | streams nearly 1 Terabit pers second), but is for the love of god
       | unable to stream HD Content to my iPhone (newest firmware, all
       | up2date). Tried everything gigabit wifi, cellular, multiple ISPs
       | ...
       | 
       | It is better for me to pirate their content, play it with Plex
       | and be happy. I pay for Netflix, but still have to download it,
       | to see it an acceptable quality. Absurd. The support couldn't
       | help. It doesn't affect, because I have my Torrent/Plex Setup,
       | but for 99.9% of people it is a subpar experience.
       | 
       | I think the best years are over for Netflix. The hard awakening
       | is here to make content that the users want and they are a
       | movie/tv content company, not primarily a ,,tech company".
        
         | staringback wrote:
        
         | leetharris wrote:
         | You live in a bubble. The vast majority of the world likely
         | cannot even tell the difference between HD and 4K. Netflix
         | continues to grow its content and retain subscribers.
        
           | BonoboIO wrote:
           | Netflix is a media company as I said.
           | 
           | Well 4K vs HD, you are right, but 480p on a Retina display
           | right in front of me. Really obvious.
        
         | selfhoster69 wrote:
         | > unable to stream HD Content to my iPhone
         | 
         | Yeah this has been the case since forever. It prioritizes
         | instant playback vs forcing 1080p or similar.
         | 
         | Can't speak for iPhone, but on iPad, I've moved to using the
         | website which goes goes to 1080p immediately.
         | 
         | > still have to download it, to see it an acceptable quality
         | 
         | Downloaded content do contain a whole lot more compression than
         | streaming at max phone supported quality, so just a tiny FYI.
        
       | this15testing wrote:
       | related to slide 4...
       | 
       | how much does netflix donate to the FreeBSD foundation relative
       | to their profits?
        
         | hnarn wrote:
         | "Netflix does contribute financially to the FreeBSD Foundation
         | and has done so since 2012. Last year they engaged at the
         | "platinum" level with contributing more than $50,000+ USD to
         | the foundation." (2019)
         | 
         | Took about five seconds to Google, it's the first result for
         | "netflix donations to freebsd".
         | 
         | NFLX Q3 2019 earnings were about $5.2B.
         | 
         | So about 0.001%, I guess.
        
           | this15testing wrote:
           | haha
        
       ___________________________________________________________________
       (page generated 2022-11-03 23:01 UTC)