[HN Gopher] Teleforking a Process onto a Different Computer
       ___________________________________________________________________
        
       Teleforking a Process onto a Different Computer
        
       Author : trishume
       Score  : 365 points
       Date   : 2020-04-26 15:19 UTC (7 hours ago)
        
 (HTM) web link (thume.ca)
 (TXT) w3m dump (thume.ca)
        
       | crashdelta wrote:
       | THIS IS REVOLUTIONARY!
        
       | anthk wrote:
       | What I'd love it's bingding easily remote directories as local.
       | Not NFS, but a braindead 9p. If I don't have a tool, I'd love to
       | have a bind-mount of a directory from a stranger, and run a
       | binarye from within (or piping it) without he being able to trace
       | the I/O.
       | 
       | If the remote FS is a diff arch, I'd should be able to run the
       | same binary remotely as a fallback option, seamless.
        
       | crashdelta wrote:
       | This is one of the best side projects I've ever seen, hands down.
        
         | trishume wrote:
         | <3
        
       | zozbot234 wrote:
       | The page states that CRIU requires kernel patches, but other
       | sources say that the kernel code for CRIU is already in the
       | mainline kernel. What's up with that?
        
         | trishume wrote:
         | I didn't say it requires kernel patches, just that it requires
         | a build flag and new enough versions. It does happen that most
         | distros do enable that build flag though.
         | 
         | There is a patch set for write protection support in
         | userfaultfd that was just merged recently and I'm not sure if
         | it's made it into any actual releases yet.
        
         | cyphar wrote:
         | CRIU does work on mainline kernels and has for several years.
         | However, it is still being actively developed (userfaultfd is
         | the example the article uses -- a mainline feature which is
         | still seeing lots of development).
         | 
         | It doesn't usually require additional patches to work, though
         | they do have a fairly big job (every new kernel feature needs
         | to be added to CRIU so it can both checkpoint and restore it --
         | and not all features are trivial to checkpoint). It's entirely
         | possible that programs using very new kernel features may
         | struggle to get CRIU to work.
        
       | hawski wrote:
       | Seriously cool. That also reminds me of DragonFlyBSD's process
       | checkpointing feature that offers suspend to disk. In Linux world
       | there were many attempts, but AFAIK nothing simple and complete
       | enough. To be fair I don't know if DF's implementation is that
       | either.
       | 
       | https://www.dragonflybsd.org/cgi/web-man?command=sys_checkpo...
       | 
       | https://www.dragonflybsd.org/cgi/web-man?command=checkpoint&...
        
         | jabedude wrote:
         | This article[0] on LWN seems to suggest that Linux has no
         | kernel support for checkpoint/restore because it's hard to
         | implement (no arguments there). But hypervisors support
         | checkpoint/restore for virtual machines, e.g. ESXi VMotion and
         | KVM live migration, so it seems like these technical problems
         | are solvable. Indeed all the benefits of VM migration seem to
         | also apply to process migration (load balancing, service
         | uptime, etc).
         | 
         | 0. https://lwn.net/Articles/293575/
        
           | dirtydroog wrote:
           | They seem to be different problems though, as that LWN
           | article suggests. It's probably easier to checkpoint/restore
           | an entire image's state than an individual process' state.
           | 
           | But even calls like gettimeofday() work differently when
           | running on hypervisors than they do when running on bare
           | metal
        
           | colinchartier wrote:
           | The hypervisor migration features you mention are really
           | powerful and stable. I founded a startup that does fast CI
           | using them.
           | 
           | This is what it looks like:
           | https://layerci.com/jobs/github/layerdemo/layer-gogs-demo/18
           | 
           | (in this run it restored a previous VM snapshot to skip the
           | setup, and took a snapshot after the webserver started to
           | give a lazy-loaded staging server)
        
           | colinchartier wrote:
           | There's also a project called CRIU that is used
           | experimentally by Docker for container save/load:
           | https://www.criu.org/Main_Page
        
             | jraph wrote:
             | You were faster than me. Yes, CRIU does checkpointing in
             | userspace on Linux.
             | 
             | They contributed a lot of patches to the Linux kernel to
             | support this feature. So there isn't any support for
             | checkpointing in the mainline kernel. Rather, there are
             | many kernel features that allows building such a feature in
             | userspace. For instance, you can set the PID of the next
             | program you spawn on Linux thanks to CRIU, because CRIU
             | needs to be able to restore processes with the same PIDs as
             | when they were checkpointed [1].
             | 
             | CRIU is used by OpenVZ, which, if I remember correctly, are
             | moving or moved away from a kernel-based approach.
             | 
             | [1] http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-
             | using...
        
               | jabedude wrote:
               | This was immensely informative, thank you for that link.
               | CRIU looks like an awesome project. I can't believe I
               | haven't heard of them prior to today.
        
             | emmelaich wrote:
             | CRIU and DMTCP are mentioned in the article.
        
           | [deleted]
        
           | justapassenger wrote:
           | IMO, big reason for why Linux doesn't support it, is that big
           | tech companies, who drive a lot of development and funding,
           | found that it's just not worth it.
           | 
           | Tech moved away from mainframe model of keeping single
           | processes up for as long as possible and isolating software
           | from the failures. Instead they embrace failure at the
           | software layer, which gives you (when executed correctly)
           | both same high level of availability, but also layers of
           | protection against other issues (as node going down in
           | expected way is the same, no matter what's the root cause)
           | and makes overall maintainability and upgradability systems
           | higher.
           | 
           | Basically, why deal with complicated checkpoints at kernel
           | level, that doesn't understand your software, when you ca
           | instead deal with that at the software layer, with full
           | control over it.
        
             | spullara wrote:
             | Swap performance has been similarly underinvested.
        
               | justapassenger wrote:
               | Yeah, and for exactly the same reasons.
               | 
               | In old days swap was necessity to keep system up, but
               | nowadays you need it only if your dataset is really big
               | (other than some weirdness in kernel, that actually like
               | to have some small amount of swap to do cleanup and
               | maintenance)
               | 
               | And if you have that problem, you have custom solutions
               | that implement memory hierarchy for you, that give you
               | much better result, as they can work across all the
               | layers (cache/hdd/sdd/net/etc) in much more effective way
               | than generic, software agnostic solutions focused just on
               | one dimension.
        
               | snazz wrote:
               | How commonly do people need to use swap in the first
               | place? On my current computer, I don't even have a swap
               | partition because I've never gotten close to filling my
               | 16GB, even with a big virtual machine and hundreds of
               | browser tabs. In my experience on desktop systems, it
               | just makes your computer painfully slow when you write a
               | program that allocates far too much memory. Without swap,
               | the OOM killer gets involved much earlier and you don't
               | have to do a hard power cut to stop the disk from
               | thrashing.
        
               | cesarb wrote:
               | I just opened the site of the biggest retailer in my
               | country, and looked at one entry-level laptop: it has 2GB
               | of RAM. Not everybody has a high-end computer.
        
               | saalweachter wrote:
               | What's hilarious to me, growing up in the era of swap, is
               | that now you can have so much RAM that you can start
               | caching your disk to RAM for a little extra performance
               | (well, we've kind of moved past that, too, since SSDs are
               | so much faster than spinny disks).
        
               | haneefmubarak wrote:
               | When done well, swap can be pretty useful. I have a Fall
               | 2019 MacBook Pro maxed out on CPU (8 cores) and RAM (64
               | GB) - MacOS still regularly compresses (highest I've seen
               | so far: 12 GB) and/or swaps (highest I've seen so far: 16
               | GB) relatively unused memory to make more space for other
               | applications and hot caches (which are still useful even
               | with an PCIe SSD).
               | 
               | Keep in mind that everything feels fully snappy and
               | responsive while this is all happening.
        
               | barrkel wrote:
               | If you don't have enough memory for a single app, then
               | your life is completely miserable - in that case,
               | swapping is almost indistinguishable from crashed. But
               | swap can be the difference between being able to run one
               | app at a time, vs run three or four, as long as you only
               | have one in the foreground at a time.
               | 
               | Swap doesn't play well with GC, or pointerful programming
               | languages generally. Swap is a latency hit every time you
               | touch an evicted page, and the more pointers your
               | structures have, the more frequently that occurs.
               | 
               | In older times, when memory needed to be conserved, GC
               | was less frequently used, and as a consequence of the
               | accounting overheads of manual memory allocation,
               | pointerful data structures weren't quite as common.
               | Instead, you'd have pre-allocated arrays which supported
               | up to N items of a particular type (and tough luck if you
               | wanted more), or in more sophisticated applications,
               | arena allocators. In such systems, the working set of
               | memory is more likely to be contiguous; paging works
               | better because every page load is more likely to bring in
               | relevant data.
               | 
               | Switching between two open applications was the moment
               | where you'd typically hit swap; that, and every now and
               | then when you touched an area of the app you hadn't in a
               | while, or jumped to a distant area of the document you
               | were working on. The HDD light would come on, and you'd
               | have to wait a few seconds while the computer ground
               | through (HDDs being audible, some much louder than
               | others) before you'd switch.
               | 
               | At some point in the mid 2000s, GC languages tipped the
               | balance, especially JS. The browser became the main
               | application, and wouldn't tolerate swap well. It became a
               | mini OS unto itself, and had to balance the competing
               | needs of different tabs. Then mobile exploded, and didn't
               | use swap at all in order to compete with the iPhone,
               | remarkable for its smooth animations. If there was to be
               | latency, the application had to hide it; OS-level
               | introduced latency on a page fault wasn't good enough,
               | because if it happened on the UI thread, you dropped
               | frames.
               | 
               | And of course swap never worked well on servers with
               | frequent requests. Swapping on a server is a great way to
               | back up all your queues.
        
       | saagarjha wrote:
       | It's touched on at the very end, but this kind of work is
       | somewhat similar to what the kernel needs to do on a fork or
       | context switch, so you can really figure out what state you need
       | to keep track of from there. Once you have that, scheduling one
       | of these network processes isn't really all that different than
       | scheduling a normal process, except the of course syscalls on the
       | remote machine will possibly go to a kernel that doesn't know
       | what to do with them.
        
       | new_realist wrote:
       | See https://criu.org/Live_migration
        
       | abotsis wrote:
       | Also of interest might be Sprite- a Berkeley research os
       | developed "back in the day" by Ken Shirriff And others. It
       | boasted a lot of innovations like a logging filesystem (not just
       | metadata) and a distributed process model and filesystem allowing
       | for live migrations between nodes.
       | https://www2.eecs.berkeley.edu/Research/Projects/CS/sprite/s...
        
       | jka wrote:
       | This reminds me a little bit of the idea of 'Single System
       | Image'[1] computing.
       | 
       | The idea, in abstract, is that you login to an environment where
       | you can list running processes, perform filesystem I/O, list and
       | create network connections, etc -- and any and all of these are
       | in fact running across a cluster of distributed machines.
       | 
       | (in a trivial case that cluster might be a single machine, in
       | which case it's essentially no different to logging in to a
       | standalone server)
       | 
       | The wikipedia page referenced has a good description and a list
       | of implementations; sadly the set of {has-recent-release && is-
       | open-source && supports-process-migration} seems empty.
       | 
       | [1] - https://en.wikipedia.org/wiki/Single_system_image
        
         | macintux wrote:
         | That was the original concept that led Ian Murdock and John
         | Hartman to found Progeny. The idea was that overnight, while no
         | one was working at their desks, companies could reboot their
         | Windows boxes into a SSI network of Linux nodes to run parallel
         | compute tasks.
         | 
         | Roughly, anyway, I got the sales pitch 20 years ago so my
         | memories are fuzzy. I wasn't remotely sold on it but was so
         | anxious to work for a Linux R&D company in Indianapolis of all
         | places that I accepted the job anyway.
         | 
         | Sadly we didn't get far on the concept before the dot-com
         | crash. Absent more venture capital we pivoted to focus on
         | something we could sell, Progeny Linux, and tried to turn that
         | into a managed platform for companies who wanted to run Linux
         | on their appliances.
        
       | ISL wrote:
       | What's old is new again -- I'm pretty sure QNX could do this in
       | the 1990s.
       | 
       | QNX had a really cool way of doing inter-process communication
       | over the LAN that worked as if it were local. Used it in my first
       | lab job in 2001. You might not find it on the web, though. The
       | API references were all (thick!) dead trees.
       | 
       | Edit: Looks like QNX4 couldn't fork over the LAN. It had a
       | separate "spawn()" call that could operate across nodes.
       | 
       | https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysar...
        
         | Proven wrote:
         | +1. I was going to mention this (less eloquently)
        
         | checker659 wrote:
         | > I'm pretty sure QNX could do this in the 1990s.
         | 
         | Plan 9, SmallTalk
        
           | imglorp wrote:
           | Erlang of course. All spawns and messages can be local or
           | remote.
        
             | masklinn wrote:
             | Indeed, weird to only find it that low as a sub-comment:
             | spawn(Node, Module, Function, Args) -> pid()
             | Returns the process identifier (pid) of a new process
             | started by the application of Module:Function to Args on
             | Node. If Node does not exist, a useless pid is returned.
             | Otherwise works like spawn/3.
        
             | russellbeattie wrote:
             | Any time I see something about remote processes, I
             | immediately think, "Erlang could probably do this."
             | 
             | I think Erlang would have been the programming language of
             | the 21st century... If only the syntax wasn't like line
             | noise and a printer error code had a baby, and raised it to
             | think like Lisp.
        
         | femto113 wrote:
         | Even further back this echoes some of the goals of MIT's
         | Project Athena started in 1980s.
        
       | lachlan-sneff wrote:
       | Wow, this is really interesting. I bet that there's a way of
       | doing this robustly by streaming wasm modules instead of full
       | executables to every server in the cluster.
        
       | peterkelly wrote:
       | There's been a bunch of interesting work done on this over the
       | years. Here's a literature survey on the topic:
       | https://dl.acm.org/doi/abs/10.1145/367701.367728
        
       | pcr910303 wrote:
       | This makes me think about Urbit[0] OS -- Urbit OS represents the
       | entire OS state in a simple tree, this would be very simple to
       | implement.
       | 
       | [0]: https://urbit.org/
        
       | [deleted]
        
       | userbinator wrote:
       | _This can let you stream in new pages of memory only as they are
       | accessed by the program, allowing you to teleport processes with
       | lower latency since they can start running basically right away._
       | 
       | That's what "live migration" does; it can be done with an entire
       | VM: https://en.wikipedia.org/wiki/Live_migration
        
         | dmitrygr wrote:
         | Hey what's the best way to contact you? Have a question. If
         | possible my email is in my profile.
        
           | sitkack wrote:
           | https://www.usenix.org/node/170864
        
         | trishume wrote:
         | It's what a certain kind of live migration strategy does.
         | There's a discussion thread about this further down where I
         | link to that same page.
        
         | sitkack wrote:
         | https://cloud.google.com/compute/docs/instances/live-migrati...
         | gives a great rundown on the mechanics as they pertain to GCP.
         | Is probably somewhat valid for other systems.
        
       | YesThatTom2 wrote:
       | Condor did this in the early 90s.
        
       | synack wrote:
       | This reminds me of OpenMOSIX, which implemented a good chunk of
       | POSIX in a distributed fashion.
       | 
       | MPI also comes to mind, but it's more focused on the IPC
       | mechanisms.
       | 
       | I always liked Plan 9's approach, where every CPU is just a file
       | and you execute code by writing to that file, even if it's on a
       | remote filesystem.
        
         | anfractuosity wrote:
         | I recall OpenMosix too, I remember playing with it on a couple
         | of old machines a while ago. I thought it seemed quite a cool
         | idea. I think I was trying to do mp3 encoding from CDs with it
         | in a distributed fashion.
        
         | adrianpike wrote:
         | Whoa, I was just talking about OpenMOSIX the other day - at my
         | college job, when we retired workstations, they would just sit
         | in the back closet for years until facilities would get around
         | to getting them up for auction. We set up a TFTP boot setup and
         | had a few dozen nodes at any given time. It wasn't high
         | performance by any fashion, but it worked pretty transparently
         | and was always fun to throw heavy synthetic workloads at it and
         | watch the cluster rebalance and hear the fans spin up on the
         | clunky old pentium's.
        
         | dghughes wrote:
         | Funny almost exactly one year ago I was desperately trying to
         | learn about and demo a Linux cluster.
         | 
         | I happened upon ClusterKnoppix and used that LiveCD as my demo
         | which uses OpenMOSIX. And MPI as well but I was having a lot of
         | trouble getting it all in my head and having to explain it
         | during a presentation.
         | 
         | I'm glad that over with! But it was still fun.
        
           | AstroJetson wrote:
           | Came here to write about ClusterKnoppix! It was amazing and I
           | had 7 small form factor PC's running a mini cluster. It was
           | great for moving things around. Problem was it fell behind in
           | releases and it became hard to get applications to run on it.
           | Knoppix was how I got into Linux, it still ranks as my
           | favorite distro. Thanks Klaus Knopper for your work in
           | setting that up!
           | 
           | It would be nice if it worked on Raspberry Pi, or if there
           | was a simple way to set OpenMOSIX up on Pi's
        
         | AstroJetson wrote:
         | For people that want to try OpenMOSIX out, take a look at this
         | site http://dirk.eddelbuettel.com/quantian.html He has a distro
         | that is called Quantian, with a big collection of science tools
         | added. Shame it's Sunday, I'll need to wait a week to pull it
         | down and see how well it flys.
        
         | synack wrote:
         | Ah, the Plan 9 bits I was remembering are actually in
         | Inferno... http://www.vitanuova.com/inferno/man/4/grid-cpu.html
        
         | notacoward wrote:
         | Same reaction here. MOSIX was basically this idea developed for
         | real. Turns out there are a bunch of secondary problems to do
         | with namespaces and security, scheduling and resource use,
         | topology, performance, and so on. In the end it never turned
         | out to be all that practical, and I know of several efforts
         | (e.g. LOCUS) that went quite far to reach the same conclusion.
         | There's a reason that non-transparent "shared nothing"
         | distributed computing came to dominate the computing landscape.
        
       | carapace wrote:
       | "Somebody else has had this problem."
       | 
       | Don't get me wrong, this is great hacking and great fun. And this
       | is a good point:
       | 
       | > I think this stuff is really cool because it's an instance of
       | one of my favourite techniques, which is diving in to find a
       | lesser-known layer of abstraction that makes something that seems
       | nigh-impossible actually not that much work. Teleporting a
       | computation may seem impossible, or like it would require
       | techniques like serializing all your state, copying a binary
       | executable to the remote machine, and running it there with
       | special command line flags to reload the state.
        
       | Animats wrote:
       | That goes back to the 1980s, with UCLA Locus. This was a
       | distributed UNIX-like system. You could launch a process on
       | another machine and keep I/O and pipes connected. Even on a
       | machine with a different CPU architecture. They even shared file
       | position between tasks across the network. Locus was eventually
       | part of an IBM product.
       | 
       | A big part of the problem is "fork", which is a primitive
       | designed to work on a PDP-11 with very limited memory. The way
       | "fork" originally worked was to swap out the process, and instead
       | of discarding the in-memory copy, duplicate the process table
       | entry for it, making the swapped-out version and the in-memory
       | version separate processes. This copied code, data, and the
       | process header with the file info. This is a strange way to
       | launch a new process, but it was really easy to implement in
       | early Unix.
       | 
       | Most other systems had some variant on "run" - launch and run the
       | indicated image. That distributes much better.
        
         | wrs wrote:
         | No need to use the past tense -- CreateProcess is the primitive
         | in Windows NT. (It's been 20 years...can we just call that
         | Windows now?)
        
           | zozbot234 wrote:
           | posix_spawn is a thing as well.
        
         | barrkel wrote:
         | There's also an ergonomics to process creation APIs - rather
         | than needing separate APIs for manipulating your child process
         | vs manipulating your own process, fork() lets you use one to
         | implement the other: fork(), configure the resulting process,
         | then exec().
         | 
         | CreateProcess* on Windows is a relative monstrosity of
         | complexity compared to fork/exec.
        
       | peterwwillis wrote:
       | It's nice to see people re-discover old school tech. In cluster
       | computing this was generally called "application
       | checkpointing"[1] and it's still in use in many different systems
       | today. If you want to build this into your app for parallel
       | computing you'd typically use PVM[2]/MPI[3]. SSI[4] clusters
       | tried to simplify all this by making any process "telefork" and
       | run on any node (based on a load balancing algorithm), but the
       | most persistent and difficult challenge was getting shared memory
       | and threading to work reliably.
       | 
       | It looks like CRIU support is bundled in kernels since 3.11[5],
       | and works for me in Ubuntu 18.04, so you can basically do this
       | now without custom apps.
       | 
       | [1] https://en.wikipedia.org/wiki/Application_checkpointing [2]
       | https://en.wikipedia.org/wiki/Parallel_Virtual_Machine [3]
       | https://en.wikipedia.org/wiki/Message_Passing_Interface [4]
       | https://en.wikipedia.org/wiki/Single_system_image [5]
       | https://en.wikipedia.org/wiki/CRIU#Use
        
       | dekhn wrote:
       | Condor, a distributed computing environment, has done IO remoting
       | (where all calls to IO on the target machine get sent back to the
       | source) for several decades. The origin of Linux Containers was
       | process migration.
       | 
       | I believe people have found other ways to do this, personally I
       | think the ECS model (like k8s, but the cloud provider hosts the
       | k8s environment) where the user packages up all the dependencies
       | and clearly specifies the IO mechanisms through late biniding,
       | makes a lot more sense for distributed computing.
        
         | gmfawcett wrote:
         | For those like me who haven't heard of Condor before, it's
         | called HTCondor now:
         | 
         | https://research.cs.wisc.edu/htcondor/description.html
        
         | vidarh wrote:
         | I clicked through to mention Condor too... I first came across
         | it in the 90's, and it seems like one of those obvious hacks
         | that keeps being reinvented.
        
           | dekhn wrote:
           | I was actually channeling the creator of Condor, Miron Livny,
           | who has a history of going to talks about distributed
           | computing and pointing out that "Condor already does that"
           | for nearly everything that people try to tout as new and
           | cool.
           | 
           | He's not wrong, but few people use Condor.
        
             | mempko wrote:
             | We use Condor. Oh and did you know it can run docker
             | containers too? And it's constantly being updated an
             | improved. What is lacking is in my opinion is a cool GUI to
             | monitor and spawn things.
        
               | iaresee wrote:
               | Pre-MS acquisition you used to be able to use Cycle
               | Cloud's software for ~free without support. Unfortunately
               | it went all-Azure-only post-acquisition.
        
             | iaresee wrote:
             | Few people outside academia, maybe? But inside it still
             | seems to dominate in areas like physics computation. CERN
             | uses a world wide Condor grid. LIGO too. It's excellent for
             | sharing cycles for those slow-burn, highly parallel,
             | massive data scale problems.
             | 
             | I spent more than a decade bringing Condor to semi-
             | conductor, financial and biomedical institutions. It was
             | always a fight to show them there was a better way to
             | utilize their massive server farms that didn't require
             | paying the LSF Tax. Without a shiny sales or marketing
             | department, Condor was hard to pitch to IT departments.
             | 
             | Still, to this day, I see people doing things with "modern"
             | platforms like Kubernetes and such and I chuckle. Had that
             | in Condor 15 years ago in many cases. :)
        
               | eternauta3k wrote:
               | Why "LSF tax"?
        
               | gautamcgoel wrote:
               | I think he's referring to IBM LSF: https://www.ibm.com/su
               | pport/knowledgecenter/en/SSWRJV/produc...
        
               | iaresee wrote:
               | Yes, this exactly.
               | 
               | If you want to dig deep into computer science history,
               | you'll note that LSF was once a slightly modified Condor.
               | There's a wild history between the two and U.Wisc and U
               | of T.
        
               | 0xdeadbeefbabe wrote:
               | > Still, to this day, I see people doing things with
               | "modern" platforms like Kubernetes and such and I
               | chuckle. Had that in Condor 15 years ago in many cases.
               | :)
               | 
               | I'm reading the docs and it seems used mostly for solving
               | long running math problems like protein folding or seti
               | at home?
               | 
               | Can it be used for scaling a website too? I think that's
               | k8s "killer" feature heh.
        
               | iaresee wrote:
               | Yes, it can scale based on metrics. And metrics can be
               | anything.
               | 
               | What it's missing is all the discovery and network
               | plumbing to tie running instances together with load
               | balancing and inter-service comms.
               | 
               | Googles old Borg paper mentions Condor as a thing they
               | considered and cribbed features from.
               | 
               | Honestly, serving a website is not as different from
               | batch processing problems as you'd think. There are
               | differences but they're subtle, not mountainous.
        
               | yummypaint wrote:
               | I've used condor for ~5 years now, mostly for running
               | simulations and processing data. Everything i've done
               | with it has been trivially parallelizable (divide data
               | into chunks based on time, etc), and in those
               | applications it has been a superb tool that just works.
               | 
               | It should be possible to run a scalable website with it,
               | but then you don't get "infinite" scalability like cloud
               | services offer, since you're limited by the size of the
               | compute pool. It would probably have its pitfalls.
               | 
               | That being said, coming in without knowledge of either, i
               | found it much easier to learn and get started doing
               | things with condor than kubernetes. I had all kinds of
               | issues just getting simple things like LaTeX compilation
               | as part of gitlab CI to work reliably. Clearly the
               | experts know how to make things go, but condor is lower
               | barrier to entry. For use cases where condor CAN work,
               | especially data processing, i always recommend that.
        
               | dekhn wrote:
               | Most systems like Condor have a concept that a job or
               | task is something that "comes up, runs for a while,
               | writes to log files, and then exits on its own, or after
               | a deadline". I've talked to the various batch queue
               | providers and I don't think they really consider
               | "services" (like a webserver, app server, rpc server, or
               | whatever) in their purview.
               | 
               | In fact, that was what struck me the most when I started
               | at Google (a looong time ago): at first I thought of borg
               | like a batch queue system, but really it's more of a
               | "Service-keeper-upper" that does resource management and
               | batch jobs are just sort one type of job (webserving, etc
               | are examples of "Service jobs") laid on top of the RM
               | system.
               | 
               | Over time I've really come to prefer the google approach,
               | for example when a batch job is running, it still listens
               | on a web port, serves up a status page, and numerous
               | other introspection pages that are great for debugging.
               | 
               | TBH I haven't read the Condor, PBS, LSF manuals in a
               | while so it's very well possible they handle service jobs
               | and the associated problems like dynamic port management,
               | task discovery, RPC balancing, etc.
        
               | iaresee wrote:
               | But in a world where you're continuously deploying on a
               | cadence that's incredibly quick, how do things differ? I
               | contend the batch and online worlds start to get pretty
               | blurry at this stage. We're not in a world where bragging
               | about uptime on our webserver in years is a thing any
               | more.
               | 
               | I was routinely using Condor in semi-conductor R&D and
               | running batches of jobs where each job was running for
               | many days -- that's probably far longer than any single
               | instance of a service exists at Google in this day and
               | age, right?
               | 
               | None of the batch stuff does the networking management
               | though. No port mapping, no service discovery
               | registration, no load balancer integration, etc. That's
               | Kubernetes sugar they lack. But...has never struck me as
               | overly hard to add, especially if you use Condor's Docker
               | runner facilities.
               | 
               | Edit: I should say that I don't _really_ think you could
               | swap out Kubernetes for Condor. Not easily. But it's
               | always been in my long list of weekend projects to see
               | what running an cluster of online services would be like
               | on Condor. I don't think it'd be awful or all that hard.
               | 
               | The other killer Condor tech is their database
               | technology. the double-evaluation approach of ClassAd is
               | so fantastic for non-homogenous environments. Where loads
               | have needs and computational nodes have needs and
               | everyone can end up mostly happy.
        
       | concernedctzn wrote:
       | side note: take a look at this guy's other blog posts, they're
       | all very good
        
       | cecilpl2 wrote:
       | This is similar to what Incredibuild does. It distributes compile
       | and compute jobs across a network, effectively sandboxing the
       | remote process and forwarding all filesystem calls back to the
       | initiating agent.
        
       | touisteur wrote:
       | I wonder whether the effort by the syzkaller people (@dyukov)
       | could help with the actual description of all the syscalls (that
       | the author says people gave up on for now, because too complex),
       | since they need them to be able to fuzz efficiently...
        
       | vladbb wrote:
       | I implemented something similar ten years ago for a class
       | project: https://youtu.be/0am-5noTrWk
        
       | fitzn wrote:
       | Really cool idea! Thanks for providing so much detail in the
       | post. I enjoyed it.
       | 
       | A somewhat related project is the PIOS operating system written
       | 10 years ago but still used today to teach the operating systems
       | class there. The OS has different goals than your project but it
       | does support forking processes to different machines and then
       | deterministically merging their results back into the parent
       | process. Your post remind me of it. There's a handful of papers
       | that talks about the different things they did with the OS, as
       | well as their best paper award at OSDI 2010.
       | 
       | https://dedis.cs.yale.edu/2010/det/
        
       | [deleted]
        
       | cjbprime wrote:
       | Amazing work.
        
         | ignoramous wrote:
         | Yep. Reminds me of u/xal praising u/trishume:
         | https://news.ycombinator.com/item?id=20490964
        
       | londons_explore wrote:
       | Bonus points if you can effectively implement the "copy on write"
       | ability of the linux kernel to only send over pages to the remote
       | machine that are changed either in the local or remote fork, or
       | read in the remote fork.
       | 
       | A rsync-like diff algorithm might also substantially reduce
       | copied pages if the same or a similar process is teleforked
       | multiple times.
       | 
       | Many processes have a lot of memory which is never read or
       | written, and there's no reason that should be moved, or at least
       | no reason it should be moved quickly.
       | 
       | Using that, you ought to be able to resume the remote fork in
       | milliseconds rather than seconds.
       | 
       | userfaultfd() or mapping everything to files on a FUSE filesystem
       | both look like promising implementation options.
        
         | trishume wrote:
         | I do in fact mention this idea in the article. In fact
         | userfaultfd was added to the kernel so that CRIU and KVM live
         | migration could implement exactly this.
         | 
         | Another cool project that does something like this is
         | https://github.com/gamozolabs/chocolate_milk which is a fuzzing
         | hypervisor kernel which can back a VM snapshot memory mapping
         | over the network to only pull down the pages that the VM
         | actually reads during the fuzz case.
        
         | mlyle wrote:
         | If you just pull things on demand, you're going to get a lot of
         | round-trip-time penalties to page things in.
         | 
         | I think you should still be pushing the memory as fast as you
         | can, but maybe you start the child while this is still in
         | progress, and prioritize sending stuff the child asks for
         | (reorder to send that stuff "next"), if you've not already sent
         | it.
        
           | lunixbochs wrote:
           | That's a great idea. One of my thoughts was to "pre-heat" the
           | process by executing a bit locally with side effects disabled
           | to see what would get immediately accessed and send that
           | first.
           | 
           | If your systems strictly match somehow (machine image with
           | auto update disabled? or regularly hash and timestamp files
           | on both systems) you can also cheat by mapping some of the
           | files locally on the other side.
        
           | trishume wrote:
           | Yah that is indeed a super important optimization for
           | avoiding round trips. CRIU does this and calls it "pre-
           | paging", their wiki also mentions that they adapt their page
           | streaming to try to pre-stream pages around pages that have
           | been faulted:
           | https://en.wikipedia.org/wiki/Live_migration#Post-
           | copy_memor...
           | 
           | edit: lol I didn't realized that isn't CRIU's wiki since they
           | just linked to a Wikipedia page and both use WikiMedia
           | software. This is the actual CRIU wiki page, and it's way
           | harder to tell if they do this, although I suspect they do
           | and it's in the "copy images" step of the diagram
           | https://criu.org/Userfaultfd
        
         | [deleted]
        
         | touisteur wrote:
         | If you're at the hypervisor level you can also use Intel PML,
         | seems it's made for this. https://arxiv.org/abs/2001.09991
         | 
         | I'm guessing @gamozolabs (twitter) is going to use it some day
         | soon to be fuzz billions of cases per second, with snapshot
         | fuzzing...
        
       | anticensor wrote:
       | hoard() would be a better name.
        
         | jabedude wrote:
         | I don't understand, could you explain? `tfork(2)` seems more in
         | line with Linux naming conventions.
        
           | anticensor wrote:
           | Telefork as a verb does not map to a real word concept. Hoard
           | fits better, as in hoarding another computer for your
           | process.
        
             | jabedude wrote:
             | That is funny. hoard() and steal() could work.
        
               | anticensor wrote:
               | Agree, steal() instead of telepad() would be a good idea.
        
             | carapace wrote:
             | Ha! I thought you had misspelled _horde!_ ;-)
        
               | buckminster wrote:
               | I think you mishurd.
        
       | dreamcompiler wrote:
       | Telescript [0] is based on this idea, although at a higher level.
       | I wish we could just build Actor-based operating systems and then
       | we wouldn't need to keep reinventing flexible distributed
       | computation, but alas...[1]
       | 
       | [0]
       | https://en.wikipedia.org/wiki/Telescript_(programming_langua...
       | 
       | [1] Yes I know Erlang exists. I wish more people would use it.
        
       ___________________________________________________________________
       (page generated 2020-04-26 23:00 UTC)