dataswamp.org

       Title: Managing a fleet of NixOS Part 1 - Design choices
       Author: Solène
       Date: 02 September 2022
       Tags: bento nixos nix
       Description: In this series of articles, I'll explain my steps toward
       designing an infrastructure to centrally manage a fleet of NixOS
       systems.
       
       # Introduction
       
       I have a grand project in my mind, and I need to think about it before
       starting any implementation.  The blog is a right place for me to
       explain what I want to do and the different solutions.
       
       It's related to NixOS.  I would like to ease the management of a fleet
       of NixOS workstations that could be anywhere.
       
       This could be useful for companies using NixOS for their employees, to
       manage all the workstations remotely, but also for people who may
       manage NixOS systems in various places (cloud, datacenter, house,
       family computers).
       
       In this central management, it makes sense to not have your users with
       root access, they would have to call their technical support to ask for
       a change, and their system could be updated quickly to reflect the
       request.  This can be super useful for remote family computers when
       they need an extra program not currently installed, and that you took
       the responsibility of handling your system...
       
       With NixOS, this setup totally makes sense, you can potentially
       reproduce users bugs as you have their configuration, stage new changes
       for testing, and users can roll back to a previous working state in
       case of big regression.
       
       Cachix company made it possible before I figure a solution.  It's still
       not late to propose an open source alternative.
       
 (HTM) Cachix Deploy
       
       # Defining the project
       
       The purpose of this project is to have a central management system on
       which you keep the configuration files for all the NixOS around, and
       allow the administrator to make the remote NixOS to pick up the new
       configuration as soon as possible when required.
       
       We can imagine three different implementations at the highest level:
       
       * a scheduled job on each machine looking for changes in the source. 
       The source could be a git repository, a tarball or anything that could
       be used to carry the configuration.
       * NixOS systems could connect to something like a pub/sub and wait for
       an event from the central management to trigger a rebuild, the event
       may or not contain information / sources.
       * the central management system could connect to the remote NixOS to
       trigger the build / push the build
       
       These designs have all pros and cons.  Let's see them more in details.
       
       ## Solution 1 - Scheduled job
       
       In this scenario, The NixOS system would use a cron or systemd timer to
       periodically check for changes and trigger the update.
       
       ### Pros
       
       * low maintenance
       * could interactively ask the user when they want to upgrade if not now
       
       ### Cons
       
       * may not run at all if the system is not up at the correct time, or
       could be run at a delayed time depending on situation
       * can't force an update as soon as possible
       * not really bandwidth effective if you often poll
       * no feedback from the central management about who made/receive the
       update (except by adding a call to the server?)
       
       ## Solution 2 - Remote systems are listening for changes (publisher / subscriber)
       
       In this scenario, the NixOS system would always be connected to the
       central management, using some kind of protocol like MQTT, BOCH or
       similar.
       
       ### Pros
       
       * you know which systems are up
       * events from central management are instantaneous and should wait for
       an acknowledgment
       * updates should propagate very quickly
       * could interactively ask the user when they want to upgrade if not now
       
       ### Cons
       
       * this can lead to privacy issue as you know when each host is
       connected
       * this adds complexity to the server
       * this adds complexity on each client
       * firewalls usually don't like long-lived connections, HTTPS based
       solution would help bypass firewalls
       
       ## Solution 3 - The central management pushes the updates to the remote systems
       
       In this scenario, the NixOS system would be reachable over a protocol
       allowing to run commands like SSH.  The central management system would
       run a remote upgrade on it, or push the changes using tools like
       deploy-rs, colmena, morph or similar...
       
 (HTM) Awesome-nix list: deployment-tools
       
       ### Pros
       
       * update is immediate
       * SSH could be exposed over TOR or I2P for maximum firewall bypassing
       capability
       
       ### Cons
       
       * offline systems may be complicated to update, you would need to try
       to connect to them often until they are reachable
       * you can connect to the remote machine and potentially spy the user. 
       In the alternatives above, you can potentially achieve the same by
       reconfiguring the computer to allow this, but it would have to be done
       on purpose
       
       # Making a choice
       
       I tried to state the pros and cons of each setup, but I can't see a
       clear winner.  However, I'm not convinced by the Solution 1 as you
       don't have any feedback or direct control on the systems, I prefer to
       abandon it.
       
       The Solutions 2 and 3 are still in the competition, we basically ended
       with a choice between a PUSH and a PULL workflow.
       
       # Conclusion
       
       In order to choose between 2 and 3, I will need to experiment with the
       Solution 2 technologies as I never used them (MQTT, RabbitMQ, BOCH
       etc…).