[HN Gopher] We used Elixir's Observer to hunt down bottlenecks
       ___________________________________________________________________
        
       We used Elixir's Observer to hunt down bottlenecks
        
       Author : todsacerdoti
       Score  : 107 points
       Date   : 2022-08-23 16:00 UTC (7 hours ago)
        
 (HTM) web link (blog.sequin.io)
 (TXT) w3m dump (blog.sequin.io)
        
       | dminor wrote:
       | Sounds like there are some very nice observability features built
       | into BEAM. I wish NodeJS had something similar!
        
         | lliamander wrote:
         | The BEAM is really cool, and was actually originally intended
         | to be a bare-metal operating system. That's why it has so many
         | features that are useful for operations: they couldn't assume
         | you'd have any other tooling available, and often didn't even
         | have physical access to the machines that were running it.
        
       | davidw wrote:
       | > Second, we passed one particularly large data structure from a
       | manager to a pool of dedicated worker processes. This meant we
       | were reincurring the memory cost of this data structure for each
       | worker process. We couldn't eliminate the repetition, but
       | reducing the data to its bare essentials before passing it down
       | to the workers minimizes that cost.
       | 
       | Hard to say without knowing much about the data in question, but
       | my recollection is that large Erlang/Elixir/BEAM "binaries" are
       | actually not copied around. That might be a strategy for sharing
       | larger things in some cases.
       | 
       | Marshalling data is pretty easy in Erlang:                   2>
       | Bin = erlang:term_to_binary([1, 2, 3]).
       | <<131,107,0,3,1,2,3>>         3> erlang:binary_to_term(Bin).
       | [1,2,3]
        
         | realcorvus wrote:
         | If the data does not change, persistent_term is useful as well
        
       | conradfr wrote:
       | A related anecdote: some months ago I had a memory leak inside a
       | (greatly duplicated) genserver while repeatedly calling a lib[0]
       | function inside it, that would result in the server basically
       | crashing after a while.
       | 
       | I never understood what in that lib was causing the leak but I
       | fixed it (or more accurately mitigated it) by wrapping the call
       | in a Task.async/1
       | 
       | Maybe that will help someone else one day.
       | 
       | [0] https://hexdocs.pm/shoutcast/Shoutcast.html#read_meta/1
        
         | filmor wrote:
         | It was probably leaking refc binaries, see for example
         | https://ferd.github.io/recon/recon.html#bin_leak-1.
         | 
         | Running the function (which probably parses large binaries) in
         | a separate process ensures that it's properly garbage collected
         | in time.
        
           | conradfr wrote:
           | Interesting thanks.
           | 
           | Yes that could be it.
        
       | austinjp wrote:
       | So, the graphic at the top of the article (on mobile) is AI-
       | generated, right? The character's fingers are smooshed.
       | 
       | Interesting to see this approach to article graphics after I
       | first read about it on HN recently.
        
         | _acco wrote:
         | It is. Dall-e did the heavy lifting, I tweaked with Photoshop
         | 
         | > Painting of a detective from the 1800s, portrait, looking at
         | a magnifying glass at a computer monitor, digital art
        
       | [deleted]
        
       | cpursley wrote:
       | Sequin is really cool! Are y'all listening postgres WAL?
        
         | _acco wrote:
         | Thanks! We considered using Postgres' WAL but decided not to
         | for the time being.
         | 
         | Our solution now uses trigger functions. These trigger
         | functions fire whenever a create/update/delete happens on a
         | Sequin table. They insert a row into a log table. That log
         | table is processed by our workers to send changes to the
         | upstream API.
         | 
         | The advantage of using trigger functions + a log table are all
         | about ease of use and compatibility: our customers don't have
         | to do anything fancy to setup Sequin, we just need a role with
         | `create` privileges in the database. The log table also makes
         | it easy for both them and us to debug issues, as the stream of
         | changes that we captured is right there in the database.
        
           | cpursley wrote:
           | Very cool.
           | 
           | I'm using Elixir to listen to change events via
           | https://github.com/cpursley/walex (which I basically ripped
           | off from Supabase).
        
       | losvedir wrote:
       | This is really cool. We use Elixir at work, but we mostly use it
       | in a "traditional web app" (i.e. non-Elixir) way, of Docker
       | containers deployed to independent AWS instances.
       | 
       | So I'm always intrigued by some of the more BEAM-specific things
       | that folks do, like using `observer` on a remote (production??)
       | node here, or distributed Elixir where the nodes communicate with
       | each other, or "hot" code updates.
       | 
       | How do companies deploy Elixir in such a way to take advantage of
       | all those things? Does Sequin talk anywhere about their deploy
       | process and how their infrastructure looks?
        
         | mattbaker wrote:
         | For us we have our app deployed to $N containers with a load
         | balancer in front (pretty standard stuff I think?)
         | 
         | In Erlang/Elixir you can actually override how instances of the
         | BEAM find each other (instead of the standard EPMD daemon), so
         | we have a module that does some DNS queries, finds the IPs of
         | the other containers and says "hi, here's your cluster,
         | discovery done." (Your setup may preclude all that, I know this
         | all depends on how a system's architected.)
         | 
         | After doing that we were free to use all of Erlang's cool
         | cluster stuff! In our case we have in-memory caches for a few
         | things, and if a given instance does a lookup because of a
         | cache miss it broadcasts a message to all the other nodes
         | saying "I just looked up $expensive_thing, here's its value" so
         | they don't have to do the lookup themselves, they just cache
         | that value, so you end up with a little distributed cache with
         | a few lines of code. In our case, btw, these cache entries are
         | short lived and a little inconsistency does us no harm if one
         | of our instances misses the message, networks are networks, but
         | it's been great!
         | 
         | Anyway, I think it's super cool and I'd encourage you to play
         | around if you get the chance.
         | 
         | Also the observer is just amazing. We've debugged some pretty
         | weird memory and cpu usage issues with it, I have some internal
         | blog posts, maybe I should see if I could make them public.
        
           | JohnCurran wrote:
           | Can you speak more to how you bypass EPMD and send the IPs of
           | the containers to each other? That would be great for a
           | problem we're seeing where I work
        
         | cpursley wrote:
         | Distributed Elixir on Render is crazy easy. Fly.io also looks
         | neat.
        
         | lycos wrote:
         | Distributed Elixir can be done with Docker containers too, see
         | https://github.com/bitwalker/libcluster which by default has
         | some Kubernetes support but you can also have third party (or
         | custom) clustering strategies. I've not done this myself but
         | I've seen articles about this a lot during the past years.
         | 
         | Hot code updates for most applications aren't really worth it
         | in my opinion, assuming you do something like blue/green
         | rollover deployments. It's cool that it's possible though. But
         | it requires appup files and afaik Distillery is one of the
         | release tools that has support for it built-in.
        
         | ranyefet wrote:
         | If you deploy to fly.io it should be very easy to create a
         | cluster of elixir nodes.
        
       | conradfr wrote:
       | I think the screenshot under the "Memory" section is not the
       | correct one.
        
         | _acco wrote:
         | Fixed, thanks!
        
       | ananthakumaran wrote:
       | recon and observer_cli are the tools I reach out first to debug
       | any issues in production. In any other language, I usually think
       | about how to reproduce the issue locally. With Elixir, I just get
       | into a remote shell in the affected machine and live debug the
       | issue, and there are cases where we applied hotfix by using eval
       | right there from the shell. The idea of the remote shell itself
       | is alien to most languages.
        
         | busterarm wrote:
         | And unfortunately the kind of thing that compliance flags as a
         | big no-no once you've got any kind of filing or privacy
         | requirements.
        
           | jon-wood wrote:
           | This sort of thing doesn't have to be a compliance breach,
           | but you will likely need some way of ensuring there's a
           | second person in the loop, typically that would take the form
           | of having someone in a separate production infrastructure
           | team actually driving a while you talk them through what
           | needs to happen.
        
             | busterarm wrote:
             | Yes and with the added benefit of having to explain that
             | control to your rotating bunch of compliance people every
             | single year.
             | 
             | I'm not criticizing the methodology as much as the useless
             | performative nature of compliance work.
        
               | d4mi3n wrote:
               | Compliance is performative until it isn't. If you've ever
               | been party to a breach, the role of compliance and an
               | audit trail to the security narrative becomes _very_
               | important. Consider:
               | 
               | 1. We had a breach. A factor in this was insufficient
               | oversight on a process that granted privileged access to
               | customer data. We fixed the problem, promise that your
               | data is safe, and don't believe this will happen again.
               | 
               | 2. We had a breach. A factor in this was due to a gap in
               | an existing control around customer data that had a
               | problem we had not anticipated. These were the people
               | involved. This is exactly how this problem occurred. This
               | is the data that was exposed. This is documentation of
               | our response to this incident. This is our existing
               | policy around how we handle data and how we respond to
               | breaches.
               | 
               | Customers, partners, regulators, and law enforcement
               | respond a lot better when you can demonstrate good intent
               | and at least imply that you have some kind of process. Of
               | the two scenarios I outlined, the latter provides those
               | assurances.
               | 
               | Compliance isn't the only way to do this, but it's often
               | the easiest.
        
           | mattbaker wrote:
           | Still wildly useful debugging things locally too!
        
       ___________________________________________________________________
       (page generated 2022-08-23 23:00 UTC)