[HN Gopher] We used Elixir's Observer to hunt down bottlenecks ___________________________________________________________________ We used Elixir's Observer to hunt down bottlenecks Author : todsacerdoti Score : 107 points Date : 2022-08-23 16:00 UTC (7 hours ago) (HTM) web link (blog.sequin.io) (TXT) w3m dump (blog.sequin.io) | dminor wrote: | Sounds like there are some very nice observability features built | into BEAM. I wish NodeJS had something similar! | lliamander wrote: | The BEAM is really cool, and was actually originally intended | to be a bare-metal operating system. That's why it has so many | features that are useful for operations: they couldn't assume | you'd have any other tooling available, and often didn't even | have physical access to the machines that were running it. | davidw wrote: | > Second, we passed one particularly large data structure from a | manager to a pool of dedicated worker processes. This meant we | were reincurring the memory cost of this data structure for each | worker process. We couldn't eliminate the repetition, but | reducing the data to its bare essentials before passing it down | to the workers minimizes that cost. | | Hard to say without knowing much about the data in question, but | my recollection is that large Erlang/Elixir/BEAM "binaries" are | actually not copied around. That might be a strategy for sharing | larger things in some cases. | | Marshalling data is pretty easy in Erlang: 2> | Bin = erlang:term_to_binary([1, 2, 3]). | <<131,107,0,3,1,2,3>> 3> erlang:binary_to_term(Bin). | [1,2,3] | realcorvus wrote: | If the data does not change, persistent_term is useful as well | conradfr wrote: | A related anecdote: some months ago I had a memory leak inside a | (greatly duplicated) genserver while repeatedly calling a lib[0] | function inside it, that would result in the server basically | crashing after a while. | | I never understood what in that lib was causing the leak but I | fixed it (or more accurately mitigated it) by wrapping the call | in a Task.async/1 | | Maybe that will help someone else one day. | | [0] https://hexdocs.pm/shoutcast/Shoutcast.html#read_meta/1 | filmor wrote: | It was probably leaking refc binaries, see for example | https://ferd.github.io/recon/recon.html#bin_leak-1. | | Running the function (which probably parses large binaries) in | a separate process ensures that it's properly garbage collected | in time. | conradfr wrote: | Interesting thanks. | | Yes that could be it. | austinjp wrote: | So, the graphic at the top of the article (on mobile) is AI- | generated, right? The character's fingers are smooshed. | | Interesting to see this approach to article graphics after I | first read about it on HN recently. | _acco wrote: | It is. Dall-e did the heavy lifting, I tweaked with Photoshop | | > Painting of a detective from the 1800s, portrait, looking at | a magnifying glass at a computer monitor, digital art | [deleted] | cpursley wrote: | Sequin is really cool! Are y'all listening postgres WAL? | _acco wrote: | Thanks! We considered using Postgres' WAL but decided not to | for the time being. | | Our solution now uses trigger functions. These trigger | functions fire whenever a create/update/delete happens on a | Sequin table. They insert a row into a log table. That log | table is processed by our workers to send changes to the | upstream API. | | The advantage of using trigger functions + a log table are all | about ease of use and compatibility: our customers don't have | to do anything fancy to setup Sequin, we just need a role with | `create` privileges in the database. The log table also makes | it easy for both them and us to debug issues, as the stream of | changes that we captured is right there in the database. | cpursley wrote: | Very cool. | | I'm using Elixir to listen to change events via | https://github.com/cpursley/walex (which I basically ripped | off from Supabase). | losvedir wrote: | This is really cool. We use Elixir at work, but we mostly use it | in a "traditional web app" (i.e. non-Elixir) way, of Docker | containers deployed to independent AWS instances. | | So I'm always intrigued by some of the more BEAM-specific things | that folks do, like using `observer` on a remote (production??) | node here, or distributed Elixir where the nodes communicate with | each other, or "hot" code updates. | | How do companies deploy Elixir in such a way to take advantage of | all those things? Does Sequin talk anywhere about their deploy | process and how their infrastructure looks? | mattbaker wrote: | For us we have our app deployed to $N containers with a load | balancer in front (pretty standard stuff I think?) | | In Erlang/Elixir you can actually override how instances of the | BEAM find each other (instead of the standard EPMD daemon), so | we have a module that does some DNS queries, finds the IPs of | the other containers and says "hi, here's your cluster, | discovery done." (Your setup may preclude all that, I know this | all depends on how a system's architected.) | | After doing that we were free to use all of Erlang's cool | cluster stuff! In our case we have in-memory caches for a few | things, and if a given instance does a lookup because of a | cache miss it broadcasts a message to all the other nodes | saying "I just looked up $expensive_thing, here's its value" so | they don't have to do the lookup themselves, they just cache | that value, so you end up with a little distributed cache with | a few lines of code. In our case, btw, these cache entries are | short lived and a little inconsistency does us no harm if one | of our instances misses the message, networks are networks, but | it's been great! | | Anyway, I think it's super cool and I'd encourage you to play | around if you get the chance. | | Also the observer is just amazing. We've debugged some pretty | weird memory and cpu usage issues with it, I have some internal | blog posts, maybe I should see if I could make them public. | JohnCurran wrote: | Can you speak more to how you bypass EPMD and send the IPs of | the containers to each other? That would be great for a | problem we're seeing where I work | cpursley wrote: | Distributed Elixir on Render is crazy easy. Fly.io also looks | neat. | lycos wrote: | Distributed Elixir can be done with Docker containers too, see | https://github.com/bitwalker/libcluster which by default has | some Kubernetes support but you can also have third party (or | custom) clustering strategies. I've not done this myself but | I've seen articles about this a lot during the past years. | | Hot code updates for most applications aren't really worth it | in my opinion, assuming you do something like blue/green | rollover deployments. It's cool that it's possible though. But | it requires appup files and afaik Distillery is one of the | release tools that has support for it built-in. | ranyefet wrote: | If you deploy to fly.io it should be very easy to create a | cluster of elixir nodes. | conradfr wrote: | I think the screenshot under the "Memory" section is not the | correct one. | _acco wrote: | Fixed, thanks! | ananthakumaran wrote: | recon and observer_cli are the tools I reach out first to debug | any issues in production. In any other language, I usually think | about how to reproduce the issue locally. With Elixir, I just get | into a remote shell in the affected machine and live debug the | issue, and there are cases where we applied hotfix by using eval | right there from the shell. The idea of the remote shell itself | is alien to most languages. | busterarm wrote: | And unfortunately the kind of thing that compliance flags as a | big no-no once you've got any kind of filing or privacy | requirements. | jon-wood wrote: | This sort of thing doesn't have to be a compliance breach, | but you will likely need some way of ensuring there's a | second person in the loop, typically that would take the form | of having someone in a separate production infrastructure | team actually driving a while you talk them through what | needs to happen. | busterarm wrote: | Yes and with the added benefit of having to explain that | control to your rotating bunch of compliance people every | single year. | | I'm not criticizing the methodology as much as the useless | performative nature of compliance work. | d4mi3n wrote: | Compliance is performative until it isn't. If you've ever | been party to a breach, the role of compliance and an | audit trail to the security narrative becomes _very_ | important. Consider: | | 1. We had a breach. A factor in this was insufficient | oversight on a process that granted privileged access to | customer data. We fixed the problem, promise that your | data is safe, and don't believe this will happen again. | | 2. We had a breach. A factor in this was due to a gap in | an existing control around customer data that had a | problem we had not anticipated. These were the people | involved. This is exactly how this problem occurred. This | is the data that was exposed. This is documentation of | our response to this incident. This is our existing | policy around how we handle data and how we respond to | breaches. | | Customers, partners, regulators, and law enforcement | respond a lot better when you can demonstrate good intent | and at least imply that you have some kind of process. Of | the two scenarios I outlined, the latter provides those | assurances. | | Compliance isn't the only way to do this, but it's often | the easiest. | mattbaker wrote: | Still wildly useful debugging things locally too! ___________________________________________________________________ (page generated 2022-08-23 23:00 UTC)