• jhgg 2 hours ago

    When I worked at Discord, we used BEAM hot code loading pretty extensively, built a bunch of tooling around it to apply and track hot-patches to nodes (which in turn could update the code on >100M processes in the system.) It allowed us to deploy hot-fixes in minutes (full tilt deploy could complete in a matter of seconds) to our stateful real-time system, rather than the usual ~hour long deploy cycle. We generally only used it for "emergency" updates though.

    The tooling would let us patch multiple modules at a time, which basically wrapped `:rpc.call/4` and `Code.eval_string/1` to propagate the update across the cluster, which is to say, the hot-patch was entirely deployed over erlang's built-in distribution.

    • davisp 2 hours ago

      This matches my experience. I spent a decade operating Erlang clusters and using hot code upgrades is a superpower for debugging a whole class of hard to track bugs. Although, without the tracking for cluster state it can be its own footgun when a hotpatch gets unpatched during a code deploy.

      As for relups, I once tried starting a project to make them easier but eventually decided that the number of bazookas pointed at each and every toe made them basically a non-starter for anything that isn’t trivial. And if its trivial it was already covered by the nl (network load, send a local module to all nodes in the cluster and hot load it) style tooling.

    • edude03 38 minutes ago

      Nerves and hot code reloading got me into erlang after I watched a demo of patching code on a flying drone ~8 years ago.

      While I can't imagine hot reloading is super practicle in production, it does highlight that erlang/beam/otp has great primitives for building reliable production systems.

      • rozap 3 hours ago

        I used to work on a pretty big elixir project that had many clients with long lived connections that ran jobs that weren't easily resumable. Our company had a language agnostic deployment strategy based on docker, etc which meant we couldn't do hot code updates even though they would have saved our customers some headache.

        Honestly I wish we had had the ability to do both. Sometimes a change is so tricky that the argument that "hot code updates are complicated and it'll cause more issues than it will solve" is very true, and maybe a deploy that forces everyone to reconnect is best for that sort of change. But often times we'd deploy some mundane thing where you don't have to worry about upgrading state in a running gen server or whatever, and it'd be nice to have minimal impact.

        Obviously that's even more complexity piled onto the system, but every time I pushed some minor change and caused a retry that (in a perfect world at least...) didn't need to retry, I winced a bit.

        • ElevenLathe 3 hours ago

          I work in gaming and have experienced the opposite side of this: many of our services have more than one "kind" of update, each with its own caveats and gotchas, so that it takes an expert in the whole system (meaning really almost ALL of our systems) to determine which would be the least impactful possible one if nothing goes wrong. Not only is there a lot of complexity and lost productivity in managing this process ("Are we sure this change is zero downtime-able?" "Does it need a schema reload?" etc) but we often get it wrong. The result is that, in practice, anything even remotely questionable gets done during a full downtime where we kick players out.

          It's sometimes helpful to have the option to just restart one little corner of the full system, to minimize impact, but it is helpful to customer experience (if we don't screw it up) and very much the opposite for developer experience (it's crippling to velocity to need to discuss each change with multiple experts and determine the appropriate type of release).

          • rozap 2 hours ago

            No doubt that traditional deployments are much better for dev experience at (sometimes) the cost of customer experience.

            • ElevenLathe 2 hours ago

              One thing that would help both is deployment automation that could examine the desired changes and work out the best way to deploy them without human input. For distributed systems, this would require rock-solid contracts between individual services for all relevant scenarios, and would also require each update to be specified completely in code (or at least something machine readable), ideally in one commit. This is a level of maturity that seems elusive in gaming.

              • toast0 2 hours ago

                I disagree. Hot loading means I can have a very short cycle on an issue, and move onto something else. Having to think about the implications of hot loading is worth it for the rapid cycle time and not having to hold as many changes in my mind at once.

          • hauxir 3 hours ago

            We use hot code upgrades on kosmi.io with great success.

            It's absolute magic and allows for very rapid development and ease of deploying fixes and updates.

            We do use have to use distillery though and have had to resort to a bunch of custom glue bash scripts which I wish was more standardized because it's such a killer feature.

            Due to Elixirs efficiency, everything is running on a single node despite thousands of concurrents so haven't really experienced how it handles multiple nodes.

            • throwaway81523 2 hours ago

              You have to be very very very careful when preparing relups. The alternative on Linux is to launch an entire new server on the same machine, then transfer the session data and the open sockets to it through IPC. I once asked Joe Armstrong whether this was as good as relups and why Erlang went the relup route. I don't remember the exact words and don't want to misquote him, but he basically said it was fine, and Erlang went with relups and hot patching because transferring connections (I guess they would have been hardware interfaces rather than sockets) wasn't possible when they designed the hot patch system.

              Hot patching is a bit unsatisfying because you are still running the same VM afterwards. WIth socket migration you can launch a new VM if you want to upgrade your Erlang version. I don't know of a way to do it with existing software, but in principle using something like HAProxy with suitable extensions, it should be possible to even migrate connections across machines.

              • toast0 2 hours ago

                State migration is possible, and yeah, if you want to upgrade BEAM, state migration would be effective, whereas hot loading is not. If your VM gets pretty big, you might need to be careful about memory usage though, the donor VM is likely not going to shrink as fast as the heir VM grows. If you were so inclined, C does allow for hot loading too, but I think it'd be pretty hard to bend BEAM into something that you could hot load to upgrade.

                Migrating socket state across machines is possible too, but I don't think it's anywhere close to mainstream. HAProxy is a lovely tool, but I'm pretty sure I saw something in its documentation that explicitly states that sort of thing is out of scope; they want to deal with user level sockets.

                Linux has a TCP Repair feature which can be used as part of socket migration; but you'll also need to do something to forward packets to the new destination. Could be arping for the address from a new machine, or something fancier that can switch proportionally or ??? there's lots of options, depending on your network.

                As much as I'd love to have a use case for TCP migration, it's a little bit too esoteric for me ... reconnecting is best avoided when possible, but I'm counting TCP migration as non-possible for purposes of the rule of thumb.

                • throwaway81523 an hour ago

                  TCP migration on the same machine is real and it's not that big a deal, if that's what you meant by TCP migration. Doing it across machines is at best a theoretical possibility, I would agree. I have been wanting to look into CRIU more carefully, but I believe it uses TCP Repair that you mentioned. I'm unfamiliar with it though.

                  The saying in the Erlang crowd is that a non-distributed system can't be really reliable, since the power cord is a single point of failure. So a non-painful way to migrate across machines would be great. It just hasn't been important enough (I guess) for make anyone willing to deal with the technical obstacles.

                  I wonder whether other OS's have supported anything like that.

                  I worked on a phone switch (programmed in C) a long time ago that let you do both software and hardware upgrades (swap CPU boards etc.) while keeping connections intact, but the hardware was specially designed for that.

                  • toast0 an hour ago

                    > I wonder whether other OS's have supported anything like that.

                    I don't think I've seen it, but I don't see everything, and it'd be pretty esoteric. From my memory of working with the FreeBSD tcp stack, I suspect it wouldn't be too hard to make something like this work there, too; other than the security aspects, but could probably do something like ok to 'repair' a connection that matches a listen socket you also pass or something. But you'd really need the use case to make the hassle worth it, and I don't think most regular server applications are enough to warrant it.

              • whorleater 4 hours ago

                WhatsApp very long ago used to hot reload across all nodes with a ssh script to incrementally deploy during the day

                • robocat 3 hours ago

                  Background to the article: https://underjord.io/unpacking-elixir-iot-embedded-nerves.ht...

                  Seems like they deploy Elixir on embedded Linux. The embedded Linux distro is Nerves which replaces systemd and boots to the BEAM VM instead as process 1, putting Elixir as close to the metal as they can.

                  I know nothing about any of the above (assumption is I'm fool enough to try and simplify) plus I know I've misused the concepts I wrote but that's my point so read the article. All simplifications are salads

                  • aeturnum 2 hours ago

                    IMO Hot Code Updates are a tantalizing tool that can be useful at times but are extremely easy to foot-gun and have little support. I suspect that the reason why no one has built a nice, formal framework for organizing and fanning out hot code changes to erlang nodes is that it's very hard to do well, involves making some educated guesses about the halting problem, and generally doesn't help you much unless you're already in a real bind.

                    Most of the benefits of hot code updates (with better understanding of the boundaries of changes) can be found through judicious rolling restarts that things like k8s make easier these days. Any time you have the capacity to hot patch code on a node, you probably have the capacity to hot patch the node's setup as well.

                    That said I think that someone could use the code reloading abilities of erlang to make a genuinely unparalleled production problem diagnostic toolkit - where you can take apart a problem as it is happening in real time. The same kinds of people who are excited about time traveling debugging should be excited about this imo.

                    • toast0 4 hours ago

                      > Both have described hot code updates as something that people should learn and use. I imagine Whatsapp’s initial engineering crew would agree. They did pretty well.

                      Yeah. Hot loading is clearly better than anything else when you've got a million clients connected and you want to make a code change. Of course, we didn't have any of these fancy 'release' tools, we just used GNU Make to rsync the code to prod and run erlc. Then you can grab a debug shell and l(module). (we did write utilities to see what code was modified, and to provide the right incantations so we wouldn't load if it would kill processes)

                      • rybosome 3 hours ago

                        > Hot loading is clearly better than anything else when you've got a million clients connected and you want to make a code change.

                        In the contexts in which I’ve worked, this was solved by issuing a command to the server to enter a lame-duck mode and stop accepting new connections, then restarting the process with updated code after all existing connections ended.

                        This worked in our case because connections had a TTL with a “reasonable” time, couldn’t have been more than an hour. We could always wait it out.

                        I suppose hot reloading is more necessary when you have connections without a set TTL.

                        • toast0 2 hours ago

                          That way works, but it means you're spending that much more time on a deploy.

                          For a small change, you can hot load your change and be done in minutes. This means you can push several small changes in an hour. Which helps get things done rapidly.

                          It's also really nice to be able to see that the change works as expected (or not) at full load right away. If you've got to wait for connections to accumulate on the new server, that takes longer without hot load too.

                          Some changes can't be effectively hot loaded[1], and for those you do need to do something to kick out users and let them reconnect elsewhere, and you could do all your updates that way, but it means a lot more client time spent reconnecting.

                          On the one hour TTL. Sometimes that's reasonable, but sometimes it's really not. Someone downloading a large file on a slow connection is better served by letting the download continue to trickle for hours than forcing them to reconnect and resume. A real time call is better served by letting it run until the participants are done. For someone on a low end phone, staying connected for as long as they can is probably better than forcing a reconnect where they'll need to generate new ephemeral keys and do a key exchange exercise.

                          [1] At the very least, BEAM updates and kernel changes are much more easily done by restarting. But not all userspace Erlang changes are easy to make hot loadable, either.

                      • arnon 4 hours ago

                        A few years ago, the biggest problem with Erlang's hot code updates was getting the files updated on all of the nodes. Has this been solved or improved in any way?

                        • comboy 3 hours ago

                          I don't think updating files is the problem. The biggest issue with hot code updates seems to be that they can create states that cannot be replicated in either release on its own.

                          • ketralnis 3 hours ago

                            This is my experience. About 25% of the time I'd encounter a bug that's impossible to reproduce without both versions of the code in memory, and end up restarting the node anyway dropping requests in the process. Whereas if I'd have architected around not having hot code updates I could built it in a way that never has to drop requests

                            • faizshah 3 hours ago

                              In general, you can save your team a lot of ops trouble just by periodically restarting your long running services from scratch instead of trying to keep alive a process or container for a long time.

                              I’m still new to the erlang/elixir community and I haven’t run it in prod yet but this is my experience coming from Java, Node, and Python.

                            • toast0 2 hours ago

                              There's about a thousand different ways to update files on servers?

                              You can build os packages, and push those however you like.

                              You can use rsync.

                              You could push the files over dist, if you want.

                              You could probably do something cool with bittorrent (maybe that trend is over?)

                              If you write Makefiles to push, you can use make -j X to get low effort parallelization, which works ok if your node count isn't too big, and you don't need as instant as possible updates.

                              Erlang source and beam files don't tend to get very large. And most people's dist clusters aren't very large either; I don't think I've seen anyone posting large cluster numbers lately, but I'd be surprised if anyone was pushing to 10,000 nodes at once. Assuming they're well connected, pushing to 10,000 nodes takes some prep, but not that much; if you're driving it from your laptop, you probably want an intermediate pusher node in your datacenter, so you can push once from home/office internet to the pusher node, and then fork a bunch of pushers in the datacenter to push to the other hosts. If you've got multiple locations and you're feeling fancy, have a pusher node at each location, push to the pusher node nearest you; that pushes to the node at each location and from there to individual nodes.

                              Other issues are more pressing; like making sure you write your code so it's hotload friendly, and maybe trying to test that to confirm you won't use the immense power of hotloading to very rapidly crash all your server processes.

                              • samgranieri an hour ago

                                I think Twitter once cobbled together a BitTorrent based deployment strategy for Capistrano called murder, that was a cool read from their eng blog back in the day.

                                I wish I had used a pusher node to deploy things when a colleague was using almost all the upstream bandwidth in the office making a video call when my bosses were giving demo and the fix I coded for an issue discovered during the demo could not deploy via Capistrano

                            • samgranieri 2 hours ago

                              I tried to do this back in 2017 as an elixir newbie with distillery, but for some reason just went with standard deploys with distillery. Now it’s just using mix release to build elixir apps in a docker image deployed to k8s.

                              • Thaxll 3 hours ago

                                Hot code update is one of those thing I don't understand, just use a rolling deployment, problem solved. You have a new version of the code without loosing any connection.

                                It's one of those thing that sound nice on paper but a actually couple your runtime with ci/cd, if you have anything else beside Erlang what do you do? You now need a second solution to deploy code.

                                • AlphaWeaver 2 hours ago

                                  I'm not sure that rolling deployments guarantee you won't lose connections, depending on the type of connection. Imagine your customer is downloading a large file over a single TCP connection, and you want to upgrade the application mid-download.

                                  With rolling deployments, your only choice is to wait until that connection drains by completing or failing the download. If that doesn't fit your use case, you're out of options.

                                  If your application is an Erlang app, you could hot code reload an unaffected part of the application while the download finishes. Or, if the part of the application handling the download is an OTP pattern that supports hot code reloading (like a gen_server) you could even make changes to that module and release e.g. speed improvements mid download stream. This is why Erlang shines in applications like telephony, which it was originally designed for.

                                  • fiddlerwoaroof 2 hours ago

                                    The other perspective on this is that, at some level of your system you are always doing a hot code reload: terraform, kubernetes, etc. are taking a new deployment description in and reconciling it with the existing state of the world. Wouldn’t it be nice if this process was just more code in your preferred programming language rather than YAML soup?

                                    BEAM encourages you to structure your program as isolated interacting processes and so it’s not that far from a container runtime in itself.

                                    • rozap 2 hours ago

                                      But what if you have long lived stateful connections? And you don't want a deploy to take forever?

                                      Ofc you can say "don't do that" but sometimes it's just the way it is...

                                      But I agree, 99% of the time a rolling update is easier and works fine.

                                      • nthh 2 hours ago

                                        It could be useful if you have an embedded device that you don't want to miss data from, but for most deployments I would agree.