<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Speedshop - Ruby on Rails performance consulting</title>
    <description>Speedshop is a one-man Ruby on Rails performance consultancy that optimizes the full stack - frontend, backend and environment - to generate revenue and cut scaling costs for businesses on Rails. Fast sites are profitable sites. Speed is a feature.
</description>
    <link>https://www.speedshop.co/</link>
    <atom:link href="https://www.speedshop.co/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Mon, 30 Nov 2020 19:10:21 +0000</pubDate>
    <lastBuildDate>Mon, 30 Nov 2020 19:10:21 +0000</lastBuildDate>
    <generator>Jekyll v4.1.1</generator>
    
      <item>
        <title>We Made Puma Faster With Sleep Sort</title>
        <description>&lt;p&gt;Puma 5 (codename Spoony Bard&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(When Puma gets a new ‘supercontributor’ that submits lots of important work to the project, we let them name the next release. This release features a lot of code from Will Jordan, who named this release ‘Spoony Bard’. Will said: ‘Final Fantasy IV is especially nostalgic for me, the first big open-source project I ever worked on was a fan re-translation of the game back in the late 90s.’)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt; When Puma gets a new ‘supercontributor’ that submits lots of important work to the project, we let them name the next release. This release features a lot of code from Will Jordan, who named this release ‘Spoony Bard’. Will said: ‘Final Fantasy IV is especially nostalgic for me, the first big open-source project I ever worked on was a fan re-translation of the game back in the late 90s.’&lt;/span&gt;) was released today (my birthday!). There’s a lot going on in this release, so I wanted to talk about the different features and changes to give Puma users confidence in upgrading.&lt;/p&gt;

&lt;h2 id=&quot;experimental-performance-features-for-cluster-mode-on-mri&quot;&gt;Experimental Performance Features For Cluster Mode on MRI&lt;/h2&gt;

&lt;p&gt;This is probably the headline of the release - two features for reducing memory usage, and one for reducing latency.&lt;/p&gt;

&lt;p&gt;Puma 5 contains 3 new experimental performance features:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wait_for_less_busy_worker&lt;/code&gt; config. This may reduce latency on MRI through inserting a small delay (sleep sort!) before re-listening on the socket if worker is busy. Intended result: If enabled, should reduce latency in high-load (&amp;gt;50% utilization) Puma clusters.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork_worker&lt;/code&gt; option and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;refork&lt;/code&gt; command for reduced memory usage by forking from a worker process instead of the master process. Intended result: If enabled, should reduce memory usage.&lt;/li&gt;
  &lt;li&gt;Added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nakayoshi_fork&lt;/code&gt; config option. Reduce memory usage in preloaded cluster-mode apps by GCing before fork and compacting, where available. Intended result: If enabled, should reduce memory usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these experiments are only for &lt;strong&gt;cluster mode&lt;/strong&gt; Puma configs running on &lt;strong&gt;MRI&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We’re calling them &lt;em&gt;experimental&lt;/em&gt; because we’re not sure if they’ll actually have any benefit. We’re pretty sure they’re stable and won’t break anything, but we’re not sure they’re actually going to have big benefits in the real world. People’s workloads are often not what we anticipate, and synthetic benchmarks are usually not of any help in figuring out if a change will be beneficial or not.&lt;/p&gt;

&lt;p&gt;We do not believe any of the new features will have a negative effect or impact the stability of your application. This is either a “it works” or “it does nothing” experiment.&lt;/p&gt;

&lt;p&gt;If any of the features turn out to be particularly beneficial, we may make them defaults in future versions of Puma.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you upgrade and try any of the 3 new features, please post before and after results or screenshots to &lt;a href=&quot;https://github.com/puma/puma/issues/2258&quot;&gt;this Github issue&lt;/a&gt;.&lt;/strong&gt; “It didn’t do anything” is still a useful report in this case. Posting ~24 hours of “before” and ~24 hours of “after” data would be most helpful.&lt;/p&gt;

&lt;h3 id=&quot;wait_for_less_busy_worker-sleep-sort-for-faster-apps&quot;&gt;wait_for_less_busy_worker: sleep sort for faster apps?!&lt;/h3&gt;

&lt;p&gt;This feature was contributed to Puma by Gitlab. Turn it on by adding &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wait_for_less_busy_worker&lt;/code&gt; to your Puma config.&lt;/p&gt;

&lt;p&gt;When a request comes in to a Puma cluster, the operating system randomly selects a listening, free Puma worker process to pick up the request. “Listening” and “free” being the key words - a Puma process will only listen to the socket (and pick up more requests) if it has nothing else to do. However, when running Puma with multiple threads, Puma will also listen on the socket when all of its busy threads are waiting on I/O or have otherwise released &lt;a href=&quot;2020/05/11/the-ruby-gvl-and-scaling.html&quot;&gt;the Global VM Lock&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;When Gitlab investigated switching from Unicorn to Puma, they encountered an issue with this behavior. Under high load with moderate thread settings (a max pool size of 5 in their case), average request latency increased. Why?&lt;/p&gt;

&lt;p&gt;Remember, I said that the operating system &lt;em&gt;randomly&lt;/em&gt; assigns a request to a &lt;em&gt;listening&lt;/em&gt; worker process. So, it will never send a request to a worker process that’s busy doing other things, but what about a worker process that’s got 4 threads that are processing other requests, but all 4 of those threads happen to be waiting on I/O right now?&lt;/p&gt;

&lt;p&gt;Imagine a Puma cluster with 3 workers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Worker 1: 0/5 threads busy.&lt;/li&gt;
  &lt;li&gt;Worker 2: 1/5 threads busy.&lt;/li&gt;
  &lt;li&gt;Worker 3: 4/5 threads busy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If Worker 3’s 4 active threads happen to all have released the GVL, allowing that worker to listen to the socket, and a new request comes in - which worker process should we assign the request to, ideally? Worker 1, right? Unfortunately, most operating systems will assign the request to Worker 3 33% of the time.&lt;/p&gt;

&lt;p&gt;So, what do we do? We want the operating system to prefer less-loaded workers. It would be really cool if we could sort the list of workers listening on the socket so that the operating system would give requests to the least-loaded worker. Well, we can’t really do that easily, but we can do something else.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wait_for_less_busy_worker&lt;/code&gt; causes a worker to &lt;em&gt;wait&lt;/em&gt; to re-listen on the socket if it’s thread pool isn’t completely empty. This means that in high-load scenarios, the operating system will assign requests to less-loaded workers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is basically sleep-sorting our workers&lt;/strong&gt;. We’re kind of doing doing this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[].tap { |a| workers.map { |e| Thread.new{ sleep worker_busyness.to_f/1000; a &amp;lt;&amp;lt; e} }.each{|t| t.join} }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;… and hiding “more loaded” workers from the operating system by letting less-loaded workers listen first!&lt;/p&gt;

&lt;p&gt;Originally the proposal was for a more complicated sort - processes slept longer if they had more busy threads - but that was removed when it was found that a simpler on/off sleep was just as effective.&lt;/p&gt;

&lt;p&gt;The net effect is that in high-load scenarios, request latency decreases. This is because workers with more busy threads are slower than workers with no busy threads. We’re assuring that requests get assigned to the faster workers. Prior to this patch, Gitlab saw an increase in latency using Puma compared to Unicorn - after this patch, latency was the same (they also were able to reduce their fleet size by almost 30% thanks to Puma’s memory-saving multithreaded design).&lt;/p&gt;

&lt;p&gt;There may be even more efficient ways for us to implement this behavior in the future. There’s some magic you can do with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;libev&lt;/code&gt;, I’m pretty sure, or we can just implement a different sleep/wait strategy.&lt;/p&gt;

&lt;h3 id=&quot;fork_worker&quot;&gt;fork_worker&lt;/h3&gt;

&lt;p&gt;Adding &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork_worker&lt;/code&gt; to your puma.rb config file (or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--fork-worker&lt;/code&gt; from the CLI) turns on this feature. This mode causes Puma to fork additional workers from worker 0, instead of directly from the master process:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;10000   \_ puma 5.0.0 (tcp://0.0.0.0:9292) [puma]
10001       \_ puma: cluster worker 0: 10000 [puma]
10002           \_ puma: cluster worker 1: 10000 [puma]
10003           \_ puma: cluster worker 2: 10000 [puma]
10004           \_ puma: cluster worker 3: 10000 [puma]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Similar to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preload_app!&lt;/code&gt; option, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork_worker&lt;/code&gt; option allows your application to be initialized only once for copy-on-write memory savings, and it has two additional advantages:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Compatible with phased restart.&lt;/strong&gt; Because the master process itself doesn’t preload the application, this mode works with phased restart (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIGUSR1&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pumactl phased-restart&lt;/code&gt;), unlike &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preload_app!&lt;/code&gt;. When worker 0 reloads as part of a phased restart, it initializes a new copy of your application first, then the other workers reload by forking from this new worker already containing the new preloaded application.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This allows a phased restart to complete as quickly as a hot restart (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIGUSR2&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pumactl restart&lt;/code&gt;), while still minimizing downtime by staggering the restart across cluster workers.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;‘Refork’ for additional copy-on-write improvements in running applications.&lt;/strong&gt; Fork-worker mode introduces a new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;refork&lt;/code&gt; command that re-loads all nonzero workers by re-forking them from worker 0.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This command can potentially improve memory utilization in large or complex applications that don’t fully pre-initialize on startup, because the re-forked workers can share copy-on-write memory with a worker that has been running for a while and serving requests.&lt;/p&gt;

&lt;p&gt;You can trigger a refork by sending the cluster the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SIGURG&lt;/code&gt; signal or running the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pumactl refork&lt;/code&gt; command at any time. A refork will also automatically trigger once, after a certain number of requests have been processed by worker 0 (default 1000). To configure the number of requests before the auto-refork, pass a positive integer argument to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork_worker&lt;/code&gt; (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork_worker 1000&lt;/code&gt;), or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0&lt;/code&gt; to disable.&lt;/p&gt;

&lt;h3 id=&quot;nakayoshi_fork&quot;&gt;nakayoshi_fork&lt;/h3&gt;

&lt;p&gt;Add &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nakayoshi_fork&lt;/code&gt; to your puma.rb config to try this option.&lt;/p&gt;

&lt;p&gt;Nakayoshi means “friendly”, so this is a “friendly fork”. The concept was &lt;a href=&quot;https://github.com/ko1/nakayoshi_fork&quot;&gt;originally implemented by MRI supercontributor Koichi Sasada&lt;/a&gt; in a gem, but we wanted to see if we could bring a simpler version into Puma.&lt;/p&gt;

&lt;p&gt;Basically, we just do the following before forking a worker:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;times&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;no&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;compact&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# if available&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The concept here is that we’re trying to get as clean of a Ruby heap as possible before forking to maximize &lt;a href=&quot;https://en.wikipedia.org/wiki/Copy-on-write&quot;&gt;copy-on-write&lt;/a&gt; benefits. That should, in turn, lead to reduced memory usage.&lt;/p&gt;

&lt;h2 id=&quot;other-new-features&quot;&gt;Other New Features&lt;/h2&gt;

&lt;p&gt;A few more things in the grab-bag:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;You can now compile Puma on machines where OpenSSL is not installed.&lt;/li&gt;
  &lt;li&gt;There is now a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;thread-backtraces&lt;/code&gt; command in pumactl to print all active threads backtraces. This has been available via SIGINFO on Darwin, but now it works on Linux via this new command.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Puma.stats&lt;/code&gt; now has a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;requests_count&lt;/code&gt; counter.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lowlevel_error_handler&lt;/code&gt; got some enhancements - we also pass the status code to it now.&lt;/li&gt;
  &lt;li&gt;Phased restarts and worker timeouts should be faster.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Puma.stats_hash&lt;/code&gt; provides Puma statistics as a hash, rather than as JSON.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;loads-of-bugfixes&quot;&gt;Loads of Bugfixes&lt;/h2&gt;

&lt;p&gt;The number of bugfixes in this release is pretty huge. Here’s the most important ones:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Shutdowns should be more reliable.&lt;/li&gt;
  &lt;li&gt;Issues surrounding socket closing on shutdown have been resolved.&lt;/li&gt;
  &lt;li&gt;Fixed some concurrency bugs in the Reactor.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;out_of_band&lt;/code&gt; should be much more reliable now.&lt;/li&gt;
  &lt;li&gt;Fixed an issue users were seeing with ActionCable and not being able to start a server.&lt;/li&gt;
  &lt;li&gt;Many stability improvements to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prune_bundler&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;nicer-internals-and-tests&quot;&gt;Nicer Internals and Tests&lt;/h2&gt;

&lt;p&gt;This release has seen a massive improvement to our test coverage. We’ve pretty much doubled the size of the test suite since 4.0, and it’s way more stable and reproducible now too.&lt;/p&gt;

&lt;p&gt;A number of breaking changes come with this major release. &lt;a href=&quot;https://github.com/puma/puma/blob/master/History.md&quot;&gt;For the complete list, see the HISTORY file.&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;thanks-to-our-contributors&quot;&gt;Thanks to Our Contributors!&lt;/h2&gt;

&lt;p&gt;This release is our first major or minor release with new maintainer MSP-Greg on the team. Greg has been doing tons of work on the test suite to make it more reliable, as well as a lot of work on our SSL features to bring them up-to-date and more extendable. Greg is also our main Windows expert.&lt;/p&gt;

&lt;p&gt;The following people contributed more than 10 commits to this release:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/seven1m&quot;&gt;Tim Morgan&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/alexeevit&quot;&gt;Vyacheslav Alexeev&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/wjordan&quot;&gt;Will Jordan&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/jalevin&quot;&gt;Jeff Levin&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/dentarg&quot;&gt;Patrik Ragnarsson&lt;/a&gt;, who’s also been very helpful in our Issues tracker.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’ve like to make a contribution to Puma, please see our &lt;a href=&quot;https://github.com/puma/puma/blob/master/CONTRIBUTING.md&quot;&gt;Contributors Guide&lt;/a&gt;. We’re always looking for more help and try to make it as easy as possible to contribute.&lt;/p&gt;

&lt;p&gt;Enjoy Puma 5!&lt;/p&gt;
</description>
        <pubDate>Thu, 17 Sep 2020 00:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2020/09/17/we-made-puma-faster-with-sleep-sort.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2020/09/17/we-made-puma-faster-with-sleep-sort.html</guid>
        
        
      </item>
    
      <item>
        <title>The Practical Effects of the GVL on Scaling in Ruby</title>
        <description>&lt;p&gt;The Global Virtual Machine Lock confuses many Rubyists. Most Rubyists I’ve met have a vague sense that the GVL is somehow bad, and has something to do concurrency or parallelism.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(‘CRuby’ refers to the mainline Ruby implementation, written in C. Sometimes people call this ‘MRI’.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;‘CRuby’ refers to the mainline Ruby implementation, written in C. Sometimes people call this ‘MRI’.&lt;/span&gt;
The GVL (formerly known as GIL, as you’re about to learn) is a unique feature to CRuby, and doesn’t exist in JRuby or TruffleRuby.&lt;/p&gt;

&lt;p&gt;JavaScript’s popular V8 virtual machine also has a VM lock. CPython also has a &lt;em&gt;global&lt;/em&gt; VM lock. That’s three of the most popular dynamic languages in the world! VM locks in dynamic languages are very common.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Instead of removing the GVL, Ruby core has signaled that it will take an approach similar to V8 Isolates with inspiration from the Actor concurrency model (discussed at the end).)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;Instead of removing the GVL, Ruby core has signaled that it will take an approach similar to V8 Isolates with inspiration from the Actor concurrency model (discussed at the end).&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Understanding CRuby’s Global VM Lock is important when thinking about scaling Ruby applications. It will probably never be removed from CRuby completely, and its behavior changes how we scale Ruby apps efficiently.&lt;/p&gt;

&lt;p&gt;Understanding what the GVL is and why the current GVL is “global” will help you to answer questions like these:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;What should I set my Sidekiq concurrency to?&lt;/li&gt;
  &lt;li&gt;How many threads should I use with Puma?&lt;/li&gt;
  &lt;li&gt;Should I switch to Puma or Sidekiq from Unicorn, Resque, or DelayedJob?&lt;/li&gt;
  &lt;li&gt;What are the advantages of event-driven concurrency models, like Node?&lt;/li&gt;
  &lt;li&gt;What are the advantages of a global-lock-less language VM, like Erlang’s BEAM or Java’s JVM?&lt;/li&gt;
  &lt;li&gt;How will Ruby’s concurrency story change in Ruby 3?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’ll deal with these questions and more in this article.&lt;/p&gt;

&lt;h2 id=&quot;what-were-locking-the-language-virtual-machine&quot;&gt;What we’re locking: the language virtual machine&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Most descriptions of the GVL immediately dive into concepts like atomicity and thread-safety. This description will start from a more basic premise and work up to that.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;Most descriptions of the GVL immediately dive into concepts like atomicity and thread-safety. This description will start from a more basic premise and work up to that.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(YARV was essentially &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/1094855.1094912&quot;&gt;Koichi Sasada’s graduate thesis.&lt;/a&gt;)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;YARV was essentially &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/1094855.1094912&quot;&gt;Koichi Sasada’s graduate thesis.&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Wait: isn’t it the GIL? What’s the GVL?&lt;/p&gt;

&lt;p&gt;GIL stands for Global Interpreter Lock, and it’s something that was removed from Ruby (or just mutated, depending on how you look at it) in Ruby 1.9, when Koichi Sasada introduced YARV (Yet Another Ruby VM) to Ruby. YARV changed CRuby’s internal structure so that the lock existed around the Ruby virtual machine, not an interpreter. The correct terminology for over a decade now has been GVL, not GIL.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(You can interact with instruction sequences &lt;a href=&quot;https://ruby-doc.org/core-2.5.1/RubyVM/InstructionSequence.html&quot;&gt;via the InstructionSequence class&lt;/a&gt;. Everything is an object in Ruby!)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;You can interact with instruction sequences &lt;a href=&quot;https://ruby-doc.org/core-2.5.1/RubyVM/InstructionSequence.html&quot;&gt;via the InstructionSequence class&lt;/a&gt;. Everything is an object in Ruby!&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;How does an interpreter differ from a virtual machine?&lt;/p&gt;

&lt;p&gt;A virtual machine is a little like a CPU-within-a-CPU. Virtual machines are computer programs that usually take simple instructions, and those instructions manipulate some internal state. A &lt;a href=&quot;https://en.wikipedia.org/wiki/Turing_machine&quot;&gt;Turing machine&lt;/a&gt;, if it was implemented in software, would be a kind of virtual machine. We call them virtual machines and not machines because they’re implemented in software, rather than in hardware, like a CPU is.&lt;/p&gt;

&lt;p&gt;Before Ruby 1.9, Ruby didn’t really have a separate virtual machine step - it just had an interpreter. As your Ruby program ran, it actually interpreted each line of Ruby as it went. Now, we just interpret the code once, turn it into a series of VM instructions, and then execute those instructions. This is much faster than interpreting Ruby constantly.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/turingmachine.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;A Turing machine, implemented in software, would be a kind of virtual machine. &lt;a href=&quot;https://commons.wikimedia.org/wiki/File:TuringBeispielDiskretAnimatedGIF_uk.gif&quot;&gt;Wikimedia Commons by RosarioVanTuple&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The Ruby Virtual Machine understands a simple instruction set. Those instructions are generated from the Ruby code you write by the interpreter, and then the virtual machine instructions are fed into the Ruby VM.&lt;/p&gt;

&lt;p&gt;Let’s watch this in action. First, in case you didn’t know, you can execute Ruby from the command line using the -e option:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ ruby -e &quot;puts 1 + 1&quot;
2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/escanor_stack_meme_opt.jpeg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;
Now, you can then dump the instructions for this simple program by calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--dump=insns&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ ruby --dump=insns -e &quot;puts 1 + 1&quot;
== disasm: #&amp;lt;ISeq:&amp;lt;main&amp;gt;@-e:1 (1,0)-(1,10)&amp;gt; (catch: FALSE)
0000 putself                                                          (   1)[Li]
0001 putobject_INT2FIX_1_
0002 putobject_INT2FIX_1_
0003 opt_plus                     &amp;lt;callinfo!mid:+, argc:1, ARGS_SIMPLE&amp;gt;, &amp;lt;callcache&amp;gt;
0006 opt_send_without_block       &amp;lt;callinfo!mid:puts, argc:1, FCALL|ARGS_SIMPLE&amp;gt;, &amp;lt;callcache&amp;gt;
0009 leave
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Ruby is a “stack-based” VM. You can see how this works by looking at the generated instructions here - we add the integer 1 to the stack two times, than call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plus&lt;/code&gt;. When &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plus&lt;/code&gt; is called, there are two integers on the stack. Those two integers are replaced by the result, 2, which is then on the stack.&lt;/p&gt;

&lt;p&gt;So, what does the Ruby VM have to do with threading, concurrency and parallelism?&lt;/p&gt;

&lt;h2 id=&quot;concurrency-and-paralellism&quot;&gt;Concurrency and Paralellism&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/checkout_counter.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;You may be aware that there’s a difference between concurrency and parallelism. Imagine a grocery store. At this grocery store, we have a line and some checkout clerks working to pull customers from the line and get their groceries checked out.&lt;/p&gt;

&lt;p&gt;Each of our grocery store checkout clerks works in parallel. They don’t need to talk to each other to do their job, and what one clerk is doing doesn’t affect the other in any way. They’re working 100% in parallel.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/concurrent_checkout.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Now, a clerk &lt;em&gt;can&lt;/em&gt; work on multiple customers &lt;em&gt;concurrently&lt;/em&gt;. This would look like a clerk grabbing multiple customers from the line, working on one customer’s groceries for a moment, then switching to another customer’s groceries, and so on. This would be working concurrently.&lt;/p&gt;

&lt;p&gt;Let’s take a more concrete example. Compare three grocery store clerks working in parallel with a single one working concurrently. To check out a customer, we must perform two operations: scanning their groceries, and then bagging them. Imagine each customer’s groceries take the exact same amount of time to scan and bag.&lt;/p&gt;

&lt;p&gt;Three customers arrive. Let’s say scanning takes time &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; and bagging takes time &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt;. Our three parallel clerks will process these three customers in time &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(A + B)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/parallel.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;The parallel case.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;What about our concurrent clerk? All three of her customers arrive at the same time. The clerk scans each customer’s groceries, then bags each customer’s groceries. Each customer is worked on concurrently, but never in parallel.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/concurrency.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;The concurrent case.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;In the concurrent case, our first customer experiences a total service time of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(3A + B)&lt;/code&gt;. They had to wait for everyone else’s groceries to be checked out for their own groceries to get bagged. The second customer will exerience a total service time of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(3A + 2B)&lt;/code&gt;, and the final customer will experience a service time of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(3A + 3B)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Notice how the customers who got the concurrent checkout clerk experienced a longer total service time than the customers who used our three parallel clerks.&lt;/p&gt;

&lt;p&gt;In short: &lt;strong&gt;concurrency is interesting, but parallelism is what speeds up systems and allows them to handle increased load&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Performing two operations concurrently means that the start and end times of those operations overlapped at some point. For example, you and I sit down to a sign a contract. However, there is only one pen. I sign where I’m supposed to, hand the pen to you, and then you sign. Then, you hand the pen back to me and I initial a few lines. You might say that we signed the contract concurrently, but never in parallel - there was only one pen, so we couldn’t sign the contract at the exact same time.&lt;/p&gt;

&lt;p&gt;Peforming operations in parallel means that we are doing those operations &lt;em&gt;at the exact same instant&lt;/em&gt;. In my contract example, a parallel contract signing would involve two pens (and probably two copies of the contract, otherwise it would get a little crowded).&lt;/p&gt;

&lt;h2 id=&quot;concurrency-and-paralellism-on-a-computer&quot;&gt;Concurrency and Paralellism on a Computer&lt;/h2&gt;

&lt;p&gt;On a modern operating system, programs are run with a combination of processes and threads. Processes have at least one thread, and can have up to thousands.&lt;/p&gt;

&lt;p&gt;To extend the grocery store analogy, processes are like the checkout counters that our clerks use. They contain tools and common resources, like the point-of-sale terminal and the barcode scanner, but they don’t actually &lt;em&gt;do&lt;/em&gt; anything. A process usually contains a memory allocation (the heap), file descriptors (sockets, files, etc), and other such computer resources.&lt;/p&gt;

&lt;p&gt;Threads actually run our code. Each process has at least one thread. In our analogy, they’re like the store clerks. They also hold a small amount of information. For example, if we’re adding two local variables in a Rails application, our thread contains information about those two variables (&lt;em&gt;thread-local storage&lt;/em&gt;) and also what line of code we’re currently running (the &lt;em&gt;stack&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/pentium.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Threads run the code when they are scheduled to by the operating system’s kernel. The Ruby runtime itself doesn’t actually manage when threads are executed - the operating system decides that.&lt;/p&gt;

&lt;p&gt;When Ruby was written in the 90s, all processes had just one thread. This started to change in the early 2000s, necessitating the rewrite of the language VM in Ruby 1.9 (YARV), which is what gave us the GVL as we know it today.&lt;/p&gt;

&lt;h2 id=&quot;what-the-gvl-actually-does&quot;&gt;What the GVL actually does&lt;/h2&gt;

&lt;p&gt;As mentioned earlier, the Ruby Virtual Machine is what actually turns Ruby virtual machine instructions (generated by the interpreter) into CPU instructions.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/vm_lock_bernie.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The Ruby Virtual Machine is not internally thread-safe. If two threads try to access the Ruby VM at the same time, really Bad Things would happen. This is a bit like the point of sale terminal at our grocery store checkout counters. If two checkout clerks tried to use the same POS terminal, they would interrupt each other and probably keep losing their work or corrupting each other’s work. You would end up paying for someone else’s groceries!&lt;/p&gt;

&lt;p&gt;So, because it isn’t safe for multiple threads to access the Ruby Virtual Machine at the same moment, instead we use a global lock around it so that only one thread can access it in parallel.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(One caveat of the Javascript GVL is that it isn’t actually global: you can create additional Isolates. Koichi Sasada’s proposal for Ractors (formerly Guilds) would be similar.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;One caveat of the Javascript GVL is that it isn’t actually global: you can create additional Isolates. Koichi Sasada’s proposal for Ractors (formerly Guilds) would be similar.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It is extremely common for dynamic language VMs to not be thread-safe. As mentioned, CPython and V8 are the most prominent examples. Java is probably the best example of a semi-dynamic language that &lt;em&gt;does&lt;/em&gt; have a threadsafe VM. It’s also why so many languages are written on top of the JVM: writing your own threadsafe VM is really hard.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/realize.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;TFW you realize that there’s always going to be locks, the only difference is what level they’re implemented at and who implements them&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;There’s a few good reasons that having a GVL is so popular:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;It’s faster. Single-threaded performance improves because you don’t have to constantly lock and unlock internals.&lt;/li&gt;
  &lt;li&gt;Integrating with extensions, such as C extensions, is easier.&lt;/li&gt;
  &lt;li&gt;It’s easier to write a lockless VM than one with a lot of locks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each Ruby process has its own Global VM Lock, so it might be more accurate to say that it’s a “process-wide VM lock”. Its “global” in the same sense that a “global variable” is global.&lt;/p&gt;

&lt;p&gt;Only one thread in any Ruby process can hold the global VM lock at any given time. Since a thread needs access to the Ruby Virtual Machine to actually run any Ruby code, effectively only one thread can run Ruby code at any given time.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/songofmyppl.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Let me play you the song of my people: “&lt;em&gt;GGVVVVLLLLLLLLLLLLLLLL&lt;/em&gt;”&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Think of the GVL like the conch shell in the Lord of the Flies - if you have it, you get to speak (or execute Ruby code in this case). If the GVL is already locked by a different thread, other threads must wait for the GVL to be released before they can hold the GVL.&lt;/p&gt;

&lt;h2 id=&quot;amdahls-law-why-1-sidekiq-process-can-be-2x-as-efficient-as-delayedjob-or-resque&quot;&gt;Amdahl’s Law: Why 1 Sidekiq Process Can Be 2x as Efficient as DelayedJob or Resque&lt;/h2&gt;

&lt;p&gt;Your programs actually do many things that don’t need to access the Ruby Virtual Machine. The most important is waiting on I/O, such as database and network calls. These actions are executed in C, and the GVL is explicitly released by the thread waiting on that I/O to return. When the I/O returns, the thread attempts to reacquire the GVL and continue to do whatever the program says.&lt;/p&gt;

&lt;p&gt;This has enormous real-world performance impacts.&lt;/p&gt;

&lt;p&gt;Imagine you have a stack of satellite image data you have to process (with Ruby). You have written a Sidekiq job, called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SatelliteDataProcessorJob&lt;/code&gt;, and each job works on a small fraction of all of the satellite data.&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SatelliteDataProcessorJob&lt;/span&gt;
  &lt;span class=&quot;kp&quot;&gt;include&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Sidekiq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Worker&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;perform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;some_satellite_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;some_satellite_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;touch_external_service&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;some_satellite_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;add_data_to_database&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;some_satellite_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Let’s imagine that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;process&lt;/code&gt; is a 100% Ruby method, which does not call C extensions or external services. Further, let’s imagine that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;touch_external_service&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;add_data_to_database&lt;/code&gt; are effectively 100% I/O methods that spend all of their time waiting on the network.&lt;/p&gt;

&lt;p&gt;First, an easy question: if each run of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SatelliteDataProcessorJob&lt;/code&gt; takes 1 second, and you have 100 enqueued jobs and just 1 Sidekiq process with 1 thread, how long will it take to process all the jobs? Assume infinite CPU and memory resources.&lt;/p&gt;

&lt;p&gt;100 seconds.&lt;/p&gt;

&lt;p&gt;How about if you two processes? 50 seconds. And 25 seconds for 4 processes and so on. That’s parallelism.&lt;/p&gt;

&lt;p&gt;Now, let’s say you have 1 Sidekiq process with 10 threads. How long will it take to process all of those jobs?&lt;/p&gt;

&lt;p&gt;The answer is &lt;em&gt;it depends&lt;/em&gt;. If you’re on JRuby or TruffleRuby, it will take about 10 seconds, because each thread is fully parallel with all the other threads.&lt;/p&gt;

&lt;p&gt;But on MRI, we have the GVL. Does adding threads increase concurrency?&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/AmdahlsLaw.svg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;From &lt;a href=&quot;https://commons.wikimedia.org/wiki/File:AmdahlsLaw.svg&quot;&gt;Daniels 220 @ Wikipedia&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It turns out, this exact problem interested computer scientist Gene Amdahl back in 1967. He proposed something called Amdahl’s Law, which gives the theoretical speedup in latency for the execution of tasks with fixed workloads when resources increase.&lt;/p&gt;

&lt;p&gt;Amdahl figured out that the speedup you got from additional parallelism was related to the proportion of execution time that could be done in parallel. Sound familiar?&lt;/p&gt;

&lt;p&gt;Amdahl’s Law is simply &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1 / (1 - p + p/s)&lt;/code&gt;, where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; is the percentage of the task that could be done in parallel, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s&lt;/code&gt; is the speedup factor from the part of the task that gained improved resources (the parallel part).&lt;/p&gt;

&lt;p&gt;So, in our example, let’s say that half of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SatelliteDataProcessorJob&lt;/code&gt; is GVL-bound and half is IO-bound. In this case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p&lt;/code&gt; is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0.5&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;s&lt;/code&gt; is 10, because we can wait for IO in parallel and there are 10 threads. &lt;strong&gt;In this case, Amdahl’s Law shows that a Sidekiq process would go through our jobs up to 1.81x faster than a single-threaded Resque or DelayedJob process.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Many background jobs in Ruby spend at least 50% of their time waiting on IO. For those jobs, Sidekiq can lead to a 2x decrease in resource usage, because 1 Sidekiq process can do the work of what used to take 2 single-threaded processes.&lt;/p&gt;

&lt;p&gt;So, even with a GVL, adding threads to applications increases throughput-per-process, which in turn lowers memory consumption.&lt;/p&gt;

&lt;h2 id=&quot;threads-puma-and-gvl-caused-latency&quot;&gt;Threads, Puma and GVL-caused Latency&lt;/h2&gt;

&lt;p&gt;This also means that “how many threads does my Sidekiq or Puma process need” is a question answered by “how much time does that thread spend in non-GVL execution?” or “how much time does my program spend waiting on I/O?” Workloads with high percentages of time spent in I/O (75%+ or more) often benefit from 16 threads or even more, but more typical workloads see benefit from just 3 to 5 threads.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/paralellizable.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It’s possible to configure your thread pools to be &lt;em&gt;too large&lt;/em&gt;. Setting Puma or Sidekiq to thread settings higher than 5 can lead to contention for the GVL if the work is not parallelizable enough. This increases service latency.&lt;/p&gt;

&lt;p&gt;While total time to process all of the units of work remains the same, the latency experienced by each individual unit of work increases.&lt;/p&gt;

&lt;p&gt;Imagine a grocery store where a checkout clerk grabbed 16 people off of the checkout queue and checked those 16 people’s groceries concurrently, scanning one item per person before scanning one item from the next person’s cart. Rather than experiencing checkout time as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(A + B)&lt;/code&gt;, they experience a checkout time of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;16(A+B)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(This effect is generally present in a concurrent-but-not-100%-parallel system where overall utilization is not extremely high. &lt;a href=&quot;https://github.com/puma/puma/pull/2079&quot;&gt;We’re mitigating this effect slightly in Puma 5&lt;/a&gt; by having Puma workers with more than one thread delay listening to the socket, so less-loaded workers pick up requests first.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;This effect is generally present in a concurrent-but-not-100%-parallel system where overall utilization is not extremely high. &lt;a href=&quot;https://github.com/puma/puma/pull/2079&quot;&gt;We’re mitigating this effect slightly in Puma 5&lt;/a&gt; by having Puma workers with more than one thread delay listening to the socket, so less-loaded workers pick up requests first.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Some people misidentify this additional latency as “context switching” costs. However, latency experienced by the individual unit of work is increasing  &lt;em&gt;without additional switching cost&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In any case, context switching on modern machines and operating systems is pretty cheap relative to the time it takes to service a typical web app request or background job. It does not add hundreds of milliseconds to response times - but oversaturating the GVL can.&lt;/p&gt;

&lt;p&gt;If adding threads to a CRuby process can increase latency, why is it still useful?&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Shouldn’t adding an additonal thread only increase memory usage by 8MB, which is the size of the thread’s stack allocation? Ah, if only memory usage was that simple. &lt;a href=&quot;/2017/12/04/malloc-doubles-ruby-memory.html&quot;&gt;Learn more about the complexities of RSS and thread-induced fragmentation here.&lt;/a&gt;)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;Shouldn’t adding an additonal thread only increase memory usage by 8MB, which is the size of the thread’s stack allocation? Ah, if only memory usage was that simple. &lt;a href=&quot;/2017/12/04/malloc-doubles-ruby-memory.html&quot;&gt;Learn more about the complexities of RSS and thread-induced fragmentation here.&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adding more threads to a Ruby process helps us to improve CPU utilization at less memory cost than an entire additional process.&lt;/strong&gt; Adding 1 process might use 512MB of memory, but adding 1 thread will probably cause less than 64MB of additional memory usage. With 2 threads instead of 1, when the first thread releases the GVL and listens on I/O, our 2nd thread can either pick up new work to do, increasing throughput and utilization of our server.&lt;/p&gt;

&lt;p&gt;GitLab switched from Unicorn (single-thread model) to Puma (multi-thread model) and &lt;a href=&quot;https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7455#note_239070865&quot;&gt;saw a 30% decrease in memory usage across their fleet.&lt;/a&gt; If you’re memory-constrained on your host, this allows you to run 30% more throughput for the same money. That’s awesome.&lt;/p&gt;

&lt;h2 id=&quot;the-future&quot;&gt;The Future&lt;/h2&gt;

&lt;p&gt;For a decade now, bystanders have declared that Ruby is dead because it “doesn’t have a proper concurrency story”.&lt;/p&gt;

&lt;p&gt;I think we’ve shown that there is a concurrency story in Ruby. First, we have process-based concurrency. We multiply GVLs by multiplying processes. This works perfectly fine, if you have enough memory.&lt;/p&gt;

&lt;p&gt;If you’re out of memory, you can use Sidekiq or Puma, which provides a threaded container for our apps, and then let pre-emptive threading do its thing.&lt;/p&gt;

&lt;p&gt;Ruby has proven that process-based concurrency (which is really what the GVL forces us to do) scales well. It’s not much more expensive than other models, especially these days when memory is so cheap on cloud providers. Think critically about what an Actor-style approach or an Erlang Process-style approach would &lt;em&gt;actually change&lt;/em&gt; about your deployment at the end of the day: you would use less memory per CPU. But on large deployments, most web applications are already CPU-bottlenecked, not memory!&lt;/p&gt;

&lt;h4 id=&quot;ractor&quot;&gt;Ractor&lt;/h4&gt;

&lt;p&gt;Koichi Sasada, author of YARV, is proposing a new concurrency abstraction for Ruby 3 called Ractors. It’s a proposal based on the Actor concurrency model (hence Ruby Actor -&amp;gt; Ractor). Basically, Actors are boxes for objects to go into, and each actor can only touch its own objects, but can send and receive objects to/from other Actors. Here’s an example written by Koichi Sasada:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Ractor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;current&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;..&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Ractor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;send&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Ractor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;recv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;r&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;#{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;send&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;r0&quot;&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Ractor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;recv&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;#=&amp;gt; &quot;r0r10r9r8r7r6r5r4r3r2r1&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Eventually (not yet in the current implementation), each Ractor will get their own VM lock. That means the example code above will execute in parallel.&lt;/p&gt;

&lt;p&gt;This is made possible because Ractors don’t share mutable state. Instead, they only share immutable objects, and can send mutable objects between each other. This should mean that we don’t need a VM lock inside of a Ractor.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ruby/ruby/compare/master...ko1:ractor&quot;&gt;Koichi Sasada’s Ractor proposal is now public&lt;/a&gt;, though as of this writing the docs are mostly in Japanese, and “each Ractor gets its own VM lock” has not yet been implemented. Ractors will essentially allow us to “multiply” GVLs in a process, which would make the GVL no longer “global”, although the lock will still exist in each Ractor. The Global VM Lock will become a Ractor VM Lock.&lt;/p&gt;

&lt;h2 id=&quot;tldr&quot;&gt;TL:DR;&lt;/h2&gt;

&lt;p&gt;Thanks for listening to me whinge. Here’s what you need to remember:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If you are memory bottlenecked on Ruby, you need to &lt;strong&gt;saturate the GVL&lt;/strong&gt; by adding more threads, which will allow you to get &lt;em&gt;more CPU work done with less memory use&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;The GVL means that parallelism is limited to I/O in Ruby, so &lt;strong&gt;switch to a multithreaded background job processor before you switch to a multithreaded web server&lt;/strong&gt;. Also, you’ll probably use much higher threadpool sizes with your background jobs than with your web server.&lt;/li&gt;
  &lt;li&gt;Ruby 3 &lt;strong&gt;might make the GVL no longer global&lt;/strong&gt; by allowing you to multiply VMs using Ractors. Application servers and background job processors will probably change their backend to take advantage of this, you won’t really have to change much of your code at all, but you will no longer have to worry about thread safety (yay).&lt;/li&gt;
  &lt;li&gt;Process based concurrency scales very well, and while it might lose a few microseconds to other concurrency models, these &lt;strong&gt;concurrency switching costs generally don’t matter for the typical Rails application&lt;/strong&gt;. Instead, the important thing is saturating CPU, which is the most scarce resource in today’s computing environments.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Mon, 11 May 2020 00:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2020/05/11/the-ruby-gvl-and-scaling.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2020/05/11/the-ruby-gvl-and-scaling.html</guid>
        
        
      </item>
    
      <item>
        <title>The World Follows Power Laws: Why Premature Optimization is Bad</title>
        <description>&lt;p&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(This post is a sample of the content available in the &lt;a href=&quot;https://railsspeed.com&quot;&gt;Complete Guide to Rails Performance&lt;/a&gt;. It’s actually the first lesson - there are 30+ more lessons and 18 hours of video in the course itself.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;This post is a sample of the content available in the &lt;a href=&quot;https://railsspeed.com&quot;&gt;Complete Guide to Rails Performance&lt;/a&gt;. It’s actually the first lesson - there are 30+ more lessons and 18 hours of video in the course itself.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I want to tell you about a physicist from Schenectady, a Harvard linguist,
and an Italian economist.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/pareto.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Pareto’s poltiical views are bit suspect, unfortunately, because he chose to see the way things &lt;em&gt;are&lt;/em&gt; (unequal) as the way they &lt;em&gt;ought to be&lt;/em&gt;.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The Italian economist you may already have heard of - Vilifredo Pareto. He became
famous for something called &lt;strong&gt;The Pareto Principle&lt;/strong&gt;, the idea that for most
things, 80% of the effect comes from just 20% of the causes. The Pareto
Principle is fundamental to performance work because it reminds us why premature optimization is so inefficient and useless.
While you’ve probably &lt;em&gt;heard&lt;/em&gt;
of the Pareto Principle, I want you to &lt;em&gt;understand why&lt;/em&gt; it actually works. And
to do that, we’re going to have to talk about probability distributions.&lt;/p&gt;

&lt;h2 id=&quot;benford---the-physicist&quot;&gt;Benford - the physicist&lt;/h2&gt;

&lt;p&gt;Frank Benford was an American electrical engineer and physicist who worked for
General Electric. It was the early 20th century, when you had a job for life
rather than a startup gig for 18 months, so he worked there from the day he
graduated from the University of Michigan until his death 38 years later in 1948.
&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/logtables.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;A page from Henry Briggs’ first table of common logarithms, Logarithmorum Chilias Prima, from 1617. &lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Logarithmorum_Chilias_Prima_page_0-67.jpg&quot;&gt;Wikipedia&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Back in that time, before calculators, if you wanted to know the logarithm
of a number - say, 12 - you looked it up in a book. The books were usually
organized by the leading digit. If you wanted to know the logarithm of 330, you
first went to the section for 3, then looked for 330. Benford noticed that the
first pages of the book were far more worn out than the last pages. Benford
realized this meant that the numbers looked up in the table began more often
with 1 than with 9.&lt;/p&gt;

&lt;p&gt;Most people would have noticed that and thought nothing of it. But Benford
pooled 20,000 numbers from widely divergent sources (he used the numbers in
newspaper stories) and found that the leading digit of all those numbers
followed a power law too.&lt;/p&gt;

&lt;p&gt;This became known as &lt;a href=&quot;https://en.wikipedia.org/wiki/Benford%27s_law&quot;&gt;Benford’s Law&lt;/a&gt;. Here are some other sets of numbers that
conform to this power law:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/physicalconstants.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Physical constants of the universe (pi, the molar constant, etc.)&lt;/li&gt;
  &lt;li&gt;Surface areas of rivers&lt;/li&gt;
  &lt;li&gt;Fibonacci numbers&lt;/li&gt;
  &lt;li&gt;Powers of 2&lt;/li&gt;
  &lt;li&gt;Death rates&lt;/li&gt;
  &lt;li&gt;Population censuses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That the &lt;em&gt;physical constants of the universe&lt;/em&gt; follow this
distribution is probably the most mind-blowing revelation of Benford’s Law, for
me, anyway.&lt;/p&gt;

&lt;p&gt;Benford’s Law is so airtight that it’s been admitted in US courts as evidence of
accounting fraud (someone used RAND in their Excel sheet!). It’s been used to
identify other types of fraud too - elections, scientific and even macroeconomic data.&lt;/p&gt;

&lt;p&gt;What would cause numbers that have (seemingly) little relationship with each
other to conform so perfectly to this non-random distribution?&lt;/p&gt;

&lt;h2 id=&quot;zipf---the-linguist&quot;&gt;Zipf - the linguist&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/zipf_wiki.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;A plot of
the rank versus frequency for the first 10 million words in 30 different
languages of Wikipedia. Note the logarithmic scales. &lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Zipf_30wiki_en_labels.png&quot;&gt;Licensed CC-BY-SA by SergioJimenez.&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;At almost exactly the same time as Benford was looking at first leading digits,
George Kingsley Zipf was studying languages at
Harvard. Uniquely, George was applying the techniques of a new and interesting
field - statistics - to the study of language. This landed him an astonishing
insight: in nearly every language, some words are used a lot, but most (nearly
all) words are used hardly at all.&lt;/p&gt;

&lt;p&gt;Only a few words account for most of our use of language. The Brown Corpus is a collection of literature used by linguistics researchers.
It consists of 500 samples of English-language text comprising 1 million words.
Just 135 unique words are needed to account for 50% of those million words.
That’s insane.&lt;/p&gt;

&lt;p&gt;Zipf’s probability distribution is &lt;em&gt;discrete&lt;/em&gt;. Discrete
distributions are comprised of whole integers. Continuous distributions can take
on any value. If you take Zipf’s distribution and make it continuous instead of
discrete, you get the Pareto distribution.&lt;/p&gt;

&lt;h2 id=&quot;pareto---the-economist&quot;&gt;Pareto - the economist&lt;/h2&gt;

&lt;p&gt;Pareto initially noticed a curious distribution when he was thinking about
wealth in society - he noticed that 80% of the wealth and income came from 20%
of the people in it.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/pareto.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The Pareto distribution, pictured, has been found to hold for a scary number of
completely different and unrelated fields in the sciences. For example, here are
some natural phenomena that exhibit a Pareto (power law) distribution:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Wealth inequality&lt;/li&gt;
  &lt;li&gt;Sizes of rocks on a beach&lt;/li&gt;
  &lt;li&gt;Hard disk drive error rates (!)&lt;/li&gt;
  &lt;li&gt;File size distribution of Internet traffic (!!!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We tend to think of the natural world as random or chaotic. In schools, we’re
taught the bell curve/normal distribution. &lt;strong&gt;But reality isn’t normally
distributed.&lt;/strong&gt; It’s log-normal. Many probability distributions, in the wild,
support the Pareto Principle:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;80% of the output will come from 20% of the input&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/Normal_Distribution_PDF.svg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Normal distributions are taught in schools because they’re quite easy to talk about mathematically, not because they’re particularly good descriptions of the natural world.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;While you may have heard this before, what I’m trying to get across to you is
that it isn’t made up. The Pareto distribution is used in hundreds of otherwise
completely unrelated scientific fields - and we can use its ubiquity to our
advantage.&lt;/p&gt;

&lt;p&gt;It doesn’t matter what area you’re working in - if you’re applying equal effort
to all areas, you &lt;em&gt;are wasting your time&lt;/em&gt;. What the Pareto distribution shows us
is that most of the time, our efforts would be better spent &lt;em&gt;finding&lt;/em&gt; and
&lt;em&gt;identifying&lt;/em&gt; the crucial 20% that accounts for 80% of the output.&lt;/p&gt;

&lt;p&gt;Allow me to reformulate and apply this to web application performance:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;80% of an application’s work occurs in 20% of its code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are other applications in our performance realm too:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;80% of an application’s traffic will come from 20% of its features.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;80% of an application’s memory usage will come from 20% of its allocated
objects.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The ratio isn’t always 80/20. Actually, usually it’s way more severe - 90/10,
95/5, 99/1. Sometimes it’s less severe. So long as it isn’t 50/50 we’re talking
about a non-normal distribution.&lt;/p&gt;

&lt;p&gt;This is why premature optimization is so bad and why performance monitoring,
profiling and benchmarking are so important. The world is full of power-law distributions, not normal distributions. Spreading your effort evenly across a power-law distribution is a massive waste of effort.&lt;/p&gt;

&lt;p&gt;What the Pareto Principle reveals to us is that optimizing any random line of code in our application is in fact
unlikely to speed up our application at all! 80% of the “slowness” in any given
app will be hidden away in a minority of the code.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/haskell.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;So instead of optimizing
blindly, applying principles at random that we read from blog posts, or engaging
in Hacker-News-Driven-Development by using the latest and “most performant” web
technologies, we need to measure where the bottlenecks and problem areas are in
our application.&lt;/p&gt;

&lt;h2 id=&quot;an-optimization-story---measurement-profiling-and-benchmarking&quot;&gt;An Optimization Story - Measurement, Profiling and Benchmarking&lt;/h2&gt;

&lt;p&gt;There’s only one skill in performance work that you need to understand
completely and deeply - how to &lt;em&gt;measure&lt;/em&gt; your application’s performance. Once
you have that skill mastered, knowing every possible thing about performance might
be a waste of time. Your problems are not other’s problems. There are going to
be lessons to learn that solve problems you don’t have (or don’t comprise
that crucial 20% of the causes of slowness in your application).&lt;/p&gt;

&lt;p&gt;On the flip side, you should realize that the Pareto Principle is extremely
liberating. You &lt;em&gt;don’t&lt;/em&gt; need to fix every performance issue in your application.
You don’t need to go line-by-line to look for problems under every rock. You
need to &lt;em&gt;measure&lt;/em&gt; the actual performance of your application, and focus on the
20% of your code that is the worst performance offender.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/minitest_knows.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;My first conference talk ever, actually.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I once gave &lt;a href=&quot;https://www.youtube.com/watch?v=ojd1G4gOMdk&quot;&gt;a conference talk that was a guided read-through of Minitest&lt;/a&gt;, the
Ruby testing framework. Minitest is a great read if you’ve got a spare hour or two -
it’s fairly short at just 1,500 lines. As I was reading Minitest’s code, I
came across this funny line:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;runnable_methods&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;methods_matching&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sr&quot;&gt;/^test_/&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;test_order&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;when&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:parallel&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;size&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sort_by&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;rand&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;when&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:alpha&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:sorted&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;then&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sort&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;raise&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Unknown test_order: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;#{&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;test_order&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;inspect&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This code is extremely readable as to what’s going on; we determine which
methods on a class are runnable with a regex (“starts with test_”), and then
sort them depending upon this test class’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test_order&lt;/code&gt;. Minitest uses the
return value to execute all of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;runnable_methods&lt;/code&gt; on all the test classes
you give it. Usually this is a randomized array of method names, because the default test order is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:random&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What I was honing in on was this line, which is run when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:test_order&lt;/code&gt; is
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:random&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:parallel&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;size&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sort_by&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;rand&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This seemed like a really roundabout way to do  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;methods.shuffle&lt;/code&gt; to me. Maybe
Ryan (Minitest’s author) was doing some weird thing to ensure deterministic
execution given a seed. Minitest runs your tests in the same order given the
same seed to the random number generator. It turns out methods.shuffle is
deterministic, though, just like the code as written. So, I decided to benchmark
it, mostly out of curiosity.&lt;/p&gt;

&lt;p&gt;Whenever I need to write a micro benchmark of Ruby code, I reach for
&lt;a href=&quot;https://github.com/evanphx/benchmark-ips&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;benchmark/ips&lt;/code&gt;&lt;/a&gt;.&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(The reason I use benchmark/ips rather than the stdlib benchmark is because
the stdlib version requires you to run a certain line of code X number of times
and tells you how long that took. The problem with that is that I don’t usually
know how fast the code is to begin with, so I have no idea how to set X. Usually
I run the code a few times, guess at a number of X that will make the benchmark
take 10 seconds to run, and then move on. benchmark/ips does that work for me by
running my benchmark for 10 seconds and calculating iterations-per-second.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt; The reason I use benchmark/ips rather than the stdlib benchmark is because
the stdlib version requires you to run a certain line of code X number of times
and tells you how long that took. The problem with that is that I don’t usually
know how fast the code is to begin with, so I have no idea how to set X. Usually
I run the code a few times, guess at a number of X that will make the benchmark
take 10 seconds to run, and then move on. benchmark/ips does that work for me by
running my benchmark for 10 seconds and calculating iterations-per-second.&lt;/span&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ips&lt;/code&gt; stands
for iterations-per-second. The gem is an extension of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Benchmark&lt;/code&gt; module,
something we get in the Ruby stdlib.&lt;/p&gt;

&lt;p&gt;Here’s that benchmark:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;require&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;benchmark/ips&quot;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;TestBench&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;methods&lt;/span&gt;
    &lt;span class=&quot;vi&quot;&gt;@methods&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;a&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;..&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;z&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;to_a&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fast&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;shuffle&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;slow&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;size&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;methods&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sort_by&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;rand&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;test&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;TestBench&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;new&lt;/span&gt;

&lt;span class=&quot;no&quot;&gt;Benchmark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;ips&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;report&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;faster alternative&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fast&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;report&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;current minitest code&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;slow&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;compare!&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This benchmark suggested that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shuffle&lt;/code&gt; was 12x faster than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sort.sort_by {
rand methods.size }&lt;/code&gt;. This makes sense - &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shuffle&lt;/code&gt; randomizes the array with C,
which will always be faster than randomizing it with pure Ruby. In addition,
Ryan was actually sorting the array twice - once in alphabetical order, followed
by a random shuffle based on the output of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rand&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/ryans_talk.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=5KVcsV_jseQ&quot;&gt;Ryan’s conference talks&lt;/a&gt; are pretty good, too.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I asked Ryan Davis, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minitest&lt;/code&gt; author, what was up with this. He gave me a great
reply: “you benchmarked it, but did you profile it?”&lt;/p&gt;

&lt;p&gt;What did he mean by this? Well, first, you have to know the difference between
&lt;strong&gt;benchmarking and profiling - the two fundamental performance measurement tools.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are a lot of different ways to define this difference. Here’s my attempt:&lt;/p&gt;

&lt;h3 id=&quot;benchmarking&quot;&gt;Benchmarking&lt;/h3&gt;

&lt;p&gt;A benchmark is a test of one or many different pieces of code that measures
how fast they execute or how many resources they consume.&lt;/p&gt;

&lt;p&gt;When we benchmark, we take two competing pieces of code and compare them. It could be as simple as a one liner, like in my story, or as
complex as an entire web framework. Then, we put them up against each other
(usually comparing them in terms of iterations/second) using a simple,
contrived task. At the end of the task, we come up with a single metric - a
score. We use the score to compare the two competing options.&lt;/p&gt;

&lt;p&gt;In my example
above, it was just how fast each line could shuffle an array. If you were
benchmarking web frameworks, you might test how fast a framework can return a
simple “Hello World” response. Benchmarks put the competing alternatives on
exactly equal footing by coming up with a contrived, simple, non-real-world
example.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/rails-sucks.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;a href=&quot;https://www.speedshop.co/2017/07/11/is-ruby-too-slow-for-web-scale.html&quot;&gt;I wrote a v v long post once about why this benchmark doesn’t mean much for Rails&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It’s usually too difficult to benchmark real-world code because the
alternatives aren’t doing &lt;em&gt;exactly&lt;/em&gt; the same thing. For example, comparing
Rails against Sinatra isn’t entirely fair because Rails has many features that
Sinatra does not - even for a simple Hello World response, the Rails
application is, for example, performing many security checks that the Sinatra
app doesn’t. Comparing these frameworks in a 1-to-1 benchmark will always be
slightly misleading for that reason.&lt;/p&gt;

&lt;h3 id=&quot;profiling&quot;&gt;Profiling&lt;/h3&gt;

&lt;p&gt;Profiles are a accounting of all the sub-steps required to run a
given piece of code. When we profile, we’re
usually examining the performance characteristics of an entire, real-world
application. For example, this might be a web application or a test suite.
Because profiling works with real-world code, we can’t really use it to
compare competing alternatives, because the alternative usually doesn’t
exactly match what we’re profiling. Profiling doesn’t usually produce a
comparable “score” at the end with which to measure these alternatives,
either. But that’s not to say profiling is useless - it can tell us a lot of
valuable things, like what percentage of CPU time was used where, where memory
was allocated, and what lines of code are important and which ones aren’t.&lt;/p&gt;

&lt;p&gt;What Ryan was asking me was - “Yeah, that way is faster on this one line, but
does it really matter in the grand scheme of Minitest”? How much time does a
Minitest test run actually spend shuffling the methods? 1%? 10%? 0.001%? Profiling
can tells us that.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/thatwasalie.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;You said that this one-line change would speed up minitest. A higher-level benchmark determined &lt;em&gt;that&lt;/em&gt; was a lie.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Is this one line really part of Pareto’s “20%”? We can assume, based on the
Principle, that 80% of Minitest’s execution time will come from just 20% of its
code. Was this line part of that 20%?&lt;/p&gt;

&lt;p&gt;I’ve already shown you how to benchmark on the micro scale. But before we get to
profiling, I’m going to do a quick macro-benchmark to test my assumption that
using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shuffle&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sort.sort_by&lt;/code&gt; will speed up Minitest.&lt;/p&gt;

&lt;p&gt;Minitest is used to run tests, so we’re going to benchmark a whole test suite.
&lt;a href=&quot;https://github.com/rubygems/rubygems.org/&quot;&gt;Rubygems.org&lt;/a&gt;, an open-source Rails application with a Minitest suite, will make a good example test suite.&lt;/p&gt;

&lt;p&gt;When micro-benchmarking, I reach for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;benchmark-ips&lt;/code&gt;. When macro-benchmarking
(and especially in this case, with a test suite), I usually reach first for the
simplest tool available: the unix utility &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;time&lt;/code&gt;! We’re going to run the tests
10 times, and then divide the total time by 10.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ time for i in {1..10}; do bundle exec rake; done

...

real	15m59.384s
user	11m39.100s
sys	1m15.767s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;time&lt;/code&gt;, we’re usually only going to pay attention the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user&lt;/code&gt;
statistic. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;real&lt;/code&gt; gives the actual total time (as if you had used a stopwatch),
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sys&lt;/code&gt; gives the time spent in the kernel (in a test run, this would be things
like shelling out to I/O), and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user&lt;/code&gt; will be the closest approximation to time
actually spent running Ruby. You’ll notice that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sys&lt;/code&gt; don’t add up to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;real&lt;/code&gt; - the difference is time spent waiting on the CPU while other operations
(like running my web browser, etc) block.&lt;/p&gt;

&lt;p&gt;With stock &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minitest&lt;/code&gt;, the whole thing takes 11 minutes and 39 seconds, for an
average of 69.9 seconds per run. Now, let’s alter the Gemfile to point to a
modified version (with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;shuffle&lt;/code&gt; on the line in question) of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minitest&lt;/code&gt; on my
local machine:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;gem&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'minitest'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;require: &lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;path: &lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'../minitest'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To make sure the test is 100% fair, I only make the change to my local version
after I check out &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minitest&lt;/code&gt; to the same version that Rubygems.org is running
(5.8.1).&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/computers.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Even so-called performance experts mess this shit up sometimes.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The result? 11 minutes 56 seconds. Longer than the original test! We know my
code is faster in micro, but the macro benchmark told me that it actually takes
longer. A lot of things can cause this (the most likely being other stuff
running on my machine), but what’s clear is this - my little patch doesn’t seem
to be making a big difference to the big picture of someone’s test suite. While
making this change &lt;em&gt;would&lt;/em&gt;, in &lt;em&gt;theory&lt;/em&gt;, speed up someone’s suite, in reality,
the impact is so minuscule that it didn’t really matter.&lt;/p&gt;

&lt;p&gt;So, while a benchmark told me one thing - X is 10x faster than Y! - a higher-level
benchmark told me another (make your change and this thing didn’t really matter.)
Not only does this show the value of profiling (which would have told me before that
the sorting didn’t take much of the total time) but also how microbenchmarks and relative
comparisons can mislead.&lt;/p&gt;

&lt;p&gt;Performance measurement is a critical skill. Anywhere along the way, I could have been mislead by a single number or a rogue measurement. By applying a scientific, empirical approach, I was able to put my benchmark in context of a larger program.&lt;/p&gt;

&lt;p&gt;Premature optimization is ignoring these lessons and optimizing “when we feel like it”, or optimizing constantly all the time. Hopefully I’ve convinced you: it’s a guaranteed waste of time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repeat after me: I will not optimize anything in my application until my
metrics tell me so.&lt;/strong&gt;&lt;/p&gt;
</description>
        <pubDate>Sun, 22 Dec 2019 07:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2019/12/22/why-premature-optimization-is-bad.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2019/12/22/why-premature-optimization-is-bad.html</guid>
        
        
      </item>
    
      <item>
        <title>Why Your Rails App is Slow: Lessons Learned from 3000+ Hours of Teaching</title>
        <description>&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/setofskills.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;“What I do have is a particular set of skills, a set of skills which makes me a nightmare for slow Rails applications like you.”&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;For the last 4 years, I’ve been working on making Rails applications faster and more scalable. I &lt;a href=&quot;https://speedshop.co/workshops.html&quot;&gt;teach workshops&lt;/a&gt;, I &lt;a href=&quot;https://www.railsspeed.com/&quot;&gt;sell a course&lt;/a&gt;, and I &lt;a href=&quot;https://speedshop.co/tune.html&quot;&gt;consult&lt;/a&gt;. If you do anything for a long period of time, you start to see patterns. I’ve noticed four different factors that prevent software organizations from improving the performance of their Rails applications, and I’d like to share them here.&lt;/p&gt;

&lt;h2 id=&quot;performance-becomes-a-luxury-good-especially-when-no-one-is-watching&quot;&gt;Performance becomes a luxury good, especially when no one is watching&lt;/h2&gt;

&lt;p&gt;Often times at my &lt;a href=&quot;https://speedshop.co/workshops.html&quot;&gt;Rails Performance Workshop&lt;/a&gt;, I discover that an attendee simply has no visiblity into what their application is doing in production - they either don’t understand their dashboards, they don’t have them, or they’re not allowed to even access them (“DevOps team only”).&lt;/p&gt;

&lt;p&gt;Performance metrics are often just not tracked. No one is aware if the app is over or underscaled, no one knows if the app is “slow” or “fast”. Is it any wonder, then, that no one spends any time working on it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance is rarely the first priority of any organization, and often gets “trickled down” hours and resources&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/workcleanfast.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Some of this is actually a good thing. There’s a reason that the classic programming mantra of “make it work, make it clean, make it fast” is in that order and not the opposite way around. People pay for software that does stuff. If it does that stuff quickly and in a pleasantly performant way, then that’s great, but it’s not always required (especially if the organization is first to market in their space and customers have no other options).&lt;/p&gt;

&lt;p&gt;Often, my consulting clients are at a point in their organization where they’re no longer scraping by on ramen and cheeto dust, but have a solid business that’s expanding (slowly or quickly). They’ve finally gotten their heads above water and they’re ready to start thriving, not just surviving. People don’t come to me when they’re still trying to achieve product market fit unless things have become untenable, and then that’s more of a rescue job.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/travoltawallet.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;When the time comes to set the budget on performance instead of feature velocity.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;This is a natural and correct progression. It also means organizations accumulate performance debt during that initial period of building and obtaining product-market fit. I think there may be some that believe that all kinds of technical debt are some kind of sinful black stain on any organization, and that if you just Coded The Right Way or were a Software Craftsperson™ this would not have happened. I think that’s probably wrong. Before achieving ramen profitability, businesses must take out technical debt as a kind of financing of their own product development runway. This will happen regardless of one’s coding techniques or knowledge level.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/pmburn.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;However, performance is not a luxury good. It isn’t something that can simply be ignored until one’s organization has a spare four or five figures in the couch cushions. Like technical debt, there is a point when feature work grinds to a halt because the organization is too busy maintaining the performance debt that has accrued. Requests are timing out. Customers are complaining about slow the app feels and switching to competitors. You’re scared to check the AWS bill.&lt;/p&gt;

&lt;p&gt;Ideally, organizations monitor and sensibly take out performance debt when required, and understand the full extent of the work that must be done in the future.&lt;/p&gt;

&lt;p&gt;To do this sort of “sensible debt accrual”, &lt;strong&gt;you need performance monitoring/metrics and you need to understand how to present numbers to management&lt;/strong&gt;. I find that while most people know subscribe to a performance monitoring service, such as New Relic, Skylight or Scout, they often have no idea how to read it and extract useful insights from it, making it a very expensive average latency monitor. Being able to actually use your APM is a critical performance skill that I cover in great detail in my workshops and course.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/scoutexample.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;If you can’t draw insights from this, you’re just throwing cash out the door.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Monitoring these metrics allows you to assess where you’re at and to figure out what parts of the application have accrued performance debt. It also helps you to make decisions on the “cost/benefit” of future work.&lt;/p&gt;

&lt;p&gt;It also means you need to be able to “speak manager” or “speak business”. The business case for adding more features is obvious to the non-technical side of your organization. There is a great business case for performance, fortunately, both &lt;a href=&quot;https://wpostats.com/&quot;&gt;from the side of the customer&lt;/a&gt; and from the cost side as well - reducing average latency by 50% means you can spend 50% less on your application’s servers thanks to queueing theory and something called &lt;a href=&quot;https://en.wikipedia.org/wiki/Little's_law&quot;&gt;Little’s Law&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;At my workshops, I spend a lot of time simply discussing terminology, like request queueing, latency, throughput, tracing and profiling. Giving people the vocabulary they need to understand the tools out there seems to be half the battle of getting everyone comfortable reading their own metrics.&lt;/p&gt;

&lt;h2 id=&quot;complex-apps-and-complex-problems-with-little-training&quot;&gt;Complex apps and complex problems, with little training&lt;/h2&gt;

&lt;p&gt;This leads me to the second cause of performance problems in software - a simple lack of knowledge. We can’t optimize what we don’t understand and we can’t fix what we can’t see.&lt;/p&gt;

&lt;p&gt;I wrote the &lt;a href=&quot;https://railsspeed.com&quot;&gt;Complete Guide to Rails Performance&lt;/a&gt; simply because there was so much information about this topic that had simply never been compiled before into one place.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/confusedscaleman.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;“What’s request queueing?”&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;This shows itself most when scaling for throughput. Most organizations simply aren’t tracking critical scaling metrics or even know what they are, often because they believe the platform-as-a-service that they’re using should “take care of this” for them. By the time I’ve been called in, they’re spending thousands of dollars a month more than they need to, and could have fixed this months or even years ago with some simple autoscaling policies and a bit of organizational knowledge around scaling. Or, the flipside is happening and they’re massive under-scaled, with 25-50% of their total request latency being just time spent queueing for resources.&lt;/p&gt;

&lt;p&gt;Performance work is not rocket science. However, unlike a lot of other areas in software&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(The only other area in software that requires an even wider base of knowledge is security. Consider &lt;a href=&quot;https://en.wikipedia.org/wiki/Row_hammer&quot;&gt;Rowhammer&lt;/a&gt; - basically an electrical engineering exploit in very particular configurations of DRAM.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt; The only other area in software that requires an even wider base of knowledge is security. Consider &lt;a href=&quot;https://en.wikipedia.org/wiki/Row_hammer&quot;&gt;Rowhammer&lt;/a&gt; - basically an electrical engineering exploit in very particular configurations of DRAM.&lt;/span&gt;, it can require an extremely broad base of knowledge. When your customer says the site “feels slow”, the problem can quite reasonably be almost anywhere between the pixels on the user’s screen (say, an issue with the customer’s client machine) and the electrons running through the silicon on your cloud service provider (for example, a mitigation for a recent Intel security issue puts your servers above capacity). Feature work and even to a large extent refactoring work generally only requires knowledge of the language and frameworks in use. Performance work often needs esoteric knowledge from other fields (such as queueing theory) in addition to highly in-depth knowledge in your frameworks and language.&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(I wrote a &lt;a href=&quot;https://www.speedshop.co/2019/01/10/three-activerecord-mistakes.html&quot;&gt;3000+ word blog&lt;/a&gt; about the critical performance differences between English-language synonyms .present? and .exists? in Rails, for example, but my &lt;a href=&quot;wwww.railsspeed.com&quot;&gt;Rails performance course&lt;/a&gt; spends the majority of the time talking about things which are not Ruby-specific.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt; I wrote a &lt;a href=&quot;https://www.speedshop.co/2019/01/10/three-activerecord-mistakes.html&quot;&gt;3000+ word blog&lt;/a&gt; about the critical performance differences between English-language synonyms .present? and .exists? in Rails, for example, but my &lt;a href=&quot;wwww.railsspeed.com&quot;&gt;Rails performance course&lt;/a&gt; spends the majority of the time talking about things which are not Ruby-specific.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/debuggingrails.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Looking at a flamegraph of a Rails app for the first time often leads to this reaction.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;This depth of knowledge simply isn’t present in many organizations, especially those who place sprint velocity before the development of engineering capacity and skills in the organization.&lt;/p&gt;

&lt;p&gt;The workshops I’ve been doing have really allowed me to go in deep on complex problems and help people deal with the “wrinkles” introduced by their application. Getting to look over people’s shoulders while they experience an error or something I hadn’t anticipated has been very rewarding, both for them and for me as an educator.&lt;/p&gt;

&lt;p&gt;Also, during those workshops, I don’t emphasize “pre-baked” problem/solutions, but instead have the attendees bring their real world applications, and we immediately try to apply what we’ve learned on their actual apps right then and there. I don’t want anyone to go home and run into a problem caused by the complexity of their app - rather, I’d like that to happen while we’re both in the same room!&lt;/p&gt;

&lt;h2 id=&quot;boiling-frogs---even-when-tracked-performance-slips-without-fix&quot;&gt;Boiling frogs - even when tracked, performance slips without fix&lt;/h2&gt;

&lt;p&gt;In the Slack channel for the Complete Guide to Rails Performance, we’ve had a few conversations about managing performance work in the software organization.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/elmoflames.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Walking into the office on Monday like&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;An organizational culture that always places completeness over quality inevitably runs into issues. Often when I get new clients, they’re experiencing not just performance issues but have problems with all the various dimensions of software quality: low correctness (an excess of bugs and lack of test coverage), high complexity (“technical debt”, spaghetti organization), and a poor deployment pipeline (broken builds, janky deploys). These aspects of software quality tend to either all be good or all be bad. Project management can (and often should) sacrifice quality for a period of time to prioritize completeness and features, but when it’s done pathologically, it inevitably leads to ruin.&lt;/p&gt;

&lt;p&gt;I find that the lack of software quality culture often arises because no one is measuring it&lt;sup class=&quot;sidenote-number&quot;&gt;3&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(I actually really don’t vibe with the ‘software craftsperson’ aesthetic that people like Uncle Bob try to push. Quality is great but it isn’t everything. It’s possible to turn this into navelgazing, and building ivory towers.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;3&lt;/sup&gt; I actually really don’t vibe with the ‘software craftsperson’ aesthetic that people like Uncle Bob try to push. Quality is great but it isn’t everything. It’s possible to turn this into navelgazing, and building ivory towers.&lt;/span&gt;. Feature velocity is measured, or at least vaguely tracked, with things like pull request counts, sprint points, or user stories. We shipped 5 stories last week, so management expects us to ship 5 this week.&lt;/p&gt;

&lt;p&gt;Fortunately, many software quality measures are actually very easy to track. How many bugs were reported or experienced by customers last week? How much downtime did we have? How many deploys were there? Are these numbers rising or falling?&lt;/p&gt;

&lt;p&gt;In terms of performance, most organizations would benefit from setting simple thresholds that, if exceeded, move performance work into the “bug fixing” pipeline that the organization employs. For example, an organization can commit to a maximum 95th percentile latency of 1 second. If a transaction&lt;sup class=&quot;sidenote-number&quot;&gt;4&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(In New Relic parlance - a single controller action is a ‘transaction’.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;4&lt;/sup&gt; In New Relic parlance - a single controller action is a ‘transaction’.&lt;/span&gt; exceeds that threshold, a new bug is recorded.&lt;/p&gt;

&lt;p&gt;For organizations that want to improve the customer’s experience and perceived performance of the application, other budgets may be necessary. For example, a first-page-load time of 5 seconds. This page load target has implications that flow down throughout the stack, as one simply cannot ship 10 megabytes of JavaScript and also have a page load in 5 seconds&lt;sup class=&quot;sidenote-number&quot;&gt;5&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(In fact, I would estimate that to keep page load times below 5 seconds on the average connection and hardware, you can probably ship only a few hundred KB)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;5&lt;/sup&gt; In fact, I would estimate that to keep page load times below 5 seconds on the average connection and hardware, you can probably ship only a few hundred KB&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;Software engineers are often poor communicators, and they very often fail to communicate to other parts of the organization that prioritizing feature velocity at all costs is not sustainable.&lt;/p&gt;

&lt;p&gt;Think of it this way: how do you think the project managers in your organization would answer the following questions?&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/bezos.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Bezos showering in your AWS bill&lt;/span&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Is your tolerance for the slowness of our application infinite? (i.e. can the app just beachball for all customers all the time?)&lt;/li&gt;
  &lt;li&gt;Do you have infinite money to spend on our EC2 instance bills?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer to either of those questions is “no”, then it is &lt;strong&gt;your job as a software developer&lt;/strong&gt; to find and make explicit those tolerances. They will be different for every organization. These performance requirements can be easily translated into automated alerts and thresholds.&lt;/p&gt;

&lt;p&gt;You just have to have the conversation beforehand. The difference between “Hey boss - we’ve been shipping 12 points a week for the last 8 weeks and now we can’t ship anything for 6 weeks because we need to write tests and make the homepage load time somewhat bearable” and “we exceeded the limit for page load time that we all agreed upon 6 months ago, and we’ll need to reduce velocity for a while to compensate” is miles apart.&lt;/p&gt;

&lt;p&gt;As a result of seeing this pattern often enough, I’ve changed how I phrase my consulting deliverables, as I now realize I need to provide ammunition for the engineers when bringing back my recommendations to the “business side”.&lt;/p&gt;

&lt;h2 id=&quot;its-not-ruby-and-it-isnt-really-rails&quot;&gt;It’s not Ruby, and it isn’t (really) Rails&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/leavematzalone.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;LEAVE MATZ ALONE&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;And, finally, here’s what isn’t the reason why your web application is slow: your framework or language choice. Once 90th percentile latency is lower than 500 milliseconds and median latency is below 100 milliseconds, most web application backends are no longer the bottleneck in their customer’s experience (if they ever were to begin with, which, in the age of 10 megabyte JavaScript bundles, they are usually not).&lt;/p&gt;

&lt;p&gt;It’s 2017 and web applications don’t return flat HTML files anymore&lt;sup class=&quot;sidenote-number&quot;&gt;5&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(CNN.com took 5MB of resources and 112 requests to render for me, today. R.I.P. the old light web.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;5&lt;/sup&gt; CNN.com took 5MB of resources and 112 requests to render for me, today. R.I.P. the old light web.&lt;/span&gt;. Websites are gargantuan, with JavaScript bundles stretching into the size of megabytes and stylesheets that couldn’t fit in ten Apollo Guidance Computers. So how much of a difference does a web application which responds in 1 millisecond or less make in this environment?&lt;/p&gt;

&lt;p&gt;Vanishingly little. Nowadays, the average webpage takes 5 seconds to render. Some JavaScript single-page-applications can take 12 seconds or more on initial render.&lt;/p&gt;

&lt;p&gt;Server response times simply make up a minority part of the actual user experience of loading and interacting with a webpage - cutting 99 milliseconds off the server response time just doesn’t make a difference.&lt;/p&gt;

&lt;p&gt;Not to mention: if Ruby on Rails, frequently maligned “as too slow” or “can’t scale”, can run several of the top 1000 websites in the world by traffic, including that little fly-by-night outfit called GitHub, then it’s a fine choice for whatever your application is. Rails is just an example here - there are many comparable frameworks in comparable languages that you could substitute like Python and Django. There are some web applications for which 100 milliseconds of latency is an unacceptable eon (advertising is the most common case), but for the vast majority of us delivering HTML or JSON to a client, that’s zippy quick.&lt;/p&gt;

&lt;h2 id=&quot;whither-rails-today&quot;&gt;Whither Rails Today?&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/course.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;I’m doing a &lt;a href=&quot;https://speedshop.co/workshops.html&quot;&gt;Rails Perf workshop tour&lt;/a&gt; this summer in the US of A.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;One of the questions I ask in my post-workshop survey is “How do you feel writing Ruby on Rails? Would you like to keep doing it?”.&lt;/p&gt;

&lt;p&gt;The answers I get back are always astoundingly positive. For all the FUD on the web at large, people writing Ruby are incredibly happy doing it. And that’s what keeps me writing and teaching: as long as people feel that performance concerns are keeping them from enjoying Ruby or choosing it as their tech stack, I’ll keep doing what I do.&lt;/p&gt;
</description>
        <pubDate>Mon, 17 Jun 2019 07:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2019/06/17/what-i-learned-teaching-rails-performance.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2019/06/17/what-i-learned-teaching-rails-performance.html</guid>
        
        
      </item>
    
      <item>
        <title>3 ActiveRecord Mistakes That Slow Down Rails Apps: Count, Where and Present</title>
        <description>&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/weirdriddles.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;“When does ActiveRecord execute queries? No one knows!”&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;ActiveRecord is great. Really, it is. But it’s an abstraction, intended to insulate you from the actual SQL queries being run on your database. And, if you don’t understand how ActiveRecord works, you may be causing SQL queries to run that you didn’t intend to.&lt;/p&gt;

&lt;p&gt;Unfortunately, the performance costs of many features of ActiveRecord means we can’t afford to ignore unnecessary usage or treat our ORM as just an implementation detail. We need to understand exactly what queries are being run on our performance-sensitive endpoints. Freedom isn’t free, and neither is ActiveRecord.&lt;/p&gt;

&lt;p&gt;One particular case of ActiveRecord misuse that I find is common amongst my clients is that ActiveRecord is executing SQL queries that aren’t really necessary. Most of my clients are completely unaware that this is even happening.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/dirtythree.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Unnecessary SQL is a common cause of overly slow controller actions, especially when the unnecessary query appears in a partial which is rendered for every element in a collection. This is common in search actions or index actions. This is one of the most common problems I encounter in my performance consulting. It’s a problem in nearly every app I’ve ever worked on.&lt;/p&gt;

&lt;p&gt;One way to eliminate unnecessary queries is to poke our heads into ActiveRecord and understand its internals, and know exactly how certain methods are implemented. &lt;strong&gt;Today, we’re going to look at the implementation and usage of three methods which cause lots of unnecessary queries in Rails applications: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;how-do-i-know-if-a-query-is-unnecessary&quot;&gt;How Do I Know if a Query is Unnecessary?&lt;/h2&gt;

&lt;p&gt;I have a rule of thumb to judge whether or not any particular SQL query is unnecessary. Ideally, a Rails controller action should execute &lt;strong&gt;one SQL query per table&lt;/strong&gt;. If you’re seeing more than one SQL query per table, you can usually find a way to reduce that to one or two queries. If you’ve got more than a half-dozen or so queries on a single table, you almost definitely have unnecessary queries. &lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Please don’t email or tweet with me with ‘Well ackshually…’ on this one. It’s a guideline, not a rule, and I understand there are circumstances where more than one query per table is a good idea.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt; Please don’t email or tweet with me with ‘Well ackshually…’ on this one. It’s a guideline, not a rule, and I understand there are circumstances where more than one query per table is a good idea.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The number of SQL queries per table can be easily seen on NewRelic, for example, if you have that installed.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/post/img/nplusoneposts.png&quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/washeyes.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;I keep an eyewash station next to my desk for really bad N+1s&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Another rule of thumb is that &lt;strong&gt;most queries should
execute during the first half of a controller action’s response, and almost never during partials&lt;/strong&gt;. Queries executed during partials are usually unintentional, and are often N+1s. These are easy to spot during a controller’s execution if you just read the logs in development mode. For example, if you see this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;User Load (0.6ms)  SELECT  &quot;users&quot;.* FROM &quot;users&quot; WHERE &quot;users&quot;.&quot;id&quot; = $1 LIMIT 1  [[&quot;id&quot;, 2]]
Rendered posts/_post.html.erb (23.2ms)
User Load (0.3ms)  SELECT  &quot;users&quot;.* FROM &quot;users&quot; WHERE &quot;users&quot;.&quot;id&quot; = $1 LIMIT 1  [[&quot;id&quot;, 3]]
Rendered posts/_post.html.erb (15.1ms)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;… you have an N+1 in this partial.&lt;/p&gt;

&lt;p&gt;Usually, when a query is executed halfway through a controller action (somewhere deep in a partial, for example) it means that you haven’t &lt;a href=&quot;https://api.rubyonrails.org/classes/ActiveRecord/QueryMethods.html#method-i-preload&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preload&lt;/code&gt;ed&lt;/a&gt; the data that you needed.&lt;/p&gt;

&lt;p&gt;So, let’s look specifically at the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; methods, and why they cause unnecessary SQL queries.&lt;/p&gt;

&lt;h2 id=&quot;count-executes-a-count-every-time&quot;&gt;.count executes a COUNT every time&lt;/h2&gt;

&lt;p&gt;I see this one at almost every company I contract for. It seems to be little-known that calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; on an ActiveRecord relation will &lt;em&gt;always&lt;/em&gt; try to execute a SQL query, every time. This is inappropriate in most scenarios, but, in general, &lt;strong&gt;only use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; if you want to always execute a SQL COUNT &lt;em&gt;right now&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/count.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;“How many queries do we want per table?”&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The most common cause of unnecessary &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; queries is when you &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; an association you will use later in the view (or have already used):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# _messages.html.erb
# Assume @messages = user.messages.unread, or something like that

&amp;lt;h2&amp;gt;Unread Messages: &amp;lt;%= @messages.count %&amp;gt;&amp;lt;/h2&amp;gt;

&amp;lt;% @messages.each do |message| %&amp;gt;
blah blah blah
&amp;lt;% end %&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This executes 2 queries, a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt; and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT&lt;/code&gt;. The COUNT is executed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@messages.count&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@messages.each&lt;/code&gt; executes a SELECT to load all the messages. Changing the order of the code in the partial and changing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; eliminates the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt; query completely and keeps the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;% @messages.each do |message| %&amp;gt;
blah blah blah
&amp;lt;% end %&amp;gt;

&amp;lt;h2&amp;gt;Unread Messages: &amp;lt;%= @messages.size %&amp;gt;&amp;lt;/h2&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Why is this the case? We need not look any further than &lt;a href=&quot;https://github.com/rails/rails/blob/94b5cd3a20edadd6f6b8cf0bdf1a4d4919df86cb/activerecord/lib/active_record/relation.rb#L210&quot;&gt;the actual method definition of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; on ActiveRecord::Relation:&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# File activerecord/lib/active_record/relation.rb, line 210&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;size&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;loaded?&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;vi&quot;&gt;@records&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;length&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;:all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/triggeredcount.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;If the relation is loaded (that is, the query that the relation describes has been executed and we have stored the result), we call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;length&lt;/code&gt; on the already loaded record array. &lt;a href=&quot;https://ruby-doc.org/core-2.5.0/Array.html#method-i-length&quot;&gt;That’s just a simple Ruby method on Array&lt;/a&gt;. If the ActiveRecord::Relation &lt;em&gt;isn’t&lt;/em&gt; loaded, we trigger a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt; query.&lt;/p&gt;

&lt;p&gt;On the other hand, &lt;a href=&quot;https://github.com/rails/rails/blob/94b5cd3a20edadd6f6b8cf0bdf1a4d4919df86cb/activerecord/lib/active_record/relation/calculations.rb#L41&quot;&gt;here’s how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; is implemented&lt;/a&gt; (in ActiveRecord::Calculations):&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;column_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;nil&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;block_given?&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# ...&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;super&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

  &lt;span class=&quot;n&quot;&gt;calculate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;:count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;column_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And, of course, &lt;a href=&quot;https://github.com/rails/rails/blob/94b5cd3a20edadd6f6b8cf0bdf1a4d4919df86cb/activerecord/lib/active_record/relation/calculations.rb#L131&quot;&gt;the implementation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;calculate&lt;/code&gt;&lt;/a&gt; doesn’t memoize or cache anything, and executes a SQL calculation every time it is called.&lt;/p&gt;

&lt;p&gt;Simply changing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; in our original example would have still triggered a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt;. The record’s wouldn’t be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded?&lt;/code&gt; when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; was called, so ActiveRecord will still attempt a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt;. Moving the method &lt;em&gt;after&lt;/em&gt; the records are loaded eliminates the query. Now, moving our header to the end of the partial doesn’t really make any logical sense. Instead, we can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt; method.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;h2&amp;gt;Unread Messages: &amp;lt;%= @messages.load.size %&amp;gt;&amp;lt;/h2&amp;gt;

&amp;lt;% @messages.each do |message| %&amp;gt;
blah blah blah
&amp;lt;% end %&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt; just causes all of the records described by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@messages&lt;/code&gt; to load immediately, rather than lazily. &lt;a href=&quot;https://api.rubyonrails.org/classes/ActiveRecord/Relation.html#method-i-load&quot;&gt;It returns the ActiveRecord::Relation, not the records.&lt;/a&gt; So, when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; is called, the records are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded?&lt;/code&gt; and a query is avoided. Voilà.&lt;/p&gt;

&lt;p&gt;What if, in that example, we used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;messages.load.count&lt;/code&gt;? We’d still trigger a COUNT query!&lt;/p&gt;

&lt;p&gt;When &lt;em&gt;doesn’t&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; trigger a query? Only if the result has been cached by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ActiveRecord::QueryCache&lt;/code&gt;.&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(I have some Opinions on the use of QueryCache, but that’s a post for another day.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt; I have some Opinions on the use of QueryCache, but that’s a post for another day.&lt;/span&gt; This could occur by trying to run the same SQL query twice:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;h2&amp;gt;Unread Messages: &amp;lt;%= @messages.count %&amp;gt;&amp;lt;/h2&amp;gt;

... lots of other view code, then later:

&amp;lt;h2&amp;gt;Unread Messages: &amp;lt;%= @messages.count %&amp;gt;&amp;lt;/h2&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/pissed.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Every time you use count when you could have used size&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In my opinion, most Rails developers should be using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; in most of the places that they use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt;.&lt;/strong&gt; I’m not sure why everyone seems to write &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; where it is appropriate, and it doesn’t when the records are already loaded. I think it’s because when you’re writing an ActiveRecord relation, you’re in the “SQL” mindset. You think: “This is SQL, I should write count because I want a COUNT!”&lt;/p&gt;

&lt;p&gt;So, when do you actually want to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt;? Use it when you won’t actually &lt;em&gt;ever&lt;/em&gt; be loading the full association that you’re &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt;ing. For example, take this view on Rubygems.org, which displays a single gem:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/posts/img/rspecview.png&quot; loading=&quot;lazy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the “versions” list, the view does a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; to get the total number of releases (versions) of this gem.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/rubygems/rubygems.org/blob/d8a48488d29cbfc83efd2e936c74290c54041288/app/views/rubygems/show.html.erb#L36&quot;&gt;Here’s the actual code:&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;% if show_all_versions_link?(@rubygem) %&amp;gt;
  &amp;lt;%= link_to t('.show_all_versions', :count =&amp;gt; @rubygem.versions.count), rubygem_versions_url(@rubygem), :class =&amp;gt; &quot;gem__see-all-versions t-link--gray t-link--has-arrow&quot; %&amp;gt;
&amp;lt;% end %&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The thing is, this view &lt;em&gt;never&lt;/em&gt; loads &lt;em&gt;all&lt;/em&gt; of the Rubygem’s versions. It only loads five of the most recent ones, in order to show that versions list.&lt;/p&gt;

&lt;p&gt;So, a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; makes perfect sense here. Even though &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; would be logically equivalent (it would just execute a COUNT as well because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@versions&lt;/code&gt; is not &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded?&lt;/code&gt;), it states the intent of the code in a clear way.&lt;/p&gt;

&lt;p&gt;My advice is to grep through your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;app/views&lt;/code&gt; directory for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; calls and make sure that they actually make sense. If you’re not 100% sure that you really need a real SQL &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt; right then and there, switch it to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt;. Worst case, ActiveRecord will still execute a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt; if the association isn’t loaded. If you’re going to use the association later in the view, change it to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load.size&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;where-means-filtering-is-done-by-the-database&quot;&gt;.where means filtering is done by the database&lt;/h2&gt;

&lt;p&gt;What’s the problem with this code (let’s say its &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_post.html.erb&lt;/code&gt;)&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;% @posts.each do |post| %&amp;gt;
  &amp;lt;%= post.content %&amp;gt;
  &amp;lt;%= render partial: :comment, collection: post.active_comments %&amp;gt;
&amp;lt;% end %&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and in Post.rb:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Post&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;ActiveRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Base&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;active_comments&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;comments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;soft_deleted: &lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/whoaguy.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;If you said, “this causes a SQL query to be executed on every rendering of the post partial”, you’re correct! &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt; always causes a query. I didn’t even bother to write out the controller code, because &lt;em&gt;it doesn’t matter&lt;/em&gt;. You can’t use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;includes&lt;/code&gt; or other preloading methods to stop this query. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt; will always try to execute a query!&lt;/p&gt;

&lt;p&gt;This also happens when you call scopes on associations. Imagine instead our Comment model looked like this:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Comment&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;ActiveRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Base&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;belongs_to&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:post&lt;/span&gt;

  &lt;span class=&quot;n&quot;&gt;scope&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:active&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;soft_deleted: &lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Allow me to sum this up with two rules: &lt;strong&gt;Don’t call scopes on associations when you’re rendering collections&lt;/strong&gt; and &lt;strong&gt;don’t put query methods, like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt;, in instance methods of an ActiveRecord::Base class&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Calling scopes on associations means we cannot preload the result. In the example above, we can preload the comments on a post, but we can’t preload the &lt;em&gt;active&lt;/em&gt; comments on a post, so we have to go back to the database and execute new queries for every element in the collection.&lt;/p&gt;

&lt;p&gt;This isn’t a problem when you only do it once, and not on every element of a collection (like every post, as above). Feel free to use scopes galore in those situations - for example, if this was a PostsController#show action that only displayed one post and its associated comments. But in collections, scopes on associations cause N+1s, every time.&lt;/p&gt;

&lt;p&gt;The best way I’ve found to fix this particular problem is to &lt;strong&gt;create a new association&lt;/strong&gt;. &lt;a href=&quot;https://www.justinweiss.com/&quot;&gt;Justin Weiss&lt;/a&gt;, of “Practicing Rails”, taught me this in &lt;a href=&quot;https://www.justinweiss.com/articles/how-to-preload-rails-scopes/&quot;&gt;this blog post about preloading Rails scopes&lt;/a&gt;. The idea is that you create a new association, which you &lt;em&gt;can&lt;/em&gt; preload:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Post&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;has_many&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:comments&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;has_many&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:active_comments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;active&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;class_name: &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Comment&quot;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Comment&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;belongs_to&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:post&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;scope&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:active&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;soft_deleted: &lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PostsController&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;index&lt;/span&gt;
    &lt;span class=&quot;vi&quot;&gt;@posts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Post&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;includes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;:active_comments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The view is unchanged, but now executes just 2 SQL queries, one on the Posts table and one on the Comments table. Nice!&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;% @posts.each do |post| %&amp;gt;
  &amp;lt;%= post.content %&amp;gt;
  &amp;lt;%= render partial: :comment, collection: post.active_comments %&amp;gt;
&amp;lt;% end %&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The second rule of thumb I mentioned, &lt;strong&gt;don’t put query methods, like where, in instance methods of an ActiveRecord::Base class&lt;/strong&gt;, may seem less obvious. Here’s an example:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Post&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;ActiveRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Base&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;belongs_to&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:post&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;latest_comment&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;comments&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;order&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'published_at desc'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;first&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What happens if the view looks like this?&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;% @posts.each do |post| %&amp;gt;
  &amp;lt;%= post.content %&amp;gt;
  &amp;lt;%= render post.latest_comment %&amp;gt;
&amp;lt;% end %&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/rules.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;That’s a SQL query on every post, regardless of what you preloaded. In my experience, &lt;strong&gt;every instance method on an ActiveRecord::Base class will eventually get called inside a collection&lt;/strong&gt;. Someone adds a new feature and isn’t paying attention. Maybe it’s by a different developer than the one who wrote the method originally, and they didn’t fully read the implementation. Ta-da, now you’ve got an N+1. The example I gave could be rewritten as an association, like I described earlier. That can still cause an N+1, but at least it can be fixed easily with the correct preloading.&lt;/p&gt;

&lt;p&gt;Which ActiveRecord methods should we &lt;em&gt;avoid&lt;/em&gt; inside of our ActiveRecord model instance methods? Generally, it’s pretty much everything in the &lt;a href=&quot;https://api.rubyonrails.org/classes/ActiveRecord/QueryMethods.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;QueryMethods&lt;/code&gt;&lt;/a&gt;, &lt;a href=&quot;https://api.rubyonrails.org/classes/ActiveRecord/FinderMethods.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FinderMethods&lt;/code&gt;&lt;/a&gt;, and &lt;a href=&quot;https://api.rubyonrails.org/classes/ActiveRecord/Calculations.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Calculations&lt;/code&gt;&lt;/a&gt;. Any of these methods will usually &lt;em&gt;try&lt;/em&gt; to run a SQL query, and are resistant to preloading. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt; is the most frequent offender, however.&lt;/p&gt;

&lt;h2 id=&quot;any-exists-and-present&quot;&gt;any?, exists? and present?&lt;/h2&gt;

&lt;p&gt;Rails programmers have been struck by a major affliction - they’re adding a particular predicate method to just about every variable in their applications. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; has spread across Rails codebases faster than the plague in 13th century Europe. The vast majority of the time, the predicate adds nothing but verbosity, and really, all the author needed was a truthy/falsey check, which they could have done by just writing the variable name.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/codetriage/codetriage/blob/b92e347e0f4714b4646be930e341be5a44761b95/app/models/doc_comment.rb#L9&quot;&gt;Here’s an example&lt;/a&gt; from &lt;a href=&quot;https://www.codetriage.com/&quot;&gt;CodeTriage&lt;/a&gt;, a free and open-source Rails application written by my friend &lt;a href=&quot;https://schneems.com/&quot;&gt;Richard Schneeman&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;DocComment&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;ActiveRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Base&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;belongs_to&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;:doc_method&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;ss&quot;&gt;counter_cache: &lt;/span&gt;&lt;span class=&quot;kp&quot;&gt;true&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;# ... things removed for clarity...&lt;/span&gt;

  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;doc_method?&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;doc_method_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;present?&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; doing here? One, it transforms the value of doc_method_id from either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nil&lt;/code&gt; or an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Integer&lt;/code&gt; into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;true&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;false&lt;/code&gt;. Some people have Strong Opinions about whether predicates should return true/false or can return truthy/falsey. I don’t. But adding &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; also does something else, and we have to &lt;a href=&quot;https://github.com/rails/rails/blob/94b5cd3a20edadd6f6b8cf0bdf1a4d4919df86cb/activesupport/lib/active_support/core_ext/object/blank.rb#L26&quot;&gt;look at the implementation&lt;/a&gt; to figure out what:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Object&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;present?&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;blank?&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt; is a more complicated question than “is this object truthy or falsey”. Empty arrays and hashes are truthy, but &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank&lt;/code&gt;, and empty strings are also &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt;. In the example above from CodeTriage, however, the only things that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;doc_method_id&lt;/code&gt; will &lt;em&gt;ever&lt;/em&gt; be is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nil&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Integer&lt;/code&gt;, meaning &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; is logically equivalent to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!!&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;doc_method?&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;!!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc_method_id&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# same as doc_method_id.present?&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/oldmanyellscloud.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; in cases like this is the wrong tool for the job. If you don’t care about “emptiness” in the value you’re calling the predicate on (i.e. the value cannot be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[]&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{}&lt;/code&gt;), use the simpler (and much faster) language features available to you. I sometimes see people even do this on values &lt;em&gt;which are already boolean&lt;/em&gt;, which means you’re just adding verbosity and making me wonder if there’s some weird edge cases I’m not seeing.&lt;/p&gt;

&lt;p&gt;Alright, that’s my style gripe. I understand that you may not agree. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; makes more sense when dealing with strings, which can frequently be empty (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;&quot;&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where people get into trouble is calling predicates, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt;, on ActiveRecord::Relation objects.&lt;/strong&gt; Let’s say you need to know if an ActiveRecord::Relation has any records. You can use the English-language synonyms any?/present?/exists? or their negations none?/blank?/empty?. Surely it doesn’t matter which method you choose, right? Just pick the one that sounds the most natural when read aloud? Nope.&lt;/p&gt;

&lt;p&gt;What SQL queries do you think the following code will execute? Assume &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@comments&lt;/code&gt; is an ActiveRecord::Relation.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;- if @comments.any?
  h2 Comments on this Post
  - @comments.each do |comment|
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The answer is &lt;em&gt;two&lt;/em&gt;. One will be an existence check, triggered by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@comments.any?&lt;/code&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT  1 AS one FROM ... LIMIT 1&lt;/code&gt;), then the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@comments.each&lt;/code&gt; line will trigger a loading of the entire relation (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT &quot;comments&quot;.* FROM &quot;comments&quot; WHERE ...&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;What about this?&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;- unless @comments.load.empty?
  h2 Comments on this Post
  - @comments.each do |comment|
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This one only executes one query - &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@comments.load&lt;/code&gt; loads the entire relation right away with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT &quot;comments&quot;.* FROM &quot;comments&quot; WHERE ...&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And this one?&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;- if @comments.exists?
  This post has
  = @comments.size
  comments
- if @comments.exists?
  h2 Comments on this Post
  - @comments.each do |comment|
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Four! &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt; doesn’t memoize itself and it doesn’t load the relation. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt; here triggers a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT 1 ...&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.size&lt;/code&gt; triggers a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COUNT&lt;/code&gt; because the relation hasn’t been loaded yet, and then the next &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt; triggers ANOTHER &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT 1 ...&lt;/code&gt; and finally &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@comments&lt;/code&gt; loads the entire relation! Yay! Isn’t this fun? You could reduce this down to just 1 query with the following:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;- if @comments.load.any?
  This post has
  = @comments.size
  comments
- if @comments.any?
  h2 Comments on this Post
  - @comments.each do |comment|
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And it just gets better - this behavior changes depending if you’re on Rails 4.2, Rails 5.0 or Rails 5.1+.&lt;/p&gt;

&lt;p&gt;Here’s how it works in Rails 5.1+:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;method&lt;/th&gt;
      &lt;th&gt;SQL generated&lt;/th&gt;
      &lt;th&gt;memoized?&lt;/th&gt;
      &lt;th&gt;implementation&lt;/th&gt;
      &lt;th&gt;Runs query if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded?&lt;/code&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;present?&lt;/td&gt;
      &lt;td&gt;SELECT “users”.* FROM “users”&lt;/td&gt;
      &lt;td&gt;yes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt;)&lt;/td&gt;
      &lt;td&gt;Object (!blank?)&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;blank?&lt;/td&gt;
      &lt;td&gt;SELECT “users”.* FROM “users”&lt;/td&gt;
      &lt;td&gt;yes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt;)&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt;; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;any?&lt;/td&gt;
      &lt;td&gt;SELECT 1 AS one FROM “users” LIMIT 1&lt;/td&gt;
      &lt;td&gt;no unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!empty?&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;empty?&lt;/td&gt;
      &lt;td&gt;SELECT 1 AS one FROM “users” LIMIT 1&lt;/td&gt;
      &lt;td&gt;no unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt; if !&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded?&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;none?&lt;/td&gt;
      &lt;td&gt;SELECT 1 AS one FROM “users” LIMIT 1&lt;/td&gt;
      &lt;td&gt;no unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;empty?&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;exists?&lt;/td&gt;
      &lt;td&gt;SELECT 1 AS one FROM “users” LIMIT 1&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;ActiveRecord::Calculations&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Here’s how it works in Rails 5.0:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;method&lt;/th&gt;
      &lt;th&gt;SQL generated&lt;/th&gt;
      &lt;th&gt;memoized?&lt;/th&gt;
      &lt;th&gt;implementation&lt;/th&gt;
      &lt;th&gt;Runs query if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded?&lt;/code&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;present?&lt;/td&gt;
      &lt;td&gt;SELECT “users”.* FROM “users”&lt;/td&gt;
      &lt;td&gt;yes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt;)&lt;/td&gt;
      &lt;td&gt;Object (!blank?)&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;blank?&lt;/td&gt;
      &lt;td&gt;SELECT “users”.* FROM “users”&lt;/td&gt;
      &lt;td&gt;yes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt;)&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt;; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;any?&lt;/td&gt;
      &lt;td&gt;SELECT COUNT(*) FROM “users”&lt;/td&gt;
      &lt;td&gt;no unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!empty?&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;empty?&lt;/td&gt;
      &lt;td&gt;SELECT COUNT(*) FROM “users”&lt;/td&gt;
      &lt;td&gt;no unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;count(:all) &amp;gt; 0&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;none?&lt;/td&gt;
      &lt;td&gt;SELECT COUNT(*) FROM “users”&lt;/td&gt;
      &lt;td&gt;no unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;empty?&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;exists?&lt;/td&gt;
      &lt;td&gt;SELECT 1 AS one FROM “users” LIMIT 1&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;ActiveRecord::Calculations&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Here’s how it works in Rails 4.2:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;method&lt;/th&gt;
      &lt;th&gt;SQL generated&lt;/th&gt;
      &lt;th&gt;memoized?&lt;/th&gt;
      &lt;th&gt;implementation&lt;/th&gt;
      &lt;th&gt;Runs query if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded?&lt;/code&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;present?&lt;/td&gt;
      &lt;td&gt;SELECT “users”.* FROM “users”&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;Object (!blank?)&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;blank?&lt;/td&gt;
      &lt;td&gt;SELECT “users”.* FROM “users”&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
      &lt;td&gt;to_a.blank?&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;any?&lt;/td&gt;
      &lt;td&gt;SELECT COUNT(*) FROM “users”&lt;/td&gt;
      &lt;td&gt;no unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!empty?&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;empty?&lt;/td&gt;
      &lt;td&gt;SELECT COUNT(*) FROM “users”&lt;/td&gt;
      &lt;td&gt;no unless &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;count(:all) &amp;gt; 0&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;none?&lt;/td&gt;
      &lt;td&gt;SELECT “users”.* FROM “users”&lt;/td&gt;
      &lt;td&gt;yes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt; called)&lt;/td&gt;
      &lt;td&gt;Array&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;exists?&lt;/td&gt;
      &lt;td&gt;SELECT 1 AS one FROM “users” LIMIT 1&lt;/td&gt;
      &lt;td&gt;no&lt;/td&gt;
      &lt;td&gt;ActiveRecord::Calculations&lt;/td&gt;
      &lt;td&gt;yes&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;any?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;empty?&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;none?&lt;/code&gt; remind me of the implementation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; - if the records are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;loaded?&lt;/code&gt; do a simple method call on a basic Array, if they’re not loaded, &lt;em&gt;always run a SQL query&lt;/em&gt;. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt; has no caching or memoization built in, just like other ActiveRecord::Calculations. This means that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt;, which is another method people like to write in these circumstances, is actually much worse than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; in some cases!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;These six predicate methods, which are English-language synonyms all asking the same question, have completely different implementations and performance implications, and these consequences depend on which version of Rails you are using.&lt;/strong&gt; So, let me distill all of the above into some concrete advice:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt; should not be used if the ActiveRecord::Relation will never be used in its entirety after you call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt;. For example, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@my_relation.present?; @my_relation.first(3).each&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;any?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;none?&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;empty?&lt;/code&gt; should probably be replaced with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt; unless you will only take a section of the ActiveRecord::Relation using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;first&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last&lt;/code&gt;. They will generate an extra existence SQL check if you’re just going to use the entire relation if it exists. In essence, change &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@users.any?; @users.each...&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@users.present?; @users.each...&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@users.load.any?; @users.each...&lt;/code&gt;, but &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@users.any?; @users.first(3).each&lt;/code&gt; is fine.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt; is a lot like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; - it is never memoized, and always executes a SQL query. Most people probably do not actually want this behavior, and would be better off using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/doless.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;As your app grows in size and complexity, unnecessary SQL can become a real drag on your application’s performance. Each SQL query involves a round-trip back to the database, which entails, usually, at &lt;em&gt;least&lt;/em&gt; a millisecond, and sometimes much more for complex &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clauses. Even if one extra &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt; check isn’t a big deal, if it suddenly happens in every row of a table or a partial in a collection, you’ve got a big problem!&lt;/p&gt;

&lt;p&gt;ActiveRecord is a powerful abstraction, but since database access will never be “free”, we need to be aware of how ActiveRecord works internally so that we can avoid database access in unnecessary cases.&lt;/p&gt;

&lt;h2 id=&quot;app-checklist&quot;&gt;App Checklist&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Look for uses of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;none?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;any?&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;blank?&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;empty?&lt;/code&gt; on objects which may be ActiveRecord::Relations. Are you just going to load the entire array later if the relation is present? If so, add &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;load&lt;/code&gt; to the call (e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@my_relation.load.any?&lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;Be careful with your use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exists?&lt;/code&gt; - it ALWAYS executes a SQL query. Only use it in cases where that is appropriate - otherwise use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;present?&lt;/code&gt; or any other the other methods which use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;empty?&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Be extremely careful using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where&lt;/code&gt; in instance methods on ActiveRecord objects - they break preloading and often cause N+1s when used in rendering collections.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count&lt;/code&gt; always executes a SQL query - audit its use in your codebase, and determine if a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; check would be more appropriate.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Thu, 10 Jan 2019 07:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2019/01/10/three-activerecord-mistakes.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2019/01/10/three-activerecord-mistakes.html</guid>
        
        
      </item>
    
      <item>
        <title>The Complete Guide to Rails Performance, Version 2</title>
        <description>&lt;p&gt;Today, the Complete Guide to Rails Performance has been updated to version 2.0. &lt;a href=&quot;https://www.railsspeed.com&quot;&gt;You can purchase it here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;All existing purchasers have had their copies updated on Gumroad. When I started this project, I always believed that a digital course should be &lt;em&gt;better&lt;/em&gt; than a typical paperback programming book. That’s why I don’t include any DRM or proprietary video codecs. That’s why I think, like most software, updates should be free.&lt;/p&gt;

&lt;p&gt;“Version 2.0” isn’t quite as drastic a change as a software v2.0, though. The world of Rails performance has actually changed very little since I wrote the course 2 years ago. The apps I consult on still have many of the same problems. The V2 update reflects this: I have revised the content for clarity, and updated a few places to reflect changes in Ruby 2.5 and Rails 5.2, but it is mostly still the same. I have also added four lessons: memory fragmentation, application server config, GC tuning, and PGBouncer config. These lessons were added based on new problems and thinking I’ve had since the course was released. Web-Scale Package purchasers will also get a new interview with Noah Gibbs of Appfolio next week.&lt;/p&gt;

&lt;p&gt;So, what does it mean that not much has changed in the Rails performance world?&lt;/p&gt;

&lt;p&gt;This tweet put me in an introspective mood this morning:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;It is profoundly sad how Rails has institutionalized a &amp;quot;nobody cares&amp;quot; attitude toward performance. &lt;a href=&quot;https://t.co/UhzvxyLjuz&quot;&gt;https://t.co/UhzvxyLjuz&lt;/a&gt;&lt;/p&gt;&amp;mdash; Jeff Atwood (@codinghorror) &lt;a href=&quot;https://twitter.com/codinghorror/status/1002448764630470656?ref_src=twsrc%5Etfw&quot;&gt;June 1, 2018&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;To summarize, Jeff’s cofounder, Sam Saffron (who I interviewed for the CGRP), wrote a great, in-depth blog post about memory use in ActiveRecord. In short, Sam finds that ActiveRecord creates excessive amounts of objects, even when doing simple and supposedly “optimized” work. Sam posted a proof-of-concept patch which improves this quite a bit.&lt;/p&gt;

&lt;p&gt;Jeff’s tweet diminishes the work of many Rails contributors. Aaron Patterson has spent the last two years working on Rails performance and a compacting garbage collector. Richard Schneeman has improved Sprockets’ performance a great deal. Sam Saffron himself has contributed over a dozen performance improvements to Rails, which, as far I can tell, have all been accepted. I know also that Andrew White, Eileen Uchitelle, and Rafael Franca are all Rails core members that care deeply about performance (probably because all of them have day-jobs running large Rails applications!). So any idea that Rails’ contributors or core members “don’t care” about performance is laughingly misguided, and is an opinion that can only really be held by someone outside the community. The way Jeff tried to turn it around in the replies into a “hot take” that people should “get angry” and “punk rock” about the “status quo” just made it more obvious.&lt;/p&gt;

&lt;p&gt;It’s pretty easy to take potshots at a mature framework like Ruby on Rails. It has almost 13 years of history behind it. There’s going to be cruft, baggage, and outdated decisions baked in. That’s what happens. But there’s also tremendous productivity, something gained from the thousands of contributors who have all contributed their “lessons learned” back to the framework. But if you forget about that history, it’s easy to craft a benchmark to make it look like that history has overtaken it’s usefulness in the present.&lt;/p&gt;

&lt;p&gt;This is the gap I’ve tried to bridge in my writing and in publishing The Complete Guide to Rails Performance. &lt;strong&gt;I believe that performance problems in Rails are pedagogical, not technical&lt;/strong&gt;. It’s not because we don’t have enough people working on performance (though it helps!). It’s not because we don’t value it as a community (how many times do I have to cite all of the top 10,000 websites that run Rails at speed?). &lt;strong&gt;It’s because Rails (and Ruby) optimizes for programmer happiness, and that means we provide sharp tools which are easy to cut yourself on.&lt;/strong&gt; Rather than throw the tools out, I think we need to teach people to use them safely.&lt;/p&gt;

&lt;p&gt;ActiveRecord is probably the best example of what I’m talking about. It’s an extremely productive tool. It works very well for 80% of web-app use-cases. But every year, someone wants to throw it out and thinks that some other Rubygem or pattern (e.g. DataMapper) will save them. It’s so easy to craft a line of code with ActiveRecord that will slow your application to a crawl if you’re not thinking through through the consequences, as anyone who has written &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;User.all.each&lt;/code&gt; can tell you.&lt;/p&gt;

&lt;p&gt;There is no One True Pattern or One True Framework. But there is a Thing Which Works For Most People. And if you end up being one of the 20% for whom it doesn’t work so well, or the tool’s productivity preference means that it’s easier to make performance mistakes, I don’t think that’s the tool or framework’s fault.&lt;/p&gt;

&lt;p&gt;In this way, I think publishing the Complete Guide to Rails Performance was placing my faith in the developer community of Rails. If I didn’t think that people could make their Rails apps faster through knowledge and skills, and instead they had to wait until the framework or the language itself got faster, I would have gone to work at Github or Shopify and made a bunch of patches to Rails and Ruby. I might have started an alternative, “lightweight” framework or ORM that prioritized performance over usefulness and productivity. Instead, I think that &lt;strong&gt;teaching Rails developers how to find and fix performance problems&lt;/strong&gt; will make a bigger dent in the average Rails app’s response time than improving the language or framework’s performance by even 2-3x, or by removing “dangerous” features.&lt;/p&gt;

&lt;p&gt;As I think we’ve slowly discovered over the course of trying to make Ruby 3x faster, there is no “waste” or “bloat” that can be cut out of a framework or language without cost that suddenly makes the whole thing faster. It’s sort of like how politicians always promise to “cut waste in government spending”, but no-one can ever tell you exactly where or how which programs will be cut. Everything was implemented for a reason. There is no magic wand or amount of man-hours that can be waved at these problems. I’ve discovered this in my consulting and writing as well. I wish it was that easy. But it isn’t.&lt;/p&gt;

&lt;p&gt;However, far from Jeff’s doomsday attitude, I believe that the macro picture for Ruby and Rails performance looks good, as it always had. Ruby 2.0 to 2.5 made a number of incremental performance improvements, particularly in garbage collection. I feel like the community has become more mature and performance-savvy over the last few years too. We’re waking up the mainstream Rails developer to things like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt; and teaching them how to use ActiveRecord and avoid performance issues.&lt;/p&gt;

&lt;p&gt;The technical future of Ruby looks strong, too. Ruby 2.6 will contain a JIT compiler. How cool is that? TruffleRuby has made great progress to becoming useable enough to run a Rails application. JRuby continues to truck along with more performance improvements and compatibility fixes all the time. The technical future of the language hardly looks dim - in fact, I think it’s much brighter than it was in 2011, when I got started in Ruby and Rails.&lt;/p&gt;

&lt;p&gt;I’ll continue to do my part for the Rails performance community by publishing and writing, to improve the technical skills and capacity of the average Rails developer so that they can make their apps faster. Here’s to you, developers!&lt;/p&gt;
</description>
        <pubDate>Fri, 01 Jun 2018 07:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2018/06/01/rails-performance-version-two.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2018/06/01/rails-performance-version-two.html</guid>
        
        
      </item>
    
      <item>
        <title>A New Ruby Application Server: NGINX Unit</title>
        <description>&lt;p&gt;There’s a new application server on the block for Rubyists - NGINX Unit. As you could probably guess by the name, it’s a project of &lt;a href=&quot;https://www.nginx.com/company/&quot;&gt;NGINX Inc.&lt;/a&gt;, the for-profit open-source company that owns the NGINX web server. In fall of 2017, they announced the &lt;a href=&quot;https://unit.nginx.org/&quot;&gt;NGINX Unit&lt;/a&gt; project. It’s essentially an application server designed to replace all of the various application servers used with NGINX. In Ruby’s case, that’s Puma, Unicorn, and Passenger.&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(For a far more in-depth comparison of these application servers, read &lt;a href=&quot;/2017/10/12/appserver.html&quot;&gt;my article about configuring Puma, Passenger and Unicorn&lt;/a&gt;)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt; For a far more in-depth comparison of these application servers, read &lt;a href=&quot;/2017/10/12/appserver.html&quot;&gt;my article about configuring Puma, Passenger and Unicorn&lt;/a&gt;&lt;/span&gt; NGINX Unit also runs Python, Go, PHP and Perl.&lt;/p&gt;

&lt;p&gt;The overarching idea seems to be to make microservice administration a lot easier. One NGINX Unit process can run any number of applications running any number of languages - for example, one NGINX Unit server can manage a half-dozen different Ruby applications, each running a different version of the Ruby runtime. Or you can run a Ruby application and a Python application side-by-side. The combinations are only limited by your system resources.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/bullshit-meter.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, the “microservice” space is quite prone to buzzword-laden marketing pages.&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(I really don’t like when software projects advertise themselves as “modern”. It’s like “subtweeting” all pre-existing software projects in this problem space and saying they’re all old and busted, and this is the New Way To Do Things. Why it’s better than the “old busted ways” is never explicitly stated. This kind of marketing preys on software developer’s fear of becoming obsolete in their skillset, rather than making any substantive point.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt; I really don’t like when software projects advertise themselves as “modern”. It’s like “subtweeting” all pre-existing software projects in this problem space and saying they’re all old and busted, and this is the New Way To Do Things. Why it’s better than the “old busted ways” is never explicitly stated. This kind of marketing preys on software developer’s fear of becoming obsolete in their skillset, rather than making any substantive point.&lt;/span&gt; Words like “dynamic”, “modular”, “lightweight” are mixed in with “service mesh”, “seamless” and “graceful”. This article is going to be about cutting through the marketing and getting into what NGINX Unit means for those of us running production Ruby applications.&lt;/p&gt;

&lt;p&gt;Before I move on to more about NGINX Unit’s architecture and what makes it unique, let’s make sure we all understand the difference between an application server and a web server. A &lt;strong&gt;web server&lt;/strong&gt; connects to clients over HTTP, and usually serves static files or &lt;strong&gt;proxies&lt;/strong&gt; to other HTTP-enabled servers, and acts as a middleman. An &lt;strong&gt;application server&lt;/strong&gt; is the thing which actually starts and runs the language runtime. In Ruby, these functions are sometimes combined. For example, all of the major Ruby application servers &lt;em&gt;also&lt;/em&gt; are web servers. However, many web servers, such as Nginx and Apache, are &lt;em&gt;not&lt;/em&gt; also application servers. Nginx UNIT is both a web &lt;em&gt;and&lt;/em&gt; application server.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/nginx-unit.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;NGINX Unit runs four different types of processes: main, router, controller, and application. Application processes are the self-explanatory ones - this would just be the Ruby runtime running your Rails application. The router and controller processes, and how they interact with each other and the application processes, is what defines how NGINX Unit works.&lt;/p&gt;

&lt;p&gt;The main process creates the router and application processes. That’s really all it does. Application processes in NGINX Unit are dynamic, however - the number of processes running can be changed at any time, Ruby versions can be changed, or even entire new applications can be added while the server is running. The thing that tells the main process what application processes to run is the controller process.&lt;/p&gt;

&lt;p&gt;The controller process (like the main process, there’s only one) has two jobs: expose a JSON configuration API over HTTP, and configure the router and main processes. This is probably the most novel and interesting part of NGINX Unit for Rubyists. Rather than working with configuration files, you POST JSON objects to the controller process to tell it what to do. For example, with this json file:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;{
    &quot;listeners&quot;: {
        &quot;*:3000&quot;: {
            &quot;application&quot;: &quot;rails&quot;
        }
    },

    &quot;applications&quot;: {
        &quot;rails&quot;: {
            &quot;type&quot;: &quot;ruby&quot;,
            &quot;processes&quot;: 5,
            &quot;script&quot;: &quot;/www/railsapp/config.ru&quot;
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;… we can PUT it to an NGINX Unit controller process with this (assuming our NGINX Unit server is listening on port 8443):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;curl -d &quot;myappconfig.json&quot; -X PUT '127.0.0.1:8443'
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;… and create a new Ruby application.&lt;/p&gt;

&lt;p&gt;NGINX Unit’s JSON configuration object is divided into &lt;em&gt;listeners&lt;/em&gt; and &lt;em&gt;applications&lt;/em&gt;. Applications are the actual apps you want to run. Listeners are where those apps are exposed to the world (i.e. what port they’re on).&lt;/p&gt;

&lt;p&gt;Changes in application and listener configuration are supposed to be seamless. For example, a “hot deploy” of a new version of your application would be accomplished by adding a new application to the configuration:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;{
  &quot;rails-new&quot;: {
      &quot;type&quot;: &quot;ruby&quot;,
      &quot;processes&quot;: 5,
      &quot;script&quot;: &quot;/www/rails-new-app/config.ru&quot;
  }
}

curl -d &quot;mynewappconfig.json&quot; -X PUT
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and then switching the listener to the new application:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;curl -X PUT -d '&quot;rails-new&quot;' '127.0.0.1:8443/listeners/*:3000/application`
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This transition is (supposedly) seamless, and clients won’t notice. This is similar to a Puma “phased restart”. In a phased restart in Puma, each worker process is restarted one at a time, which means that the other works processes are up and available to take requests. Puma accomplishes this using a control server (managed by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pumactl&lt;/code&gt; utility). However, unlike Puma, NGINX Unit “hot restarts” will not have two versions of the application taking requests at the same time.&lt;/p&gt;

&lt;p&gt;In a &lt;a href=&quot;https://github.com/puma/puma/blob/master/docs/restart.md&quot;&gt;Puma phased restart&lt;/a&gt;, say your application has six workers. Halfway through the phased restart, 3 works will be running the old code, and half will be running new code. This can cause some problems with database schema changes, for example. NGINX Unit restarts happen “all at once”, so while two versions of the code will be running at once, only one version will be taking requests at any point in time.&lt;/p&gt;

&lt;p&gt;This functionality seems quite useful to those who are running their own Ruby applications on a service such as AWS, where you have to manage your own deployment. However, Heroku users won’t find any of this useful, as you’ve already had this sort of “hot deploy” functionality using &lt;a href=&quot;https://devcenter.heroku.com/articles/preboot&quot;&gt;Heroku’s preboot system&lt;/a&gt;. However, these two features aren’t doing exactly the same thing. Heroku creates an entirely new virtual server and hot-swaps the whole thing, whereas NGINX Unit is just changing processes on a single machine, but they’re completely the same from a client perspective.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/nginx-unit-router.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The router process is pretty much what it sounds like - the thing which turns HTTP connections from clients into requests to the web application processes. NGINX claims a single Unit router can handle thousands of simultaneous connections. The router works a lot like an NGINX web server, and has a number of worker threads to accept, buffer and parse incoming connections.&lt;/p&gt;

&lt;p&gt;To me, this is one of the most exciting parts of NGINX Unit for Rubyists. It is very difficult for Ruby application servers to deal with HTTP connections without some kind of reverse proxy in front of the app server. Unicorn, for example, is recommended for use only behind a reverse proxy because it cannot buffer requests. That is, if a client sends one byte of their request and then stops (due to network conditions, a bad cellphone connection perhaps), then the Unicorn process just stops all work and cannot continue until that request has finished buffering. Using NGINX, for example, in front of Unicorn allows NGINX to buffer that request before it reaches Unicorn. Since NGINX is written in highly optimized C and it’s &lt;em&gt;not&lt;/em&gt;  restricted by Ruby’s GVL, it can buffer hundreds of connections for Unicorn. Passenger solves this problem by basically just being an addon for NGINX or Apache&lt;sup class=&quot;sidenote-number&quot;&gt;3&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Now you know why it’s called &lt;em&gt;Passenger&lt;/em&gt;!)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;3&lt;/sup&gt; Now you know why it’s called &lt;em&gt;Passenger&lt;/em&gt;!&lt;/span&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mod_ruby&lt;/code&gt;!) and offloading all of the connection-related work to the webserver. In this way, NGINX Unit is more similar to Passenger than it is to Unicorn.&lt;/p&gt;

&lt;p&gt;The application configuration has a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;processes&lt;/code&gt; key. This key can have a minimum number and maximum number of processes:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;{
  &quot;rails-new&quot;: {
      &quot;type&quot;: &quot;ruby&quot;,
      &quot;processes&quot;: {
        &quot;spare&quot;: 5
        &quot;max&quot;: 10
      },
      &quot;script&quot;: &quot;/www/rails-new-app/config.ru&quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For some reason, the “minimum” number of processes is called “spare”. The config above will start 5 processes immediately, and will scale to 10 if the load requires it.&lt;/p&gt;

&lt;p&gt;No word yet on if any settings like Puma’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preload_app!&lt;/code&gt; and similar settings in Passenger and Unicorn are available so you will be able to start up processes before they are needed &lt;em&gt;and&lt;/em&gt; take advantage of copy-on-write memory.&lt;/p&gt;

&lt;p&gt;This leaves the application processes. The interesting and novel thing here is that the router does not communicate with the application processes via HTTP - it uses Unix sockets and shared memory. This looks like an optimization aimed at microservice architectures, as communicating between services on the same machine will be considerably faster without any HTTP in between. I have yet to see any Ruby code examples of how this could work, however.&lt;/p&gt;

&lt;p&gt;It is unclear to me in the long-term if it is intended for you to run NGINX in front of NGINX Unit, or if NGINX Unit can run on it’s own without anything in front of it. As of right now (Q1 2018), you should probably be running NGINX &lt;em&gt;in front&lt;/em&gt; of NGINX Unit as a reverse proxy, because NGINX Unit lacks static file serving, HTTPS (TLS), and HTTP/2. Obviously, the &lt;a href=&quot;http://unit.nginx.org/integration/&quot;&gt;integration is pretty seamless&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;NGINX Unit is approaching a stable 1.0 release. You can’t really run it in production right now for Ruby applications: As I write this sentence, the Ruby module is literally 5 days old. It’s still under very active development right now - minor versions are released every few weeks. TLS and HTTP-related features seem like the next “big features” to come down the pipe, with static file serving being next. There is &lt;em&gt;some&lt;/em&gt; discussion about support for Java, which could probably be turned into support for JRuby and TruffleRuby as well.&lt;/p&gt;

&lt;p&gt;There is no Windows support, and I don’t think I would hold my breath for any in the future. NGINX Unit only supports Ruby 2.0 and above.&lt;/p&gt;

&lt;p&gt;I will not be benchmarking NGINX Unit in this post. It’s Ruby module is extremely new and probably not ready for any kind of benchmarking. However, the real reason I won’t be benchmarking NGINX Unit against Puma, Unicorn or Passenger is because application server choice in Ruby is not a matter of speed (techincally, latency) but throughput. Application servers tend to differ in &lt;em&gt;how many requests&lt;/em&gt; they can serve in parallel, rather than &lt;em&gt;how quickly they do it&lt;/em&gt;. Application servers impose very little latency overhead on the applications they serve, probably on the order of a couple of milliseconds.&lt;/p&gt;

&lt;p&gt;The most important Ruby application server setting which affects throughput is &lt;em&gt;threading&lt;/em&gt;. The reason is that it is the only application server setting which can increase the number of requests served &lt;em&gt;concurrently&lt;/em&gt;. A multithreaded Ruby application server can make greater and more efficient use of the available CPU and memory resources and serve more requests-per-minute than a single-threaded Ruby application process.&lt;/p&gt;

&lt;p&gt;Currently, the only &lt;em&gt;free&lt;/em&gt; application server which runs Ruby web applications in multiple threads is Puma. Passenger Enterprise will do it, but you must pay for a license.&lt;/p&gt;

&lt;p&gt;NGINX Unit plans support for multiple threads in Python applications, so it is not inconceivable that it will support Ruby applications in multiple threads sometime in the future.&lt;/p&gt;

&lt;p&gt;So, how does NGINX Unit currently “shake out” in comparison to Unicorn, Passenger and Puma? I think that the traditional Rails application setup: one monolithic application, run on a Plaform-as-a-Service provider like Heroku will probably not see any benefit at all from NGINX Unit’s current features and planned roadmap. Puma already serves these users very well.&lt;/p&gt;

&lt;p&gt;NGINX Unit may be interesting for Unicorn users who want to stop using a reverse proxy. Once NGINX Unit’s HTTP features are fleshed out, it could replace a Unicorn/NGINX setup with just a single NGINX Unit server.&lt;/p&gt;

&lt;p&gt;NGINX Unit is probably most &lt;em&gt;directly&lt;/em&gt; comparable to Phusion Passenger, which also recently went into the “microservice” realm by supporting Javascript and Python as well as Ruby applications. NGINX Unit currently supports more languages and will probably support even more in the future, so those that need greater language support will probably switch. However, Phusion is a Ruby-first company, so I expect Passenger to always “support” Ruby in a better, more complete way than NGINX Unit ever will. And, as mentioned above, Phusion Passenger Enterprise supports multithreaded execution &lt;em&gt;today&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So, what is the ideal NGINX Unit app? If you’re running your own cloud (that is, not on a service which manages the routing for you, like Heroku) and you have many Ruby applications running on different Ruby versions or many services in many different languages &lt;em&gt;and&lt;/em&gt; those services/apps need to talk to each other, quickly, it looks like NGINX Unit was designed for you. If you don’t fit that profile, though, it’s probably best to stick to the existing top three options (Puma, Passenger, and Unicorn).&lt;/p&gt;
</description>
        <pubDate>Wed, 28 Mar 2018 07:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2018/03/28/nginx-unit-for-ruby.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2018/03/28/nginx-unit-for-ruby.html</guid>
        
        
      </item>
    
      <item>
        <title>Malloc Can Double Multi-threaded Ruby Program Memory Usage</title>
        <description>&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/easy-button.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Sometimes, it really is that simple.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It’s not every day that a simple configuration change can completely solve a problem.&lt;/p&gt;

&lt;p&gt;I had a client whose Sidekiq processes were using a lot of memory - about a gigabyte each. They would start at about 300MB each, then slowly grow over the course of several hours to almost a gigabyte, where they would start to level off.&lt;/p&gt;

&lt;p&gt;I asked him to change a single environment variable: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MALLOC_ARENA_MAX&lt;/code&gt;. “Please set it to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2&lt;/code&gt;.”&lt;/p&gt;

&lt;p&gt;His processes restarted, and immediately the slow growth was eliminated. Processes settled at about half the memory usage they had before - around 512MB each.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/ilied.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Actually, it’s not that simple. There are no free lunches. Though this one might be close to free. Like a ten cent lunch.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Now, before you go copy-pasting this “magical” environment variable into all of your application environments, know this: there are drawbacks. You may not be suffering the problem it solves. There are no silver bullets.&lt;/p&gt;

&lt;p&gt;Ruby is not known for being a language that’s light on memory use. Many Rails applications suffer from up to a gigabyte of memory use &lt;em&gt;per process&lt;/em&gt;. That’s approaching Java levels. &lt;a href=&quot;https://github.com/mperham/sidekiq&quot;&gt;Sidekiq&lt;/a&gt;, the popular Ruby background job processor, has processes which can get just as large or even larger. The reasons are many, but one reason in particular is extremely difficult to diagnose and debug: fragmentation.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/log.jpeg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Typical Ruby memory growth looks logarithmic.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The problem manifests itself as a slow, creeping memory growth in Ruby processes. It is often mistaken for a memory leak. However, unlike a memory leak, memory growth due to fragmentation is logarithmic, while memory leaks are linear.&lt;/p&gt;

&lt;p&gt;A memory leak in a Ruby program is usually caused by a C-extension bug. For example, if your Markdown parser leaks 10kb every time you call it, your memory growth will continue forever &lt;em&gt;at a linear rate&lt;/em&gt;, since you tend to call the markdown parser at a regular frequency.&lt;/p&gt;

&lt;p&gt;Memory fragmentation causes logarithmic growth in memory. It looks like a long curve, approaching some unseen limit. All Ruby processes experience &lt;em&gt;some&lt;/em&gt; memory fragmentation. It’s an inevitable consequence of how Ruby manages memory.&lt;/p&gt;

&lt;p&gt;In particular, Ruby cannot &lt;em&gt;move&lt;/em&gt; objects in memory. Doing so would potentially break any C language extensions which are holding raw pointers to a Ruby object. If we can’t move objects in memory, fragmentation is an inevitable result. It’s a fairly common issue in C programs, not just Ruby.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/malloc-arena-max.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Actual client graph. This is what fragmentation looks like. Note the enormous drop after MALLOC_ARENA_MAX changed to 2.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;However, fragmentation can sometimes cause Ruby programs to &lt;em&gt;twice&lt;/em&gt; as much memory as they would otherwise, sometimes as much as four times more!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ruby programmers aren’t used to thinking about memory, especially not at the level of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc&lt;/code&gt;. And that’s OK: the entire language is designed to abstract memory away from the programmer. It’s right in the manpage. But while Ruby can guarantee memory &lt;em&gt;safety&lt;/em&gt;, it cannot provide perfect memory &lt;em&gt;abstraction&lt;/em&gt;. One cannot be completely ignorant of memory. Because Ruby programmers are often inexperienced with how computer memory works, when problems occur, they often have no idea where to even start with debugging it, and may dismiss it as an intrinsic feature of a dynamic, interpreted language like Ruby.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/princess.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;“And underneath 4 layers of memory abstraction, she noticed some fragmentation!”&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;What makes it worse is that memory is abstracted away from Rubyists through &lt;em&gt;four separate layers&lt;/em&gt;. First is the Ruby virtual machine itself, which has its own internal organization and memory tracking features (sometimes called the &lt;a href=&quot;http://ruby-doc.org/core-2.4.0/ObjectSpace.html&quot;&gt;ObjectSpace&lt;/a&gt;). Second is the allocator, which differs &lt;em&gt;greatly&lt;/em&gt; in behavior depending on the particular implementation you’re using. Third is the operating system, which abstracts actual physical memory addresses away into virtual memory addresses. The way it does this varies significantly depending on the kernel - Mach does this much differently than Linux, for example. Finally, there’s the actual hardware itself, which uses several strategies to keep frequently-accessed data in “hot” locations where it can be more quickly accessed. There are even special parts of the CPU involved here, such as the &lt;a href=&quot;https://en.wikipedia.org/wiki/Translation_lookaside_buffer&quot;&gt;translation lookaside buffer&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is what makes memory fragmentation so difficult for Rubyists to deal with. It’s a problem that generally happens at the level of the virtual machine and the allocator, parts of the Ruby language that 95% of Rubyists are probably unfamiliar with.&lt;/p&gt;

&lt;p&gt;Some fragmentation is inevitable, but it can also get so bad that it doubles the memory usage of your Ruby processes. How can you know if you’re suffering the latter rather than the former? What causes critical levels of memory fragmentation? Well, I have one thesis about a cause of memory fragmentation which affects multithreaded Ruby applications, like webapps running on Puma or Passenger Enterprise, and multithreaded job processors such as Sidekiq or Sucker Punch.&lt;/p&gt;

&lt;h2 id=&quot;per-thread-memory-arenas-in-glibc-malloc&quot;&gt;Per-Thread Memory Arenas in glibc Malloc&lt;/h2&gt;

&lt;p&gt;It all boils down to a particular feature of the standard &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;glibc&lt;/code&gt; malloc implementation called “per-thread memory arenas”.&lt;/p&gt;

&lt;p&gt;To understand why, I need to explain how garbage collection works in CRuby &lt;em&gt;really quickly&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/heapfrag.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;ObjectSpace visualization by Aaron Patterson. Each pixel is an RVALUE. Green is “new”, red is “old”. See &lt;a href=&quot;https://github.com/tenderlove/heapfrag&quot;&gt;heapfrag&lt;/a&gt;.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;All objects have a entry in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt;. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; is a big list which contains an entry for &lt;em&gt;every&lt;/em&gt; Ruby object currently alive in the process. The list entries take the form of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt;s, which are 40-byte C &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;struct&lt;/code&gt;s that contain some basic data about the object. The exact contents of these structs varies depending on the class of the object. As an example, if it is a very short String like “hello”, the actual bits that contain the character data are embedded directly in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt;. However, we only have 40 bytes - if the string is 23 bytes or longer, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt; contains only a raw pointer to where the object data &lt;em&gt;actually&lt;/em&gt; lies in memory, outside the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt;s are further organized in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; into 16KB “pages”. Each page contains about 408 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt;s.&lt;/p&gt;

&lt;p&gt;These numbers can be confirmed by looking at the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GC::INTERNAL_CONSTANTS&lt;/code&gt; constant in any Ruby process:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;no&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;INTERNAL_CONSTANTS&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;ss&quot;&gt;:RVALUE_SIZE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;ss&quot;&gt;:HEAP_PAGE_OBJ_LIMIT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;408&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# ...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Creating a long string (let’s say it’s a 1000-character HTTP response for example) looks like this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Add an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt; to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; list. If we are out of free slots in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt;, we lengthen the list by 1 heap page, calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc(16384)&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc(1000)&lt;/code&gt; and receive a address to a 1000-byte memory location.&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Actually, Ruby will request an area slightly larger than it needs in case the string is added to or resized.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt; Actually, Ruby will request an area slightly larger than it needs in case the string is added to or resized.&lt;/span&gt; This is where we’ll put our HTTP response.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The malloc calls here are what I want to bring your attention to. All we’re doing is asking for a memory location of a particular size, &lt;em&gt;somewhere&lt;/em&gt;. &lt;strong&gt;Actually, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc&lt;/code&gt;’s contiguity is &lt;em&gt;undefined&lt;/em&gt;&lt;/strong&gt;, that is, it makes no guarantees about &lt;em&gt;where&lt;/em&gt; that memory location will actually be. This means that, from the perspective of the Ruby VM, fragmentation (which is fundamentally a problem about &lt;em&gt;where&lt;/em&gt; memory is) is a problem of the allocator.&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(However, allocation patterns and sizes can definitely make things harder for the allocator.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt; However, allocation patterns and sizes can definitely make things harder for the allocator.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Ruby can, in a way, measure the fragmentation of its own &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt;. A method in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GC&lt;/code&gt; module, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GC.stat&lt;/code&gt;, provides a wealth of information about the current memory and GC state. It’s a little overwhelming and is under-documented, but the output is a hash that looks like this:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;no&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;stat&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;ss&quot;&gt;:count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;ss&quot;&gt;:heap_allocated_pages&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;91&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;ss&quot;&gt;:heap_sorted_length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;91&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# ... way more keys ...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There are two keys in this hash that I want to point your attention to: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GC.stat[:heap_live_slots]&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GC.stat[:heap_eden_pages]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:heap_live_slots&lt;/code&gt; refers to the number of slots in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; currently occupied by live (not marked for freeing) &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt; structs. This is roughly the same as “currently live Ruby objects”.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/eden.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;The Eden heap&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;:heap_eden_pages&lt;/code&gt; is the number of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; pages which currently contain &lt;em&gt;at least one&lt;/em&gt; live slot. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; pages which have at least one live slot are called eden pages. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; pages which contain no live objects are called tomb pages. This distinction is important from the GC’s perspective, because tomb pages can be returned back to the operating system. Also, the GC will put new objects into eden pages first, and then tomb pages after all the eden pages have filled up. This reduces fragmentation.&lt;/p&gt;

&lt;p&gt;If you divide the number of live slots by the number of slots in all eden pages, you get a measure of the current fragmentation of the ObjectSpace. As an example, here’s what I get in a fresh &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;irb&lt;/code&gt; process:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;times&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;no&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;stat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;:heap_live_slots&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# 24508&lt;/span&gt;
&lt;span class=&quot;no&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;stat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;:heap_eden_pages&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# 83&lt;/span&gt;
&lt;span class=&quot;no&quot;&gt;GC&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;INTERNAL_CONSTANTS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;:HEAP_PAGE_OBJ_LIMIT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# 408&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# live_slots / (eden_pages * slots_per_page)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# 24508 / (83 * 408) = 72.3%&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;About 28% of my eden page slots are currently unoccupied. A high percentage of free slots indicates that the ObjectSpace’s RVALUEs are spread across many more heap pages than they would be if we could move them around. This is a kind of internal memory fragmentation.&lt;/p&gt;

&lt;p&gt;Another measure of internal fragmentation in the Ruby VM comes from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GC.stat[:heap_sorted_length]&lt;/code&gt;. This key is the “length” of the heap. If we have three ObjectSpace pages, and I &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;free&lt;/code&gt; the 2nd one (the one in the middle), I only have two heap pages remaining. However, I cannot move heap pages around in memory, so the “length” of the heap (essentially the highest index of the heap pages) is still 3.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/swisscheese.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Yes, this heap is fragmented, but it looks &lt;em&gt;really tasty&lt;/em&gt;.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Dividing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GC.stat[:heap_eden_pages]&lt;/code&gt; by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GC.stat[:heap_sorted_length]&lt;/code&gt; gives a measure of internal fragmentation at the level of ObjectSpace pages - a low percentage here would indicate a lot of heap-page-sized “holes” in the ObjectSpace list.&lt;/p&gt;

&lt;p&gt;While these measures are interesting, most memory fragmentation (and most allocation) doesn’t happen in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; - it happens in the process of allocating space for objects which don’t fit inside a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALUE&lt;/code&gt;. It turns out that’s most of them, according to experiments performed by Aaron Patterson and Sam Saffron. A typical Rails app’s memory usage will be 50%-80% in these &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc&lt;/code&gt; calls to get space for objects larger than a few bytes.&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Well this sucks. Looks like only 15% of the heap in a basic Rails app is managed by the GC. 85% is just mallocs &lt;a href=&quot;https://t.co/sPbtAq4g8j&quot;&gt;pic.twitter.com/sPbtAq4g8j&lt;/a&gt;&lt;/p&gt;&amp;mdash; Aaron Patterson (@tenderlove) &lt;a href=&quot;https://twitter.com/tenderlove/status/879870368680255489?ref_src=twsrc%5Etfw&quot;&gt;June 28, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;When Aaron says “managed by the GC” here, he means “inside the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ObjectSpace&lt;/code&gt; list”.&lt;/p&gt;

&lt;p&gt;Ok, so let’s talk about where per-thread memory arenas come in.&lt;/p&gt;

&lt;p&gt;The per-thread memory arena was an optimization introduced in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;glibc&lt;/code&gt; 2.10, &lt;a href=&quot;https://github.molgen.mpg.de/git-mirror/glibc/blob/master/malloc/arena.c&quot;&gt;and lives today in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;arena.c&lt;/code&gt;&lt;/a&gt;. It’s designed to decrease contention between threads when accessing memory.&lt;/p&gt;

&lt;p&gt;In a naive, basic allocator design, the allocator makes sure only one thread can request a memory chunk from the main arena at a time. This ensures that two threads don’t accidentally get the same chunk of memory. If they did, that would cause some pretty nasty multi-threading bugs. However, for programs with a lot of threads, this can be slow, since there’s a lot of contention for the lock. &lt;em&gt;All&lt;/em&gt; memory access for &lt;em&gt;all&lt;/em&gt; threads is gated through this lock, so you can see how this could be a bottleneck.&lt;/p&gt;

&lt;p&gt;Removing this lock has been an area of major effort in allocator design because of its performance impact. There are even a few lockless allocators out there.&lt;/p&gt;

&lt;p&gt;The per-thread memory arena implementation alleviates lock contention with the following process (paraphrased from &lt;a href=&quot;https://siddhesh.in/posts/malloc-per-thread-arenas-in-glibc.html&quot;&gt;this article by Siddhesh Poyarekar&lt;/a&gt;):&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;We call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc&lt;/code&gt; in a thread. The thread attempts to obtain the lock for the memory arena it accessed previously (or the main arena, if no other arenas have been created).&lt;/li&gt;
  &lt;li&gt;If that arena is not available, try the next memory arena (if there are any other memory arenas).&lt;/li&gt;
  &lt;li&gt;If none of the memory arenas are available, create a new arena and use that. This new arena is linked to to the last arena in a linked list.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this way, the main arena is basically extended into a linked list of arenas/heaps. The number of arenas is limited by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mallopt&lt;/code&gt;, specifically the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_ARENA_MAX&lt;/code&gt; parameter (documented &lt;a href=&quot;http://man7.org/linux/man-pages/man3/mallopt.3.html&quot;&gt;here&lt;/a&gt;, note the “environment variables” section). By default, the limit on the number of per-thread memory arenas that can be created is 8 times the number of available cores. Most Ruby web applications run about 5 threads per core, and Sidekiq clusters can often run far more than that. In practice, this means that many, many per-thread memory arenas can get created by a Ruby application.&lt;/p&gt;

&lt;p&gt;Let’s take a look at exactly how this would play out in a multithreaded Ruby application.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;You are running a Sidekiq process with the default setting of 25 threads.&lt;/li&gt;
  &lt;li&gt;Sidekiq begins running 5 new jobs. Their job is to communicate with an external credit card processor - so they POST a request via HTTPS and receive a response ~3 seconds later.&lt;/li&gt;
  &lt;li&gt;Each job (which is running a separate thread in Rubyland) sends an HTTP request and waits for a response using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IO&lt;/code&gt; module. Generally, almost all IO in CRuby releases the Global VM lock, which means that these threads are working &lt;em&gt;in parallel&lt;/em&gt; and may contend for the main memory arena lock, causing the creation of new memory arenas.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If multiple CRuby threads are running but &lt;em&gt;not&lt;/em&gt; doing I/O, it is pretty much impossible for them to contend for the main memory arena because the Global VM Lock prevents two Ruby threads from executing Ruby code at the same time. Thus, per-thread-memory arenas only affect CRuby applications which are both multithreaded and performing I/O.&lt;/p&gt;

&lt;p&gt;How does this lead to memory fragmentation?&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/tetris.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Bin-packing can be fun, too!&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Memory fragmentation is essentially a &lt;a href=&quot;https://en.wikipedia.org/wiki/Bin_packing_problem&quot;&gt;bin packing problem&lt;/a&gt; - how can we efficiently distribute oddly-sized items between multiple bins so that they take up the least amount of space? Bin-packing is made much more difficult for the allocator because a) Ruby never moves memory locations around (once we allocate a location, the object/data stays there until it is freed) b) per-thread memory arenas essentially create a &lt;em&gt;lot&lt;/em&gt; of different bins, which cannot be combined or “packed” together. Bin-packing is already NP-hard, and these constraints just make it even more difficult to achieve an optimal solution.&lt;/p&gt;

&lt;p&gt;Per-thread memory arenas leading to large amounts of RSS use over time is something of a &lt;a href=&quot;https://sourceware.org/bugzilla/show_bug.cgi?id=11261&quot;&gt;known issue on the glibc malloc tracker&lt;/a&gt;. In fact, the &lt;a href=&quot;https://sourceware.org/glibc/wiki/MallocInternals&quot;&gt;MallocInternals wiki&lt;/a&gt; says specifically:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;As pressure from thread collisions increases, additional arenas are created via mmap to relieve the pressure. The number of arenas is capped at eight times the number of CPUs in the system (unless the user specifies otherwise, see mallopt), which means a heavily threaded application will still see some contention, but the trade-off is that there will be less fragmentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There you have it - lowering the number of available memory arenas reduces fragmentation. There’s an explicit tradeoff here: fewer arenas decreases memory use, but may slow the program down by increasing lock contention.&lt;/p&gt;

&lt;p&gt;Heroku discovered this side-effect of per-thread memory arenas when they created the Cedar-14 stack, which upgraded glibc to version 2.19.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://devcenter.heroku.com/articles/tuning-glibc-memory-behavior&quot;&gt;Heroku customers reported greater memory consumption of their applications when upgrading their apps to the new stack.&lt;/a&gt; Testing by Terrence Hone of Heroku produced some interesting results:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Configuration&lt;/th&gt;
      &lt;th&gt;Memory Use&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Base (unlimited arenas)&lt;/td&gt;
      &lt;td&gt;1.73x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Base (before arenas introduced)&lt;/td&gt;
      &lt;td&gt;1x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;MALLOC_ARENA_MAX=1&lt;/td&gt;
      &lt;td&gt;0.86&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;MALLOC_ARENA_MAX=2&lt;/td&gt;
      &lt;td&gt;0.87&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Basically, the default memory arena behavior in libc 2.19 reduced execution time by 10%, but increased memory use by 75%! Reducing the maximum number of memory arenas to 2 essentially eliminated the speed gains, but reduced memory usage over the old Cedar-10 stack by 10% (and reduced memory usage by about 2X over the default memory arena behavior!).&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Configuration&lt;/th&gt;
      &lt;th&gt;Response Times&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Base (unlimited arenas)&lt;/td&gt;
      &lt;td&gt;0.9x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Base (before arenas introduced)&lt;/td&gt;
      &lt;td&gt;1x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;MALLOC_ARENA_MAX=1&lt;/td&gt;
      &lt;td&gt;1.15x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;MALLOC_ARENA_MAX=2&lt;/td&gt;
      &lt;td&gt;1.03x&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;For almost &lt;em&gt;all&lt;/em&gt; Ruby applications, a 75% memory gain for 10% speed gain is &lt;em&gt;not&lt;/em&gt; an appropriate tradeoff. But let’s get some more real-world results in here.&lt;/p&gt;

&lt;h2 id=&quot;a-replicating-program&quot;&gt;A Replicating Program&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/2arenas.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I wrote &lt;a href=&quot;https://github.com/speedshop/sidekiqdemo&quot;&gt;a demo application&lt;/a&gt;, which is a Sidekiq job which generates some random data and writes the response to a database.&lt;/p&gt;

&lt;p&gt;After switching &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MALLOC_ARENA_MAX&lt;/code&gt; to 2, memory usage was 15% lower after 24 hours.&lt;/p&gt;

&lt;p&gt;I’ve noticed that real-world workloads magnify this effect greatly, which means I don’t fully understand the allocation pattern which can cause this fragmentation yet. I’ve seen plenty of memory graphs on the &lt;a href=&quot;https://www.railsspeed.com/&quot;&gt;Complete Guide to Rails Performance&lt;/a&gt; Slack channel that show 2-3x memory savings in production with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MALLOC_ARENA_MAX=2&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;fixing-the-problem&quot;&gt;Fixing the Problem&lt;/h2&gt;

&lt;p&gt;There are two main solutions for this problem, along with one possible solution for the future.&lt;/p&gt;

&lt;h3 id=&quot;fix-1-reduce-memory-arenas&quot;&gt;Fix 1: Reduce Memory Arenas&lt;/h3&gt;

&lt;p&gt;One fairly obvious fix would be to reduce the maximum number of memory arenas available. We can do this by changing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MALLOC_ARENA_MAX&lt;/code&gt; environment variable. As mentioned before, this increases lock contention in the allocator and &lt;em&gt;will&lt;/em&gt; have a negative impact on the performance of your application across the board.&lt;/p&gt;

&lt;p&gt;It’s impossible to recommend a generic setting here, but it seems like 2 to 4 arenas is appropriate for most Ruby applications. Setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MALLOC_ARENA_MAX&lt;/code&gt; to 1 seems to have a high negative impact on performance with only a very marginal improvement to memory usage (1-2%). Experiment with these settings and &lt;em&gt;measure the results&lt;/em&gt; both in memory use reduction and performance reduction until you’ve made a tradeoff appropriate for your app.&lt;/p&gt;

&lt;h3 id=&quot;fix-2-use-jemalloc&quot;&gt;Fix 2: Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt;&lt;/h3&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;This is CodeTriage&amp;#39;s Sidekiq worker memory use with and without jemalloc. I&amp;#39;m really starting to wonder how much of Ruby&amp;#39;s memory problems are just caused by the allocator. &lt;a href=&quot;https://t.co/FD0fVbJCLt&quot;&gt;pic.twitter.com/FD0fVbJCLt&lt;/a&gt;&lt;/p&gt;&amp;mdash; Nate Berkopec (@nateberkopec) &lt;a href=&quot;https://twitter.com/nateberkopec/status/936627901071466496?ref_src=twsrc%5Etfw&quot;&gt;December 1, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;Another possible solution is to simply use a different allocator. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt; also implements per-thread arenas, but their design seems to avoid the fragmentation issues present in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The above tweet was from when I removed jemalloc from &lt;a href=&quot;https://www.codetriage.com/&quot;&gt;CodeTriage&lt;/a&gt;’s background job processes. As you can see, the effect was pretty drastic. I also experimented with using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc&lt;/code&gt; with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MALLOC_ARENA_MAX=2&lt;/code&gt;, but memory usage was still almost &lt;em&gt;4 times&lt;/em&gt; greater than memory usage with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt;. &lt;strong&gt;If you can switch to jemalloc with Ruby, do it.&lt;/strong&gt; It seems to have the same or better performance than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc&lt;/code&gt; with far less memory use.&lt;/p&gt;

&lt;p&gt;This isn’t a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt; blog post, but some finer points on using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt; with Ruby:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/mojodna/heroku-buildpack-jemalloc&quot;&gt;You can use it on Heroku with this buildpack.&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Do not use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt; 4.x with Ruby. It has a bad interaction with Transparent Huge Pages that reduces the memory savings you’ll see. Instead, use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt; 3.6. 5.0’s performance with Ruby is currently unknown.&lt;/li&gt;
  &lt;li&gt;You do not need to compile Ruby with jemalloc (though you can). &lt;a href=&quot;https://github.com/jemalloc/jemalloc/wiki/Getting-Started&quot;&gt;You can dynamically load it with LD_PRELOAD.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;fix-3-compacting-gc&quot;&gt;Fix 3: Compacting GC&lt;/h3&gt;

&lt;p&gt;Fragmentation can generally be reduced if one can &lt;em&gt;move&lt;/em&gt; locations in memory around. We can’t do that in CRuby because C-extensions may use raw pointers to refer to Ruby’s memory - moving that location would cause a segfault or incorrect data to be read.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=8Q7M513vewk&quot;&gt;Aaron Patterson has been working on a compacting garbage collector for a while now.&lt;/a&gt; The work looks promising, but perhaps a ways off in the future.&lt;/p&gt;

&lt;h2 id=&quot;tldr&quot;&gt;TL;DR:&lt;/h2&gt;

&lt;p&gt;Multithreaded Ruby programs may be consuming 2 to 4 times the amount of memory that they really need, due to fragmentation caused by per-thread memory arenas in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;malloc&lt;/code&gt;. To fix this, you can reduce the maximum number of arenas by setting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MALLOC_ARENA_MAX&lt;/code&gt; environment variable or by switching to an allocator with better performance, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The potential memory savings here are so great and the penalties so minor that &lt;strong&gt;I would recommend that if you are using Ruby and Puma or Sidekiq in production, you should always use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jemalloc&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While this effect is most pronounced in CRuby, &lt;a href=&quot;https://github.com/cloudfoundry/java-buildpack/issues/320&quot;&gt;it may also affect the JVM and JRuby.&lt;/a&gt;&lt;/p&gt;
</description>
        <pubDate>Mon, 04 Dec 2017 07:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html</guid>
        
        
      </item>
    
      <item>
        <title>Configuring Puma, Unicorn and Passenger for Maximum Efficiency</title>
        <description>&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/unicorn_car.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;
In Ruby, web application servers are like gasoline in a car: the fancy stuff won’t make your car go any faster, but the nasty stuff will bring you grinding to a halt. Application servers can’t actually make your app significantly &lt;em&gt;faster&lt;/em&gt; - no, they’re all pretty much the same and changing from one to the other won’t improve your throughput or response times by much. But it &lt;em&gt;is&lt;/em&gt; easy to shoot yourself in the foot with a bad setting or misconfigured server. It’s one of the most common problems I see on client applications.&lt;/p&gt;

&lt;p&gt;This post will be about optimizing resource usage (memory and CPU) and maximizing throughput (that is, requests-per-second) from the three major Ruby application servers: Puma, Unicorn and Passenger. I’m going to use the terms “server” and “container” interchangeably, because nothing here is specific to a virtualized environment.&lt;/p&gt;

&lt;p&gt;I can cover all three of the popular application servers in a single guide because they all use, fundamentally, the same design. With the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork&lt;/code&gt; system call, these application servers create several child processes, which then do the job of serving requests. &lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(In all three app servers, the ‘master’ process that creates the child processes does not actually answer any requests. Passenger will actually shut down the ‘master’ preload process after a while if you haven’t forked recently.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt; In all three app servers, the ‘master’ process that creates the child processes does not actually answer any requests. Passenger will actually shut down the ‘master’ preload process after a while if you haven’t forked recently.&lt;/span&gt; Most of the differences between these servers lie in the finer details (which I’ll also cover here, where important for maximum performance).&lt;/p&gt;

&lt;p&gt;Throughout this guide, we’re going to try to maximize our throughput-per-server-dollar. We want to serve the most number of requests per second for the lowest amount of server resources (and therefore, cash).&lt;/p&gt;

&lt;h2 id=&quot;the-most-important-configuration-settings-for-performance&quot;&gt;The most important configuration settings for performance&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/dyno.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Timeouts are fairly important too, but they’re not really throughput-related. I’ll leave them for another day.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;There are 4 fundamental settings on your application server that determine its performance and resource consumption:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Number of child processes.&lt;/li&gt;
  &lt;li&gt;Number of threads.&lt;/li&gt;
  &lt;li&gt;Copy-on-write.&lt;/li&gt;
  &lt;li&gt;Container size.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s go through each in turn.&lt;/p&gt;

&lt;h3 id=&quot;child-process-count&quot;&gt;Child process count&lt;/h3&gt;

&lt;p&gt;Unicorn, Puma and Passenger all use a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork&lt;/code&gt;ing design.&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(JRuby people can probably skip to the next section.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt; JRuby people can probably skip to the next section.&lt;/span&gt; This means that they create one application process and call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork&lt;/code&gt; a number of times to create copies of that application process. We call these copies child processes. The number of child processes we have on each server is probably the most important setting for maximizing throughput-per-server-dollar. &lt;sup class=&quot;sidenote-number&quot;&gt;3&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(This is because of CRuby’s Global VM Lock. Only one thread can execute Ruby code at a time, so the only way to achieve parallel Ruby work is to run multiple processes.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;3&lt;/sup&gt; This is because of CRuby’s Global VM Lock. Only one thread can execute Ruby code at a time, so the only way to achieve parallel Ruby work is to run multiple processes.&lt;/span&gt; We want to run &lt;em&gt;as many processes per server as possible&lt;/em&gt; without exceeding the resources of the server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I recommend that all Ruby webapps run at least 3 processes per server or container&lt;/strong&gt;. This maximizes routing performance. Puma and Unicorn both use a design where the child processes listen directly on a single socket, and then let the operating system balance load between the processes. Passenger uses a reverse proxy (nginx or Apache) to route requests to a child process.&lt;sup class=&quot;sidenote-number&quot;&gt;5&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Passenger’s &lt;a href=&quot;https://www.phusionpassenger.com/library/indepth/ruby/request_load_balancing.html&quot;&gt;least-busy-process-first&lt;/a&gt; routing is actually one of my favorite features of theirs.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;5&lt;/sup&gt; Passenger’s &lt;a href=&quot;https://www.phusionpassenger.com/library/indepth/ruby/request_load_balancing.html&quot;&gt;least-busy-process-first&lt;/a&gt; routing is actually one of my favorite features of theirs.&lt;/span&gt; Both approaches are pretty efficient and mean that a request will be quickly routed to a worker that is idle. Routing at higher layers (that is, at the load balancer or Heroku’s HTTP mesh) is far more difficult to do so efficiently, because the load balancer usually has no idea whether or not the servers its routing to are busy or not.&lt;sup class=&quot;sidenote-number&quot;&gt;6&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(For one client I had, moving from 30 servers with 2 processes each to 3 servers with 20 processes each almost &lt;em&gt;completely&lt;/em&gt; eliminated the timeout errors they were having (which were being caused by fast requests piling up behind slow ones).)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;6&lt;/sup&gt; For one client I had, moving from 30 servers with 2 processes each to 3 servers with 20 processes each almost &lt;em&gt;completely&lt;/em&gt; eliminated the timeout errors they were having (which were being caused by fast requests piling up behind slow ones).&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Consider a setup with 3 servers, each running 1 processes (so a total of 3 processes). How does the load balancer optimally route a request to one of the three servers? It could do so randomly or in a round-robin fashion, but this does not guarantee that the request will be routed to a server with an idle, waiting process. For example, with a round-robin strategy, let’s say Request A is routed to server #1. Request B is then routed to server #2, and Request C to server #3. &lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/unicornhead.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;My face when you give me a request but all my children are busy.&lt;/span&gt; Now here comes a fourth request, Request D. What happens if Request B and C have already been successfully served and those servers (2 and 3) are idle, but Request A was somebody’s CSV export and will take 20 seconds to complete? The load balancer will continue to give requests to server #1 even though its busy and won’t process them until it’s done with Request A. All load balancers have ways of knowing if a server is &lt;em&gt;completely&lt;/em&gt; dead, but most of these methods have a long lag time (i.e. 30 seconds or more of delay). Running higher numbers of processes per server insulates us from the risk of long-lived requests “hogging” the majority of a server’s child processes, because at the &lt;em&gt;server&lt;/em&gt; level, requests will &lt;em&gt;never&lt;/em&gt; be given to an already-busy process. Instead, they’ll back up at the socket level or the reverse proxy until a worker is free. From experience, I find that 3 processes per server is a good minimum to achieve this. If you can’t run at least 3 processes per server due to resource constraints, get a bigger server (more on that later).&lt;/p&gt;

&lt;p&gt;So, we should run at least 3 child processes per container. But what’s the maximum? That’s constrained by our memory and CPU resources.&lt;/p&gt;

&lt;p&gt;Let’s start with memory. Each child process uses a certain amount of memory. Obviously, we shouldn’t add more child processes than our server’s RAM can support!&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/log.jpeg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Actual memory usage of Ruby processes is logarithmic. Due to memory fragmentation, memory usually doesnt level off, but only approaches a limit.&lt;/span&gt;
Measuring the actual memory usage of a single Ruby application process can be tricky, however. It’s not enough to just start up a process on your computer or production environment and check the number right away. &lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/puma_bloat.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;After a while, Puma workers can get rather… large.&lt;/span&gt; For a number of reasons, &lt;strong&gt;Ruby web application processes increase in memory usage over time&lt;/strong&gt;, even as much as doubling or tripling their memory usage from when they are spawned. To get an accurate measurement of how much memory your Ruby application processes are using, &lt;em&gt;disable all process restarts&lt;/em&gt; (worker killers) and wait 12-24 hours to take a measurement with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ps&lt;/code&gt;. If you’re on Heroku, you can use the new &lt;a href=&quot;https://devcenter.heroku.com/articles/exec&quot;&gt;Heroku Exec&lt;/a&gt; to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ps&lt;/code&gt; on a running dyno, or simply divide Heroku’s memory usage metric by the number of processes you are running per dyno. Most Ruby applications will use between 200 and 400 MB per process, but some can use as much as 1GB.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/david_meme.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;1 upvote = 1 prayer&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Be sure to give yourself some headroom on the memory number - if you want an equation, set your child process count to something like (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TOTAL_RAM&lt;/code&gt; / (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RAM_PER_PROCESS&lt;/code&gt; * 1.2))&lt;/p&gt;

&lt;p&gt;Exceeding the available memory capacity of a server/container can cause major slowdowns as memory is overcommitted and swapping starts to occur. This is why you want your application’s memory usage to be predictable and consistent with no sudden spikes. Sudden increases in memory usage are a condition I call &lt;em&gt;memory bloat&lt;/em&gt;. Solving this is a topic for another day or post, but the topic is covered in &lt;a href=&quot;http://www.railsspeed.com&quot;&gt;The Complete Guide to Rails Performance&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Second, we don’t want to exceed the available CPU capacity of our server. Ideally, we don’t spend more than 5% of our total deployed time at 100% CPU usage - more than that means that we’re being bottlenecked by the available CPU capacity. Most Ruby and Rails applications tend to be memory-bottlenecked on most cloud providers, but sometimes CPU can be the bottlenecking resource too. How do you know? Just use your favorite server monitoring tool - AWS’s built in tools are probably good enough for figuring out if CPU usage is frequently maxing out.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/thatwasalie.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;You said that OS context switching was expensive. Actual production use determined that was a lie.&lt;/span&gt;
It’s frequently said that you shouldn’t have more child processes per server than CPUs. This is only &lt;em&gt;partly&lt;/em&gt; true. It’s a good starting point, but actual CPU usage is the metric you should watch and optimize. In practice, most applications will probably settle at a process count that is 1.25-1.5x the number of available hyperthreads.&lt;/p&gt;

&lt;p&gt;On Heroku, use &lt;a href=&quot;https://devcenter.heroku.com/articles/log-runtime-metrics&quot;&gt;log-runtime-metrics&lt;/a&gt; to get a CPU load metric written to your logs. I would look at the 5 and 15 minute load averages - if they are consistently close to or higher than 1, you are maxing out CPU and need to reduce child process counts.&lt;/p&gt;

&lt;p&gt;Setting child process counts is pretty easy in every application server:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Puma&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;$&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;puma&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Command-line option&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;workers&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# in your config/puma.rb&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Unicorn&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;worker_processes&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# config/unicorn.rb&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Passenger (nginx/Standalone)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Passenger can automatically scale workers up and down - I don't find this&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# super useful. Instead, just run a constant number by setting the max and min:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;passenger_max_pool_size&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;passenger_min_instances&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Instead of setting this to a hard number, you may want to set it to an environment variable such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WEB_CONCURRENCY&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;workers&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;ENV&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;WEB_CONCURRENCY&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In summary, most applications will want to use 3-8 processes per server, depending on available resources. Highly memory-constrained applications or apps which have high 95th percentile times (5-10 seconds or more) may want to run higher numbers, up to 4x the available hyperthread count. Most app’s child process counts should not exceed 1.5x the amount of available hyperthreads.&lt;/p&gt;

&lt;h3 id=&quot;thread-count&quot;&gt;Thread count&lt;/h3&gt;

&lt;p&gt;Puma and Passenger Enterprise support multi-threading your application, so this discussion is aimed at those servers.&lt;/p&gt;

&lt;p&gt;Threads can be a resource-light way of improving your application’s concurrency (and, therefore, throughput). Rails is already threadsafe, and most applications aren’t doing weird things like creating their own threads or using globals to access shared resource, like database connections (looking at you, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$redis&lt;/code&gt;!) So, &lt;em&gt;most&lt;/em&gt; Ruby web-applications are thread-safe. The only &lt;em&gt;real&lt;/em&gt; way to know is to actually give it a shot. Ruby applications tend to surface threading bugs in loud, exception-raising ways, so it’s easy to give it a shot and see the results.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/amdahl.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;
So how many threads should we use? The speedup you can gain from additional parallelism depends on the &lt;em&gt;portion of your program’s execution which can be done in parallel&lt;/em&gt;. &lt;a href=&quot;https://en.wikipedia.org/wiki/Amdahl%27s_law&quot;&gt;This is known as Amdahl’s Law&lt;/a&gt;. In MRI/C Ruby, we can only parallelize waiting on IO (waiting on a database result, for example). For &lt;em&gt;most&lt;/em&gt; web applications, this is probably 10-25% of their total time. You can check for your own application by looking at the amount of time you spend “in the database” per request. Unfortuantely, what Amdahl’s law reveals is that for programs that have small parallel portions (less than 50%), there is little to no benefit past a handful of threads. This matches my own experience: on client applications, thread settings of more than 5 have no effect. &lt;a href=&quot;https://appfolio-engineering.squarespace.com/appfolio-engineering/2017/1/31/the-benchmark-and-the-rails&quot;&gt;Noah Gibbs also tested this against the Discourse homepage benchmark&lt;/a&gt; and settled on a thread count of 6.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/setit.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;
Unlike process count, where I advise you to constantly check the metrics against your settings and tune appropriately, with threads, it’s usually OK to just “set it and forget it” to 5 threads per application server process.&lt;/p&gt;

&lt;p&gt;In MRI/C Ruby, threads can have a surprisingly large memory impact. This is due to a host of complicated reasons (which I’ll probably get into in a future post). Be sure to check memory consumption before and after adding threads to the application. Do &lt;em&gt;not&lt;/em&gt; expect that each thread will only consume an additional 8MB of stack space, they will often increase total memory usage by &lt;em&gt;far&lt;/em&gt; more than that.&lt;/p&gt;

&lt;p&gt;Here’s how to set thread counts:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Puma. Again, I don't really use the &quot;automatic&quot; spin-up/spin-down features, so&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# I set the max and min to the same number.&lt;/span&gt;
&lt;span class=&quot;err&quot;&gt;$&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;puma&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Command-line option&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;threads&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# in your config/puma.rb&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Passenger (nginx/Standalone)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;passenger_concurrency_model&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thread&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;passenger_thread_count&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For JRuby people - threads are fully parallelizable, so you can take the full benefit of Amdahl’s law here. Setting thread counts for you will be more like setting process counts under MRI (described above) - increase them until you run out of memory or CPU resources.&lt;/p&gt;

&lt;h3 id=&quot;copy-on-write-behavior&quot;&gt;Copy-on-write behavior&lt;/h3&gt;

&lt;p&gt;All Unix-based operating systems implement copy-on-write memory behavior. It’s pretty simple: when a process &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork&lt;/code&gt;s and creates a child, that child process’ memory is &lt;em&gt;shared&lt;/em&gt;, completely, with the parent process. All memory reads from the child process will simply read from the parent’s memory. However, modifying a memory location creates a copy, solely for the private use of the child process. It’s extremely useful for reducing the memory usage of forking webservers, since child processes should, in theory, be able to share things like shared libraries and other “read-only” memory with the parent, rather than creating their own copy.&lt;/p&gt;

&lt;p&gt;Copy-on-write &lt;em&gt;just happens&lt;/em&gt;. &lt;sup class=&quot;sidenote-number&quot;&gt;6&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(You can’t really ‘support’ copy-on-write so much as just ‘make it more effective at saving you memory’.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;6&lt;/sup&gt; You can’t really ‘support’ copy-on-write so much as just ‘make it more effective at saving you memory’.&lt;/span&gt; It can’t be “turned off”, but you can make it more effective. Basically, we want to load all of our application &lt;em&gt;before&lt;/em&gt; forking. Most Ruby webapp servers call this “preloading”. All it does is change &lt;em&gt;when&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork&lt;/code&gt; is called - before or after your application is initialized.&lt;/p&gt;

&lt;p&gt;You’ll also need to re-connect to any databases you’re using after forking. For example, with ActiveRecord:&lt;/p&gt;

&lt;div class=&quot;language-ruby highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Puma&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;preload_app!&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;on_worker_boot&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# Valid on Rails 4.1+ using the `config/database.yml` method of setting `pool` size&lt;/span&gt;
  &lt;span class=&quot;no&quot;&gt;ActiveRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Base&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;establish_connection&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Unicorn&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;preload_app&lt;/span&gt; &lt;span class=&quot;kp&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;after_fork&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;server&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;worker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
	&lt;span class=&quot;no&quot;&gt;ActiveRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;Base&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;establish_connection&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Passenger uses preloading by default, so no need to turn it on.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Passenger automatically establishes connections to ActiveRecord,&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# but for other DBs, you will have to:&lt;/span&gt;
&lt;span class=&quot;no&quot;&gt;PhusionPassenger&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;on_event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;ss&quot;&gt;:starting_worker_process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;forked&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;forked&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;reestablish_connection_to_database&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# depends on the DB&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In theory, you have to do this for every database your application uses. However, in practice, Sidekiq doesn’t try to connect to Redis until you actually try to do something, so unless you’re running Sidekiq jobs during application boot, you don’t have to reconnect after fork.&lt;/p&gt;

&lt;p&gt;Unfortunately, there are limits to the gains of copy-on-write. Transparent Huge Pages can cause even a 1-bit memory modification to copy an entire 2MB page, and &lt;a href=&quot;https://brandur.org/ruby-memory&quot;&gt;fragmentation can also limit savings&lt;/a&gt;. But it doesn’t hurt, so turn on
preloading anyway.&lt;/p&gt;

&lt;h3 id=&quot;container-size&quot;&gt;Container size&lt;/h3&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/hungry.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Gimme some of that memory, boi&lt;/span&gt;
In general, we want to make sure we’re utilizing 70-80% of our server’s available CPU and memory. These needs will differ between applications, and the ratio between CPU cores and GB of memory will differ in turn. One application might be happiest on a 4 vCPU / 4 GB of RAM server with 6 Ruby processes, while another less-memory-hungry and more CPU-heavy application might do well with 8 vCPUs and 2GB of RAM. There’s no one perfect container size, but the ratio between CPU and memory should be chosen based on your actual production metrics.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/spicywinner.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;
The &lt;strong&gt;amount of memory available to our server or container&lt;/strong&gt; is probably one of the most important resources we can tune. On many providers, this number is exceedingly low - 512MB on the standard Heroku dyno, for example. Ruby applications, especially sufficiently complex and mature ones, are memory hungry, and the amount of memory we have to work with is probably one of our most important resources.&lt;/p&gt;

&lt;p&gt;Because most Rails applications use ~300MB of RAM and I think everyone should be running at least 3 processes per server, most Rails applications will need a server with at least 1 GB of RAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our server’s CPU resources&lt;/strong&gt; are another important lever we can tune. We need to know how many CPU cores are available to us, and how many threads we can execute at a single time (basically, does this server support Hyper-Threading or not?).&lt;/p&gt;

&lt;p&gt;As I mentioned in the discussion of child process counts, &lt;strong&gt;your container should support at least 3 child processes&lt;/strong&gt;. Even better would be 8 or more processes per server/container. Higher process counts per container improves request routing and decreases latency.&lt;/p&gt;

&lt;h2 id=&quot;tldr&quot;&gt;TL;DR&lt;/h2&gt;

&lt;p&gt;This was an overview of how to best maximize the throughput of your Ruby web application servers. In a short, list format, here’s the steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Figure out how much memory 1 worker with 5 threads uses.&lt;/strong&gt; If you’re using Unicorn, obviously no threads required. Run just a few workers on a single server under production load for at least 12 hours without restarting. Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ps&lt;/code&gt; to get the memory usage of a typical worker.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Choose a container size with memory equal to at least 3X that number&lt;/strong&gt;. Most Rails applications will use ~300-400MB of RAM per worker. So, most Rails apps will need at least 1 GB container/server. This gives us enough memory headroom to run at least 3 processes per server. You can run a number of child processes equal to (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TOTAL_RAM&lt;/code&gt; / (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RAM_PER_PROCESS&lt;/code&gt; * 1.2)).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Check CPU core/hyperthread counts&lt;/strong&gt; If your container has &lt;em&gt;fewer&lt;/em&gt; hyperthreads (vCPUs on AWS) than your memory can support, you can either choose a container size with less memory or more CPU. Ideally, the number of child processes you run should equal 1.25-1.5x the number of hyperthreads.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deploy and watch CPU and memory consumption&lt;/strong&gt;. Tune child process count and container size as appropriate to maximize usage.&lt;/li&gt;
&lt;/ol&gt;
</description>
        <pubDate>Thu, 12 Oct 2017 07:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2017/10/12/appserver.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2017/10/12/appserver.html</guid>
        
        
      </item>
    
      <item>
        <title>Is Ruby Too Slow For Web-Scale?</title>
        <description>&lt;p&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Okay, okay, I know. &lt;a href=&quot;https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines&quot;&gt;Betteridge’s Law of Headlines&lt;/a&gt;. Of course Ruby and Rails are fast enough for big websites - Shopify makes it work and they’re one of the largest in the world. But some people &lt;em&gt;genuinely do seem to think&lt;/em&gt; that Rails ‘isn’t fast enough’. That’s what this article is about.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;1&lt;/sup&gt; Okay, okay, I know. &lt;a href=&quot;https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines&quot;&gt;Betteridge’s Law of Headlines&lt;/a&gt;. Of course Ruby and Rails are fast enough for big websites - Shopify makes it work and they’re one of the largest in the world. But some people &lt;em&gt;genuinely do seem to think&lt;/em&gt; that Rails ‘isn’t fast enough’. That’s what this article is about.&lt;/span&gt; How does one choose a framework or programming language for a new web application?&lt;/p&gt;

&lt;p&gt;You almost certainly need one, unless you’re doing something pretty trivial. All web applications have a lot of boilerplate they need to get running: security, object-relational mapping,
templating and testing. So how do you know which one to choose?&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/Tiny-trains-on-track.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;This is what Rails is, right?&lt;/span&gt;
Well, you certainly don’t want to pick a &lt;em&gt;slow&lt;/em&gt; framework, do you? That wouldn’t be good - we want a &lt;em&gt;fast&lt;/em&gt;, &lt;em&gt;modern&lt;/em&gt;, and &lt;em&gt;lightweight&lt;/em&gt; web framework, not some &lt;em&gt;heavy&lt;/em&gt;, &lt;em&gt;old&lt;/em&gt;, &lt;em&gt;slow&lt;/em&gt;, web framework. Heavy, old, and slow…like Ruby on Rails, right? Ruby on Rails, the king of the all-in-one web framework space for the last 10 years, is constantly under assault by faster, nippier, lighter competitors. Is Rails a dinosaur that can no longer compete?&lt;/p&gt;

&lt;p&gt;Well, we could look at some benchmarks to find out. Surely a “fast” and “lightweight” framework would do well on a benchmark, while a old, busted framework would do poorly.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/rails-sucks.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Yeah! Rails sucks!&lt;/span&gt;
You would be forgiven for thinking that Ruby on Rails was somehow irretrievably graveyard-bound if you looked at the benchmarks posted by sites such as &lt;a href=&quot;https://www.techempower.com/benchmarks/&quot;&gt;TechEmpower&lt;/a&gt;. Sequel author Jeremy Evans recently pointed out that even &lt;a href=&quot;https://twitter.com/jeremyevans0/status/864212426618675200&quot;&gt;other Ruby frameworks can bury Rails&lt;/a&gt; in these comparisons. You look at those benchmarks at think: “Wow, Sequel is &lt;em&gt;ten times faster&lt;/em&gt; than ActiveRecord and Rails!”&lt;/p&gt;

&lt;p&gt;And in a narrow sense, you’d be right. Benchmarks are like statistics - it’s easy to give the right answer to the wrong question, and allow the reader to draw a conclusion which isn’t supported by the data. If you looked at those benchmarks and thought: “If I take my Rails application and rewrite it in Sequel and Sinatra, it will be ten times faster than it is now!”, you would be wrong.&lt;/p&gt;

&lt;p&gt;And, even if it &lt;em&gt;was&lt;/em&gt; faster, would it matter? &lt;strong&gt;Is there such a thing as a &lt;em&gt;fast enough&lt;/em&gt; web application?&lt;/strong&gt; Just how important is performance when choosing a web framework or even a programming language for a web application?&lt;/p&gt;

&lt;h2 id=&quot;latency-and-throughput&quot;&gt;Latency and Throughput&lt;/h2&gt;

&lt;p&gt;Let’s start with some definitions.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/funnel.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Servers are like funnels: latency is how long it takes one molecule of water to pass through the funnel, throughput is how much water passes throught the funnel every second. A high-latency high-throughput server would be something like a long, wide tube, and a low-latency low-throughput server would look like a short, wide disc with a narrow opening.&lt;/span&gt;
In server application design, &lt;em&gt;latency&lt;/em&gt; and &lt;em&gt;throughput&lt;/em&gt; are king. Latency is the amount of time it takes our server to respond to a single request. &lt;em&gt;Throughput&lt;/em&gt; is how many requests we can serve at the same time, usually measured in a unit like responses/second.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Throughput&lt;/em&gt; of a web application is generally governed by CPU and parallelism - how many CPU cycles does it take to respond to a web request, and how efficiently can you saturate all the CPU cores of the host machine? The amount of CPU cycles is governed by the application’s domain, framework, and language - complicated apps take more time, and dynamic languages like Ruby generate more CPU instructions than compiled languages like C or Rust. Efficiently using all the available CPU resources varies depending on the language - Go’s goroutines, Elixir’s “processes”, multi-process servers to get around global VM locks like Python and Ruby, event-driven architectures like Node, or true threading like Java.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Latency&lt;/em&gt;, however, is even more important. This is because &lt;em&gt;latency is inversely proportional to throughput&lt;/em&gt;. If we halve the latency of our web application, we double its maximum throughput. Latency also affects the end-user experience - a 500 millisecond response time manifests as an extra 500 milliseconds the user must spend waiting for the webpage to load.&lt;/p&gt;

&lt;h2 id=&quot;benchmark-trip-ups&quot;&gt;Benchmark Trip-Ups&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/topfuel.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;TechEmpower’s servers&lt;/span&gt;
Let’s take a look at the &lt;a href=&quot;https://www.techempower.com/benchmarks/&quot;&gt;TechEmpower web framework benchmarks&lt;/a&gt;. TechEmpower measures latency and maximum throughput across six synthetic benches. These benchmarks are run on pretty fat servers - they’ve got 4 cpus with 10 cores and 20 threads &lt;em&gt;each&lt;/em&gt; (so, 40 cores and 80 hyperthreads in total). Oh yeah, and 528 GB of RAM.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/multiquery.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Rails implementation of the benchmark&lt;/span&gt;
One of the more relevant benchmarks is the multiple-queries benchmark. It’s pretty simple - it executes 20 queries, sequentially, against a SQL database, and then returns the result. This is a pretty common web application workload - most Rails applications I’ve worked on roughly look like this. As we render the template, we execute a few SQL queries to get the results to populate the template, and return it.&lt;/p&gt;

&lt;p&gt;In Round 14, the typical Rails setup (puma-mri-rails) clocks in at a measly ~531 requests per second. &lt;a href=&quot;https://github.com/jeremyevans/roda&quot;&gt;Roda&lt;/a&gt;, an &lt;em&gt;extreme&lt;/em&gt; lightweight Ruby web framework, when used with Sequel, clocks in at about 7000 requests/second, depending on the webserver used.&lt;/p&gt;

&lt;p&gt;So does that mean Rails is more than 10 times slower than Roda and Sequel? On an 80 core machine, is 531 requests/second really all you can get out of Ruby on Rails?&lt;/p&gt;

&lt;p&gt;TechEmpower’s Rails setup is unbelievably crippled compared to their Roda setup. &lt;a href=&quot;https://github.com/TechEmpower/FrameworkBenchmarks/blob/e784c36f255b318611d3a0a2c91ad57255eb19d5/frameworks/Ruby/rails/run_mri_puma.sh#L7&quot;&gt;Their Puma server is configured to run just 8 processes&lt;/a&gt;, while &lt;a href=&quot;https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Ruby/roda-sequel/config/mri_puma.rb&quot;&gt;Roda auto-tunes itself&lt;/a&gt;, ending up with around 100 processes. So the Rails benchmark is using, at best, about 15-20% of the available hyperthreads, while the Roda benchmark is using all of them. So that’s &lt;em&gt;at least&lt;/em&gt; a 5-8x throughput penalty for the Rails benchmark &lt;em&gt;out of the gate&lt;/em&gt;. But that’s fixable - TechEmpower is open source and &lt;a href=&quot;https://github.com/TechEmpower/FrameworkBenchmarks/pull/2850&quot;&gt;we can just open a pull request and fix this&lt;/a&gt;, and we’ll get better results for Round 15.&lt;/p&gt;

&lt;p&gt;Let’s take a look at another TechEmpower measurement - average request latency. Focusing on request latency allows to put all languages and frameworks on a somewhat more even footing, because things like global VM locks and other concurrency features usually don’t really matter when processing a single request.&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Concurrency features generally increase throughput, not decrease latency.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;2&lt;/sup&gt; Concurrency features generally increase throughput, not decrease latency.&lt;/span&gt; On the multiple-query database test, Puma and Rails clock in at 129 milliseconds. The Roda/Sequel/Puma stack clocks in at 31.3 milliseconds.&lt;/p&gt;

&lt;p&gt;Now, as I said, the Puma settings for Rails on TechEmpower are incredibly crippled compared to the Roda settings, so Rails could probably still shave a lot off of that time, but let’s take it as it is. Let’s just say Rails adds &lt;strong&gt;one hundred milliseconds&lt;/strong&gt; of latency to the average web application response over a microframework or other competing platform like Phoenix. (Actually, Phoenix is slower on this test than Rails. &lt;a href=&quot;https://www.reddit.com/r/elixir/comments/48ke69/any_reason_why_elixirphoenix_did_so_badly_in/&quot;&gt;The framework creators dispute this result though&lt;/a&gt;, and I don’t doubt it if the Rails benchmark is this gimped too).&lt;/p&gt;

&lt;h2 id=&quot;the-computer-changes-but-the-human-does-not&quot;&gt;The Computer Changes, But The Human Does Not&lt;/h2&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/room-sized-computer.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;/span&gt;
The funny thing about computers is that although they keep getting faster, squishy human beings stay the same speed. Just &lt;em&gt;how fast&lt;/em&gt; a human-computer interaction has to be has been studied since the 1960s. You can understand their interest in this, back in the times when computers were the size of rooms and computations took hours rather than microseconds. If the computer was going to move out of the mainframe and the science lab and into public life, it was going to have to be faster. But &lt;em&gt;how much&lt;/em&gt; faster?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.nngroup.com/articles/response-times-3-important-limits/&quot;&gt;Jakob Nielsen summarized the results in 1993:&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/jakob_mouse_big.jpg&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Jakob Nielsen. I am glad that this photo exists.&lt;/span&gt;&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;0.1 second: Limit for users feeling that they are directly manipulating objects in the UI. (…)&lt;/p&gt;

  &lt;p&gt;1 second: Limit for users feeling that they are freely navigating the command space without having to unduly wait for the computer. (…)&lt;/p&gt;

  &lt;p&gt;10 seconds: Limit for users keeping their attention on the task. (…)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can read his full article on the topic &lt;a href=&quot;https://www.nngroup.com/articles/response-times-3-important-limits/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;on-the-web-how-fast-is-fast-enough&quot;&gt;On the web, how fast is fast enough?&lt;/h3&gt;

&lt;p&gt;Let’s assume that all our little web application does is return an HTML response with &lt;em&gt;no&lt;/em&gt; JavaScript or CSS. It’s just a flat, HTML document with the default browser styling.&lt;sup class=&quot;sidenote-number&quot;&gt;3&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Imagine if you would, for a moment, a website whose styling is even more boring than this one.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;3&lt;/sup&gt; Imagine if you would, for a moment, a website whose styling is even more boring than this one.&lt;/span&gt; How long would it take for a user to visit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www.oursite.com&lt;/code&gt; and receive a response?&lt;/p&gt;

&lt;p&gt;Well, if our user is on a desktop computer in the same country as our servers, it will take about 20 milliseconds for their packets to get from their computer to our servers, and another 20 milliseconds back. This is a &lt;em&gt;best case scenario&lt;/em&gt;: if they’re on the other side of the world, this could easily be 100 milliseconds each way. If they’re on a mobile cellular connection, we’re talking ~300-400 milliseconds. My home DSL connection fluctuates from 50-150 milliseconds to most US servers.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/brentrambo.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;150 milliseconds time-to-first-byte? That’s Brent Rambo Approved.&lt;/span&gt;
So, if we’ve already got ~40 milliseconds of round-trip network latency in the first place, will our users be able to perceive the difference in a web application which renders a response in 1 millisecond or 100 milliseconds? That is, one application will take 41 milliseconds in total and the other 141. The answer &lt;strong&gt;is emphatically no&lt;/strong&gt;. Both applications will appear almost instantaneous to the user. And in the worst cases of network conditions, the difference will completely vanish. So minor latency differences (100 milliseconds or less, as in the difference between web frameworks) only matter in their contribution to improving throughput.&lt;/p&gt;

&lt;h3 id=&quot;your-server-is-just-a-small-part-of-the-user-experience&quot;&gt;Your Server is Just a Small Part of the User Experience&lt;/h3&gt;

&lt;p&gt;&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/modern-web.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;WELCOME TO THE MODERN WEB, BITCH.&lt;/span&gt;
It’s 2017 and web applications don’t return flat HTML files anymore. Websites are gargantuan, with JavaScript bundles stretching into the size of megabytes and stylesheets that couldn’t fit in ten Apollo Guidance Computers. So how much of a difference does a web application which responds in 1 millisecond or less make in this environment?&lt;/p&gt;

&lt;p&gt;Vanishingly little. Nowadays, the average webpage takes 5 seconds to render. Some JavaScript single-page-applications can take 12 seconds or more on initial render.&lt;/p&gt;

&lt;p&gt;Server response times simply make up a teeny-tiny part of the actual user experience of loading and interacting with a webpage - cutting 99 milliseconds off the server response time just doesn’t make a difference.&lt;/p&gt;

&lt;h3 id=&quot;theres-a-ceiling-web-apps-arent-video-games&quot;&gt;There’s a Ceiling: Web Apps Aren’t Video Games&lt;/h3&gt;

&lt;p&gt;In the video gaming world, speed matters. Faster languages can mean more polygons on the screen per frame. There’s really no upper limit for this - more polygons will always be good, so a faster language will always help with increasing the fidelity of the simulation.
&lt;span class=&quot;marginnote &quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/sortafast.gif&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=hU7EHKFNMQg&quot;&gt;for those unfamiliar with the meme&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Web applications are not like this. Fundamentally, 90% of them are simple CRUD applications. A faster language does not open more possibilities for functionality or features, it just takes the same HTML webform we’ve been rendering and renders it a few milliseconds faster. There’s a &lt;em&gt;ceiling&lt;/em&gt; on the usefulness of reduced request latency.&lt;/p&gt;

&lt;h3 id=&quot;ruby-is-slow-so-more-ruby-is-slower&quot;&gt;Ruby is Slow, so More Ruby is Slower&lt;/h3&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/hashtables.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;&lt;a href=&quot;https://twitter.com/mperham/status/884126933255995392&quot;&gt;Mike Perham&lt;/a&gt;. And, ultimately, most of Ruby’s internals boil down to hash tables, so…&lt;/span&gt;
Ruby isn’t a fast language. So, if you execute less of it, you’ll have a faster benchmark result.&lt;/p&gt;

&lt;p&gt;Feature-rich frameworks like Rails have a &lt;em&gt;lot&lt;/em&gt; of code, and execute a lot more on each request because they are &lt;em&gt;doing more stuff&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This seems like 101-level stuff, but again, TechEmpower and other benchmarks typically do &lt;em&gt;not&lt;/em&gt; make the difference in features obvious. On TechEmpower, all you get is this impossible-to-skim array of tags.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;marginnote no-mobile&quot;&gt;&lt;img src=&quot;https://www.speedshop.co/assets/posts/img/techempower.png&quot; loading=&quot;lazy&quot; /&gt;&lt;br /&gt;Yes, this is is an easy-to-understand feature comparison which humans can read.&lt;/span&gt;
On throughput microbenchmarks like TechEmpower, where differences are measured in milliseconds (or even microseconds), what you’re really measuring is how many &lt;em&gt;CPU instructions&lt;/em&gt; a particular language runtime generates in response to a particular request. And since there’s no real way to compare featuresets between frameworks on TechEmpower, all frameworks are placed on an “equal footing” and you’ll think that Rails is the slowest web framework in the world.&lt;/p&gt;

&lt;p&gt;The truth is that Rails does &lt;em&gt;a lot&lt;/em&gt; on every request. Just create a new Rails app and look at the middleware stack (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rake middleware&lt;/code&gt;). There’s a lot of work being done here that &lt;em&gt;every good web application should do&lt;/em&gt; but many frameworks &lt;em&gt;do not do for you&lt;/em&gt;, at least by default.&lt;/p&gt;

&lt;h3 id=&quot;performance-is-more-complicated-than-cpu-or-maximum-throughput&quot;&gt;Performance is More Complicated than CPU or Maximum Throughput&lt;/h3&gt;

&lt;p&gt;While on TechEmpower CPU usage is the bottleneck, in the real world, the CPU performance of language or the framework is almost never the &lt;em&gt;bottleneck&lt;/em&gt; for a web application’s performance. Web applications are fairly I/O heavy, especially as they grow more complicated. The modern Rails application may interact with three separate databases or more - their SQL database, Redis for their backend job processor, and Memcache for caching. Often, time spent interacting with these databases can make up 25% or more of a response.&lt;/p&gt;

&lt;p&gt;In addition, as a Ruby on Rails performance consultant, I’ve seen so many problems with application deployments that have nothing to do with the CPU performance of the framework or language: poor server configurations, memory leaks or bloat, or poor use of caching. Programmers, mysteriously, seem to find a way to completely degrade the performance of their application all on their own!&lt;/p&gt;

&lt;p&gt;Finally, most mature web applications spend &lt;em&gt;at most&lt;/em&gt; 50% of their execution time in the framework itself, and far more time in the actual application code and other added dependencies. This is pretty easy to see in Ruby - take a look at a stacktrace and count how many of the top frames are from your framework. It won’t be many. If your application could be rewritten in a faster framework in the same language, you would halve its response times &lt;em&gt;at best&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;rewrite-your-entire-application-to-save-1000month&quot;&gt;Rewrite Your Entire Application to Save $1,000/month&lt;/h2&gt;

&lt;p&gt;What I worry about is what people do with the information presented in relative benchmarks like TechEmpower. Do they go home and rewrite their applications in the flavor-of-the-week framework or stack? Or, when choosing a stack for a &lt;em&gt;new&lt;/em&gt; product or service, do people choose the “faster” stack over the “slower” one?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://medium.com/@Pinterest_Engineering/introducing-new-open-source-tools-for-the-elixir-community-2f7bb0bb7d8c&quot;&gt;Heck, Pinterest rewrote it’s Ads API in Elixir and now they have response times of less than a millisecond.&lt;/a&gt; Surely, that’s just &lt;em&gt;better&lt;/em&gt;, right?&lt;/p&gt;

&lt;p&gt;The question is, &lt;em&gt;why&lt;/em&gt;? As we’ve already established, there’s no difference for the end-user experience. So there’s really only two reasons to choose a framework over another: a) it’s faster and therefore I’ll spend less on server costs to host it b) it’s easy to develop with, and helps me ship quality features faster.&lt;/p&gt;

&lt;p&gt;Let’s take a look at that server cost one, for a second.&lt;/p&gt;

&lt;p&gt;The majority of web applications handle far less than 1000 requests per second. I’d go as far as to say that most web application developers are employed by a company whose entire webapp does far less than 1000 requests/second. Most of them do less than 1000 requests/&lt;em&gt;minute&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Let’s say you have a Rails application which serves 20,000 RPM (request/minute, or about 300 req/sec) at an average response time of 250 milliseconds. That’s a pretty average profile for a large, mature Rails application. Such an application will take about 200 Puma processes to serve properly. That’s equal to roughly a dozen Performance-L dynos on Heroku, or $6,000/month.&lt;/p&gt;

&lt;p&gt;Now, let’s say you rewrite it in Phoenix, Node, or whatever flavor of the week you want and reduce that to 125 milliseconds. Before you jump out of your seat, remember that you’re not going to reduce latency to 12 milliseconds or some other stupid-low amount: you’re still going to be limited by I/O to the databases that back this application.&lt;/p&gt;

&lt;p&gt;Halving our application’s latency means we need about half the amount of servers we needed before. So, congratulations: you rewrote your application (or chose your framework) to save $3,000/month. The load on the relational database backing this application won’t change, so those costs will remain the same. When your application is big enough to be doing 20,000 RPM, you will have anywhere from a half-dozen to even fifty engineers, depending on your application’s domain. A single software engineer costs a company at least $10,000/month in employee benefits and salary. So we’re choosing our frameworks based on saving one-third of an engineer per month? And if that framework caused your development cycles to slow down by even &lt;em&gt;one third&lt;/em&gt; of a mythical man-month, you’ve &lt;em&gt;increased&lt;/em&gt; your costs, not decreased them. Choosing a web framework based on server costs is clearly a sucker’s game.&lt;/p&gt;

&lt;p&gt;Why cargo-cult engineering practices from huge companies where a few milliseconds can save tens of thousands per month? You’re not Pinterest (or Netflix, or…), you have different problems, and that’s OK.&lt;/p&gt;

&lt;h3 id=&quot;it-isnt-getting-worse&quot;&gt;It Isn’t Getting Worse&lt;/h3&gt;

&lt;p&gt;Computers aren’t getting slower. While &lt;a href=&quot;https://en.wikipedia.org/wiki/Wirth%27s_law&quot;&gt;Wirth’s Law&lt;/a&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;4&lt;/sup&gt;&lt;span class=&quot;sidenote-parens&quot;&gt;(Software is getting slower more rapidly than hardware becomes faster.)&lt;/span&gt;&lt;span class=&quot;sidenote&quot;&gt;&lt;sup class=&quot;sidenote-number&quot;&gt;4&lt;/sup&gt; Software is getting slower more rapidly than hardware becomes faster.&lt;/span&gt; certainly holds for most end-user applications like your mobile phone apps, it doesn’t really hold for your typical web application. Ruby web applications (and any web application) will continue to get faster because the slow grind of progress in hardware will continue to find ways to jam more CPU instructions into a clock cycle, or to make those clock cycles even faster, or to cram more cores onto a die.&lt;/p&gt;

&lt;p&gt;And the language isn’t getting slower, either. Noah Gibbs of Appfolio has shown that &lt;a href=&quot;http://engineering.appfolio.com/appfolio-engineering/2017/5/22/rails-speed-with-ruby-240-and-discourse-180&quot;&gt;each minor version of Ruby decreases average response times by about 5-10%.&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;lets-talk-about-happiness&quot;&gt;Let’s Talk About Happiness&lt;/h2&gt;

&lt;p&gt;The performance doomsayers have always been wrong, and will continue to be wrong. &lt;a href=&quot;http://archive.oreilly.com/pub/post/multicore_hardware_and_the_fut.html&quot;&gt;Take this gentleman from 2007&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;No matter what implementation becomes the next de-facto Ruby platform, one thing is clear: People are interested in taking advantage of their newer, more powerful multi-core systems (as the recent surge in interest in Erlang in recent RailsConf and RubyConfs has shown). As Ruby becomes increasingly part of solutions that deal in high volumes of data processing, this demand can only increase.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ten years later, and scaling across multiple cores through preforking webservers like Puma and Unicorn is still plenty Good Enough. Ruby still isn’t dead. I’m excited for the possibilities afforded by the &lt;a href=&quot;http://olivierlacan.com/posts/concurrency-in-ruby-3-with-guilds/&quot;&gt;proposed Guild model&lt;/a&gt;, but is the language unusable until then? Nope.&lt;/p&gt;

&lt;p&gt;What I want is for the conversation around web frameworks and programming languages to change. There’s too much talk of performance and concurrency, when in reality the margins are narrow and the costs minimal and getting lower. Languages aren’t dying based on their concurrency or performance features alone.&lt;/p&gt;

&lt;p&gt;The better conversation, the more meaningful and impactful one, is which framework &lt;strong&gt;helps me write software faster, with more quality, and with more happiness&lt;/strong&gt;. I know what the answer to that question is for me, and maybe the answer is different for you.&lt;/p&gt;

&lt;h3 id=&quot;polyglots-and-the-new-hotness-stack&quot;&gt;“Polyglots” and “The New Hotness Stack”&lt;/h3&gt;

&lt;p&gt;There’s a subset of engineers who will never be happy writing software which isn’t on the “new hotness stack”. Engineers are always looking for a new problem to solve, something new to learn - and that’s great! I’ve never related. GORUCO, the NYC Ruby conference started calling itself a “polyglot conference” this year, and the speaker schedule features talks on Python, Elixir, Rust, React and static typing. &lt;a href=&quot;https://medium.com/@flavorjones/ruby-values-7b5ffe45aea7&quot;&gt;Conference organizer Mike Dalessio’s blog post announcing this&lt;/a&gt; reads like a tombstone.&lt;/p&gt;

&lt;p&gt;Benchmarks are often waved around in this “is X dead” discussion. As I hope I’ve shown above, there really is no benchmark which can prove that any language or framework is not suitable for writing web applications. Performance isn’t the concern.&lt;/p&gt;

&lt;p&gt;Instead, the performance discussion regarding web applications is mostly FUD, spread by those trying to justify the engineering time they just spent rewriting their entire stack or what they’re telling management so that they get to play with the coolest new toy they saw on Hacker News.&lt;/p&gt;

&lt;p&gt;Programmers are perpetually terrified of career obsolescence. Some are afraid of intellectual stagnation - that they’ll become the crusty old person in the back office writing RPG to keep a truck parts company’s order system running. But almost all of them are afraid of unemployment. They’re worried that the world will move on from their particular stack, leaving their salaries and jobs in jeopardy. These fears are real - but let’s realize that most of the discussion around “is stack X dead?!” are driven by &lt;em&gt;fear&lt;/em&gt;, not concerns for the &lt;em&gt;requirements&lt;/em&gt; of web applications.&lt;/p&gt;

&lt;h2 id=&quot;fun-and-games&quot;&gt;Fun and Games&lt;/h2&gt;

&lt;p&gt;Let’s be clear - performance still matters. Most organizations can and should save on server costs by focusing on speeding up their endpoints, and particularly slow endpoints probably do impact the customer experience &lt;a href=&quot;https://wpostats.com/tags/revenue/&quot;&gt;or the bottom line&lt;/a&gt; and should be sped up. What I’ve talked about above is just how little &lt;em&gt;framework choice&lt;/em&gt; matters in the performance of your web application.&lt;/p&gt;

&lt;p&gt;Also, I’m not ragging on TechEmpower. It’s a massive project, and they depend on domain experts creating PRs that fix any problems with the results. They’re genuinely good people in my opinion, and aren’t trying to push an agenda or participate in benchmarketing in favor of any particular stack.&lt;/p&gt;

&lt;p&gt;In conclusion, JavaScript, Go, Elixir and Python all suck, write Ruby :) No, of course not - write what you’re productive in. If you’re a web programmer, count your lucky stars that you get to choose your tools based on ergonomics, not on performance.&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;The highest rule of computing: computers SHOULD exist to accommodate their creators, never the other way around.&lt;/p&gt;&amp;mdash; Gary Bernhardt (@garybernhardt) &lt;a href=&quot;https://twitter.com/garybernhardt/status/879945502556422144&quot;&gt;June 28, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

</description>
        <pubDate>Tue, 11 Jul 2017 07:00:00 +0000</pubDate>
        <link>https://www.speedshop.co/2017/07/11/is-ruby-too-slow-for-web-scale.html</link>
        <guid isPermaLink="true">https://www.speedshop.co/2017/07/11/is-ruby-too-slow-for-web-scale.html</guid>
        
        
      </item>
    
  </channel>
</rss>
