Railsconf 2017: The Performance Update
When you just can’t conf any more Hello readers! Railsconf 2017 has just wrapped up, and as I did for RubyConf 2016, here’s a rundown of all the Ruby-performance-related stuff that happened or conversations that I had.
Shopify recently released bootsnap, a Rubygem designed to boot large Ruby apps faster. It was released just a week or so before the conference, but Discourse honcho Sam Saffron was telling everyone about how great it was. It’s fairly infrequently that someone is able to come up with one of these “just throw it in your Gemfile and voila your app is faster” projects, but it looks like this is one of them.
50% faster, you say? Bootsnap reduced bootup time in development for Discourse by 50%.
You may have heard of or used bootscale - Bootsnap is intended to be an evolution/replacement of that gem.
How does it work? Well, unlike a lot of performance projects, Bootsnap’s README is actually really good and goes into depth on how it accomplishes these boot speedups. Basically, it does two big things: makes
require faster, and caches the compilation of your Ruby code.
require speedups are pretty straightforward -
bootsnap uses caches to reduce the number of system calls that Ruby makes. Normally if you
require 'mygem', Ruby tries to open a file called
mygem.rb on every folder on your LOAD_PATH. Ouch. Bootsnap thought ahead too - your application code is only cached for 30 seconds, so no worries about file changes not being picked up.
The second feature is caching of compiled Ruby code. This idea has been around for a while - if I recall, Eileen Uchitelle and Aaron Patterson were working on something like this for a while but either gave up or got sidetracked. Basically, Bootsnap stores the compilation results of any given Ruby file in the extended file attributes of the file itself. It’s a neat little hack. Unfortunately it doesn’t really work on Linux for a few reasons - if you’re using ext2 or ext3 filesystems, you probably don’t have extended file attributes turned on, and even if you did, the maximum size of xattrs on Linux is very, very limited and probably can’t fit the data Bootsnap generates.
There was some discussion at the conference that, eventually, the load path caching features could be merged into Bundler or Rubygems.
When the conf wifi doesn’t co-operate
I gave a workshop entitled “Front End Performance for Full-Stack Developers”. The idea was to give an introduction to using Chrome’s Developer Tools to profile and diagnose problems with first page load experiences.
Application Server Performance
After a recent experience with a client, I had a mini-mission at Railsconf to try to diagnose and improve some issues with performance in
The issue was with how Puma processes accept requests for processing. Every Puma process (“worker”) has an internal “reactor”. The reactor’s job is to listen to the socket, buffer the request, and then hand requests to available threads.
Puma’s reactor, accepting requests
The problem was that Puma’s default behavior is for the reactor to accept as many requests as possible, without limit. This leads to poor load-balancing between Puma worker processes, especially during reboot scenarios.
Imagine you’ve restarted your
puma-powered Rails application. While you were restarting, 100 requests have piled up on the socket and are now waiting to be processed. What could sometimes happen is that just a few of those Puma processes could accept a majority of those requests. This would lead to excessive request queueing times.
This behavior didn’t make a lot of sense. If a Puma worker has 5 threads, for example, why should it ever accept more than 5 requests at a time? There may be other worker processes that are completely empty and waiting for work to do - we should let those processes accept new work instead!
So, Evan fixed it. Now, Puma workers will not accept more requests than they could possibly process at once. This should really improve performance for single-threaded Puma apps, and should improve performance for multithreaded apps too.
In the long term, I still think request load-balancing could be improved in Puma. For example - if I have 5 Puma worker processes, and 4 currently have a request being processed and 1 is completely empty, it’s possible that a new request could be picked up by one of the already-busy workers. For example, if we’re using MRI/CRuby and one of those busy workers hits an IO block (say it’s waiting on a result from the database), it could pick up a new request instead of our totally-free worker. That’s no good. And, as far as I know, routing is completely random between all the processes available and listening to the socket.
Basically, the only way Puma can “get smarter” with it’s request routing is to put some kind of “master routing process” on the socket, instead of letting the Puma workers listen directly to the socket themselves. One idea Evan had was to just put the Reactor (the thing that buffers and listens for new requests) in Puma’s “master” process, and then have the master process decide which child process to give it to. This would let Puma implement more complex routing algorithms, such as round-robin or Passenger’s “least-busy-process-first”.
Speaking of Passenger, Phusion founder Hongli spitballed the idea that Passenger could even act as a reverse proxy/load-balancer for Puma. It could definitely work (and would give Puma other benefits like offloading static file serving to Passenger) but I think Puma using the master process as a kind of “master reactor” is more likely.
Is my app threadsafe? Survey says… definitely maybe.
One question that frequently comes up around performance is “how do I know if my Ruby application is thread-safe or not?” My stock is answer is usually to run your tests in multiple threads. There are two problems with this suggestion though - one, you can’t run RSpec in multiple threads, so this is Minitest-only, and two, this really only helps you find threading bugs in your unit tests and application units, it doesn’t cover most of your dependencies.
One source of threading bugs is Rack middleware. Basically, the problem looks something like this:
class NonThreadSafeMiddleware def initialize(app) @app = app @state = 0 end def call(env) @state += 1 return @app.call(env) end end
A interesting way to surface these problems is to just
freeze everything in all of your Rack middlewares. In the example above,
@state += 1 would now blow up and return a RuntimeError, rather than just silently adding incorrectly in a multithreaded app. That’s exactly what rack-freeze does (which is where the example above is from). Hat-tip to @schneems for bringing this up.
When talking to Kevin Deisz in the hallway (I don’t recall what about), he told me about his gem called
snip_snip. Many of you have probably tried
bullet at some point -
bullet’s job is to help you find N+1 queries in your app.
snip_snip is sort of similar, but it looks for database columns which you
SELECTed but didn’t use. For example:
class MyModel < ActiveRecord::Base # has attributes - :foo, :bar, :baz, :qux end class SomeController < ApplicationController def my_action @my_model_instance = MyModel.first end end
# somewhere in my_action.html.erb @my_model_instance.bar @my_model_instance.foo
snip_snip will tell me that I
:qux attributes but didn’t use them. I could rewrite my controller action as:
class SomeController < ApplicationController def my_action @my_model_instance = MyModel.select(:bar, :foo).first end end
Selecting fewer attributes, rather than all of the attributes (default behavior) can provide a decent speedup when you’re creating many (hundreds or more, usually) ActiveRecord objects at once, or when you’re grabbing objects which have many attributes (User, for example).
In a hallway conversation with Noah Gibbs, Noah mentioned that he’s found that increasing the compiler’s inline threshold when compiling Ruby can lead to a minor speed improvement.
The inline threshold is basically how aggressively the compiler decides to copy-paste sections of code, inlining it into a function rather than calling out to a separate function. Inlining is usually always faster than jumping to a different area of a program, but of course if we just inlined the entire program we’d probably have a 1GB Ruby binary!
Noah found that increasing the inline threshold a little led to a 5-10% speedup on the optcarrot benchmark, at the cost of a ~3MB larger Ruby binary. That’s a pretty good tradeoff for most people.
Here’s how to try this yourself. We can pass some options to our compiler using the
CFLAGS environment variable - if you’re using Clang (if you’re on a Mac, this is the default compiler):
CFLAGS="-O3 -inline-threshold=5000" Example with ruby-install ruby-install ruby 2.4.0 -- --enable-jemalloc CFLAGS="-O3 -inline-threshold=5000"
If you’re using GCC:
I wouldn’t try this in production just yet though - it seems to cause a few segfaults for me locally from time to time. Worth playing around with on your development box though!
Your App Server Config is Wrong
I gave a sponsored talk for Heroku that I titled “Your App Server Config is Wrong”. Confreaks still hasn’t posted the video, but you can follow me on Twitter and I’ll retweet it as soon as it’s posted.
Basically, the number one problem I see when consulting on people’s applications is misconfigured app servers (Puma, Unicorn, Passenger and the like). This can end up costing companies thousands of dollars a month, or even costing them 30-40% of their application’s performance. Bad stuff. Give the talk a watch.
Attenddee Savannah made this cool mind-mappy-thing:
More Performance Talks
There are a few more talks from Railsconf you should watch if you’re interested in Ruby performance:
- 5 Years of Scaling Rails to 80,000 RPS with Simon Eskildsen of Shopify. Simon’s talks are always really good to begin with, so if you want to hear how Rails is used at one of the top-100 sites by traffic in the world, you should probably watch this talk.
- The Secret Life of SQL: How to Optimize Database Performance A (short) introduction to making those SQL queries as fast as possible from Bryana Knight, mostly discussing indexes and how you know if they’re being used.
- High Performance Political Revolutions Another “performance war story” from Braulio Carreno.
Son, once you start adding stuff to $(document).ready…
turbolinks:load hooks, you’re Gonna Have a Bad Time. This framework looks like it could fix that by providing a “golden path” for attaching behaviors to pages.
I’ve advocated that you just throw an HTTP/2-enabled CDN in front of your app and Be Done With It before, and Aaron and I pretty much agree on that. Aaron wants to add an HTTP/2-specific key to the Rack env hash, which could take a callback so you can do whatever fancy HTTP/2-y stuff you want in your application if Rack tells you it’s an HTTP/2-enabled request. I see the uses of this being pretty limited, however, as Server Push can mostly be implemented by your CDN or your reverse proxy.
In my Rubyconf 2016 update, I said:
Finally, there was some great discussion during the Performance Birds of a Feather meeting about various issues. Two big things came out of it - the creation of a Ruby Performance Research Group, and a Ruby Performance community group.
I want to say I’m still working on both of these projects. You should see something about the Research Group very soon (I have something I want to test surrounding memory fragmentation in highly multithreaded Ruby apps) and the community group some time after that.
Jon McCartie, everyone
That pretty much sums up my Railsconf 2017. Looking forward to next year, with even more Ruby performance and karaoke.
Want a faster website?
I'm Nate Berkopec (@nateberkopec). I write online about web performance from a full-stack developer's perspective. I primarily write about frontend performance and Ruby backends. If you liked this article and want to hear about the next one, click below. I don't spam - you'll receive about 1 email per week. It's all low-key, straight from me.
The Complete Guide to Rails Performance
Look what I wrote! The Complete Guide to Rails Performance is a full-stack course that gives you the tools to make Ruby on Rails applications faster, more scalable, and simpler to maintain. It includes a 361 page PDF, private Slack, and over 15 hours of video content.Learn more
NGINX Inc. has just released Ruby support for their new multi-language application server, NGINX Unit. What does this mean for Ruby web applications? Should you be paying attention to NGINX Unit?
Memory fragmentation is difficult to measure and diagnose, but it can also sometimes be very easy to fix. Let's look at one source of memory fragmentation in multi-threaded CRuby programs: malloc's per-thread memory arenas.
Application server configuration can make a major impact on the throughput and performance-per-dollar of your Ruby web application. Let's talk about the most important settings.
Have you ever wondered how the heck Ruby's GC works? Let's see what we can learn by reading some of the statistics it provides us in the GC.stat hash.
Full HTTP/2 support for Ruby web frameworks is a long way off - but that doesn't mean you can't benefit from HTTP/2 today!
WebFonts are awesome and here to stay. However, if used improperly, they can also impose a huge performance penalty. In this post, I explain how Rubygems.org painted 10x faster just by making a few changes to its WebFonts.
The total size of a webpage, measured in bytes, has little to do with its load time. Instead, increase network utilization: make your site preloader-friendly, minimize parser blocking, and start downloading resources ASAP with Resource Hints.
One of the most important parts of any webpage's performance is the content and organization of the head element. We'll take a deep dive on some easy optimizations that can be applied to any site.
New Relic is a great tool for getting the overview of the performance bottlenecks of a Ruby application. But it's pretty extensive - where do you start? What's the most important part to pay attention to?
Your website is slow, but the backend is fast. How do you diagnose performance issues on the frontend of your site? We'll discuss everything involved in constructing a webpage and how to profile it at sub-millisecond resolution with Chrome Timeline, Google's flamegraph-for-the-browser.
Action Cable will be one of the main features of Rails 5, to be released sometime this winter. But what can Action Cable do for Rails developers? Are WebSockets really as useful as everyone says?
rack-mini-profiler is a powerful Swiss army knife for Rack app performance. Measure SQL queries, memory allocation and CPU time.
Most "scaling" resources for Ruby apps are written by companies with hundreds of requests per second. What about scaling for the rest of us?
Caching in a Rails app is a little bit like that one friend you sometimes have around for dinner, but should really have around more often.