Railsconf 2017: The Performance Update
When you just can’t conf any more
Hello readers! Railsconf 2017 has just wrapped up, and as I
did for
RubyConf 2016, here’s a rundown of all the Ruby-performance-related stuff
that happened or conversations that I had.
Bootsnap
Shopify recently released
bootsnap, a
Rubygem designed to boot large Ruby apps faster. It was
released just a week or so before the conference, but
Discourse honcho
Sam Saffron was
telling everyone about how great it was. It’s fairly
infrequently that someone is able to come up with one of these
“just throw it in your Gemfile and voila your app is faster”
projects, but it looks like this is one of them.
50% faster, you say?
Bootsnap reduced bootup time in development for Discourse
by 50%.
You may have heard of or used bootscale - Bootsnap is intended to be an evolution/replacement of that gem.
How does it work? Well, unlike a lot of performance projects,
Bootsnap’s README is actually really good
and goes into depth on how it accomplishes these boot
speedups. Basically, it does two big things: makes
require
faster, and caches the compilation of your Ruby code.
The
require
speedups are pretty straightforward -
bootsnap
uses caches to reduce the number of system calls that Ruby
makes. Normally if you
require 'mygem'
, Ruby tries to open a file called
mygem.rb
on every folder on your LOAD_PATH. Ouch. Bootsnap
thought ahead too - your application code is only cached for
30 seconds, so no worries about file changes not being picked
up.
The second feature is caching of compiled Ruby code. This idea has been around for a while - if I recall, Eileen Uchitelle and Aaron Patterson were working on something like this for a while but either gave up or got sidetracked. Basically, Bootsnap stores the compilation results of any given Ruby file in the extended file attributes of the file itself. It’s a neat little hack. Unfortunately it doesn’t really work on Linux for a few reasons - if you’re using ext2 or ext3 filesystems, you probably don’t have extended file attributes turned on, and even if you did, the maximum size of xattrs on Linux is very, very limited and probably can’t fit the data Bootsnap generates.
There was some discussion at the conference that, eventually, the load path caching features could be merged into Bundler or Rubygems.
Frontend Performance
When the conf wifi doesn’t co-operate
I gave a workshop entitled “Front End Performance for Full-Stack Developers”. The idea was to give an introduction to using Chrome’s Developer Tools to profile and diagnose problems with first page load experiences.
I thought it went okay - on conference wifi, many of
the pages I had planned to use as examples suddenly had far
far different load behaviors than what I had practiced with,
so I felt a little lost! However, it must have gone
okay, as
Richard managed to
halve CodeTriage’s paint times
by marking his Javascript bundle as
async
.
Application Server Performance
After a recent experience with a client, I had a mini-mission
at Railsconf to try to diagnose and improve some issues with
performance in
puma
.
The issue was with how Puma processes accept requests for processing. Every Puma process (“worker”) has an internal “reactor”. The reactor’s job is to listen to the socket, buffer the request, and then hand requests to available threads.
Puma’s reactor, accepting requests
The problem was that Puma’s default behavior is for the reactor to accept as many requests as possible, without limit. This leads to poor load-balancing between Puma worker processes, especially during reboot scenarios.
Imagine you’ve restarted your
puma
-powered Rails application. While you were restarting, 100
requests have piled up on the socket and are now waiting to be
processed. What could sometimes happen is that just a
few of those Puma processes could accept a majority
of those requests. This would lead to excessive request
queueing times.
This behavior didn’t make a lot of sense. If a Puma worker has 5 threads, for example, why should it ever accept more than 5 requests at a time? There may be other worker processes that are completely empty and waiting for work to do - we should let those processes accept new work instead!
So, Evan fixed it. Now, Puma workers will not accept more requests than they could possibly process at once. This should really improve performance for single-threaded Puma apps, and should improve performance for multithreaded apps too.
In the long term, I still think request load-balancing could be improved in Puma. For example - if I have 5 Puma worker processes, and 4 currently have a request being processed and 1 is completely empty, it’s possible that a new request could be picked up by one of the already-busy workers. For example, if we’re using MRI/CRuby and one of those busy workers hits an IO block (say it’s waiting on a result from the database), it could pick up a new request instead of our totally-free worker. That’s no good. And, as far as I know, routing is completely random between all the processes available and listening to the socket.
Basically, the only way Puma can “get smarter” with it’s request routing is to put some kind of “master routing process” on the socket, instead of letting the Puma workers listen directly to the socket themselves. One idea Evan had was to just put the Reactor (the thing that buffers and listens for new requests) in Puma’s “master” process, and then have the master process decide which child process to give it to. This would let Puma implement more complex routing algorithms, such as round-robin or Passenger’s “least-busy-process-first”.
Speaking of Passenger, Phusion founder Hongli spitballed the idea that Passenger could even act as a reverse proxy/load-balancer for Puma. It could definitely work (and would give Puma other benefits like offloading static file serving to Passenger) but I think Puma using the master process as a kind of “master reactor” is more likely.
rack-freeze
Is my app threadsafe? Survey says… definitely
maybe.
One question that frequently comes up around performance is “how do I know if my Ruby application is thread-safe or not?” My stock is answer is usually to run your tests in multiple threads. There are two problems with this suggestion though - one, you can’t run RSpec in multiple threads, so this is Minitest-only, and two, this really only helps you find threading bugs in your unit tests and application units, it doesn’t cover most of your dependencies.
One source of threading bugs is Rack middleware. Basically, the problem looks something like this:
class NonThreadSafeMiddleware
def initialize(app)
@app = app
@state = 0
end
def call(env)
@state += 1
return @app.call(env)
end
end
A interesting way to surface these problems is to just
freeze
everything in all of your Rack middlewares. In the example
above,
@state += 1
would now blow up and return a RuntimeError, rather than just
silently adding incorrectly in a multithreaded app. That’s
exactly what
rack-freeze
does (which is where the example above is from).
Hat-tip to @schneems
for bringing this up.
snip_snip
When talking to
Kevin Deisz in the
hallway (I don’t recall what about), he told me about his gem
called
snip_snip
. Many of you have probably tried
bullet
at some point -
bullet
’s job is to help you find N+1 queries in your app.
snip_snip
is sort of similar, but it looks for database columns which
you
SELECT
ed but didn’t use. For example:
class MyModel < ActiveRecord::Base
# has attributes - :foo, :bar, :baz, :qux
end
class SomeController < ApplicationController
def my_action
@my_model_instance = MyModel.first
end
end
…and then…
# somewhere in my_action.html.erb
@my_model_instance.bar
@my_model_instance.foo
…then
snip_snip
will tell me that I
SELECT
ed the
:baz
and
:qux
attributes but didn’t use them. I could rewrite my controller
action as:
class SomeController < ApplicationController
def my_action
@my_model_instance = MyModel.select(:bar, :foo).first
end
end
Selecting fewer attributes, rather than all of the attributes (default behavior) can provide a decent speedup when you’re creating many (hundreds or more, usually) ActiveRecord objects at once, or when you’re grabbing objects which have many attributes (User, for example).
Inlining Ruby
In a hallway conversation with Noah Gibbs, Noah mentioned that he’s found that increasing the compiler’s inline threshold when compiling Ruby can lead to a minor speed improvement.
The inline threshold is basically how aggressively the compiler decides to copy-paste sections of code, inlining it into a function rather than calling out to a separate function. Inlining is usually always faster than jumping to a different area of a program, but of course if we just inlined the entire program we’d probably have a 1GB Ruby binary!
Noah found that increasing the inline threshold a little led to a 5-10% speedup on the optcarrot benchmark, at the cost of a ~3MB larger Ruby binary. That’s a pretty good tradeoff for most people.
Here’s how to try this yourself. We can pass some options to
our compiler using the
CFLAGS
environment variable - if you’re using Clang (if you’re on a
Mac, this is the default compiler):
CFLAGS="-O3 -inline-threshold=5000"
Example with ruby-install
ruby-install ruby 2.4.0 -- --enable-jemalloc CFLAGS="-O3 -inline-threshold=5000"
If you’re using GCC:
CFLAGS="-O3 -finline-limit=5000"
I wouldn’t try this in production just yet though - it seems to cause a few segfaults for me locally from time to time. Worth playing around with on your development box though!
Your App Server Config is Wrong
I gave a sponsored talk for Heroku that I titled “Your App Server Config is Wrong”. Confreaks still hasn’t posted the video, but you can follow me on Twitter and I’ll retweet it as soon as it’s posted.
Basically, the number one problem I see when consulting on people’s applications is misconfigured app servers (Puma, Unicorn, Passenger and the like). This can end up costing companies thousands of dollars a month, or even costing them 30-40% of their application’s performance. Bad stuff. Give the talk a watch.
Performance Panel
On the last day of the conference, Sam Saffron hosted a panel on performance with Richard, Eileen, Rafael and myself. Here’s the video.
Attenddee Savannah made this cool mind-mappy-thing:
the penultimate talk: a panel on performance with @nateberkopec @rafaelfranca @samsaffron @schneems @eileencodes #railsconf pic.twitter.com/srRe4ebPSW
— savannah (@Savannahdworth) April 27, 2017
More Performance Talks
There are a few more talks from Railsconf you should watch if you’re interested in Ruby performance:
- 5 Years of Scaling Rails to 80,000 RPS with Simon Eskildsen of Shopify. Simon’s talks are always really good to begin with, so if you want to hear how Rails is used at one of the top-100 sites by traffic in the world, you should probably watch this talk.
- The Secret Life of SQL: How to Optimize Database Performance A (short) introduction to making those SQL queries as fast as possible from Bryana Knight, mostly discussing indexes and how you know if they’re being used.
- High Performance Political Revolutions Another “performance war story” from Braulio Carreno.
Secret Project
So, I won’t go into too much detail here, but somebody showed me a very cool JavaScript project which was basically a “Javascript framework people who don’t have a single-page-app”. It looked like it would work extremely well with Turbolinks applications, or just apps which have a lot of Javascript behaviors but don’t already use another framework. If you could imagine “Unobtrusive JavaScript: The Framework”, that’s what this looked like. I’ll let you know when this project gets a public release.
Son, once you start adding stuff to
$(document).ready…
One of Turbolinks’ problems, IMO, is that it lacks a lot of
teaching resources or pedagogy around “How To Build Complex
Turbolinks-enabled Applications”. Turbolinks requires a
different approach to JavaScript in your app, and if you try
to use an SPA framework such as Backbone or Angular with it,
or if you try to just write your JavaScript the way you had
before by dumping the kitchen sink into
turbolinks:load
hooks, you’re Gonna Have a Bad Time. This framework looks like
it could fix that by providing a “golden path” for attaching
behaviors to pages.
HTTP/2
This was touched on briefly in Aaron’s keynote, but in hallway conversations with Aaron and Evan, the path forward on HTTP/2 support in Rack was discussed.
I’ve advocated that you just throw an HTTP/2-enabled CDN in front of your app and Be Done With It before, and Aaron and I pretty much agree on that. Aaron wants to add an HTTP/2-specific key to the Rack env hash, which could take a callback so you can do whatever fancy HTTP/2-y stuff you want in your application if Rack tells you it’s an HTTP/2-enabled request. I see the uses of this being pretty limited, however, as Server Push can mostly be implemented by your CDN or your reverse proxy.
RPRG/Chat Update
In my Rubyconf 2016 update, I said:
Finally, there was some great discussion during the Performance Birds of a Feather meeting about various issues. Two big things came out of it - the creation of a Ruby Performance Research Group, and a Ruby Performance community group.
I want to say I’m still working on both of these projects. You should see something about the Research Group very soon (I have something I want to test surrounding memory fragmentation in highly multithreaded Ruby apps) and the community group some time after that.
And Karaoke!
Jon McCartie, everyone
That pretty much sums up my Railsconf 2017. Looking forward to next year, with even more Ruby performance and karaoke.
SHARE:
Want a faster website?
I'm Nate Berkopec (@nateberkopec). I write online about web performance from a full-stack developer's perspective. I primarily write about frontend performance and Ruby backends. If you liked this article and want to hear about the next one, click below. I don't spam - you'll receive about 1 email per week. It's all low-key, straight from me.
Products from Speedshop
The Complete Guide to Rails Performance is a full-stack performance book that gives you the tools to make Ruby on Rails applications faster, more scalable, and simpler to maintain.
Learn more
The Rails Performance Workshop is the big brother to my book. Learn step-by-step how to make your Rails app as fast as possible through a comprehensive video and hands-on workshop. Available for individuals, groups and large teams.
Learn more
More Posts
Announcing the Rails Performance Apocrypha
I've written a new book, compiled from 4 years of my email newsletter.
We Made Puma Faster With Sleep Sort
Puma 5 is a huge major release for the project. It brings several new experimental performance features, along with tons of bugfixes and features. Let's talk about some of the most important ones.
The Practical Effects of the GVL on Scaling in Ruby
MRI Ruby's Global VM Lock: frequently mislabeled, misunderstood and maligned. Does the GVL mean that Ruby has no concurrency story or CaN'T sCaLe? To understand completely, we have to dig through Ruby's Virtual Machine, queueing theory and Amdahl's Law. Sounds simple, right?
The World Follows Power Laws: Why Premature Optimization is Bad
Programmers vaguely realize that 'premature optimization is bad'. But what is premature optimization? I'll argue that any optimization that does not come from observed measurement, usually in production, is premature, and that this fact stems from natural facts about our world. By applying an empirical mindset to performance, we can...