Railsconf 2017: The Performance Update

by Nate Berkopec (@nateberkopec) of (who?), a Rails performance consultancy.
Summary: Did you miss Railsconf 2017? Or maybe you went, but wonder if you missed something on the performance front? Let me fill you in! (2330 words/12 minutes)


When you just can’t conf any more
Hello readers! Railsconf 2017 has just wrapped up, and as I did for RubyConf 2016, here’s a rundown of all the Ruby-performance-related stuff that happened or conversations that I had.

Bootsnap

Shopify recently released bootsnap, a Rubygem designed to boot large Ruby apps faster. It was released just a week or so before the conference, but Discourse honcho Sam Saffron was telling everyone about how great it was. It’s fairly infrequently that someone is able to come up with one of these “just throw it in your Gemfile and voila your app is faster” projects, but it looks like this is one of them.
50% faster, you say?
Bootsnap reduced bootup time in development for Discourse by 50%.

You may have heard of or used bootscale - Bootsnap is intended to be an evolution/replacement of that gem.

How does it work? Well, unlike a lot of performance projects, Bootsnap’s README is actually really good and goes into depth on how it accomplishes these boot speedups. Basically, it does two big things: makes require faster, and caches the compilation of your Ruby code.

The require speedups are pretty straightforward - bootsnap uses caches to reduce the number of system calls that Ruby makes. Normally if you require 'mygem', Ruby tries to open a file called mygem.rb on every folder on your LOAD_PATH. Ouch. Bootsnap thought ahead too - your application code is only cached for 30 seconds, so no worries about file changes not being picked up.

The second feature is caching of compiled Ruby code. This idea has been around for a while - if I recall, Eileen Uchitelle and Aaron Patterson were working on something like this for a while but either gave up or got sidetracked. Basically, Bootsnap stores the compilation results of any given Ruby file in the extended file attributes of the file itself. It’s a neat little hack. Unfortunately it doesn’t really work on Linux for a few reasons - if you’re using ext2 or ext3 filesystems, you probably don’t have extended file attributes turned on, and even if you did, the maximum size of xattrs on Linux is very, very limited and probably can’t fit the data Bootsnap generates.

There was some discussion at the conference that, eventually, the load path caching features could be merged into Bundler or Rubygems.

Frontend Performance


When the conf wifi doesn’t co-operate

I gave a workshop entitled “Front End Performance for Full-Stack Developers”. The idea was to give an introduction to using Chrome’s Developer Tools to profile and diagnose problems with first page load experiences.

I thought it went okay - on conference wifi, many of the pages I had planned to use as examples suddenly had far far different load behaviors than what I had practiced with, so I felt a little lost! However, it must have gone okay, as Richard managed to halve CodeTriage’s paint times by marking his Javascript bundle as async.

Application Server Performance

After a recent experience with a client, I had a mini-mission at Railsconf to try to diagnose and improve some issues with performance in puma.

The issue was with how Puma processes accept requests for processing. Every Puma process (“worker”) has an internal “reactor”. The reactor’s job is to listen to the socket, buffer the request, and then hand requests to available threads.


Puma’s reactor, accepting requests

The problem was that Puma’s default behavior is for the reactor to accept as many requests as possible, without limit. This leads to poor load-balancing between Puma worker processes, especially during reboot scenarios.

Imagine you’ve restarted your puma-powered Rails application. While you were restarting, 100 requests have piled up on the socket and are now waiting to be processed. What could sometimes happen is that just a few of those Puma processes could accept a majority of those requests. This would lead to excessive request queueing times.

This behavior didn’t make a lot of sense. If a Puma worker has 5 threads, for example, why should it ever accept more than 5 requests at a time? There may be other worker processes that are completely empty and waiting for work to do - we should let those processes accept new work instead!

So, Evan fixed it. Now, Puma workers will not accept more requests than they could possibly process at once. This should really improve performance for single-threaded Puma apps, and should improve performance for multithreaded apps too.

In the long term, I still think request load-balancing could be improved in Puma. For example - if I have 5 Puma worker processes, and 4 currently have a request being processed and 1 is completely empty, it’s possible that a new request could be picked up by one of the already-busy workers. For example, if we’re using MRI/CRuby and one of those busy workers hits an IO block (say it’s waiting on a result from the database), it could pick up a new request instead of our totally-free worker. That’s no good. And, as far as I know, routing is completely random between all the processes available and listening to the socket.

Basically, the only way Puma can “get smarter” with it’s request routing is to put some kind of “master routing process” on the socket, instead of letting the Puma workers listen directly to the socket themselves. One idea Evan had was to just put the Reactor (the thing that buffers and listens for new requests) in Puma’s “master” process, and then have the master process decide which child process to give it to. This would let Puma implement more complex routing algorithms, such as round-robin or Passenger’s “least-busy-process-first”.

Speaking of Passenger, Phusion founder Hongli spitballed the idea that Passenger could even act as a reverse proxy/load-balancer for Puma. It could definitely work (and would give Puma other benefits like offloading static file serving to Passenger) but I think Puma using the master process as a kind of “master reactor” is more likely.

rack-freeze


Is my app threadsafe? Survey says… definitely maybe.

One question that frequently comes up around performance is “how do I know if my Ruby application is thread-safe or not?” My stock is answer is usually to run your tests in multiple threads. There are two problems with this suggestion though - one, you can’t run RSpec in multiple threads, so this is Minitest-only, and two, this really only helps you find threading bugs in your unit tests and application units, it doesn’t cover most of your dependencies.

One source of threading bugs is Rack middleware. Basically, the problem looks something like this:

class NonThreadSafeMiddleware
  def initialize(app)
    @app = app
    @state = 0
  end

  def call(env)
    @state += 1

    return @app.call(env)
  end
end

A interesting way to surface these problems is to just freeze everything in all of your Rack middlewares. In the example above, @state += 1 would now blow up and return a RuntimeError, rather than just silently adding incorrectly in a multithreaded app. That’s exactly what rack-freeze does (which is where the example above is from). Hat-tip to @schneems for bringing this up.

snip_snip

When talking to Kevin Deisz in the hallway (I don’t recall what about), he told me about his gem called snip_snip. Many of you have probably tried bullet at some point - bullet’s job is to help you find N+1 queries in your app.

snip_snip is sort of similar, but it looks for database columns which you SELECTed but didn’t use. For example:

class MyModel < ActiveRecord::Base
  # has attributes - :foo, :bar, :baz, :qux
end

class SomeController < ApplicationController
  def my_action
    @my_model_instance = MyModel.first
  end
end

…and then…

# somewhere in my_action.html.erb

@my_model_instance.bar
@my_model_instance.foo

…then snip_snip will tell me that I SELECTed the :baz and :qux attributes but didn’t use them. I could rewrite my controller action as:

class SomeController < ApplicationController
  def my_action
    @my_model_instance = MyModel.select(:bar, :foo).first
  end
end

Selecting fewer attributes, rather than all of the attributes (default behavior) can provide a decent speedup when you’re creating many (hundreds or more, usually) ActiveRecord objects at once, or when you’re grabbing objects which have many attributes (User, for example).

Inlining Ruby

In a hallway conversation with Noah Gibbs, Noah mentioned that he’s found that increasing the compiler’s inline threshold when compiling Ruby can lead to a minor speed improvement.

The inline threshold is basically how aggressively the compiler decides to copy-paste sections of code, inlining it into a function rather than calling out to a separate function. Inlining is usually always faster than jumping to a different area of a program, but of course if we just inlined the entire program we’d probably have a 1GB Ruby binary!

Noah found that increasing the inline threshold a little led to a 5-10% speedup on the optcarrot benchmark, at the cost of a ~3MB larger Ruby binary. That’s a pretty good tradeoff for most people.

Here’s how to try this yourself. We can pass some options to our compiler using the CFLAGS environment variable - if you’re using Clang (if you’re on a Mac, this is the default compiler):

CFLAGS="-O3 -inline-threshold=5000"

Example with ruby-install
ruby-install ruby 2.4.0 -- --enable-jemalloc CFLAGS="-O3 -inline-threshold=5000"

If you’re using GCC:

CFLAGS="-O3 -finline-limit=5000"

I wouldn’t try this in production just yet though - it seems to cause a few segfaults for me locally from time to time. Worth playing around with on your development box though!

Your App Server Config is Wrong

I gave a sponsored talk for Heroku that I titled “Your App Server Config is Wrong”. Confreaks still hasn’t posted the video, but you can follow me on Twitter and I’ll retweet it as soon as it’s posted.

Basically, the number one problem I see when consulting on people’s applications is misconfigured app servers (Puma, Unicorn, Passenger and the like). This can end up costing companies thousands of dollars a month, or even costing them 30-40% of their application’s performance. Bad stuff. Give the talk a watch.

Performance Panel

On the last day of the conference, Sam Saffron hosted a panel on performance with Richard, Eileen, Rafael and myself. Here’s the video.

Attenddee Savannah made this cool mind-mappy-thing:

More Performance Talks

There are a few more talks from Railsconf you should watch if you’re interested in Ruby performance:

Secret Project

So, I won’t go into too much detail here, but somebody showed me a very cool JavaScript project which was basically a “Javascript framework people who don’t have a single-page-app”. It looked like it would work extremely well with Turbolinks applications, or just apps which have a lot of Javascript behaviors but don’t already use another framework. If you could imagine “Unobtrusive JavaScript: The Framework”, that’s what this looked like. I’ll let you know when this project gets a public release.


Son, once you start adding stuff to $(document).ready…

One of Turbolinks’ problems, IMO, is that it lacks a lot of teaching resources or pedagogy around “How To Build Complex Turbolinks-enabled Applications”. Turbolinks requires a different approach to JavaScript in your app, and if you try to use an SPA framework such as Backbone or Angular with it, or if you try to just write your JavaScript the way you had before by dumping the kitchen sink into turbolinks:load hooks, you’re Gonna Have a Bad Time. This framework looks like it could fix that by providing a “golden path” for attaching behaviors to pages.

HTTP/2

This was touched on briefly in Aaron’s keynote, but in hallway conversations with Aaron and Evan, the path forward on HTTP/2 support in Rack was discussed.

I’ve advocated that you just throw an HTTP/2-enabled CDN in front of your app and Be Done With It before, and Aaron and I pretty much agree on that. Aaron wants to add an HTTP/2-specific key to the Rack env hash, which could take a callback so you can do whatever fancy HTTP/2-y stuff you want in your application if Rack tells you it’s an HTTP/2-enabled request. I see the uses of this being pretty limited, however, as Server Push can mostly be implemented by your CDN or your reverse proxy.

RPRG/Chat Update

In my Rubyconf 2016 update, I said:

Finally, there was some great discussion during the Performance Birds of a Feather meeting about various issues. Two big things came out of it - the creation of a Ruby Performance Research Group, and a Ruby Performance community group.

I want to say I’m still working on both of these projects. You should see something about the Research Group very soon (I have something I want to test surrounding memory fragmentation in highly multithreaded Ruby apps) and the community group some time after that.

And Karaoke!


Jon McCartie, everyone

That pretty much sums up my Railsconf 2017. Looking forward to next year, with even more Ruby performance and karaoke.

SHARE:
Reddit

Want a faster website?

I'm Nate Berkopec (@nateberkopec). I write online about web performance from a full-stack developer's perspective. I primarily write about frontend performance and Ruby backends. If you liked this article and want to hear about the next one, click below. I don't spam - you'll receive about 1 email per week. It's all low-key, straight from me.

Products from Speedshop

The Complete Guide to Rails Performance is a full-stack performance book that gives you the tools to make Ruby on Rails applications faster, more scalable, and simpler to maintain.

Learn more

The Rails Performance Workshop is the big brother to my book. Learn step-by-step how to make your Rails app as fast as possible through a comprehensive video and hands-on workshop. Available for individuals, groups and large teams.

Learn more

More Posts

Announcing the Rails Performance Apocrypha

I've written a new book, compiled from 4 years of my email newsletter.

Read more

We Made Puma Faster With Sleep Sort

Puma 5 is a huge major release for the project. It brings several new experimental performance features, along with tons of bugfixes and features. Let's talk about some of the most important ones.

Read more

The Practical Effects of the GVL on Scaling in Ruby

MRI Ruby's Global VM Lock: frequently mislabeled, misunderstood and maligned. Does the GVL mean that Ruby has no concurrency story or CaN'T sCaLe? To understand completely, we have to dig through Ruby's Virtual Machine, queueing theory and Amdahl's Law. Sounds simple, right?

Read more

The World Follows Power Laws: Why Premature Optimization is Bad

Programmers vaguely realize that 'premature optimization is bad'. But what is premature optimization? I'll argue that any optimization that does not come from observed measurement, usually in production, is premature, and that this fact stems from natural facts about our world. By applying an empirical mindset to performance, we can...

Read more

Close

Get notified on new posts.

Straight from the author. No spam, no bullshit. Frequent email-only content.