Railsconf 2017: The Performance Update

by Nate Berkopec (@nateberkopec) of (who?), a Rails performance consultancy.

Summary: Did you miss Railsconf 2017? Or maybe you went, but wonder if you missed something on the performance front? Let me fill you in! (2330 words/12 minutes)

When you just can’t conf any more Hello readers! Railsconf 2017 has just wrapped up, and as I did for RubyConf 2016, here’s a rundown of all the Ruby-performance-related stuff that happened or conversations that I had.

Bootsnap

Shopify recently released bootsnap, a Rubygem designed to boot large Ruby apps faster. It was released just a week or so before the conference, but Discourse honcho Sam Saffron was telling everyone about how great it was. It’s fairly infrequently that someone is able to come up with one of these “just throw it in your Gemfile and voila your app is faster” projects, but it looks like this is one of them.
50% faster, you say? Bootsnap reduced bootup time in development for Discourse by 50%.

You may have heard of or used bootscale - Bootsnap is intended to be an evolution/replacement of that gem.

How does it work? Well, unlike a lot of performance projects, Bootsnap’s README is actually really good and goes into depth on how it accomplishes these boot speedups. Basically, it does two big things: makes require faster, and caches the compilation of your Ruby code.

The require speedups are pretty straightforward - bootsnap uses caches to reduce the number of system calls that Ruby makes. Normally if you require 'mygem', Ruby tries to open a file called mygem.rb on every folder on your LOAD_PATH. Ouch. Bootsnap thought ahead too - your application code is only cached for 30 seconds, so no worries about file changes not being picked up.

The second feature is caching of compiled Ruby code. This idea has been around for a while - if I recall, Eileen Uchitelle and Aaron Patterson were working on something like this for a while but either gave up or got sidetracked. Basically, Bootsnap stores the compilation results of any given Ruby file in the extended file attributes of the file itself. It’s a neat little hack. Unfortunately it doesn’t really work on Linux for a few reasons - if you’re using ext2 or ext3 filesystems, you probably don’t have extended file attributes turned on, and even if you did, the maximum size of xattrs on Linux is very, very limited and probably can’t fit the data Bootsnap generates.

There was some discussion at the conference that, eventually, the load path caching features could be merged into Bundler or Rubygems.

Frontend Performance

When the conf wifi doesn’t co-operate

I gave a workshop entitled “Front End Performance for Full-Stack Developers”. The idea was to give an introduction to using Chrome’s Developer Tools to profile and diagnose problems with first page load experiences.

I thought it went okay - on conference wifi, many of the pages I had planned to use as examples suddenly had far far different load behaviors than what I had practiced with, so I felt a little lost! However, it must have gone okay, as Richard managed to halve CodeTriage’s paint times by marking his Javascript bundle as async.

Application Server Performance

After a recent experience with a client, I had a mini-mission at Railsconf to try to diagnose and improve some issues with performance in puma.

The issue was with how Puma processes accept requests for processing. Every Puma process (“worker”) has an internal “reactor”. The reactor’s job is to listen to the socket, buffer the request, and then hand requests to available threads.

Puma’s reactor, accepting requests

The problem was that Puma’s default behavior is for the reactor to accept as many requests as possible, without limit. This leads to poor load-balancing between Puma worker processes, especially during reboot scenarios.

Imagine you’ve restarted your puma-powered Rails application. While you were restarting, 100 requests have piled up on the socket and are now waiting to be processed. What could sometimes happen is that just a few of those Puma processes could accept a majority of those requests. This would lead to excessive request queueing times.

This behavior didn’t make a lot of sense. If a Puma worker has 5 threads, for example, why should it ever accept more than 5 requests at a time? There may be other worker processes that are completely empty and waiting for work to do - we should let those processes accept new work instead!

So, Evan fixed it. Now, Puma workers will not accept more requests than they could possibly process at once. This should really improve performance for single-threaded Puma apps, and should improve performance for multithreaded apps too.

In the long term, I still think request load-balancing could be improved in Puma. For example - if I have 5 Puma worker processes, and 4 currently have a request being processed and 1 is completely empty, it’s possible that a new request could be picked up by one of the already-busy workers. For example, if we’re using MRI/CRuby and one of those busy workers hits an IO block (say it’s waiting on a result from the database), it could pick up a new request instead of our totally-free worker. That’s no good. And, as far as I know, routing is completely random between all the processes available and listening to the socket.

Basically, the only way Puma can “get smarter” with it’s request routing is to put some kind of “master routing process” on the socket, instead of letting the Puma workers listen directly to the socket themselves. One idea Evan had was to just put the Reactor (the thing that buffers and listens for new requests) in Puma’s “master” process, and then have the master process decide which child process to give it to. This would let Puma implement more complex routing algorithms, such as round-robin or Passenger’s “least-busy-process-first”.

Speaking of Passenger, Phusion founder Hongli spitballed the idea that Passenger could even act as a reverse proxy/load-balancer for Puma. It could definitely work (and would give Puma other benefits like offloading static file serving to Passenger) but I think Puma using the master process as a kind of “master reactor” is more likely.

rack-freeze

Is my app threadsafe? Survey says… definitely maybe.

One question that frequently comes up around performance is “how do I know if my Ruby application is thread-safe or not?” My stock is answer is usually to run your tests in multiple threads. There are two problems with this suggestion though - one, you can’t run RSpec in multiple threads, so this is Minitest-only, and two, this really only helps you find threading bugs in your unit tests and application units, it doesn’t cover most of your dependencies.

One source of threading bugs is Rack middleware. Basically, the problem looks something like this:

class NonThreadSafeMiddleware
  def initialize(app)
    @app = app
    @state = 0
  end

  def call(env)
    @state += 1

    return @app.call(env)
  end
end

A interesting way to surface these problems is to just freeze everything in all of your Rack middlewares. In the example above, @state += 1 would now blow up and return a RuntimeError, rather than just silently adding incorrectly in a multithreaded app. That’s exactly what rack-freeze does (which is where the example above is from). Hat-tip to @schneems for bringing this up.

snip_snip

When talking to Kevin Deisz in the hallway (I don’t recall what about), he told me about his gem called snip_snip. Many of you have probably tried bullet at some point - bullet’s job is to help you find N+1 queries in your app.

snip_snip is sort of similar, but it looks for database columns which you SELECTed but didn’t use. For example:

class MyModel < ActiveRecord::Base
  # has attributes - :foo, :bar, :baz, :qux
end

class SomeController < ApplicationController
  def my_action
    @my_model_instance = MyModel.first
  end
end

…and then…

# somewhere in my_action.html.erb

@my_model_instance.bar
@my_model_instance.foo

…then snip_snip will tell me that I SELECTed the :baz and :qux attributes but didn’t use them. I could rewrite my controller action as:

class SomeController < ApplicationController
  def my_action
    @my_model_instance = MyModel.select(:bar, :foo).first
  end
end

Selecting fewer attributes, rather than all of the attributes (default behavior) can provide a decent speedup when you’re creating many (hundreds or more, usually) ActiveRecord objects at once, or when you’re grabbing objects which have many attributes (User, for example).

Inlining Ruby

In a hallway conversation with Noah Gibbs, Noah mentioned that he’s found that increasing the compiler’s inline threshold when compiling Ruby can lead to a minor speed improvement.

The inline threshold is basically how aggressively the compiler decides to copy-paste sections of code, inlining it into a function rather than calling out to a separate function. Inlining is usually always faster than jumping to a different area of a program, but of course if we just inlined the entire program we’d probably have a 1GB Ruby binary!

Noah found that increasing the inline threshold a little led to a 5-10% speedup on the optcarrot benchmark, at the cost of a ~3MB larger Ruby binary. That’s a pretty good tradeoff for most people.

Here’s how to try this yourself. We can pass some options to our compiler using the CFLAGS environment variable - if you’re using Clang (if you’re on a Mac, this is the default compiler):

CFLAGS="-O3 -inline-threshold=5000"

Example with ruby-install
ruby-install ruby 2.4.0 -- --enable-jemalloc CFLAGS="-O3 -inline-threshold=5000"

If you’re using GCC:

CFLAGS="-O3 -finline-limit=5000"

I wouldn’t try this in production just yet though - it seems to cause a few segfaults for me locally from time to time. Worth playing around with on your development box though!

Your App Server Config is Wrong

I gave a sponsored talk for Heroku that I titled “Your App Server Config is Wrong”. Confreaks still hasn’t posted the video, but you can follow me on Twitter and I’ll retweet it as soon as it’s posted.

Basically, the number one problem I see when consulting on people’s applications is misconfigured app servers (Puma, Unicorn, Passenger and the like). This can end up costing companies thousands of dollars a month, or even costing them 30-40% of their application’s performance. Bad stuff. Give the talk a watch.

Performance Panel

On the last day of the conference, Sam Saffron hosted a panel on performance with Richard, Eileen, Rafael and myself. Here’s the video.

Attenddee Savannah made this cool mind-mappy-thing:

the penultimate talk: a panel on performance with @nateberkopec @rafaelfranca @samsaffron @schneems @eileencodes #railsconf pic.twitter.com/srRe4ebPSW
— savannah (@Savannahdworth) April 27, 2017

More Performance Talks

There are a few more talks from Railsconf you should watch if you’re interested in Ruby performance:

5 Years of Scaling Rails to 80,000 RPS with Simon Eskildsen of Shopify. Simon’s talks are always really good to begin with, so if you want to hear how Rails is used at one of the top-100 sites by traffic in the world, you should probably watch this talk.
The Secret Life of SQL: How to Optimize Database Performance A (short) introduction to making those SQL queries as fast as possible from Bryana Knight, mostly discussing indexes and how you know if they’re being used.
High Performance Political Revolutions Another “performance war story” from Braulio Carreno.

Secret Project

So, I won’t go into too much detail here, but somebody showed me a very cool JavaScript project which was basically a “Javascript framework people who don’t have a single-page-app”. It looked like it would work extremely well with Turbolinks applications, or just apps which have a lot of Javascript behaviors but don’t already use another framework. If you could imagine “Unobtrusive JavaScript: The Framework”, that’s what this looked like. I’ll let you know when this project gets a public release.

Son, once you start adding stuff to $(document).ready…

One of Turbolinks’ problems, IMO, is that it lacks a lot of teaching resources or pedagogy around “How To Build Complex Turbolinks-enabled Applications”. Turbolinks requires a different approach to JavaScript in your app, and if you try to use an SPA framework such as Backbone or Angular with it, or if you try to just write your JavaScript the way you had before by dumping the kitchen sink into turbolinks:load hooks, you’re Gonna Have a Bad Time. This framework looks like it could fix that by providing a “golden path” for attaching behaviors to pages.

HTTP/2

This was touched on briefly in Aaron’s keynote, but in hallway conversations with Aaron and Evan, the path forward on HTTP/2 support in Rack was discussed.

I’ve advocated that you just throw an HTTP/2-enabled CDN in front of your app and Be Done With It before, and Aaron and I pretty much agree on that. Aaron wants to add an HTTP/2-specific key to the Rack env hash, which could take a callback so you can do whatever fancy HTTP/2-y stuff you want in your application if Rack tells you it’s an HTTP/2-enabled request. I see the uses of this being pretty limited, however, as Server Push can mostly be implemented by your CDN or your reverse proxy.

RPRG/Chat Update

In my Rubyconf 2016 update, I said:

Finally, there was some great discussion during the Performance Birds of a Feather meeting about various issues. Two big things came out of it - the creation of a Ruby Performance Research Group, and a Ruby Performance community group.

I want to say I’m still working on both of these projects. You should see something about the Research Group very soon (I have something I want to test surrounding memory fragmentation in highly multithreaded Ruby apps) and the community group some time after that.

And Karaoke!

Jon McCartie, everyone

That pretty much sums up my Railsconf 2017. Looking forward to next year, with even more Ruby performance and karaoke.

Want a faster website?

I'm Nate Berkopec (@nateberkopec). I write online about web performance from a full-stack developer's perspective. I primarily write about frontend performance and Ruby backends. If you liked this article and want to hear about the next one, click below. I don't spam - you'll receive about 1 email per week. It's all low-key, straight from me.

Products from Speedshop

The Complete Guide to Rails Performance is a full-stack performance book that gives you the tools to make Ruby on Rails applications faster, more scalable, and simpler to maintain.

Learn more

The Rails Performance Workshop is the big brother to my book. Learn step-by-step how to make your Rails app as fast as possible through a comprehensive video and hands-on workshop. Available for individuals, groups and large teams.

Learn more

Announcing the Rails Performance Apocrypha

I've written a new book, compiled from 4 years of my email newsletter.

We Made Puma Faster With Sleep Sort

Puma 5 is a huge major release for the project. It brings several new experimental performance features, along with tons of bugfixes and features. Let's talk about some of the most important ones.

The Practical Effects of the GVL on Scaling in Ruby

MRI Ruby's Global VM Lock: frequently mislabeled, misunderstood and maligned. Does the GVL mean that Ruby has no concurrency story or CaN'T sCaLe? To understand completely, we have to dig through Ruby's Virtual Machine, queueing theory and Amdahl's Law. Sounds simple, right?

The World Follows Power Laws: Why Premature Optimization is Bad

Programmers vaguely realize that 'premature optimization is bad'. But what is premature optimization? I'll argue that any optimization that does not come from observed measurement, usually in production, is premature, and that this fact stems from natural facts about our world. By applying an empirical mindset to performance, we can...

Why Your Rails App is Slow: Lessons Learned from 3000+ Hours of Teaching

I've taught over 200 people at live workshops, worked with dozens of clients, and thousands of readers to make their Rails apps faster. What have I learned about performance work and Rails in the process? What makes apps slow? How do we make them faster?

3 ActiveRecord Mistakes That Slow Down Rails Apps: Count, Where and Present

Many Rails developers don't understand what causes ActiveRecord to actually execute a SQL query. Let's look at three common cases: misuse of the count method, using where to select subsets, and the present? predicate. You may be causing extra queries and N+1s through the abuse of these three methods.

The Complete Guide to Rails Performance, Version 2

I've completed the 'second edition' of my course, the CGRP. What's changed since I released the course two years ago? Where do I see Rails going in the future?

A New Ruby Application Server: NGINX Unit

NGINX Inc. has just released Ruby support for their new multi-language application server, NGINX Unit. What does this mean for Ruby web applications? Should you be paying attention to NGINX Unit?

Malloc Can Double Multi-threaded Ruby Program Memory Usage

Memory fragmentation is difficult to measure and diagnose, but it can also sometimes be very easy to fix. Let's look at one source of memory fragmentation in multi-threaded CRuby programs: malloc's per-thread memory arenas.

Configuring Puma, Unicorn and Passenger for Maximum Efficiency

Application server configuration can make a major impact on the throughput and performance-per-dollar of your Ruby web application. Let's talk about the most important settings.

Is Ruby Too Slow For Web-Scale?

Choosing a new web framework or programming language for the web and wondering which to pick? Should performance enter your decision, or not?

Understanding Ruby GC through GC.stat

Have you ever wondered how the heck Ruby's GC works? Let's see what we can learn by reading some of the statistics it provides us in the GC.stat hash.

Rubyconf 2016: The Performance Update

What happened at RubyConf 2016 this year? A heck of a lot of stuff related to Ruby performance, that's what.

What HTTP/2 Means for Ruby Developers

Full HTTP/2 support for Ruby web frameworks is a long way off - but that doesn't mean you can't benefit from HTTP/2 today!

How Changing WebFonts Made Rubygems.org 10x Faster

WebFonts are awesome and here to stay. However, if used improperly, they can also impose a huge performance penalty. In this post, I explain how Rubygems.org painted 10x faster just by making a few changes to its WebFonts.

Page Weight Doesn't Matter

The total size of a webpage, measured in bytes, has little to do with its load time. Instead, increase network utilization: make your site preloader-friendly, minimize parser blocking, and start downloading resources ASAP with Resource Hints.

Hacking Your Webpage's Head Tags for Speed and Profit

One of the most important parts of any webpage's performance is the content and organization of the head element. We'll take a deep dive on some easy optimizations that can be applied to any site.

How to Measure Ruby App Performance with New Relic

New Relic is a great tool for getting the overview of the performance bottlenecks of a Ruby application. But it's pretty extensive - where do you start? What's the most important part to pay attention to?

Ludicrously Fast Page Loads - A Guide for Full-Stack Devs

Your website is slow, but the backend is fast. How do you diagnose performance issues on the frontend of your site? We'll discuss everything involved in constructing a webpage and how to profile it at sub-millisecond resolution with Chrome Timeline, Google's flamegraph-for-the-browser.

Get notified on new posts.

Straight from the author. No spam, no bullshit. Frequent email-only content.