Hey Rootly team,

Thanks for bringing me into look at your app. I was pretty heartened by most of what I saw. I think you’ve done a good job keeping the app up to date with current versions of dependencies, and despite the large size (500k lines of Ruby) you really only have a limited set of latency issues in very specific areas.

I prefer to structure my work around managing a small number of performance indicators, and what we need to manage in your case it quite clear:

I’m confident that an application that meets these three metric goals would be a performant application that feels great to use.

Finally, I have a miscellaneous backlog of technical performance debt which may indirectly help improve those 3 metrics. However, I separate those recommendations from my others because I want to clearly delineate between things that I think immediately “move the needle” on our two metrics versus things which have ~10% chance of moving the needle but are just “good housekeeping”.

Speeding up Web

I noticed a few areas where web response times could be improved. I want to concentrate on a magnitude of improvement that would actually be visible to customers - mostly trying to find a handful of actions where we can shave ~500ms off rather than 50ms here or there. However, there’s one big change I would make to shave a few ms off every action…

Recommendation: Fix/optimize whatever is being stored in Flipper

Both rootly and rootly-api spend about 30 milliseconds per request in Marshal#load, deserializing the result of an MGET for the following Redis keys:

["flipper/v1/feature/disable_long_term_shift_creation", "flipper/v1/feature/azure_alert_sources", "flipper/v1/feature/enable_slack_basic_picker", "flipper/v1/feature/redesigned_incident_announcement", "flipper/v1/feature/enable_invite_users_to_slack_channel_by_email", "flipper/v1/feature/use_exte...]

It’s cut off, so I’m not sure if there’s more or not. The actual Redis call itself returns totally quickly, about 3 or 4ms. After that, there’s ~20-30ms of time just spent in Ruby deserializing stuff.

Fixing this wouldn’t have a noticeable impact on customers but it might have a noticeable impact on server load, as this cache call corresponds to about 10-15% of total time in both apps.

There’s a couple of possibilities here. It could just be that the list is super long and there’s lots to unpack, but more likely there’s a complex object somewhere in here that’s quite large. I would try to figure out what that list of keys actually is in production, and then run the GET myself in prod to get the actual Ruby objects. It could just be that you have too many old feature flags in here as well.

Recommendation: Update to Postgres 17

Major versions of Postgres are just kind of a free speed boost, but PG 17 in particular has some important stuff for Rails because of it’s impact on IN clauses, which Rails uses a lot.

Here’s a deep dive on the whole thing. I think it’s worth the lift.

Recommendation: Install Datadog RUM for the browser

It looks like this is hooked up but not fully turned on yet? It’s worth spending ~$500 a month on full browser RUM. The way I think about it is this: we are fundamentally shipping web applications, not HTML APIs. Once the web server has generated a string of HTML, we are not done. There’s still a whole bunch of stuff left to do in the browser. It is not unusual to have 100ms server response times and 5 second page loads times. If what we’re trying to do is improve how real humans use our site with their eyeballs and keyboards, we need to measure what’s happening in their web browser.

You mentioned a couple of questions that would be answered by browser RUM:

  1. On what pages are we using tons of memory or potentially have a memory leak, crashing our client’s browsers?
  2. Is our Turbo based frontend approach working?

We really don’t know without this data.

Recommendation: Fix whatever is grouping together traces in rootly-api

I only saw this in rootly-api. Something is broken with Datadog’s distributed trace setup. Some traces, like this one, have multiple requests to the API grouped together, but those traces are unrelated AFAICT? This gets really confusing and annoying when traces become >100MB and Datadog refuses to even open them.

If you don’t know where to start with this one, I’d just start a case with Datadog support.

Recommendation: Decide on a timeout measurement, metric-ize it.

We need to agree on how we’ll measure timeouts, and we need that measurement to be a metric. It doesn’t have to start as one (it can be log based and then get turned into a metric). What’s really important is that it isn’t sampled: it has to capture 100% of all timeouts. APM, for example, is a bad place to do this for that reason (spans are sampled).

Most of clients will do this with Logs. Errors doesn’t work because they don’t aggregate very well in Datadog (and you can’t turn them into a Metric), and your ELB metrics don’t really work either because they can’t tell just from the status code what’s a timeout vs a 500.

Recommendation: Fix N+1 on Api::V1::ServicesController#index

12% of the traces for this endpoint on rootly-api are >1 second. When you look at those traces (when they actually load, that is), you can see that it’s a huge amount of N+1.

My preferred N+1 toolset/fix-path for you would be:

  1. Reproducing the N+1s in development, using seed data or copies from prod (let’s discuss). Optionally, find the line numbers of these N+1s in production using rack-mini-profiler or even, in a more difficult/limited way, with Datadog profiler.
  2. Creating a test with Prosopite coverage to wrap it up
  3. Remove the N+1
  4. Post a PR showing that the query count has gone down and/or the Prosopite test is passing.

Recommendation: Api::V1::IncidentsController#index: split off memberships and active storage.

About 70% of the requests to this endpoint are longer than 1 second. Yee-ouch! This is your worst endpoint across all applications.

There’s a small amount of N+1 going on here, which would be nice to fix.

However, the main thing that seems to slow this down is that it’s a “kitchen sink” REST endpoint which is returning a huge amount of data.

The two biggest parts of the request, by time, are actually retrieving a huge amount of memberships and a huge amount of active storage related assocations (blobs, attachments, and variant records). If we split this endpoint into three (one for the current data, one for memberships, one for active storage), we could probably cut down latency by ~50% because we could make those requests all in parallel from the frontend.

You will, also, eventually have to paginate this endpoint. However, in my experience, that’s usually a harder lift than splitting it. So I’m suggesting the split first.

Recommendation: Optimize date parsing in filter_active_shifts_and_overrides (shift_manager_service.rb:L#163) of Mobile::V1::ShiftsController#on_call

20% of responses here are >1 second. Datadog profiler reveals something really interesting: they’re all CPU blocked and spend a bunch of time in filter_active_shifts_and_overrides (shift_manager_service.rb:L#163) parsing dates. This could be fixed.

Recommendation: Deploy index changes to Account::AlertsController#index

This was what Vishnu was working on from the incident the other day. Almost 45% of requests to this endpoint take more than 1 second, and that one SQL query is the vast majority of it. Once that fix is in place, we can re-evaluate next steps on this controller. I suspect there will be more to do.

Recommendation: Fix N+1s in Api::V1::ServicesController#index

Straightforward. Similar to Api::V1::ServicesController#index. This controller has ~12% of its requests go over 1 second.

Recommendation: Add alert_groups.delete_at condition to Mobile::V2::AlertsController#index

This endpoint has 20% > 1 second, but the reason is simple: a SQL query that looks very, very similar to what we saw in Account::AlertsController#index. While the indexes will help here, this query also suffers from not included the alert_groups.deleted_at condition which caused the other controller to not use the index at all. So, that code change must be made here as well.

Recommendation: Mobile::V1::AlertsController#acknowledge has very strange Redis behavior

While only 6% of requests here are >1 second, theres a lot of requests so it’s important.

There’s a super weird pattern at the beginning of many requests to this:

  1. 100ms I/O call to Redis for rack-attack
  2. 100ms I/O call to Redis for incrementing rack-attack
  3. 100ms I/O call to Redis for Flipper
  4. 400-500ms of deserializing time

What’s weird is that I’m not approximating: these I/O calls are often EXACTLY 100ms. That points to a GVL issue. I’m wondering if your Redis pool is misconfigured, or if this service is running with too many Puma threads.

Probably the easiest thing to try would be to set the env var RUBY_THREAD_TIMESLICE to 10. If we see this problem suddenly go away, then we know it was a GVL issue.

If that doesn’t work, I think what’s possibly happening here is that the zlib c-extension is pinning CPU usage when deserializing flipper/v1/features. This kind of tracks with another issue we saw around flipper and feature list size. If we could remove that GET or decrease the size of the feature list, we could also potentially fix this problem.

Managing queue times

In any web application, manging the time work spends waiting to processed (queue time) is the primary thing we want to control when scaling. In general, so long as your servers are configured correctly, increasing traffic does not increase how long work actually takes to process by Puma or Sidekiq (service latency - this is what’s measured in e.g. the Datadog service section). However, as traffic increases, it becomes more and more likely that all available servers are busy and work has to wait. That queue time is the quality of service we’re trying to manage.

Recommendation: Create SLOs in Datadog for web request queue time and Sidekiq queue times

For each Ruby service, you can have two different (sets of) SLOs and corresponding monitors:

  1. Web request queue time p95 is less than 100ms. You don’t measure this today - I’ll explain how to track it next.
  2. For each queue, queue time is below it’s SLO.

SLOs help us to understand how often we’re actually meeting our goals, e.g. “within 5 minutes” queue. Is that true 100% of the time? 99%? 80%? It provides super valuable information for adjusting our autoscaling rules. For example, how do you know if the minimum task count is correct? Well, if your SLO performance is only 80% and it’s a within-30-second SLO, you should bump that up!

Recommendation: Track request queue times through Datadog, Cloudwatch

Request queue time should be your primary scaling signal. It already is for background job queues, but we can do this for web too. However, you’re going to need to do a couple of things:

  1. Install an nginx sidecar in your ECS containers, and then have nginx configured to add “milliseconds since the Unix epoch” as the X-Request-Start header.
  2. Turn on request queueing in the Datadog Rails implementation.
  3. Report this to Cloudwatch as well (you’ll need your own Rack middleware) to use it as a scaling signal.

Recommendation: Puma utilization is measured incorrectly, alter dashboard

In puma_stats.rb, you measure utilization like this:

aggregated[:utilization] = (aggregated[:busy_threads] / aggregated[:max_threads].to_f) * 100

This is a kind of utilization, but it’s not a useful one because it’s highly dependent on the I/O qualities of the workload.

For example, consider a theoretical app which is completely CPU bound and makes no I/O calls at all. This application will almost never have busy_threads higher than 1, but yet it will not accept new requests (I’m glossing over some details on the GVL here but this is all mostly true) because the GVL is locked. So it’s “20% utilized” but requests are still queueing and waiting to be picked up by this Ruby process! You might have ~200-300ms of time per request spent queueing, not measured anywhere and your utilization metric looks just great.

This dashboard is using this utilization metric.

The utilization that we really care about is actually every Puma process’ GVL: we want it to be fairly heavily utilized but not so much that it increases request queue time.

We have a gem called gvl_metrics_middleware that measures this directly. I recommend tracking this instead. You can also use information from this gem to set Sidekiq concurrency counts in a data-driven way.

Recommendation: Use Judoscale for ECS

I see a lot of people on ECS essentially rebuilding Judoscale. They offer ECS integration now and it’s $500 a month for 100 tasks or fewer.. Everyone’s “build it or buy it” calculus is different, but I would literally just tell you to rebuild what Judoscale has already done for ECS scaling.

Note that if you go this path, you will still need to do the nginx sidecar process described in a previous recommendation.

Miscellaneous “tech debt”

Recommendation: Remove Makara

I noticed you have makara in your gemfile, but it appears to be unused. Makara was a really old-school way to do multi-db and read replicas, which has been superseded by this stuff being added to Rails 7.

I think nothing needs to be done other than to delete the line from the Gemfile.

Recommendation: replace sidekiq-unique-jobs with Enterprise uniqueness

The mhenrixson community version of Sidekiq job uniqueness uses extra Redis commands compared to the official Enterprise implementation by Mike Perham. For that reason, I generally recommend it over the community version. However, you currently have a large amount of addons/modifications to the gem, so it may not be practicable to make this replacement. The only benefits would be a reduction in Redis load (which is not a bottleneck for you) and getting on the “official” implementation.

Recommendation: Turn on Nate heckler mode

You already have test-prof installed so I recommend adding this.

You just add:

require "test_prof/factory_prof/nate_heckler"

and get the following reminder to fix your dang factories every time you run rspec:

[TEST PROF INFO] Time spent in factories: 04:31.222 (54% of total time)

Recommendation: Remove OJ and replace with json

In the last few years, byroot has put in a lot of work on the default json gem so that it’s as fast or faster than OJ or any other alternative implementation.

You should remove oj and use the default, just for the benefits of being on the “mainline” json implementation.

Recommendation: Use Prosopite in Tests

Prosopite is great but in my opinion the best possible use of it is on a per-example filter in rspec.

test "does a thing", prosopite: true do
# ...
end

… should turn on prosopite in raise mode for the duration of that example. It’s a super nice way to create regression tests around N+1 fixes.

Recommendation: Should Redis use SSL in production?

You appear to be using Elasticache. I noticed you don’t have SSL set up anywhere for this. I’m not sure what your requirements are for “encryption in transit”, but “we’re in a private VPC” may not be sufficient.

Recommendation: Don’t manually GC during tests

It looks like this was suggested by AI here. I don’t like this suggestion. It certainly won’t speed things up (because by definition it causes more GC) and I don’t think it will decrease memory usage by any meaningful amount either. You’re already using jemalloc, the best possible way to reduce memory usage for Ruby. This just feels like a footgun for making your tests potentially a lot slower.

Recommendation: Standardize on mise

I’ve really enjoyed mise a lot lately. I noticed your dev setup isn’t that prescriptive, but here’s what you get for free with mise:

Recommendation: Use Knapsack Pro or another RSpec centralized queue

I noticed your Semaphore tests can vary by as much as 100% in their execution time (RSpec jobs can take either 10 minutes or up to 20 minutes). In my experience, trying to binpack tests into jobs by measuring execution time is a loser’s game that never pays off. Instead, pulling examples off a centralized queue (this is how Knapsack Pro works) ensures that no matter what, even if tests take longer or if new tests are added, all jobs take the same amount of time.

Knapsack Pro is the best vendor here but is not cheap. Recently a good open source alternative appeared but it’s quite new and doesn’t have a lot of adoption: specwrk.

Recommendation: Use separate cache stores for Sidekiq and cache

Using the same Redis instance for Sidekiq and Rails cache is dangerous.

Sidekiq stores a lot of data with an EXPIRE key, mostly stats but also some important stuff around locks.

Cache stores are designed to work at 100% utilization. If your Sidekiq Redis store is at 100% memory utilization, your entire site will go down.

Thus, keeping them on the same DB:

  1. Ensures you have to buy a way, way bigger cache than needed because you can’t expire any keys.
  2. In the case of 100% utilization, you’ll end up expiring potentially critical Sidekiq lock data and also take down your app to boot.

Create a separate Redis store for cache.

Recommendation: Delete default queue

I noticed you have a few old jobs sitting in the default queue. I would try to figure out what added that in, because it looks like nothing is picking that up/listening to it, and I would perhaps manually log in to Redis to delete it. Otherwise our “queue:*” graphs all look funny because there’s this one queue with a super long latency all the time.

Recommendation: Turn on Rails query logs

Rails’ new query log feature works really well with Datadog database monitoring, and helps you to quickly understand where queries are coming from, because they now all have comments at the end which say where they’re coming from. Queries now look like this:

SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 /*application:rootly-api,controller:account/dashboard,action:show*/

You can turn this on by adding config.active_record.query_log_tags_enabled = true to your production.rb config file. I don’t recommend using the source_location tag, as it can be farily slow, but controller/action/job is extremely helpful. Particularly during incidents, you often want to know from the db’s perspective where load is being generated.