Hey Rootly team,
Thanks for bringing me into look at your app. I was pretty heartened by most of what I saw. I think you’ve done a good job keeping the app up to date with current versions of dependencies, and despite the large size (500k lines of Ruby) you really only have a limited set of latency issues in very specific areas.
I prefer to structure my work around managing a small number of performance indicators, and what we need to manage in your case it quite clear:
I’m confident that an application that meets these three metric goals would be a performant application that feels great to use.
Finally, I have a miscellaneous backlog of technical performance debt which may indirectly help improve those 3 metrics. However, I separate those recommendations from my others because I want to clearly delineate between things that I think immediately “move the needle” on our two metrics versus things which have ~10% chance of moving the needle but are just “good housekeeping”.
I noticed a few areas where web response times could be improved. I want to concentrate on a magnitude of improvement that would actually be visible to customers - mostly trying to find a handful of actions where we can shave ~500ms off rather than 50ms here or there. However, there’s one big change I would make to shave a few ms off every action…
Both rootly and
rootly-api spend
about 30 milliseconds per request in Marshal#load, deserializing the result of
an MGET for the following Redis keys:
["flipper/v1/feature/disable_long_term_shift_creation", "flipper/v1/feature/azure_alert_sources", "flipper/v1/feature/enable_slack_basic_picker", "flipper/v1/feature/redesigned_incident_announcement", "flipper/v1/feature/enable_invite_users_to_slack_channel_by_email", "flipper/v1/feature/use_exte...]
It’s cut off, so I’m not sure if there’s more or not. The actual Redis call itself returns totally quickly, about 3 or 4ms. After that, there’s ~20-30ms of time just spent in Ruby deserializing stuff.
Fixing this wouldn’t have a noticeable impact on customers but it might have a noticeable impact on server load, as this cache call corresponds to about 10-15% of total time in both apps.
There’s a couple of possibilities here. It could just be that the list is super long and there’s lots to unpack, but more likely there’s a complex object somewhere in here that’s quite large. I would try to figure out what that list of keys actually is in production, and then run the GET myself in prod to get the actual Ruby objects. It could just be that you have too many old feature flags in here as well.
Major versions of Postgres are just kind of a free speed boost, but PG 17 in
particular has some important stuff for Rails because of it’s impact on
IN clauses, which
Rails uses a lot.
Here’s a deep dive on the whole thing. I think it’s worth the lift.
It looks like this is hooked up but not fully turned on yet? It’s worth spending ~$500 a month on full browser RUM. The way I think about it is this: we are fundamentally shipping web applications, not HTML APIs. Once the web server has generated a string of HTML, we are not done. There’s still a whole bunch of stuff left to do in the browser. It is not unusual to have 100ms server response times and 5 second page loads times. If what we’re trying to do is improve how real humans use our site with their eyeballs and keyboards, we need to measure what’s happening in their web browser.
You mentioned a couple of questions that would be answered by browser RUM:
We really don’t know without this data.
rootly-api
I only saw this in
rootly-api.
Something is broken with Datadog’s distributed trace setup. Some traces,
like this one, have multiple requests to the API grouped together, but those traces are
unrelated AFAICT? This gets really confusing and annoying when traces become
>100MB and Datadog refuses to even open them.
If you don’t know where to start with this one, I’d just start a case with Datadog support.
We need to agree on how we’ll measure timeouts, and we need that measurement to be a metric. It doesn’t have to start as one (it can be log based and then get turned into a metric). What’s really important is that it isn’t sampled: it has to capture 100% of all timeouts. APM, for example, is a bad place to do this for that reason (spans are sampled).
Most of clients will do this with Logs. Errors doesn’t work because they don’t aggregate very well in Datadog (and you can’t turn them into a Metric), and your ELB metrics don’t really work either because they can’t tell just from the status code what’s a timeout vs a 500.
12% of the traces for this endpoint on
rootly-api are >1
second. When you look at those traces (when they actually load, that is), you
can see that it’s a huge amount of N+1.
My preferred N+1 toolset/fix-path for you would be:
About 70% of the requests to this endpoint are longer than 1 second. Yee-ouch! This is your worst endpoint across all applications.
There’s a small amount of N+1 going on here, which would be nice to fix.
However, the main thing that seems to slow this down is that it’s a “kitchen sink” REST endpoint which is returning a huge amount of data.
The two biggest parts of the request, by time, are actually retrieving a huge amount of memberships and a huge amount of active storage related assocations (blobs, attachments, and variant records). If we split this endpoint into three (one for the current data, one for memberships, one for active storage), we could probably cut down latency by ~50% because we could make those requests all in parallel from the frontend.
You will, also, eventually have to paginate this endpoint. However, in my experience, that’s usually a harder lift than splitting it. So I’m suggesting the split first.
filter_active_shifts_and_overrides (shift_manager_service.rb:L#163)
of Mobile::V1::ShiftsController#on_call
20% of responses here are >1 second. Datadog profiler reveals something
really interesting: they’re all CPU blocked and spend a bunch of time in
filter_active_shifts_and_overrides (shift_manager_service.rb:L#163)
parsing dates. This could be fixed.
This was what Vishnu was working on from the incident the other day. Almost 45% of requests to this endpoint take more than 1 second, and that one SQL query is the vast majority of it. Once that fix is in place, we can re-evaluate next steps on this controller. I suspect there will be more to do.
Straightforward. Similar to
Api::V1::ServicesController#index. This controller has ~12% of its requests go over 1 second.
alert_groups.delete_at
condition to Mobile::V2::AlertsController#index
This endpoint has 20% > 1 second, but the reason is simple: a SQL query
that looks very, very similar to what we saw in
Account::AlertsController#index. While the indexes will help here, this query also suffers from not
included the
alert_groups.deleted_at
condition which caused the other controller to not use the index at all. So,
that code change must be made here as well.
While only 6% of requests here are >1 second, theres a lot of requests so it’s important.
There’s a super weird pattern at the beginning of many requests to this:
What’s weird is that I’m not approximating: these I/O calls are often EXACTLY 100ms. That points to a GVL issue. I’m wondering if your Redis pool is misconfigured, or if this service is running with too many Puma threads.
Probably the easiest thing to try would be to set the env var RUBY_THREAD_TIMESLICE to 10. If we see this problem suddenly go away, then we know it was a GVL issue.
If that doesn’t work, I think what’s possibly happening here is that the zlib
c-extension is pinning CPU usage when deserializing
flipper/v1/features.
This kind of tracks with another issue we saw around flipper and feature list
size. If we could remove that GET or decrease the size of the feature list, we
could also potentially fix this problem.
In any web application, manging the time work spends waiting to processed (queue time) is the primary thing we want to control when scaling. In general, so long as your servers are configured correctly, increasing traffic does not increase how long work actually takes to process by Puma or Sidekiq (service latency - this is what’s measured in e.g. the Datadog service section). However, as traffic increases, it becomes more and more likely that all available servers are busy and work has to wait. That queue time is the quality of service we’re trying to manage.
For each Ruby service, you can have two different (sets of) SLOs and corresponding monitors:
SLOs help us to understand how often we’re actually meeting our goals, e.g. “within 5 minutes” queue. Is that true 100% of the time? 99%? 80%? It provides super valuable information for adjusting our autoscaling rules. For example, how do you know if the minimum task count is correct? Well, if your SLO performance is only 80% and it’s a within-30-second SLO, you should bump that up!
Request queue time should be your primary scaling signal. It already is for background job queues, but we can do this for web too. However, you’re going to need to do a couple of things:
In puma_stats.rb,
you measure utilization like this:
aggregated[:utilization] = (aggregated[:busy_threads] / aggregated[:max_threads].to_f) * 100
This is a kind of utilization, but it’s not a useful one because it’s highly dependent on the I/O qualities of the workload.
For example, consider a theoretical app which is completely CPU bound and
makes no I/O calls at all. This application will almost never have
busy_threads higher
than 1, but yet it will not accept new requests (I’m glossing over some
details on the GVL here but this is all mostly true) because the GVL is
locked. So it’s “20% utilized” but requests are still queueing and waiting to
be picked up by this Ruby process! You might have ~200-300ms of time per
request spent queueing, not measured anywhere and your utilization metric
looks just great.
This dashboard is using this utilization metric.
The utilization that we really care about is actually every Puma process’ GVL: we want it to be fairly heavily utilized but not so much that it increases request queue time.
We have a gem called gvl_metrics_middleware that measures this directly. I recommend tracking this instead. You can also use information from this gem to set Sidekiq concurrency counts in a data-driven way.
I see a lot of people on ECS essentially rebuilding Judoscale. They offer ECS integration now and it’s $500 a month for 100 tasks or fewer.. Everyone’s “build it or buy it” calculus is different, but I would literally just tell you to rebuild what Judoscale has already done for ECS scaling.
Note that if you go this path, you will still need to do the nginx sidecar process described in a previous recommendation.
I noticed you have
makara in your
gemfile, but it appears to be unused. Makara was a really old-school way to do
multi-db and read replicas, which has been superseded by this stuff being
added to Rails 7.
I think nothing needs to be done other than to delete the line from the Gemfile.
The
mhenrixson community
version of Sidekiq job uniqueness uses extra Redis commands compared to the
official Enterprise implementation by Mike Perham. For that reason, I
generally recommend it over the community version. However, you currently have
a large amount of addons/modifications to the gem, so it may not be
practicable to make this replacement. The only benefits would be a reduction
in Redis load (which is not a bottleneck for you) and getting on the
“official” implementation.
You already have test-prof installed so I recommend adding this.
You just add:
require "test_prof/factory_prof/nate_heckler"
and get the following reminder to fix your dang factories every time you run rspec:
[TEST PROF INFO] Time spent in factories: 04:31.222 (54% of total time)
json
In the last few years, byroot has put in a lot of work on the default
json gem so that
it’s as fast or faster than OJ or any other alternative implementation.
You should remove
oj and use the
default, just for the benefits of being on the “mainline” json implementation.
Prosopite is great but in my opinion the best possible use of it is on a per-example filter in rspec.
test "does a thing", prosopite: true do
# ...
end
… should turn on prosopite in
raise mode for the
duration of that example. It’s a super nice way to create regression tests
around N+1 fixes.
You appear to be using Elasticache. I noticed you don’t have SSL set up anywhere for this. I’m not sure what your requirements are for “encryption in transit”, but “we’re in a private VPC” may not be sufficient.
It looks like this was suggested by AI here. I don’t like this suggestion. It certainly won’t speed things up (because by definition it causes more GC) and I don’t think it will decrease memory usage by any meaningful amount either. You’re already using jemalloc, the best possible way to reduce memory usage for Ruby. This just feels like a footgun for making your tests potentially a lot slower.
I’ve really enjoyed mise a lot lately. I noticed your dev setup isn’t that prescriptive, but here’s what you get for free with mise:
I noticed your Semaphore tests can vary by as much as 100% in their execution time (RSpec jobs can take either 10 minutes or up to 20 minutes). In my experience, trying to binpack tests into jobs by measuring execution time is a loser’s game that never pays off. Instead, pulling examples off a centralized queue (this is how Knapsack Pro works) ensures that no matter what, even if tests take longer or if new tests are added, all jobs take the same amount of time.
Knapsack Pro is the best vendor here but is not cheap. Recently a good open source alternative appeared but it’s quite new and doesn’t have a lot of adoption: specwrk.
Using the same Redis instance for Sidekiq and Rails cache is dangerous.
Sidekiq stores a lot of data with an EXPIRE key, mostly stats but also some important stuff around locks.
Cache stores are designed to work at 100% utilization. If your Sidekiq Redis store is at 100% memory utilization, your entire site will go down.
Thus, keeping them on the same DB:
Create a separate Redis store for cache.
I noticed you have a few old jobs sitting in the default queue. I would try to figure out what added that in, because it looks like nothing is picking that up/listening to it, and I would perhaps manually log in to Redis to delete it. Otherwise our “queue:*” graphs all look funny because there’s this one queue with a super long latency all the time.
Rails’ new query log feature works really well with Datadog database monitoring, and helps you to quickly understand where queries are coming from, because they now all have comments at the end which say where they’re coming from. Queries now look like this:
SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2 /*application:rootly-api,controller:account/dashboard,action:show*/
You can turn this on by adding
config.active_record.query_log_tags_enabled = true
to your production.rb config file. I don’t recommend using the
source_location tag,
as it can be farily slow, but controller/action/job is extremely helpful.
Particularly during incidents, you often want to know from the db’s
perspective where load is being generated.