Hey Beehiiv,
Thanks for bringing me into look at your app. I think you’ve done a really great job building an app that regularly shoulders almost 2000 requests per second. That’s serious scale, particularly for the size of the team you’ve got. I think “100 transactions per second per team member” is probably one of the higher ratios I’ve ever seen. So, kudos!
I’m about to get into the whole pile of stuff I think should be in your backlog. However, I want to get out of the way the things that I think are NOT major issues:
With that in mind, here are the key metrics I propose to improve over our 6-month engagement (so, to be achieved by March 1st):
There will undoubtedly be other things we can improve (I detail a bunch of miscellaneous stuff in the end “Other Tech Debt” section), but I want to concentrate on these two areas.
I have a standard set of metrics I recommend everyone track. I’ve bolded ones that I think are especially important for you. I consider this “done” once these are all viewable in one spot.
This would be a nice “landing” page for everybody concerned with perf in DD.
SLOs are like a backstop. For example - we have autoscaling set up for web. However, it takes ~3-5 minutes to bring up new dynos. So, while your autoscaling is working as designed, you might be violating your “promised” queue time performance for up to 3-5 minutes while that new capacity comes online. We need to monitor and catalog these kinds of delays. SLOs allow us to do that: we set a standard of service (e.g.: p95 queue time must be 100ms or below) and then look at how often we actually attain that standard.
I recommend using terraform to provision this. - you have about 30 different queues so setting up and maintaining ~60 things in datadog is a PITA otherwise.
Perf-L dynos get access to 8 CPUs. You current have this set to 6, which means theres 2 CPUs on the dyno you are paying for but not using. That’s not ideal. Memory usage looks fine to run 8 per dyno.
Changing this setting effectively results in 33% more Puma processes for the same $$$/month. This should have a significant effect on average request queue times (and will just be automatically picked up/handled by the autoscaler). After this change I expect (hope?) to see a large reduction in average web dyno count.
After we get the SLO for web in place, I’d like to consider adjusting the web scale threshold. Judoscale works in terms of an average, which is OK, but is affected by outliers and may not align with how we measure our web queue time SLO.
The p95 request queue time I’m seeing in Datadog has a lot of “noise” and frequent spikes in it, often above 1 second. I think some of this will be addressed by the WEB_CONCURRENCY change but probably not all of it. We probably will need to tune the auto scale thresholds down in order to attain our SLO, but we might not.
So, this is blocking on both of those “tickets” being resolved first.
You’re starting a transition to latency queues, which is great.
I like the queue latency buckets you’ve already chosen. We should discuss potentially renaming within_30_seconds (you’ve chosen this queue to be your realtime/customer-blocking queue but 30 seconds is a long SLO for that) but that’s a nitpick.
What I really want is to move all other existing queues into the latency queues. There’s going to be individual issues with each one (scheduling fairness is usually the big one), but it will be worth it in the long run.
More work in SLO queues means:
You’re using the approach of reporting the
enqueued_at - Time.now
for each job as it executes. The main downside here is that enqueued_at
doesn’t work very well with retries, which can cause this metric to fluctuate
a lot randomly. It also can get thrown off on low-throughput queues (because
you only take an observation every time you execute a job), and during periods
where queue size is increasing, you won’t see that until the queue actually
executes a job.
Sidekiq enterprise and Judoscale use a different approach: polling Sidekiq every ~10 seconds for “what is the oldest job in every queue right now”? This solves a lot of those issues. We can duplicate that logic ourselves and report queue latency that way.
We’re probably going to do more of these adjustments as we put SLOs in place,
but one that immediately stuck out to me was
within_one_minute.
It’s constantly overly it’s stated goal of 1 minute latency because it takes
~5 minutes to bring on new capacity.
The solution for this is to increase the dyno minimum. When you get SLOs in place the tuning here will become very easy and obvious (increase minimum until SLO is met) but even without that one in place it’s obvious you need to change it.
Almost all of your jobs are really quick, which is great! However, this
particular job class
Posts::RequestedTtsGenerationWorker
is among your most executed, but regularly takes >60 seconds.
This is dangerous for two reasons:
We should look at opportunties to refactor this job: either to straight up optimize, to split into a child worker/fanout pattern, or to use Sidekiq’s Iterable feature.
This was just one job I found, there are others. It would be beneficial to tell affected teams and authors that their jobs are potentially unsafe if the following is true:
This could be done via a regularly-running Slack bot. We can discuss how to best get this kind of warning on people’s plates.
As you move more work into SLO queues, particularly for SLOs under 5 minutes, you’re going to run into people putting way too much work on queues at once and expecting it to be done immediately. If you can’t autoscale to meet the SLO in time, that results in SLO violations.
We generally need to manage: over the last minute, what % of available runtime did each job use? Did any one job use a large amount more than other jobs?
I can make a dashboard here.
We have a gem which measures Ruby’s GVL contention. We use this information to recommend the most efficient thread counts for Sidekiq and Puma (we’re automating this soon, but for now it’s manual).
Let’s install this gem, get it reporting to Datadog, and then I can recommend better thread counts for all processes. This will allow us to process the maxmium transactions/second on each dynotype.
You’ve got the classic problem of a few particular jobs, the needles in the haystack, using lots of memory to complete. You’ve addressed this by increasing dynos to perf-M.
Ideally, I really want to get everything to 2x. The cost is 5x less per dyno, which is substantial. Perf-M is truly the worst value available in cloud computing.
The process will be:
Having used both for a while now, I’m fully on the
prosopite train:
I would switch it out. I like having prosopite available on a per-test basis, so I can effectively write a “no N+1s” test.
This gem is only safe for non-production use. If you trigger it in production
(?pp=memory-profiler) you will turn on Ruby’s memory allocation extensions, which are not
guaranteed to be stable and add a lot of overhead, in terms of memory and
latency. That condition will continue until the process restarts.
It looks like you attempted to install it in prod, but it’s missing a user provider, which makes me think it probably doesn’t work. If it doesn’t work, this recommendation is that we change the setup so that it does!
The new Heroku router works really poorly with Puma 6 and below, resulting in
high request queue times. In Puma 7, this is addressed. I recommend the
upgrade. If you don’t want to do this, or want something to do in the
meantime, make sure you have the new router turned off via
heroku CLI.
I noticed this only for the
queue_latency
histogram, but I think these are being applied to all metrics.
Datadog essentially charges you based on the tag cardinality of each metric. dyno and host are high cardinality tags (they’re also basically duplicate) and I’m not sure they’re that useful in most cases. Should they be removed to save you a bit on Datadog metric cost?
You already have test-prof installed so I recommend adding this.
You just add:
require "test_prof/factory_prof/nate_heckler"
and get the following reminder to fix your dang factories every time you run rspec:
[TEST PROF INFO] Time spent in factories: 04:31.222 (54% of total time)
At ~$70/month/database, I think this is the best deal available in observability right now. It’s like AWS performance insights, but turbocharged and brings so much useful stuff into Datadog, auto-correlated with all your other traces.
CrunchyData’s dashboards are just not enough, and are particularly limited as far as precision (e.g. show me the CPU load for a 15 minute period 7 days ago).
You’re at the scale now where you can easily DoS yourself by running a large amount of a poorly performing query. You need to know where queries are coming from, from the DB’s perspective. Rails’ query logs help you do that.
Turn them on from here. If it’s not supported on your Rails version and you don’t feel like upgrading that soon, use marginalia.
I do not recommend starting out by logging source code location: that can be expensive. Just log job/controller to start.
You’ve got a Swarm Timeout dashboard but the metric you use to track this is doesn’t work.
For example, over the last week, your logs have ~8k matches for timeouts but the metric says ~24k.
I’m pretty sure this is because the metric doesn’t filter by service name (even though you only have one service?). We should adjust, because having this as a long-term metric is useful.
I’m not sure we’re going to find anything catastrophic here, because your backend response times are great, but it’s the next frontier. I think we’ve got enough to focus on from the scalability/stability side, so we may not be able to make RUM data actionable right away. However, spending a couple hundred bucks on some RUM sessions every month will surely still be worth it.
Part of Crunchy’s value prop is that they’re all really experienced and you
can Just Ask them about database scaling issues. The one thing I saw on your
dashboards that I thought was weird was extremely large amounts of table
bloat, despite autovacuuming. Ask Crunchy, see what they think. This is
particularly bad on
segment_step_build_subscribers, a very high-write table (unsurprising correlation).
You’re on the biggest database plan available, so you’ve run out of runway. You routinely touch the ~100k+ IOPs threshold, and 1 minute load average is regularly above 50.
Nothing is falling over (yet), but you have officially run out of headroom to keep vertically scaling.
We should discuss what plans you’ve got around replication or other horizontal scaling strategies.
You have an enormous reader that you barely use.
One very effective strategy I saw at Gusto involved creating “read only”
versions of the Sidekiq latency queues. So, you’d have
within_30_seconds
and also
within_30_seconds_read_only, etc.
In the read only queue, there’s an additional Sidekiq middleware in the server stack. It switches ActiveRecord to only use the replica, and to raise on violation. In production, if there’s a violation, you re-enqueue the job on the non-read-only-version of the queue. You send an alert to the affected team.
This was hugely successful in two aspects: