Hey Beehiiv,

Thanks for bringing me into look at your app. I think you’ve done a really great job building an app that regularly shoulders almost 2000 requests per second. That’s serious scale, particularly for the size of the team you’ve got. I think “100 transactions per second per team member” is probably one of the higher ratios I’ve ever seen. So, kudos!

I’m about to get into the whole pile of stuff I think should be in your backlog. However, I want to get out of the way the things that I think are NOT major issues:

Puma and Sidekiq are not grossly misconfigured. There’s one quick change with Puma, but the thread counts all look reasonable everywhere. We have some tweaking to do for more efficient throughput but there’s no massive problems.
Web and job transaction execution times are very good. Almost every popular individual job and web endpoint does not have significantly high latency. This is really good!
You’ve done a great job getting autoscaling on everything. There’s tweaks to make, but the broad use of autoscaling means that overall cost performance is better than what I usually see.

With that in mind, here are the key metrics I propose to improve over our 6-month engagement (so, to be achieved by March 1st):

Set and maintain 99.9% SLOs on web request and background job queue time. You’re tracking these latencies in Datadog now but you don’t have SLOs or a process for managing these queue times across web and ~dozens of queues.
Reduce Heroku costs by 25%. Through a combination of approaches, I think this improvement can be achieved without too much work.

There will undoubtedly be other things we can improve (I detail a bunch of miscellaneous stuff in the end “Other Tech Debt” section), but I want to concentrate on these two areas.

Managing Queue Time: Web

Recommendation: Set up the Standard Speedshop Dashboard

I have a standard set of metrics I recommend everyone track. I’ve bolded ones that I think are especially important for you. I consider this “done” once these are all viewable in one spot.

Latency/Customer Experience
- Time to execute customer-blocking background jobs. For any background job where a customer is actively waiting on the result and is blocked until that job completes (password reset email), tracks total time from enqueued_at until completion.
- % of responses which took longer than 500ms, organized by controller action.
Scalability
- Web utilization
  - Total Puma process count
  - Concurrent request load (average req/sec * sec/req)
  - Process count / load
- HPA status (web and workers)
  - current, min, max
- Web request queue timing (p75,p95,pmax)
- Worker latency
  - For each queue, show queue latency (and SLA for that particular queue)
Reliability
- Database, cache DBs, and Redis DBs
  - CPU (load and utilization)
  - IOPs (if limited)
  - Read/write latency
  - Error rates
  - Hitrate (if cache)
- Error rates
  - Web, worker
- www.*.com uptime

This would be a nice “landing” page for everybody concerned with perf in DD.

Recommendation: Set up SLOs and monitors around queue times for web and background jobs

SLOs are like a backstop. For example - we have autoscaling set up for web. However, it takes ~3-5 minutes to bring up new dynos. So, while your autoscaling is working as designed, you might be violating your “promised” queue time performance for up to 3-5 minutes while that new capacity comes online. We need to monitor and catalog these kinds of delays. SLOs allow us to do that: we set a standard of service (e.g.: p95 queue time must be 100ms or below) and then look at how often we actually attain that standard.

I recommend using terraform to provision this. - you have about 30 different queues so setting up and maintaining ~60 things in datadog is a PITA otherwise.

Recommendation: WEB_CONCURRENCY 8 on Web

Perf-L dynos get access to 8 CPUs. You current have this set to 6, which means theres 2 CPUs on the dyno you are paying for but not using. That’s not ideal. Memory usage looks fine to run 8 per dyno.

Changing this setting effectively results in 33% more Puma processes for the same $$$/month. This should have a significant effect on average request queue times (and will just be automatically picked up/handled by the autoscaler). After this change I expect (hope?) to see a large reduction in average web dyno count.

Recommendation: Reevaluate and adjust web scale threshold

After we get the SLO for web in place, I’d like to consider adjusting the web scale threshold. Judoscale works in terms of an average, which is OK, but is affected by outliers and may not align with how we measure our web queue time SLO.

The p95 request queue time I’m seeing in Datadog has a lot of “noise” and frequent spikes in it, often above 1 second. I think some of this will be addressed by the WEB_CONCURRENCY change but probably not all of it. We probably will need to tune the auto scale thresholds down in order to attain our SLO, but we might not.

So, this is blocking on both of those “tickets” being resolved first.

Managing Queue Time: Sidekiq

You’re starting a transition to latency queues, which is great.

Recommendation: Complete latency queue migration

I like the queue latency buckets you’ve already chosen. We should discuss potentially renaming within_30_seconds (you’ve chosen this queue to be your realtime/customer-blocking queue but 30 seconds is a long SLO for that) but that’s a nitpick.

What I really want is to move all other existing queues into the latency queues. There’s going to be individual issues with each one (scheduling fairness is usually the big one), but it will be worth it in the long run.

More work in SLO queues means:

Better bin-packing and therefore better cost performance. You’ll spend less money trying to run 5 dynotypes rather than 10, even for the same throughput.
Clearer operations. No one’s unclear on the queue latency requirements of jobs because it’s in the queue time. You can pause queues at will during incidents because everyone’s pre-agreed to the queue SLO.

Recommendation: Change how you report queue latency to Datadog

You’re using the approach of reporting the enqueued_at - Time.now for each job as it executes. The main downside here is that enqueued_at doesn’t work very well with retries, which can cause this metric to fluctuate a lot randomly. It also can get thrown off on low-throughput queues (because you only take an observation every time you execute a job), and during periods where queue size is increasing, you won’t see that until the queue actually executes a job.

Sidekiq enterprise and Judoscale use a different approach: polling Sidekiq every ~10 seconds for “what is the oldest job in every queue right now”? This solves a lot of those issues. We can duplicate that logic ourselves and report queue latency that way.

Recommendation: within_one_minute minimum is not high enough

We’re probably going to do more of these adjustments as we put SLOs in place, but one that immediately stuck out to me was within_one_minute. It’s constantly overly it’s stated goal of 1 minute latency because it takes ~5 minutes to bring on new capacity.

The solution for this is to increase the dyno minimum. When you get SLOs in place the tuning here will become very easy and obvious (increase minimum until SLO is met) but even without that one in place it’s obvious you need to change it.

Recommendation: Posts::RequestedTtsGenerationWorker takes too long. Split it up.

Almost all of your jobs are really quick, which is great! However, this particular job class Posts::RequestedTtsGenerationWorker is among your most executed, but regularly takes >60 seconds.

This is dangerous for two reasons:

If someone enqueues a bunch of these in one big batch, it will mess up your scheduling fairness and will really negatively affect queue latency
Jobs get hard-killed at 25 seconds. While most jobs should be idempotent, they often aren’t. Particularly long running ones.

We should look at opportunties to refactor this job: either to straight up optimize, to split into a child worker/fanout pattern, or to use Sidekiq’s Iterable feature.

Recommendation: Soft-enforce Sidekiq job execution time limits

This was just one job I found, there are others. It would be beneficial to tell affected teams and authors that their jobs are potentially unsafe if the following is true:

p95 execution time is over 25 seconds AND
the job is not Iterable.

This could be done via a regularly-running Slack bot. We can discuss how to best get this kind of warning on people’s plates.

Recommendation: Manage queue runtime sums

As you move more work into SLO queues, particularly for SLOs under 5 minutes, you’re going to run into people putting way too much work on queues at once and expecting it to be done immediately. If you can’t autoscale to meet the SLO in time, that results in SLO violations.

We generally need to manage: over the last minute, what % of available runtime did each job use? Did any one job use a large amount more than other jobs?

I can make a dashboard here.

Improving Cost Performance

Recommendation: Install gvl_metrics_middleware, tweak concurrency

We have a gem which measures Ruby’s GVL contention. We use this information to recommend the most efficient thread counts for Sidekiq and Puma (we’re automating this soon, but for now it’s manual).

Let’s install this gem, get it reporting to Datadog, and then I can recommend better thread counts for all processes. This will allow us to process the maxmium transactions/second on each dynotype.

Recommendation: Kill the Perf-M! Install Sidekiq memory logger, whack heavy work

You’ve got the classic problem of a few particular jobs, the needles in the haystack, using lots of memory to complete. You’ve addressed this by increasing dynos to perf-M.

Ideally, I really want to get everything to 2x. The cost is 5x less per dyno, which is substantial. Perf-M is truly the worst value available in cloud computing.

The process will be:

Install Sidekiq Memory Logger.
Have a table in Datadog that sorts jobs by how much memory they add (sum).
Use memory_profiler, traces and other information to reduce the memory usage of these jobs.
If a dyno has < 1GB memory usage for 1 week, downgrade it to 2x.

Other Tech Debt

Recommendation: Ditch bullet, change to (per-example) prosopite

Having used both for a while now, I’m fully on the prosopite train:

It seems to have lower overhead.
The output is easier to understand (for me).
It feels much more accurate and useful.

I would switch it out. I like having prosopite available on a per-test basis, so I can effectively write a “no N+1s” test.

Recommendation: Remove memory_profiler from production

This gem is only safe for non-production use. If you trigger it in production (?pp=memory-profiler) you will turn on Ruby’s memory allocation extensions, which are not guaranteed to be stable and add a lot of overhead, in terms of memory and latency. That condition will continue until the process restarts.

Recommendation: Does RMP work in production?

It looks like you attempted to install it in prod, but it’s missing a user provider, which makes me think it probably doesn’t work. If it doesn’t work, this recommendation is that we change the setup so that it does!

Recommendation: Upgrade to Puma 7 and/or never turn on Heroku’s new router

The new Heroku router works really poorly with Puma 6 and below, resulting in high request queue times. In Puma 7, this is addressed. I recommend the upgrade. If you don’t want to do this, or want something to do in the meantime, make sure you have the new router turned off via heroku CLI.

Recommendation: Sidekiq metrics (maybe others): ignore host and dyno?

I noticed this only for the queue_latency histogram, but I think these are being applied to all metrics.

Datadog essentially charges you based on the tag cardinality of each metric. dyno and host are high cardinality tags (they’re also basically duplicate) and I’m not sure they’re that useful in most cases. Should they be removed to save you a bit on Datadog metric cost?

Recommendation: Turn on testprof Nate Heckler mode

You already have test-prof installed so I recommend adding this.

You just add:

require "test_prof/factory_prof/nate_heckler"

and get the following reminder to fix your dang factories every time you run rspec:

[TEST PROF INFO] Time spent in factories: 04:31.222 (54% of total time)

Recommendation: Datadog database monitoring

At ~$70/month/database, I think this is the best deal available in observability right now. It’s like AWS performance insights, but turbocharged and brings so much useful stuff into Datadog, auto-correlated with all your other traces.

CrunchyData’s dashboards are just not enough, and are particularly limited as far as precision (e.g. show me the CPU load for a 15 minute period 7 days ago).

Recommendation: Turn on Rails’ DB query log

You’re at the scale now where you can easily DoS yourself by running a large amount of a poorly performing query. You need to know where queries are coming from, from the DB’s perspective. Rails’ query logs help you do that.

Turn them on from here. If it’s not supported on your Rails version and you don’t feel like upgrading that soon, use marginalia.

I do not recommend starting out by logging source code location: that can be expensive. Just log job/controller to start.

Recommendation: Fix swarm timeout metric

You’ve got a Swarm Timeout dashboard but the metric you use to track this is doesn’t work.

For example, over the last week, your logs have ~8k matches for timeouts but the metric says ~24k.

I’m pretty sure this is because the metric doesn’t filter by service name (even though you only have one service?). We should adjust, because having this as a long-term metric is useful.

Recommendation: Install RUM

I’m not sure we’re going to find anything catastrophic here, because your backend response times are great, but it’s the next frontier. I think we’ve got enough to focus on from the scalability/stability side, so we may not be able to make RUM data actionable right away. However, spending a couple hundred bucks on some RUM sessions every month will surely still be worth it.

Recommendation: Ask crunchy: bloat? segment_step_build_subscribers is a weird table

Part of Crunchy’s value prop is that they’re all really experienced and you can Just Ask them about database scaling issues. The one thing I saw on your dashboards that I thought was weird was extremely large amounts of table bloat, despite autovacuuming. Ask Crunchy, see what they think. This is particularly bad on segment_step_build_subscribers, a very high-write table (unsurprising correlation).

Recommendation: Start building a horizontal scaling strategy

You’re on the biggest database plan available, so you’ve run out of runway. You routinely touch the ~100k+ IOPs threshold, and 1 minute load average is regularly above 50.

Nothing is falling over (yet), but you have officially run out of headroom to keep vertically scaling.

We should discuss what plans you’ve got around replication or other horizontal scaling strategies.

Recommendation: Create read replica queues

You have an enormous reader that you barely use.

One very effective strategy I saw at Gusto involved creating “read only” versions of the Sidekiq latency queues. So, you’d have within_30_seconds and also within_30_seconds_read_only, etc.

In the read only queue, there’s an additional Sidekiq middleware in the server stack. It switches ActiveRecord to only use the replica, and to raise on violation. In production, if there’s a violation, you re-enqueue the job on the non-read-only-version of the queue. You send an alert to the affected team.

This was hugely successful in two aspects:

It drove a bunch of adoption of the replica.
The replica is usually considerably faster than the primary, so it also ended up being really efficient and quick, and reducing job runtimes and taking some pressure off the “primary” queues. People generally wanted to use the read only queue because it had better SLO performance and better execution times.