Hey Kyle,

Thanks again for bringing me on to look at your setup. Overall, we’ve got a pretty simple app that needs a little “operational” love as far as improving reliability and robustness to changes in usage and load. Through improving observability and applying a bit of “SRE”, we can get this setup into a much better place.

As I mentioned, I view retainer engagements as a 6-month focus on a particular set of metrics. I said I would propose a set of metrics to be our yardsticks for this effort. Here they are:

Web request queue time SLO: >99.9% p95 100ms or less. Web request queue time is how long a request spends waiting (queueing) for a free Puma thread to start processing it. This queueing occurs when a request is routed to an ECS task but is not immediately picked up by a free Sidekiq thread. This is the fundamental signal for all web scaling: we are trying to manage this number in response to changes in traffic.
Sidekiq job latency SLO: All queues meeting latency SLO >99.9% of the time. All jobs spend some amount of time waiting in a Redis queue. Each job has, de facto, some level of acceptable amount of queue latency. I’ll discuss this more later, but we will assign an SLO to every job (actually to every queue), so our metric will be what % of the time we are achieving those SLOs.

There are other, secondary metrics which kind of by definition have to be “good” in order for these two metrics to succeed, but these high-level metrics “contain” those secondary metrics within them. For example, we don’t need a separate database uptime metric, because if your database goes down for 3 hours straight both of these SLOs will be toast for the quarter.

I arrange my recommendations here based on which of those two SLOs it best applies to.

Web Queue Time

Recommendation: Set up Judoscale for ECS

I’ve had the opportunity to work with Judoscale (and their founder Adam) at a number of clients now over the years, and I think for teams without a dedicated SRE resource, they represent a very good product that helps teams get autoscaling into place in a way that works very well with very little effort.

Frankly it probably even makes sense for most of the teams I work with who do have SRE, as we end up just rebuilding what Judoscale already does. It’s a simple cost tradeoff - once you’re doing more than ~250 tasks it probably makes sense to Build Your Own Judoscale because it’s just costing you $1,000/month.

Here’s what Judoscale does right, out of the box:

Queue based autoscaling for Puma and Sidekiq. You don’t have to rebuild anything in Cloudwatch, it just collects and sends and manages all that for you. It can do schedules, it can let you easily try and tweak different parameters, it’s all completely painless.
Utilization based autoscaling for web. This is become more of an important thing at clients that I have where large amounts of traffic can drop in over the course of ~5 minutes. It allows you to plan to have a very high “headroom” but in a data driven way that’s not just “oh yeah let’s throw a billion servers at it”.

I’m very happy with the product and have no drawbacks to speak of. It would give us a really nice autoscaling setup with a couple of clicks and a PR.

I read the autoscaling proposal completely but there’s a number of problems:

Request count per target only works if response times don’t change. They will. If your workload suddenly gets very slow, your autoscaling breaks.
Scaling based on CPU has a similar problem: it assumes constant ratios of I/O and a correctly configured server for that I/O. Let’s say you suddenly get a bunch of web traffic (or far more likely: sidekiq jobs) which just do a ton of I/O. Every Puma thread or Sidekiq thread on the box is full and busy, but CPU usage is <10% (because every thread is just waiting). Your autoscaling is now broken.
Queue time is the service objective. If queue time is constantly below a reasonable threshold, your customers by definition are not experiencing latency problems. So, we should track that directly rather than 2nd or 3rd order metrics.
Queue depth is also a 2nd order objective. If you have a million jobs in the queue but they all take a millisecond to process, that’s far different than a million jobs that all take 60 seconds. Again, latency is what matters, and we can easily measure it.

Recommendation: use Ruby 3.3+ to get YJIT auto-on

YJIT wasn’t really all that good until Ruby 3.2 (it also got better in 3.3) but once you’re in 3.3 with Rails framework defaults of 7.1+ (you are), then it’s just auto-on.

IME, YJIT is a consistent 10%+ speedup you just get for free. So, that’s 10% more capacity we can get just by upgrading Ruby. Let’s do that!

Recommendation: deploy rack-mini-profiler in production

Sentry’s profiler is actually really good, definitely one of my favorites, but it’s not perfect. You have this nice domain where everyone who uses your app is a small “enterprise-y” admin type user, so you can easily impersonate and view their data and repro the exact request they have.

You may not be feeling this pain yet, but apps like this typically end up with certain customers with certain data getting these awful performance experiences. “What do you mean, people will have 10,000 of X per Y?!” is a kind of thing that happens all the time. My preferred workflow here is: notice the problem with Sentry, and if I don’t have all the info I need, repro with rack-mini-profiler in production and get a full stacktrace or per-line accounting of every SQL query.

Recommendation: Don’t set WEB_CONCURRENCY based on processor count or test it

I saw this bit of your Puma config where you’re setting WEB_CONCURRENCY based on physical processor count.

Couple problems here:

We actually want to pin to the logical core count, not physical. IME hyperthreads are perfectly fine to treat as a fully parallel unit. It’s also worth noting that on AWS 1 vCPU = 1 hyperthread, so 2vCPU = 1 physical core on Intel.
I’m really not convinced that in all environments and all virtualization situations that these numbers correctly account for anything regarding CPU quotas or timesharing, or at least I’m not willing to bet my setup on it. Heroku for example reports 8 cores on 1x/2x dynos despite having 1024/2048 millicore of timeshare.

So I’d much rather just set WEB_CONCURRENCY in the task definition, right next to the CPU share I set, and have it always set to cpu share / 1024, rounding down.

Recommendation: Web service should have tasks of WEB_CONCURRENCY >=4

Queueing can only occur when all Puma threads on a task are busy. How do you get that to happen less often? You increase the number of threads per task! Even if request arrival rate doesn’t change, this decreases queue time in a way roughly equivalent to 1/n, where n is the number of Puma workers per box. This is actually even more important than setting the minimum task count to 3 as discussed here. It’s entirely possible to have a min task count of 1, with a WEB_CONCURRENCY of 16, perform better than a min task count of 8 with a WEB_CONCURRENCY of 2. We want as many processes per box as we can reasonably stomach given our min/max autoscaling needs and how much of a “step” we want to increment by, but 4 is my recommended minimum for cost efficiency here.

Recommendation: In the short term, web worker request count could be tuned upward

I hope we get Judoscale set up ASAP, but in the short term, you currently have a request count per target set to 80 for a web concurrency of what I think would be 2 (though I’m completely unsure about that). Assuming I’m right that’s 40 request per minute per process, which, unless your app is really slow, is probably too conservative. You could probably tune this upward to increase your cost effiency.

Recommendation; Run an nginx sidecar, measure request queue time and terminate HTTP/2 and SSL there

nginx is really good at a lot of things that Puma isn’t:

Uploads. Unclear to me so far how important this is to your app but nginx is far better at managing >5MB uploads than Puma is.
HTTP/2. Puma doesn’t support this at all. A CDN also works here to check this requirement.
SSL. Puma’s support is OK. nginx’s is the reference standard.
Not being Puma. In order to measure request queue time, you need something, anything, in front of Puma that can stamp every incoming request with a time-in-milliseconds-since-the-epoch header. Nginx can do this.

This setup is extremely common. If Amazon got their shit together and let you do this request queue time stamping at the ALB you wouldn’t have to necessarily run nginx as a sidecar. But they don’t so you do.

Note: this is a prereq for Judoscale to work.

Recommendation: Have a comprehensive alert and dashboard suite up in Cloudwatch.

So I thought a while about what I would really recommend you do as far as observability. I mentioned this on the call and while I love Datadog it’s not designed for small teams. I think we don’t need to reinvent the wheel and do a big migration, we can use Cloudwatch for everything we want to do.

I have a standard set of metrics I recommend everyone track. I’ve bolded ones that I think are especially important for you. I consider this “done” once these are all viewable in one spot. I recommend using terraform to provision this.:

Latency/Customer Experience
- Time to execute customer-blocking background jobs. For any background job where a customer is actively waiting on the result and is blocked until that job completes (password reset email), tracks total time from enqueued_at until completion.
- % of responses which took longer than 500ms, organized by controller action.
Scalability
- Web utilization
  - Total Puma process count
  - Concurrent request load (average req/sec * sec/req)
  - Process count / load
- HPA status (web and workers)
  - current, min, max
- Web request queue timing (p75,p95,pmax)
- Worker latency
  - For each queue, show queue latency (and SLA for that particular queue)
Reliability
- Database, cache DBs, and Redis DBs
  - CPU (load and utilization)
  - IOPs (if limited)
  - Read/write latency
  - Error rates
  - Hitrate (if cache)
- Error rates
  - Web, worker
- www.*.com uptime

This will also mean we have to report all this to Cloudwatch which may mean double-measuring (showing stuff in Judoscale AND Cloudwatch, for example request queue time)

Long-term aside: Planetscale

So I saw the Aurora serverless config. I’m sure it’s probably working fine right now, but I wanted to mention that I’ve watched a number of people migrate to Planetscale (which now supports Postgres) with really great success. It’s something to have on the back of your mind.

Job Queue Time

Recommendation: Set up Sentry Queues

[Sentry’s queue product] is actually a really cool. I’d like to get it turned on so we can see Queue latency and health over time in Sentry’s very nice interface. We’ll have it in Cloudwatch too, but this is also nice.

Recommendation: Sentry - separate projects for web and sidekiq

I really don’t like Sentry’s filtering regarding separating web/background jobs as different “operations”. It works poorly. Let’s change the config so that Sidekiq and web are completely different projects, which will make it a lot easier for the default dashboards to separate this stuff in a useful way.

Recommendation: Fix sample rate for web and remove the allowlist

We already merged the PR for fixing sample rates for web, but you’ve also got this thing in there that’s an allowlist for all Rails routes. It strikes me as unnecessarily complicated: are you really getting that much traffic to e.g. spam or other URLs which is clogging up your Sentry traces? I’d rather just delete all this and set the global sample rate for traces to ~0.01 which you’re 90% of the way there to.

Recommendation: SLA Queues

This is my most heavy-weight recommendation. You should migrate to a simple queue structure where the queue’s name is it’s “service level agreement” - how long jobs in this queue are allowed to remain queued:

immediate/asap/within_0_seconds (many ways to name this one, but the idea is that the SLO is the same as web requests)
within_1_minute
within_1_hour
within_24_hours
backfill (has no SLO)

This makes setting SLOs extremely clear. I recommend making this switch as we determine what SLOs actually are for each queue.

Step 1: group existing queues into various SLO buckets
Step 2: change services so that you only have ~5 which are SLO based
Step 3: remove old queues and only have SLO queues, 1 per ECS service.

We can discuss this more in detail or I can direct you to the explanation in Sidekiq in Practice about why SLA queues are so awesome!

Recommendation: Diagnose memory usage issue on delister queue by installing a memory logger

Delister seems to struggle with some kind of memory issue where it’s using ~1.5 gb and everything else uses much less. We created a gem that logs memory use and I think would help us to plug this hole. The primary benefit would just be smaller/more reasonable task sizes and lower cost.

Recommendation: Target a 25 second job execution time maximum except on “backfill” queues

You’ve got a handful of jobs which can take more than ~25 seconds on average. These jobs have two main problems:

They struggle to be idempotent. Sidekiq’s graceful shutdown timeout is 25 seconds. That means jobs which take longer than that will be axed in the middle of execution which is a very unstable thing for most jobs. Idempotency is good but hard to guarantee, and most jobs won’t hold up to being hard-terminated halfway through and then retried later. It’s easier to ensure robustness by just not hard-terminating jobs on the regular.
They create unstable queue conditions. Jobs which regularly execute for >30 seconds can cause sudden, difficult to predict spikes in queue latency, which is one of our key metrics.

To operationalize this, I would work on splitting up jobs with a >25 second 7 day max execution time.

Recommendation: Fix Sentry Sidekiq trace grouping

You’ve got this problem where unrelated Sidekiq traces are grouped together in Sentry. This is an issue I reported to them a year ago which they did fix but there’s a config setting you have to apply and/or a version upgrade we’ll need to do.

Recommendation: TDCUpsertEventsJob: Huge amounts of time consumed, investigate retry mechanism feedback

This job is your highest by total time consumed (i.e.: total time spent running), so I’d like to reduce that to increase our available capacity. There’s something very weird going on in this job regarding retries I think. We should talk through how Retriable works and what we think are reasonable defaults here. The Sentry data is unclear as to why this job takes so long but I have some hypotheses.

Recommendation: Install jemalloc and reduce Sidekiq task definition memory to 2048, 1024 long term

I noticed all Sidekiq tasks are set to 4GB. That’s usually more than required. We can install jemalloc and probably remove ~90% of your memory usage problems and reduce this to 2GB. Long term I’d like to get this down to 1GB.

Recommendation: Fix sidekiq pool/concurrency size mismatch

You’re currently running Sidekiq at default concurrency (10), which is good, but your database pool size is set to RAILS_MAX_THREADS

5. That leaves you by default with a Sidekiq concurrency too high for the database pool. That will eventually result in connection pool exceptions or phantom latency as threads wait on the connection pool.

I prefer people set these to the same value somehow. Usually the simplest way is to just rely on how Sidekiq uses RAILS_MAX_THREADS as it’s default concurrency value (Puma too), and use that to control database pool size AND concurrency everywhere.

Recommendation: volatile-lru is not ideal for sidekiq

Your Redis store is set to volatile-lru expiration. This could result in the permanent loss of Sidekiq stats data as well as any other data Mike decides now or in the future will have an expiry, such as the new profiler stuff or Iterable state.

Recommendation: Do not use same Redis store for cache, actioncable and sidekiq (env vars diff?)

It was unclear to me if the cache and Sidekiq are actually using the same Redis store, but I know that actioncable and Sidekiq definitely are. This is unsafe, as all three types of store should a) fail independently and b) generally should have different memory sizes and expiration policies (as just mentioned).

Migrate this to three different elasticache instances.