Hello Altrio team,
Thanks again for bringing me in to work with you. After taking a look at your application, metrics and monitoring, I’ve decided that I have a few goals for the rest of our engagement:
To accomplish these goals, I want to track two key metrics over the course of this engagement:
Improving these two metrics will be a great yardstick for measuring our effort.
~95% of your web transaction latency issues are caused by N+1. Each query in your application takes at least about 1 millisecond, but most actions are running 500+ queries across various tables. The “mode of failure” for a slow page looks something like:
Let’s go through some things we can do to fix this, in detail.
There’s this pattern in the codebase where you use a method called
full_preloads on a
model to basically define every single relationship this model has with any
other model. You then use this hash almost everywhere in your controller layer
when looking up data.
Includes/preload/eager_load are like salt: we almost always need a little, but we only want to add it only as it’s called for, not pre-emptively.
Using
full_preloads is
leading to huge amounts of records being loaded, many of which probably are
not even used by the view. Having nested relationships of has_many-s (this has
many of that which has many of the other thing and so on) leads to a geometric
explosion of records being loaded:
# Example: Deal with 3 layers of nested has_many relationships
Deal.includes({
tasks: {
task_assigned_users: :user,
comments: [:user, :attachments],
field_values: :field
},
deal_participants: {
user: [:profile, :permissions],
notes: :attachments
},
recordings: {
recording_participants: :user,
transcripts: :sections
}
})
# If you have:
# - 10 deals
# - Each deal has 50 tasks
# - Each task has 5 assigned users, 10 comments, 20 field values
# - Each deal has 8 participants with notes
# - Each deal has 3 recordings with participants
#
# You're loading: 10 + (10*50) + (10*50*5) + (10*50*10) + (10*50*20) +
# (10*8) + (10*3) = 10 + 500 + 2,500 + 5,000 + 10,000 +
# 80 + 30 = ~18,000+ records for what might have been a
# simple list view
If we don’t need every single one of those records (and if we do, then perhaps our view is overloaded with responsibilities?), then we’re loading huge amounts of wasted data.
Instead, we should always define exactly what records are needed for the view we’re rendering right now. That’s a controller-level concern.
I have a rule of thumb: almost any instance method on a model (i.e. in the models directory or inheriting from ActiveRecord::Base) will eventually be called in a loop. That means if running that method causes a query every time it’s called, it will also eventually cause an N+1. Many of your N+1s in production are from this antipattern.
Take for example
Task#task_assigned_users.
def task_assigned_users
potential_users = users.presence || business_role_users
deal_active_members.where(
id: potential_users.select(:id)
).ordered_by_name
end
The second line causes a query every time this method is run. There are several ActiveRecord query methods which, every time they are run, cause a query and there is no way to use includes, preload or eager_load to avoid this. Includes can only prevent the lazy-loading of an association. It cannot prevent query methods from running.
The most common offenders, which in my estimation should never be used in model instance methods, are:
To rewrite this method in a way that it cannot N+1, it must only work in terms of associations, and never call these query methods.
def task_assigned_users
# both of these are associations, can be preloaded
potential_users = users.presence || business_role_users
potential_user_ids = potential_users.map(&:id) # select() would cause a query
# Filter deal_active_members using enumerable methods instead of where()
deal_active_members.select { |member|
potential_user_ids.include?(member.id)
}.sort_by { |member|
[member.first_name, member.last_name]
}
end
This method now doesn’t N+1 if users, business_role_users, and deal_active_members have been preloaded.
You should move towards a “prove it didn’t N+1” workflow, given that N+1s are such a prevalent problem. A great way to do this is strict_loading. I prefer to use it on an entire relation or just a record, inside a controller, rather than trying to set this “all the time” on a particular relation (this is the only way you currently use it in the codebase).
This would allow you to find N+1s through the raising of exceptions in dev, and in prod you could set strict_loading to just log.
def set_recording
@recording = Recording.strict_loading.any_file_identity_sourced
.find(params[:id])
end
I would be working to insert this everywhere, and particularly when fixing N+1s as a kind of regression test (or just insert it somewhere on a page where you’re struggling with N+1s and then fix every exception that occurs).
In a handful of spots, but most critically in DataTeam::TasksController, you do this:
def set_recording
@recording = DataTeam::Recording
.includes(
task: [:file_identity, taskable_recording: :recordable]
)
.where(recordable_type: "DataTeam::Task")
.find(params[:task_recording_id])
end
Unfortunately, every place this is used, this query takes a few hundred
milliseconds. The combination of
where and
includes turns this
into an expensive JOIN across several tables.
Rewriting this as a
preload with a
subquery for the
recordable_type will
probably lead to better performance.
Every request in the application is currently updating the Session record’s
updated_at column
before proceeding. This has a number of drawbacks:
COMMIT.
This is a low-impact-but-easy-to-do thing where we’d just record this in Redis to be updated later.
# In ApplicationController or wherever session updates happen
def track_session_activity
minute_key = "session_updates:#{Time.current.strftime('%Y%m%d%H%M')}"
Redis.current.sadd(minute_key, session.id)
end
# Background job to flush session updates
class SessionUpdateJob < ApplicationJob
def perform
# Find all minute keys that haven't been processed
keys = Redis.current.keys("session_updates:*")
return if keys.empty?
keys.each do |key|
# Extract timestamp from key (format: session_updates:YYYYMMDDHHMM)
timestamp_str = key.split(':').last
time = Time.strptime(timestamp_str, '%Y%m%d%H%M')
# Get all session IDs for this minute
session_ids = Redis.current.smembers(key)
next if session_ids.empty?
# Use transaction to ensure we don't lose data
Session.transaction do
Session.where(id: session_ids)
.where("updated_at < ?", time)
.touch_all(time: time)
Redis.current.del(key)
end
end
end
end
# Schedule this job to run every minute
# In config/schedule.rb (whenever gem) or similar:
# every 1.minute do
# runner "SessionUpdateJob.perform_later"
# end
I noticed that the
Api::Public::V0
endpoints were taking up a lot of your total time consumed (about 25% of total
server time). It’s something in Google Cloud that’s crawling you.
You could of course decide to just ban this crawler.
However, if you want to improve this endpoint, which routinely takes 10+seconds (!!!), we really need to get rack-mini-profiler into production to figure out what’s going on, as this endpoint spends lots of time just spinning in CPU, which is highly unusual. A stacktrace from rack-mini-profiler would help us to figure out what Ruby code we’re actually running (which would be trivial to repro, because this request is from a crawler w/o any special state or permissions).
Your React application is having significant problems with multiple requests blocking on each other.
Many views will involve several blocking requests to the Altrio backend API. Request A is made, we wait for it to return, then make Request B, etc etc:
Timeline (each dash = ~100ms):
Browser: |----[waiting]----[waiting]----[waiting]----[waiting]----|
0ms 400ms 800ms 1200ms 1600ms
Request 1: GET /api/view_groups/78494
|████████████|
0ms 300ms
↓ (blocks until complete)
Request 2: GET /api/view_groups/78494/overview
|██████████|
300ms 600ms
↓ (blocks until complete)
Request 3: GET /api/deals?view_type=group&view_id=78494&search_query=&page=1
|████████████████|
600ms 1000ms
↓ (blocks until complete)
Request 4: GET /api/report_style
|████████|
1000ms 1200ms
↓ (blocks until complete)
Request 5: GET /static/various-fonts.woff2
|████████|
1200ms 1400ms
Total time: 1400ms (could be ~400ms if parallelized)
The main problem here is that you’re making requests when rendering React components, and those requests are blocking child components from rendering. This is an extremely common issue, and I spent a year at Gusto rolling back this exact problem from their entire frontend.
The most common place this seems to occur is in
ObjectList
components, which, when rendered, block here waiting for their network request
to complete:
const fetchResult = useObjectList({
// ...
})
It’s not uncommon to see 5+ round trips to the backend like this, one after the other.
Collapsing 5 steps down to 1 step with 5 requests done in parallel would completely revolutionize the performance of these pages. For the key metrics I’ve decided to focus on /deals/active and the single deal view /deals/active/*, because they’re particularly bad here and represent a vast majority of your traffic.
Unrolling this requires getting good with DevTools and the performance tab. I’m also seeing what I think is some kind of bug on your site where many requests do not appear in the Network tab. However, the relevant caller information does appear correctly in Performance so I think it’s fine. I can record a video showing how to figure out where these requests are coming from if it would be helpful.
Next, there are a few minor configuration issues which I think would be useful to fix.
Sidekiq databases should be configured to disallow expiry, and caches should
be configured to use
allkeys-lru. To be
honest, I’m not sure what the elasticache defaults are here, but these
settings are really important, so we shouldn’t rely on whatever the defaults
are.
# Cache parameter group - allows LRU eviction when memory is full
resource "aws_elasticache_parameter_group" "cache_params" {
family = "redis7"
name = "altrio-cache-params"
parameter {
name = "maxmemory-policy"
value = "allkeys-lru"
}
}
allkeys-lru is
important for caches, because we expect caches to evict the LRU key when full.
Sidekiq, however, should NEVER evict any jobs, as that would have catastrophic
consequences!
Given that it’s not in your terraform and your Puma config file doesn’t have a
workers directive,
I’m a little concerned that you’re running Puma in single mode (workers 0) in production. That may not be the case if you’re booting Puma with
bundle exec puma -w 8
or similar in production, but as far as I can tell from your Docker stuff,
you’re using
bundle exec rails server, which cannot set Puma’s worker count by itself. You may also be setting the
WEB_CONCURRENCY env
var somewhere I can’t see, in which case that’s fine and Puma will just read
that and use it to set the worker count.
So - let’s make sure we’re running multiple Puma workers in production (equal to the CPU count of the container) and that we’re running at least 4 workers per container/server/box.
Your overall database performance is pretty mediocre. I can’t see your TF vars, but if you’re running the default, a t4g, that’s not a great instance type for a production database.
Pretty much all your SQL queries, no matter how big or small, take just over a millisecond to run, which is below average. This is compounded by the fact that you’re using SolidCache, which in some cases is taking 80ms to find a key!
Changing instance type (latest c7i or c8g) might help, but really RDS just isn’t that fast of a vendor. Some of my other clients are having really good success with Planetscale, which uses physically attached SSDs rather than network-attached storage like AWS, greatly improving latency.
If I had to sum your performance problems, I would say “you need to get really good at fixing N+1s”. To that end, I have three tools to install.
You already have this in dev, but RMP is designed to be used in production. I’ve already mentioned in another recommendation how this can be used to solve production issues.
I can put together the PR for this one as I’ve done it dozens of times. It does require a Redis cache store in production, and I’m unsure of the current state of yours given the switch to SolidCache.
Prosopite is an excellent N+1 detection library. In dev, it will give you
excellent logging for when an N+1 is detected, but I like it even better in
tests, where it can
raise if an N+1
happens.
This allows you to do regression tests when you fix an N+1:
# In spec/rails_helper.rb
require 'prosopite'
RSpec.configure do |config|
config.before(:each, :n_plus_one_safe) do
Prosopite.scan
end
config.after(:each, :n_plus_one_safe) do
Prosopite.finish
end
end
# Example test that ensures no N+1 queries
RSpec.describe "DealsController", type: :request do
describe "GET /deals/active", :n_plus_one_safe do
let!(:deals) {
create_list(:deal, 5, :with_tasks_and_participants)
}
it "loads the deals page without N+1 queries" do
get "/deals/active"
expect(response).to have_http_status(:ok)
# If any N+1 queries occur, Prosopite will raise an exception
# This test will fail, forcing you to fix the N+1 before merging
end
end
end
You could also use this prophylactically, like I said with
strict_loading.
Unlike strict loading, this works on all types of N+1s, not just those caused
by attempting load a non-preloaded association.
Your current observability stack is missing a lot of important numbers. This is a somewhat complicated topic for teams of your size, because I think there is a big hole in the market for an affordable, easy-to-use tool for small teams which still has all the features I look for. The easy answer is “use Datadog” but the pricing is often out of reach for small teams, and the UI is difficult for most people to interpret and use. For power users like me, it’s a dream, but for people just trying to get by and ship useful stuff without a dedicated SRE or whatnot, it’s overkill.
For each topic, I’m going to list a few options. Hopefully by the end, you’ll have formed an impression as to what direction you want to go with this.
You currently don’t track request queue time. We talked on our call about how
you’re going to need to set up a proxy that adds
X-Request-Start to
the headers of every request, such that Puma can read it in a Rack middleware
and caclulate the request queue time as “Time.now - X-Request-Start”:
Client Request
↓
┌─────────────────┐
│ Load Balancer │ ← Adds X-Request-Start: 1642774800.123
│ (nginx/thruster)│ (current timestamp in milliseconds)
└─────────────────┘
↓
┌─────────────────┐
│ Puma Server │ ← Receives request with X-Request-Start header
│ │
│ ┌─────────────┐ │
│ │ Rack │ │ ← Middleware calculates:
│ │ Middleware │ │ queue_time = Time.now - X-Request-Start
│ └─────────────┘ │ (e.g., 1642774800.145 - 1642774800.123 = 22ms)
│ ↓ │
│ ┌─────────────┐ │
│ │ Rails App │ │ ← Request processed normally
│ └─────────────┘ │
└─────────────────┘
↓
Response + Queue Time Metric
(logged/sent to monitoring)
Queue Time = Time request was picked up by Puma - Time request entered proxy
This measures how long requests wait in the queue before being processed.
You also want alerting when this value gets too high.
Here are your options:
You currently have just three queues, high default and low. Each of these queues has an SLO, even if you haven’t made it explicit. You have some internal sense of “if a job waits on high_priority for x minutes without being processed, that is Bad”.
I prefer to just name the queue after the SLO:
within_1_second,
within_5_minutes,
etc.
However, in any case, each queue actually does have an acceptable queue latency and you should have long-term metrics showing you what those latencies are, and alert you if it goes over acceptable bounds.
Options:
My standard set of database metrics is:
- CPU (load and utilization)
- IOPs (if limited)
- Read/write latency
- Error rates
- Hitrate (if cache)
You want that for each Redis or SQL store you’re connected to, with monitors on each.
Scout/Sentry can’t do this. Cloudwatch and Datadog certainly can.
I think where I’ve come down, after writing this section, is to drop Scout, use Sentry for APM, and use Cloudwatch in combination with Terraform for metrics and infrastructure monitoring.
Scout just doesn’t really do anything better than Sentry, and I prefer Sentry’s overall tracing product. Also, you’re already using them for frontend perf, so this would allow us to potentially connect frontend and backend traces into a single view.
There’s really nothing Datadog can do for you that Cloudwatch can’t, and you’re already on AWS, so it will cost basically nothing for us to just use more Cloudwatch for the numbers I’ve mentioned.