Hello Altrio team,

Thanks again for bringing me in to work with you. After taking a look at your application, metrics and monitoring, I’ve decided that I have a few goals for the rest of our engagement:

  1. Make the site feel faster. Current web transaction times on the backend and frontend are both mediocre and show lots of room for improvement.
  2. Greatly improve observability. We can’t fix what we can’t understand. The current observability situation has a lot that could be improved.
  3. Fix up some configuration issues. You have a handful of small issues with the config of your app that has an impact on performance.

To accomplish these goals, I want to track two key metrics over the course of this engagement:

  1. Web transaction 95th percentile, currently 1.8 seconds but my target is 1.0 seconds or less.
  2. The 50th percentile load time for deals/active and deals/active/*, currently about 2.5 seconds but my goal is 1 second or less.

Improving these two metrics will be a great yardstick for measuring our effort.

Web

~95% of your web transaction latency issues are caused by N+1. Each query in your application takes at least about 1 millisecond, but most actions are running 500+ queries across various tables. The “mode of failure” for a slow page looks something like:

  1. A massive query at the beginning of the transaction, with huge amounts of either preloads or joins (via eager_load).
  2. 100+ N+1 queries (I guess the includes didn’t work!)

Let’s go through some things we can do to fix this, in detail.

Recommendation: Pare back your includes, and refactor away from #full_preloads method

There’s this pattern in the codebase where you use a method called full_preloads on a model to basically define every single relationship this model has with any other model. You then use this hash almost everywhere in your controller layer when looking up data.

Includes/preload/eager_load are like salt: we almost always need a little, but we only want to add it only as it’s called for, not pre-emptively.

Using full_preloads is leading to huge amounts of records being loaded, many of which probably are not even used by the view. Having nested relationships of has_many-s (this has many of that which has many of the other thing and so on) leads to a geometric explosion of records being loaded:

# Example: Deal with 3 layers of nested has_many relationships
Deal.includes({
  tasks: {
    task_assigned_users: :user,
    comments: [:user, :attachments],
    field_values: :field
  },
  deal_participants: {
    user: [:profile, :permissions],
    notes: :attachments
  },
  recordings: {
    recording_participants: :user,
    transcripts: :sections
  }
})

# If you have:
# - 10 deals
# - Each deal has 50 tasks
# - Each task has 5 assigned users, 10 comments, 20 field values
# - Each deal has 8 participants with notes
# - Each deal has 3 recordings with participants
#
# You're loading: 10 + (10*50) + (10*50*5) + (10*50*10) + (10*50*20) +
#                 (10*8) + (10*3) = 10 + 500 + 2,500 + 5,000 + 10,000 +
#                 80 + 30 = ~18,000+ records for what might have been a
#                 simple list view

If we don’t need every single one of those records (and if we do, then perhaps our view is overloaded with responsibilities?), then we’re loading huge amounts of wasted data.

Instead, we should always define exactly what records are needed for the view we’re rendering right now. That’s a controller-level concern.

Recommendation: Avoid find_by, where, and find in model instance methods.

I have a rule of thumb: almost any instance method on a model (i.e. in the models directory or inheriting from ActiveRecord::Base) will eventually be called in a loop. That means if running that method causes a query every time it’s called, it will also eventually cause an N+1. Many of your N+1s in production are from this antipattern.

Take for example Task#task_assigned_users.

def task_assigned_users
  potential_users = users.presence || business_role_users

  deal_active_members.where(
    id: potential_users.select(:id)
  ).ordered_by_name
end

The second line causes a query every time this method is run. There are several ActiveRecord query methods which, every time they are run, cause a query and there is no way to use includes, preload or eager_load to avoid this. Includes can only prevent the lazy-loading of an association. It cannot prevent query methods from running.

The most common offenders, which in my estimation should never be used in model instance methods, are:

To rewrite this method in a way that it cannot N+1, it must only work in terms of associations, and never call these query methods.

def task_assigned_users
  # both of these are associations, can be preloaded
  potential_users = users.presence || business_role_users
  potential_user_ids = potential_users.map(&:id) # select() would cause a query

  # Filter deal_active_members using enumerable methods instead of where()
  deal_active_members.select { |member|
    potential_user_ids.include?(member.id)
  }.sort_by { |member|
    [member.first_name, member.last_name]
  }
end

This method now doesn’t N+1 if users, business_role_users, and deal_active_members have been preloaded.

Recommendation: Use strict_loading

You should move towards a “prove it didn’t N+1” workflow, given that N+1s are such a prevalent problem. A great way to do this is strict_loading. I prefer to use it on an entire relation or just a record, inside a controller, rather than trying to set this “all the time” on a particular relation (this is the only way you currently use it in the codebase).

This would allow you to find N+1s through the raising of exceptions in dev, and in prod you could set strict_loading to just log.

def set_recording
  @recording = Recording.strict_loading.any_file_identity_sourced
                        .find(params[:id])
end

I would be working to insert this everywhere, and particularly when fixing N+1s as a kind of regression test (or just insert it somewhere on a page where you’re struggling with N+1s and then fix every exception that occurs).

Recommendation: Change how set_recording works

In a handful of spots, but most critically in DataTeam::TasksController, you do this:

def set_recording
  @recording = DataTeam::Recording
    .includes(
      task: [:file_identity, taskable_recording: :recordable]
    )
    .where(recordable_type: "DataTeam::Task")
    .find(params[:task_recording_id])
end

Unfortunately, every place this is used, this query takes a few hundred milliseconds. The combination of where and includes turns this into an expensive JOIN across several tables.

Rewriting this as a preload with a subquery for the recordable_type will probably lead to better performance.

Recommendation: Update session records outside the request

Every request in the application is currently updating the Session record’s updated_at column before proceeding. This has a number of drawbacks:

  1. We now have a write on every GET. In the future, it might be nice to move GET requests entirely to a read replica, for example. Can’t do that if your GETs do data updates.
  2. It’s not the fastest thing. It takes 10-20ms to do, including the COMMIT.
  3. We could potentially batch these updates in a gather->flush pattern (this would be really overengineering at this stage, your overall load is so low)

This is a low-impact-but-easy-to-do thing where we’d just record this in Redis to be updated later.

# In ApplicationController or wherever session updates happen
def track_session_activity
  minute_key = "session_updates:#{Time.current.strftime('%Y%m%d%H%M')}"
  Redis.current.sadd(minute_key, session.id)
end

# Background job to flush session updates
class SessionUpdateJob < ApplicationJob
  def perform
    # Find all minute keys that haven't been processed
    keys = Redis.current.keys("session_updates:*")
    return if keys.empty?

    keys.each do |key|
      # Extract timestamp from key (format: session_updates:YYYYMMDDHHMM)
      timestamp_str = key.split(':').last
      time = Time.strptime(timestamp_str, '%Y%m%d%H%M')

      # Get all session IDs for this minute
      session_ids = Redis.current.smembers(key)
      next if session_ids.empty?

      # Use transaction to ensure we don't lose data
      Session.transaction do
        Session.where(id: session_ids)
               .where("updated_at < ?", time)
               .touch_all(time: time)
        Redis.current.del(key)
      end
    end
  end
end

# Schedule this job to run every minute
# In config/schedule.rb (whenever gem) or similar:
# every 1.minute do
#   runner "SessionUpdateJob.perform_later"
# end

Recommendation: Figure out what’s CPU-hogging in the public API, caused by crawlers

I noticed that the Api::Public::V0 endpoints were taking up a lot of your total time consumed (about 25% of total server time). It’s something in Google Cloud that’s crawling you.

You could of course decide to just ban this crawler.

However, if you want to improve this endpoint, which routinely takes 10+seconds (!!!), we really need to get rack-mini-profiler into production to figure out what’s going on, as this endpoint spends lots of time just spinning in CPU, which is highly unusual. A stacktrace from rack-mini-profiler would help us to figure out what Ruby code we’re actually running (which would be trivial to repro, because this request is from a crawler w/o any special state or permissions).

Frontend

Your React application is having significant problems with multiple requests blocking on each other.

Many views will involve several blocking requests to the Altrio backend API. Request A is made, we wait for it to return, then make Request B, etc etc:

Timeline (each dash = ~100ms):

Browser:  |----[waiting]----[waiting]----[waiting]----[waiting]----|
          0ms              400ms         800ms        1200ms      1600ms

Request 1: GET /api/view_groups/78494
          |████████████|
          0ms         300ms
                      ↓ (blocks until complete)

Request 2: GET /api/view_groups/78494/overview
                      |██████████|
                      300ms     600ms
                                ↓ (blocks until complete)

Request 3: GET /api/deals?view_type=group&view_id=78494&search_query=&page=1
                                |████████████████|
                                600ms           1000ms
                                                ↓ (blocks until complete)

Request 4: GET /api/report_style
                                                |████████|
                                                1000ms  1200ms
                                                        ↓ (blocks until complete)

Request 5: GET /static/various-fonts.woff2
                                                        |████████|
                                                        1200ms  1400ms

Total time: 1400ms (could be ~400ms if parallelized)

The main problem here is that you’re making requests when rendering React components, and those requests are blocking child components from rendering. This is an extremely common issue, and I spent a year at Gusto rolling back this exact problem from their entire frontend.

The most common place this seems to occur is in ObjectList components, which, when rendered, block here waiting for their network request to complete:

const fetchResult = useObjectList({
  // ...
})

It’s not uncommon to see 5+ round trips to the backend like this, one after the other.

Collapsing 5 steps down to 1 step with 5 requests done in parallel would completely revolutionize the performance of these pages. For the key metrics I’ve decided to focus on /deals/active and the single deal view /deals/active/*, because they’re particularly bad here and represent a vast majority of your traffic.

Unrolling this requires getting good with DevTools and the performance tab. I’m also seeing what I think is some kind of bug on your site where many requests do not appear in the Network tab. However, the relevant caller information does appear correctly in Performance so I think it’s fine. I can record a video showing how to figure out where these requests are coming from if it would be helpful.

Configuration

Next, there are a few minor configuration issues which I think would be useful to fix.

Recommendation: Set expiration policies for elasticache in the parameter groups for these resources

Sidekiq databases should be configured to disallow expiry, and caches should be configured to use allkeys-lru. To be honest, I’m not sure what the elasticache defaults are here, but these settings are really important, so we shouldn’t rely on whatever the defaults are.

# Cache parameter group - allows LRU eviction when memory is full
resource "aws_elasticache_parameter_group" "cache_params" {
  family = "redis7"
  name   = "altrio-cache-params"

  parameter {
    name  = "maxmemory-policy"
    value = "allkeys-lru"
  }
}

allkeys-lru is important for caches, because we expect caches to evict the LRU key when full. Sidekiq, however, should NEVER evict any jobs, as that would have catastrophic consequences!

Recommendation: Set Puma worker count

Given that it’s not in your terraform and your Puma config file doesn’t have a workers directive, I’m a little concerned that you’re running Puma in single mode (workers 0) in production. That may not be the case if you’re booting Puma with bundle exec puma -w 8 or similar in production, but as far as I can tell from your Docker stuff, you’re using bundle exec rails server, which cannot set Puma’s worker count by itself. You may also be setting the WEB_CONCURRENCY env var somewhere I can’t see, in which case that’s fine and Puma will just read that and use it to set the worker count.

So - let’s make sure we’re running multiple Puma workers in production (equal to the CPU count of the container) and that we’re running at least 4 workers per container/server/box.

Recommendation: Consider increasing database size or changing vendor

Your overall database performance is pretty mediocre. I can’t see your TF vars, but if you’re running the default, a t4g, that’s not a great instance type for a production database.

Pretty much all your SQL queries, no matter how big or small, take just over a millisecond to run, which is below average. This is compounded by the fact that you’re using SolidCache, which in some cases is taking 80ms to find a key!

Changing instance type (latest c7i or c8g) might help, but really RDS just isn’t that fast of a vendor. Some of my other clients are having really good success with Planetscale, which uses physically attached SSDs rather than network-attached storage like AWS, greatly improving latency.

Rails Observability

If I had to sum your performance problems, I would say “you need to get really good at fixing N+1s”. To that end, I have three tools to install.

Recommendation: Use Rack::MiniProfiler in production

You already have this in dev, but RMP is designed to be used in production. I’ve already mentioned in another recommendation how this can be used to solve production issues.

I can put together the PR for this one as I’ve done it dozens of times. It does require a Redis cache store in production, and I’m unsure of the current state of yours given the switch to SolidCache.

Recommendation: Use Prosopite in development and tests

Prosopite is an excellent N+1 detection library. In dev, it will give you excellent logging for when an N+1 is detected, but I like it even better in tests, where it can raise if an N+1 happens.

This allows you to do regression tests when you fix an N+1:

# In spec/rails_helper.rb
require 'prosopite'

RSpec.configure do |config|
  config.before(:each, :n_plus_one_safe) do
    Prosopite.scan
  end

  config.after(:each, :n_plus_one_safe) do
    Prosopite.finish
  end
end

# Example test that ensures no N+1 queries
RSpec.describe "DealsController", type: :request do
  describe "GET /deals/active", :n_plus_one_safe do
    let!(:deals) {
      create_list(:deal, 5, :with_tasks_and_participants)
    }

    it "loads the deals page without N+1 queries" do
      get "/deals/active"

      expect(response).to have_http_status(:ok)
      # If any N+1 queries occur, Prosopite will raise an exception
      # This test will fail, forcing you to fix the N+1 before merging
    end
  end
end

You could also use this prophylactically, like I said with strict_loading. Unlike strict loading, this works on all types of N+1s, not just those caused by attempting load a non-preloaded association.

Observability

Your current observability stack is missing a lot of important numbers. This is a somewhat complicated topic for teams of your size, because I think there is a big hole in the market for an affordable, easy-to-use tool for small teams which still has all the features I look for. The easy answer is “use Datadog” but the pricing is often out of reach for small teams, and the UI is difficult for most people to interpret and use. For power users like me, it’s a dream, but for people just trying to get by and ship useful stuff without a dedicated SRE or whatnot, it’s overkill.

For each topic, I’m going to list a few options. Hopefully by the end, you’ll have formed an impression as to what direction you want to go with this.

Recommendation: Track request queue time, and alert on it

You currently don’t track request queue time. We talked on our call about how you’re going to need to set up a proxy that adds X-Request-Start to the headers of every request, such that Puma can read it in a Rack middleware and caclulate the request queue time as “Time.now - X-Request-Start”:

Client Request
      ↓
┌─────────────────┐
│ Load Balancer   │ ← Adds X-Request-Start: 1642774800.123
│ (nginx/thruster)│   (current timestamp in milliseconds)
└─────────────────┘
      ↓
┌─────────────────┐
│   Puma Server   │ ← Receives request with X-Request-Start header
│                 │
│ ┌─────────────┐ │
│ │ Rack        │ │ ← Middleware calculates:
│ │ Middleware  │ │   queue_time = Time.now - X-Request-Start
│ └─────────────┘ │   (e.g., 1642774800.145 - 1642774800.123 = 22ms)
│        ↓        │
│ ┌─────────────┐ │
│ │ Rails App   │ │ ← Request processed normally
│ └─────────────┘ │
└─────────────────┘
      ↓
  Response + Queue Time Metric
  (logged/sent to monitoring)

Queue Time = Time request was picked up by Puma - Time request entered proxy
This measures how long requests wait in the queue before being processed.

You also want alerting when this value gets too high.

Here are your options:

  1. Do this in Scout. Scout does allow setting up alerts when average request queue time exceeds a certain value. If you start adding the X-Request-Start header, no additional config is needed and you just set the alert and go.
  2. Do this yourself in Cloudwatch. You can also report this yourself to Cloudwatch and alert there.
  3. Do this in Datadog. Datadog would work the same way as Scout here.

Recommendation: Track queue latency for every queue, and alert if it goes above acceptable bounds

You currently have just three queues, high default and low. Each of these queues has an SLO, even if you haven’t made it explicit. You have some internal sense of “if a job waits on high_priority for x minutes without being processed, that is Bad”.

I prefer to just name the queue after the SLO: within_1_second, within_5_minutes, etc.

However, in any case, each queue actually does have an acceptable queue latency and you should have long-term metrics showing you what those latencies are, and alert you if it goes over acceptable bounds.

Options:

  1. Scout doesn’t cut it. You can only see the max latency for a job over a particular period, and you can only alert on a per job or global basis, which isn’t helpful. We care about latency on the queue.
  2. You could report it yourself to Cloudwatch. Works fine.
  3. You could use Datadog. You would still have to write your own queue latency reporting middleware, same as Cloudwatch. So not much different to Cloudwatch in that respect. Both also have SLOs.
  4. Sentry does have a Queue product here, but I’ve never used it and I’ve found Sentry’s alerting/monitoring capability about as sophisticated as Scout.

Recommendation: Set up database dashboards and alerts

My standard set of database metrics is:

- CPU (load and utilization)
- IOPs (if limited)
- Read/write latency
- Error rates
- Hitrate (if cache)

You want that for each Redis or SQL store you’re connected to, with monitors on each.

Scout/Sentry can’t do this. Cloudwatch and Datadog certainly can.

Recommendation: Drop Scout, use Cloudwatch and Terraform

I think where I’ve come down, after writing this section, is to drop Scout, use Sentry for APM, and use Cloudwatch in combination with Terraform for metrics and infrastructure monitoring.

Scout just doesn’t really do anything better than Sentry, and I prefer Sentry’s overall tracing product. Also, you’re already using them for frontend perf, so this would allow us to potentially connect frontend and backend traces into a single view.

There’s really nothing Datadog can do for you that Cloudwatch can’t, and you’re already on AWS, so it will cost basically nothing for us to just use more Cloudwatch for the numbers I’ve mentioned.