Hello Huntregs team,
Thanks again for bringing me on to look at your setup. Overall, we’ve got a very simple app that needs a little “operational” love as far as improving reliability and robustness to changes in usage and load.
The app itself is extremely simple: just 10,000 lines of Ruby. That’s great! Most apps I have to work with are far more complex, and for the limited size of your full-time team, I think having the app be a reasonable size like this is an incredible asset. Most teams would be buried under ~5 years of tech debt and ~100k+ lines of app by this point, and you’re not. Awesome!
Your load is also very low. The Wordpress site sees 4,000 views a day on a big day, and based on Sentry and your sample rate, you’re probably in the ~100 requests/minute range for the Rails application. This is also a great place to be in, as you have so much room to scale vertically and horizontally!
I view retainer engagements as a long-term focus on a particular set of metrics. I said I would propose a set of metrics to be our yardsticks for this effort. Here they are:
We would track these SLOs on Cloudwatch.
There are other, secondary metrics which kind of by definition have to be “good” in order for these three metrics to succeed, but these high-level metrics “contain” those secondary metrics within them. For example, we don’t need a separate database uptime metric, because if your database goes down for 3 hours straight both of these SLOs will be toast for the quarter.
I arrange my recommendations here based on which of those SLOs it best applies to.
Your web response times in general look great. The app is simple enough that you don’t need major overhauls to get lower response times. My recommendations here focus on other issues as a result.
Ruby 3.2 and up are where YJIT started to get really effective, and YJIT is just a free 20-30% speed improvement for most Rails apps. If you use Ruby 3.2+ with Rails frameworks defaults at 7.1 and up, you get YJIT turned on automatically with no other changes.
The app is so small that I don’t anticipate this being a difficult upgrade.
The app has a lot of very strict pessimistic gem version requires that have resulted in a number of dependencies getting very, very far out of date.
Here are the most important ones that I recommend upgrading immediately for performance reasons:
Here’s a remaining list of all dependencies where your current version is 3 years old versus the latest version available (or older):
| Name | Current Version | Latest Version | Age |
|---|---|---|---|
| ast | 2.4.2 | 2.4.3 | 4y55d |
| bootstrap | 4.3.1 | 5.3.5 | 6y68d |
| diff-lcs | 1.5.0 | 1.6.2 | 3y140d |
| et-orbi | 1.2.7 | 1.4.0 | 3y203d |
| faraday-follow_redirects | 0.3.0 | 0.4.0 | 3y183d |
| ffi-compiler | 1.0.1 | 1.3.2 | 7y251d |
| http-2 | 0.11.0 | 1.1.1 | 4y100d |
| http-cookie | 1.0.5 | 1.1.0 | 3y125d |
| humanize | 2.5.1 | 3.1.0 | 3y129d |
| jbuilder | 2.11.5 | 2.14.1 | 3y235d |
| job-iteration | 1.3.6 | 1.11.0 | 3y127d |
| jquery-ui-rails | 6.0.1 | 8.0.0 | 8y151d |
| llhttp-ffi | 0.4.0 | 0.5.1 | 3y184d |
| matrix | 0.4.2 | 0.4.3 | 4y0d |
| multi_xml | 0.6.0 | 0.7.2 | 8y147d |
| pdf-core | 0.9.0 | 0.10.0 | 3y132d |
| pg | 1.2.2 | 1.6.2 | 5y238d |
| prawn | 2.4.0 | 2.5.0 | 3y64d |
| puma | 3.12.6 | 7.1.0 | 5y151d |
| rails-assets-jquery | 3.7.1 | 2.2.4 | unknown |
| rb-inotify | 0.10.1 | 0.11.1 | 4y147d |
| roo | 2.8.3 | 3.0.0 | 5y241d |
| rubyzip | 2.3.2 | 3.2.0 | 4y101d |
| semantic_range | 3.0.0 | 3.1.0 | 3y242d |
| spring | 2.1.1 | 4.4.0 | 4y348d |
| spring-watcher-listen | 2.0.1 | 2.1.0 | 5y358d |
| ttfunk | 1.7.0 | 1.8.0 | 3y66d |
| uglifier | 4.2.0 | 4.2.1 | 4y363d |
Again, your app is so small that I think you should be more on top of these dependency updates than you are. Upgrading these will only get more difficult if/as the app grows.
This is a good fit for Speedshop to implement.
You should be measuring the time between nginx seeing a request and when it first gets processed by Puma. This queue time is the time that will increase if your ECS task count is too low. It should be your primary scaling signal.
You will get better performance out of 1 ECS task with a Puma in cluster mode with 4 workers and 4 CPU than 4 tasks with 1 worker/1 CPU each. The reason is based in queueing theory, but the upshot is that it will allow you to run fewer servers for the same load. Doing this will allow you realize lower request queue times for the same server spend.
My recommendation is to run 4 workers per task, and allocate 4096 CPU to each task. Minimum task count can be 2.
RAILS_MAX_THREADS is
probably too high (should be 3)
The correct setting for
RAILS_MAX_THREADS is
based on your workload. We don’t have any data for your app specifically as to
how much time is spent waiting on I/O, but, having done extensive testing on
this, I can tell you that most Rails apps should be using a setting of 3
(yours is 10).
Having this set too high leads to overly slow response times under load. Having it set too low would lead to higher server costs. 3 is the right compromise for 99% of apps.
This endpoint is one of the few truly poor-performing endpoints in the entire app.
We need to turn on profiling in Sentry (using Vernier) to understand why some of these traces take so long. A profile will tell us exactly what the stack looks like for these ~2 second requests you see so frequently. It could be anything. A profile will tell us exactly what it is.
From a latency perspective, this controller has a problem with making three somewhat slow HTTP requests to external providers, serially:
It may be possible to parallelize these using threads, but an even better approach (which has some benefits for scalability, discussed later) would be to move them into background jobs. This controller is one of the slower ones in the app (slower even than the Journals create endpoint), so the cost/benefit feels worth it to me.
I tend to focus on queue times, because for Sidekiq, service time (time to actually run the job) is not difficult to manage. Simply look at Sentry, look for jobs which routinely take ~1 minute or more to execute, and if they do, just split them up into many smaller jobs (or use Sidekiq::Iterable). Not too hard!
Queue times are the big thing to manage. When your app suddenly get a bunch of traffic, queue times easily balloon out of control. This is the thing you need visibility on.
This is a good issue for Speedshop to implement.
You currently don’t have a way to look back historically on queue latency. You can look at the Sidekiq dashboard to understand it moment to moment, but what about when no one’s looking? You should report this to Cloudwatch so you can understand how frequently your queues are violating their expected latencies.
Similar to my comment about
RAILS_MAX_THREADS in
web, we believe the best default for Sidekiq regarding thread count is 10.
You’re currently running it at only 2, and it seems like
RAILS_MAX_THREADS is
not being read by the Sidekiq start script you’re using. I’d drop the
concurrency setting
from sidekiq.yml and
use
RAILS_MAX_THREADS to
configure Sidekiq concurrency. I’m also slightly worried about your database
pool size is being set to in Sidekiq currently (the setup seems to assume
concurrency is configured with
RAILS_MAX_THREADS
and not manually via config files).
within_10_minutes
Sidekiq workloads are defined by how quickly they need to be started. Currently, you’re operating with just one queue. That’s fine. However, I’m sure you’ve got at least one job in there which cannot wait. That means the SLO of your entire queue is effectively 0, or put another way “the default queue is the ASAP queue”.
That’s actually a great place to start, but I think you’ve probably also got work that can wait up to 10 minutes or so to be started. This is useful, because you can autoscale to meet that kind of demand. An ASAP queue needs capacity already online to meet its SLO goal, which means it needs high minimum task counts. That’s expensive!
As you grow, you will continue to add more queues and SLO buckets (within an hour, within 24 hours, etc).
So, concretely:
within_0_seconds
within_10_minutes.
Move some jobs to this queue.
Sidekiq can’t use more than 1 CPU core at a time. You currently have 2 cores allocated to these tasks, which is simply wasted.
This was already discussed in a Shortcut issue.
One of the few ways that Huntregs can fail is due to the failure of an external service provider. This may be for any number of reasons - maybe they’re just having a bad day and decide to rate limit you for no good reason. You need to be resilient to 3rd-party failure of your two non-AWS external APIs: weather and geocoding.
The best possible path forward for resiliency would be to never call these APIs during a web request. Always do it in a background job. That leads your frontend to the following pattern:
Polling sounds old fashioned but in fact it can scale extremely well at this size of app. Each polling request is extremely cheap (10ms) because there’s no data there to respond with!
There’s also a Shortcut story around “moving more geocoding calculation from Ruby to the database”. I want to disagree with this idea.
Your database is not (easily) horizontally scalable. Your Ruby compute is. Moving these CPU-bound calculations from Ruby to the DB is making them less easy to scale.
They may be faster on Postgres, but I’d want to see a really, really good benchmark result before I started moving in this direction.
Rails and Ruby in general don’t really need to be involved in uploading assets to S3. You should use token-based workflows to just let the client upload directly to S3 and cut the Rails app out of it entirely. Big uploads of ~5mb+ pictures and other assets are no fun to scale and deal with. Let Amazon do that for you.
When you have a critical dependency on an external API provider like this, you need to use a circuit breaker pattern to decouple yourself from that provider should they go down.
A circuit breaker allows you to make the “error” path when the entire provider goes down much faster, which means that if your weather API provider goes down, your site doesn’t go down.
As a note, I prefer the Meteo weather API for reliability.
One of the reasons I was interested in working with you is because I think your userbase is so interesting: primarily mobile devices, often operating on poor network conditions.
A CDN is an immensely useful thing when network conditions get bad. By making the “first hop” on a poor network connection as close to the user as possible, you minimize the effect of that poor connection.
You’re on AWS, so Cloudfront is an option, but for a number of reasons, Cloudflare is my preferred vendor. They have a wide and useful product offering, and the CDN product is just better than AWS’ in a number of ways (more PoPs, HTTP/3, lots of nice image manipulation features, brotli, etc etc)
In conjunction with the previous recommendation, you have a number of
responses that could be HTTP cacheable. For example, the main web
regulation search flow uses
https://prod.huntregsapp.com/regulations/species?state_id=41, which is a classic example of an HTTP cacheable response. There’s no user
data required for this request, I can just run
curl https://prod.huntregsapp.com/regulations/species?state_id=41
and it just works!
HTTP caching would give you some really nice stuff:
HTTP caching is so useful but so rarely allowed, but you have a great usecase here.
If you do find you’re able to HTTP cache a lot of your GET requests, and those request are coming from a CDN, you can suddenly get way more aggressive with prefetching stuff on your clients.
For example, on the main regulation search page, you could just prefetch every state! It “Just Works” if you set the headers correctly, and since you’re not serving that request (your CDN is) you’re not meaningfully increasing your own load at all!
It means your users no longer have to wait for anything to load, because the data’s already there.
You could, of course, do this in your mobile clients as well. Again, the nice thing is since this is all just HTTP, the semantics of how caching works is already built for you. You just have to mark your requests as HTTP cacheable, the underlying client libraries should do all the rest for you!
Another reason to use CDNs is that they will automatically compress responses. If you’re shipping JSON to mobile clients, a brotli-capable CDN could reduce payload sizes by 50% or more.
I can also tell by the stuff you’re returning on your API that you might have some especially large responses. I think you should pay attention to this as it’s definitely going to be a cause of latency for mobile clients.
I can’t be super concrete on this one because I think it will depend on what CDN you choose. Cloudfront and Cloudflare would both have different tools for monitoring response body sizes. I would pick one and keep track of how many response bodies I was sending that were larger than ~100kb.
You have an issue with your webfonts on prod.huntregsapp.com - fontawesome and Google Fonts are downloading late (they’re blocked on something else, I’m not quite sure what).
I don’t think webfonts are worth it for Huntregs, outside of the marketing site. If your clients are mostly and mobile and potentially mostly on poor connections, any branding benefit here is outweighed by the ~500ms-1 sec it takes it download a webfont for the first time.
I would browse through https://modernfontstacks.com/ and find something I liked and just use that instead.
If you must keep webfonts, let’s host them first-party from huntregsapp.com to improve performance, and I’ll investigate the late-download issue.
You have several third-party CDNs serving javascript on your hunting regulation pages. This isn’t necessary and it’s much slower than just hosting everything yourself on prod.huntregsapp.com.
Javascript is on the “hot path” of your page load. By putting JS on a 3rd party domain, we now add the time it takes to connect to that 3rd party domain on the “hot path” and slow down our pageload by at least that amount. It’s one of those “100 ms problems that adds up”.
The HTML pages served by Basecamp are pretty simple and there’s just a few screens. I would consider using Turbo Drive to speed up the page transition and make it feel almost instant when you go between them.
The difference between a full page load and a Turbo Drive pageload will be something like a 10x difference. The other alternative to get that kind of performance boost is a React app, but that would add significant complexity.
This may be too late, but as I was looking through this, I thought: Huntregs is kind of the poster child for the “new” style of Rails deployment that DHH is pushing with Rails 8:
That kind of setup can easily handle ~500 requests per second. You’re currently doing about 2. You could serve your load for the next ~3 years for $200/month!
The drawback is that all the data stuff is much harder: backup, recovery, snapshotting, is all “taken care of” for you by RDS, and you have to reinvent the wheel. But, something to consider.
When you get circuit breakers in place, you should reduce the read and open timeouts for your connections to third parties. If OpenStreetMap or your weather API doesn’t respond in 2 seconds, they’re not gonna respond if you wait another 28 seconds.
The default read/open timeouts in Net::HTTP are 30 seconds.
Lax timeouts just lead to worse behavior when third parties go down. I would also consider timing out your own responses on a tight leash: something like 10 seconds. It’s easy to do at this stage of an application when it’s this small, and much harder to do 3 years from now.
Your AWS auto-scaling group is currently based on CPU autoscaling. It’s better to use request queue times (which, as I’ve recommended in other parts here, you need to report to Cloudwatch anyway).
The reason is that it’s trivial, especially in your app, to construct a workload that 100% utilizes your servers but not your CPU.
Consider what happens if your Weather API gets 2x slower. Let’s say every request now takes 1 second, and 0.7 seconds of that is I/O to the weather API, waiting on a response.
In these conditions, your Puma processes will completely full and requests will time out waiting to be picked up, but your CPU usage will be just 30%, and your auto scaler threshold at 40% CPU will never be hit.
Request queue time is the only correct way to scale. We scale based on how long people wait.
Here’s some more random things I noticed while looking around.
Your overall monorepo structure and the onboarding process is easily the best I’ve ever seen for a company of this stage and size, credit to the people who wrote it.
I’ve really been enjoying
mise for tool
version management. I would encourage you to adopt it officially in the
monorepo and enforce/set it up for devs in the automated process you’ve got.
Speedshop can do this.
Sentry isn’t configured to do profiling. It costs a small amount extra (like in the $50/mo range IIRC) but it’s incredibly, incredibly useful (mentioned in another recommendation). You should turn it on.
I saw the performance testing report on Shortcut, but I don’t have access to the repo. I’d like to read it now that I’ve done my own report!
The marketing site on Wordpress is kind of hopeless in my opinion. I’d like to hear more about what your requirements are for this site.
The load time for the marketing site, from Tokyo, is ~2-3x worse than the app itself.
The way I see it:
You’re a tech company. You can be technical and solve problems in a technical way. You don’t need a WYSIWYG for content editing because you’ve got full time technical staff.
Your prod.huntregsapp.com site is very fast. It would be great if the marketing stuff was built on the same kind of frontend stack, because then it would also be really really fast.
You asked my opinion on how to deal with syncing ~1GB of regulations to the mobile apps. I’m not an expert on this kind of thing but, my rough idea:
I’m assuming the database itself is Sqlite, or it probably should be.
This is more or less how most video game companies do it. The patch diff files themselves can be HTTP cached as static binary data, which would also help download times.