Witnessing errors connected with database performance is never a pleasant experience. Could you have prepared yourself to handle them? Let’s find out.
Let’s divide into possible errors into categories. There are 2 most common error types in the Elixir ecosystem that indicate performance issues connected with a database. Both are raised in the form of
DBConnection.ConnectionError exceptions but they have different messages:
- the first type informs you about not having enough connections available to communicate with your database instance. Let’s call them pool timeout errors. They look like this:
You can see that the message proposes suggestions on how to solve the problem 🎉 I will try to dig deeper into those points later on.
(DBConnection.ConnectionError) connection not available and request was dropped from queue after 206ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by: 1. Ensuring your database is available and that you can connect to it 2. Tracking down slow queries and making sure they are running fast enough 3. Increasing the pool_size (although this increases resource consumption) 4. Allowing requests to wait longer by increasing :queue_target and :queue_interval See DBConnection.start_link/2 for more information
- the second type usually means that there was a transaction/query that lasted too long and exceeded the timeout. I will refer to them as query timeout.
(DBConnection.ConnectionError) tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
(DBConnection.ConnectionError) client #PID<0.368.0> timed out because it checked out the connection for longer than 15000ms
Buying more time
There is a chance that you’ve noticed these errors when your systems performance was degraded. In the worst-case scenario, there are reports submitted by angry users so there can be a lot of pressure on you because the service isn’t working as desired. You want to act fast.
I’ve noticed during my work as a Site Reliability Engineer that people during incidents, want to react too fast. This can be deceptive. It is easy, during stressful situations to forget that actions cause reactions, and operating in the blink of an eye isn’t always the best solution. Because of this, it’s a good idea to have a list of actions that are safe to apply during incidents that won’t have a long-term impact if they fail and will buy you more time to resolve issues.
I’ve prepared for you a list of actions that should be safe to apply if you noticed
Start with tuning your settings
It should be quite easy with pool timeout errors. If you didn’t tune it before, there is a high chance that simply tweaking
pool_size will buy you some time and allow you to debug further in with better morale. But remember
pool_size isn’t any magic and it has its limits. As described in the error message “although this increases resource consumption” - it will increase base memory usage in your app host and database.
It also won’t scale indefinitely, as you can see here there is a point where increasing the number of connections won’t increase the number of operations that your database can perform - it can even significantly drop. The breakpoint can differ on hardware or even on the DB version so you need to experiment to find the optimal value for your
Influence of the number of concurrent connections on Transactions Per Seconds for two PostgreSQL versions (source: https://pganalyze.com/blog/postgres-14-performance-monitoring)
pool_size isn’t working for you, you can still tweak
queue_interval. It can help you avoid
DBConnection.ConnectionError but there won’t be a difference in response times. It will allow you to avoid 500 responses, which is usually enough for a short period of time.
There is a bigger problem with query timeout. I think issues with query timeout are more specific and that is why they are harder to pinpoint and mitigate.
Probably the easiest and fastest way to mitigate this problem is to increase
timeout but it won’t be good for long.
- Check if other clients aren’t overusing your database, if it is shared among more applications or services.
- If it is possible you can try to scale your database vertically to increase performance.
If nothing helps maybe you can turn off some of your features that aren’t core to your business? For example: generating reports, fetching and displaying data that isn’t necessary for core business processes. It can be hard to decide during an incident what can be turned off, so it is nice to prepare a list of things that your service can live without for some time.
It is even better if you use feature flags, you can use flag as a safety switch and prepare for turning off selected features. In other case you can easily integrate and control feature flags in runtime thanks to the great Elixir library Fun with Flags or some SaaS, for example, LaunchDarkly.
So let’s assume that you mitigated problems and have the time to debug. It is crucial to find a root cause. There is a reason why in pool timeout errors, Tracking down slow queries and making sure they are running fast enough is the first step to do. It is also a good moment to start working on postmortem. It will help you to avoid similar issues in the future and also share knowledge with other teammates.
Finding the root cause can be challenging but in the long term it will pay off. If you don’t have experience in looking for the core of the issues you can try the Five whys technique. With time and knowledge of your system, you will learn when you can stop asking why :D
To help you with starting, I’ve enlisted issues that turned out to be root causes of
DBConnection.ConnectionError errors in the project I’ve been working on.
You can start by searching for long-lasting queries. You can use for example query insights if you host your Postgres database on Google Cloud Platform. There are a lot of other ways you can you find long-lasting queries but it depends heavily on your environment.
If you found a slow query it is time to fix it. There can be a lot of reasons why the query is slow - maybe it isn’t using the index you desire. If you use PostgreSQL you can try EXPLAIN ALYZE for learning the execution plan of the query. Maybe it is joining too many unnecessary data or maybe fetching too many columns. It can be painful if the query is executed many times per second, encoding, decoding, transfer by the network, all this matters in high rate traffic, so it’s also a place to look at.
I have also seen cases in which long-lasting operation was performed in transactions. You should rather execute all heavy computation or HTTP requests outside transactions, so you can check in the connection to the pool as soon as possible.
The wrong model can also be a problem as to why your database can’t handle your desired effect. If you have a lot more reads than write, maybe you should prepare a read model, denormalize data. You can also think about making stronger boundaries between your contexts so they don’t have to live in the same database. I recommend trying domain-driven design, which can help you find those boundaries.
Or maybe your issues were caused by the wrong configuration. If you didn’t tweak the configuration at all, there is a chance that traffic on your website needs more DB connections, then simply
pool_size increasing should be fine (if you still have resources on the application host). If not maybe you should spread the load between more instances or vertically scale your host, which is good but still will limit you at some point.
If the problems occurs during applying migrations I highly recommend reading Safe Ecto Migrations by David Bernheisel. It is the complete guide how to use Ecto migrations safety so there is no point for adding more here.
What can be done better?
If you learned about issues from users, your monitoring probably needs some work. From my experience, most of the problems can be detected earlier and you won’t have to deal with
DBConnection.ConnectionError in a hurry.
It is all about observability - one of the non-functional requirements that can be easily forgotten and which is crucial if you want to continuously deliver a stable and performant platform for your users.
If you collect metrics describing query times or even just your response times you can detect in advance that something can cause issues in the near future. Just define limits and set alarms if something will go off those limits. It is quite important to set up automatic alarms, without them there is a high chance that you will miss a moment to act before the fire and all your collected metrics won’t pay off.
I really like the focus on the BEAM ecosystem on standardizing metrics. telemetry is a great foundation that allows libraries to export metrics and is easy to consume, which brings joy. So it should be quite easy to observe your Elixir applications.
Also, there is a lot of work on OpenTelemetry for BEAM, which can help you easily boost your observability to the next level, especially if you are working in a microservices environment.
And also big kudos to fly.io for supporting metrics at no cost (at the moment of writing) which can be easly used with Grafana Cloud Free. Thanks to that metrics won't cost you a penny.
We can sum up dealing with
DBConnection.ConnectionError into this 3 step process:
- Buying more time (mitigating issues)
- Tune your settings
- Scale database instance
- Decrease not critical load
- Finding the root cause
- Long-lasting queries
- Polluted transactions
- Wrong model
- Inappropriate configuration
- Heavy migrations
- Learn from the incident
This basic 3 step pattern: mitigate, find the root cause, learn from the incident will also help you deal with other issues in production environment, not only
DBConnection.ConnectionError. Feature flags and postmortems, I've introduced earlier are generic tools so introducing them will help you in many cases. So if you don’t use them already I recommend checking them out.
I hope that after reading this article dealing with
DBConnection.ConnectionError and other issues in production will be a bit easier for you and what’s more important you will be prepared for them or just predict them and avoid.