Looking up the name of the lock from before Access Exclusive Lock I saw that it locks the whole table:Ĭonflicts with locks of all modes (ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE). Luckily Postgres has excellent docs around locks, but it's a bit much if you're new to the field: Postgresql Lock documentation Up until this point, I basically knew nothing about how PostgreSQL uses locking other than in an explicit advisory lock, which can be used via a gem like pg_lock (That I maintain). I saw a ton of queries that were hung for quite some time, and most of them pointed to my seemingly teeny UPDATE statement. Query | UPDATE "table" SET WHERE ("uuid" = '') The idea was strange to me since it had been running relatively unchanged for an extremely long time and only now started to see problems, but I decided to look into it. They mentioned that there might be a locking issue with the database. Luckily I work with some pretty fantastic database engineers, and I pinged them for possible ideas. If it had, then I could have run with EXPLAIN ANALYZE to see why it was so slow. Running the update statement manually didn't reproduce the issue, so I was fresh out of ideas. When I checked heroku pg:outliers from the pg extras CLI plugin I was surprised to see this update taking up 80%+ of the time even though it is smaller than the largest table in the database by a factor of 200. The query in question was a simple update, and it's not even on the most massive table in the DB. It showed that I had queries that had been running for DAYS: 68698 5 days 18:01:26.446979 UPDATE "table" SET WHERE ("uuid" = '') The first thing I did was run heroku pg:diagnose, which shows "red" (critical) and "yellow" (important but less critical) issues. My best guess was the database where the results were being stored was having problems. I checked our datastores, and they were well under their limits, I checked our error tracker and didn't see any smoking guns. The system has been running in production for years, and while there have been occasional performance issues, nothing stood out as a huge problem. I started debugging when the backlog on our system began to grow, and the number of jobs being processed fell to nearly zero. This post is a story about how the problem was debugged and fixed and why such a seemingly simple query caused so much harm. After hours of debugging, I found the problem was an UPDATE on a single row on a single table was causing the entire table to lock, which caused a lock queue and ground the whole process to a halt. Recently I discovered that the system was getting bogged down to the point where no jobs were being executed at all. It's not real-time, so there's plenty of slack for when things go wrong. I maintain an internal-facing service at Heroku that does metadata processing. Postgres does not have lock promotion as suggested in the debugging section of this post. For more information on this locking mechanism see the internal Postgresql tuple locking documentation. Update: On closer inspection, the lock type was not on the table, but on a tuple.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |