In PostgreSQL, whenever rows in a table deleted, The existing row or tuple is marked as dead ( will not be physically removed) and during an update, it marks corresponding exiting tuple as dead and inserts a new tuple so in PostgreSQL UPDATE operations = DELETE + INSERT. These dead tuples consumes unnecessary storage and eventually you have a bloated PostgreSQL database, This is a serious issue to solve for a PostgreSQL DBA. VACUUM reclaims in the storage occupied by dead tuples, Please keep this noted that reclaimed storage space is never given back to the resident operating system rather they are just defragmented within the same database page and so storage for reuse to future data inserts in the same table. Does the pain stops here ? No, It doesn’t. Bloat seriously affect the PostgreSQL query performance, In PostgreSQL tables and indexes are stored as array of fixed-size pages ( usually 8KB in size). Whenever a query requests for rows, the PostgreSQL instance loads these pages into the memory and dead rows causes expensive disk I\/O during data loading.<\/p>\n
Blame it on PostgreSQL Multi-Version Concurrency Control (MVCC) for bloating. \u00a0Because, Multi-Version Concurrency Control (MVCC) in PostgreSQL is committed to maintain each transaction\u00a0isolated\u00a0and durable\u00a0( ACID compliance in transaction management), readers never block writers and vice versa. Every transaction ( such as an insert, update or delete, as well as explicitely wrapping a group of statements together via BEGIN – COMMIT.) in PostgreSQL is identified by a transaction ID called XID. When a transaction starts, Postgres increments an XID and assigns it to the current transaction. PostgresSQL also stores transaction information on every row in the system, which is used to determine whether a row is visible to the transaction or not. Because different transactions will have visibility to a different set of rows, PostgreSQL needs to maintain potentially obsolete records. This is why an UPDATE actually creates a new row and why DELETE doesn\u2019t really remove the row: it merely marks it as deleted and sets the XID values appropriately. As transactions complete, there will be rows in the database that cannot possibly be visible to any future transactions. These are called dead rows ( technically bloated records in PostgreSQL).<\/p><\/blockquote>\n
How PostgreSQL database handles bloating for optimal performance and storage efficiency ?<\/h4>\n
In the past PostgreSQL DBAs (\u00a0pre PostgreSQL 8<\/em> ) used to reclaim the storage from dead tuples by running VACUUM<\/a><\/span><\/strong>\u00a0command manually, This was most daunting task to do because DBAs need to to balance the resource utilization for vacuuming against the current transaction volume \/ load to plan when to do it, and also when to abort it. \u00a0PostgreSQL \u201cautovacuum\u201d feature simplified DBA life much better on managing database bloating and vacuum.<\/p>\n
How Autovacuum works in PostgreSQL ?\u00a0<\/strong><\/h4>\n
Autovacuum is one of the background utility processes that starts automatically when the actual number of dead tuples in a table exceeds an effective threshold, due to updates and deletes, The frequency of this process is controlled by PostgreSQL configuration parameter autovacuum_naptime<\/a><\/span> (default is 1 minute) and autovacuum attempts to start a new worker process every time when vacuuming process begins, this completely depends on the value of configuration parameter autovacuum_max_workers<\/a><\/span> (default 3). The worker searches for tables where PostgreSQL\u2019s statistics records indicate a large enough number of rows have changed relative to the table size. The formula is:<\/p>\n
## The formula which is applied by autovacuum process to identify tables which are bloated and need immediate attention for vacuuming: \r\n\r\n[estimated rows invalidated] \u2265 autovacuum_vacuum_scale_factor * [estimated table size] + autovacuum_vacuum_threshold<\/pre>\nThis is what happens internally with-in PostgreSQL during autovacuum process:<\/p>\n
The worker threads generated will start removing dead tuples and compacting pages aggressively but these entire activities consumes intense disk I\/O throughput, Autovacuum records these I\/O credits and when it exceeds autovacuum_vacuum_cost_limit<\/a><\/span> then autovacuum pauses all workers for few milliseconds depending on the value of configuration parameter autovacuum_vacuum_cost_delay<\/a><\/span> (default is 20 milliseconds). The vacuuming as mentioned above is an resource hogging and time consuming activity because every vacuum worker scan through individual dead rows to remove index entries pointed on those rows before compaction of pages, If you have deployed PostgreSQL on a limited memory \/ RAM infrastructure then maintenance_work_mem<\/a><\/span> configuration parameter will be very conservative and this leaves worker thread to process only limited dead rows during each attempt making vacuum fall behind.<\/p>\n
How to configure Autovacuum parameters ?<\/h4>\n
The default autovacuum works great for few GBs PostgreSQL Database and is definitely not recommended for larger PostgreSQL infrastructure as with increasing data \/ transactions volume the vacuum will fall behind. Once vacuum has fallen behind, It will directly impact query execution plan and performance, This will direct DBAs to either run autovacuum less frequently or not at all. The following matrix recommends optimal sizing of configuration parameters for larger PostgreSQL database instances:<\/p>\n\n