24.5. Hot Standby

Hot Standby is the term used to describe the ability to connect to the server and run queries while the server is in archive recovery. This is useful for both log shipping replication and for restoring a backup to an exact state with great precision. The term Hot Standby also refers to the ability of the server to move from recovery through to normal running while users continue running queries and/or continue their connections.

Running queries in recovery is in many ways the same as normal running though there are a large number of usage and administrative points to note.

24.5.1. User's Overview

Users can connect to the database while the server is in recovery and perform read-only queries. Read-only access to catalogs and views will also occur as normal.

The data on the standby takes some time to arrive from the primary server so there will be a measurable delay between primary and standby. Running the same query nearly simultaneously on both primary and standby might therefore return differing results. We say that data on the standby is eventually consistent with the primary. Queries executed on the standby will be correct with regard to the transactions that had been recovered at the start of the query, or start of first statement, in the case of serializable transactions. In comparison with the primary, the standby returns query results that could have been obtained on the primary at some exact moment in the past.

When a transaction is started in recovery, the parameter transaction_read_only will be forced to be true, regardless of the default_transaction_read_only setting in postgresql.conf. It can't be manually set to false either. As a result, all transactions started during recovery will be limited to read-only actions only. In all other ways, connected sessions will appear identical to sessions initiated during normal processing mode. There are no special commands required to initiate a connection at this time, so all interfaces work normally without change. After recovery finishes, the session will allow normal read-write transactions at the start of the next transaction, if these are requested.

Read-only here means "no writes to the permanent database tables". There are no problems with queries that make use of transient sort and work files.

The following actions are allowed

These actions produce error messages

Note that current behaviour of read only transactions when not in recovery is to allow the last two actions, so there are small and subtle differences in behaviour between read-only transactions run on standby and during normal running. It is possible that the restrictions on LISTEN, UNLISTEN, NOTIFY and temporary tables may be lifted in a future release, if their internal implementation is altered to make this possible.

If failover or switchover occurs the database will switch to normal processing mode. Sessions will remain connected while the server changes mode. Current transactions will continue, though will remain read-only. After recovery is complete, it will be possible to initiate read-write transactions.

Users will be able to tell whether their session is read-only by issuing SHOW transaction_read_only. In addition a set of functions Table 9-57 allow users to access information about Hot Standby. These allow you to write functions that are aware of the current state of the database. These can be used to monitor the progress of recovery, or to allow you to write complex programs that restore the database to particular states.

In recovery, transactions will not be permitted to take any table lock higher than RowExclusiveLock. In addition, transactions may never assign a TransactionId and may never write WAL. Any LOCK TABLE command that runs on the standby and requests a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected.

In general queries will not experience lock conflicts with the database changes made by recovery. This is becase recovery follows normal concurrency control mechanisms, known as MVCC. There are some types of change that will cause conflicts, covered in the following section.

24.5.2. Handling query conflicts

The primary and standby nodes are in many ways loosely connected. Actions on the primary will have an effect on the standby. As a result, there is potential for negative interactions or conflicts between them. The easiest conflict to understand is performance: if a huge data load is taking place on the primary then this will generate a similar stream of WAL records on the standby, so standby queries may contend for system resources, such as I/O.

There are also additional types of conflict that can occur with Hot Standby. These conflicts are hard conflicts in the sense that we may need to cancel queries and in some cases disconnect sessions to resolve them. The user is provided with a number of optional ways to handle these conflicts, though we must first understand the possible reasons behind a conflict.

Some WAL redo actions will be for DDL actions. These DDL actions are repeating actions that have already committed on the primary node, so they must not fail on the standby node. These DDL locks take priority and will automatically *cancel* any read-only transactions that get in their way, after a grace period. This is similar to the possibility of being canceled by the deadlock detector, but in this case the standby process always wins, since the replayed actions must not fail. This also ensures that replication doesn't fall behind while we wait for a query to complete. Again, we assume that the standby is there for high availability purposes primarily.

An example of the above would be an Administrator on Primary server runs a DROP TABLE on a table that's currently being queried in the standby server. Clearly the query cannot continue if we let the DROP TABLE proceed. If this situation occurred on the primary, the DROP TABLE would wait until the query has finished. When the query is on the standby and the DROP TABLE is on the primary, the primary doesn't have information about which queries are running on the standby and so the query does not wait on the primary. The WAL change records come through to the standby while the standby query is still running, causing a conflict.

The most common reason for conflict between standby queries and WAL redo is "early cleanup". Normally, PostgreSQL allows cleanup of old row versions when there are no users who may need to see them to ensure correct visibility of data (the heart of MVCC). If there is a standby query that has been running for longer than any query on the primary then it is possible for old row versions to be removed by either a vacuum or HOT. This will then generate WAL records that, if applied, would remove data on the standby that might *potentially* be required by the standby query. In more technical language, the primary's xmin horizon is later than the standby's xmin horizon, allowing dead rows to be removed.

Experienced users should note that both row version cleanup and row version freezing will potentially conflict with recovery queries. Running a manual VACUUM FREEZE is likely to cause conflicts even on tables with no updated or deleted rows.

We have a number of choices for resolving query conflicts. The default is that we wait and hope the query completes. The server will wait automatically until the lag between primary and standby is at most max_standby_delay seconds. Once that grace period expires, we take one of the following actions:

max_standby_delay is set in postgresql.conf. The parameter applies to the server as a whole so if the delay is all used up by a single query then there may be little or no waiting for queries that follow immediately, though they will have benefited equally from the initial waiting period. The server may take time to catch up again before the grace period is available again, though if there is a heavy and constant stream of conflicts it may seldom catch up fully.

Users should be clear that tables that are regularly and heavily updated on primary server will quickly cause cancellation of longer running queries on the standby. In those cases max_standby_delay can be considered somewhat but not exactly the same as setting statement_timeout.

Other remedial actions exist if the number of cancellations is unacceptable. The first option is to connect to primary server and keep a query active for as long as we need to run queries on the standby. This guarantees that a WAL cleanup record is never generated and we don't ever get query conflicts as described above. This could be done using contrib/dblink and pg_sleep(), or via other mechanisms. If you do this, you should note that this will delay cleanup of dead rows by vacuum or HOT and many people may find this undesirable. However, we should remember that primary and standby nodes are linked via the WAL, so this situation is no different to the case where we ran the query on the primary node itself except we have the benefit of off-loading the execution onto the standby.

It is also possible to set vacuum_defer_cleanup_age on the primary to defer the cleanup of records by autovacuum, vacuum and HOT. This may allow more time for queries to execute before they are cancelled on the standby, without the need for setting a high max_standby_delay.

Three-way deadlocks are possible between AccessExclusiveLocks arriving from the primary, cleanup WAL records that require buffer cleanup locks and user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks are currently resolved by the cancellation of user processes that would need to wait on a lock. This is heavy-handed and generates more query cancellations than we need to, though does remove the possibility of deadlock. This behaviour is expected to improve substantially for the main release version of 8.5.

Dropping tablespaces or databases is discussed in the administrator's section since they are not typical user situations.

24.5.3. Administrator's Overview

If there is a recovery.conf file present the server will start in Hot Standby mode by default, though recovery_connections can be disabled via postgresql.conf, if required. The server may take some time to enable recovery connections since the server must first complete sufficient recovery to provide a consistent state against which queries can run before enabling read only connections. Look for these messages in the server logs

LOG:  initializing recovery connections

... then some time later ...

LOG:  consistent recovery state reached
LOG:  database system is ready to accept read only connections

Consistency information is recorded once per checkpoint on the primary, as long as recovery_connections is enabled (on the primary). If this parameter is disabled, it will not be possible to enable recovery connections on the standby. The consistent state can also be delayed in the presence of both of these conditions

If you are running file-based log shipping ("warm standby"), you may need to wait until the next WAL file arrives, which could be as long as the archive_timeout setting on the primary.

The setting of some parameters on the standby will need reconfiguration if they have been changed on the primary. The value on the standby must be equal to or greater than the value on the primary. If these parameters are not set high enough then the standby will not be able to track work correctly from recovering transactions. If these values are set too low the the server will halt. Higher values can then be supplied and the server restarted to begin recovery again.

It is important that the administrator consider the appropriate setting of max_standby_delay, set in postgresql.conf. There is no optimal setting and should be set according to business priorities. For example if the server is primarily tasked as a High Availability server, then you may wish to lower max_standby_delay or even set it to zero, though that is a very aggressive setting. If the standby server is tasked as an additional server for decision support queries then it may be acceptable to set this to a value of many hours (in seconds). It is also possible to set max_standby_delay to -1 which means wait forever for queries to complete, if there are conflicts; this will be useful when performing an archive recovery from a backup.

Transaction status "hint bits" written on primary are not WAL-logged, so data on standby will likely re-write the hints again on the standby. Thus the main database blocks will produce write I/Os even though all users are read-only; no changes have occurred to the data values themselves. Users will be able to write large sort temp files and re-generate relcache info files, so there is no part of the database that is truly read-only during hot standby mode. There is no restriction on the use of set returning functions, or other users of tuplestore/tuplesort code. Note also that writes to remote databases will still be possible, even though the transaction is read-only locally.

The following types of administrator command are not accepted during recovery mode

Note again that some of these commands are actually allowed during "read only" mode transactions on the primary.

As a result, you cannot create additional indexes that exist solely on the standby, nor can statistics that exist solely on the standby. If these administrator commands are needed they should be executed on the primary so that the changes will propagate through to the standby.

pg_cancel_backend() will work on user backends, but not the Startup process, which performs recovery. pg_stat_activity does not show an entry for the Startup process, nor do recovering transactions show as active. As a result, pg_prepared_xacts is always empty during recovery. If you wish to resolve in-doubt prepared transactions then look at pg_prepared_xacts on the primary and issue commands to resolve those transactions there.

pg_locks will show locks held by backends as normal. pg_locks also shows a virtual transaction managed by the Startup process that owns all AccessExclusiveLocks held by transactions being replayed by recovery. Note that Startup process does not acquire locks to make database changes and thus locks other than AccessExclusiveLocks do not show in pg_locks for the Startup process, they are just presumed to exist.

check_pgsql will work, but it is very simple. check_postgres will also work, though many some actions could give different or confusing results. e.g. last vacuum time will not be maintained for example, since no vacuum occurs on the standby (though vacuums running on the primary do send their changes to the standby).

WAL file control commands will not work during recovery e.g. pg_start_backup, pg_switch_xlog etc..

Dynamically loadable modules work, including pg_stat_statements.

Advisory locks work normally in recovery, including deadlock detection. Note that advisory locks are never WAL logged, so it is not possible for an advisory lock on either the primary or the standby to conflict with WAL replay. Nor is it possible to acquire an advisory lock on the primary and have it initiate a similar advisory lock on the standby. Advisory locks relate only to a single server on which they are acquired.

Trigger-based replication systems such as Slony, Londiste and Bucardo won't run on the standby at all, though they will run happily on the primary server as long as the changes are not sent to standby servers to be applied. WAL replay is not trigger-based so you cannot relay from the standby to any system that requires additional database writes or relies on the use of triggers.

New oids cannot be assigned, though some UUID generators may still work as long as they do not rely on writing new status to the database.

Currently, temp table creation is not allowed during read only transactions, so in some cases existing scripts will not run correctly. It is possible we may relax that restriction in a later release. This is both a SQL Standard compliance issue and a technical issue.

DROP TABLESPACE can only succeed if the tablespace is empty. Some standby users may be actively using the tablespace via their temp_tablespaces parameter. If there are temp files in the tablespace we currently cancel all active queries to ensure that temp files are removed, so that we can remove the tablespace and continue with WAL replay.

Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, or ALTER DATABASE ... RENAME on primary will generate a log message that will cause all users connected to that database on the standby to be forcibly disconnected, once max_standby_delay has been reached.

In normal running, if you issue DROP USER or DROP ROLE for a role with login capability while that user is still connected then nothing happens to the connected user - they remain connected. The user cannot reconnect however. This behaviour applies in recovery also, so a DROP USER on the primary does not disconnect that user on the standby.

Stats collector is active during recovery. All scans, reads, blocks, index usage etc will all be recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so replaying an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted at start of recovery, so stats from primary and standby will differ; this is considered a feature not a bug.

Autovacuum is not active during recovery, though will start normally at the end of recovery.

Background writer is active during recovery and will perform restartpoints (similar to checkpoints on primary) and normal block cleaning activities. The CHECKPOINT command is accepted during recovery, though performs a restartpoint rather than a new checkpoint.

24.5.4. Hot Standby Parameter Reference

Various parameters have been mentioned above in the Section 24.5.3 and Section 24.5.2 sections.

On the primary, parameters recovery_connections and vacuum_defer_cleanup_age can be used to enable and control the primary server to assist the successful configuration of Hot Standby servers. max_standby_delay has no effect if set on the primary.

On the standby, parameters recovery_connections and max_standby_delay can be used to enable and control Hot Standby. standby server to assist the successful configuration of Hot Standby servers. vacuum_defer_cleanup_age has no effect during recovery.

24.5.5. Caveats

At this writing, there are several limitations of Hot Standby. These can and probably will be fixed in future releases: