admin管理员组

文章数量:1667032

前言:
经过测试,postgres的主备切换后,同步会出现问题,新的备库不会从主库同步数据过来。需要使用pg_rewind重新修复新的备库


参考文档:
http://postgres/docs/10/app-pgrewind.html

-- 主备信息
PG版本 :10.15 
主库 192.168.2.80   主机名test
备库 192.168.2.81   主机名test1 

-- 查看主备库的信息

[root@test /opt/pgpool/bin]$pg_controldata  | grep "Database cluster state"
Database cluster state:               in production
[root@test /opt/pgpool/bin]$
[root@test1 /root]$pg_controldata  | grep "Database cluster state"
Database cluster state:               in archive recovery
[root@test1 /root]$

postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 13456
usesysid         | 16406
usename          | repl
application_name | walreceiver
client_addr      | 192.168.2.81
client_hostname  | 
client_port      | 47797
backend_start    | 2021-08-27 08:13:11.217507+08
backend_xmin     | 
state            | streaming
sent_lsn         | 0/10000350
write_lsn        | 0/10000350
flush_lsn        | 0/10000350
replay_lsn       | 0/10000350
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 0
sync_state       | async

postgres=# 

-- 停掉主库 ,同时将原来备库上的recovery.con文件copy一份到主库,后面变更主库为备库的时候会用到。(或者不copy也行,在后面使用pg_rewind的时候,会自动copy过来,包括pg_hba.conf,postgres.conf,recovery.conf文件等,所以做好conf文件的备份)

pg_ctl stop -m fast 
pg_controldata  | grep "Database cluster state"

[postgres@test /home/postgres]$pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
[postgres@test /home/postgres]$

[postgres@test /home/postgres]$pg_controldata  | grep "Database cluster state"
Database cluster state:               shut down
[postgres@test /home/postgres]$

-- 提升从库为新的主库,可以看到原来的从库上面,recovery.conf变成了recovery.done 

pg_ctl promote 
pg_controldata  | grep "Database cluster state"

[postgres@test1 /home/postgres]$pg_ctl promote 
waiting for server to promote.... done
server promoted
[postgres@test1 /home/postgres]$pg_controldata  | grep "Database cluster state"
Database cluster state:               in production
[postgres@test1 /home/postgres]$

[postgres@test1 /opt/PostgreSQL/10/data]$ls *recov*
recovery.done
[postgres@test1 /opt/PostgreSQL/10/data]$more recovery.done 
standby_mode = 'on'
primary_conninfo = 'user=repl password=oracle host=192.168.2.80 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_s
ession_attrs=any'
[postgres@test1 /opt/PostgreSQL/10/data]$

-- 将原来的主库,部署为从库,编辑recovery.conf文件后重启即可 (这样做会有问题,问题是主库和从库不同步。后面使用pg_rewind来解决)

[postgres@test /opt/PostgreSQL/10/data]$more recovery.conf 
standby_mode=on
primary_conninfo = 'user=repl password=oracle host=192.168.2.81 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_s
ession_attrs=any'
archive_cleanup_command = 'pg_archivecleanup /opt/PostgreSQL/10/data/pg_wal %r'
primary_slot_name = 'node_a_slot'

[postgres@test /opt/PostgreSQL/10/data]$

[postgres@test /opt/PostgreSQL/10/data]$pg_ctl start
waiting for server to start....2021-08-27 08:44:41.640 CST [15175] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-08-27 08:44:41.640 CST [15175] LOG:  listening on IPv6 address "::", port 5432
2021-08-27 08:44:41.643 CST [15175] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-08-27 08:44:41.660 CST [15175] LOG:  redirecting log output to logging collector process
2021-08-27 08:44:41.660 CST [15175] HINT:  Future log output will appear in directory "log".
 done
server started
[postgres@test /opt/PostgreSQL/10/data]$

-- 验证,主备状态是正常的,但是实际测试是无法同步数据的。当然,如果把备库上的recover.conf改名为recovery.done或者其他名字,启动备库后,备库状态为 in production (这样就相当于failover了,单独的一个读写库了,但是数据就和原来的主库不一样了) 。

[postgres@test1 /opt/PostgreSQL/10/data]$pg_controldata  | grep "Database cluster state"
Database cluster state:               in production
[postgres@test1 /opt/PostgreSQL/10/data]$

[postgres@test /opt/PostgreSQL/10/data]$pg_controldata  | grep "Database cluster state"
Database cluster state:               in archive recovery
[postgres@test /opt/PostgreSQL/10/data]$

-- 新的从库中,提示没有复制槽,

[postgres@test /opt/PostgreSQL/10/data/log]$tail -f postgresql-2021-08-27_084441.log
2021-08-27 08:46:51.895 CST [15321] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:46:56.904 CST [15326] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:47:01.906 CST [15331] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:47:06.912 CST [15336] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:47:11.918 CST [15341] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:47:16.925 CST [15346] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:47:21.932 CST [15351] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:47:26.948 CST [15377] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:47:31.942 CST [15383] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" does not exist
2021-08-27 08:47:36.950 CST [15388] FATAL:  could not start WAL streaming: ERROR:  replication slot "node_a_slot" 

-- 在新的备库上,将recovery.conf中的复制槽删除掉,主备正常(其实这个时候,数据是不能同步的,在后面的日志中,可以看到replication terminated by primary server、End of WAL reached on timeline 1 at 0/11000098.表明这个日志时间线之后的日志不能同步)

mydb=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 78518
usesysid         | 16406
usename          | repl
application_name | walreceiver
client_addr      | 192.168.2.80
client_hostname  | 
client_port      | 29673
backend_start    | 2021-08-27 08:48:44.874848+08
backend_xmin     | 
state            | startup
sent_lsn         | 0/11000098
write_lsn        | 0/11000098
flush_lsn        | 0/11000098
replay_lsn       | 0/11000098
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 0
sync_state       | async

mydb=# 

-- 新的主库上插入数据,备库上显示数据没有同步过来,同时,新的备库上显示"replication terminated by primary server"

mydb=# insert into pub_t1 values(2,'bbb',now());
INSERT 0 1
mydb=# select * from pub_t1;
-[ RECORD 1 ]------------------------
id       | 1
name     | aaa
crt_time | 2021-08-27 08:17:23.672054
-[ RECORD 2 ]------------------------
id       | 2
name     | bbb
crt_time | 2021-08-27 08:56:33.630264

mydb=# 

mydb=# select * from pub_t1;
 id | name |          crt_time          
----+------+----------------------------
  1 | aaa  | 2021-08-27 08:17:23.672054
(1 row)

mydb=# 

2021-08-27 09:12:05.575 CST [16448] LOG:  replication terminated by primary server
2021-08-27 09:12:05.575 CST [16448] DETAIL:  End of WAL reached on timeline 1 at 0/11000098.
2021-08-27 09:12:10.581 CST [16448] LOG:  restarted WAL streaming at 0/11000000 on timeline 1
2021-08-27 09:12:10.582 CST [16448] LOG:  replication terminated by primary server
2021-08-27 09:12:10.582 CST [16448] DETAIL:  End of WAL reached on timeline 1 at 0/11000098.

-- 经过多次的测试,主库和备库切换后,虽然状态为in production和in archive recovery,但是后续是无法继续同步的。需要使用pg_rewind修复新的备库 

[postgres@test1 /opt/PostgreSQL/10/data/log]$tail -f postgresql-2021-08-27_104842.log
2021-08-27 10:49:37.163 CST [85899] DETAIL:  End of WAL reached on timeline 2 at 0/14000098.
2021-08-27 10:49:42.169 CST [85899] LOG:  restarted WAL streaming at 0/14000000 on timeline 2
2021-08-27 10:49:42.169 CST [85899] LOG:  replication terminated by primary server
2021-08-27 10:49:42.169 CST [85899] DETAIL:  End of WAL reached on timeline 2 at 0/14000098.
2021-08-27 10:49:47.181 CST [85899] LOG:  restarted WAL streaming at 0/14000000 on timeline 2
2021-08-27 10:49:47.181 CST [85899] LOG:  replication terminated by primary server
2021-08-27 10:49:47.181 CST [85899] DETAIL:  End of WAL reached on timeline 2 at 0/14000098.
2021-08-27 10:49:52.182 CST [85899] LOG:  restarted WAL streaming at 0/14000000 on timeline 2
2021-08-27 10:49:52.182 CST [85899] LOG:  replication terminated by primary server
2021-08-27 10:49:52.182 CST [85899] DETAIL:  End of WAL reached on timeline 2 at 0/14000098.
2021-08-27 10:49:57.188 CST [85899] LOG:  restarted WAL streaming at 0/14000000 on timeline 2

-- 使用pg_rewind ,将新的备库干净关闭。新的主库不用关闭。注意设置参数wal_log_hints。然后在新的备库上进行pg_rewind,在pg_rewind
的过程中,可以看到从这个0/11000028位置开始重新复制

[postgres@test /opt/PostgreSQL/10/data]$pg_rewind --target-pgdata=$PGDATA --source-server='host=192.168.2.81 port=5432 user=postgres password=oracle' -P
connected to server

target server needs to use either data checksums or "wal_log_hints = on"
Failure, exiting
[postgres@test /opt/PostgreSQL/10/data]$


[postgres@test /opt/PostgreSQL/10/data]$pg_ctl stop
waiting for server to shut down.... done
server stopped
[postgres@test /opt/PostgreSQL/10/data]$pg_rewind --target-pgdata=$PGDATA --source-server='host=192.168.2.81 port=5432 user=postgres password=oracle' -P
connected to server
servers diverged at WAL location 0/11000098 on timeline 1
rewinding from last common checkpoint at 0/11000028 on timeline 1
reading source file list
reading target file list
reading WAL in target
need to copy 103 MB (total source directory size is 135 MB)
105492/105492 kB (100%) copied
creating backup label and updating control file
syncing target data directory
Done!
[postgres@test /opt/PostgreSQL/10/data]$

-- backup_label文件

[postgres@test /opt/PostgreSQL/10/data]$more backup_label.old 
START WAL LOCATION: 0/11000028 (file 000000010000000000000011)
CHECKPOINT LOCATION: 0/11000028
BACKUP METHOD: pg_rewind
BACKUP FROM: standby
START TIME: 2021-08-27 09:38:25 CST
[postgres@test /opt/PostgreSQL/10/data]$

-- 编辑recovery.conf并启动新的备库 (可以看到新的备库上有个recovery.done文件)

[postgres@test /opt/PostgreSQL/10/data]$ll recover*
-rw------- 1 postgres postgres 172 Aug 27 09:38 recovery.done
[postgres@test /opt/PostgreSQL/10/data]$

[postgres@test /opt/PostgreSQL/10/data]$more recovery.done 
standby_mode = 'on'
primary_conninfo = 'user=repl password=oracle host=192.168.2.80 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres target_s
ession_attrs=any'
[postgres@test /opt/PostgreSQL/10/data]$cp recovery.done recovery.conf   -- 注意修改里面的IP地址 
[postgres@test /opt/PostgreSQL/10/data]$pg_ctl start
waiting for server to start....2021-08-27 09:41:54.533 CST [18217] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-08-27 09:41:54.534 CST [18217] LOG:  listening on IPv6 address "::", port 5432
2021-08-27 09:41:54.536 CST [18217] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-08-27 09:41:54.554 CST [18217] LOG:  redirecting log output to logging collector process
2021-08-27 09:41:54.554 CST [18217] HINT:  Future log output will appear in directory "log".
 done
server started
[postgres@test /opt/PostgreSQL/10/data]$

-- 验证新的备库的数据,可以看到id=2的记录已经同步过来 。

postgres=# \c mydb
You are now connected to database "mydb" as user "postgres".
mydb=# select * from pub_t1;
 id | name |          crt_time          
----+------+----------------------------
  1 | aaa  | 2021-08-27 08:17:23.672054
  2 | bbb  | 2021-08-27 08:56:33.630264
(2 rows)

mydb=# 

--再次验证数据 ,数据同步正常了  

mydb=# insert into pub_t1 values(3,'cc',now());
INSERT 0 1
mydb=# select * from pub_t1;
 id | name |          crt_time          
----+------+----------------------------
  1 | aaa  | 2021-08-27 08:17:23.672054
  2 | bbb  | 2021-08-27 08:56:33.630264
  3 | cc   | 2021-08-27 09:43:39.673485
(3 rows)

mydb=# 

[postgres@test /opt/PostgreSQL/10/data]$psql
Password: 
psql.bin (10.15)
Type "help" for help.

postgres=# \c mydb
You are now connected to database "mydb" as user "postgres".
mydb=# select * from pub_t1;
 id | name |          crt_time          
----+------+----------------------------
  1 | aaa  | 2021-08-27 08:17:23.672054
  2 | bbb  | 2021-08-27 08:56:33.630264
  3 | cc   | 2021-08-27 09:43:39.673485
(3 rows)

mydb=# 

END

本文标签: Postgrespgrewind