pgsql双机热备高可用部署文档

简介

公司的云部门，在2017年迎来的第四次更换架构，尴尬，基于公司的架构又产生了变化，我这边的技术栈又进行了一波新的更新，还不错，不管公司的架构如何变化，能充实自己的技术拥有是对自己能力提升最好的方法；一个新的项目组进入了我们云部门，这个新的项目组用的是PGSQL，之前做高可用什么的都是由甲方提供技术支持，领导知道有我这么个人，这个事情就落到了我的头上，T_T。下面开始正文。

最终效果

实现pgsql集群，一主多从，读写分离(master写，slave读) ，读写配置俩个 vip

系统架构前提

服务器ip	主机名	系统	角色
10.10.0.128	pgsql1	Centos 7	master
10.10.0.129	pgsql2	Centos 7	slave
10.10.0.130	pgsql3	Centos 7	slave

Vip	角色	作用
10.10.0.200	master-vip	master虚拟ip
10.10.0.201	slave-vip	slave 虚拟ip

同步系统时间，更改主机名

# 重建yum缓存
yum clean all && yum clean metadata && yum clean dbcache && yum makecache
# 更改主机名
echo 'pgsql1' > /etc/hostname # 不同主机输入不同主机名
# 更改selinux
setenforce 0
sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config
# 关闭防火墙
systemctl disable firewalld.service
systemctl stop firewalld.service
# 安装ntpdate并更新时间
yum -y install ntp
/usr/sbin/ntpdate asia.pool.ntp.org && /sbin/hwclock -w
# 安装相关工具
yum install -y sysstat net-tools vim

安装Pacemaker+Corosync

1	yum install -y pacemaker pcs psmisc policycoreutils-python

启动pcs并设置为开机启动

1 2	systemctl start pcsd.service systemctl enable pcsd.service

将所有节点上的hacluster用户密码更改为hacluster

1	echo hacluster \| sudo passwd hacluster --stdin

pcs集群认证

随便在一台机器上执行即可。

1	[root@pgsql1 ~]# pcs cluster auth -u hacluster -p hacluster 10.10.0.128 10.10.0.129 10.10.0.130

命令返回信息：

[root@pgsql1 ~]# pcs cluster auth -u hacluster -p hacluster 10.10.0.128 10.10.0.129 10.10.0.130
10.10.0.128: Authorized
10.10.0.129: Authorized
10.10.0.130: Authorized

表示认证成功。

备注：如有有redis的情况下，需要执行gem install orderedhash才会通过认证不然会报错

同步配置信息

1	pcs cluster setup --last_man_standing=1 --name pgcluster 10.10.0.128 10.10.0.129 10.10.0.130

命令返回信息（该命令可能会等待一段时间，耐心等待）

[root@pgsql1 ~]# pcs cluster setup --last_man_standing=1 --name pgcluster 10.10.0.128 10.10.0.129 10.10.0.130
Destroying cluster on nodes: 10.10.0.128, 10.10.0.129, 10.10.0.130...
10.10.0.128: Stopping Cluster (pacemaker)...
10.10.0.130: Stopping Cluster (pacemaker)...
10.10.0.129: Stopping Cluster (pacemaker)...
10.10.0.130: Successfully destroyed cluster
10.10.0.129: Successfully destroyed cluster
10.10.0.128: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to '10.10.0.128', '10.10.0.129', '10.10.0.130'
10.10.0.128: successful distribution of the file 'pacemaker_remote authkey'
10.10.0.129: successful distribution of the file 'pacemaker_remote authkey'
10.10.0.130: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
10.10.0.128: Succeeded
10.10.0.129: Succeeded
10.10.0.130: Succeeded

Synchronizing pcsd certificates on nodes 10.10.0.128, 10.10.0.129, 10.10.0.130...
10.10.0.128: Success
10.10.0.129: Success
10.10.0.130: Success

表示同步完成

启动集群

1	pcs cluster start --all

命令返回结果

[root@pgsql1 ~]# pcs cluster start --all
10.10.0.130: Starting Cluster...
10.10.0.128: Starting Cluster...
10.10.0.129: Starting Cluster...

验证Pcaemaker和corosync

验证Pacemaker

1	pcs status

命令返回

[root@pgsql1 ~]# pcs status
Cluster name: pgcluster
WARNING: no stonith devices and stonith-enabled is not false
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Stack: corosync
Current DC: pgsql1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: Sat Jan  6 21:01:34 2018
Last change: Sat Jan  6 20:58:21 2018 by hacluster via crmd on pgsql1

3 nodes configured
0 resources configured

Online: [ pgsql1 pgsql2 pgsql3 ]

No resources


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

返回信息Online中包含所有节点(Online: [ pgsql1 pgsql2 pgsql3 ])，表示pacemaker正常

验证corosync

1	pcs status corosync

命令返回：

[root@pgsql1 ~]# pcs status corosync

Membership information
----------------------
    Nodeid      Votes Name
         1          1 10.10.0.128 (local)
         2          1 10.10.0.129
         3          1 10.10.0.130

自己看吧。T_T 语文不行不知道怎么描述

源码安装pgsql9.6.6并配置主从同步

安装必要依赖

1
2

yum install -y gcc.x86_64 glibc.x86_64 glibc-devel.x86_64 vim-enhanced.x86_64 gcc-java apr apr-devel openssl openssl-devel libgcc.x86_64 java-1.8.0-openjdk.x86_64 java-1.8.0-openjdk-devel.x86_64 perl-Module-Install.noarch
yum install -y readline-devel.x86_64

创建postgres账户

1 2	groupadd postgres useradd -g postgres postgres

解压编译并安装pgsql9.6.6

这边已经先把pgsql的包放到了/tmp目录

[root@pgsql3 tmp]# tar -zxf postgresql-9.6.6.tar.gz 
[root@pgsql3 tmp]# cd postgresql-9.6.6
[root@pgsql3 postgresql-9.6.6]# ./configure --prefix=/home/postgres   其中--prefix表示安装目录 这边就放到home下面了，安装系统是给home很大空间即可。目录可以随机安装的主要是后面pgsql的data目录要放到有很大空间的地方。
[root@pgsql3 postgresql-9.6.6]# make
[root@pgsql3 postgresql-9.6.6]# make install
[root@pgsql3 postgresql-9.6.6]# cd contrib/
[root@pgsql3 postgresql-9.6.6]# make
[root@pgsql3 postgresql-9.6.6]# make install

把程序目录全部赋权给postgres用户

1	chown -R postgres.postgres /home/postgres/

设置环境变量

编辑用户目录下.bashrc文件,主要是设置PGDATA变量

切换到postgres账户

1	[root@pgsql1 ]# su - postgres

创建相关目录并赋权

1 2	[postgres@pgsql2 ~]$ mkdir -p {data,xlog_archive} [postgres@pgsql2 ~]$ chmod 0700 data

编辑用户下配置文件

1	[postgres@pgsql1 ]# vim .bash_profile

编辑内容如下:

PGHOME=/home/postgres
export PGHOME
PGDATA=$PGHOME/data
export PGDATA
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$PGHOME/bin
export PATH

编辑完成，按esc，输入 wq!保存退出，重新启用下配置文件

1	[postgres@pg1 ]# source .bash_profile

在pgsql1上初始化pgsql数据库

1	initdb -D $PGDATA

命令返回信息

[postgres@pgsql1 ~]$ echo $PGDATA
/home/postgres/data
[postgres@pgsql1 ~]$ initdb -D $PGDATA
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /home/postgres/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /home/postgres/data -l logfile start

表示正常

在pgsql1、pgsql2、pgsql3上修改配置文件

创建相关目录

mkdir /var/log/pgsql
chown -R postgres:postgres /var/log/pgsql/
mkdir /var/run/pgsql
chown -R postgres:postgres /var/run/pgsql
mkdir -p /var/log/pg_log
chown postgres:postgres /var/log/pg_log/
chmod 700 /var/log/pg_log/
mkdir -p /var/lib/pgsql/tmp
# 此条只在pgsql1上执行
echo "synchronous_standby_names = 'pgsql3,pgsql2'" > /var/lib/pgsql/tmp/rep_mode.conf 
# 此条在pgsql2、3上执行
echo "synchronous_standby_names = ''" > /var/lib/pgsql/tmp/rep_mode.conf

在pgsql1上根据系统情况配置postgresql.conf文件

listen_addresses = '*'		# what IP address(es) to listen on;
max_connections = 1000			# (change requires restart)
shared_buffers = 128kB		# min 128kB
huge_pages = try			# on, off, or try
temp_buffers = 16MB			# min 800kB
work_mem = 4MB				# min 64kB
maintenance_work_mem = 64MB		# min 1MB
dynamic_shared_memory_type = posix	# the default is the first option
wal_level = replica			# minimal, replica, or logical
fsync = off				# flush data to disk for crash safety
synchronous_commit = local	# synchronization level;

full_page_writes = on			# recover from partial page writes
wal_buffers = 16MB			# min 32kB, -1 sets based on shared_buffers
wal_writer_delay = 200ms		# 1-10000 milliseconds

commit_siblings = 5			# range 1-1000

archive_mode = on		# enables archiving; off, on, or always
archive_command = 'cp %p /home/postgres/xlog_archive/%f'		# command to use to archive a logfile segment
max_wal_senders = 16            # max number of walsender processes
wal_keep_segments = 512         # in logfile segments, 16MB each; 0 disables
hot_standby = on			# "on" allows queries during recovery
logging_collector = on		# Enable capturing of stderr and csvlog
log_directory = '/var/log/pg_log'		# directory where log files are written,

log_truncate_on_rotation = off		# If on, an existing log file with the
log_rotation_age = 1d			# Automatic rotation of logfiles will
log_rotation_size = 10MB		# Automatic rotation of logfiles will

log_timezone = 'PRC'

datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'en_US.utf8'			# locale for system error message
lc_monetary = 'en_US.utf8'			# locale for monetary formatting
lc_numeric = 'en_US.utf8'			# locale for number formatting
lc_time = 'en_US.utf8'				# locale for time formatting
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 1000
pg_stat_statements.track = all
include '/var/lib/pgsql/tmp/rep_mode.conf'

include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA

在pgsql1上配置pg_hba.conf

在最后添加三行

1
2
3

host    replication     postgres        10.10.0.0/24            trust
host    replication     replicator      10.10.0.0/24              trust
host    all             all             0.0.0.0/0               trust

在pgsql1上启动pgsql并配置密码

1	pg_ctl start -D $PGDATA

命令回显

1
2
3

server starting
[postgres@pgsql1 ~]$ LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "/var/log/pg_log".

创建同步账户replicator密码为Dzer0

[postgres@pgsql1 ~]$ psql -U postgres
psql (9.6.6)
Type "help" for help.

postgres=# create role replicator with login replication password 'Dzer0';
CREATE ROLE
postgres=# \q

配置slave（在pgsql2和pgsql3上执行）

1 2	su - postgres pg_basebackup -h 10.10.0.128 -U postgres -D $PGDATA -X stream -P

回显信息：

[root@pgsql3 contrib]# su - postgres
Last login: Sat Jan  6 21:37:38 CST 2018 on pts/1
[postgres@pgsql3 ~]$ pg_basebackup -h 10.10.0.128 -U postgres -D $PGDATA -X stream -P
22827/22827 kB (100%), 1/1 tablespace

创建文件recovery.conf

1	vim $PGDATA/recovery.conf

内容为：

standby_mode = 'on'
primary_conninfo = 'host=10.10.0.128 port=5432 user=replicator password=Dzer0 application_name=node2'  # 注意名字需要更改
restore_command = ''
recovery_target_timeline = 'latest'

启动pgsql2和pgsql3并检查集群状态

1	pg_ctl start -D $PGDATA

登录到pgsql1上，检查集群状态

[postgres@pgsql1 data]$  su - postgres
Password: 
[postgres@pgsql1 data]$ psql -U postgres
psql (9.6.6)
Type "help" for help.

postgres=# select * from pg_stat_replication ;
  pid  | usesysid |  usename   | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+------------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
 62339 |    16384 | replicator | node3            | 10.10.0.130 |                 |       59467 | 2018-01-06 22:44:27.958245+08 |              | streaming | 0/12000060    | 0/12000060     | 0/12000060     | 0/12000060      |             0 | async
 62374 |    16384 | replicator | node3            | 10.10.0.129 |                 |       34777 | 2018-01-06 22:45:52.61649+08  |              | streaming | 0/12000060    | 0/12000060     | 0/12000060     | 0/12000060      |             0 | async
(2 rows)

postgres=#

表示集群创建完成并正常同步

配置pacemaker自动切换和vip

由于命令过多所以这里写成一个脚本
新建一个脚本

1	vim cluster_setup.sh

内容如下后面如果有修改请根据真实情况进行修改

# 将cib配置保存到文件
pcs cluster cib pgsql_cfg                                                                   
# 在pacemaker级别忽略quorum
pcs -f pgsql_cfg property set no-quorum-policy="ignore"        
# 禁用STONITH           
pcs -f pgsql_cfg property set stonith-enabled="false"                    
# 设置资源粘性，防止节点在故障恢复后发生迁移     
pcs -f pgsql_cfg resource defaults resource-stickiness="INFINITY"       
# 设置多少次失败后迁移
pcs -f pgsql_cfg resource defaults migration-threshold="2"                 
# 设置master节点虚ip
pcs -f pgsql_cfg resource create vip-master IPaddr2 ip="10.10.0.200" nic="eno16777736" cidr_netmask="24"    op start   timeout="60s" interval="0s"  on-fail="restart"    op monitor timeout="60s" interval="10s" on-fail="restart"    op stop    timeout="60s" interval="0s"  on-fail="block"                             
# 设置slave节点虚ip                       
pcs -f pgsql_cfg resource create vip-slave IPaddr2 ip="10.10.0.201" nic="eno16777736" cidr_netmask="24"    op start   timeout="60s" interval="0s"  on-fail="restart"    op monitor timeout="60s" interval="10s" on-fail="restart"    op stop    timeout="60s" interval="0s"  on-fail="block"                                                        
# 设置pgsql集群资源
# pgctl、psql、pgdata和config等配置根据自己的环境修改
pcs -f pgsql_cfg resource create pgsql pgsql pgctl="/home/postgres/bin/pg_ctl" psql="/home/postgres/bin/psql" pgdata="/home/postgres/data/" config="/home/postgres/data/postgresql.conf" rep_mode="sync" node_list="pgsql1 pgsql2 pgsql3" master_ip="10.10.0.200"  repuser="replicator" primary_conninfo_opt="password=Dzer0 keepalives_idle=60 keepalives_interval=5 keepalives_count=5" restore_command="cp /home/postgres/xlog_archive/%f %p" restart_on_promote='true' op start   timeout="60s" interval="0s"  on-fail="restart" op monitor timeout="60s" interval="4s" on-fail="restart" op monitor timeout="60s" interval="3s"  on-fail="restart" role="Master" op promote timeout="60s" interval="0s"  on-fail="restart" op demote  timeout="60s" interval="0s"  on-fail="stop" op stop    timeout="60s" interval="0s"  on-fail="block"       
 # 设置master/slave模式
pcs -f pgsql_cfg resource master pgsql-cluster pgsql master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true                                                                       
# 配置master ip组绑定master节点
pcs -f pgsql_cfg resource group add master-group vip-master        
# 配置slave ip组     
pcs -f pgsql_cfg resource group add slave-group vip-slave                 
# 配置master ip组绑定master节点
pcs -f pgsql_cfg constraint colocation add master-group with master pgsql-cluster INFINITY    
# 配置启动master节点
pcs -f pgsql_cfg constraint order promote pgsql-cluster then start master-group symmetrical=false score=INFINITY                                 
# 配置停止master节点                                                                   
pcs -f pgsql_cfg constraint order demote  pgsql-cluster then stop  master-group symmetrical=false score=0                                                                                                                
# 配置slave ip组绑定slave节点
pcs -f pgsql_cfg constraint colocation add slave-group with slave pgsql-cluster INFINITY         
# 配置启动slave节点
pcs -f pgsql_cfg constraint order promote pgsql-cluster then start slave-group symmetrical=false score=INFINITY                               
# 配置停止slave节点                                                                         
pcs -f pgsql_cfg constraint order demote  pgsql-cluster then stop  slave-group symmetrical=false score=0                                                                                                                  
# 把配置文件push到cib
pcs cluster cib-push pgsql_cfg

保存后运行

检查集群状态

1	pcs status

命令回显:

[root@pgsql1 tmp]# pcs status
Cluster name: pgcluster
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Stack: corosync
Current DC: pgsql2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: Sat Jan  6 23:24:47 2018
Last change: Sat Jan  6 23:19:25 2018 by root via crm_attribute on pgsql2

3 nodes configured
5 resources configured

Online: [ pgsql1 pgsql2 pgsql3 ]

Full list of resources:

 Master/Slave Set: pgsql-cluster [pgsql]
     Masters: [ pgsql2 ]
     Slaves: [ pgsql1 pgsql3 ]
 Resource Group: master-group
     vip-master	(ocf::heartbeat:IPaddr2):	Started pgsql2
 Resource Group: slave-group
     vip-slave	(ocf::heartbeat:IPaddr2):	Started pgsql3

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

故障切换

从14上可以看到master在pgsql2上。我们登录到pgsql2上直接kill掉pgsql进程，我们来观察vip10.10.0.200会漂移到哪里

[root@pgsql2 tmp]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno16777736: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:0f:80:fe brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.129/16 brd 10.10.255.255 scope global dynamic eno16777736
       valid_lft 1748sec preferred_lft 1748sec
    inet 10.10.0.200/24 brd 10.10.255.255 scope global eno16777736
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe0f:80fe/64 scope link 
       valid_lft forever preferred_lft forever
[root@pgsql2 tmp]# ps -ef |grep postgres
postgres  82086      1  0 23:18 ?        00:00:00 /home/postgres/bin/postgres -D /home/postgres/data -c config_file=/home/postgres/data/postgresql.conf
postgres  82090  82086  0 23:18 ?        00:00:00 postgres: logger process   
postgres  82093  82086  0 23:18 ?        00:00:00 postgres: checkpointer process   
postgres  82094  82086  0 23:18 ?        00:00:00 postgres: writer process   
postgres  82095  82086  0 23:18 ?        00:00:00 postgres: wal writer process   
postgres  82096  82086  0 23:18 ?        00:00:00 postgres: autovacuum launcher process   
postgres  82097  82086  0 23:18 ?        00:00:00 postgres: archiver process   
postgres  82098  82086  0 23:18 ?        00:00:00 postgres: stats collector process   
postgres  82863  82086  0 23:18 ?        00:00:00 postgres: wal sender process replicator 10.10.0.130(58485) streaming 0/17000220
postgres  85025  82086  0 23:19 ?        00:00:00 postgres: wal sender process replicator 10.10.0.128(40217) streaming 0/17000220
root     115329  41431  0 23:28 pts/1    00:00:00 grep --color=auto postgres
[root@pgsql2 tmp]# kill 82086
[root@pgsql2 tmp]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno16777736: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:0f:80:fe brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.129/16 brd 10.10.255.255 scope global dynamic eno16777736
       valid_lft 1739sec preferred_lft 1739sec
    inet6 fe80::20c:29ff:fe0f:80fe/64 scope link 
       valid_lft forever preferred_lft forever

从结果可以看到 10.10.0.200 已经不见了。

发现200已经调移到pgsql3上而pgsql1变更为201
pgsql3上查看ip

[root@pgsql3 contrib]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno16777736: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:eb:e5:76 brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.130/16 brd 10.10.255.255 scope global dynamic eno16777736
       valid_lft 1674sec preferred_lft 1674sec
    inet 10.10.0.200/24 brd 10.10.255.255 scope global eno16777736
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:feeb:e576/64 scope link 
       valid_lft forever preferred_lft forever

pgsql1上查看ip

[root@pgsql1 tmp]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eno16777736: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:cf:9c:4c brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.128/16 brd 10.10.255.255 scope global dynamic eno16777736
       valid_lft 1256sec preferred_lft 1256sec
    inet 10.10.0.201/24 brd 10.10.255.255 scope global eno16777736
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fecf:9c4c/64 scope link 
       valid_lft forever preferred_lft forever

16.恢复刚刚kill掉的pgsql

由于是直接kill掉的进程，所以如果要恢复，我们需要先执行pg_basebackup具体操作如下

当我们执行pcs status会发现

[root@pgsql1 tmp]# pcs status
Cluster name: pgcluster
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Stack: corosync
Current DC: pgsql2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: Sat Jan  6 23:36:06 2018
Last change: Sat Jan  6 23:28:35 2018 by root via crm_attribute on pgsql3

3 nodes configured
5 resources configured

Online: [ pgsql1 pgsql2 pgsql3 ]

Full list of resources:

 Master/Slave Set: pgsql-cluster [pgsql]
     Masters: [ pgsql3 ]
     Slaves: [ pgsql1 ]
     Stopped: [ pgsql2 ]
 Resource Group: master-group
     vip-master	(ocf::heartbeat:IPaddr2):	Started pgsql3
 Resource Group: slave-group
     vip-slave	(ocf::heartbeat:IPaddr2):	Started pgsql1

Failed Actions:
* pgsql_start_0 on pgsql2 'unknown error' (1): call=227, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
    last-rc-change='Sat Jan  6 23:28:28 2018', queued=0ms, exec=301ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

其中“My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start”告诉你 pacemaker无法启动pgsql原因是有PGSQL.lock文件

所以我们要恢复pgsql流程如下

清空pgsql data目录

[root@pgsql2 ~]# su - postgres
Last login: Sat Jan  6 23:34:54 CST 2018 on pts/1
[postgres@pgsql2 ~]$ cd $PGDATA
[postgres@pgsql2 data]$ ls
backup_label.old  global   pg_commit_ts  pg_hba.conf    pg_logical    pg_notify    pg_serial     pg_stat      pg_subtrans  pg_twophase  pg_xlog               postgresql.conf      postmaster.opts
base              pg_clog  pg_dynshmem   pg_ident.conf  pg_multixact  pg_replslot  pg_snapshots  pg_stat_tmp  pg_tblspc    PG_VERSION   postgresql.auto.conf  postgresql.conf.bak  recovery.conf
[postgres@pgsql2 data]$ rm -rf *
[postgres@pgsql2 data]$ ls
[postgres@pgsql2 data]$

同步数据

[postgres@pgsql2 data]$ pg_basebackup -h 10.10.0.200 -U postgres -D $PGDATA -X stream -P
22828/22828 kB (100%), 1/1 tablespace
[postgres@pgsql2 data]$ ls
backup_label      base    pg_clog       pg_dynshmem  pg_ident.conf  pg_multixact  pg_replslot  pg_snapshots  pg_stat_tmp  pg_tblspc    PG_VERSION  postgresql.auto.conf  postgresql.conf.bak
backup_label.old  global  pg_commit_ts  pg_hba.conf  pg_logical     pg_notify     pg_serial    pg_stat       pg_subtrans  pg_twophase  pg_xlog     postgresql.conf
[postgres@pgsql2 data]$

登录到pgsql2上进入/var/lib/pgsql/tmp目录删除PGSQL.lock文件（该文件的作用是保持pgsql数据一致性）

[root@pgsql2 ~]# cd /var/lib/pgsql/tmp/
[root@pgsql2 tmp]# ls
PGSQL.lock  rep_mode.conf
[root@pgsql2 tmp]# rm -rf PGSQL.lock 
[root@pgsql2 tmp]# ls
rep_mode.conf
[root@pgsql2 tmp]#

启动pgsql

[postgres@pgsql2 data]$ pg_ctl start -D $PGDATA
server starting
[postgres@pgsql2 data]$ LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "/var/log/pg_log".

通知pcs资源已恢复

1 2	[root@pgsql2 ~]# pcs resource cleanup Waiting for 1 replies from the CRMd. OK

验证：

[root@pgsql2 ~]# pcs status
Cluster name: pgcluster
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Stack: corosync
Current DC: pgsql2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: Sat Jan  6 23:41:24 2018
Last change: Sat Jan  6 23:28:35 2018 by root via crm_attribute on pgsql3

3 nodes configured
5 resources configured

Online: [ pgsql1 pgsql2 pgsql3 ]

Full list of resources:

 Master/Slave Set: pgsql-cluster [pgsql]
     Masters: [ pgsql3 ]
     Slaves: [ pgsql1 pgsql2 ]
 Resource Group: master-group
     vip-master	(ocf::heartbeat:IPaddr2):	Started pgsql3
 Resource Group: slave-group
     vip-slave	(ocf::heartbeat:IPaddr2):	Started pgsql1

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root@pgsql2 ~]#

备注

6、7、8均在pgsql1主机上执行。
还可以给pacemaker 安装命令行管理工具 crm

方法如下：

cd /etc/yum.repos.d/
wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo
wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/noarch/
# 安装工具
yum -y install crmsh pssh

crm 的一些用法

[root@pgsql1 tmp]# crm
crm(live)# status
Stack: corosync
Current DC: pgsql2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: Sat Jan  6 23:46:11 2018
Last change: Sat Jan  6 23:28:35 2018 by root via crm_attribute on pgsql3

3 nodes configured
5 resources configured

Online: [ pgsql1 pgsql2 pgsql3 ]

Full list of resources:

 Master/Slave Set: pgsql-cluster [pgsql]
     Masters: [ pgsql3 ]
     Slaves: [ pgsql1 pgsql2 ]
 Resource Group: master-group
     vip-master	(ocf::heartbeat:IPaddr2):	Started pgsql3
 Resource Group: slave-group
     vip-slave	(ocf::heartbeat:IPaddr2):	Started pgsql1

crm(live)# edit  编辑配置文件

End

哇，，这篇文章写了我整整4个小时。厉害了我都佩服我自己。啊哈哈哈

如果你感觉这篇文章对你有帮助，点击下面的支付宝，您可以请我喝杯茶