高可用系统

很久之前,在公司鼎盛的时期,还存在运维人员的时候,我们开发应用的时候,就会和运维人员讨论一下部署的方案,听的比较多的是,运维人员说的比较多的是两个专业的名词是,负载均衡和ip漂移,前者是为减轻应用服务的压力使性能均衡,后者是为了负载均衡宕机后系统的高可用性。由于这些都是运作部署的,我们只关注应用功能的实现,并不是很在意这些部署的细节,只是了解到有这样的东西。现在公司没落了,很多事情都得自己搞了。

负载均衡,无非就减轻后台应用服务器的压力,提高系统的可扩展性。方法是很多,比如有些在客户端实现,有些自己写分发程序,还有业界比较常用的F5负载均衡(贵,高大上),使用LVS/haproxy/nginx等开源的方案,这些都是支持4层/7层负载均衡。LVS已经集成在内核,没有使用过,haproxy/nginx曾经窥视过其内部数据结构,可没有非常深入的了解,nginx支持模块开发,如果对nginx比较熟悉,那么在其继承上开发对应的模块,其效率是比较高效的。这次用nginx的4层负载均衡测试一下,然而据说nginx的健康检查是惰性的,它不能及时知道后端服务的存活,所以有些人开发了相关的模块(更新年限久远,不知道是否能用):nginx_upstream_check_module

当应用部署上负载均衡后,负载均衡很容易成为单点,如果负载均衡挂掉了,这个时候keepalived上场。要了解keepalived的工作原理,首先要先了解VRRP协议–虚拟路由器冗余协议,这里有份详细的原理描述:虚拟路由器冗余协议【原理篇】VRRP详解。简单来说,VRRP就是把N台路由器(机器)放到一个组里面,组里面有一个MASTER和N-1个BACKUP,对外拥有一个虚拟IP,MASTER所在的机器拥有这个虚拟IP,MASTER通过广播报文到组内的BACKUP,当BACKUP在规定时间内没有收到MASTER的广播报文,则认为MASTER宕机了,当MASTER宕机后,组内的BACKUP通过特定的选举算法机制,选择出一个MASTER,然后这个MASTER拥有这个虚拟IP,这样看起来,感觉就是“IP从一台机器漂移到另外一台机器”了。Keepalived就是一个实现VRRP协议高可用方案。

用以前在学习nodejs写的echo服务器模拟实际上的应用服务器,开三台虚拟机(VNODE-01 ~ VNODE-03), 简单记录一下简单测试步骤:

  1. 启动应用,分别在VNODE-01 ~ VNODE-03部署应用服务
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
VNODE-01:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:43:db:4d brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.78/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 17744sec preferred_lft 17744sec
    inet6 fe80::950a:8286:4242:1da3/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

able@VNODE-01:~/luoguochun/privt/proj/privt-prj/web/nodejs/node-demo/echo-svr$ node echo-svr.JS -h 172.20.46.78 -p 9978 -i vnode-01
server(vnode-01) listen 172.20.46.78:9978 started

VNODE-02:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:d0:b6:f5 brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.80/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 17513sec preferred_lft 17513sec
    inet6 fe80::146f:4835:5b3e:29c1/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

able@VNODE-02:~/luoguochun/privt/proj/privt-prj/web/nodejs/node-demo/echo-svr$ node echo-svr.JS -h 172.20.46.80 -p 9980 -i vnode-02
server(vnode-02) listen 172.20.46.80:9980 started

VNODE-03:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:0e:28:a6 brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.79/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 17720sec preferred_lft 17720sec
    inet6 fe80::e7aa:1f55:1bdc:aaeb/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
able@VNODE-03:~/luoguochun/privt/proj/privt-prj/web/nodejs/node-demo/echo-svr$ node echo-svr.JS -h 172.20.46.79 -p 9979 -i vnode-03
server(vnode-03) listen 172.20.46.79:9979 started
  1. 4层负载均衡:分别在VNODE-01 ~ VNODE-02部署nginx,假设VNODE-01为主机,VNODE-02为备机
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
2.1 部署nginx: 简略配置(负载策略,按权重轮询)
user able;
worker_processes 1;
error_log /tmp/nginx_error.log;
pid /tmp/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;

events {
 worker_connections 768;
 # multi_accept on;
}

stream {
    upstream echo_svr {
        server 172.20.46.78:9978 weight=40;
        server 172.20.46.79:9979 weight=30;
        server 172.20.46.80:9980 weight=30;
    }
    server {
        listen 9900;
        proxy_pass echo_svr;
    }
}



2.1 SHELL查看:
VNODE-01:
able@VNODE-01:/media/sf_luoguochun/privt/proj/privt-prj/arch/nginx$ sudo nginx -c /media/sf_luoguochun/privt/proj/privt-prj/arch/nginx/nginx-lb4layer.conf
[sudo] password for able: 
able@VNODE-01:/media/sf_luoguochun/privt/proj/privt-prj/arch/nginx$ ps -ef --forest | grep -Ev 'grep' | grep nginx
root      2131     1  0 09:41 ?        00:00:00 nginx: master process nginx -c /media/sf_luoguochun/privt/proj/privt-prj/arch/nginx/nginx-lb4layer.conf
able      2132  2131  0 09:41 ?        00:00:00  \_ nginx: worker process

VNODE-02:
able@VNODE-02:/media/sf_luoguochun/privt/proj/privt-prj/arch/nginx$ sudo nginx -c /media/sf_luoguochun/privt/proj/privt-prj/arch/nginx/nginx-lb4layer.conf 
[sudo] password for able: 
able@VNODE-02:/media/sf_luoguochun/privt/proj/privt-prj/arch/nginx$ ps -ef --forest | grep -Ev 'grep' | grep nginx
root      2335     1  0 09:44 ?        00:00:00 nginx: master process nginx -c /media/sf_luoguochun/privt/proj/privt-prj/arch/nginx/nginx-lb4layer.conf
able      2336  2335  0 09:44 ?        00:00:00  \_ nginx: worker process


2.3 简单测试,可看到nginx已经对后端服务进行负载:
^_^@/Users/luoguochun]$ telnet 172.20.46.78 9900
Trying 172.20.46.78...
Connected to bogon.
Escape character is '^]'.
hello, vnode
echo(from vnode-01):hello, vnode
^]
telnet> quit
Connection closed.
^_^@/Users/luoguochun]$ telnet 172.20.46.78 9900
Trying 172.20.46.78...
Connected to bogon.
Escape character is '^]'.
hello, vnode

echo(from vnode-03):hello, vnode
echo(from vnode-03):
^]
telnet> quit
Connection closed.
  1. 高可用:分别在VNODE-01 ~ VNODE-02部署keepalived,假设VNODE-01为MASTER,VNODE-01为BACKUP
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
3.1 keepalived配置
MASTER 简略配置:
global_defs {
    router_id MASTER_ROUTER_ID
}
vrrp_script chk_svr {
    script "killall -0 node"
    interval 2
    weight -5
    fall 3  
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface enp0s3
    mcast_src_ip 172.20.46.78
    virtual_router_id 99
    priority 101
    advert_int 2
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        172.20.46.222
    }
    track_script {
       chk_svr
    }
}
BACKUP 简略配置:
与MASTER几乎相同,不同的是:
state MASTER -> state BACKUP
mcast_src_ip 172.20.46.78 -> mcast_src_ip 172.20.46.80
priority 101 -> priority 100

2. SHELL查看,可以看到MASTER除了拥有一个真实IP外,还有一个虚拟IP(ip a 查看, ifconfig 查看不到)
VNODE-01:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:43:db:4d brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.78/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 15610sec preferred_lft 15610sec
    inet 172.20.46.222/32 scope global enp0s3
       valid_lft forever preferred_lft forever
    inet6 fe80::950a:8286:4242:1da3/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
VNODE-02:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:d0:b6:f5 brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.80/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 17513sec preferred_lft 17513sec
    inet6 fe80::146f:4835:5b3e:29c1/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

3. 测试
3.1 应用测试,可以正常使用虚拟IP进行转发处理
^_^@/Users/luoguochun]$ telnet 172.20.46.222 9900
Trying 172.20.46.222...
Connected to bogon.
Escape character is '^]'.
hello, keepalived
echo(from vnode-02):hello, keepalived
^]
telnet> quit
Connection closed.
^_^@/Users/luoguochun]$ telnet 172.20.46.222 9900
Trying 172.20.46.222...
Connected to bogon.
Escape character is '^]'.
hellow 
echo(from vnode-01):hellow
^]
telnet> quit
Connection closed.

3.2 模拟MASTER宕机(IP漂移到BACKUP)
VNODE-01:
able@VNODE-01:~/luoguochun/privt/proj/privt-prj/web/nodejs/node-demo/echo-svr$ sudo killall nginx keepalived

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:43:db:4d brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.78/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 15355sec preferred_lft 15355sec
    inet6 fe80::950a:8286:4242:1da3/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

VNODE-02:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:d0:b6:f5 brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.80/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 15363sec preferred_lft 15363sec
    inet 172.20.46.222/32 scope global enp0s3
       valid_lft forever preferred_lft forever
    inet6 fe80::146f:4835:5b3e:29c1/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

^_^@/Users/luoguochun]$ telnet 172.20.46.222 9900
Trying 172.20.46.222...
Connected to bogon.
Escape character is '^]'.
hello, keepalived
echo(from vnode-02):hello, keepalived
^]
telnet> quit
Connection closed.
^_^@/Users/luoguochun]$ telnet 172.20.46.222 9900
Trying 172.20.46.222...
Connected to bogon.
Escape character is '^]'.
hellow 
echo(from vnode-01):hellow
^]
telnet> quit
Connection closed.

可见IP正确漂移到备机,除了正连接主机处理的有问题外,新请求正确负载到备机。


3.3 模拟MASTER恢复
在VNODE-01重启nginx和keepalived后,可以看到IP又漂移会主机

VNODE-01:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:43:db:4d brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.78/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 15206sec preferred_lft 15206sec
    inet 172.20.46.222/32 scope global enp0s3
       valid_lft forever preferred_lft forever
    inet6 fe80::950a:8286:4242:1da3/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

VNODE-02:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:d0:b6:f5 brd ff:ff:ff:ff:ff:ff
    inet 172.20.46.80/23 brd 172.20.47.255 scope global dynamic noprefixroute enp0s3
       valid_lft 17513sec preferred_lft 17513sec
    inet6 fe80::146f:4835:5b3e:29c1/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

^_^@/Users/luoguochun]$ telnet 172.20.46.222 9900
Trying 172.20.46.222...
Connected to bogon.
Escape character is '^]'.
what
echo(from vnode-02):what
^]
telnet> quit
Connection closed.
^_^@/Users/luoguochun]$ telnet 172.20.46.222 9900
Trying 172.20.46.222...
Connected to bogon.
Escape character is '^]'.
youe
echo(from vnode-01):youe
^]
telnet> quit
Connection closed.
^_^@/Users/luoguochun]$

可见IP正确漂移到主机,除了正连接备机处理的有问题外,新请求正确负载到主机。

由此,使用简单的开源方法便可以相对简单的部署高可用的负载均衡系统,当然这里的练习测试没有涉及具体的业务,现实业务情况可能会更复杂。不过,由于nginx本身的良好设计,在nginx上进行模块的开发,不失为一种高效的方法。