5. LVS: The ARP Problem

5.1. The problem

If you follow the instructions and setup the examples in the LVS-mini-HOWTO, then you don't need to know about the arp problem. If you're going to setup grander LVS's, then you'll need to understand the arp problem.

Although this section comes early in the HOWTO, it has lots of pitfalls. You shouldn't be reading this unless you've at least setup a working LVS-NAT and LVS-DR LVS using the canned instructions in the LVS-mini-HOWTO.

The LVS allows several machines to function as one machine. For LVS-DR and LVS-Tun some trickery was needed to split the various handshakes etc, that are involved in establishing and maintaining a tcpip connection, so that some parts of the handshake come from one machine and other parts from another machine. The worst problem, which ironically only happens with realservers running Linux (2.2 and later kernels), is the "arp problem" (it's just as well we have the source code).

With LVS-DR and LVS-Tun, all the machines (director, realservers) in the LVS have an extra IP, the VIP. Here's a LVS-DR in a test setup where all machines and IPs are on the same physical network (i.e. are using the same link layer and can hear each other's broadcasts).


                      ________
                     |        |
                     | client |
                     |________|
 	                 |
                         |
                      (router)
                         |
                         |
                         |       __________
                         |  DIP |          |
                         |------| director |
                         |  VIP |__________|
                         |
                         |
                         |
       ------------------------------------
       |                 |                |
       |                 |                |
   RIP1, VIP         RIP2, VIP        RIP3, VIP
 ______________    ______________    ______________
|              |  |              |  |              |
| realserver1  |  | realserver2  |  | realserver3  |
|______________|  |______________|  |______________|


When the client requests a connection to the VIP, it must connect to the VIP on the director and not to the VIP on the realservers.

The director acts as a layer-4 (IP) router, accepting packets destined for the VIP and then sending them on to a realserver (where the real work is done and a reply is generated). For the LVS to function, when the client (or if present, the router) puts out the arp request "who has VIP, tell client", the client/router must receive the MAC address of the director (and not the MAC address of one of the realservers). After receiving the arp reply, the client will send the connect request to the director. (The director will update its ipvsadm tables to keep track of the connections that it's in charge of and then forward the connect request packet to the chosen realserver).

If instead, the client gets the MAC address of one of the realservers, then the packets will be sent directly to that realserver, bypassing the LVS action of the director. If nothing is done to direct arp requests, by the router for the VIP, to the director (arp bouncing), then in some setups, one particular realserver's MAC address will be in the client/router's arp table for the VIP and the client will only see one realserver. If the client's packets are consistently sent to the same realserver, then the client will have a normal session connected to that realserver. You can't count on this happening: in the middle of a tcpip sesssion, the client/router might get the MAC address of another realserver as a result of an arp requiest, and the client will start getting packets for connections it knows nothing about (and the realserver will send tcp resets). (In my setup, the machine with the fastest CPU is in the client's arp table, suggesting that it's the first machine to reply that gets in. Horms and Steven WIlliams have written that they think it's the last machine to reply whose entry in in the client's arp table.) In other setups where the realservers are identical, the client will connect to different realservers each time the arp cache times out (see comment by Steven WIlliams elsewhere). If the director always gets its MAC address in the router arp table, then the LVS will work without any changes to the realservers (as happened in my case with a director with the fastest CPU in the LVS), although this is not a reliable solution for production.

Getting the MAC address of the VIP on the director (instead of the MAC address of the VIP on the realservers) to the client when the client/router does an arp request is the key to solving the "arp problem".

The traditional ways of handling the arp problem (as explained here) all require fiddling with the settings of the VIP on the realservers. The assumption in the early days of LVS was that you wouldn't have access to the router (this being under the control of the IT department or your ISP and you would have to go through a lot of bureaucracy to changed the settings on the router). However if you're paying good money to an ISP to house your LVS, or your inhouse LVS is doing something useful for your establishment, then you should have no trouble in having the router setup the way you want.

If you have access to the router (or can put one in front of your LVS - a low power linux box is just fine) and you can set it to route packets for the VIP only to the director(s) and not to the realservers, or you can use the arp filtering tools of iptables, and you understand what's been said above, then you've handled the arp problem and need read no futher.

For those who don't have access to the router, or who want to setup an LVS on one network, then read on...

The arp problem is handled in Linux 2.0.x kernels as serveral devices which don't reply to arp requests (eg dummy0, tunl0, lo:0) were available for the the VIP. For other OS's, the NOARP flag for ifconfig stops the VIP on the realservers from replying to arp requests.

However with 2.2.x (and now 2.4.x and later) kernels, the devices which didn't reply to arp requests in 2.0.x, now reply to arp requests. There is a "-arp" (NOARP) option for ifconfig which (according to the man pages) turns off replies to arp requests for that device, and an "arp" option which turns them back on again. Linux does not always honour this flag. You couldn't turn on replies to arp requests for the dummy0 devices in 2.0.36 kernels and you can't turn it off for tunl0 in 2.2.x kernels. eth0 behaves properly in 2.0.36 but in 2.2.x kernels it arps even when you tell it not to arp. This behaviour of not honouring the NOARP flag in the Linux 2.2.x kernels is not regarded as a "problem" by those writing the Linux TCPIP code and is not going to be "fixed".

Julian 22 May 2001

The flag is used to allow arp requests for the specified device. Although "lo" doesn't reply to arp requests, the requests for the VIP go through eth*, and so the NOARP flag is of no help to us. We can't drop the flag for eth.

Another wrinkle is that in 2.0.36 kernels, aliased devices (eg eth0:1) could be setup independantly of the options on the primary (eth0) device. Thus eth0:1 behaved as if it were on a separate NIC and it's arp'ing behaviour could be set independantly of the primary interface. The settings of an aliased device belonged to the IP. With the 2.2.x kernels, the aliased devices are now just alternate names for each other: you change an option (eg -arp) or up/down of one alias (or primary) the other aliases follow. With 2.2.x kernels, the settings of the aliased device belong to the primary device (there is only one device with several IPs).

When LVS was running on 2.0.36 machines, the VIP was usually configured as an alias (eg lo:0, tunl0) on the main ethernet device (eth0), allowing the nodes in an LVS to have only one NIC.

With 2.2.x kernels, care is needed when only one NIC is used on the realserver (the usual case). On a realserver with eth0 carrying the RIP, and the realserver having only one NIC, eth0 must reply to arp requests (to receive packets), then eth0:1 carrying the VIP will reply to arp requests too, even if you ifconfig it with -noarp. Thus if a realserver is running a 2.2.x kernel and has the VIP on an ip_alias, then the VIP on the realserver will reply to arp requests received from the router.

With the 2.4 kernels, the use of ip_aliases is still allowed, but requires a "label" to be recognised by the new Policy Routing tools that come with 2.4 (iproute2 and ip_tables). The "label"ed IPs are now secondary IPs.

5.2. The Cure(s)

Several cures have been produced in an attempt to solve the arp problem. They involve either

  • stopping the realservers from replying to arp requests for the VIP.
  • hiding the VIP on the realservers so that they don't see the arp requests.
  • priming the client/router in front of the director with the correct MAC address for the VIP.
  • allowing the realserver to accept a packet with dst=VIP even though the realserver does not have a device with this IP (i.e. the host has nothing to reply to an arp request). This is implemented by transparent proxy or fwmark.
  • stopping arp requests for the VIP getting to the realservers.

Note: Most of these cures involve applying a patch to the kernel on the Linux 2.2.x or 2.4.x realserver. The patch (e.g. the "hidden" patch), which you apply to the realserver, is different to the patch which you apply to the director (the "ipvs" patch). For more on the hidden patch see julian's page.

5.3. The Cure: 2.0 kernels

There is no arp problem with 2.0 kernels on the realservers. On the realservers, configure the VIP on the lo device with the -noarp option as you would with any other Unix.

5.4. The Cure: 2.2.x kernels

5.4.1. The hidden patches

The "hidden" patches for kernel >=2.2.14 are now in the standard linux distribution (i.e. you can use the "hidden" feature with a standard kernel and don't have to patch the kernel on the realserver). The arp patches allow you to hide a device from arp requests, allowing the realserver to function in an LVS.

Note
The hidden patch hides the device (here the lo) (and any IPs that are on it). The -noarp flag in 2.0 kernels affects only the ip_alias (and not other IPs on the same device). These are different methods, but both stop the router/client from getting arp replies from the realserver for the VIP.

To hide devices from arp calls, on the realservers do

       #to activate the hidden feature
       echo 1 > /proc/sys/net/ipv4/conf/all/hidden
       #to make lo:0 not arp, put lo here
       echo 1 > /proc/sys/net/ipv4/conf/<interface_name>/hidden

then test that you've hidden the VIP (testing for arp).

There is a possible race condition in hiding the VIP -

Kyle Sparger, 15 Feb 2001

I've found an interesting, but not totally unexpected race condition under DR in 2.2.x that I've managed to create when installing VIP's on a machine in DR mode. Basically, the cause is this:

ifconfig dummy0 10.0.1.15
echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden

You'll notice that there's going to be a small gap between the two which allows an ARP request to come in, and for the server to reply. And yes, it is big enough to be bitten by -- I've been bitten twice by it so far :)

Julian

On boot:

echo 1 > /proc/sys/net/ipv4/conf/all/hidden
# For each hidden interface (dummy, lo, tunl):
modprobe dummy0
ifconfig dummy0 0.0.0.0 up
echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden
# Now set any other IP address

Kyle's suggestion

echo 1 > /proc/sys/net/ipv4/conf/default/hidden
ifconfig dummy0 10.0.1.15
echo 0 > /proc/sys/net/ipv4/conf/default/hidden

The echo 0 command is incase I want to configure other interfaces later that I _do_ want responding to ARP requests. Technically, it's not necessary, I just find it useful in my particular setup.

5.4.2. The Cure: Older 2.2 kernels (<2.2.12)

These are old and it would better to upgrade (you won't get much help on the mailing list for these). However if you have them, you apply the arp patches to the kernel code of the 2.2.x realservers. These patches are separate from the ipvs patch applied to the kernel on the director.

For kernels <2.2.12, Julian's patch is on the lvs website.

http://www.linuxvirtualserver.org/arp_invisible-2213-2.diff

The patch by Stephen WIllIams is at

http://www.linuxvirtualserver.org/sdw_fullarpfix.patch

This patch is against a 2.2.5 kernel but can be applied to later kernels (tested to 2.2.13). The file appears to have DOS carriage control. Depending what you get on your disk, you may have to convert the file to unix carriage control (with `tr -d '\015'`) (the unix line extension of '\' doesn't work in combination with DOS carriage control).

The whitespace may not match your file so do

$ cd /usr/src/linux
$ patch -p1 -l < sdw_fullarpfix.patch

If you are using martian modification you will need the forward_shared-hidden patch as well (needed only on the director, but can be applied to both director and realservers).

5.4.3. Put an extra NIC (eth1) on the realserver to carry the VIP

Possible cards would be a discarded ISA card (WD80x3), or a cheap 100Mbit PCI card (eg Netgear FA310TX, $16 in USA in Nov 99) There is no traffic going through this NIC and it doesn't matter that it's an old slow card. The extra card is only required so that the realserver can have the VIP on the machine. With 2.2.x kernels you can't stop this device (eth1) from replying to arp requests, but if you don't connect the cable to it or don't put a route to it in the realserver's routing table, then the client won't be able to send it an arp request.

To set this up with the configure script, enter eth1 as the device for the VIP on the realserver.

Note

Apparently, the 2nd NIC doesn't handle arp problem for 2.4 kernels.

I tested the 2 NIC method of handling the arp problem with kernel 2.2.13. I haven't tried it with 2.4 kernels, but apparently it doesn't work. Julian and Ratz think it shouldn't work with 2.2.x kernels, but I haven't revisited the matter to see why we have come to different conclusions.

5.5. The Cure: 2.4.x kernels

There are several ways of handling the arp problem for 2.4.x kernels. They all work, but some of them have been around longer and so have been used more and people on the mailing list are more familiar with them.

5.5.1. 2.4 Hidden Patch

Julian's hidden patch has been around the longest and is well tested. Although included in the standard 2.2.x kernel, it is not being included in the 2.4.x kernels. You'll have to patch the kernel on the realservers.

For early 2.4.x kernels (eg x=0), the patch is available at http://www.linuxvirtualserver.org/hidden-2.3.41-1.diff. (This patches a part of the kernel that isn't being actively fiddled with, so hopefully the patch will work against later 2.4.x kernels too.)

The 2.4.x "hidden" patch in now being actively maintained and is included in ipvs-x.x.x/contrib/patches/hidden-x.x.x.diff

Assuming you are patching 2.4.2 with the ipvs-0.2.5 files

cd /usr/src/linux
patch -p1 <../ipvs-0.2.5/contrib/patches/hidden-2.4.2-1.diff

Then build the kernel (can use same options as for the 2.4 director kernel build).

You activate the hidden feature as for 2.2 (see hidden).

As to why the hidden patch is in the 2.2 kernels but not the 2.4 kernels see the the mailing list archives or for the thread

5.5.2. 2.4 arp_announce

The 2.6.x arp_announce, arp_ignore code has been ported to 2.4.26 (and later) kernels.

5.5.3. arp filtering

Julian has written an extension to iptables, which filters arp packets. You can use this to handle the arp problem. See Julian's software page for more details. This method does not require patching of the 2.4 kernel on the realserver. At the moment, the hidden method is well established.

5.5.4. Maurizio Sartori's noarp module

Maurizio Sartori masar (at) masarlabs (dot) com 28 Nov 2002

On my site is a simple kernel module for Linux 2.4.x to solve the ARP Problem. You don't have to patch the kernel but only to compile, install and configure the 'noarp' module, to use your loopback interface filtering its arp reply. I've tested it on Debian 'Sarge' and RedHat 8.

Sebastien Bonnet Sebastien (dot) Bonnet (at) experian (dot) fr 04 Jun 2003

Nobody seems to recall what a smart Italian guy named Maurizio Sartori did. Instead of the hidden patch, which requires a full kernel build, he's written a *module* called noarp, way more handy, as

  1. it requires only a one module build, doesn't require a kernel build, takes about 1 minute to install and get working.
  2. it allows hidding IPs, not interfaces.

I'm using it in production and it works perfectly.

Joe

Can you hide the VIP on eth0:x and not hide the RIP on eth0? (I should know this, but I don't)

Jan Abraham jan_abraham (at) gmx (dot) net 31 Oct 2003

Yes, you can :) I used Maurizio Sartori's noarp module, suggested in your HOWTO in chapter 4.5.3. It can be controlled by IP, not by interface.

5.5.5. extra NIC doesn't solve arp problem for 2.4 kernel realservers

id="2.4_2NIC">

Jean Paul Piccato j (dot) piccato (at) studenti (dot) to (dot) it

I'm setting up a DR_LVS with a director and two servers... I've to handle the ARP problem so I've put two NIC on the two realservers...

Julian Anastasov ja (at) ssi (dot) bg 16 Jan 2002

This works maybe only with Linux 2.0. (Joe: see 2.2 kernels with extra NIC). For 2.2+ you need a specific kind of ARP control. In Linux 2.2+ the operation of adding IP address involves the following 2 steps:

  1. Define a local IP address as a host property - remote hosts can talk to it through any device
  2. Define network link route on the specified device - you can talk with other hosts from this local network only through this device

(1) allows the Linux 2.2+ box to send ARP replies through any device that received the reply. Additionally, the user can provide some filtering by setting some device specific values:

/proc/sys/net/ipv4/conf/*/<FLAG>

These are explained in /usr/src/linux/Documentation/networking/ip-sysctl.txt

The LVS setups depend mostly on the FLAGs rp_filter, hidden, arp_filter, send_redirects. (for more info on kernel flags see the section on /proc filesystem flags). On problems, check them after learning what they mean and how they can kill your setup.

By setting rp_filter or arp_filter on some device you can ignore the ARP requests (and the traffic if rp_filter is set) coming from addresses if we don't have a route to these addresses through the mentioned above device.

The send_redirects values must be checked for setups playing with NAT on one physical medium.

Information on using the hidden patch is in hidden.txt

It seems that eth0 reply to the server instead of eth1

Any device can reply if the ARP probe is not filtered. See hidden.txt from the above URL

Michael McConnell michaelm (at) eyeball (dot) com 10 Jun 2002

I currently have a system which has a Tyan 2515 Motherboard. This motherboard features a Dual Intel 82559 NIC.

The problem I am face is that which using both ports of this dual interface network card (plugged into the same switch) I find that the second interface is answering arp requests (on rare occasions) that the first interface should be answering. I have used tcpdump and clearly seen eth1 answering arps requests that eth0 should be answering... how odd.... It's rare, but when it happens of course that address is offline. (Note this only seems to happen on alias IP address, it has never happened on the primary interface)

I am using the open source drivers provided with the 2.2.19 kernel, I'm wondering if the drivers provided by Intel would help this problem?

Roberto Nibali ratz (at) tac (dot) ch 11 Jun 2002

The drivers indeed can't make the difference but not because they are the same, but because the driver doesn't have anything to do with the arp/routing issue.

Julian

Classic problem of attaching multiple Linux interfaces to shared medium. You can set arp_filter on all your ARP devices or why not to restrict even the IP protocol by setting rp_filter.

Such answering (of arp requests) can not be never a problem. If the Linux box answers via many interfaces then it is willing to accept traffic through these ifaces. Of course, the achieved failover when attaching two interfaces to same hub is not perfect because the remote LAN boxes will use the alive Linux interface but Linux routing still uses the first interface (even if it is failed on Layer 2) for the used subnet. If your goal is to restrict the talks for each subnet through one interface then you have to use arp_filter=1 but still to use rp_filter=0 to allow cross-subnet talks. One day rp_filter will be aware of the medium_id values for each interface and will allow the Linux box to interconnect multiple hubs securely (and still to use many interfaces to these hubs).

By default Linux replies to ARP probes for any local IP address configured on any device no matter on what device the probe is received. Such probes look like "who-has TARGET tell SENDER". If the probe is answered later we can receive IP traffic from SENDER to TARGET destined to the TARGET's MAC address.

When we have different subnets (network routes) configured on multiple interfaces attached to same hub sometimes we prefer (may be the reader can find good reason for this) the traffic to/from one subnet always to use one interface. In such cases replying through many interfaces is not desired. We have 2 options:

arp_filter:

when

/proc/sys/net/ipv4/conf/DEV/arp_filter is set to 1
or
/proc/sys/net/ipv4/conf/all/arp_filter is set to 1

then the flag will cause any probe received on interface DEV to be dropped if the route from TARGET to SENDER points to different interface. With the usual local networks in table main in the form "from 0/0 to local_net lookup main" we see that the TARGET is ignored. As result, we drop probes received from SENDER that comes from wrong interface. As result, if the route from TARGET to SENDER1 is via DEV1 and from TARGET to SENDER2 is via DEV2, then we will reply only through one device for each of the senders. Of course, the arp_filter relies on the routing and as result the bahavior depends on the used ip rules and routes. The above is a simple example for normal local networks. The arp_filter simply checks the route for the reversed addresses. It should point to the input device.

rp_filter:

The rp_filter flag (DEV/rp_filter or all/rp_filter) set to 1 has similar semantic. It has nearly the same function as arp_filter and can control the ARP for the same purposes: symmetric talks (in and out using same device) but it covers the IP traffic too. It is assumed that where ARP is received (replied more exactly) there the IP traffic will be accepted too. It has mostly security function and can defend against IP spoofing. It controls the reverse path protection: we accept traffic from SENDER to TARGET received on DEV only when the reverse path (from TARGET to SENDER) points to the input interface DEV. It is used usually for "external" interfaces.

How you can use it:

ifconfig eth0 192.168.1.2
ifconfig eth1 192.168.2.2

echo 1 > /proc/sys/net/ipv4/conf/eth0/arp_filter
echo 1 > /proc/sys/net/ipv4/conf/eth1/arp_filter

5.5.6. Put the realservers on a different network to the VIP

Setup routing tables so that the client cannot route to the realserver network (Lars' method). This method requires the director to not forward packets for the VIP (easy to implement if 2 NICS on the director). The reply packets from the realservers return to the client via a different router to the one attached to the director. Thus the director's router cannot send arp requests to the realservers.

5.5.7. On the client(router), route packets with dst_addr=VIP to the director

You can hardwire the MAC address of the director as the MAC address of the VIP. You can do this with

#arp -s lvs.mack.net 00:80:C8:CA:A7:E4

or

arp -f /etc/ethers.

Here is my /etc/ethers file (on the client)

lvs.mack.net 00:80:C8:CA:A7:E4

This requires no extra NICs or patching of realservers. However in a production environment, redundant directors with heartbeat/failover may be required and some method (eg running send-arp) will be needed to change the static arp entry as the failover occurs. If multiple NICs are involved, it is possible that the above instruction will result in a route through the wrong NIC. In this case bring up the NIC of interest first and then run the above command.

Alternately if the router has several NICs, use one for the director and another for the realservers. Route the VIP to the director.

5.5.8. Use transparent proxy allow the incoming packet to be accepted locally - Horms method.

see the sections on Transparent Proxy (Horm's method), and its setup for LVS-DR and LVS-Tun. The configure script will set this up for you.

5.6. The Cure: 2.6.x kernels

5.6.1. 2.6 arp_announce

Julian Anastasov ja (at) ssi (dot) bg 25 Feb 2004

2.4.26 and 2.6.4 will come with 2 new device flags for tuning the ARP stack: arp_announce and arp_ignore. All IPVS like setups can use arp_announce=2 and arp_ignore=1/2/3 to solve the "ARP problem" with DR/TUN setups. These flags are going to replace the "hidden" functionality which does not work well for directors when they are changing their role between master/slave for a particular VIP. The risk is that other hosts can probe for VIP using unicast packets which the hidden flag always replies. I'll continue to support the hidden flag for 2.4 and 2.6 to help existing setups but switching to the new device flags (or other solutions) is recommended.

Documentation is in the 2.6 kernel docs (linux/Documentation/networking/ip-sysctl.txt).

arp_ignore: 1 - reply only if the target IP address is local address configured on the incoming interface. if eth0/arp_ignore=1 then all IPs on eth0 are replied, all others (on lo) are not.

Brett Simpson simpsonb (at) hillsboroughcounty (dot) org 18 Jun 2004

With Redhat WS?/ES/AS 3.0 with the latest kernel update (includes Julians ARP ignore patch).

/etc/sysctl.conf
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2

/etc/sysconfig/network-scripts/ifcfg-lo:1
DEVICE=lo:1
IPADDR=192.168.0.57
NETMASK=255.255.255.255
NETWORK=192.168.0.0
ONBOOT=yes
ARP=no

On my LVS Director I'm using...

TCP  192.168.0.57:8080 wrr persistent 3600
  -> 192.168.0.59:8080            Route   1      0          0
  -> 192.168.0.58:8080            Route   1      0          0

and on one of the real servers I'm using...

[root@extend1 network-scripts]# ip addr
1: lo: <LOOPBACK,UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet 192.168.0.57/32 brd 127.255.255.255 scope global lo:1
2: bond0: <BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue
    link/ether 00:08:02:f0:e4:30 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.58/24 brd 192.168.0.255 scope global bond0
3: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:08:02:f0:e3:80 brd ff:ff:ff:ff:ff:ff
4: eth1: <BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
    link/ether 00:08:02:f0:e4:30 brd ff:ff:ff:ff:ff:ff

kwijibo (at) zianet (dot) com 19 Mar 2004

I set a VIP on the loopback interface like so:

/sbin/ip addr add xxx.xxx.xxx.xxx/32 dev lo brd + scope host
echo '2' > /proc/sys/net/ipv4/conf/lo/arp_announce
echo '1' > /proc/sys/net/ipv4/conf/lo/arp_ignore

However I still get ARPs for this IP.

Julian

What about setting these flags on all involved ARP devices, e.g. eth0?

Ric Searle ric (at) dialogue (dot) co (dot) uk 19 Mar 2004

as well as your lo interface, try setting these flags on

/proc/sys/net/ipv4/conf/all/arp_announce
/proc/sys/net/ipv4/conf/all/arp_ignore

5.6.2. noarp v2.6

Masar masar (at) MasarLabs (dot) com 04 Mar 2004

noarp 2.0.0 (http://www.masarlabs.com) is now available. This is the port of noarp to the Linux 2.6.x kernel. For the 2.4.x kernel use noarp 1.x.x. I'm making separate packages of noarp for the two kernels, because the method for producing a module is different. If there is sufficient interest, I may produce a single package for both kernel versions.

5.7. Testing an interface for replies to arp requests

To test that the VIP on the realservers (here lo:0) is hidden from arp requests: You test from a client machine on the same network segment as the NICs on the realservers. For your sanity, you could try this with one realserver at a time. You do _NOT_ have the director (with its pingable VIP) connected to the network (unplug it).

  • Optional: on each realserver, to accumulate a list of the MAC addresses for each NIC.

    realserver: # ping VIP
    realserver: # arp -a	# look for the MAC address of the VIP
    realserver: # ifconfig	# should show the same MAC address
    
  • find the MAC address for the realserver's VIP from the test client.

    client: # ping VIP
    or ping the broadcast address
    client: # ping 192.168.1.255	#for a VIP in the 192.169.1.0/24 network
    then
    client: # arp -a	# look for the MAC address of the VIP on the realserver.
    			# if you have several realservers on-line,
    			# it could be the MAC address of the NIC on any of the realservers
    
  • Hide the lo interface on the realserver (hidden). Before the arp tables expire (15secs - 2mins depending on the OS), ping the VIP again from the test client. The realserver will still reply to the ping, since the MAC address for the VIP will still be in the arp table of the test client.

    client: # ping VIP
    
  • let the arp cache expire (wait 15sec - 2mins) or clear the arp cache of the test client.

    client: # sleep 120
    or
    client: # arp -d VIP	# delete/flush/clear the entry for the VIP
    then
    client: # arp -a	# show that the arp entry for the VIP is gone
    
  • ping the VIP. You should get no reply.

    client: # ping VIP
    
  • Do for all realservers, making sure you get no ping replies for the VIP.

  • On the director (don't connect it to the network yet) find the MAC address of the VIP

    director: # ping VIP		#VIP can be an IP or a resolvable name
    and/or
    director: # ifconfig		#look for MAC address of NIC with VIP
    then
    director: cat /proc/net/arp	#shows list of IP-MAC address pairs
    or
    director: arp -a 		#shows FQDN as well
    
  • Connect the director to the network. Just to be sure, clear the arp entry for the VIP on the test client (arp -d VIP) and ping the VIP again. You should get ping replies. The test client's arp cache should have the MAC address of the director's NIC for the VIP.

5.8. problems with switches

There are other places in the network with arp caches, like "smart" switches. These will bight you if you don't know about them.

frederic (dot) buche (at) equant (dot) com 29 Oct 2003

OK Julian, you are right. The problem came from my network-switch, which keeps in memory the MAC address of all machines. So it just relays the arp request to the concerned server, by using a unicast arp request.

Just for a test, I have deleted the MAC entry on my switch. Then reproduce the same test than before ... and the hidden patch works well!

Carlos J. Ramos cjramos (at) genasys (dot) com 15 Dec 2003

We are using an HP Procurve Switch 2124 in a cluster using Heartbeat and Ldirectord as HA and Balancing mechanisms. Previously we have similar working setups with a hub in the same location. Eerything works fine, till we make a takeover on directors. As the switch documentation saids, the switch automatically learn MAC address and associate it to its ports, so that although heartbeat changes IP address, the switch try to use the same switch port. The situation remains for at least 1 hour... for this time the forwarding in the cluster does not works... and realservers are unable to be reached from outside... We are assuming this is an arp caching problem, although we haven't eliminated other possible causes yet.

Is there any way to force the switch to refresh MAC Address Table?, is there any Linux tool that sent any kind of packet over the net forcing the ARP Table to be updated?

5.9. The ARP problem, the first inklings

History: The ARP behaviour changed between 2.0.x and 2.2.x kernels. Here's the original posting by Wensong and a reply from Alexy Kuznet (2.2 tcpip author)

Wensong Zhang wensong (at) iinchina (dot) net 24 Mar 1999

Today I upgraded the kernel to 2.2.3 with tunneling support on one of a real server, and found a problem that the Linux 2.2.3 tunnel device answers ARP requests. Even if I used the NOARP options as follows:

realserver:# ifconfig tunl0 172.26.20.110 -arp netmask 255.255.255.255 broadcast 172.26.20.110

It still answers the ARP requests. This will greatly affect the virtual server via tunneling work properly. In fact, the tunnel device shouldn't answer the ARP requests from the ethernet. I think it is a bug of linux/net/ipv4/ipip.c, which is now a clone of ip_gre.c not the original tunneling code.

If you are interested, you can test yourself on kernel 2.2.3, choose a free IP address of your ethernet and configure it on the tunl0 device, then telnet to that IP address from other host, I guess you can. Finally, have a look at the ipip.c, maybe you can debug it. :-) --

But, what is the IFF_NOARP flag of the tunnel device for?

kuznet (at) ms2 (dot) inr (dot) ac (dot) ru

IFF_NOARP means that ARP is not used by THIS device. On normal IPIP tunnels it does not make much of sense, but may be used for example to turn on/off endpoint reachability detection.

I do not see any reasons to disable answering ARP in such curcumstances. Isolation of VPNs on adjacent segments is impossible at routing/arp level, it is just not well-defined behaviour.

If the isolation is made with firewall policy rules, then it is clear that arp policy must be handled at this level too.

In kernel 2.0.x, the tunnel device doesn't answer ARP requests.

Yes.

Yeah, we can have link-local addresses that doesn't answer ARP requests in kernel 2.2.x. For example, we can configure all the hosts in a network with the following command:

ifconfig lo:0 192.168.0.10 up

There will no collision. The lookback alias interfaces don't answer ARP requests.

Are you sure? I am not. Please, test.

BTW you risk adding non-loopback addresses on loopback device. They have the HIGHEST preference to be used as router identifier. so that VPN addresses cannot be added to loopback at all.

No, it doesn't fail. I tested it with kernel 2.0.36, it worked.

It does not work under 2.2. To be honest, I am about to stop to understand you. You talk about 2.2, but all your tests are made for 2.0. 8)

5.10. A posting to the mailinglist by Peter Kese explaining the "arp problem"

(saved for posterity by Ted Pavlic, minor editing by Joe)

peter (dot) kese (at) ijs (dot) si

Before we start, let's assume we have following network configuration for an LVS running LVS-DR.

client		10.10.10.10

gw		192.168.1.1

director	192.168.1.10 	IP for admin (director IP)
        	192.168.1.110 	VIP (responds to arp requests)

real server	192.168.1.11 	IP to which each service is listening (realserver IP)
		192.168.1.110 	VIP (DOES NOT respond to arp requests)

The virtualserver is the combination of the director and the realserver running LVS.

Or goal is:

  1. Virtual server should respond to arp requests for both the VIP and the director IP.
  2. The realserver should respond to arp requests for the realserver IP but NOT the VIP.
  3. Gateway sends packets for the VIP to the director IP load balancer no matter what.

Problem 1: Interface aliases

Realserver and director need to have an interface with the VIP in order to respond to packets for virtual server. A real interface is not needed, an IP alias will do just fine and this interface alias could be either eth0:0 or lo:0.

On the 2.0 kernels, the ARP responding ability of an interface alias (eg eth0:0) could either be enabled or disabled independantly of the main (eth0) interface. If you wanted eth0:0 not to respond to ARP requests, you could simply say:

        ifconfig eth0:0 192.168.1.2 -arp up

Thus in the 2.0 kernels it is possible, on a realserver, to have the realserver IP (on eth0) respond to arp requests and for the VIP (on eth0:0) to not respond.

In the 2.2 kernels this doesn't work any more. Whether the an interface alias responds to ARP requests or not, depends only on the way the real interface is configured. So if eth0 responds to ARP requests (which it normally will), eth0:0 carrying the VIP will also respond to ARP requests no matter what.

This means an ethernet alias (eth0:0) is not permitted on real servers, because real servers should not respond ARP requests.

On the other hand, loopback aliases never respond ARP requests, which means that the loopback alias (lo:0) must not be used on the director for the VIP.

Problem 2: Loopback aliases

I haven't done much checking on loopback interface problem, but it seems that if an alias is used on a loopback interface (as is required for LVS-DR) on a real server running kernel 2.2.x, the whole ARP gets screwed.

It appears that loopback interfaces get special ARP treatment in the kernel, so I suggest avoiding the loopback aliases as whole.

The question now is: What kind of an interface can I use on real servers?

As I already noted, eth0:0 alias can not be used, because such aliases respond to ARP requests. lo:0 aliases can not be used, because they make ARP problems too.

In case of tunneling VS configuration, the answer is trivial: tunl0. But to be honest, tunl0 interface can also be used for direct routing.

(from Joe, the dummy device is OK too)

With direct routing, the only thing we need an interface for is to let kernel know we posses an additional IP address. This means, we can set up any kind of an interface, as long as it doesn't respond ARP requests. Instead of tunl0, you could also set up a ppp0, slip0, eth1 or whatever. I suggest setting up a tunl0:

        ifconfig tunl0 192.168.1.2 -arp up

Problem 3: Real server ARP requests.

Suppose we have set up a virtual server as described at the beginning. All computers are running, but no requests have been made.

Then the client sends a request to the VIP.

When the packet arrives to gateway, the gateway makes an ARP query for the VIP and the director responds. Gateway remembers the director's MAC address and sends the packet to the director. Director receives the packet, looks up its ipvsadm/LVS tables and chooses the real server and forwards the packet to the real server by direct routing or tunneling method.

Real server receives the packet and generates a response packet with destination=client, source=VIP.

(until now everything works correctly)

When real server wants to send the response packet to the gateway, it finds out, that it does not know the gateway's MAC address.

It sends an ARP request to the local network and asks for the gateway MAC address. This should look like:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP)

But in reality, real server asks something like:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.110 (VIP),

because it takes the source address from the packet it wants to send.

Here the problems come in.

Gateway receives the packet and responds to it, which is correct. But at the same time, gatweay does a little optimization. It finds out, that the realserver's MAC address is not listed in its ARP tables and adds the entry into the table, just in case it might need that address in the near future.

The ARP request contained the VIP address and the realserver's MAC address, so from now on, the gateway will send all packets destined for the VIP to the real server instead (due to MAC address). This means all packets that follow will avoid the virtual server as whole and get responded by the realserver.

If the real server's ARP request would be:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP)

all this would not have happened. Therefore I have patched the 2.2 VS kernel in such a way, that it composes ARP requests based on the address of the interface selected by the routing tables instead of the address taken from the packet itself.

In order for virtual server to work correctly, the real servers should have patched kernels as well, or at least copy the patched /usr/src/linux/net/ipv4/arp.c file to the real servers before compiling the kernels.

Conclusion

Those were my experience with ARP problems, and the 2.2 kernel virtual server.

I think it would be wise to add this letter to the web site and notify the network developers about our findings at some point in time.

Here are some golden rules I stick to, when I do virtual server configuration:

Rule 1:
        Do not use lo:0 alias on the director.
        Use eth0:0 alias instead.

Rule 2:
        Avoid using lo:0 alias, not even on realservers.
        Use tunl0 or some other simulated interface
        on real servers instead. (Joe: use dummy0)


Rule 3:
        Apply the VS patch to kernels on real servers.

5.11. arp bouncing

symptoms of realservers arp'ing - arp bouncing

Stephen WIlliams sdw (at) lig (dot) net (Stephen wrote one of the patches that stop devices in 2.2.x kernels from replying to arp requests)

If you don't use the patch you'll find that the 'active' box will bounce from machine to machine as each one sends an ARP reply that is heard last. Additionally you will get TCP Reset's as connections that were on one box suddenly start going to others. Very nasty and unusable.

5.12. Lar's Method

(This is called Lars' method)

Lars

I have thought about how the ARP problem can occur at all with direct routing, because I never noticed it. Then it occured to me that your VIP comes from the same subnet as the RIP of the LVS and also all the real servers share this media.

To avoid the "ARP problem" in this case without adding a kernel patch or anything else, you can just add a direct route for the VIP using the RIP of the LVS as a gateway address on the router in front of the LVS. ("ip route VIP 255.255.255.255 real_ip" on a Cisco, or "route add -host VIP gw RIP" on Linux)

Since I just used 2 ethernet cards and had the LVS act as gateway/firewall anyway, I never noticed the ARP problem. (We have 2 LVS in a standby configuration to eliminate the SPOF)

5.13. Static Routing to Director

The arp problem is handled if the router in front of the director has a static route for the VIP to the director (i.e. packets for the VIP from the outside world are sent to the director and cannot get to the realservers).

Wensong

For the clients who reach the virtual server through the router, there is no problem if a static route for VIP is added.

However, for the clients who are in the network of virtual server, the "ARP problem" will arise. There is fight in ARP response, and the clients don't know send the packets to the load balancer or the real server.

In my point of view, the VIP address is shared by the director and realservers in LVS-Tun or LVS-DR, only the director does ARP response for VIP to accept request packets, and the realservers has the VIP but don't, so that they can process packets destined for VIP.

5.14. iproute2 arp on|off flag

Joe, 21 May 2001

Was looking at the ip (i.e.iproute2) notes and it says

ip arp on|off

--change NOARP flad on the device

1cm NB. This operation is not allowed if the device is in state UP.
Though neither ip utility nor kernel check for this condition, you can
get unpredictable results changing the flag while the device is running.

Is this like the old -noarp flag for ifconfig?

Julian Anastasov ja (at) ssi (dot) bg 21 May 2001

This is the device ARP flag, same as ifconfig [-]arp. The flag is used to allow ARP packets for the specified device. It is correct that "lo" does not talk ARP, but you connect to the VIPs on "lo" through eth*, so the flag is of no help for LVS. We can't drop the flag for eth device.

Andreas J. Koenig, 02 Jun 2001

kernel 2.4.5 has arp_filter

Julian Anastasov ja (at) ssi (dot) bg

arp_filter does not solve the ARP problem for LVS

This is a new proposal to control the ARP probes and replies based on route flag "noarp". It will be discussed on the netdev mailing list and may be something like this is going to be included in 2.4, may be in 2.2 too, not sure. All you know that the hidden feature is not considered to 2.4. The net developers have the final word. I'll try to maintain the hidden flag in all next kernels while this flag is more usable than the new feature and because the hidden flag has other semantic. And because may be there are some user space tools that rely on this.

5.15. Is the arp behaviour of 2.2.x kernel a bug?

Note

Julian Anastasov is replying to correct an error in a previous version of the HOWTO where I state that the dummy0 device in 2.2.x kernels does not arp. Julian wrote one of the realserver patches which fix the "arp problem".

Julian

In fact, the documentation is incorrect. There is no difference, all devices are reported in the ARP replies: lo, tunl and dummy. So, only the ARP patch can solve the problem. This can be tested using this configuration with any device (before the patch applied):

Host A:
         eth:x 192.168.0.1

Host B:
         eth:x 192.168.0.2
         lo, dummy, tunl: 192.168.0.3

On host A try: ping 192.168.0.3

Host B replies for 192.168.0.3 through 192.168.0.2 device

So, the ARP problem means: "All local interfaces are reported" until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP to hide the interface are incorrect. I don't expect them in the kernel.

Stephen WIlliams (who wrote another of the patches to fix the arp problem).

Of course the ARP code in the kernel needs to be fixed so my filter code isn't needed. Still, I'm confused by this statement. The IFF_NOARP flag determines whether a device arp replies or not. What's wrong with honoring that?

If you mean that arp replies should never be sent on another interface, that is what I currently believe to be correct.

Julian

My understanding is that 2.2.x ARP code is not buggy and there is no need to be "fixed". I must say that your patch is working for the LVS folks but not for all linux users.

IFF_NOARP means "Don't talk ARP on this device", from the 'man ifconfig':

[-]arp Enable or disable the use of the ARP protocol on this interface.

So, where is the bug ? The ARP code never talks through lo, dummy and tunl devices when they are set NOARP. It uses eth (ARP) device. If You hide all NOARP interfaces from the ARP protocol this is a bug. One example:

 +--------+ppp0                          +------+
 | Host A |------------ppp link----------|ROUTER|------ The World
 +--------+A.B.C.1 (www.domain.com)      +------+
   |eth0
   |A.B.C.2
   |
   |A.B.C.3
 +--------+
 | Host B |
 +--------+

Is it possible after your patch Host B to access www.domain.com ? How ? Host A doesn't send replies for A.B.C.1 through eth0 after your patch. OK, may be this is not fatal. Tell it to all kernel users. You hide all their NOARP interfaces. May be there are other examples where this is a problem too. Or may be there is something wrong in this configuration?

I want to say that this patch hurts all users if present in the kernel. On Nov 6 I posted one patch proposal to the linux-kernel list which adds the ability to hide interfaces from the ARP queries and replies. But the difference is that only specified interfaces are not replied, not all NOARP interfaces. Its arp_invisible sysctl can be used by LVS folks to hide lo, tunl or dummy interfaces but this feature doesn't hurt all kernel users. I think, this patch is more acceptable and can be included in the 2.2 kernel, may be after some tunning. And I'm still expecting comments from the net folks and from all LVS users.

5.16. Arp caching defeats Heartbeat switchover

Claudio Di-Martino claudio (at) claudio (dot) csita (dot) unige (dot) it

I've set up a VS using direct routing composed of two linux-2.2.9 boxes with the 0.4 patch applied. The load balancer acts as a local node too. I configured mon to monitor the state of the services and update the redirect table accordingly. I also configured heartbeat so that when the load balancer fails the second machine takes over the virtual ip, sets up the redirect table and starts mon. When the load balancer restarts, the backup reconfigures itself as a real server, drops the interface alias that carries the virtual ip, stops mon, clears the redirect table. Although the configuration of the two machines is set up correctly it fails to restore the load balancer due to arp caching problems.

It seems that the local gateway keeps routing requests for the virtual ip to the load balancer backup. Sending gratuitous arp packets from the load balancer doesn't have effect since the interface of the backup is still alive and responding.

Has anyone encountered a similar problem and is there a hack or a proper solution to take back control of the virtual ip?

Antony Lee AntonyL (at) hwl (dot) com (dot) hk

I am new to LVS and I have a problem in setting up two LVSes for failover issue. The problem is related to the ARP caching of the primary LVS' MAC address in the real servers and the router connected to the Internet. The problem leads all the Internet connections stalled until all ARP caching in Web Servers and router to be expired. Can anyone help to solve the problem by making some changes in the Linux LVS ? (It is because I am not able to change the router ARP cache time. The router is not owned by the Web hosting company not by me.)

In each LVS, there are two network card installed. The eth0 is connected to a router which is connected to the Internet. The eth1 is connected to a private network which is the same segment as the two NT IIS4.

The eth0 of the primary LVS is assigned an IP address 202.53.128.56
The eth0 of the backup LVS is assigned an IP address 202.53.128.57
The eth1 of the primary LVS is assigned an IP address 192.128.1.9
The eth1 of the primary LVS is assigned an IP address 192.128.1.10

In addition, both primary and backup LVS have enabled the IPV4 FORWARD and IPV4 DEFRAG. In the file /etc/rc.d/rc.local the following command was also added:

ipchains -A -j MASQ 192.168.1.0/24 -d 0.0.0.0/0

I use the piranha to configure the LVS so that the two LVS have a common IP address 202.53.128.58 in the eth0 as eth0:1. And have a IP address 192.128.1.1 in the eth1 as eth1:1

The pulse daemon is also automatically be run when the two LVSes were booted.

In my configuration, the Internet clients can still access to our Web server with one of the NT was disconnected from the LVS. The backup LVS --CAN AUTOMATICALLY-- take up the role of the primary LVS when the primary LVS is shut down or disconnected from the backup LVS. However, I found that all the NT Web Servers cannot reach the backup LVS through the common IP address 192.128.1.1, and all the Internet clients stalled to connect to our web servers.

Later, I found that the problem may due to the ARP caching in the Web Servers and router. I tried to limit the ARP cache time to 5 seconds in the NT servers and half of the problem has solved ,i.e. the NT Web servers can reach the backup LVS through the common IP address 192.128.1.1 when the primary LVS was down. However, it is still cannot be connected through the Internet clients when the LVS failover occur.

Wensong

I just tried two LVS boxes with piranha 0.3.15. When the primary LVS stops or fails, the backup will take over and send out 5 Gratuitous Arp packets for the VIP and the NAT router IP respectively, which should clean the ARP caching in both the web servers and the external router.

After the LVS failover occurs, the established connections from the clients will be lost in the current version, and the clients need to re-connection the LVS.

.. 5 ARP packets for each IP address, and 10 for both the VIP and
the NAT router IP. I saw the log file as follows:

Mar  3 11:12:14 PDL-Linux2 pulse[4910]: running command "/sbin/ifconfig" "eth0:5" "192.168.10.1" "up"
Mar  3 11:12:14 PDL-Linux2 pulse[4908]: running command "/usr/sbin/send_arp" "-i" "eth0" "192.168.10.1" "00105A839CBE" "172.26.20.255" "ffffffffffff"
Mar  3 11:12:14 PDL-Linux2 pulse[4913]: running command  "/sbin/ifconfig" "eth0:1" "172.26.20.118" "up"
Mar  3 11:12:14 PDL-Linux2 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET)
Mar  3 11:12:14 PDL-Linux2 pulse[4909]: running command "/usr/sbin/send_arp" "-i" "eth0" "172.26.20.118" "00105A839CBE" "172.26.20.255" "ffffffffffff"
Mar  3 11:12:17 PDL-Linux2 nanny[4911]: making 192.168.10.2:80 available

I don't know if the target addresses of the 2 send_arp commands are set correctly. I am not sure if it is different when broadcast or source IP is used as target address, or any target address is OK.

Horms

Are there just 5 ARPs or 5 to start this and then more gratuitous ARPs at regular intervals. If the gratuitous ARPs only occur at fail-over then once the ARP caches on hosts expire there is a chance that a failed host - whose kernel is still functional - could reply to an ARP request.

wanger (at) redhat (dot) com

When we put this together, I talked to Alan Cox about this. His opinion was that send 5 ARPs out at 2 seconds apart. If there is something out there listening and cares, then it will pick it up.

The way piranha works, as long as the kernel is alive, the backup (or failed node) will not maintain any interfaces that are Piranha managed. In other words, it removes any of those IPs/interfaces from its routing table upon failure recovery.

5.17. Wierd Hardware: Routers gratuitously cache arp data (failover is slow)

Some hardware manufacturers think that they know better than everyone else and release equipment with non-standard timeouts.

Sean Roe May 06, 2004

I was looking for some info on cisco catalyst switches to help speed up the failover between my two director boxes. I have the following LVS-NAT setup:

                   |--------|-----|WebServer1|
       -- |LVS01|--|Cisco   |-----|WebServer2|
       |           |        |-----|WebServer3|
 ------|           |Catalyst|-----|WebServer4|
       |           |        |-----|WebServer5|
       -- |LVS02|--|Switch  |-----|WebServer6|
                   |--------|
  Virt     LVS                    Real
  IP       Servers                Servers

My Problem is that if lvs01 fails lvs takes over the load but it takes forever (5-6 minutes). for the real servers to start using the new director. It also seems that it works faster if I actually restart the httpd on each webserver. This is a LVS-NAT with multiple virtual IPS going to different ports on the webservers.

John Reuning john (at) metalab (dot) unc (dot) edu 23 Apr 2004

I've seen a similar delay in failover when using cisco routers. They don't update the internal MAC address table after receiving gratuitous arp packets during an LVS director failover event. I don't know if the heartbeat package uses arps to fail over, but keepalived does. Cisco routers seem to need icmp packets before they'll update the MAC address table. For LVS, the problem here is that the router continues to send traffic to the VIP at the master's hw address instead of shifting to the backup's hw address.

However, this wouldn't explain why your real servers route to the wrong address. The real servers and the LVS directors are on the same network segment, right?

The problem isn't with the layer-2 switches, it's with the next-hop router (the external default gateway for the LVS directors). It's common behavior with Cisco routers to update their arp cache table in response to source-generated packets but not in response to gratuitous arp packets.

Peter Mueller

I've seen a similar delay in failover when using cisco routers.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 24 Apr 2004

Me too, ISPs often configure managed routers to not respond to arp requests. You tend to have to ask them to flush the routing table if you change any of your router facing ips. I'm sure the routers can be configured to respond to ARPs

Horms 07 May 2004

Sounds like there could be a problem with your gratuitous arps that are supposed to effect failover. I have used catalyst swithces quite a lot, in fact both my test rack and the main switch for the network here at VA Japan used catalyst switches. I have found that they are quite aggressive about caching ARP information, and in some cases seem to effect proxy arp. But the current send_arp code in heartbeat seems to work just fine. Actually, I some times run that command manually after rearanging IP addresses on machines.

5.18. The device doesn't reply to arp requests, the kernel does.

ARP requests/replies are thought of as coming from a device and people make statements like

"the dummy device in 2.0.x kernels does not reply to arp requests while the same device in 2.2.x kernels does reply".

It is the kernel that handles arp requests according to a set of rules and not the device. The code for the dummy device is the same in 2.0.x and 2.2.x kernels and is not responsible for the change in arp behaviour.

(The RPC for ARP is at ftp://ftp.isi.edu/in-notes/std/std37.txt. - also see rfc826 and rfc1122. The model system used there is 2 machines on a single ethernet. It doesn't shed any light on the implementation of ARP on multi-interface systems like LVS.)

5.19. Properties of devices for the VIP

In a previous version of the HOWTO I stated that the dummy0 device did not arp in 2.2.x kernels and therefore could be used as the device for the VIP on an unpatched 2.2.13 realserver. Julian Anastasov replied that they did arp (see below for his posting and the ensuing discussions).

I hadn't actually tested whether the dummy0 device arp'ed but had concluded that it wasn't arp'ing because I had a working LVS using the dummy0 interface for the VIP on unpatched 2.2.x realservers and because as everyone knows ;-) an LVS needs to have a non-arp'ing device on the VIP of the realservers.

I had a LVS-DR LVS which worked with dummy0, lo:0 and tunl0 as the VIP device and which on further testing, I found also worked with eth0:1 or eth1 as the VIP device on 2.2.13 realservers. Whatever the arp'ing status of dummy0, lo:0 or tunl0, clearly eth1 replies to arp requests, so despite the conventional wisdom, it is possible to build an LVS with arp'ing VIP's on the realservers.

On investigating why this LVS worked, I found that the MAC address for the VIP in the client's arp cache (# arp -a) was always the director. I assume this was because the director is 3-4x the speed of the other machines in the LVS and it replies to arp requests first for the VIP (another posting from Stephen WIlliams says that the address which replies last is stored in the arp cache - we'll figure out what's really going on here eventually). On another LVS where the realservers were all identical hardware with 2.2.13 unpatched kernels, one particular realserver always was the machine in the client's arp cache for the VIP (to check, delete entry for VIP with arp -d, then ping again, then look in arp cache).

I found that I could get a working LVS using almost anything to hold the VIP on the realservers, including eth0:1 and eth1 (another NIC in the realserver). These devices carrying the VIP were pingable from the client and I could get the corresponding MAC addresses in the arp table of the client if the director was not setup with a VIP. When I setup a working LVS this way, I found each time that the MAC address for the VIP in the client's arp cache was the director's MAC address. For some reason, that I don't know, whenever the client does an arp request for the VIP, it gets the director's MAC address.

Possible reasons for the MAC address of the director always being associated with the VIP in my LVS -

  1. 1. I configure the director first and then the realservers. I don't make requests for a service till the realservers are setup. (Still I can't imagine the client asking for the MAC address of the VIP until it makes a connect request.)
  2. 2. The director is 3 times faster (CPU speed) than the next machine in the LVS and it always replies to arp request first.
  3. 3. I was lucky.

Since you can make a working LVS-DR LVS with the realserver VIP on an arp'ing eth0:1 device I decided that the relevent piece of information about arp'ing was (ta da!)

* an LVS will work if the client always gets the MAC address of the director when it asks for the MAC address of the VIP *

This provides an easy solution - you tell the client (or the router) the MAC address of the VIP with arp -s or arp -f.

here's my /etc/ethers

lvs.mack.net 00:A0:CC:55:7D:47

After installing the MAC address of the DIP (director) as the MAC address of the VIP (lvs) in the arp table (arp -f /etc/ethers) I get

client:/usr/src/temp/lvs# arp -a
realserver1.mack.net (192.168.1.1) at 00:90:27:66:CE:EB [ether] on eth0
lvs.mack.net (192.168.1.110) at 00:A0:CC:55:7D:47 [ether] PERM on eth0
director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0

notice the "PERM" in the VIP entry on the client.

removing the permanent entry

client:/usr/src/temp/lvs# arp -d lvs.mack.net
client:/usr/src/temp/lvs# arp -a
realserver1.mack.net (192.168.1.1) at 00:90:27:66:CE:EB [ether] on eth0
lvs.mack.net (192.168.1.110) at <incomplete> on eth0
director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0

If I edited /etc/ethers changing the MAC address of lvs to anything else, the LVS did not work anymore. So the arp information is coming from /etc/ethers rather than some uncontrolled variable I'm not aware of.

I had thought that in an LVS with the VIP on realservers on an arping device that the VIP would hop from one machine to another (see the postings in the MISC section). Since naturally occuring LVS's with arping VIP's on realservers existed and worked well (mine), I set up an LVS by making a permanent entry for the VIP of the director in the arp cache of the client (router). This can be done by

$ arp -f /etc/ethers
or
$ arp -s 192.168.1.110 MAC_ADDRESS

There are 2 results of this

  1. the realservers can have the VIP on an an arp'ing device (eg eth0:1, eth1) - you don't need lo or dummy0, tunl0 for realservers with 2.0.36 and 2.2.x kernels.
  2. If two (or more) directors are setup in failover mode, the mechanism by for changing the VIP from one to another is broken by making a permanent entry for VIP on the director in the arp cache of the router. This is not a problem for a test setup to demonstrate an LVS but may be a problem in a high availability environment (a solution may be found n the meantime too).

The normal method for changing directors (e.g. with heartbeat) includes a gratuitous arp. To force a gratuitous arp

Julian

You can use Yuri Volobuev's send_arp.c from the 'fake' package or Alexey Kuznetsov's arping from its iputils package:

  • fake - http://vergenet.net/linux/fake/
  • iputils - ftp://ftp.inr.ac.ru/ip-routing/iputils-ss991024.tar.gz iputils is also used for IPAT, IP address takeover

Joe Dec 2003

There is also http://www.vergenet.net/~acassen/software/garp-0.1.1.tar.gz which has been available for over a year, without me even knowing about it.

Here's some tests I did

LVS equipment: 2.2.13 client, and 0.9.4/2.2.13 director.
2 realservers
a) 2.0.36 kernel, libc5, gcc-2.7.2.3, net-tools 1.42.
b) 2.2.13 kernel, glibc, gcc-2.95,    net-tools 1.52

Experiment 1: Result - arp'ing is independant of [-]arp

Summary: the -arp/+arp option for ifconfig had no effect on any devices back to 2.0.36 kernels with net-tools 1.42. If it normally arps then -arp had no effect, if it normally doesn't arp, than "arp" doesn't turn it on (data below).

Method: IP=192.168.1.1/24 with VIP=192.168.1.110/32. The VIP was on dummy0. The test was to see if the VIP was pingable from another (external) machine on the 192.168.1.0/24 network or pingable from the machine itself (ie internally from the console). (I assume I had a route add -host for the VIP although I didn't record this). The test was done with ifconfig using arp or -arp (the output of ifconfig -a didn't change)

                 -----2.0.36------- -----2.2.13------
ping from        internal  external internal external
VIP device
dummy	ARP        +         -	      +        +
        NOARP      +         -        +        +
        down       -         -        -        - (control)

Experiment2: Can the VIP be on a separate NIC?

Summary: yes, as long as the NIC doesn't have a cable plugged into it.

Method: same as above except VIP on eth1 (another NIC).

                 -----2.0.36-------
ping from        internal  external
VIP device
eth1 has cable connected to 192.168.1.0 network
eth1    ARP        +         +
        NOARP      +         +

eth1 cable to network removed
eth1    ARP        +         -
        NOARP      +         -
        works as realserver in LVS - yes

One of the reasons an no_arp interface is used on the realserver is that it is not visible to the rest of the network. Does the LVS work if the eth1 VIP on the realserver is not visible to the rest of the network?

Conclusion: for 2.0.36 dummy0 doesn't arp, and eth1 does arp. the arp/-arp option to ifconfig has no effect on arp behaviour. LVS works with both dummy0 and eth1, I assume since VIP need only be resolved as local on the realserver and does not need to be visible to the network.

Experiment 3: What devices and netmasks are neccessary for a working LVS?

Using the /etc/ethers approach for setting the MAC address of the VIP I then set up an LVS with pair of realservers serving telnet. All IPs are 192.168.1.x, all machines have a route to 192.168.1.0 via eth0. There is no default route.

1. 2.0.36, libc5, gcc 2.7.2.3, net-tools 1.42
2. 2.2.13, glibc-2.1.2, gcc-2.95, net-tools 1.52

with the following devices holding the VIP, tunl0, eth0:1, lo:0, dummy0, eth1. In each case there was no route entry for the VIP device and there was no cable connected to eth1 when it was used for the VIP. The table below shows whether the LVS worked. The VIP is installed with

ifconfig $DEVICE 192.168.1.110 netmask $NETMASK broadcast $BROADCAST
with $NETMASK="255.255.255.255" $BROADCAST="192.168.1.110"
or   $NETMASK="255.255.255.0"   $BROADCAST="192.168.1.255"

the result belong to 1 of 3 groups

+ works fine
- doesn't work
  (at $ prompt on client get
  "unable to connect to remote host.  Protocol not available"
  then client returns to regular unix $ prompt)
hang - client hangs, realserver cannot access network anymore,
  have to run rc.inet1 from console prompt on realserver to
  start network again.

netmask of VIP=255.255.255.255 (normal LVS setup)

LVS type  -----VS-Tun------     ----VS-DR------
kernel    2.0.36     2.2.13     2.0.36   2.2.13

VIP on
tunl0      +           +         +         +
eth0:1     +           -         +         +
lo:0       +           -         +         +
dummy0     +           -         +         +
eth1       +           -         +         +

netmask of VIP=255.255.255.0 (not normally used for LVS)

VIP on
tunl0      +           +         +         +
eth0:1     +           -         +         +
lo:0       +           hangs     +         hangs
dummy0     +           -         +         +
eth1       +           -         +         +

It would seem that any device and any netmask can be used for the VIP on a 2.0.36 realserver for both LVS-Tun and LVS-DR.

For 2.2.13 realserver, LVS-Tun, VIP on a tunl0 device only, any netmask (ie you need tunl0 on LVS-Tun with 2.2.x kernels)

LVS-DR,  lo:0 device netmask /32 only
       all other devices any netmask

For LVS-DR then on solaris/DEC/HP/NT... LVS can probably use a regular eth0 device rather than an lo:0 device (more work for Ratz to do :-).

Does anyone know why the lo:0 device has to be /32 for LVS-DR on kernel 2.2.13 while the other devices can be /24?

Jean-Francois Nadeau jna (at) microflex (dot) ca 6 Dec 99

In kernel 2.2.1x with a virtual interface on lo:0 and netmask of 255.255.255.0 that the interface no longer arps.

Horms 29 Oct 2003 (4yrs later, presumably referring to the 2.4 kernels)

/sbin/ifconfig lo:110 192.168.1.110 broadcast 192.168.1.110 netmask 255.255.255.255

brings up lo:110 (a virtual interface on the loopback device) for 192.168.1.110 with the broadcast and netmask as specified. If you are using LVS-DR then the packets that arrive on the real servers have the destination IP address set to the VIP. So the real servers need some way of accepting this traffic as local. One way is to add an interface on the loopback device and hide it so it won't answer ARP requests. The netmask has to be 255.255.255.255 because the loopback interface will answer packets for _all_ hosts on any configured interface. So 192.168.1.110 with netmask of 255.255.255.0 will cause the machine to accept packets for _all_ addresses in the range 192.168.1.0 - 192.168.1.255, which is probably not what you want.

Does anyone know why only the tunl0 device works for LVS-Tun on 2.2.x kernels?

Experiment 4: Effect of route entry for VIP and connection to VIP. The VIP normally has an entry in the routing table eg

route add -host 192.168.1.110 $DEVICE

I found in Experiment 2 that a route entry was not neccessary for the LVS to work when the realserver had the VIP on eth0:1. Since I had always used a route entry for the VIP I wanted to find out when it was needed. The same LVS was used as for Experiment 3. The variables were

1) a route entry/no route entry for VIP/32
2) for eth1 whether the NIC was connected to the network by a cable.

kernel            ------2.0.36-------     -------2.2.13-------
VIP               eth1 eth1_nc eth0:1     eth1  eth1_nc eth0:1

no route
   LVS             +     +      +          +      +       +
   ping internal   -     -      -          +      +       +
   ping external   +     -      +          +      +       +

route
   LVS             +     +      +          +      +       +
   ping internal   +     +      +          +      +       +
   ping external   +     -      +          +      +       +

Conclusion 1: LVS works when for both cases of route/no_route for the VIP for eth0:1 and eth1 (ie you don't need a route entry for the VIP on the realservers).

Conclusion 2: having a network cable/no network cable does not affect whether the LVS works.

Conclusion 3: for 2.0.36 kernels you can choose to have the VIP pingable from the outside world but not pingable by the local host by having it on eth1 with a cable connection (this seems wierd and I can't think of any use for it just yet) or the reverse - pingable from the localhost but not by the external world by not have a cable connection.

Note
using a host's routable IP as the target - the IP on eth0 say - you can make a host unpingable from the console if you down the lo. The host is still pingable from elsewhere on the net.

5.20. Topologies for LVS-DR and LVS-Tun LVS's

5.20.1. Traditional

The conventional LVS-DR/VS-Tun topology which allows maximum scalability has each realserver with its own default gateway (to a router). (In a routerless test setup, the client would be the default gateway for the realservers. In a setup which is not network bound, i.e. is disk- or compute-bound, only one router may be needed. The changes in topology/routing are made by changing the IP of the default gw for the realservers)

Some method of handling the arp problem is needed here.

The packets sent to the realservers from the director, generate replies which go directly to the client. Failure messages (eg if a realservers is not available) do not get returned to the director, who cannot tell if a realserver has failed (see discussion of monitoring agents).

                       -------------clients-----------------------
                       |                         |       |       |
                    (router)                  (router)(router)(router)
                       |                         |       |       |
          _________    |                         |       |       |
        |          |   |    VIP                  |       |       |
        | director |---     DIP                  |       |       |
        |__________|   |                         |       |       |
                       |                         |       |       |
                       |                         |       |       |
        ---------------------------------        |       |       |
        |              |                |        |       |       |
        |              |                |        |       |       |
       RIP1           RIP2             RIP3      |       |       |
       VIP            VIP              VIP       |       |       |
 _____________   _____________   _____________   |       |       |
|             | |             | |             |  |       |       |
| realserver  | | realserver  | | realserver  |  |       |       |
|_____________| |_____________| |_____________|  |       |       |
        |              |                |        |       |       |
        |              |                ----------       |       |
        |              -----------------------------------       |
        ----------------------------------------------------------

5.20.2. Director sees replies

(from Julian Anastasov)

Note
This discussion led to Julian's martian modification.

If the default gw for each realserver is changed to the DIP (see the Martian modification section) then

  • The director has to handle the reply packets as well as in the incoming packets, doubling the network load.
  • The director sees all the reply packets. Connection failure can be detected (in principle).
                        clients
                           |
                         router
                           |
             __________    |
            |          |   |    VIP
            | director |---     DIP
            |__________|   |
                           |
                           |
          ------------------------------------
          |                |                 |
          |                |                 |
         RIP1             RIP2              RIP3
         VIP              VIP               VIP
   _____________     _____________     _____________
  |             |   |             |   |             |
  | realserver  |   | realserver  |   | realserver  |
  |_____________|   |_____________|   |_____________|

Here's the original posting by Horms horms (at) vergenet (dot) net

Hi, I have been setting up a test network to benchmark IPVS, the topology is as follows.

       node-1      node-6     node-7
       (client)   (client)   (client)
           |         |          |        client-net
  ---------+---------+----------+------ 192.168.2.0/24
                     |
                   node-3 (router)
                     |                   server-net
      ------+--------+----------+---     192.168.1.0/24
            |        |          |
         node-2    node-4     node-5
         (IPVS)   (server)   (server)

The question that I have is that the network I would really like to be testing is;

       node-1       node-6     node-7
       (client)   (client)   (client)
           |         |          |        client-net
  ---------+---------+----------+------ 192.168.2.0/24
                     |
                   node-2 (IPVS)
                     |                   server-net
      ---------+-----+----+---------     192.168.1.0/24
               |          |
             node-4     node-5
            (server)   (server)

.. other than using NAT, which has performance problems, is this possible? I tried this topology with direct routing and packets from the clients were multiplexed to the servers fine, but return packets from the servers to the client were not routed by the IPVS box.

Lars

Yes. The LVS box silently drops the return packets, since they have a src ip which is also bound as a local interface on the LVS. This is meant to be a simple anti-spoofing protection.

from Joe:

Note
The return packet from the realserver has src=VIP, dest=CIP. If this packet is routed via the director, which also has the VIP, the director will be receiving a packet from another machine with the the src being an one of its own IPs and the director will drop the packet).

You can enable logging these packets via

echo 1 >/proc/sys/net/ipv4/conf/all/log_martians

The only way around this with current Linux kernels is to disable the check in the kernel source or to use a separate box as the outward gateway. (Which is how DR is meant to be used for full performance) This is not a problem as such as it probably makes a lot of sense on not to use an IPVS box as your gateway router, Actually it makes a lot of sense to do just that IMHO. Less points of failure, less hard- and software to duplicate in a failover configuration.

Ray Bellis rpb (at) community (dot) net (dot) uk

It needs to be made more explicit in the documentation that LVS-DR will only work if you have a different return path.

Lars Marowsky-Bree lmb (at) teuto (dot) net

... or if you have a suitably patched kernel.

We spent several man days trying to get this to work before figuring out why the packets were being dropped, at which point we had no alternative but to use LVS-NAT instead.

I agree. We still assume too much knowledge on the network admin side.

FYI, we have our LVS system working now, with LVS redundancy achieved by running OSPF routing (gated) on the LVS-NAT servers and having the VIP within the same IP subnet as the RIPs so that IGP routing policies automatically determine which LVS router the packets arrive on.

Yes, thats one option. Even better than heartbeat and IPAT, if all your systems support running a routing protocol. (IPAT = IP address takeover, part of heartbeat) In essence, heartbeat and IPAT is nothing but reinventing a subset of the functionality of a hardened routing protocol like OSPF/RIPv2/EIGRP.

5.20.3. On other schemes for director/realservers to exchange roles

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg has pointed out on the mailing list that the prototype LVS can be redrawn as

                        ________
                       |        |
                       | client |
                       |________|
			   |
                           |
                        (router)
                           |
			   |
         ------------------------------------
         |                 |                |
         |                 |                |
      DIP, VIP         RIP1, VIP        RIP2, VIP
    ____________    ______________    ______________
   |            |  |              |  |              |
   |  director  |  | realserver1  |  | realserver2  |
   |____________|  |______________|  |______________|

and that any realserver is in a position to replace a failed director. No-one has bothered to write the code for this. It seems it's easier do have extra boxes in the director role (ready for failover) and others in realserver role. It's easier to wheel in another box for a spare director than to configure realservers to do two jobs reliably.

Julian

The director and the backup are in a shared network for incoming traffic, the backup sniff packets and change its connection state the same as the director (because the director is just on half client-to-server connection in LVS/TUN and LVS/DR), then drop packets. It needs some investigation and probably lots of additional code too. ;-)

Wensong Zhang wensong (at) iinchina (dot) net

I don't even think so - the main trick is getting the kernel to sniff the packets, which is probably quite easy with a little messing around. Not sending the packets out again (which would confuse the realservers) is easy with a ipchains output rule which silently drops them.

This doesn't work with a switch though, you need a shared network like a hub.

However, I have been talking with Rusty about this. The problem is more general - HA shared-state firewalls are asked for all the time, so we want to do a generic thing for everything which builds upon Netfilter's state machine. This would not only cover LVS, but also masquerading and packet filtering in general. We intend to discuss this in greater detail at the Ottawa Linux Symposium latest.

Julian

You can see,the connections depend on the initalize status and realsevers realtime status. So another method is that when Director is down, backup-sever setup the ipvs with the connections,but it seems too late. How do you think about this?

Wensong

TCP/IP should be able to cope with a few seconds delay and lost packets. You want to heartbeat once per second and take over after 3-4s though - this usually means takeover is complete in <10s, which TCP/IP should swallow.

5.21. Why do all devices broadcast the arp replies

John Reuning (10 Apr 2003)

Why are arp replies sent for all interfaces, regardless of which interface receives the arp request?

Julian

Because Linux routing agrees that all these senders have access to this IP, so we give them access to valid link layer address. This behavior is usually observed on routers configured without source address validation enabled. As this is the default behavior specified in RFC1812 (rp_filter=0), Linux simply allows access to this IP on any interface.

arp is part of the transition from network layer to link layer, right? So why should an alias on lo, an interface that doesn't really generate network frames, trigger an arp reply. Do other unix tcp/ip

Note that these packets are not passed via the lo interface, also, we do not send ARP replies via lo, why we should care about the lo's NOARP flag?

I can't seem to make a solaris 7 system generate arp replies for an lo alias.

The different systems have different policy for IP addresses configured on loopback device. Note that in Linux, this behavior has nothing to do with the lo interface, you can configure IP on eth1 and then again to see our ARP reply for it on eth0.

5.22. A discussion about the arp problem

(Joe and Julian)

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg There is no difference between devices in 2.2.x, all devices are reported in the ARP replies: lo, tunl and dummy. This can be tested using this configuration with any device:


Host A:
        eth:x 192.168.0.1

Host B:
        eth:x 192.168.0.2
        lo, dummy, tunl: 192.168.0.3

On host A try: ping 192.168.0.3

Host B replies for 192.168.0.3 through 192.168.0.2 device

The ARP problem means: "All local interfaces are reported" until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP to hide the interface are incorrect. I don't expect them in the kernel.

ARP problem, some rules:

ARP responses

  • all local IP addresses are replied: lo, eth, tunl*, dummy* but with some exceptions (see the next rules)
  • 127.0.0.0/8(LOOPBACK) and 224.0.0.0/4(MULTICAST) are not replied
  • there is one exception for the "lo" interface: it is possible the kernel to ignore the ARP request if the source IP is from the same net as the net used to configure "lo" alias. The specified network is treated as local.

For example:

realserver# ifconfig lo:0 192.168.1.1 netmask 255.255.255.0 broadcast 192.168.1.255 up

"real" treats all packets with source addr from 192.168.1.0/24 which come from the other devices (eth0) as invalid, i.e. source address validation works in this case and the ARP request are not replied. The kernel thinks: "The incoming packet arrived with saddr=local_IP1 and daddr=local_IP2(VIP), so it is invalid". By this way the host from the LAN can't talk to the real server if its lo alias is configured with netmask != 255.255.255.255

        ifconfig dummy0 192.168.1.1 netmask 255.255.255.255

registers only 192.168.1.1 as local ip but:

        ifconfig lo:0 192.168.1.1 netmask 255.255.255.0

all 256 IPs are local. All IFF_LOOPBACK devices treat all IPs as local according to the used netmask.

Joe

I assume IFF_LOOPBACK devices are lo, lo:0..n?

Yes, currently only lo is marked as loopback. It is used to mark whole subnets as local.

lo:0 is not marked as loopback?

lo:0 is just attached IP address to the same device "lo". You can try "ifconfig lo:0 192.168.0.1 netmask 255.255.255.255" and display the interfaces using "ifconfig". There is LOOPBACK flag for lo:0 which is inherited from the device "lo". In Linux 2.2 all aliases inherit the device flags. Only the IFF_UP flag is used to add/delete the aliases.

Joe

Assume LVS-DR with VIP, RIPs all on the same /24 network on eth0 devices, realservers all have lo:0 with VIP/24 and have the standard 2.2.x kernel (no patches to hide interfaces). Router says "who has VIP", the arp request arrives at the realservers via eth0. Device lo:0 finds arp request which arrived on eth0 from router is on the same subnet as lo:0 and does not reply to the arp request.

Before checking if to answer the ARP the routing tables are checked, i.e. the source validation of the packet is performed. If 192.168.0.2 asks "who-has 192.168.1.1 tell 192.168.1.2" the real servers assumes that this is invalid packet, i.e. from one local IP to another local IP (from me to me => drop).

Joe

I notice that with the 2.2.x kernel, that lo:0 has to have netmask=255.255.255.255 to work, whereas with the 2.0.x kernels (where lo:0 doesn't reply to arp requests), that lo:0 can have the VIP on a 255.255.255.0 netmask and still work.

The rule is to use netmask 255.255.255.255 and to hide lo. The ARP works in different way in 2.2. It looks the "local" table to validate the source of the ARP request and after that it lookups the same table to check if daddr of the ARP request is local ip.

ARP requests: - all local addresses can be used by the kernel to announce them as the source for the ARP request.

is it OK to say

the kernel can (does?) use all local addresses as the source of ARP requests

It can and does. The real server thinks that it can use any local ip address as saddr in the ARP request and the answer will be returned back if this ip is uniq in the LAN.

Joe

do you mean "the realserver will receive a reply if the s_addr is unique in the LAN"?

The real server will receive answer if it uses RIP as saddr in the ARP request because the VIP(HIP) is hidden or when using transparent proxy because it is not local (the VIP). Real server must know how to ask (using uniq IP) or the trafic for the asked IP (ROUTER) will be blocked.

But the hidden addresses are not used because they are not uniq (2.2.14) and the answer will be returned to the Director.

Joe

do you mean "the non-hidden VIP on the director"?

Yes, when the real server ask "who-has ROUTER tell VIP" the ARP reply is received in the Director and the transmission in the real servers is stopped. The ROUTER sends everything destined to VIP to the Director. This is true for all clients on the LAN too if they are not in this cluster (if they don't handle packets for VIP).

Joe

I would have thought that the main device on each NIC, eg eth0, eth1 would have been used as the source address.

No, it is extracted from the outgoing datagram and if saddr is local ip it is used. But if this is not local ip, i.e. when using transparent proxy or the address is marked as hidden the main device ip is used.

Joe

how is arping part of transparent proxy?

It is not. When VIP is not local IP address in the real server this IP is not used from the ARP code. It is not in the "local" table. But TCP, UDP and ICMP use it via transparent proxy support.

They are extracted from the outgoing packet.

Joe

what is "They"? the source addresses? When you say "extracted", do you mean "removed from packet" or "looked at/detected"

The saddr from the data packet is used to build the ARP request.

We tell the kernel that these addresses are not uniq by setting <interface>/hidden=1 (starting with kernel 2.2.14). By this way the kernel select the devices primary IP as the source of the ARP request.

Joe

the kernel can use any local address as s_addr but the code for hiding IPs from arp requests prevents the kernel from using hidden addresses as s_addr in an arp request?

Yes, the code to hide the addresses is already part of the source address autoselection (saddr in the ARP request in our case). We never autoselect hidden addresses, i.e. if the source address is not specified from the higher level. The code to hide interface:

- ignores ARP replies for hidden local addresses
- doesn't select hidden local addresses as source of the ARP request
- doesn't autoselect hidden local addresses for the IP level

Joe

When you say "We expect it is uniq in the LAN" do you mean - we expect you've set up your network properly and that you don't have the same RIP on 2 realservers? :-)

The LVS administrator must ensure that the RIPs are uniq, only the VIP is shared. We tell the kernel that the VIP addresses are not uniq by setting interfacehidden=1 (2.2.14). By this way the kernel select the devices primary IP as the source of the ARP request. We expect it is uniq in the LAN.

So, the recommendation for using the "lo" interface in the real servers is:

- use netmask 255.255.255.255 when configuring lo alias. By this way source validation doesn't drop the incoming packets to this IP. LVS users usually define the net route through the eth interface, so we can talk to other hosts from this network, for example to send the packets to the client through the default gateway. It is not needed to configure the alias with mask != 255.255.255.255

So, the interfaces which can be used in the real servers to listen for VIP are:

- lo aliases with netmask 255.255.255.255
- tunl*
- dummy*

All these devices must be marked as hidden to solve the ARP problem when using Linux 2.2.

In the Director: there is no problem to configure the VIP even on lo alias or dummy interface. If the interface is not marked as hidden this VIP is visible for all hosts on the LAN.

5.23. ATM/ethernet and router problems

LVS has only been tested on ethernet. One person had an ATM setup which didn't work with LVS-DR as the ATM router expects packets from the VIP to have the same MAC address (in LVS-DR packets coming from the VIP could have the MAC address of any of the realservers). Apparently this is not easily fixable in the ATM world. It should be possible to use Julian's martian modification to make LVS-DR work on ATM, but the person with the ATM setup disappeared off the mailing list without us convincing him of the joy in having the first ATM LVS.

Other people have found similar problems with ethernet -

Kyle Sparger ksparger (at) dialtoneinternet (dot) net

I don't know if someone has gone over this, but here's a consideration I've come across when setting up LVS in DR mode:

When the real servers reply, cisco routers (ours do, at least) will pick up on the fact that it's replying from a different MAC address, and will start arping soon thereafter. This is sub-optimal, as it causes a constant flood of arp requests on the network. Our solution has been to hardcode the MAC address into the router, but this can cause other issues, for example during failover. That can be worked around, as you can set the MAC address on most cards, but that in itself may cause other issues.

Has anyone else experienced this? Has anyone else come up with a better solution than hardcoding it into the router?

It should be possible to have the reply packets from the VIP come from a virtual MAC address (such as created by vrrpd), in which case all replies coming to the same port in a router from the VIP will have the same MAC address. No-one seems to be interested in writing the code to do this.

5.24. Same IP on multiple NICs

Bonnet Sebastien (dot) Bonnet (at) experian (dot) fr 2002-04-16

I'm setting up with LVS-DR. To allow a node to be both a realserver and a backup director, I have eth0:2 being the VIP, because at this point, "backup-and-node" is the director. But when it's not, I still need VIP to be setup on lo:1 to use "backup-and-node" as a realserver. I end up with the following config :

[root@backup-and-node root]# cat /proc/sys/net/ipv4/conf/all/hidden 1
[root@backup-and-node root]# cat /proc/sys/net/ipv4/conf/lo/hidden 1
[root@backup-and-node root]# cat /proc/sys/net/ipv4/conf/eth0/hidden 0

[root@backup-and-node root]# ifconfig
eth0    Link encap:Ethernet  HWaddr 00:40:05:5C:C2:04
        inet addr:172.22.48.208  Bcast:172.22.63.255  Mask:255.255.240.0
        UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth0:2  Link encap:Ethernet  HWaddr 00:40:05:5C:C2:04
        inet addr:172.22.48.212  Bcast:172.22.63.255  Mask:255.255.240.0
        UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

lo      Link encap:Local Loopback
        inet addr:127.0.0.1  Mask:255.0.0.0
        UP LOOPBACK RUNNING  MTU:16436  Metric:1

lo:1    Link encap:Local Loopback
        inet addr:172.22.48.212  Mask:255.255.255.255
        UP LOOPBACK RUNNING  MTU:16436  Metric:1

The problem is that when VIP is setup on both lo:1 and eth0:2, "backup-and-node" will not answer *any* ARP request for VIP, whereas it should via eth0 (as far as I understand the purpose of the hidden feature).

Julian

The problem is that this setup is ambigous. The kernel doesn't know what device you are using for primary and for secondary IPs. Device lo is a valid device for primary IPs. It is not allowed to define one IP both as primary and secondary one. Yes, lo is first in the device list and we search for hidden IP in _any_ device. We don't have a preferred device to start from. Yes, this is limitation that nobody wants to fix. Someone will have to persuade me with a clear fix for this.

Joe

I'm surprised you're allowed to have the same IP on two different devices. Is there a reason why you'd want to do this or is it just not forbidden and therefore allowed (I beleive this is called the American philosophy).

Horms

It is actually something you may want to do. Imagine you have a dialup server, 192.168.0.1, which sits on the 192.168.0.0/24 network. Now each dialup user is going to get their own ip address, but 192.168.0.0/24 is your server network, so these ip addresses are on a different network, lets say 10.0.7.0/24. Now when the dailup users come in, there is no need for the dialup-server to have an address on the 10.0.7.0/24 network, it is just a point to point link, so you can have for instance.

[client]<-------->[dialup-server]
10.0.7.7          192.168.0.1
ppp0              ppp0

But the dialup-server already has 192.168.0.1 on eth0. Thus you have the same IP address on multiple interfaces. In fact it would have the same IP address on eth0 and each of the ppp interfaces.