Skip to content

HA Troubleshooting

Introduction

In this document we explain how to interpret HA logs and troubleshoot a possibly defective cluster. See the HA setup guide on this site for installation instructions, upgrade and reboot procedures.

HA WAN IP Address Scenarios

Master and Slave with Public IP Addresses

At least 3 public IP addresses are required. The master and slave units can be reached directly via the Internet.

image

Master and Slave with Private IP Addresses

Only one public IP address is required. Only the master unit can be reached directly via the internet.

image

Processes and Logs

DRBD Process

root      5201     2  0 Jul12 ?        00:00:00 [drbd-reissue]
root      5231     2  0 Jul12 ?        00:00:00 [drbd_submit]
root      5253     2  0 Jul12 ?        00:02:57 [drbd_w_drbd0]
root      6014     2  0 Jul12 ?        00:02:57 [drbd_r_drbd0]
root      6068     2  0 Jul12 ?        00:07:52 [drbd_a_drbd0]

Heartbeat Process

root      5389     1  0 Jul12 ?        00:02:03 heartbeat: heartbeat: master control process
nobody    5392  5389  0 Jul12 ?        00:00:00 heartbeat: heartbeat: FIFO reader
nobody    5393  5389  0 Jul12 ?        00:00:17 heartbeat: heartbeat: write: ucast eth0
nobody    5394  5389  0 Jul12 ?        00:00:20 heartbeat: heartbeat: read: ucast eth0
nobody    5395  5389  0 Jul12 ?        00:00:14 heartbeat: heartbeat: write: ucast eth4
nobody    5396  5389  0 Jul12 ?        00:00:13 heartbeat: heartbeat: read: ucast eth4

Important

The exact same processes must be running on the master and the slave unit.

DRBD Sync Logs

Except for some daily heartbeat statistics, the DRBD system will not log anything during normal operations.

Failover Logs

The failover process can be followed easily under High availibility > Logs > Failover.

Important

Never interrupt a slave system that is in the process of taking over the master services. Doing so might lead to a split brain situation.

Log example:

16:00:20 -----------------------------------------------------------
16:00:58 Starting the AXS GUARD system
16:02:00 Starting high-availability services
16:02:01 First we check if the slave is executing an event.
16:02:01 Slave is not executing an event.
16:02:01 Starting service drbd (disk replication)
16:02:01 Drbd up
16:02:01 Drbd check for sync
16:02:02 Drbd decided this system is becoming the sync target
16:02:14 Starting service heartbeat ...
16:02:16 Heartbeat is running
16:02:16 Heartbeat wait until decided who is master or slave ...
16:02:16 Heartbeat other node ha event executing
16:04:44 Heartbeat other node ha event done
16:04:44 Heartbeat generated event master start  
16:04:45 Heartbeat decided this is the running master
16:04:45 Drbd wait until decided who is primary or secondary ...
16:04:45 Drbd trying to contact the other node
16:04:45 Drbd info is cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
16:04:45 Drbd decided this is the primary, drbd state is Secondary/Secondary
16:04:45 Drbd check if we can become the primary
16:04:45 Drbd info is cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
16:04:45 Drbd info is cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
16:04:45 Drbd primary
16:04:45 Drbd info is cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
16:04:45 Drbd file system check of the replicated volume
16:04:45 Drbd checking and repairing replication volume...
16:04:45 Drbd info is cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
16:04:45 Drbd trying to unmount /dev/mapper/lv-host--data
16:04:45 Drbd unmount of /dev/mapper/lv-host--data succeeded
16:04:45 Drbd mount the replication volume
16:04:45 Sync the configuration parameters
16:05:37 Back in sync with the master
16:05:37 I take the virtual ip(s)
16:06:11 High-availability services up and running
16:06:12 Start master services
16:08:58 Master services started
16:09:15 -----------------------------------------------------------
16:13:01 Heartbeat detected that the configured slave internet is down

Split Brain Log Example

A split brain is a situation in which the data on the slave and master is different and cannot be reconciled.

11:48:23 kernel d-con drbd0: Handshake successful: Agreed network protocol version 101
11:48:23 kernel d-con drbd0: Peer authenticated using 20 bytes HMAC
11:48:23 kernel d-con drbd0: conn( WFConnection -> WFReportParams )
11:48:23 kernel d-con drbd0: Starting asender thread (from drbd_r_drbd0 [7708])
11:48:23 kernel block drbd0: drbd_sync_handshake:
11:48:23 kernel block drbd0: self 3A02D9C0EDE50779:28A8902C96588045:14049F78F25841D9:14039F78F25841D9 bits:859 flags:0
11:48:23 kernel block drbd0: peer 65B4055A401632C7:28A8902C96588044:14049F78F25841D8:14039F78F25841D9 bits:267 flags:0
11:48:23 kernel block drbd0: uuid_compare()=100 by rule 90
11:48:23 kernel block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
11:48:23 kernel block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
11:48:23 kernel block drbd0: Split-Brain detected but unresolved, dropping connection!                              <<<<----------
11:48:23 kernel block drbd0: helper command: /sbin/drbdadm split-brain minor-0
11:48:23 kernel block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
11:48:23 kernel d-con drbd0: conn( WFReportParams -> Disconnecting )
11:48:23 kernel d-con drbd0: error receiving ReportState, e: -5 l: 0!
11:48:23 kernel d-con drbd0: asender terminated
11:48:23 kernel d-con drbd0: Terminating drbd_a_drbd0
11:48:23 kernel d-con drbd0: Connection closed
11:48:23 kernel d-con drbd0: conn( Disconnecting -> StandAlone )
11:48:23 kernel d-con drbd0: receiver terminated
11:48:23 kernel d-con drbd0: Terminating drbd_r_drbd0
11:48:28 heartbeat[5340] info: Link axsguard-slave.vzwstijn.be:eth0 dead.
11:48:34 heartbeat[5340] info: Link axsguard-slave.vzwstijn.be:eth0 up.
11:48:35 kernel d-con drbd0: conn( StandAlone -> Unconnected )
11:48:35 kernel d-con drbd0: Starting receiver thread (from drbd_w_drbd0 [5190])
11:48:35 kernel d-con drbd0: receiver (re)started
11:48:35 kernel d-con drbd0: conn( Unconnected -> WFConnection )
11:48:36 kernel d-con drbd0: Handshake successful: Agreed network protocol version 101
11:48:36 kernel d-con drbd0: Peer authenticated using 20 bytes HMAC
11:48:36 kernel d-con drbd0: conn( WFConnection -> WFReportParams )
11:48:36 kernel d-con drbd0: Starting asender thread (from drbd_r_drbd0 [16355])
11:48:36 kernel block drbd0: drbd_sync_handshake:
11:48:36 kernel block drbd0: self 3A02D9C0EDE50779:28A8902C96588045:14049F78F25841D9:14039F78F25841D9 bits:4122 flags:0
11:48:36 kernel block drbd0: peer 65B4055A401632C7:28A8902C96588044:14049F78F25841D8:14039F78F25841D9 bits:390 flags:0
11:48:36 kernel block drbd0: uuid_compare()=100 by rule 90
11:48:36 kernel block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
11:48:36 kernel block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
11:48:36 kernel block drbd0: Split-Brain detected but unresolved, dropping connection!
11:48:36 kernel block drbd0: helper command: /sbin/drbdadm split-brain minor-0
11:48:36 kernel block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
11:48:36 kernel d-con drbd0: conn( WFReportParams -> Disconnecting )
11:48:36 kernel d-con drbd0: error receiving ReportState, e: -5 l: 0!
11:48:36 kernel d-con drbd0: asender terminated
11:48:36 kernel d-con drbd0: Terminating drbd_a_drbd0
11:48:36 kernel d-con drbd0: Connection closed
11:48:36 kernel d-con drbd0: conn( Disconnecting -> StandAlone )
11:48:36 kernel d-con drbd0: receiver terminated
11:48:36 kernel d-con drbd0: Terminating drbd_r_drbd0

Split Brain Resolution

AXS Guard is capable of recovering from a split-brain situation semi-automatically in most cases, provided the correct procedure is followed.

  1. Determine which unit of the HA cluster has the most recent and uncorrupted configuration.

  2. Make sure this system is fully booted and the compromised system is shut down.

  3. Verify that the status page reflects:

    • StandAlone > Unconnected

    • Unconnected > WFConnection

  4. Boot the system that needs to be synced.

  5. Look for the following logs:

14:50:28 kernel d-con drbd0: Handshake successful: Agreed network protocol version 101
14:50:28 kernel d-con drbd0: Peer authenticated using 20 bytes HMAC
14:50:28 kernel d-con drbd0: conn( WFConnection -> WFReportParams )
14:50:28 kernel d-con drbd0: Starting asender thread (from drbd_r_drbd0 [6780])
14:50:28 kernel block drbd0: drbd_sync_handshake:
14:50:28 kernel block drbd0: self 3A02D9C0EDE50779:28A8902C96588045:14049F78F25841D9:14039F78F25841D9 bits:1072812 flags:0
14:50:28 kernel block drbd0: peer 65B4055A401632C6:28A8902C96588044:14049F78F25841D8:14039F78F25841D9 bits:19683 flags:0
14:50:28 kernel block drbd0: uuid_compare()=100 by rule 90
14:50:28 kernel block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
14:50:28 kernel block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
14:50:28 kernel block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from this node                     <<<<---------
14:50:28 kernel block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
14:50:28 kernel block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 967(1), total 967; compression: 99.9%
14:50:28 kernel block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 1054(1), total 1054; compression: 99.8%
14:50:28 kernel block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
14:50:28 kernel block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
14:50:28 kernel block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
14:50:28 kernel block drbd0: Began resync as SyncSource (will sync 4292772 KB [1073193 bits set]).
14:50:28 kernel block drbd0: updated sync UUID 3A02D9C0EDE50779:28A9902C96588045:28A8902C96588045:14049F78F25841D9

Important

Contact support@axsguard.com ASAP if you are experiencing difficulties.

Back to top