Manual Takeover of Master Services
About this Document
In this document we explain the manual takeover of master services in a High Availability (HA) cluster and how split-brain situations can be avoided. Note that the manual takeover of master services is disabled by default and can only be enabled by authorized Able personnel. This feature cannot be enabled via the AXS Guard web-based configuration tool.
A split-brain is a situation in which the data on the slave and master is different and cannot be reconciled. See the High Availability How To on this site for information about the correct HA maintenance and reboot procedures.
If you suspect that a split-brain situation has occurred, contact your reseller A.S.A.P. to avoid data loss.
By default, the slave unit will automatically start the services which are normally delivered by the master unit when the master goes down or when the connection between the master unit and the slave unit is lost. As a result, the slave unit will accept data on behalf of the master, e.g. e-mail traffic.
When the master comes back up and the slave is reachable, it reassumes the master role and will start accepting data, e.g. e-mail traffic. This results in two different data sets on the slave and the master unit, a.k.a. a split-brain, as both machines assume the master role. When the split-brain situation is finally resolved, only one data set will survive and data loss will inevitably occur.
To avoid split-brain situations, the automated takeover of master services can be disabled, allowing system administrators to manually start master services on the slave unit.
Only use this option if the master is powered off or unreachable due to physical network damage, e.g. a failing fiber connection between the master and the slave. We recommend that you physically disconnect the power cord from the failing master unit.
Starting Master Services Manually
Contact Able to disable the automated takeover of master services feature if you haven’t already done so. The "Start Master Services" button will only be visible on the slave unit when the master unit is unreachable.
Log in to the slave unit.
Go to High Availability > Tools.
Click on the "Start Master Services" button.
Please be patient while the master services are started. Contact Able to obtain a replacement for the defective unit if necessary. See the RMA procedures at the end of this document.
Slave Unit Unavailable
When the slave unit becomes unavaible, the master will keep delivering its services, e.g. email services. When the slave unit is up again, the master unit will keep delivering its services as normal.
Master Unit Unavailable
There are 2 possible scenarios in this case:
When the master unit becomes unavailable, it will be noticed by the slave unit. The "Start Master Services" button will become visible on the slave unit. If the button is not pressed and the master unit comes back up, it will automatically resume its services, resulting in a healthy cluster.
If the "Start Master Services" button is pressed, the slave unit will take over the master services, e.g. provide email services. When the master unit comes up again and detects the slave unit, both will start renegotiating roles. Eventually the master unit will take over the master services, resulting in a healthy cluster.
There are 2 possible scenarios in this case:
When the slave unit no longer detects the master unit due to a network issue, the "Start Master Services" button will appear on the slave unit. However, in this scenario the master unit is still operating normally and providing its services, which means the button should not be pressed to avoid a split-brain. System administrators should not take any further actions until the network issue has been resolved. Once the network issue is resolved, both units will automatically detect each other and the HA cluster will resume its normal functions.
If the button is pressed in this situation, then the slave unit will start providing master services, e.g. handle email traffic. However, in this scenario the master is also still operational; the slave unit cannot verify the status of the master unit due to network connectivity issues. The master is also handling email traffic. When the network issue is resolved, the master and the slave will be out of sync. In this case, you will need to contact our support engineers A.S.A.P. to repair the HA cluster and you will need to decide which data set has to be kept. Data loss will occur.