Thursday, September 25, 2014

NetApp Error: hw_assist: bind failed to port 4444 on IP address - How to fix

I've run across many filers - set up by NetApp "authorized" systems integrators throw this error.  I've actually have four NetApp HA pairs installed by the professional services group - and the installers always miss this step.

cf.hwassist.socBindFailed:error]: hw_assist: bind failed to port 4444 on IP address X.X.X.X. Error 49

I've actually opened tickets with NetApp support - they told me to reboot the SP and then closed the ticket.  The problem is, the error kept coming back.

The Cause
After some research, this is the high-availability portion of the NetApp attempting to communicate from the SP to the partner controller.  By checking the partner's hardware (rather than waiting for a heartbeat) - the system can do a takeover more quickly in the event of a true failure.

How hardware-assisted takeover speeds up takeover

The IP address it is complaining about in the error should be the IP address of the partner's management interface (not the SP/RLM IP)

Resolution
On each node of the HA pair - set the IP address of the partner's management IP in the correct option.

First, verify that the setting is in fact incorrect.  The IP address in the configuration should NOT be the IP address of the partner's SP/RLM.

Check both nodes.

NETAPP-A> options cf.hw_assist
cf.hw_assist.enable          on         
cf.hw_assist.partner.address 192.163.1.2 
cf.hw_assist.partner.port    4444       

NETAPP-B> options cf.hw_assist
cf.hw_assist.enable          on         
cf.hw_assist.partner.address 192.163.1.3
cf.hw_assist.partner.port    4444       

Yeah - somehow those IP address were on my NetApp.  What noob uses 192.163...what? That's not even RFC1918...wtf?

Next, adjust the settings on each node of the HA pair to the correct IP address for the partner node.   You should use the IP address on the e0m or another management port you may have designated.  Note - do not point this to the SP/RLM IP address - it won't work.

NETAPP-A> options cf.hw_assist.partner.address <CONTROLLER-B-IP>
Validating the new hw-assist configuration. Please wait... 
NETAPP-A> 

NETAPP-B> options cf.hw_assist.partner.address <CONTROLLER-A-IP>
Validating the new hw-assist configuration. Please wait... 
NETAPP-B> 

If you get an error, go back and double check your network - make sure that the correct ports are being used, and that you are putting in the correct IP.  Remember, it needs to be the management address of the partner controller.

Once it's configured correctly, you should see this message on each controller:

[NETAPP-A:cf.hwassist.hwasstActive:info]: hw_assist: hw_assist functionality is active on IP address: <CONTROLLER-A-IP> port: 4444

[NETAPP-B:cf.hwassist.hwasstActive:info]: hw_assist: hw_assist functionality is active on IP address: <CONTROLLER-B-IP> port: 4444 

You can validate the configuration this way:

NETAPP-A> cf hw_assist status 

The output will look like:
                          
Local Node(NETAPP-A) Status:
        Active: NETAPP-Amonitoring alerts from partner(NETAPP-B)
        port 4444 IP address 10.160.99.40
Partner Node(NETAPP-B) Status:
        Active: NETAPP-Bmonitoring alerts from partner(NETAPP-A)
        port 4444 IP address 10.160.99.41

After a few minutes of it being configured correctly, you should see this message in the system log of each controller:

[NETAPP-A:cf.hwassist.recvKeepAlive:info]: hw_assist: Received hw_assist KeepAlive alert from partner(NETAPP-B).  

[NETAPP-B:cf.hwassist.recvKeepAlive:info]: hw_assist: Received hw_assist KeepAlive alert from partner(NETAPP-A).  


No comments:

Post a Comment

Featured Post

Remove 3D Objects and other annoying folders on Windows 10

 Microsoft just keeps adding more crap to clutter up the navigation in Windows 10.  Seriously, who needs a 3D Objects folder?  The tiny perc...