Overview
This blog is part of a series of troubleshooting blogs geared towards telling you a story of how an issue got resolved. I will include the entire troubleshooting process to give you a fully transparent account of what went on. I hope you find these interesting. Please leave the feedback in the comments if you like the format or things I can improve on :smile:
Let's get started!
Problem Description
Trying to register the secondary site for System Replication fails with error
"remoteHost does not match with any host of the source site"
Environment Details
This incident occurred on Revision 73
Symptoms
Running the following command:
hdbnsutil -sr_register --name=SITEB --remoteHost=<hostname primary> --remoteInstance=<inst> --mode=<sync mode>
Gives error:
adding site ..., checking for inactive nameserver ..., nameserver <hostname_secondary>:3<inst>01
not responding., collecting information ..., Error while registering new
secondary site: remoteHost does not match with any host of the source site.
please ensure that all hosts of source and target site can resolve all
hostnames of both sites correctly., See primary master nameserver tracefile for
more information at <hostname_primary>, failed. trace file nameserver_<hostname_secondary>00000.000.trc
may contain more error details.]
Studio had a similar error as well.
Troubleshooting
The error message indicates that the secondary system could not be reached when performing sr_register.
Firstly, when dealing with System Replication, it is always good to double-check that all the prerequisites have been completed. Refer to the Administration
guide for this (
http://help.sap.com/hana/SAP_HANA_Administration_Guide_en.pdf)
Let’s make sure the network connectivity is fine between the primary master nodes and the secondary master nodes.
Are the servers able to ping each other?
From the O/S, type “ping <hostname>”. Perform this from the primary to secondary and secondary to primary.
In this customer’s case, ping was successful.
What about firewalls? Could the ports be blocked?
From the O/S, type “telnet <hostname> <port>”. Perform this from the primary to the secondary and secondary to the primary.
The port that you will use is the SQL Port. In this case 3<instance number>15.
In this customer’s case, ping was successful.
Comparing the host files between the primary and secondary sites
The customer noticed that there was an error in the /etc/hosts file, the shortname was not filled in correctly. They fixed this, but the problem still occurred :sad:
Network Communication and System Replication
There is a note 1876398 - Network configuration for System Replication in HANA SP6.
The symptoms of the note match what we are experiencing “
When using SAP HANA Support Package 6, a
System Replication secondary system may not be able to establish a connection to the primary system. “.
It is explained “
Therefore, the listener hears only on the local network. System Replication also uses the infrastructure for internal network communication for exchanging data between the name servers of the primary and the secondary system. Therefore, the name servers of the two systems can no longer communicate with each other in this case.”
It is worth noting this is
very common cause of the issue, but in the customer's case, it was not the problem.
Strace
Performed an strace, here is some of the output.
sendto(13,"?\0\50\50\50\60\0\0\0\1\2\6,\0\0\0dr_gethdbversion"..., 86, 0, NULL,
0) = 86
recvfrom(13,0x7f1bd94549264, 8337, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13,events=POLLIN|POLLPRI}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13,"\323\346\v\333\333\333\333\333\333\333\333F\1I\nhdbversionI\0221.00."...,
8337, 0, NULL, NULL) = 52
recvfrom(13,0x7fff22c5277f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(13,0x7fff22c528bf, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
gettid() = 35760
sendto(13,"?\0\32\33\45\33\0\0\0\1\2\0033\0\0\0dr_registerdatac"..., 413, 0,NULL, 0) = 413
recvfrom(13,0x7f1bd9745564, 8337, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13,events=POLLIN|POLLPRI}], 1, -1
Seems like some sort of packet loss here.
Involving the Networking Team
We involved the customer’s networking team and found that the MTU-size was set to 9000. They set the MTU-size to 1500 and then ran the register step and
it worked! The registration completed!
The networking team did not explain exactly what was going on but we suspect they performed a tcpdump to see if there was packet loss.
** This may need to be changed back later for performance optimization, see
2081065 - Troubleshooting SAP HANA Network **
Disclaimer
This blog detailed the steps that SAP and the customer worked through a problem towards a resolution. This may not be the exact resolution for every incident that has the same symptoms. If you are encountering the same issue, you can review these steps with your HANA Administrator and Networking team.