IntroductionIn this blog I will talk about some problems I recently experienced on a productive system and how we've made sure we wouldn't experience similar problems in the future.
The customer has most of its information content in a back-end system, and it is fetched through the portal by a custom made component. This component, and not the client browser, contacts the target system over http and retrieves the current content page, as well as navigation information which it uses to dynamically build up a tree view.
Recently, the morning after the maintenance window for infrastructure changes, we had a problem that none of the iviews that used data from the back-end system worked. Since the ITIL process is well implemented at this customer, I already was aware that the cluster address of the back-end system had changed and suspected that this was the problem.
I tried to do a nslookup against the back-end system from the portal servers and this gave the new correct ip-address , and I also had no problems connecting to it via telnet on port 80 and retrieving the content manually. Therefore, I had a suspicion that the problem was related to DNS lookups being cached in the portal and did a restart of the first portal server.
After the restart the first portal server worked fine, so I continued to restart the rest of the cluster and after not too many minutes the whole cluster was functional. That was when I finally got the time to sit down and connect the dots on why this happened.
Since the connections to the back-end system used standard java APIs, and not portal APIs, I knew that the problem was in the Java Virtual Machine(JVM) settings and not in the portal software.
After doing some googeling I stumbled over this page describing Networking properties in the JVM 1.4.2. This confirmed my suspicions that by default the JVM caches all DNS lookups in definitively instead using the time-to-live value which is specified in the DNS record of each host.
For those of you how do not know it, a DNS lookup is a request sent to a DNS server which converts a readable hostname to an ip-address. For example, it converts www.bouvet.no to the address 188.8.131.52. It is of course much more complex than this, and if you want to learn more wikipedia's entry on it is a good starting point.
The solution for Java 1.4.x(i.e. J2EE WAS 6.40)
The parameter which controls DNS caching in the JVM 1.4.x is networkaddress.cache.ttl and this can be set in the file "jre\lib\security\java.security". This must be done on each portal server. The description of the property is provided below:
The solution for Java 1.3.x(i.e. J2EE WAS 6.20)
For Java 1.3.x the parameter networkaddress.cache.ttl doesn't exist. However, you can set a sun specific parameter as an command line variable to the JVM. That parameter is called sun.net.inetaddr.ttl. To set this parameter in the SAP EP 6 portal (which runs on Java 1.3.x), you start the config tool and paste the parameter and a time-to-live(ttl) value. See image below:
This must be done on each portal server.
Since I'd like verification that changing this parameter really works, I wrote a java program which resolves a hostname to an ip-address approximately once a second. If the DNS caching is set to forever, the result will never change. Since I haven't got access to my DNS server, I used the host file which is provided on most operating system. Any entries in this file will be queried before the "proper" DNS server, and therefore this provides a great method to do debugging. On a windows system the path to the file is C:\WINDOWS\system32\drivers\etc\hosts . All you need to do is to add a line 127.0.0.1 www.bouvet.no in this file (or another ip-address), and while the program is running you change the ip-address in this file and see if the program detects the change.
How long to DNS cache?There is a tradeoff between saving network traffic (do DNS caching) and having the latest ip-address of hostname (doing a DNS lookup). The value will depend on several things:
- How often the ip-address of hostnames changes?
- What impact does it have the system is not available for x minutes?
- What time of the day is DNS changes implemented?
- How much traffic is there on the portal system?
There is no correct answer to this, and sensible values vary from under a minute to hours. My suggestion would be to use a value of about 20 minutes. This means that in average it will go ten minutes from the dns is changed, to it is updated in the JVM (assuming it is querried for).
Here I've shown how the JVM by default caches DNS lookups forever, and what problems this could bring for portal applications. Of course, I've also shown how you can change how the JVM caches DNS lookups so that problem situations can be avoided.
I was very surprised to see that java didn't use the time-to-live field in the DNS record by default, and I can 't find a way to get it to use that instead of a static value. If anyone has, please inform me!