Thursday, August 31, 2006

Java Tip #11 - Beware the Fickle Session ID

In March of 2005 I posted the article Performance 101 - Avoiding Work. In that tip I described a process where we store the users last jdbc connection in a hashmap using the HTTP Session ID as the key. In the last few months, several of our customers had been experiencing random, unexplained connection leaks when under a heavy load.

We were unable to duplicate this issue in house, even though we executed the same business process against a copy of the customer database. This of course made it very hard to find... we had no choice but to debug a production system. A couple of weeks ago I flew to Minneapolis (nice place, btw) for a week and the hunt was on.

I had been looking at all our fancy connection handling code trying to find the leak, and missed the problem for several days. Finally after staring at logs for several hours on the day we were scheduled to leave, we noticed something odd. The HTTP Session ID changed for a LIVE session. I had never even considered that a Session ID could change after a user logged in. Here is the scenario that caused the leak.
1) Connection X goes into map with key 123!456!NONE
2) Session ID changes to 123!456!7890
3) The next request comes in, there is NO connection in the map using key 123!456!7890, so connection Y is retrieved from the connection pool.
4) Connection X just leaked.

The servlet api doesn't say anything about the ID changing, only that it must be unique. The underlying issue is Weblogic appending proprietary information to the session "cookie" and returning that as the ID when HttpSession.getId() is called. We use the default name of JSESSIONID for the cookie and a sample would look like:
JSESSIONID=1E9Xwn7nLYfOsc1oSD7iaWHMXzpHga5cQj!-1587343083!-1587348922

According to the HTTP Session Replication Failures support pattern, the JSESSIONID format is:
SessionId!PrimaryServer JVM Hash!SecondaryServer JVMHash

If you experience a session replication failure, the SecondaryServer JVMHash will change to NONE.
JSESSIONID=1E9Xwn7nLYfOsc1oSD7iaWHMXzpHga5cQj!-1587343083!NONE

We actually saw it start with NONE, and then get a JVMHash. The theory is that the replication failed, then succeeded at a later time. We also wonder if during a heavy load, did a delay in replication cause the ID to begin at NONE then eventually get assigned. We didn't have the opportunity to research, but plan to during our next load test.

Our fix was to use only the 1st 52 bytes of the value returned by HttpSession.getId() as the key into the hashmap. The length is configurable in weblogic.xml, so you need to use that configured value as the length.

I'm sure BEA just doesn't know what to do here. There is probably tons of code out there relying on HttpSession.getId() returning the full JSESSIONID cookie, so changing its default behavior would not be a good idea.

Watch out for the fickle session id! Please let me know if you've experienced this before on application servers other than Weblogic. I tested this on 8.1 - 9.2 and saw the same behavior.