Scott Oaks works in the Java Performance group at Sun Microsystems, where he focuses on the performance of Java Enterprise Edition. He has worked with Java technology since 1996 and is the co-author of four books in the O'Reilly Java Series, including Java Threads (now in its
third edition).
Yesterday, I wrote that I'm often asked which X is faster (for a variety of
X). The answer to that question always depends on your perspective. I answered that question in terms of hardware and concluded (as I always do) that the answer depends very much on your needs, but that a machine which appears slower in a single-threaded test will likely be faster in a multi-threaded world. You can't necessarily extrapolate results from a simple test to a complex system.
What about software? Today, I'll look at that question in terms of NIO. You're probably aware that the connection handler of glassfish is based on Grizzly, an NIO framework. Yet in recent weeks, we've read claims from the Mailinator author that
traditional I/O is faster than NIO. And a recent
blog
from Jonathan Campbell shows traditional I/O-based appserver outperforming glassfish. So what gives?
Let's look more closely at the test Jonathan Campbell ran: even though it simulates multiple clients, the driver runs only a single request at a single time. It doesn't appear so on the surface, this is exactly an NIO issue; it has to do with how you architect servers to handle single-request streams vs. a conversational stream. A little know fact about glassfish is that is still contains a blocking, traditional I/O-based connector which is based on the Coyote connector from
Tomcat. It is enabled that in glassfish (adding the -Dcom.sun.enterprise.web.connector.useCoyoteConnector=true option to your jvm-options) -- but read this whole blog before you decide that using that connector is a good thing.
So I enabled this connector, got out my two-CPU Linux machine running Red Hat AS 3.0, and re-ran the benchmark Jonathan ran on glassfish and jBoss (I tried Geronimo, but when it didn't work for me, I abandonned it -- I'm sure I'd just done
something stupidly wrong in running it, but I didn't have the time to look into
it). I ran each appserver with the same JVM options, but did no other tuning.
And now that we're comparing the blocking, traditional I/O connectors, Glassfish comes out well on top (and, by comparison with Jonathan's numbers, it would easily have beat Geronimo as well):
So does this mean that traditional I/O is faster than NIO? For this test, yes, But in general? Not necessarily. So next, I wrote up a little Faban Driver that uses the same war file as the original test, but Faban will run the clients simultaneously instead of sequentially and continually pound on the same sessions. In my Faban test, I ran 100 clients, each of which had a 50 ms think time between repeated calls to the session validation servlet of the test. This gave me these calls per second:
Glassfish with NIO (grizzly): 8192
Glassfish with Std IO: 3344
jBoss: 6953
Yes, those calls per second are vastly higher than the original benchmark -- the jRealBench driver is able to drive the CPU usage of my appserver machine only to about 15%. Faban can do better, though since the test is dominated by network
traffic, the CPU utilization is still only about 70%. And for glassfish's blocking connector, I had to increase the request-processing thread count to 100 (even so, there's probably something wrong with that result, but since the blocking connector is not really what we recommend you use in production, I'm not going to delve into it).
When scalability matters, NIO is faster than traditional blocking I/O. Which, of course, is why we use grizzly as the connector architecture for glassfish (and
why you probably should NOT run out and change your configuration to use the coyote connector, unless your appserver usage pattern is very much dominated by single request/response patterns). The complex is different than the simple.
As always, your milage will vary -- but the point is, are there tests where traditional I/O is faster than NIO? Of course -- with NIO, you always have the overhead of a select() system call, so when you measure the individual path, traditional I/O will always be faster. But when you need to scale, then NIO will generally be faster; the overehead of the select() call is outweighed by having fewer thread context switches, or by having long keep-alive times, or other options that architecture opens. Just as we saw with hardware, you can't necessarily extrapolate performance from the single, simple case to the complex system: you must test it to see how it behaves.
As a performance engineer, I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective.
Today, I'll talk about the answer in terms of hardware and application servers.
People quite often measure the performace of their appserver on, say, their laptop and a 6-core, 24-thread Sun Fire T1000
and are surprised that the cheaper laptop can serve single requests much faster
than the more expensive server.
There are technical reasons for this that I won't delve into -- there are architecture guides that go into all that. Rather I want to explore the question
of which of these machines is actually faster, particularly in a Java EE context. In an appserver, you typically want to process multiple requests at the same time. So looking at the speed of a single request isn't really interesting: what
is the speed of multiple requests?
To answer this, I took a simple program that does a long-running nonsense calculation. Running this on my laptop and 24-thread T1000, I see the following times
(in seconds) to calculate X items:
# Items
Laptop
T1000
1
.66
1.3
2
1.4
1.5
4
2.8
1.6
8
5.4
2.5
16
10.8
3.7
24
16.6
4.8
As you'd expect, the performance of the laptop degrades linearly, to where it takes 16.6 seconds to perform 24 calculations. The performance of the T1000 isn't
a linear scale, but even though it takes twice as as the laptop long to perform
a single calculation, it can perform 24 calculations in one-third of the time of the laptop.
In the context of an appserver, think of the calculation as the time required for the business methods of your app. I've walked through this explanation a number of times, and often I'm told that the business method is the critical part of
the app, and it must be done in .6 seconds for each user -- and hence the throughput of the T1000 isn't important. And that's fine: if you need to calculate a single method in .6 seconds, then you must use the single-threaded machine. But if you need to calculate two of those at the same time, then you'll need to get two of those machines, and if you need to calculate 24 of them, you'll need to get 24 machines.
So this brings us back to our question: which machine is faster? And it depends
on what you need. If you need to only do one calculation at a time, then the laptop is faster. If you need to do 3 or more calculations at the same time, then the T1000 is faster. Which is faster for you will depend on your application, your traffic model, and many other variables. As always, the best thing is to try your application, but if that's not feasible, be very careful about extrapolating whatever data you do have: you cannot simply extrapolate performance data from
a simple (single-threaded) model to a complex system.
Recently, I've been reading an article entitled
The Fallacy of Premature Optimization by Randall Hyde. I urge everyone to go read the full article, but I can't help
summarizing some of it here -- it meshes so well with some of my conversations with developers
over the past few years.
Most people can quote the line "Premature optimization is the root of all evil" (which was
popularized by Donald Knuth, but originally comes from Tony Hoare). Unfortunately, I (and
apparently My. Hyde) come across too many developers who have taken this to mean that they
don't have to care about the performance of their code at all, or at least not until the code
is completed. This is just wrong.
To begin, the complete quote is actually
We should forget about small efficiencies, say about 97% of the time: premature optimization
is the root of all evil.
I agree with the basic premise of what this says, and also with everything it does not say.
In particular, this quote is abused in three ways.
First, it is only talking about small efficiencies. If you're designing a multi-tier app
that uses the network alot, you want to pay attention to the number of network calls you
make and the data involved in them. Network calls are a large inefficiency. And
not to pick on network calls -- experienced developers know what things are inefficient,
and know to program them carefully from the start.
Second, Hoare is saying (and Hyde and I agree) that you can safely ignore the small
inefficiencies 97% of the time. That means that you should pay attention to small
inefficiencies 1 out of every 33 lines of code you write.
Third, and only somewhat relatedly, this quote builds into the perception that 80% of
the time an application spends will be in 20% of the code, so we don't have to worry about
our code's performance until we find out we're in the 80%.
I'll present one example from glassfish to highlight those last two points. One day, we
discovered that a particular test case for glassfish was bottlenecked on calls to Vector.size --
in particular, because of loops like this:
Vector v;
for (int i = 0; i < v.size(); i++)
process(v.get(i));
This is a suboptimal way to process a vector, and one of the 3% of cases you need to pay
attention to. The key reason here is because of the synchronization around vector, which
turns out to be quite expensive when this loop is the hot loop in your program. I know,
you've been told that uncontended access to a synchronized block is almost free, but that's
also not quite true -- crossing a synchronization boundary means that the JVM must flush all
instance variables presently held in registers to main memory. The synchronization boundary
also prevents the JVM from performing certain optimzations, because it limits how the JVM
can re-order the code. So we got a big performance boost by re-writing this as
ArrayList v;
for (int i = 0, j = v.size(); i < j; i++)
process(v.get(i));
Perhaps you're thinking that we needed to use a vector because of threading issues, but
look at that first loop again: it is not threadsafe. If this code is accessed by multiple
threads, then it's buggy in both cases.
What about that 80/20 rule? It's true that we found this case because it was consuming a lot
(not 80%, but still a lot) of time in our program. [Which also means that fixing this case
is tardy optimization, but there it is.]
But the problem is that there wasn't just
one loop written like this in the code; there were (and still are...sigh) hundreds. We
fixed the few that we the worst offenders, but there are still many, many places in the
code where this construct lives on. It's considered "too hard" to go change all the places
where this occurs (though NetBeans could refactor it all pretty quickly, but there's a
risk that subtle differences in the loop would mean that it would need to be refactored
differently).
When we addressed preformance in Glassfish V2 in order to get our excellent SPECjAppServer results,
we fixed a lot of little things like this, because we spend 80% of our time in about 50% of
our code. It's what I call performance death by a thousand cuts: it's great when you can
find a simple CPU-intensive set of code to optimize. But it's even better if developers
pay some attention to writing good, performant code at the outset and you don't have to
track down hundreds of small things to fix.
Hyde's full article
has some excellent references for further reading, as well as other important points about
why, in fact, paying attention to performance as you're developing is a necessary part of
coding.
I've written several times before about how you have to measure performance to understand how you're doing -- and so here's my favorite performance stat of the day: New York 17, New England 14.
I spent last week working with a customer in Phoenix (only a few weeks before the Giants go there to beat the Patriots), and one of the things we wanted to test was how their application would work with the new in-memory replication feature of the appserver. They brought along one of their apps, we installed it and used their jmeter test, and quickly verified that the in-memory session replication worked as expected in the face of a server failure.
Feeling confident about the functionality test, we did some performance testing using their jmeter script. We got quite good throughput from their test. But as we watched it run, we noticed jmeter reporting that the throughput kept continually decreasing. Since we were pulling the plug on instances in our 6-node cluster all the time, at first I just chalked it up to that. But then we ran a test without failing instances, and the same thing happened: continually decreasing performance.
Nothing is quite as embarrassing as showing off your product to a customer and having the product behave badly. I was ready to blame a host of things: botched installation, network interference, phases of the moon. Secretly, I was willing to blame the customer app: if there's a bug, it must be in their code, not ours.
Eventually, we simplified the test down to a single instance, no failover, and a single URL to a simple JSP: pretty basic stuff, and yet it still showed degradation over time (in fact, things got worse). Now there were two things left to blame: jmeter, or the phases of the moon. Neither seemed likely, until I took a closer look at what jmeter was doing: it turns on that the jmeter script was using an Aggregate Report. That report, in addition to updating the throughput for each request, also updates various statistics, including the 90% response time. It does this in real-time, which may seem like a good idea: but the problem is that calculating the 90% response time is an O(n) operation: the more requests jmeter made, the longer it took to calculate the 90% time.
I've previously written in other contexts about why tests with 0 think time are subject to misleading results. And it turns out this is another case of that: because there is no think time in the jmeter script, the time to calculate the 90% penalizes the total throughput. As the time to calculate the 90% increases, the time available for jmeter to make requests decreases, and hence the reported throughput decreases over time.
I'm not actually sure if jmeter is smart enough to do this calculation correctly even if there is think time between requests: will it just blindly sleep for the think time, or will it correctly calculate the think time minus its own processing time? For my test, it doesn't matter: the simpler thing is to use a different reporting tool that doesn't have the 90% calculation (which, I'm happy to report, showed glassfish/SJSAS 9.1 performing quite well with in-memory replication across the cluster and no degradation over time).
But what's more important to me is that it reinforces a lesson that I seem to have to relearn a lot: sometimes, your intuition is smarter than your tools. I had a strong intuition from the beginning that the test was flawed, but despite that, we spent a fair amount of time tracking down possible bugs in glassfish or the servlets.
And I also don't mean to limit this to a discussion of this particular bug/design issue with jmeter. When we tested startup for the appserver, a particular engineer was convinced that glassfish was idle for most of its startup time: the UNIX time command reported that the elapsed time to run asadmin start-domain was 30 seconds, but the CPU time used was only 1 or 2 seconds. The conclusion from that was that glassfish sat idle for 28 seconds. But intuitively, we knew that wasn't true (for one thing, the disk was cranking away all that time, and a quick glance at a CPU meter would disprove the theory that the CPU wasn't being used). And of course, it turns out that asadmin was starting processes which started processes, and shell timing code didn't understand all the descendant structure (particularly when intermediate processes exited but the grandchild process -- the appserver -- was still executing). The time command was just not suited to giving the desired answer.
Tools that give you visibility into your applications are invaluable; I'm not suggesting that when a tool gives you a result that you don't expect that you should blindly cling to your hypothesis anyway. But when a tool and your intuition are in conflict, don't be afraid to examine the possibility that the tool isn't measuring what you wanted it to.