Performance Comparison between NIO Frameworks

posted in Uncategorized on October 6, 2008 by Trustin Lee

Most NIO frameworks can saturate 1 gigabit ethernet at some point. However, some frameworks can saturate the bandwidth with the smaller number of connections while others can not. The performance numbers of the 5 well-known open source NIO frameworks are presented here to help you figure out the excellence of Netty in performance.

Where’s the Graph?

If you are in a hurry, please scroll down to see the graphs first. You can also download the PDF document which contains detailed numbers and graphs.

What’s the Bottom Line?

Unlike usual expectations, NIO frameworks have different performance characteristics in spite of the fact that they are using the same NIO selector provider.

What’s observed is that the difference comes from the fundamental factors such as data structure and thread contention management, and those factors should never be overlooked.

Netty has succeeded to introduce the breakthrough in NIO framework performance with careful engineering, while retaining the flexible architecture.

Test Scenario

A simple echo server and client exchange fixed length messages one by one (i.e. synchronous ping-pong). The handler code, which sends the received data back in verbatim, is executed in a separate thread pool that each NIO framework provides.

The tests were run with different message lengths (64 ~ 16384 bytes) and different network configurations (loopback and 1 gigabit ethernet), to see how well each framework performs on various conditions.

Test Environment

Software
- The test client has been written in Netty 3.0.0.CR5.
- Echo server implementations
  - Netty 3.0.0.CR5
  - Other 4 open source NIO frameworks
    - Grizzly, MINA, NIO Framework, and xSocket
    - Used the latest milestone releases as of October 3rd, 2008
    - Excluded inactive projects (no release in 2008)
    - Framework names were anonymized in no particular order.
  - Thread pool
    - The number of I/O threads – the number of the CPU cores
    - The number of handler threads – 16
    - The default thread pool that each framework provides was used.
    - If the framework doesn’t provide a thread pool implementation which limits the maximum number of threads, Executors.newFixedThreadPool() was used instead.
  - Use of direct buffers was suppressed to avoid excessive memory consumption.
- JRE – Sun JDK 1.6.0_07
- JRE options – -server -Xms2048m -Xmx2048m -XX:+UseParallelGC -XX:+AggressiveOpts -XX:+UseFastAccessorMethods
Hardware
- Server (Hostname: Eden)
  - CPU: 2 x quad-core Xeon 2.83GHz, ‘performance’ governor
  - O/S: Linux 2.6.25.11-97.fc9 (Fedora 9)
  - RAM: 6 GiB
  - NIC: Broadcom NetXtreme Gigabit Ethernet PCI express
- Client (Hostname: Serpent)
  - CPU: 2 x dual-core Xeon 3.00GHz, ‘performance’ governor
  - O/S: Linux 2.6.25.11-97.fc9 (Fedora 9)
  - RAM: 3 GiB
  - NIC: Broadcom NetXtreme Gigabit Ethernet PCI express
- No switching hub was used to minimized possible network latency.
Common TCP/IP parameters
- TCP_NODELAY was turned on. (i.e. Nagle’s algorithm was disabled.)
- net.ipv4.tcp_tw_recycle has been set to 1
- Used the default MTU (i.e. 1500 – no jumbo frame)

Test Result

Client and Server on the Same Machine (Loopback Device)

The test client and servers ran on the same machine, Eden. (If images are not showing up, please refresh. There are three graphs here.)

Client and Server on Different Machines (1 Gigabit Ethernet)

The test client ran in Serpent, and the servers ran in Eden. (If images are not showing up, please refresh. There are three graphs here.)

Running the Tests by Yourself

The test result should be always reproduceable. Please give us your feed back to improve the accuracy of the test result. The full source code is available at the Subversion repository:

svn co http://anonsvn.jboss.org/repos/netty/subproject/benchmark

All tests run by Ant. Enter ‘ant -p‘ to see the instruction.

29 Comments → Performance Comparison between NIO Frameworks

c.m. October 6, 2008 at 7:21 pm

what was the intention for omitting the names of the other frameworks here in this report? Doing a comparison and not saying what was compared against seems strange to me.
Trustin Lee October 6, 2008 at 8:24 pm
@c.m: Here’s the list of other frameworks in an alphabetical order:
- Grizzly
- MINA
- NIO Framework
- xSocket
However, it doesn’t necessarily mean that the framework B is Grizzly and so on. This report has been anonymized for a political reason. It is actually pretty easy to reproduce the performance result because the test code is completely open source. You can browse the source code here to find which version of each framework was used:

http://anonsvn.jboss.org/repos/netty/subproject/benchmark
Trustin Lee October 6, 2008 at 8:48 pm
Ah, of course, there’s no restrictive terms of license which prohibits me from publishing the exact names. However, opening the whole result crystal-clearly might hurt some frameworks mentioned here. For example, some framework even had a resource leak so that I had to relaunch the server very often.

Anyway, the bottom line of this comparison is pretty obvious.
- Unlike usual expectations, NIO frameworks have different performance characteristics in spite of the fact that they are using the same NIO selector provider.
- My observation is that the difference comes from the fundamental factors such as data structure and thread contention management, and those factors should never be overlooked.
- AND Netty has succeeded to introduce the breakthrough in NIO framework performance, while retaining the flexible architecture.
Trustin Lee October 6, 2008 at 10:57 pm

OK. I’ve just updated the post to avoid the confusion. HTH..
gregor October 7, 2008 at 1:19 am

Hi Trustin,

please note that your xSocket-based example includes an unnecessary copy of the incoming data. The readByteBufferByLength(size) method should have been used instead.

Gregor
Mike Heath October 7, 2008 at 5:11 am

Nice work Trustin. Netty is really a pleasure to use. It’s nice to see that it performs so well too. Even better than Grizzly!
Trustin Lee October 7, 2008 at 7:20 am

@gregor: Hi Gregor,

I was actually looking for what exactly you mentioned to avoid memory copy. Let me check in the fix right now.

BTW, I’d love to mention that I was impressed by xSocket’s performance and scalability. Very stable and high-performing. 🙂

Thanks!
Trustin Lee October 7, 2008 at 7:24 am

@Mike Heath: Thanks Mike for your comment. Any idea on improvement though? 🙂
Example Citizen October 9, 2008 at 11:23 am

Your post seems to have been largely/completely copied without attribution at http://techmemo.org/2008/10/06/the-performance-numbers-of-the-5-well-known-open-source-nio-frameworks-are-presented-here-to-help-you-figure-out-the-excellence-of-netty-in-performance.html.
Trustin Lee October 9, 2008 at 11:50 am

@Example Citizen: Thanks for the information. I’ve contacted the site admin.
Trustin Lee October 14, 2008 at 8:34 pm

Interesting post and comments on NIO frameworks
Claudio Miranda October 14, 2008 at 9:57 pm

At the URL “Example Citzen” pointed, there is no article there, only a parked site.
Trustin Lee October 14, 2008 at 10:17 pm

@Claudio Miranda: Perhaps a sort of SEO bump and dump attempt? The article was there at the moment of the comment.
Kamel October 20, 2008 at 9:59 pm

I start to use Netty on my future GPL project (Asynchronous Computation over Grid). I would like to say Great Job Lee!
Trustin Lee October 20, 2008 at 10:07 pm

@Kamel: Thanks for using Netty, and please feel free to contact me or the community if you have a question or suggestion. 🙂

Also, please ping me when you are ready to publish your project to the web. I’d like to publish a list of projects which use Netty.
Alan November 9, 2008 at 9:05 pm

There is a note in this blog to say that the “use of direct buffers was suppressed to avoid excessive memory consumption”. I’m not sure what this means but there isn’t any way in NIO to suppress the use of direct buffers. If the framework/application uses non-direct buffers (ie: ByteBuffers that encapsulate byte[] in the java heap) then the buffers are transparently substituted with direct buffers when doing I/O.
Trustin Lee November 10, 2008 at 2:17 pm

@Alan: I guess a certain buffer allocation pattern (?) causes indefinitely increasing direct buffer memory which ends up with OOM, but I’m not sure what condition triggers that. What’s apparent though is that it just works fine when heap buffer is used primarily.
Alan November 10, 2008 at 7:09 pm

It mostly depends on how you are managing the direct buffers. If you allocate and unreference a direct buffer then the memory will not be released until the corresponding ByteBuffer object is GC’ed. Direct buffers are intended to be re-used.
Trustin Lee November 10, 2008 at 8:23 pm

@Alan: AFAIK, most NIO frameworks don’t pool direct buffers because it’s not really user-friendly to ask a user to return the buffer to the pool explicitly. It would be great if I can control obviously how a direct buffer is reclaimed.

Common technique so far between NIO frameworks is to allocate a big chunk of direct buffer and slice it as needed because it lessens the GC overhead for some reason, but I don’t think it scales as the load goes up.
Vikram November 11, 2008 at 4:26 am

Trustin,

In my project the connection, to the NIO server, is required to be kept open. (Requests are sent over the same connection)
This requirement makes it difficult to count the response time for every single request.
I referred to the load test code you have posted on the JBoss svn repo
http://anonsvn.jboss.org/repos/netty/subproject/benchmark
You are counting the execution time for the overall test and calculating the response time by dividing that time by number of requests sent… please correct me if I am wrong.
How do I check the response time for each of the requests. This will enable me to plot a graph for the performance of the server side code.
Trustin Lee November 11, 2008 at 2:57 pm

@Vikram: You’re right. It should yield the average response time which is acceptable in general. I could have measured the response time per each request-response pair, but I didn’t do that because I was worried about the overhead implication.
Alan November 11, 2008 at 7:47 pm

Unfortunately, the framework or application cannot control when direct buffers are released. In a multi-threaded environment you cannot release a native resource that may potentially be accessible or in use by other threads — ie: an explicit free method creates the potential for crashes or security issues. Slicing a large buffer is a good approach when you needs lots of small buffers.
Vikram November 12, 2008 at 3:14 am

Exactly, the client code will spend time in mapping the request to response and will result in sending requests (more)spaced from each other. Can it be done on a separate thread, asynchronously, without affecting the client performance?
Trustin Lee November 12, 2008 at 9:27 am

@Vikram: I think you need to try that and see how much overhead it will have.
Trustin Lee November 12, 2008 at 9:29 am

@Alan: I think slicing doesn’t solve the fundamental issue because it just decreases the allocation and GC overhead to some degrees. Can’t think of the ideal solution at this moment.
Testo November 16, 2009 at 1:41 am

Any chance to update this benchmark with Netty 3.1.5?
Zelalem Sintayehu May 11, 2011 at 11:27 pm

Hi I don’t know if you are still following this thread. It is a wonderful analysis and want to refer it. Have you published this work? I want to refer it in an academic paper.

Thanks again for the nice work.

Zelalem
Trustin Lee May 26, 2011 at 12:19 pm

No I haven’t. However, you can run the test by yourself.
syuu1228 October 13, 2011 at 10:19 pm

Hi,

I’m trying to use your benchmark program for measure network performance on multicore systems, but I got “java.io.IOException: Connection reset by peer” when “Message Size: 128, Connections: 10000” case.
#To do so, I set messageSizeTable = {64 … 16384}, connectionTable = {1 … 10000}.

This exception raised on client side, I can’t see any message from server side.
And at least xsocket is worked perfectly, the other frameworks are still under testing right now.

Do you have a any idea to prevent this?
I already set net.ipv4.tcp_tw_recycle=1 on both side, also fs.file-max=100000.
And machine power is enough – Core i7 X980, 24GB RAM, Intel 10GbE.

Maybe not so useful, but I uploaded full log here:
https://gist.github.com/1284191