Performance Comparison between NIO Frameworks

Most NIO frameworks can saturate 1 gigabit ethernet at some point. However, some frameworks can saturate the bandwidth with the smaller number of connections while others can not. The performance numbers of the 5 well-known open source NIO frameworks are presented here to help you figure out the excellence of Netty in performance.

Where’s the Graph?

If you are in a hurry, please scroll down to see the graphs first. You can also download the PDF document which contains detailed numbers and graphs.

What’s the Bottom Line?

Unlike usual expectations, NIO frameworks have different performance characteristics in spite of the fact that they are using the same NIO selector provider.

What’s observed is that the difference comes from the fundamental factors such as data structure and thread contention management, and those factors should never be overlooked.

Netty has succeeded to introduce the breakthrough in NIO framework performance with careful engineering, while retaining the flexible architecture.

Test Scenario

A simple echo server and client exchange fixed length messages one by one (i.e. synchronous ping-pong). The handler code, which sends the received data back in verbatim, is executed in a separate thread pool that each NIO framework provides.

The tests were run with different message lengths (64 ~ 16384 bytes) and different network configurations (loopback and 1 gigabit ethernet), to see how well each framework performs on various conditions.

Test Environment

  • Software
    • The test client has been written in Netty 3.0.0.CR5.
    • Echo server implementations
      • Netty 3.0.0.CR5
      • Other 4 open source NIO frameworks
        • Grizzly, MINA, NIO Framework, and xSocket
        • Used the latest milestone releases as of October 3rd, 2008
        • Excluded inactive projects (no release in 2008)
        • Framework names were anonymized in no particular order.
      • Thread pool
        • The number of I/O threads – the number of the CPU cores
        • The number of handler threads – 16
        • The default thread pool that each framework provides was used.
        • If the framework doesn’t provide a thread pool implementation which limits the maximum number of threads, Executors.newFixedThreadPool() was used instead.
      • Use of direct buffers was suppressed to avoid excessive memory consumption.
    • JRE – Sun JDK 1.6.0_07
    • JRE options – -server -Xms2048m -Xmx2048m -XX:+UseParallelGC -XX:+AggressiveOpts -XX:+UseFastAccessorMethods
  • Hardware
    • Server (Hostname: Eden)
      • CPU: 2 x quad-core Xeon 2.83GHz, ‘performance’ governor
      • O/S: Linux 2.6.25.11-97.fc9 (Fedora 9)
      • RAM: 6 GiB
      • NIC: Broadcom NetXtreme Gigabit Ethernet PCI express
    • Client (Hostname: Serpent)
      • CPU: 2 x dual-core Xeon 3.00GHz, ‘performance’ governor
      • O/S: Linux 2.6.25.11-97.fc9 (Fedora 9)
      • RAM: 3 GiB
      • NIC: Broadcom NetXtreme Gigabit Ethernet PCI express
    • No switching hub was used to minimized possible network latency.
  • Common TCP/IP parameters
    • TCP_NODELAY was turned on. (i.e. Nagle’s algorithm was disabled.)
    • net.ipv4.tcp_tw_recycle has been set to 1
    • Used the default MTU (i.e. 1500 – no jumbo frame)

Test Result

Client and Server on the Same Machine (Loopback Device)

The test client and servers ran on the same machine, Eden. (If images are not showing up, please refresh. There are three graphs here.)

Size=128, Loopback
Size=1024, Loopback
Size=4096, Loopback

Client and Server on Different Machines (1 Gigabit Ethernet)

The test client ran in Serpent, and the servers ran in Eden. (If images are not showing up, please refresh. There are three graphs here.)

Size=128, 1Gb Ethernet
Size=1024, 1Gb Ethernet
Size=4096, 1Gb Ethernet

Running the Tests by Yourself

The test result should be always reproduceable. Please give us your feed back to improve the accuracy of the test result. The full source code is available at the Subversion repository:

svn co http://anonsvn.jboss.org/repos/netty/subproject/benchmark

All tests run by Ant. Enter ‘ant -p‘ to see the instruction.

29 Comments Performance Comparison between NIO Frameworks

  1. c.m.

    what was the intention for omitting the names of the other frameworks here in this report? Doing a comparison and not saying what was compared against seems strange to me.

  2. Trustin Lee

    @c.m: Here’s the list of other frameworks in an alphabetical order:

    However, it doesn’t necessarily mean that the framework B is Grizzly and so on. This report has been anonymized for a political reason. It is actually pretty easy to reproduce the performance result because the test code is completely open source. You can browse the source code here to find which version of each framework was used:

    http://anonsvn.jboss.org/repos/netty/subproject/benchmark

  3. Trustin Lee

    Ah, of course, there’s no restrictive terms of license which prohibits me from publishing the exact names. However, opening the whole result crystal-clearly might hurt some frameworks mentioned here. For example, some framework even had a resource leak so that I had to relaunch the server very often.

    Anyway, the bottom line of this comparison is pretty obvious.

    • Unlike usual expectations, NIO frameworks have different performance characteristics in spite of the fact that they are using the same NIO selector provider.
    • My observation is that the difference comes from the fundamental factors such as data structure and thread contention management, and those factors should never be overlooked.
    • AND Netty has succeeded to introduce the breakthrough in NIO framework performance, while retaining the flexible architecture.
  4. gregor

    Hi Trustin,

    please note that your xSocket-based example includes an unnecessary copy of the incoming data. The readByteBufferByLength(size) method should have been used instead.

    Gregor

  5. Trustin Lee

    @gregor: Hi Gregor,

    I was actually looking for what exactly you mentioned to avoid memory copy. Let me check in the fix right now.

    BTW, I’d love to mention that I was impressed by xSocket’s performance and scalability. Very stable and high-performing. 🙂

    Thanks!

  6. Kamel

    I start to use Netty on my future GPL project (Asynchronous Computation over Grid). I would like to say Great Job Lee!

  7. Trustin Lee

    @Kamel: Thanks for using Netty, and please feel free to contact me or the community if you have a question or suggestion. 🙂

    Also, please ping me when you are ready to publish your project to the web. I’d like to publish a list of projects which use Netty.

  8. Alan

    There is a note in this blog to say that the “use of direct buffers was suppressed to avoid excessive memory consumption”. I’m not sure what this means but there isn’t any way in NIO to suppress the use of direct buffers. If the framework/application uses non-direct buffers (ie: ByteBuffers that encapsulate byte[] in the java heap) then the buffers are transparently substituted with direct buffers when doing I/O.

  9. Trustin Lee

    @Alan: I guess a certain buffer allocation pattern (?) causes indefinitely increasing direct buffer memory which ends up with OOM, but I’m not sure what condition triggers that. What’s apparent though is that it just works fine when heap buffer is used primarily.

  10. Alan

    It mostly depends on how you are managing the direct buffers. If you allocate and unreference a direct buffer then the memory will not be released until the corresponding ByteBuffer object is GC’ed. Direct buffers are intended to be re-used.

  11. Trustin Lee

    @Alan: AFAIK, most NIO frameworks don’t pool direct buffers because it’s not really user-friendly to ask a user to return the buffer to the pool explicitly. It would be great if I can control obviously how a direct buffer is reclaimed.

    Common technique so far between NIO frameworks is to allocate a big chunk of direct buffer and slice it as needed because it lessens the GC overhead for some reason, but I don’t think it scales as the load goes up.

  12. Vikram

    Trustin,

    In my project the connection, to the NIO server, is required to be kept open. (Requests are sent over the same connection)
    This requirement makes it difficult to count the response time for every single request.
    I referred to the load test code you have posted on the JBoss svn repo
    http://anonsvn.jboss.org/repos/netty/subproject/benchmark
    You are counting the execution time for the overall test and calculating the response time by dividing that time by number of requests sent… please correct me if I am wrong.
    How do I check the response time for each of the requests. This will enable me to plot a graph for the performance of the server side code.

  13. Trustin Lee

    @Vikram: You’re right. It should yield the average response time which is acceptable in general. I could have measured the response time per each request-response pair, but I didn’t do that because I was worried about the overhead implication.

  14. Alan

    Unfortunately, the framework or application cannot control when direct buffers are released. In a multi-threaded environment you cannot release a native resource that may potentially be accessible or in use by other threads — ie: an explicit free method creates the potential for crashes or security issues. Slicing a large buffer is a good approach when you needs lots of small buffers.

  15. Vikram

    Exactly, the client code will spend time in mapping the request to response and will result in sending requests (more)spaced from each other. Can it be done on a separate thread, asynchronously, without affecting the client performance?

  16. Trustin Lee

    @Alan: I think slicing doesn’t solve the fundamental issue because it just decreases the allocation and GC overhead to some degrees. Can’t think of the ideal solution at this moment.

  17. Zelalem Sintayehu

    Hi I don’t know if you are still following this thread. It is a wonderful analysis and want to refer it. Have you published this work? I want to refer it in an academic paper.

    Thanks again for the nice work.

    Zelalem

  18. syuu1228

    Hi,

    I’m trying to use your benchmark program for measure network performance on multicore systems, but I got “java.io.IOException: Connection reset by peer” when “Message Size: 128, Connections: 10000” case.
    #To do so, I set messageSizeTable = {64 … 16384}, connectionTable = {1 … 10000}.

    This exception raised on client side, I can’t see any message from server side.
    And at least xsocket is worked perfectly, the other frameworks are still under testing right now.

    Do you have a any idea to prevent this?
    I already set net.ipv4.tcp_tw_recycle=1 on both side, also fs.file-max=100000.
    And machine power is enough – Core i7 X980, 24GB RAM, Intel 10GbE.

    Maybe not so useful, but I uploaded full log here:
    https://gist.github.com/1284191

Comments are closed.