Java broker OOM due to DirectMemory

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Java broker OOM due to DirectMemory

Ramayan Tiwari
Hi All,

We are using Java broker 6.0.5, with patch to use MultiQueueConsumer
feature. We just finished deploying to production and saw couple of
instances of broker OOM due to running out of DirectMemory buffer
(exceptions at the end of this email).

Here is our setup:
1. Max heap 12g, max direct memory 4g (this is opposite of what the
recommendation is, however, for our use cause message payload is really
small ~400bytes and is way less than the per message overhead of 1KB). In
perf testing, we were able to put 2 million messages without any issues.
2. ~400 connections to broker.
3. Each connection has 20 sessions and there is one multi queue consumer
attached to each session, listening to around 1000 queues.
4. We are still using 0.16 client (I know).

With the above setup, the baseline utilization (without any messages) for
direct memory was around 230mb (with 410 connection each taking 500KB).

Based on our understanding of broker memory allocation, message payload
should be the only thing adding to direct memory utilization (on top of
baseline), however, we are experiencing something completely different. In
our last broker crash, we see that broker is constantly running with 90%+
direct memory allocated, even when message payload sum from all the queues
is only 6-8% (these % are against available DM of 4gb). During these high
DM usage period, heap usage was around 60% (of 12gb).

We would like some help in understanding what could be the reason of these
high DM allocations. Are there things other than message payload and AMQP
connection, which use DM and could be contributing to these high usage?

Another thing where we are puzzled is the de-allocation of DM byte buffers.
From log mining of heap and DM utilization, de-allocation of DM doesn't
correlate with heap GC. If anyone has seen any documentation related to
this, it would be very helpful if you could share that.

Thanks
Ramayan


*Exceptions*

java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
~[na:1.8.0_40]
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[na:1.8.0_40]
at
org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(QpidByteBuffer.java:474)
~[qpid-common-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.restoreApplicationBufferForWrite(NonBlockingConnectionPlainDelegate.java:93)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.processData(NonBlockingConnectionPlainDelegate.java:60)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NonBlockingConnection.doRead(NonBlockingConnection.java:506)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NonBlockingConnection.doWork(NonBlockingConnection.java:285)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NetworkConnectionScheduler.processConnection(NetworkConnectionScheduler.java:124)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.processConnection(SelectorThread.java:504)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.SelectorThread$SelectionTask.performSelect(SelectorThread.java:337)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(SelectorThread.java:87)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:462)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_40]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_40]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]



*Second exception*
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
~[na:1.8.0_40]
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[na:1.8.0_40]
at
org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(QpidByteBuffer.java:474)
~[qpid-common-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.<init>(NonBlockingConnectionPlainDelegate.java:45)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NonBlockingConnection.setTransportEncryption(NonBlockingConnection.java:625)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NonBlockingConnection.<init>(NonBlockingConnection.java:117)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.NonBlockingNetworkTransport.acceptSocketChannel(NonBlockingNetworkTransport.java:158)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.SelectorThread$SelectionTask$1.run(SelectorThread.java:191)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:462)
~[qpid-broker-core-6.0.5.jar:6.0.5]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_40]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_40]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Oleksandr Rudyy
Hi Ramayan,

Could please share with us the details of messaging use case(s) which ended
up in OOM on broker side?
I would like to reproduce the issue on my local broker in order to fix it.
I would appreciate if you could provide as much details as possible,
including, messaging topology, message persistence type, message
sizes,volumes, etc.

Qpid Broker 6.0.x uses direct memory for keeping message content and
receiving/sending data. Each plain connection utilizes 512K of direct
memory. Each SSL connection uses 1M of direct memory. Your memory settings
look Ok to me.

Kind Regards,
Alex


On 18 April 2017 at 23:39, Ramayan Tiwari <[hidden email]> wrote:

> Hi All,
>
> We are using Java broker 6.0.5, with patch to use MultiQueueConsumer
> feature. We just finished deploying to production and saw couple of
> instances of broker OOM due to running out of DirectMemory buffer
> (exceptions at the end of this email).
>
> Here is our setup:
> 1. Max heap 12g, max direct memory 4g (this is opposite of what the
> recommendation is, however, for our use cause message payload is really
> small ~400bytes and is way less than the per message overhead of 1KB). In
> perf testing, we were able to put 2 million messages without any issues.
> 2. ~400 connections to broker.
> 3. Each connection has 20 sessions and there is one multi queue consumer
> attached to each session, listening to around 1000 queues.
> 4. We are still using 0.16 client (I know).
>
> With the above setup, the baseline utilization (without any messages) for
> direct memory was around 230mb (with 410 connection each taking 500KB).
>
> Based on our understanding of broker memory allocation, message payload
> should be the only thing adding to direct memory utilization (on top of
> baseline), however, we are experiencing something completely different. In
> our last broker crash, we see that broker is constantly running with 90%+
> direct memory allocated, even when message payload sum from all the queues
> is only 6-8% (these % are against available DM of 4gb). During these high
> DM usage period, heap usage was around 60% (of 12gb).
>
> We would like some help in understanding what could be the reason of these
> high DM allocations. Are there things other than message payload and AMQP
> connection, which use DM and could be contributing to these high usage?
>
> Another thing where we are puzzled is the de-allocation of DM byte buffers.
> From log mining of heap and DM utilization, de-allocation of DM doesn't
> correlate with heap GC. If anyone has seen any documentation related to
> this, it would be very helpful if you could share that.
>
> Thanks
> Ramayan
>
>
> *Exceptions*
>
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> ~[na:1.8.0_40]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[na:1.8.0_40]
> at
> org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> QpidByteBuffer.java:474)
> ~[qpid-common-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.
> restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
> gate.java:93)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> gate.processData(NonBlockingConnectionPlainDelegate.java:60)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NonBlockingConnection.doRead(
> NonBlockingConnection.java:506)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NonBlockingConnection.doWork(
> NonBlockingConnection.java:285)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NetworkConnectionScheduler.
> processConnection(NetworkConnectionScheduler.java:124)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.
> processConnection(SelectorThread.java:504)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.SelectorThread$
> SelectionTask.performSelect(SelectorThread.java:337)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
> SelectorThread.java:87)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.SelectorThread.run(
> SelectorThread.java:462)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_40]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> ~[na:1.8.0_40]
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>
>
>
> *Second exception*
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> ~[na:1.8.0_40]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[na:1.8.0_40]
> at
> org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> QpidByteBuffer.java:474)
> ~[qpid-common-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NonBlockingConnection.
> setTransportEncryption(NonBlockingConnection.java:625)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NonBlockingConnection.<init>(
> NonBlockingConnection.java:117)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.NonBlockingNetworkTransport.
> acceptSocketChannel(NonBlockingNetworkTransport.java:158)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.SelectorThread$SelectionTask$1.run(
> SelectorThread.java:191)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> org.apache.qpid.server.transport.SelectorThread.run(
> SelectorThread.java:462)
> ~[qpid-broker-core-6.0.5.jar:6.0.5]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_40]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> ~[na:1.8.0_40]
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Hi Alex,

Thanks for your response, here are the details:

We use "direct" exchange, without persistence (we specify NON_PERSISTENT
that while sending from client) and use BDB store. We use JSON virtual host
type. We are not using SSL.

When the broker went OOM, we had around 1.3 million messages with 100 bytes
average message size. Direct memory allocation (value read from MBean) kept
going up, even though it wouldn't need more DM to store these many
messages. DM allocated persisted at 99% for about 3 and half hours before
crashing.

Today, on the same broker we have 3 million messages (same message size)
and DM allocated is only at 8%. This seems like there is some issue with
de-allocation or a leak.

I have uploaded the memory utilization graph here:
https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/view?usp=sharing
Blue line is DM allocated, Yellow is DM Used (sum of queue payload) and Red
is heap usage.

Thanks
Ramayan

On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[hidden email]> wrote:

> Hi Ramayan,
>
> Could please share with us the details of messaging use case(s) which ended
> up in OOM on broker side?
> I would like to reproduce the issue on my local broker in order to fix it.
> I would appreciate if you could provide as much details as possible,
> including, messaging topology, message persistence type, message
> sizes,volumes, etc.
>
> Qpid Broker 6.0.x uses direct memory for keeping message content and
> receiving/sending data. Each plain connection utilizes 512K of direct
> memory. Each SSL connection uses 1M of direct memory. Your memory settings
> look Ok to me.
>
> Kind Regards,
> Alex
>
>
> On 18 April 2017 at 23:39, Ramayan Tiwari <[hidden email]>
> wrote:
>
> > Hi All,
> >
> > We are using Java broker 6.0.5, with patch to use MultiQueueConsumer
> > feature. We just finished deploying to production and saw couple of
> > instances of broker OOM due to running out of DirectMemory buffer
> > (exceptions at the end of this email).
> >
> > Here is our setup:
> > 1. Max heap 12g, max direct memory 4g (this is opposite of what the
> > recommendation is, however, for our use cause message payload is really
> > small ~400bytes and is way less than the per message overhead of 1KB). In
> > perf testing, we were able to put 2 million messages without any issues.
> > 2. ~400 connections to broker.
> > 3. Each connection has 20 sessions and there is one multi queue consumer
> > attached to each session, listening to around 1000 queues.
> > 4. We are still using 0.16 client (I know).
> >
> > With the above setup, the baseline utilization (without any messages) for
> > direct memory was around 230mb (with 410 connection each taking 500KB).
> >
> > Based on our understanding of broker memory allocation, message payload
> > should be the only thing adding to direct memory utilization (on top of
> > baseline), however, we are experiencing something completely different.
> In
> > our last broker crash, we see that broker is constantly running with 90%+
> > direct memory allocated, even when message payload sum from all the
> queues
> > is only 6-8% (these % are against available DM of 4gb). During these high
> > DM usage period, heap usage was around 60% (of 12gb).
> >
> > We would like some help in understanding what could be the reason of
> these
> > high DM allocations. Are there things other than message payload and AMQP
> > connection, which use DM and could be contributing to these high usage?
> >
> > Another thing where we are puzzled is the de-allocation of DM byte
> buffers.
> > From log mining of heap and DM utilization, de-allocation of DM doesn't
> > correlate with heap GC. If anyone has seen any documentation related to
> > this, it would be very helpful if you could share that.
> >
> > Thanks
> > Ramayan
> >
> >
> > *Exceptions*
> >
> > java.lang.OutOfMemoryError: Direct buffer memory
> > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> > ~[na:1.8.0_40]
> > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> ~[na:1.8.0_40]
> > at
> > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> > QpidByteBuffer.java:474)
> > ~[qpid-common-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.
> > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
> > gate.java:93)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
> > NonBlockingConnection.java:506)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
> > NonBlockingConnection.java:285)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NetworkConnectionScheduler.
> > processConnection(NetworkConnectionScheduler.java:124)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.
> > processConnection(SelectorThread.java:504)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.SelectorThread$
> > SelectionTask.performSelect(SelectorThread.java:337)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
> > SelectorThread.java:87)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.SelectorThread.run(
> > SelectorThread.java:462)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> > ~[na:1.8.0_40]
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> > ~[na:1.8.0_40]
> > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> >
> >
> >
> > *Second exception*
> > java.lang.OutOfMemoryError: Direct buffer memory
> > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> > ~[na:1.8.0_40]
> > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> ~[na:1.8.0_40]
> > at
> > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> > QpidByteBuffer.java:474)
> > ~[qpid-common-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NonBlockingConnection.
> > setTransportEncryption(NonBlockingConnection.java:625)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
> > NonBlockingConnection.java:117)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
> > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.SelectorThread$SelectionTask$1.run(
> > SelectorThread.java:191)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > org.apache.qpid.server.transport.SelectorThread.run(
> > SelectorThread.java:462)
> > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> > ~[na:1.8.0_40]
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> > ~[na:1.8.0_40]
> > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Oleksandr Rudyy
Ramayan,
Thanks for the details. I would like to clarify whether flow to disk was
triggered today for 3 million messages?

The following logs are issued for flow to disk:
BRK-1014 : Message flow to disk active :  Message memory use {0,number,#}KB
exceeds threshold {1,number,#.##}KB
BRK-1015 : Message flow to disk inactive : Message memory use
{0,number,#}KB within threshold {1,number,#.##}KB

Kind Regards,
Alex


On 19 April 2017 at 17:10, Ramayan Tiwari <[hidden email]> wrote:

> Hi Alex,
>
> Thanks for your response, here are the details:
>
> We use "direct" exchange, without persistence (we specify NON_PERSISTENT
> that while sending from client) and use BDB store. We use JSON virtual host
> type. We are not using SSL.
>
> When the broker went OOM, we had around 1.3 million messages with 100 bytes
> average message size. Direct memory allocation (value read from MBean) kept
> going up, even though it wouldn't need more DM to store these many
> messages. DM allocated persisted at 99% for about 3 and half hours before
> crashing.
>
> Today, on the same broker we have 3 million messages (same message size)
> and DM allocated is only at 8%. This seems like there is some issue with
> de-allocation or a leak.
>
> I have uploaded the memory utilization graph here:
> https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> view?usp=sharing
> Blue line is DM allocated, Yellow is DM Used (sum of queue payload) and Red
> is heap usage.
>
> Thanks
> Ramayan
>
> On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[hidden email]> wrote:
>
> > Hi Ramayan,
> >
> > Could please share with us the details of messaging use case(s) which
> ended
> > up in OOM on broker side?
> > I would like to reproduce the issue on my local broker in order to fix
> it.
> > I would appreciate if you could provide as much details as possible,
> > including, messaging topology, message persistence type, message
> > sizes,volumes, etc.
> >
> > Qpid Broker 6.0.x uses direct memory for keeping message content and
> > receiving/sending data. Each plain connection utilizes 512K of direct
> > memory. Each SSL connection uses 1M of direct memory. Your memory
> settings
> > look Ok to me.
> >
> > Kind Regards,
> > Alex
> >
> >
> > On 18 April 2017 at 23:39, Ramayan Tiwari <[hidden email]>
> > wrote:
> >
> > > Hi All,
> > >
> > > We are using Java broker 6.0.5, with patch to use MultiQueueConsumer
> > > feature. We just finished deploying to production and saw couple of
> > > instances of broker OOM due to running out of DirectMemory buffer
> > > (exceptions at the end of this email).
> > >
> > > Here is our setup:
> > > 1. Max heap 12g, max direct memory 4g (this is opposite of what the
> > > recommendation is, however, for our use cause message payload is really
> > > small ~400bytes and is way less than the per message overhead of 1KB).
> In
> > > perf testing, we were able to put 2 million messages without any
> issues.
> > > 2. ~400 connections to broker.
> > > 3. Each connection has 20 sessions and there is one multi queue
> consumer
> > > attached to each session, listening to around 1000 queues.
> > > 4. We are still using 0.16 client (I know).
> > >
> > > With the above setup, the baseline utilization (without any messages)
> for
> > > direct memory was around 230mb (with 410 connection each taking 500KB).
> > >
> > > Based on our understanding of broker memory allocation, message payload
> > > should be the only thing adding to direct memory utilization (on top of
> > > baseline), however, we are experiencing something completely different.
> > In
> > > our last broker crash, we see that broker is constantly running with
> 90%+
> > > direct memory allocated, even when message payload sum from all the
> > queues
> > > is only 6-8% (these % are against available DM of 4gb). During these
> high
> > > DM usage period, heap usage was around 60% (of 12gb).
> > >
> > > We would like some help in understanding what could be the reason of
> > these
> > > high DM allocations. Are there things other than message payload and
> AMQP
> > > connection, which use DM and could be contributing to these high usage?
> > >
> > > Another thing where we are puzzled is the de-allocation of DM byte
> > buffers.
> > > From log mining of heap and DM utilization, de-allocation of DM doesn't
> > > correlate with heap GC. If anyone has seen any documentation related to
> > > this, it would be very helpful if you could share that.
> > >
> > > Thanks
> > > Ramayan
> > >
> > >
> > > *Exceptions*
> > >
> > > java.lang.OutOfMemoryError: Direct buffer memory
> > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> > > ~[na:1.8.0_40]
> > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> > ~[na:1.8.0_40]
> > > at
> > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> > > QpidByteBuffer.java:474)
> > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.
> > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
> > > gate.java:93)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
> > > NonBlockingConnection.java:506)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
> > > NonBlockingConnection.java:285)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
> > > processConnection(NetworkConnectionScheduler.java:124)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.
> > > processConnection(SelectorThread.java:504)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.SelectorThread$
> > > SelectionTask.performSelect(SelectorThread.java:337)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
> > > SelectorThread.java:87)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.SelectorThread.run(
> > > SelectorThread.java:462)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > ThreadPoolExecutor.java:1142)
> > > ~[na:1.8.0_40]
> > > at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > ThreadPoolExecutor.java:617)
> > > ~[na:1.8.0_40]
> > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> > >
> > >
> > >
> > > *Second exception*
> > > java.lang.OutOfMemoryError: Direct buffer memory
> > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> > > ~[na:1.8.0_40]
> > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> > ~[na:1.8.0_40]
> > > at
> > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> > > QpidByteBuffer.java:474)
> > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NonBlockingConnection.
> > > setTransportEncryption(NonBlockingConnection.java:625)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
> > > NonBlockingConnection.java:117)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
> > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.SelectorThread$SelectionTask$1.run(
> > > SelectorThread.java:191)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > org.apache.qpid.server.transport.SelectorThread.run(
> > > SelectorThread.java:462)
> > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > ThreadPoolExecutor.java:1142)
> > > ~[na:1.8.0_40]
> > > at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > ThreadPoolExecutor.java:617)
> > > ~[na:1.8.0_40]
> > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Alex,

Below are the flow to disk logs from broker having 3million+ messages at
this time. We only have one virtual host. Time is in GMT. Looks like flow
to disk is active on the whole virtual host and not a queue level.

When the same broker went OOM yesterday, I did not see any flow to disk
logs from when it was started until it crashed (crashed twice within 4hrs).


4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1014 : Message flow to disk active :  Message memory use 3356539KB
exceeds threshold 3355443KB
4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1015 : Message flow to disk inactive : Message memory use 3354866KB
within threshold 3355443KB
4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1014 : Message flow to disk active :  Message memory use 3358509KB
exceeds threshold 3355443KB
4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1015 : Message flow to disk inactive : Message memory use 3353501KB
within threshold 3355443KB
4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1014 : Message flow to disk active :  Message memory use 3357544KB
exceeds threshold 3355443KB
4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1015 : Message flow to disk inactive : Message memory use 3353236KB
within threshold 3355443KB
4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1014 : Message flow to disk active :  Message memory use 3356704KB
exceeds threshold 3355443KB
4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1015 : Message flow to disk inactive : Message memory use 3353511KB
within threshold 3355443KB
4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1014 : Message flow to disk active :  Message memory use 3357948KB
exceeds threshold 3355443KB
4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1015 : Message flow to disk inactive : Message memory use 3355310KB
within threshold 3355443KB
4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1014 : Message flow to disk active :  Message memory use 3365624KB
exceeds threshold 3355443KB
4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1015 : Message flow to disk inactive : Message memory use 3355136KB
within threshold 3355443KB
4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
BRK-1014 : Message flow to disk active :  Message memory use 3358683KB
exceeds threshold 3355443KB


After production release (2days back), we have seen 4 crashes in 3
different brokers, this is the most pressing concern for us in decision if
we should roll back to 0.32. Any help is greatly appreciated.

Thanks
Ramayan

On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]> wrote:

> Ramayan,
> Thanks for the details. I would like to clarify whether flow to disk was
> triggered today for 3 million messages?
>
> The following logs are issued for flow to disk:
> BRK-1014 : Message flow to disk active :  Message memory use {0,number,#}KB
> exceeds threshold {1,number,#.##}KB
> BRK-1015 : Message flow to disk inactive : Message memory use
> {0,number,#}KB within threshold {1,number,#.##}KB
>
> Kind Regards,
> Alex
>
>
> On 19 April 2017 at 17:10, Ramayan Tiwari <[hidden email]>
> wrote:
>
> > Hi Alex,
> >
> > Thanks for your response, here are the details:
> >
> > We use "direct" exchange, without persistence (we specify NON_PERSISTENT
> > that while sending from client) and use BDB store. We use JSON virtual
> host
> > type. We are not using SSL.
> >
> > When the broker went OOM, we had around 1.3 million messages with 100
> bytes
> > average message size. Direct memory allocation (value read from MBean)
> kept
> > going up, even though it wouldn't need more DM to store these many
> > messages. DM allocated persisted at 99% for about 3 and half hours before
> > crashing.
> >
> > Today, on the same broker we have 3 million messages (same message size)
> > and DM allocated is only at 8%. This seems like there is some issue with
> > de-allocation or a leak.
> >
> > I have uploaded the memory utilization graph here:
> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> > view?usp=sharing
> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload) and
> Red
> > is heap usage.
> >
> > Thanks
> > Ramayan
> >
> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[hidden email]>
> wrote:
> >
> > > Hi Ramayan,
> > >
> > > Could please share with us the details of messaging use case(s) which
> > ended
> > > up in OOM on broker side?
> > > I would like to reproduce the issue on my local broker in order to fix
> > it.
> > > I would appreciate if you could provide as much details as possible,
> > > including, messaging topology, message persistence type, message
> > > sizes,volumes, etc.
> > >
> > > Qpid Broker 6.0.x uses direct memory for keeping message content and
> > > receiving/sending data. Each plain connection utilizes 512K of direct
> > > memory. Each SSL connection uses 1M of direct memory. Your memory
> > settings
> > > look Ok to me.
> > >
> > > Kind Regards,
> > > Alex
> > >
> > >
> > > On 18 April 2017 at 23:39, Ramayan Tiwari <[hidden email]>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > We are using Java broker 6.0.5, with patch to use MultiQueueConsumer
> > > > feature. We just finished deploying to production and saw couple of
> > > > instances of broker OOM due to running out of DirectMemory buffer
> > > > (exceptions at the end of this email).
> > > >
> > > > Here is our setup:
> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of what the
> > > > recommendation is, however, for our use cause message payload is
> really
> > > > small ~400bytes and is way less than the per message overhead of
> 1KB).
> > In
> > > > perf testing, we were able to put 2 million messages without any
> > issues.
> > > > 2. ~400 connections to broker.
> > > > 3. Each connection has 20 sessions and there is one multi queue
> > consumer
> > > > attached to each session, listening to around 1000 queues.
> > > > 4. We are still using 0.16 client (I know).
> > > >
> > > > With the above setup, the baseline utilization (without any messages)
> > for
> > > > direct memory was around 230mb (with 410 connection each taking
> 500KB).
> > > >
> > > > Based on our understanding of broker memory allocation, message
> payload
> > > > should be the only thing adding to direct memory utilization (on top
> of
> > > > baseline), however, we are experiencing something completely
> different.
> > > In
> > > > our last broker crash, we see that broker is constantly running with
> > 90%+
> > > > direct memory allocated, even when message payload sum from all the
> > > queues
> > > > is only 6-8% (these % are against available DM of 4gb). During these
> > high
> > > > DM usage period, heap usage was around 60% (of 12gb).
> > > >
> > > > We would like some help in understanding what could be the reason of
> > > these
> > > > high DM allocations. Are there things other than message payload and
> > AMQP
> > > > connection, which use DM and could be contributing to these high
> usage?
> > > >
> > > > Another thing where we are puzzled is the de-allocation of DM byte
> > > buffers.
> > > > From log mining of heap and DM utilization, de-allocation of DM
> doesn't
> > > > correlate with heap GC. If anyone has seen any documentation related
> to
> > > > this, it would be very helpful if you could share that.
> > > >
> > > > Thanks
> > > > Ramayan
> > > >
> > > >
> > > > *Exceptions*
> > > >
> > > > java.lang.OutOfMemoryError: Direct buffer memory
> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> > > > ~[na:1.8.0_40]
> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> > > ~[na:1.8.0_40]
> > > > at
> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> > > > QpidByteBuffer.java:474)
> > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.
> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
> > > > gate.java:93)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
> > > > NonBlockingConnection.java:506)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
> > > > NonBlockingConnection.java:285)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
> > > > processConnection(NetworkConnectionScheduler.java:124)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.
> > > > processConnection(SelectorThread.java:504)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.SelectorThread$
> > > > SelectionTask.performSelect(SelectorThread.java:337)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
> > > > SelectorThread.java:87)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.SelectorThread.run(
> > > > SelectorThread.java:462)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > ThreadPoolExecutor.java:1142)
> > > > ~[na:1.8.0_40]
> > > > at
> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > ThreadPoolExecutor.java:617)
> > > > ~[na:1.8.0_40]
> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> > > >
> > > >
> > > >
> > > > *Second exception*
> > > > java.lang.OutOfMemoryError: Direct buffer memory
> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> > > > ~[na:1.8.0_40]
> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> > > ~[na:1.8.0_40]
> > > > at
> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> > > > QpidByteBuffer.java:474)
> > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NonBlockingConnection.
> > > > setTransportEncryption(NonBlockingConnection.java:625)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
> > > > NonBlockingConnection.java:117)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTask$1.run(
> > > > SelectorThread.java:191)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > org.apache.qpid.server.transport.SelectorThread.run(
> > > > SelectorThread.java:462)
> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > at
> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > ThreadPoolExecutor.java:1142)
> > > > ~[na:1.8.0_40]
> > > > at
> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > ThreadPoolExecutor.java:617)
> > > > ~[na:1.8.0_40]
> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Another issue that we noticed is when broker goes OOM due to direct memory,
it doesn't create heap dump (specified by "-XX:+HeapDumpOnOutOfMemoryError"),
even when the OOM error is same as what is mentioned in the oracle JVM docs
("java.lang.OutOfMemoryError").

Has anyone been able to find a way to get to heap dump for DM OOM?

- Ramayan

On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari <[hidden email]>
wrote:

> Alex,
>
> Below are the flow to disk logs from broker having 3million+ messages at
> this time. We only have one virtual host. Time is in GMT. Looks like flow
> to disk is active on the whole virtual host and not a queue level.
>
> When the same broker went OOM yesterday, I did not see any flow to disk
> logs from when it was started until it crashed (crashed twice within 4hrs).
>
>
> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1014 : Message flow to disk active :  Message memory use 3356539KB
> exceeds threshold 3355443KB
> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1015 : Message flow to disk inactive : Message memory use 3354866KB
> within threshold 3355443KB
> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1014 : Message flow to disk active :  Message memory use 3358509KB
> exceeds threshold 3355443KB
> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1015 : Message flow to disk inactive : Message memory use 3353501KB
> within threshold 3355443KB
> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1014 : Message flow to disk active :  Message memory use 3357544KB
> exceeds threshold 3355443KB
> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1015 : Message flow to disk inactive : Message memory use 3353236KB
> within threshold 3355443KB
> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1014 : Message flow to disk active :  Message memory use 3356704KB
> exceeds threshold 3355443KB
> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1015 : Message flow to disk inactive : Message memory use 3353511KB
> within threshold 3355443KB
> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1014 : Message flow to disk active :  Message memory use 3357948KB
> exceeds threshold 3355443KB
> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1015 : Message flow to disk inactive : Message memory use 3355310KB
> within threshold 3355443KB
> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1014 : Message flow to disk active :  Message memory use 3365624KB
> exceeds threshold 3355443KB
> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1015 : Message flow to disk inactive : Message memory use 3355136KB
> within threshold 3355443KB
> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
> BRK-1014 : Message flow to disk active :  Message memory use 3358683KB
> exceeds threshold 3355443KB
>
>
> After production release (2days back), we have seen 4 crashes in 3
> different brokers, this is the most pressing concern for us in decision if
> we should roll back to 0.32. Any help is greatly appreciated.
>
> Thanks
> Ramayan
>
> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]> wrote:
>
>> Ramayan,
>> Thanks for the details. I would like to clarify whether flow to disk was
>> triggered today for 3 million messages?
>>
>> The following logs are issued for flow to disk:
>> BRK-1014 : Message flow to disk active :  Message memory use
>> {0,number,#}KB
>> exceeds threshold {1,number,#.##}KB
>> BRK-1015 : Message flow to disk inactive : Message memory use
>> {0,number,#}KB within threshold {1,number,#.##}KB
>>
>> Kind Regards,
>> Alex
>>
>>
>> On 19 April 2017 at 17:10, Ramayan Tiwari <[hidden email]>
>> wrote:
>>
>> > Hi Alex,
>> >
>> > Thanks for your response, here are the details:
>> >
>> > We use "direct" exchange, without persistence (we specify NON_PERSISTENT
>> > that while sending from client) and use BDB store. We use JSON virtual
>> host
>> > type. We are not using SSL.
>> >
>> > When the broker went OOM, we had around 1.3 million messages with 100
>> bytes
>> > average message size. Direct memory allocation (value read from MBean)
>> kept
>> > going up, even though it wouldn't need more DM to store these many
>> > messages. DM allocated persisted at 99% for about 3 and half hours
>> before
>> > crashing.
>> >
>> > Today, on the same broker we have 3 million messages (same message size)
>> > and DM allocated is only at 8%. This seems like there is some issue with
>> > de-allocation or a leak.
>> >
>> > I have uploaded the memory utilization graph here:
>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
>> > view?usp=sharing
>> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload) and
>> Red
>> > is heap usage.
>> >
>> > Thanks
>> > Ramayan
>> >
>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[hidden email]>
>> wrote:
>> >
>> > > Hi Ramayan,
>> > >
>> > > Could please share with us the details of messaging use case(s) which
>> > ended
>> > > up in OOM on broker side?
>> > > I would like to reproduce the issue on my local broker in order to fix
>> > it.
>> > > I would appreciate if you could provide as much details as possible,
>> > > including, messaging topology, message persistence type, message
>> > > sizes,volumes, etc.
>> > >
>> > > Qpid Broker 6.0.x uses direct memory for keeping message content and
>> > > receiving/sending data. Each plain connection utilizes 512K of direct
>> > > memory. Each SSL connection uses 1M of direct memory. Your memory
>> > settings
>> > > look Ok to me.
>> > >
>> > > Kind Regards,
>> > > Alex
>> > >
>> > >
>> > > On 18 April 2017 at 23:39, Ramayan Tiwari <[hidden email]>
>> > > wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > > We are using Java broker 6.0.5, with patch to use MultiQueueConsumer
>> > > > feature. We just finished deploying to production and saw couple of
>> > > > instances of broker OOM due to running out of DirectMemory buffer
>> > > > (exceptions at the end of this email).
>> > > >
>> > > > Here is our setup:
>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of what the
>> > > > recommendation is, however, for our use cause message payload is
>> really
>> > > > small ~400bytes and is way less than the per message overhead of
>> 1KB).
>> > In
>> > > > perf testing, we were able to put 2 million messages without any
>> > issues.
>> > > > 2. ~400 connections to broker.
>> > > > 3. Each connection has 20 sessions and there is one multi queue
>> > consumer
>> > > > attached to each session, listening to around 1000 queues.
>> > > > 4. We are still using 0.16 client (I know).
>> > > >
>> > > > With the above setup, the baseline utilization (without any
>> messages)
>> > for
>> > > > direct memory was around 230mb (with 410 connection each taking
>> 500KB).
>> > > >
>> > > > Based on our understanding of broker memory allocation, message
>> payload
>> > > > should be the only thing adding to direct memory utilization (on
>> top of
>> > > > baseline), however, we are experiencing something completely
>> different.
>> > > In
>> > > > our last broker crash, we see that broker is constantly running with
>> > 90%+
>> > > > direct memory allocated, even when message payload sum from all the
>> > > queues
>> > > > is only 6-8% (these % are against available DM of 4gb). During these
>> > high
>> > > > DM usage period, heap usage was around 60% (of 12gb).
>> > > >
>> > > > We would like some help in understanding what could be the reason of
>> > > these
>> > > > high DM allocations. Are there things other than message payload and
>> > AMQP
>> > > > connection, which use DM and could be contributing to these high
>> usage?
>> > > >
>> > > > Another thing where we are puzzled is the de-allocation of DM byte
>> > > buffers.
>> > > > From log mining of heap and DM utilization, de-allocation of DM
>> doesn't
>> > > > correlate with heap GC. If anyone has seen any documentation
>> related to
>> > > > this, it would be very helpful if you could share that.
>> > > >
>> > > > Thanks
>> > > > Ramayan
>> > > >
>> > > >
>> > > > *Exceptions*
>> > > >
>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>> > > > ~[na:1.8.0_40]
>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>> > > ~[na:1.8.0_40]
>> > > > at
>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>> > > > QpidByteBuffer.java:474)
>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
>> elegate.
>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
>> > > > gate.java:93)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
>> > > > NonBlockingConnection.java:506)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
>> > > > NonBlockingConnection.java:285)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
>> > > > processConnection(NetworkConnectionScheduler.java:124)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr
>> ocessor.
>> > > > processConnection(SelectorThread.java:504)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.SelectorThread$
>> > > > SelectionTask.performSelect(SelectorThread.java:337)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
>> > > > SelectorThread.java:87)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>> > > > SelectorThread.java:462)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> > > > ThreadPoolExecutor.java:1142)
>> > > > ~[na:1.8.0_40]
>> > > > at
>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> > > > ThreadPoolExecutor.java:617)
>> > > > ~[na:1.8.0_40]
>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>> > > >
>> > > >
>> > > >
>> > > > *Second exception*
>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>> > > > ~[na:1.8.0_40]
>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>> > > ~[na:1.8.0_40]
>> > > > at
>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>> > > > QpidByteBuffer.java:474)
>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
>> > > > setTransportEncryption(NonBlockingConnection.java:625)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
>> > > > NonBlockingConnection.java:117)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
>> k$1.run(
>> > > > SelectorThread.java:191)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>> > > > SelectorThread.java:462)
>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> > > > at
>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> > > > ThreadPoolExecutor.java:1142)
>> > > > ~[na:1.8.0_40]
>> > > > at
>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> > > > ThreadPoolExecutor.java:617)
>> > > > ~[na:1.8.0_40]
>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>> > > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
After a lot of log mining, we might have a way to explain the sustained
increased in DirectMemory allocation, the correlation seems to be with the
growth in the size of a Queue that is getting consumed but at a much slower
rate than producers putting messages on this queue.

The pattern we see is that in each instance of broker crash, there is at
least one queue (usually 1 queue) whose size kept growing steadily. It’d be
of significant size but not the largest queue -- usually there are multiple
larger queues -- but it was different from other queues in that its size
was growing steadily. The queue would also be moving, but its processing
rate was not keeping up with the enqueue rate.

Our theory that might be totally wrong: If a queue is moving the entire
time, maybe then the broker would keep reusing the same buffer in direct
memory for the queue, and keep on adding onto it at the end to accommodate
new messages. But because it’s active all the time and we’re pointing to
the same buffer, space allocated for messages at the head of the
queue/buffer doesn’t get reclaimed, even long after those messages have
been processed. Just a theory.

We are also trying to reproduce this using some perf tests to enqueue with
same pattern, will update with the findings.

Thanks
Ramayan

On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari <[hidden email]>
wrote:

> Another issue that we noticed is when broker goes OOM due to direct
> memory, it doesn't create heap dump (specified by "-XX:+
> HeapDumpOnOutOfMemoryError"), even when the OOM error is same as what is
> mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
>
> Has anyone been able to find a way to get to heap dump for DM OOM?
>
> - Ramayan
>
> On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari <[hidden email]
> > wrote:
>
>> Alex,
>>
>> Below are the flow to disk logs from broker having 3million+ messages at
>> this time. We only have one virtual host. Time is in GMT. Looks like flow
>> to disk is active on the whole virtual host and not a queue level.
>>
>> When the same broker went OOM yesterday, I did not see any flow to disk
>> logs from when it was started until it crashed (crashed twice within 4hrs).
>>
>>
>> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1014 : Message flow to disk active :  Message memory use 3356539KB
>> exceeds threshold 3355443KB
>> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1015 : Message flow to disk inactive : Message memory use 3354866KB
>> within threshold 3355443KB
>> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1014 : Message flow to disk active :  Message memory use 3358509KB
>> exceeds threshold 3355443KB
>> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1015 : Message flow to disk inactive : Message memory use 3353501KB
>> within threshold 3355443KB
>> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1014 : Message flow to disk active :  Message memory use 3357544KB
>> exceeds threshold 3355443KB
>> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1015 : Message flow to disk inactive : Message memory use 3353236KB
>> within threshold 3355443KB
>> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1014 : Message flow to disk active :  Message memory use 3356704KB
>> exceeds threshold 3355443KB
>> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1015 : Message flow to disk inactive : Message memory use 3353511KB
>> within threshold 3355443KB
>> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1014 : Message flow to disk active :  Message memory use 3357948KB
>> exceeds threshold 3355443KB
>> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1015 : Message flow to disk inactive : Message memory use 3355310KB
>> within threshold 3355443KB
>> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1014 : Message flow to disk active :  Message memory use 3365624KB
>> exceeds threshold 3355443KB
>> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1015 : Message flow to disk inactive : Message memory use 3355136KB
>> within threshold 3355443KB
>> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>> BRK-1014 : Message flow to disk active :  Message memory use 3358683KB
>> exceeds threshold 3355443KB
>>
>>
>> After production release (2days back), we have seen 4 crashes in 3
>> different brokers, this is the most pressing concern for us in decision if
>> we should roll back to 0.32. Any help is greatly appreciated.
>>
>> Thanks
>> Ramayan
>>
>> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]>
>> wrote:
>>
>>> Ramayan,
>>> Thanks for the details. I would like to clarify whether flow to disk was
>>> triggered today for 3 million messages?
>>>
>>> The following logs are issued for flow to disk:
>>> BRK-1014 : Message flow to disk active :  Message memory use
>>> {0,number,#}KB
>>> exceeds threshold {1,number,#.##}KB
>>> BRK-1015 : Message flow to disk inactive : Message memory use
>>> {0,number,#}KB within threshold {1,number,#.##}KB
>>>
>>> Kind Regards,
>>> Alex
>>>
>>>
>>> On 19 April 2017 at 17:10, Ramayan Tiwari <[hidden email]>
>>> wrote:
>>>
>>> > Hi Alex,
>>> >
>>> > Thanks for your response, here are the details:
>>> >
>>> > We use "direct" exchange, without persistence (we specify
>>> NON_PERSISTENT
>>> > that while sending from client) and use BDB store. We use JSON virtual
>>> host
>>> > type. We are not using SSL.
>>> >
>>> > When the broker went OOM, we had around 1.3 million messages with 100
>>> bytes
>>> > average message size. Direct memory allocation (value read from MBean)
>>> kept
>>> > going up, even though it wouldn't need more DM to store these many
>>> > messages. DM allocated persisted at 99% for about 3 and half hours
>>> before
>>> > crashing.
>>> >
>>> > Today, on the same broker we have 3 million messages (same message
>>> size)
>>> > and DM allocated is only at 8%. This seems like there is some issue
>>> with
>>> > de-allocation or a leak.
>>> >
>>> > I have uploaded the memory utilization graph here:
>>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
>>> > view?usp=sharing
>>> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload)
>>> and Red
>>> > is heap usage.
>>> >
>>> > Thanks
>>> > Ramayan
>>> >
>>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[hidden email]>
>>> wrote:
>>> >
>>> > > Hi Ramayan,
>>> > >
>>> > > Could please share with us the details of messaging use case(s) which
>>> > ended
>>> > > up in OOM on broker side?
>>> > > I would like to reproduce the issue on my local broker in order to
>>> fix
>>> > it.
>>> > > I would appreciate if you could provide as much details as possible,
>>> > > including, messaging topology, message persistence type, message
>>> > > sizes,volumes, etc.
>>> > >
>>> > > Qpid Broker 6.0.x uses direct memory for keeping message content and
>>> > > receiving/sending data. Each plain connection utilizes 512K of direct
>>> > > memory. Each SSL connection uses 1M of direct memory. Your memory
>>> > settings
>>> > > look Ok to me.
>>> > >
>>> > > Kind Regards,
>>> > > Alex
>>> > >
>>> > >
>>> > > On 18 April 2017 at 23:39, Ramayan Tiwari <[hidden email]>
>>> > > wrote:
>>> > >
>>> > > > Hi All,
>>> > > >
>>> > > > We are using Java broker 6.0.5, with patch to use
>>> MultiQueueConsumer
>>> > > > feature. We just finished deploying to production and saw couple of
>>> > > > instances of broker OOM due to running out of DirectMemory buffer
>>> > > > (exceptions at the end of this email).
>>> > > >
>>> > > > Here is our setup:
>>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of what the
>>> > > > recommendation is, however, for our use cause message payload is
>>> really
>>> > > > small ~400bytes and is way less than the per message overhead of
>>> 1KB).
>>> > In
>>> > > > perf testing, we were able to put 2 million messages without any
>>> > issues.
>>> > > > 2. ~400 connections to broker.
>>> > > > 3. Each connection has 20 sessions and there is one multi queue
>>> > consumer
>>> > > > attached to each session, listening to around 1000 queues.
>>> > > > 4. We are still using 0.16 client (I know).
>>> > > >
>>> > > > With the above setup, the baseline utilization (without any
>>> messages)
>>> > for
>>> > > > direct memory was around 230mb (with 410 connection each taking
>>> 500KB).
>>> > > >
>>> > > > Based on our understanding of broker memory allocation, message
>>> payload
>>> > > > should be the only thing adding to direct memory utilization (on
>>> top of
>>> > > > baseline), however, we are experiencing something completely
>>> different.
>>> > > In
>>> > > > our last broker crash, we see that broker is constantly running
>>> with
>>> > 90%+
>>> > > > direct memory allocated, even when message payload sum from all the
>>> > > queues
>>> > > > is only 6-8% (these % are against available DM of 4gb). During
>>> these
>>> > high
>>> > > > DM usage period, heap usage was around 60% (of 12gb).
>>> > > >
>>> > > > We would like some help in understanding what could be the reason
>>> of
>>> > > these
>>> > > > high DM allocations. Are there things other than message payload
>>> and
>>> > AMQP
>>> > > > connection, which use DM and could be contributing to these high
>>> usage?
>>> > > >
>>> > > > Another thing where we are puzzled is the de-allocation of DM byte
>>> > > buffers.
>>> > > > From log mining of heap and DM utilization, de-allocation of DM
>>> doesn't
>>> > > > correlate with heap GC. If anyone has seen any documentation
>>> related to
>>> > > > this, it would be very helpful if you could share that.
>>> > > >
>>> > > > Thanks
>>> > > > Ramayan
>>> > > >
>>> > > >
>>> > > > *Exceptions*
>>> > > >
>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>>> > > > ~[na:1.8.0_40]
>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>>> > > ~[na:1.8.0_40]
>>> > > > at
>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>>> > > > QpidByteBuffer.java:474)
>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
>>> elegate.
>>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
>>> > > > gate.java:93)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
>>> > > > NonBlockingConnection.java:506)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
>>> > > > NonBlockingConnection.java:285)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
>>> > > > processConnection(NetworkConnectionScheduler.java:124)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr
>>> ocessor.
>>> > > > processConnection(SelectorThread.java:504)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.SelectorThread$
>>> > > > SelectionTask.performSelect(SelectorThread.java:337)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
>>> > > > SelectorThread.java:87)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>>> > > > SelectorThread.java:462)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> > > > ThreadPoolExecutor.java:1142)
>>> > > > ~[na:1.8.0_40]
>>> > > > at
>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> > > > ThreadPoolExecutor.java:617)
>>> > > > ~[na:1.8.0_40]
>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>>> > > >
>>> > > >
>>> > > >
>>> > > > *Second exception*
>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>>> > > > ~[na:1.8.0_40]
>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>>> > > ~[na:1.8.0_40]
>>> > > > at
>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>>> > > > QpidByteBuffer.java:474)
>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
>>> > > > setTransportEncryption(NonBlockingConnection.java:625)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
>>> > > > NonBlockingConnection.java:117)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
>>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
>>> k$1.run(
>>> > > > SelectorThread.java:191)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>>> > > > SelectorThread.java:462)
>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> > > > at
>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> > > > ThreadPoolExecutor.java:1142)
>>> > > > ~[na:1.8.0_40]
>>> > > > at
>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> > > > ThreadPoolExecutor.java:617)
>>> > > > ~[na:1.8.0_40]
>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Keith Wall
Hi Ramayan

We have been discussing your problem here and have a couple of questions.

I have been experimenting with use-cases based on your descriptions
above, but so far, have been unsuccessful in reproducing a
"java.lang.OutOfMemoryError: Direct buffer memory"  condition. The
direct memory usage reflects the expected model: it levels off when
the flow to disk threshold is reached and direct memory is release as
messages are consumed until the minimum size for caching of direct is
reached.

1] For clarity let me check: we believe when you say "patch to use
MultiQueueConsumer" you are referring to the patch attached to
QPID-7462 "Add experimental "pull" consumers to the broker"  and you
are using a combination of this "x-pull-only"  with the standard
"x-multiqueue" feature.  Is this correct?

2] One idea we had here relates to the size of the virtualhost IO
pool.   As you know from the documentation, the Broker caches/reuses
direct memory internally but the documentation fails to mentions that
each pooled virtualhost IO thread also grabs a chunk (256K) of direct
memory from this cache.  By default the virtual host IO pool is sized
Math.max(Runtime.getRuntime().availableProcessors() * 2, 64), so if
you have a machine with a very large number of cores, you may have a
surprising large amount of direct memory assigned to virtualhost IO
threads.   Check the value of connectionThreadPoolSize on the
virtualhost (http://<server>:<port>/api/latest/virtualhost/<virtualhostnodename>/<virtualhostname>)
to see what value is in force.  What is it?  It is possible to tune
the pool size using context variable
virtualhost.connectionThreadPool.size.

3] Tell me if you are tuning the Broker in way beyond the direct/heap
memory settings you have told us about already.  For instance you are
changing any of the direct memory pooling settings
broker.directByteBufferPoolSize, default network buffer size
qpid.broker.networkBufferSize or applying any other non-standard
settings?

4] How many virtual hosts do you have on the Broker?

5] What is the consumption pattern of the messages?  Do consume in a
strictly FIFO fashion or are you making use of message selectors
or/and any of the out-of-order queue types (LVQs, priority queue or
sorted queues)?

6] Is it just the 0.16 client involved in the application?   Can I
check that you are not using any of the AMQP 1.0 clients
(org,apache.qpid:qpid-jms-client or
org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as either
consumers or producers)

Hopefully the answers to these questions will get us closer to a
reproduction.   If you are able to reliable reproduce it, please share
the steps with us.

Kind regards, Keith.


On 20 April 2017 at 10:21, Ramayan Tiwari <[hidden email]> wrote:

> After a lot of log mining, we might have a way to explain the sustained
> increased in DirectMemory allocation, the correlation seems to be with the
> growth in the size of a Queue that is getting consumed but at a much slower
> rate than producers putting messages on this queue.
>
> The pattern we see is that in each instance of broker crash, there is at
> least one queue (usually 1 queue) whose size kept growing steadily. It’d be
> of significant size but not the largest queue -- usually there are multiple
> larger queues -- but it was different from other queues in that its size
> was growing steadily. The queue would also be moving, but its processing
> rate was not keeping up with the enqueue rate.
>
> Our theory that might be totally wrong: If a queue is moving the entire
> time, maybe then the broker would keep reusing the same buffer in direct
> memory for the queue, and keep on adding onto it at the end to accommodate
> new messages. But because it’s active all the time and we’re pointing to
> the same buffer, space allocated for messages at the head of the
> queue/buffer doesn’t get reclaimed, even long after those messages have
> been processed. Just a theory.
>
> We are also trying to reproduce this using some perf tests to enqueue with
> same pattern, will update with the findings.
>
> Thanks
> Ramayan
>
> On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari <[hidden email]>
> wrote:
>
>> Another issue that we noticed is when broker goes OOM due to direct
>> memory, it doesn't create heap dump (specified by "-XX:+
>> HeapDumpOnOutOfMemoryError"), even when the OOM error is same as what is
>> mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
>>
>> Has anyone been able to find a way to get to heap dump for DM OOM?
>>
>> - Ramayan
>>
>> On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari <[hidden email]
>> > wrote:
>>
>>> Alex,
>>>
>>> Below are the flow to disk logs from broker having 3million+ messages at
>>> this time. We only have one virtual host. Time is in GMT. Looks like flow
>>> to disk is active on the whole virtual host and not a queue level.
>>>
>>> When the same broker went OOM yesterday, I did not see any flow to disk
>>> logs from when it was started until it crashed (crashed twice within 4hrs).
>>>
>>>
>>> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3356539KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3354866KB
>>> within threshold 3355443KB
>>> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3358509KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3353501KB
>>> within threshold 3355443KB
>>> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3357544KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3353236KB
>>> within threshold 3355443KB
>>> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3356704KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3353511KB
>>> within threshold 3355443KB
>>> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3357948KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3355310KB
>>> within threshold 3355443KB
>>> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3365624KB
>>> exceeds threshold 3355443KB
>>> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1015 : Message flow to disk inactive : Message memory use 3355136KB
>>> within threshold 3355443KB
>>> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] - [Housekeeping[test]]
>>> BRK-1014 : Message flow to disk active :  Message memory use 3358683KB
>>> exceeds threshold 3355443KB
>>>
>>>
>>> After production release (2days back), we have seen 4 crashes in 3
>>> different brokers, this is the most pressing concern for us in decision if
>>> we should roll back to 0.32. Any help is greatly appreciated.
>>>
>>> Thanks
>>> Ramayan
>>>
>>> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]>
>>> wrote:
>>>
>>>> Ramayan,
>>>> Thanks for the details. I would like to clarify whether flow to disk was
>>>> triggered today for 3 million messages?
>>>>
>>>> The following logs are issued for flow to disk:
>>>> BRK-1014 : Message flow to disk active :  Message memory use
>>>> {0,number,#}KB
>>>> exceeds threshold {1,number,#.##}KB
>>>> BRK-1015 : Message flow to disk inactive : Message memory use
>>>> {0,number,#}KB within threshold {1,number,#.##}KB
>>>>
>>>> Kind Regards,
>>>> Alex
>>>>
>>>>
>>>> On 19 April 2017 at 17:10, Ramayan Tiwari <[hidden email]>
>>>> wrote:
>>>>
>>>> > Hi Alex,
>>>> >
>>>> > Thanks for your response, here are the details:
>>>> >
>>>> > We use "direct" exchange, without persistence (we specify
>>>> NON_PERSISTENT
>>>> > that while sending from client) and use BDB store. We use JSON virtual
>>>> host
>>>> > type. We are not using SSL.
>>>> >
>>>> > When the broker went OOM, we had around 1.3 million messages with 100
>>>> bytes
>>>> > average message size. Direct memory allocation (value read from MBean)
>>>> kept
>>>> > going up, even though it wouldn't need more DM to store these many
>>>> > messages. DM allocated persisted at 99% for about 3 and half hours
>>>> before
>>>> > crashing.
>>>> >
>>>> > Today, on the same broker we have 3 million messages (same message
>>>> size)
>>>> > and DM allocated is only at 8%. This seems like there is some issue
>>>> with
>>>> > de-allocation or a leak.
>>>> >
>>>> > I have uploaded the memory utilization graph here:
>>>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
>>>> > view?usp=sharing
>>>> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload)
>>>> and Red
>>>> > is heap usage.
>>>> >
>>>> > Thanks
>>>> > Ramayan
>>>> >
>>>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[hidden email]>
>>>> wrote:
>>>> >
>>>> > > Hi Ramayan,
>>>> > >
>>>> > > Could please share with us the details of messaging use case(s) which
>>>> > ended
>>>> > > up in OOM on broker side?
>>>> > > I would like to reproduce the issue on my local broker in order to
>>>> fix
>>>> > it.
>>>> > > I would appreciate if you could provide as much details as possible,
>>>> > > including, messaging topology, message persistence type, message
>>>> > > sizes,volumes, etc.
>>>> > >
>>>> > > Qpid Broker 6.0.x uses direct memory for keeping message content and
>>>> > > receiving/sending data. Each plain connection utilizes 512K of direct
>>>> > > memory. Each SSL connection uses 1M of direct memory. Your memory
>>>> > settings
>>>> > > look Ok to me.
>>>> > >
>>>> > > Kind Regards,
>>>> > > Alex
>>>> > >
>>>> > >
>>>> > > On 18 April 2017 at 23:39, Ramayan Tiwari <[hidden email]>
>>>> > > wrote:
>>>> > >
>>>> > > > Hi All,
>>>> > > >
>>>> > > > We are using Java broker 6.0.5, with patch to use
>>>> MultiQueueConsumer
>>>> > > > feature. We just finished deploying to production and saw couple of
>>>> > > > instances of broker OOM due to running out of DirectMemory buffer
>>>> > > > (exceptions at the end of this email).
>>>> > > >
>>>> > > > Here is our setup:
>>>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of what the
>>>> > > > recommendation is, however, for our use cause message payload is
>>>> really
>>>> > > > small ~400bytes and is way less than the per message overhead of
>>>> 1KB).
>>>> > In
>>>> > > > perf testing, we were able to put 2 million messages without any
>>>> > issues.
>>>> > > > 2. ~400 connections to broker.
>>>> > > > 3. Each connection has 20 sessions and there is one multi queue
>>>> > consumer
>>>> > > > attached to each session, listening to around 1000 queues.
>>>> > > > 4. We are still using 0.16 client (I know).
>>>> > > >
>>>> > > > With the above setup, the baseline utilization (without any
>>>> messages)
>>>> > for
>>>> > > > direct memory was around 230mb (with 410 connection each taking
>>>> 500KB).
>>>> > > >
>>>> > > > Based on our understanding of broker memory allocation, message
>>>> payload
>>>> > > > should be the only thing adding to direct memory utilization (on
>>>> top of
>>>> > > > baseline), however, we are experiencing something completely
>>>> different.
>>>> > > In
>>>> > > > our last broker crash, we see that broker is constantly running
>>>> with
>>>> > 90%+
>>>> > > > direct memory allocated, even when message payload sum from all the
>>>> > > queues
>>>> > > > is only 6-8% (these % are against available DM of 4gb). During
>>>> these
>>>> > high
>>>> > > > DM usage period, heap usage was around 60% (of 12gb).
>>>> > > >
>>>> > > > We would like some help in understanding what could be the reason
>>>> of
>>>> > > these
>>>> > > > high DM allocations. Are there things other than message payload
>>>> and
>>>> > AMQP
>>>> > > > connection, which use DM and could be contributing to these high
>>>> usage?
>>>> > > >
>>>> > > > Another thing where we are puzzled is the de-allocation of DM byte
>>>> > > buffers.
>>>> > > > From log mining of heap and DM utilization, de-allocation of DM
>>>> doesn't
>>>> > > > correlate with heap GC. If anyone has seen any documentation
>>>> related to
>>>> > > > this, it would be very helpful if you could share that.
>>>> > > >
>>>> > > > Thanks
>>>> > > > Ramayan
>>>> > > >
>>>> > > >
>>>> > > > *Exceptions*
>>>> > > >
>>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>>>> > > ~[na:1.8.0_40]
>>>> > > > at
>>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>>>> > > > QpidByteBuffer.java:474)
>>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
>>>> elegate.
>>>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
>>>> > > > gate.java:93)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>>>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
>>>> > > > NonBlockingConnection.java:506)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
>>>> > > > NonBlockingConnection.java:285)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
>>>> > > > processConnection(NetworkConnectionScheduler.java:124)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr
>>>> ocessor.
>>>> > > > processConnection(SelectorThread.java:504)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread$
>>>> > > > SelectionTask.performSelect(SelectorThread.java:337)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
>>>> > > > SelectorThread.java:87)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>>>> > > > SelectorThread.java:462)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>> > > > ThreadPoolExecutor.java:1142)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at
>>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>> > > > ThreadPoolExecutor.java:617)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > *Second exception*
>>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>>>> > > ~[na:1.8.0_40]
>>>> > > > at
>>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>>>> > > > QpidByteBuffer.java:474)
>>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>>>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
>>>> > > > setTransportEncryption(NonBlockingConnection.java:625)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
>>>> > > > NonBlockingConnection.java:117)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
>>>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
>>>> k$1.run(
>>>> > > > SelectorThread.java:191)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>>>> > > > SelectorThread.java:462)
>>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>>> > > > at
>>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>> > > > ThreadPoolExecutor.java:1142)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at
>>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>>> > > > ThreadPoolExecutor.java:617)
>>>> > > > ~[na:1.8.0_40]
>>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Hi Keith,

Thanks so much for your response and digging into the issue. Below are the
answer to your questions:

1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1 where it was
released because we need JMX support. Here is the destination format:
""%s ; {node : { type : queue }, link : { x-subscribes : { arguments : {
x-multiqueue : [%s], x-pull-only : true }}}}";"

2) Our machines have 40 cores, which will make the number of threads to 80.
This might not be an issue, because this will show up in the baseline DM
allocated, which is only 6% (of 4GB) when we just bring up the broker.

3) The only setting that we tuned WRT to DM is flowToDiskThreshold, which
is set at 80% now.

4) Only one virtual host in the broker.

5) Most of our queues (99%) are priority, we also have 8-10 sorted queues.

6) Yeah we are using the standard 0.16 client and not AMQP 1.0 clients. The
connection log line looks like:
CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol Version : 0-10 :
Client ID : test : Client Version : 0.16 : Client Product : qpid

We had another broker crashed about an hour back, we do see the same
patterns:
1) There is a queue which is constantly growing, enqueue is faster than
dequeue on that queue for a long period of time.
2) Flow to disk didn't kick in at all.

This graph shows memory growth (red line - heap, blue - DM allocated,
yellow - DM used)
https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/view?usp=sharing

The below graph shows growth on a single queue (there are 10-12 other
queues with traffic as well, something large size than this queue):
https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/view?usp=sharing

Couple of questions:
1) Is there any developer level doc/design spec on how Qpid uses DM?
2) We are not getting heap dumps automatically when broker crashes due to
DM (HeapDumpOnOutOfMemoryError not respected). Has anyone found a way to
get around this problem?

Thanks
Ramayan

On Thu, Apr 20, 2017 at 9:08 AM, Keith W <[hidden email]> wrote:

> Hi Ramayan
>
> We have been discussing your problem here and have a couple of questions.
>
> I have been experimenting with use-cases based on your descriptions
> above, but so far, have been unsuccessful in reproducing a
> "java.lang.OutOfMemoryError: Direct buffer memory"  condition. The
> direct memory usage reflects the expected model: it levels off when
> the flow to disk threshold is reached and direct memory is release as
> messages are consumed until the minimum size for caching of direct is
> reached.
>
> 1] For clarity let me check: we believe when you say "patch to use
> MultiQueueConsumer" you are referring to the patch attached to
> QPID-7462 "Add experimental "pull" consumers to the broker"  and you
> are using a combination of this "x-pull-only"  with the standard
> "x-multiqueue" feature.  Is this correct?
>
> 2] One idea we had here relates to the size of the virtualhost IO
> pool.   As you know from the documentation, the Broker caches/reuses
> direct memory internally but the documentation fails to mentions that
> each pooled virtualhost IO thread also grabs a chunk (256K) of direct
> memory from this cache.  By default the virtual host IO pool is sized
> Math.max(Runtime.getRuntime().availableProcessors() * 2, 64), so if
> you have a machine with a very large number of cores, you may have a
> surprising large amount of direct memory assigned to virtualhost IO
> threads.   Check the value of connectionThreadPoolSize on the
> virtualhost (http://<server>:<port>/api/latest/virtualhost/<virtualhostn
> odename>/<virtualhostname>)
> to see what value is in force.  What is it?  It is possible to tune
> the pool size using context variable
> virtualhost.connectionThreadPool.size.
>
> 3] Tell me if you are tuning the Broker in way beyond the direct/heap
> memory settings you have told us about already.  For instance you are
> changing any of the direct memory pooling settings
> broker.directByteBufferPoolSize, default network buffer size
> qpid.broker.networkBufferSize or applying any other non-standard
> settings?
>
> 4] How many virtual hosts do you have on the Broker?
>
> 5] What is the consumption pattern of the messages?  Do consume in a
> strictly FIFO fashion or are you making use of message selectors
> or/and any of the out-of-order queue types (LVQs, priority queue or
> sorted queues)?
>
> 6] Is it just the 0.16 client involved in the application?   Can I
> check that you are not using any of the AMQP 1.0 clients
> (org,apache.qpid:qpid-jms-client or
> org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as either
> consumers or producers)
>
> Hopefully the answers to these questions will get us closer to a
> reproduction.   If you are able to reliable reproduce it, please share
> the steps with us.
>
> Kind regards, Keith.
>
>
> On 20 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
> wrote:
> > After a lot of log mining, we might have a way to explain the sustained
> > increased in DirectMemory allocation, the correlation seems to be with
> the
> > growth in the size of a Queue that is getting consumed but at a much
> slower
> > rate than producers putting messages on this queue.
> >
> > The pattern we see is that in each instance of broker crash, there is at
> > least one queue (usually 1 queue) whose size kept growing steadily. It’d
> be
> > of significant size but not the largest queue -- usually there are
> multiple
> > larger queues -- but it was different from other queues in that its size
> > was growing steadily. The queue would also be moving, but its processing
> > rate was not keeping up with the enqueue rate.
> >
> > Our theory that might be totally wrong: If a queue is moving the entire
> > time, maybe then the broker would keep reusing the same buffer in direct
> > memory for the queue, and keep on adding onto it at the end to
> accommodate
> > new messages. But because it’s active all the time and we’re pointing to
> > the same buffer, space allocated for messages at the head of the
> > queue/buffer doesn’t get reclaimed, even long after those messages have
> > been processed. Just a theory.
> >
> > We are also trying to reproduce this using some perf tests to enqueue
> with
> > same pattern, will update with the findings.
> >
> > Thanks
> > Ramayan
> >
> > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari <
> [hidden email]>
> > wrote:
> >
> >> Another issue that we noticed is when broker goes OOM due to direct
> >> memory, it doesn't create heap dump (specified by "-XX:+
> >> HeapDumpOnOutOfMemoryError"), even when the OOM error is same as what is
> >> mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
> >>
> >> Has anyone been able to find a way to get to heap dump for DM OOM?
> >>
> >> - Ramayan
> >>
> >> On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari <
> [hidden email]
> >> > wrote:
> >>
> >>> Alex,
> >>>
> >>> Below are the flow to disk logs from broker having 3million+ messages
> at
> >>> this time. We only have one virtual host. Time is in GMT. Looks like
> flow
> >>> to disk is active on the whole virtual host and not a queue level.
> >>>
> >>> When the same broker went OOM yesterday, I did not see any flow to disk
> >>> logs from when it was started until it crashed (crashed twice within
> 4hrs).
> >>>
> >>>
> >>> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3356539KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3354866KB
> >>> within threshold 3355443KB
> >>> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3358509KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3353501KB
> >>> within threshold 3355443KB
> >>> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3357544KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3353236KB
> >>> within threshold 3355443KB
> >>> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3356704KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3353511KB
> >>> within threshold 3355443KB
> >>> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3357948KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3355310KB
> >>> within threshold 3355443KB
> >>> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3365624KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3355136KB
> >>> within threshold 3355443KB
> >>> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3358683KB
> >>> exceeds threshold 3355443KB
> >>>
> >>>
> >>> After production release (2days back), we have seen 4 crashes in 3
> >>> different brokers, this is the most pressing concern for us in
> decision if
> >>> we should roll back to 0.32. Any help is greatly appreciated.
> >>>
> >>> Thanks
> >>> Ramayan
> >>>
> >>> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]>
> >>> wrote:
> >>>
> >>>> Ramayan,
> >>>> Thanks for the details. I would like to clarify whether flow to disk
> was
> >>>> triggered today for 3 million messages?
> >>>>
> >>>> The following logs are issued for flow to disk:
> >>>> BRK-1014 : Message flow to disk active :  Message memory use
> >>>> {0,number,#}KB
> >>>> exceeds threshold {1,number,#.##}KB
> >>>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>>> {0,number,#}KB within threshold {1,number,#.##}KB
> >>>>
> >>>> Kind Regards,
> >>>> Alex
> >>>>
> >>>>
> >>>> On 19 April 2017 at 17:10, Ramayan Tiwari <[hidden email]>
> >>>> wrote:
> >>>>
> >>>> > Hi Alex,
> >>>> >
> >>>> > Thanks for your response, here are the details:
> >>>> >
> >>>> > We use "direct" exchange, without persistence (we specify
> >>>> NON_PERSISTENT
> >>>> > that while sending from client) and use BDB store. We use JSON
> virtual
> >>>> host
> >>>> > type. We are not using SSL.
> >>>> >
> >>>> > When the broker went OOM, we had around 1.3 million messages with
> 100
> >>>> bytes
> >>>> > average message size. Direct memory allocation (value read from
> MBean)
> >>>> kept
> >>>> > going up, even though it wouldn't need more DM to store these many
> >>>> > messages. DM allocated persisted at 99% for about 3 and half hours
> >>>> before
> >>>> > crashing.
> >>>> >
> >>>> > Today, on the same broker we have 3 million messages (same message
> >>>> size)
> >>>> > and DM allocated is only at 8%. This seems like there is some issue
> >>>> with
> >>>> > de-allocation or a leak.
> >>>> >
> >>>> > I have uploaded the memory utilization graph here:
> >>>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> >>>> > view?usp=sharing
> >>>> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload)
> >>>> and Red
> >>>> > is heap usage.
> >>>> >
> >>>> > Thanks
> >>>> > Ramayan
> >>>> >
> >>>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[hidden email]>
> >>>> wrote:
> >>>> >
> >>>> > > Hi Ramayan,
> >>>> > >
> >>>> > > Could please share with us the details of messaging use case(s)
> which
> >>>> > ended
> >>>> > > up in OOM on broker side?
> >>>> > > I would like to reproduce the issue on my local broker in order to
> >>>> fix
> >>>> > it.
> >>>> > > I would appreciate if you could provide as much details as
> possible,
> >>>> > > including, messaging topology, message persistence type, message
> >>>> > > sizes,volumes, etc.
> >>>> > >
> >>>> > > Qpid Broker 6.0.x uses direct memory for keeping message content
> and
> >>>> > > receiving/sending data. Each plain connection utilizes 512K of
> direct
> >>>> > > memory. Each SSL connection uses 1M of direct memory. Your memory
> >>>> > settings
> >>>> > > look Ok to me.
> >>>> > >
> >>>> > > Kind Regards,
> >>>> > > Alex
> >>>> > >
> >>>> > >
> >>>> > > On 18 April 2017 at 23:39, Ramayan Tiwari <
> [hidden email]>
> >>>> > > wrote:
> >>>> > >
> >>>> > > > Hi All,
> >>>> > > >
> >>>> > > > We are using Java broker 6.0.5, with patch to use
> >>>> MultiQueueConsumer
> >>>> > > > feature. We just finished deploying to production and saw
> couple of
> >>>> > > > instances of broker OOM due to running out of DirectMemory
> buffer
> >>>> > > > (exceptions at the end of this email).
> >>>> > > >
> >>>> > > > Here is our setup:
> >>>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of what
> the
> >>>> > > > recommendation is, however, for our use cause message payload is
> >>>> really
> >>>> > > > small ~400bytes and is way less than the per message overhead of
> >>>> 1KB).
> >>>> > In
> >>>> > > > perf testing, we were able to put 2 million messages without any
> >>>> > issues.
> >>>> > > > 2. ~400 connections to broker.
> >>>> > > > 3. Each connection has 20 sessions and there is one multi queue
> >>>> > consumer
> >>>> > > > attached to each session, listening to around 1000 queues.
> >>>> > > > 4. We are still using 0.16 client (I know).
> >>>> > > >
> >>>> > > > With the above setup, the baseline utilization (without any
> >>>> messages)
> >>>> > for
> >>>> > > > direct memory was around 230mb (with 410 connection each taking
> >>>> 500KB).
> >>>> > > >
> >>>> > > > Based on our understanding of broker memory allocation, message
> >>>> payload
> >>>> > > > should be the only thing adding to direct memory utilization (on
> >>>> top of
> >>>> > > > baseline), however, we are experiencing something completely
> >>>> different.
> >>>> > > In
> >>>> > > > our last broker crash, we see that broker is constantly running
> >>>> with
> >>>> > 90%+
> >>>> > > > direct memory allocated, even when message payload sum from all
> the
> >>>> > > queues
> >>>> > > > is only 6-8% (these % are against available DM of 4gb). During
> >>>> these
> >>>> > high
> >>>> > > > DM usage period, heap usage was around 60% (of 12gb).
> >>>> > > >
> >>>> > > > We would like some help in understanding what could be the
> reason
> >>>> of
> >>>> > > these
> >>>> > > > high DM allocations. Are there things other than message payload
> >>>> and
> >>>> > AMQP
> >>>> > > > connection, which use DM and could be contributing to these high
> >>>> usage?
> >>>> > > >
> >>>> > > > Another thing where we are puzzled is the de-allocation of DM
> byte
> >>>> > > buffers.
> >>>> > > > From log mining of heap and DM utilization, de-allocation of DM
> >>>> doesn't
> >>>> > > > correlate with heap GC. If anyone has seen any documentation
> >>>> related to
> >>>> > > > this, it would be very helpful if you could share that.
> >>>> > > >
> >>>> > > > Thanks
> >>>> > > > Ramayan
> >>>> > > >
> >>>> > > >
> >>>> > > > *Exceptions*
> >>>> > > >
> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> >>>> > > ~[na:1.8.0_40]
> >>>> > > > at
> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> >>>> > > > QpidByteBuffer.java:474)
> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
> >>>> elegate.
> >>>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
> >>>> > > > gate.java:93)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> >>>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
> >>>> > > > NonBlockingConnection.java:506)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
> >>>> > > > NonBlockingConnection.java:285)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
> >>>> > > > processConnection(NetworkConnectionScheduler.java:124)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr
> >>>> ocessor.
> >>>> > > > processConnection(SelectorThread.java:504)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread$
> >>>> > > > SelectionTask.performSelect(SelectorThread.java:337)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
> k.run(
> >>>> > > > SelectorThread.java:87)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
> >>>> > > > SelectorThread.java:462)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >>>> > > > ThreadPoolExecutor.java:1142)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at
> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>>> > > > ThreadPoolExecutor.java:617)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> >>>> > > >
> >>>> > > >
> >>>> > > >
> >>>> > > > *Second exception*
> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> >>>> > > ~[na:1.8.0_40]
> >>>> > > > at
> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> >>>> > > > QpidByteBuffer.java:474)
> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> >>>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
> >>>> > > > setTransportEncryption(NonBlockingConnection.java:625)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
> >>>> > > > NonBlockingConnection.java:117)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
> >>>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
> >>>> k$1.run(
> >>>> > > > SelectorThread.java:191)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
> >>>> > > > SelectorThread.java:462)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >>>> > > > ThreadPoolExecutor.java:1142)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at
> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>>> > > > ThreadPoolExecutor.java:617)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Hi All,

We have been monitoring the brokers everyday and today we found one
instance where broker’s DM was constantly going up and was about to crash,
so we experimented some mitigations, one of which caused the DM to come
down. Following are the details, which might help us understanding the
issue:

*Traffic scenario:*

DM allocation had been constantly going up and was at 90%. There were two
queues which seemed to align with the theories that we had. Q1’s size had
been large right after the broker start and had slow consumption of
messages, queue size only reduced from 76MB to 75MB over a period of 6hrs.
Q2 on the other hand, started small and was gradually growing, queue size
went from 7MB to 10MB in 6hrs. There were other queues with traffic during
this time.

*Action taken:*

   1. Moved all the messages from Q2 (since this was our original theory)
   to Q3 (already created but no messages in it). This did not help with the
   DM growing up.
   2. Moved all the messages from Q1 to Q4 (already created but no messages
   in it). This reduced DM allocation from 93% to 31%.

We have the heap dump and thread dump from when broker was 90% in DM
allocation. We are going to analyze that to see if we can get some clue. We
wanted to share this new information which might help in reasoning about
the memory issue.

- Ramayan


On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <[hidden email]>
wrote:

> Hi Keith,
>
> Thanks so much for your response and digging into the issue. Below are the
> answer to your questions:
>
> 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1 where it
> was released because we need JMX support. Here is the destination format:
> ""%s ; {node : { type : queue }, link : { x-subscribes : { arguments : {
> x-multiqueue : [%s], x-pull-only : true }}}}";"
>
> 2) Our machines have 40 cores, which will make the number of threads to
> 80. This might not be an issue, because this will show up in the baseline
> DM allocated, which is only 6% (of 4GB) when we just bring up the broker.
>
> 3) The only setting that we tuned WRT to DM is flowToDiskThreshold, which
> is set at 80% now.
>
> 4) Only one virtual host in the broker.
>
> 5) Most of our queues (99%) are priority, we also have 8-10 sorted queues.
>
> 6) Yeah we are using the standard 0.16 client and not AMQP 1.0 clients.
> The connection log line looks like:
> CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol Version : 0-10 :
> Client ID : test : Client Version : 0.16 : Client Product : qpid
>
> We had another broker crashed about an hour back, we do see the same
> patterns:
> 1) There is a queue which is constantly growing, enqueue is faster than
> dequeue on that queue for a long period of time.
> 2) Flow to disk didn't kick in at all.
>
> This graph shows memory growth (red line - heap, blue - DM allocated,
> yellow - DM used)
> https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/
> view?usp=sharing
>
> The below graph shows growth on a single queue (there are 10-12 other
> queues with traffic as well, something large size than this queue):
> https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/
> view?usp=sharing
>
> Couple of questions:
> 1) Is there any developer level doc/design spec on how Qpid uses DM?
> 2) We are not getting heap dumps automatically when broker crashes due to
> DM (HeapDumpOnOutOfMemoryError not respected). Has anyone found a way to
> get around this problem?
>
> Thanks
> Ramayan
>
> On Thu, Apr 20, 2017 at 9:08 AM, Keith W <[hidden email]> wrote:
>
>> Hi Ramayan
>>
>> We have been discussing your problem here and have a couple of questions.
>>
>> I have been experimenting with use-cases based on your descriptions
>> above, but so far, have been unsuccessful in reproducing a
>> "java.lang.OutOfMemoryError: Direct buffer memory"  condition. The
>> direct memory usage reflects the expected model: it levels off when
>> the flow to disk threshold is reached and direct memory is release as
>> messages are consumed until the minimum size for caching of direct is
>> reached.
>>
>> 1] For clarity let me check: we believe when you say "patch to use
>> MultiQueueConsumer" you are referring to the patch attached to
>> QPID-7462 "Add experimental "pull" consumers to the broker"  and you
>> are using a combination of this "x-pull-only"  with the standard
>> "x-multiqueue" feature.  Is this correct?
>>
>> 2] One idea we had here relates to the size of the virtualhost IO
>> pool.   As you know from the documentation, the Broker caches/reuses
>> direct memory internally but the documentation fails to mentions that
>> each pooled virtualhost IO thread also grabs a chunk (256K) of direct
>> memory from this cache.  By default the virtual host IO pool is sized
>> Math.max(Runtime.getRuntime().availableProcessors() * 2, 64), so if
>> you have a machine with a very large number of cores, you may have a
>> surprising large amount of direct memory assigned to virtualhost IO
>> threads.   Check the value of connectionThreadPoolSize on the
>> virtualhost (http://<server>:<port>/api/latest/virtualhost/<virtualhostn
>> odename>/<virtualhostname>)
>> to see what value is in force.  What is it?  It is possible to tune
>> the pool size using context variable
>> virtualhost.connectionThreadPool.size.
>>
>> 3] Tell me if you are tuning the Broker in way beyond the direct/heap
>> memory settings you have told us about already.  For instance you are
>> changing any of the direct memory pooling settings
>> broker.directByteBufferPoolSize, default network buffer size
>> qpid.broker.networkBufferSize or applying any other non-standard
>> settings?
>>
>> 4] How many virtual hosts do you have on the Broker?
>>
>> 5] What is the consumption pattern of the messages?  Do consume in a
>> strictly FIFO fashion or are you making use of message selectors
>> or/and any of the out-of-order queue types (LVQs, priority queue or
>> sorted queues)?
>>
>> 6] Is it just the 0.16 client involved in the application?   Can I
>> check that you are not using any of the AMQP 1.0 clients
>> (org,apache.qpid:qpid-jms-client or
>> org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as either
>> consumers or producers)
>>
>> Hopefully the answers to these questions will get us closer to a
>> reproduction.   If you are able to reliable reproduce it, please share
>> the steps with us.
>>
>> Kind regards, Keith.
>>
>>
>> On 20 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
>> wrote:
>> > After a lot of log mining, we might have a way to explain the sustained
>> > increased in DirectMemory allocation, the correlation seems to be with
>> the
>> > growth in the size of a Queue that is getting consumed but at a much
>> slower
>> > rate than producers putting messages on this queue.
>> >
>> > The pattern we see is that in each instance of broker crash, there is at
>> > least one queue (usually 1 queue) whose size kept growing steadily.
>> It’d be
>> > of significant size but not the largest queue -- usually there are
>> multiple
>> > larger queues -- but it was different from other queues in that its size
>> > was growing steadily. The queue would also be moving, but its processing
>> > rate was not keeping up with the enqueue rate.
>> >
>> > Our theory that might be totally wrong: If a queue is moving the entire
>> > time, maybe then the broker would keep reusing the same buffer in direct
>> > memory for the queue, and keep on adding onto it at the end to
>> accommodate
>> > new messages. But because it’s active all the time and we’re pointing to
>> > the same buffer, space allocated for messages at the head of the
>> > queue/buffer doesn’t get reclaimed, even long after those messages have
>> > been processed. Just a theory.
>> >
>> > We are also trying to reproduce this using some perf tests to enqueue
>> with
>> > same pattern, will update with the findings.
>> >
>> > Thanks
>> > Ramayan
>> >
>> > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari <
>> [hidden email]>
>> > wrote:
>> >
>> >> Another issue that we noticed is when broker goes OOM due to direct
>> >> memory, it doesn't create heap dump (specified by "-XX:+
>> >> HeapDumpOnOutOfMemoryError"), even when the OOM error is same as what
>> is
>> >> mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
>> >>
>> >> Has anyone been able to find a way to get to heap dump for DM OOM?
>> >>
>> >> - Ramayan
>> >>
>> >> On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari <
>> [hidden email]
>> >> > wrote:
>> >>
>> >>> Alex,
>> >>>
>> >>> Below are the flow to disk logs from broker having 3million+ messages
>> at
>> >>> this time. We only have one virtual host. Time is in GMT. Looks like
>> flow
>> >>> to disk is active on the whole virtual host and not a queue level.
>> >>>
>> >>> When the same broker went OOM yesterday, I did not see any flow to
>> disk
>> >>> logs from when it was started until it crashed (crashed twice within
>> 4hrs).
>> >>>
>> >>>
>> >>> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1014 : Message flow to disk active :  Message memory use 3356539KB
>> >>> exceeds threshold 3355443KB
>> >>> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>> 3354866KB
>> >>> within threshold 3355443KB
>> >>> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1014 : Message flow to disk active :  Message memory use 3358509KB
>> >>> exceeds threshold 3355443KB
>> >>> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>> 3353501KB
>> >>> within threshold 3355443KB
>> >>> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1014 : Message flow to disk active :  Message memory use 3357544KB
>> >>> exceeds threshold 3355443KB
>> >>> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>> 3353236KB
>> >>> within threshold 3355443KB
>> >>> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1014 : Message flow to disk active :  Message memory use 3356704KB
>> >>> exceeds threshold 3355443KB
>> >>> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>> 3353511KB
>> >>> within threshold 3355443KB
>> >>> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1014 : Message flow to disk active :  Message memory use 3357948KB
>> >>> exceeds threshold 3355443KB
>> >>> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>> 3355310KB
>> >>> within threshold 3355443KB
>> >>> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1014 : Message flow to disk active :  Message memory use 3365624KB
>> >>> exceeds threshold 3355443KB
>> >>> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>> 3355136KB
>> >>> within threshold 3355443KB
>> >>> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
>> [Housekeeping[test]]
>> >>> BRK-1014 : Message flow to disk active :  Message memory use 3358683KB
>> >>> exceeds threshold 3355443KB
>> >>>
>> >>>
>> >>> After production release (2days back), we have seen 4 crashes in 3
>> >>> different brokers, this is the most pressing concern for us in
>> decision if
>> >>> we should roll back to 0.32. Any help is greatly appreciated.
>> >>>
>> >>> Thanks
>> >>> Ramayan
>> >>>
>> >>> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]>
>> >>> wrote:
>> >>>
>> >>>> Ramayan,
>> >>>> Thanks for the details. I would like to clarify whether flow to disk
>> was
>> >>>> triggered today for 3 million messages?
>> >>>>
>> >>>> The following logs are issued for flow to disk:
>> >>>> BRK-1014 : Message flow to disk active :  Message memory use
>> >>>> {0,number,#}KB
>> >>>> exceeds threshold {1,number,#.##}KB
>> >>>> BRK-1015 : Message flow to disk inactive : Message memory use
>> >>>> {0,number,#}KB within threshold {1,number,#.##}KB
>> >>>>
>> >>>> Kind Regards,
>> >>>> Alex
>> >>>>
>> >>>>
>> >>>> On 19 April 2017 at 17:10, Ramayan Tiwari <[hidden email]>
>> >>>> wrote:
>> >>>>
>> >>>> > Hi Alex,
>> >>>> >
>> >>>> > Thanks for your response, here are the details:
>> >>>> >
>> >>>> > We use "direct" exchange, without persistence (we specify
>> >>>> NON_PERSISTENT
>> >>>> > that while sending from client) and use BDB store. We use JSON
>> virtual
>> >>>> host
>> >>>> > type. We are not using SSL.
>> >>>> >
>> >>>> > When the broker went OOM, we had around 1.3 million messages with
>> 100
>> >>>> bytes
>> >>>> > average message size. Direct memory allocation (value read from
>> MBean)
>> >>>> kept
>> >>>> > going up, even though it wouldn't need more DM to store these many
>> >>>> > messages. DM allocated persisted at 99% for about 3 and half hours
>> >>>> before
>> >>>> > crashing.
>> >>>> >
>> >>>> > Today, on the same broker we have 3 million messages (same message
>> >>>> size)
>> >>>> > and DM allocated is only at 8%. This seems like there is some issue
>> >>>> with
>> >>>> > de-allocation or a leak.
>> >>>> >
>> >>>> > I have uploaded the memory utilization graph here:
>> >>>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
>> >>>> > view?usp=sharing
>> >>>> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload)
>> >>>> and Red
>> >>>> > is heap usage.
>> >>>> >
>> >>>> > Thanks
>> >>>> > Ramayan
>> >>>> >
>> >>>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <[hidden email]
>> >
>> >>>> wrote:
>> >>>> >
>> >>>> > > Hi Ramayan,
>> >>>> > >
>> >>>> > > Could please share with us the details of messaging use case(s)
>> which
>> >>>> > ended
>> >>>> > > up in OOM on broker side?
>> >>>> > > I would like to reproduce the issue on my local broker in order
>> to
>> >>>> fix
>> >>>> > it.
>> >>>> > > I would appreciate if you could provide as much details as
>> possible,
>> >>>> > > including, messaging topology, message persistence type, message
>> >>>> > > sizes,volumes, etc.
>> >>>> > >
>> >>>> > > Qpid Broker 6.0.x uses direct memory for keeping message content
>> and
>> >>>> > > receiving/sending data. Each plain connection utilizes 512K of
>> direct
>> >>>> > > memory. Each SSL connection uses 1M of direct memory. Your memory
>> >>>> > settings
>> >>>> > > look Ok to me.
>> >>>> > >
>> >>>> > > Kind Regards,
>> >>>> > > Alex
>> >>>> > >
>> >>>> > >
>> >>>> > > On 18 April 2017 at 23:39, Ramayan Tiwari <
>> [hidden email]>
>> >>>> > > wrote:
>> >>>> > >
>> >>>> > > > Hi All,
>> >>>> > > >
>> >>>> > > > We are using Java broker 6.0.5, with patch to use
>> >>>> MultiQueueConsumer
>> >>>> > > > feature. We just finished deploying to production and saw
>> couple of
>> >>>> > > > instances of broker OOM due to running out of DirectMemory
>> buffer
>> >>>> > > > (exceptions at the end of this email).
>> >>>> > > >
>> >>>> > > > Here is our setup:
>> >>>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of
>> what the
>> >>>> > > > recommendation is, however, for our use cause message payload
>> is
>> >>>> really
>> >>>> > > > small ~400bytes and is way less than the per message overhead
>> of
>> >>>> 1KB).
>> >>>> > In
>> >>>> > > > perf testing, we were able to put 2 million messages without
>> any
>> >>>> > issues.
>> >>>> > > > 2. ~400 connections to broker.
>> >>>> > > > 3. Each connection has 20 sessions and there is one multi queue
>> >>>> > consumer
>> >>>> > > > attached to each session, listening to around 1000 queues.
>> >>>> > > > 4. We are still using 0.16 client (I know).
>> >>>> > > >
>> >>>> > > > With the above setup, the baseline utilization (without any
>> >>>> messages)
>> >>>> > for
>> >>>> > > > direct memory was around 230mb (with 410 connection each taking
>> >>>> 500KB).
>> >>>> > > >
>> >>>> > > > Based on our understanding of broker memory allocation, message
>> >>>> payload
>> >>>> > > > should be the only thing adding to direct memory utilization
>> (on
>> >>>> top of
>> >>>> > > > baseline), however, we are experiencing something completely
>> >>>> different.
>> >>>> > > In
>> >>>> > > > our last broker crash, we see that broker is constantly running
>> >>>> with
>> >>>> > 90%+
>> >>>> > > > direct memory allocated, even when message payload sum from
>> all the
>> >>>> > > queues
>> >>>> > > > is only 6-8% (these % are against available DM of 4gb). During
>> >>>> these
>> >>>> > high
>> >>>> > > > DM usage period, heap usage was around 60% (of 12gb).
>> >>>> > > >
>> >>>> > > > We would like some help in understanding what could be the
>> reason
>> >>>> of
>> >>>> > > these
>> >>>> > > > high DM allocations. Are there things other than message
>> payload
>> >>>> and
>> >>>> > AMQP
>> >>>> > > > connection, which use DM and could be contributing to these
>> high
>> >>>> usage?
>> >>>> > > >
>> >>>> > > > Another thing where we are puzzled is the de-allocation of DM
>> byte
>> >>>> > > buffers.
>> >>>> > > > From log mining of heap and DM utilization, de-allocation of DM
>> >>>> doesn't
>> >>>> > > > correlate with heap GC. If anyone has seen any documentation
>> >>>> related to
>> >>>> > > > this, it would be very helpful if you could share that.
>> >>>> > > >
>> >>>> > > > Thanks
>> >>>> > > > Ramayan
>> >>>> > > >
>> >>>> > > >
>> >>>> > > > *Exceptions*
>> >>>> > > >
>> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>> >>>> > > > ~[na:1.8.0_40]
>> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>> >>>> > > ~[na:1.8.0_40]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>> >>>> > > > QpidByteBuffer.java:474)
>> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
>> >>>> elegate.
>> >>>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainD
>> ele
>> >>>> > > > gate.java:93)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
>> ele
>> >>>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
>> >>>> > > > NonBlockingConnection.java:506)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
>> >>>> > > > NonBlockingConnection.java:285)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
>> >>>> > > > processConnection(NetworkConnectionScheduler.java:124)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr
>> >>>> ocessor.
>> >>>> > > > processConnection(SelectorThread.java:504)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$
>> >>>> > > > SelectionTask.performSelect(SelectorThread.java:337)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
>> k.run(
>> >>>> > > > SelectorThread.java:87)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>> >>>> > > > SelectorThread.java:462)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> >>>> > > > ThreadPoolExecutor.java:1142)
>> >>>> > > > ~[na:1.8.0_40]
>> >>>> > > > at
>> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> >>>> > > > ThreadPoolExecutor.java:617)
>> >>>> > > > ~[na:1.8.0_40]
>> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>> >>>> > > >
>> >>>> > > >
>> >>>> > > >
>> >>>> > > > *Second exception*
>> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>> >>>> > > > ~[na:1.8.0_40]
>> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>> >>>> > > ~[na:1.8.0_40]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>> >>>> > > > QpidByteBuffer.java:474)
>> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
>> ele
>> >>>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
>> >>>> > > > setTransportEncryption(NonBlockingConnection.java:625)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
>> >>>> > > > NonBlockingConnection.java:117)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
>> >>>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
>> >>>> k$1.run(
>> >>>> > > > SelectorThread.java:191)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>> >>>> > > > SelectorThread.java:462)
>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>> >>>> > > > at
>> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> >>>> > > > ThreadPoolExecutor.java:1142)
>> >>>> > > > ~[na:1.8.0_40]
>> >>>> > > > at
>> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> >>>> > > > ThreadPoolExecutor.java:617)
>> >>>> > > > ~[na:1.8.0_40]
>> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>> >>>> > > >
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Keith Wall
Hello Ramayan

I believe I understand the root cause of the problem.  We have
identified a flaw in the direct memory buffer management employed by
Qpid Broker J which for some messaging use-cases can lead to the OOM
direct you describe.   For the issue to manifest the producing
application needs to use a single connection for the production of
messages some of which are short-lived (i.e. are consumed quickly)
whilst others remain on the queue for some time.  Priority queues,
sorted queues and consumers utilising selectors that result in some
messages being left of the queue could all produce this patten.  The
pattern leads to a sparsely occupied 256K net buffers which cannot be
released or reused until every message that reference a 'chunk' of it
is either consumed or flown to disk.   The problem was introduced with
Qpid v6.0 and exists in v6.1 and trunk too.

The flow to disk feature is not helping us here because its algorithm
considers only the size of live messages on the queues. If the
accumulative live size does not exceed the threshold, the messages
aren't flown to disk. I speculate that when you observed that moving
messages cause direct message usage to drop earlier today, your
message movement cause a queue to go over threshold, cause message to
be flown to disk and their direct memory references released.  The
logs will confirm this is so.

I have not identified an easy workaround at the moment.   Decreasing
the flow to disk threshold and/or increasing available direct memory
should alleviate and may be an acceptable short term workaround.  If
it were possible for publishing application to publish short lived and
long lived messages on two separate JMS connections this would avoid
this defect.

QPID-7753 tracks this issue and QPID-7754 is a related this problem.
We intend to be working on these early next week and will be aiming
for a fix that is back-portable to 6.0.

Apologies that you have run into this defect and thanks for reporting.

Thanks, Keith







On 21 April 2017 at 10:21, Ramayan Tiwari <[hidden email]> wrote:

> Hi All,
>
> We have been monitoring the brokers everyday and today we found one instance
> where broker’s DM was constantly going up and was about to crash, so we
> experimented some mitigations, one of which caused the DM to come down.
> Following are the details, which might help us understanding the issue:
>
> Traffic scenario:
>
> DM allocation had been constantly going up and was at 90%. There were two
> queues which seemed to align with the theories that we had. Q1’s size had
> been large right after the broker start and had slow consumption of
> messages, queue size only reduced from 76MB to 75MB over a period of 6hrs.
> Q2 on the other hand, started small and was gradually growing, queue size
> went from 7MB to 10MB in 6hrs. There were other queues with traffic during
> this time.
>
> Action taken:
>
> Moved all the messages from Q2 (since this was our original theory) to Q3
> (already created but no messages in it). This did not help with the DM
> growing up.
> Moved all the messages from Q1 to Q4 (already created but no messages in
> it). This reduced DM allocation from 93% to 31%.
>
> We have the heap dump and thread dump from when broker was 90% in DM
> allocation. We are going to analyze that to see if we can get some clue. We
> wanted to share this new information which might help in reasoning about the
> memory issue.
>
> - Ramayan
>
>
> On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <[hidden email]>
> wrote:
>>
>> Hi Keith,
>>
>> Thanks so much for your response and digging into the issue. Below are the
>> answer to your questions:
>>
>> 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1 where it
>> was released because we need JMX support. Here is the destination format:
>> ""%s ; {node : { type : queue }, link : { x-subscribes : { arguments : {
>> x-multiqueue : [%s], x-pull-only : true }}}}";"
>>
>> 2) Our machines have 40 cores, which will make the number of threads to
>> 80. This might not be an issue, because this will show up in the baseline DM
>> allocated, which is only 6% (of 4GB) when we just bring up the broker.
>>
>> 3) The only setting that we tuned WRT to DM is flowToDiskThreshold, which
>> is set at 80% now.
>>
>> 4) Only one virtual host in the broker.
>>
>> 5) Most of our queues (99%) are priority, we also have 8-10 sorted queues.
>>
>> 6) Yeah we are using the standard 0.16 client and not AMQP 1.0 clients.
>> The connection log line looks like:
>> CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol Version : 0-10 :
>> Client ID : test : Client Version : 0.16 : Client Product : qpid
>>
>> We had another broker crashed about an hour back, we do see the same
>> patterns:
>> 1) There is a queue which is constantly growing, enqueue is faster than
>> dequeue on that queue for a long period of time.
>> 2) Flow to disk didn't kick in at all.
>>
>> This graph shows memory growth (red line - heap, blue - DM allocated,
>> yellow - DM used)
>>
>> https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/view?usp=sharing
>>
>> The below graph shows growth on a single queue (there are 10-12 other
>> queues with traffic as well, something large size than this queue):
>>
>> https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/view?usp=sharing
>>
>> Couple of questions:
>> 1) Is there any developer level doc/design spec on how Qpid uses DM?
>> 2) We are not getting heap dumps automatically when broker crashes due to
>> DM (HeapDumpOnOutOfMemoryError not respected). Has anyone found a way to get
>> around this problem?
>>
>> Thanks
>> Ramayan
>>
>> On Thu, Apr 20, 2017 at 9:08 AM, Keith W <[hidden email]> wrote:
>>>
>>> Hi Ramayan
>>>
>>> We have been discussing your problem here and have a couple of questions.
>>>
>>> I have been experimenting with use-cases based on your descriptions
>>> above, but so far, have been unsuccessful in reproducing a
>>> "java.lang.OutOfMemoryError: Direct buffer memory"  condition. The
>>> direct memory usage reflects the expected model: it levels off when
>>> the flow to disk threshold is reached and direct memory is release as
>>> messages are consumed until the minimum size for caching of direct is
>>> reached.
>>>
>>> 1] For clarity let me check: we believe when you say "patch to use
>>> MultiQueueConsumer" you are referring to the patch attached to
>>> QPID-7462 "Add experimental "pull" consumers to the broker"  and you
>>> are using a combination of this "x-pull-only"  with the standard
>>> "x-multiqueue" feature.  Is this correct?
>>>
>>> 2] One idea we had here relates to the size of the virtualhost IO
>>> pool.   As you know from the documentation, the Broker caches/reuses
>>> direct memory internally but the documentation fails to mentions that
>>> each pooled virtualhost IO thread also grabs a chunk (256K) of direct
>>> memory from this cache.  By default the virtual host IO pool is sized
>>> Math.max(Runtime.getRuntime().availableProcessors() * 2, 64), so if
>>> you have a machine with a very large number of cores, you may have a
>>> surprising large amount of direct memory assigned to virtualhost IO
>>> threads.   Check the value of connectionThreadPoolSize on the
>>> virtualhost
>>> (http://<server>:<port>/api/latest/virtualhost/<virtualhostnodename>/<virtualhostname>)
>>> to see what value is in force.  What is it?  It is possible to tune
>>> the pool size using context variable
>>> virtualhost.connectionThreadPool.size.
>>>
>>> 3] Tell me if you are tuning the Broker in way beyond the direct/heap
>>> memory settings you have told us about already.  For instance you are
>>> changing any of the direct memory pooling settings
>>> broker.directByteBufferPoolSize, default network buffer size
>>> qpid.broker.networkBufferSize or applying any other non-standard
>>> settings?
>>>
>>> 4] How many virtual hosts do you have on the Broker?
>>>
>>> 5] What is the consumption pattern of the messages?  Do consume in a
>>> strictly FIFO fashion or are you making use of message selectors
>>> or/and any of the out-of-order queue types (LVQs, priority queue or
>>> sorted queues)?
>>>
>>> 6] Is it just the 0.16 client involved in the application?   Can I
>>> check that you are not using any of the AMQP 1.0 clients
>>> (org,apache.qpid:qpid-jms-client or
>>> org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as either
>>> consumers or producers)
>>>
>>> Hopefully the answers to these questions will get us closer to a
>>> reproduction.   If you are able to reliable reproduce it, please share
>>> the steps with us.
>>>
>>> Kind regards, Keith.
>>>
>>>
>>> On 20 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
>>> wrote:
>>> > After a lot of log mining, we might have a way to explain the sustained
>>> > increased in DirectMemory allocation, the correlation seems to be with
>>> > the
>>> > growth in the size of a Queue that is getting consumed but at a much
>>> > slower
>>> > rate than producers putting messages on this queue.
>>> >
>>> > The pattern we see is that in each instance of broker crash, there is
>>> > at
>>> > least one queue (usually 1 queue) whose size kept growing steadily.
>>> > It’d be
>>> > of significant size but not the largest queue -- usually there are
>>> > multiple
>>> > larger queues -- but it was different from other queues in that its
>>> > size
>>> > was growing steadily. The queue would also be moving, but its
>>> > processing
>>> > rate was not keeping up with the enqueue rate.
>>> >
>>> > Our theory that might be totally wrong: If a queue is moving the entire
>>> > time, maybe then the broker would keep reusing the same buffer in
>>> > direct
>>> > memory for the queue, and keep on adding onto it at the end to
>>> > accommodate
>>> > new messages. But because it’s active all the time and we’re pointing
>>> > to
>>> > the same buffer, space allocated for messages at the head of the
>>> > queue/buffer doesn’t get reclaimed, even long after those messages have
>>> > been processed. Just a theory.
>>> >
>>> > We are also trying to reproduce this using some perf tests to enqueue
>>> > with
>>> > same pattern, will update with the findings.
>>> >
>>> > Thanks
>>> > Ramayan
>>> >
>>> > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari
>>> > <[hidden email]>
>>> > wrote:
>>> >
>>> >> Another issue that we noticed is when broker goes OOM due to direct
>>> >> memory, it doesn't create heap dump (specified by "-XX:+
>>> >> HeapDumpOnOutOfMemoryError"), even when the OOM error is same as what
>>> >> is
>>> >> mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
>>> >>
>>> >> Has anyone been able to find a way to get to heap dump for DM OOM?
>>> >>
>>> >> - Ramayan
>>> >>
>>> >> On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari
>>> >> <[hidden email]
>>> >> > wrote:
>>> >>
>>> >>> Alex,
>>> >>>
>>> >>> Below are the flow to disk logs from broker having 3million+ messages
>>> >>> at
>>> >>> this time. We only have one virtual host. Time is in GMT. Looks like
>>> >>> flow
>>> >>> to disk is active on the whole virtual host and not a queue level.
>>> >>>
>>> >>> When the same broker went OOM yesterday, I did not see any flow to
>>> >>> disk
>>> >>> logs from when it was started until it crashed (crashed twice within
>>> >>> 4hrs).
>>> >>>
>>> >>>
>>> >>> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1014 : Message flow to disk active :  Message memory use
>>> >>> 3356539KB
>>> >>> exceeds threshold 3355443KB
>>> >>> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>>> >>> 3354866KB
>>> >>> within threshold 3355443KB
>>> >>> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1014 : Message flow to disk active :  Message memory use
>>> >>> 3358509KB
>>> >>> exceeds threshold 3355443KB
>>> >>> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>>> >>> 3353501KB
>>> >>> within threshold 3355443KB
>>> >>> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1014 : Message flow to disk active :  Message memory use
>>> >>> 3357544KB
>>> >>> exceeds threshold 3355443KB
>>> >>> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>>> >>> 3353236KB
>>> >>> within threshold 3355443KB
>>> >>> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1014 : Message flow to disk active :  Message memory use
>>> >>> 3356704KB
>>> >>> exceeds threshold 3355443KB
>>> >>> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>>> >>> 3353511KB
>>> >>> within threshold 3355443KB
>>> >>> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1014 : Message flow to disk active :  Message memory use
>>> >>> 3357948KB
>>> >>> exceeds threshold 3355443KB
>>> >>> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>>> >>> 3355310KB
>>> >>> within threshold 3355443KB
>>> >>> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1014 : Message flow to disk active :  Message memory use
>>> >>> 3365624KB
>>> >>> exceeds threshold 3355443KB
>>> >>> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
>>> >>> 3355136KB
>>> >>> within threshold 3355443KB
>>> >>> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
>>> >>> [Housekeeping[test]]
>>> >>> BRK-1014 : Message flow to disk active :  Message memory use
>>> >>> 3358683KB
>>> >>> exceeds threshold 3355443KB
>>> >>>
>>> >>>
>>> >>> After production release (2days back), we have seen 4 crashes in 3
>>> >>> different brokers, this is the most pressing concern for us in
>>> >>> decision if
>>> >>> we should roll back to 0.32. Any help is greatly appreciated.
>>> >>>
>>> >>> Thanks
>>> >>> Ramayan
>>> >>>
>>> >>> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]>
>>> >>> wrote:
>>> >>>
>>> >>>> Ramayan,
>>> >>>> Thanks for the details. I would like to clarify whether flow to disk
>>> >>>> was
>>> >>>> triggered today for 3 million messages?
>>> >>>>
>>> >>>> The following logs are issued for flow to disk:
>>> >>>> BRK-1014 : Message flow to disk active :  Message memory use
>>> >>>> {0,number,#}KB
>>> >>>> exceeds threshold {1,number,#.##}KB
>>> >>>> BRK-1015 : Message flow to disk inactive : Message memory use
>>> >>>> {0,number,#}KB within threshold {1,number,#.##}KB
>>> >>>>
>>> >>>> Kind Regards,
>>> >>>> Alex
>>> >>>>
>>> >>>>
>>> >>>> On 19 April 2017 at 17:10, Ramayan Tiwari <[hidden email]>
>>> >>>> wrote:
>>> >>>>
>>> >>>> > Hi Alex,
>>> >>>> >
>>> >>>> > Thanks for your response, here are the details:
>>> >>>> >
>>> >>>> > We use "direct" exchange, without persistence (we specify
>>> >>>> NON_PERSISTENT
>>> >>>> > that while sending from client) and use BDB store. We use JSON
>>> >>>> > virtual
>>> >>>> host
>>> >>>> > type. We are not using SSL.
>>> >>>> >
>>> >>>> > When the broker went OOM, we had around 1.3 million messages with
>>> >>>> > 100
>>> >>>> bytes
>>> >>>> > average message size. Direct memory allocation (value read from
>>> >>>> > MBean)
>>> >>>> kept
>>> >>>> > going up, even though it wouldn't need more DM to store these many
>>> >>>> > messages. DM allocated persisted at 99% for about 3 and half hours
>>> >>>> before
>>> >>>> > crashing.
>>> >>>> >
>>> >>>> > Today, on the same broker we have 3 million messages (same message
>>> >>>> size)
>>> >>>> > and DM allocated is only at 8%. This seems like there is some
>>> >>>> > issue
>>> >>>> with
>>> >>>> > de-allocation or a leak.
>>> >>>> >
>>> >>>> > I have uploaded the memory utilization graph here:
>>> >>>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
>>> >>>> > view?usp=sharing
>>> >>>> > Blue line is DM allocated, Yellow is DM Used (sum of queue
>>> >>>> > payload)
>>> >>>> and Red
>>> >>>> > is heap usage.
>>> >>>> >
>>> >>>> > Thanks
>>> >>>> > Ramayan
>>> >>>> >
>>> >>>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy
>>> >>>> > <[hidden email]>
>>> >>>> wrote:
>>> >>>> >
>>> >>>> > > Hi Ramayan,
>>> >>>> > >
>>> >>>> > > Could please share with us the details of messaging use case(s)
>>> >>>> > > which
>>> >>>> > ended
>>> >>>> > > up in OOM on broker side?
>>> >>>> > > I would like to reproduce the issue on my local broker in order
>>> >>>> > > to
>>> >>>> fix
>>> >>>> > it.
>>> >>>> > > I would appreciate if you could provide as much details as
>>> >>>> > > possible,
>>> >>>> > > including, messaging topology, message persistence type, message
>>> >>>> > > sizes,volumes, etc.
>>> >>>> > >
>>> >>>> > > Qpid Broker 6.0.x uses direct memory for keeping message content
>>> >>>> > > and
>>> >>>> > > receiving/sending data. Each plain connection utilizes 512K of
>>> >>>> > > direct
>>> >>>> > > memory. Each SSL connection uses 1M of direct memory. Your
>>> >>>> > > memory
>>> >>>> > settings
>>> >>>> > > look Ok to me.
>>> >>>> > >
>>> >>>> > > Kind Regards,
>>> >>>> > > Alex
>>> >>>> > >
>>> >>>> > >
>>> >>>> > > On 18 April 2017 at 23:39, Ramayan Tiwari
>>> >>>> > > <[hidden email]>
>>> >>>> > > wrote:
>>> >>>> > >
>>> >>>> > > > Hi All,
>>> >>>> > > >
>>> >>>> > > > We are using Java broker 6.0.5, with patch to use
>>> >>>> MultiQueueConsumer
>>> >>>> > > > feature. We just finished deploying to production and saw
>>> >>>> > > > couple of
>>> >>>> > > > instances of broker OOM due to running out of DirectMemory
>>> >>>> > > > buffer
>>> >>>> > > > (exceptions at the end of this email).
>>> >>>> > > >
>>> >>>> > > > Here is our setup:
>>> >>>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of
>>> >>>> > > > what the
>>> >>>> > > > recommendation is, however, for our use cause message payload
>>> >>>> > > > is
>>> >>>> really
>>> >>>> > > > small ~400bytes and is way less than the per message overhead
>>> >>>> > > > of
>>> >>>> 1KB).
>>> >>>> > In
>>> >>>> > > > perf testing, we were able to put 2 million messages without
>>> >>>> > > > any
>>> >>>> > issues.
>>> >>>> > > > 2. ~400 connections to broker.
>>> >>>> > > > 3. Each connection has 20 sessions and there is one multi
>>> >>>> > > > queue
>>> >>>> > consumer
>>> >>>> > > > attached to each session, listening to around 1000 queues.
>>> >>>> > > > 4. We are still using 0.16 client (I know).
>>> >>>> > > >
>>> >>>> > > > With the above setup, the baseline utilization (without any
>>> >>>> messages)
>>> >>>> > for
>>> >>>> > > > direct memory was around 230mb (with 410 connection each
>>> >>>> > > > taking
>>> >>>> 500KB).
>>> >>>> > > >
>>> >>>> > > > Based on our understanding of broker memory allocation,
>>> >>>> > > > message
>>> >>>> payload
>>> >>>> > > > should be the only thing adding to direct memory utilization
>>> >>>> > > > (on
>>> >>>> top of
>>> >>>> > > > baseline), however, we are experiencing something completely
>>> >>>> different.
>>> >>>> > > In
>>> >>>> > > > our last broker crash, we see that broker is constantly
>>> >>>> > > > running
>>> >>>> with
>>> >>>> > 90%+
>>> >>>> > > > direct memory allocated, even when message payload sum from
>>> >>>> > > > all the
>>> >>>> > > queues
>>> >>>> > > > is only 6-8% (these % are against available DM of 4gb). During
>>> >>>> these
>>> >>>> > high
>>> >>>> > > > DM usage period, heap usage was around 60% (of 12gb).
>>> >>>> > > >
>>> >>>> > > > We would like some help in understanding what could be the
>>> >>>> > > > reason
>>> >>>> of
>>> >>>> > > these
>>> >>>> > > > high DM allocations. Are there things other than message
>>> >>>> > > > payload
>>> >>>> and
>>> >>>> > AMQP
>>> >>>> > > > connection, which use DM and could be contributing to these
>>> >>>> > > > high
>>> >>>> usage?
>>> >>>> > > >
>>> >>>> > > > Another thing where we are puzzled is the de-allocation of DM
>>> >>>> > > > byte
>>> >>>> > > buffers.
>>> >>>> > > > From log mining of heap and DM utilization, de-allocation of
>>> >>>> > > > DM
>>> >>>> doesn't
>>> >>>> > > > correlate with heap GC. If anyone has seen any documentation
>>> >>>> related to
>>> >>>> > > > this, it would be very helpful if you could share that.
>>> >>>> > > >
>>> >>>> > > > Thanks
>>> >>>> > > > Ramayan
>>> >>>> > > >
>>> >>>> > > >
>>> >>>> > > > *Exceptions*
>>> >>>> > > >
>>> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>>> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>>> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>>> >>>> > > > ~[na:1.8.0_40]
>>> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>>> >>>> > > ~[na:1.8.0_40]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>>> >>>> > > > QpidByteBuffer.java:474)
>>> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
>>> >>>> elegate.
>>> >>>> > > >
>>> >>>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
>>> >>>> > > > gate.java:93)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > >
>>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>>> >>>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
>>> >>>> > > > NonBlockingConnection.java:506)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
>>> >>>> > > > NonBlockingConnection.java:285)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
>>> >>>> > > > processConnection(NetworkConnectionScheduler.java:124)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr
>>> >>>> ocessor.
>>> >>>> > > > processConnection(SelectorThread.java:504)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$
>>> >>>> > > > SelectionTask.performSelect(SelectorThread.java:337)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > >
>>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(
>>> >>>> > > > SelectorThread.java:87)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>>> >>>> > > > SelectorThread.java:462)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> >>>> > > > ThreadPoolExecutor.java:1142)
>>> >>>> > > > ~[na:1.8.0_40]
>>> >>>> > > > at
>>> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> >>>> > > > ThreadPoolExecutor.java:617)
>>> >>>> > > > ~[na:1.8.0_40]
>>> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>>> >>>> > > >
>>> >>>> > > >
>>> >>>> > > >
>>> >>>> > > > *Second exception*
>>> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
>>> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
>>> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>>> >>>> > > > ~[na:1.8.0_40]
>>> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>>> >>>> > > ~[na:1.8.0_40]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
>>> >>>> > > > QpidByteBuffer.java:474)
>>> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > >
>>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
>>> >>>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
>>> >>>> > > > setTransportEncryption(NonBlockingConnection.java:625)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
>>> >>>> > > > NonBlockingConnection.java:117)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
>>> >>>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
>>> >>>> k$1.run(
>>> >>>> > > > SelectorThread.java:191)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
>>> >>>> > > > SelectorThread.java:462)
>>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
>>> >>>> > > > at
>>> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> >>>> > > > ThreadPoolExecutor.java:1142)
>>> >>>> > > > ~[na:1.8.0_40]
>>> >>>> > > > at
>>> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> >>>> > > > ThreadPoolExecutor.java:617)
>>> >>>> > > > ~[na:1.8.0_40]
>>> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Thanks so much Keith and the team for finding the root cause. We are so
relieved that we fix the root cause shortly.

Couple of things that I forgot to mention on the mitigation steps we took
in the last incident:
1) We triggered GC from JMX bean multiple times, it did not help in
reducing DM allocated.
2) We also killed all the AMQP connections to the broker when DM was at
80%. This did not help either. The way we killed connections - using JMX
got list of all the open AMQP connections and called close from JMX mbean.

I am hoping the above two are not related to root cause, but wanted to
bring it up in case this is relevant.

Thanks
Ramayan

On Fri, Apr 21, 2017 at 8:29 AM, Keith W <[hidden email]> wrote:

> Hello Ramayan
>
> I believe I understand the root cause of the problem.  We have
> identified a flaw in the direct memory buffer management employed by
> Qpid Broker J which for some messaging use-cases can lead to the OOM
> direct you describe.   For the issue to manifest the producing
> application needs to use a single connection for the production of
> messages some of which are short-lived (i.e. are consumed quickly)
> whilst others remain on the queue for some time.  Priority queues,
> sorted queues and consumers utilising selectors that result in some
> messages being left of the queue could all produce this patten.  The
> pattern leads to a sparsely occupied 256K net buffers which cannot be
> released or reused until every message that reference a 'chunk' of it
> is either consumed or flown to disk.   The problem was introduced with
> Qpid v6.0 and exists in v6.1 and trunk too.
>
> The flow to disk feature is not helping us here because its algorithm
> considers only the size of live messages on the queues. If the
> accumulative live size does not exceed the threshold, the messages
> aren't flown to disk. I speculate that when you observed that moving
> messages cause direct message usage to drop earlier today, your
> message movement cause a queue to go over threshold, cause message to
> be flown to disk and their direct memory references released.  The
> logs will confirm this is so.
>
> I have not identified an easy workaround at the moment.   Decreasing
> the flow to disk threshold and/or increasing available direct memory
> should alleviate and may be an acceptable short term workaround.  If
> it were possible for publishing application to publish short lived and
> long lived messages on two separate JMS connections this would avoid
> this defect.
>
> QPID-7753 tracks this issue and QPID-7754 is a related this problem.
> We intend to be working on these early next week and will be aiming
> for a fix that is back-portable to 6.0.
>
> Apologies that you have run into this defect and thanks for reporting.
>
> Thanks, Keith
>
>
>
>
>
>
>
> On 21 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
> wrote:
> > Hi All,
> >
> > We have been monitoring the brokers everyday and today we found one
> instance
> > where broker’s DM was constantly going up and was about to crash, so we
> > experimented some mitigations, one of which caused the DM to come down.
> > Following are the details, which might help us understanding the issue:
> >
> > Traffic scenario:
> >
> > DM allocation had been constantly going up and was at 90%. There were two
> > queues which seemed to align with the theories that we had. Q1’s size had
> > been large right after the broker start and had slow consumption of
> > messages, queue size only reduced from 76MB to 75MB over a period of
> 6hrs.
> > Q2 on the other hand, started small and was gradually growing, queue size
> > went from 7MB to 10MB in 6hrs. There were other queues with traffic
> during
> > this time.
> >
> > Action taken:
> >
> > Moved all the messages from Q2 (since this was our original theory) to Q3
> > (already created but no messages in it). This did not help with the DM
> > growing up.
> > Moved all the messages from Q1 to Q4 (already created but no messages in
> > it). This reduced DM allocation from 93% to 31%.
> >
> > We have the heap dump and thread dump from when broker was 90% in DM
> > allocation. We are going to analyze that to see if we can get some clue.
> We
> > wanted to share this new information which might help in reasoning about
> the
> > memory issue.
> >
> > - Ramayan
> >
> >
> > On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <
> [hidden email]>
> > wrote:
> >>
> >> Hi Keith,
> >>
> >> Thanks so much for your response and digging into the issue. Below are
> the
> >> answer to your questions:
> >>
> >> 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1 where it
> >> was released because we need JMX support. Here is the destination
> format:
> >> ""%s ; {node : { type : queue }, link : { x-subscribes : { arguments : {
> >> x-multiqueue : [%s], x-pull-only : true }}}}";"
> >>
> >> 2) Our machines have 40 cores, which will make the number of threads to
> >> 80. This might not be an issue, because this will show up in the
> baseline DM
> >> allocated, which is only 6% (of 4GB) when we just bring up the broker.
> >>
> >> 3) The only setting that we tuned WRT to DM is flowToDiskThreshold,
> which
> >> is set at 80% now.
> >>
> >> 4) Only one virtual host in the broker.
> >>
> >> 5) Most of our queues (99%) are priority, we also have 8-10 sorted
> queues.
> >>
> >> 6) Yeah we are using the standard 0.16 client and not AMQP 1.0 clients.
> >> The connection log line looks like:
> >> CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol Version : 0-10
> :
> >> Client ID : test : Client Version : 0.16 : Client Product : qpid
> >>
> >> We had another broker crashed about an hour back, we do see the same
> >> patterns:
> >> 1) There is a queue which is constantly growing, enqueue is faster than
> >> dequeue on that queue for a long period of time.
> >> 2) Flow to disk didn't kick in at all.
> >>
> >> This graph shows memory growth (red line - heap, blue - DM allocated,
> >> yellow - DM used)
> >>
> >> https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/
> view?usp=sharing
> >>
> >> The below graph shows growth on a single queue (there are 10-12 other
> >> queues with traffic as well, something large size than this queue):
> >>
> >> https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/
> view?usp=sharing
> >>
> >> Couple of questions:
> >> 1) Is there any developer level doc/design spec on how Qpid uses DM?
> >> 2) We are not getting heap dumps automatically when broker crashes due
> to
> >> DM (HeapDumpOnOutOfMemoryError not respected). Has anyone found a way
> to get
> >> around this problem?
> >>
> >> Thanks
> >> Ramayan
> >>
> >> On Thu, Apr 20, 2017 at 9:08 AM, Keith W <[hidden email]> wrote:
> >>>
> >>> Hi Ramayan
> >>>
> >>> We have been discussing your problem here and have a couple of
> questions.
> >>>
> >>> I have been experimenting with use-cases based on your descriptions
> >>> above, but so far, have been unsuccessful in reproducing a
> >>> "java.lang.OutOfMemoryError: Direct buffer memory"  condition. The
> >>> direct memory usage reflects the expected model: it levels off when
> >>> the flow to disk threshold is reached and direct memory is release as
> >>> messages are consumed until the minimum size for caching of direct is
> >>> reached.
> >>>
> >>> 1] For clarity let me check: we believe when you say "patch to use
> >>> MultiQueueConsumer" you are referring to the patch attached to
> >>> QPID-7462 "Add experimental "pull" consumers to the broker"  and you
> >>> are using a combination of this "x-pull-only"  with the standard
> >>> "x-multiqueue" feature.  Is this correct?
> >>>
> >>> 2] One idea we had here relates to the size of the virtualhost IO
> >>> pool.   As you know from the documentation, the Broker caches/reuses
> >>> direct memory internally but the documentation fails to mentions that
> >>> each pooled virtualhost IO thread also grabs a chunk (256K) of direct
> >>> memory from this cache.  By default the virtual host IO pool is sized
> >>> Math.max(Runtime.getRuntime().availableProcessors() * 2, 64), so if
> >>> you have a machine with a very large number of cores, you may have a
> >>> surprising large amount of direct memory assigned to virtualhost IO
> >>> threads.   Check the value of connectionThreadPoolSize on the
> >>> virtualhost
> >>> (http://<server>:<port>/api/latest/virtualhost/<virtualhostnodename>/<
> virtualhostname>)
> >>> to see what value is in force.  What is it?  It is possible to tune
> >>> the pool size using context variable
> >>> virtualhost.connectionThreadPool.size.
> >>>
> >>> 3] Tell me if you are tuning the Broker in way beyond the direct/heap
> >>> memory settings you have told us about already.  For instance you are
> >>> changing any of the direct memory pooling settings
> >>> broker.directByteBufferPoolSize, default network buffer size
> >>> qpid.broker.networkBufferSize or applying any other non-standard
> >>> settings?
> >>>
> >>> 4] How many virtual hosts do you have on the Broker?
> >>>
> >>> 5] What is the consumption pattern of the messages?  Do consume in a
> >>> strictly FIFO fashion or are you making use of message selectors
> >>> or/and any of the out-of-order queue types (LVQs, priority queue or
> >>> sorted queues)?
> >>>
> >>> 6] Is it just the 0.16 client involved in the application?   Can I
> >>> check that you are not using any of the AMQP 1.0 clients
> >>> (org,apache.qpid:qpid-jms-client or
> >>> org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as either
> >>> consumers or producers)
> >>>
> >>> Hopefully the answers to these questions will get us closer to a
> >>> reproduction.   If you are able to reliable reproduce it, please share
> >>> the steps with us.
> >>>
> >>> Kind regards, Keith.
> >>>
> >>>
> >>> On 20 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
> >>> wrote:
> >>> > After a lot of log mining, we might have a way to explain the
> sustained
> >>> > increased in DirectMemory allocation, the correlation seems to be
> with
> >>> > the
> >>> > growth in the size of a Queue that is getting consumed but at a much
> >>> > slower
> >>> > rate than producers putting messages on this queue.
> >>> >
> >>> > The pattern we see is that in each instance of broker crash, there is
> >>> > at
> >>> > least one queue (usually 1 queue) whose size kept growing steadily.
> >>> > It’d be
> >>> > of significant size but not the largest queue -- usually there are
> >>> > multiple
> >>> > larger queues -- but it was different from other queues in that its
> >>> > size
> >>> > was growing steadily. The queue would also be moving, but its
> >>> > processing
> >>> > rate was not keeping up with the enqueue rate.
> >>> >
> >>> > Our theory that might be totally wrong: If a queue is moving the
> entire
> >>> > time, maybe then the broker would keep reusing the same buffer in
> >>> > direct
> >>> > memory for the queue, and keep on adding onto it at the end to
> >>> > accommodate
> >>> > new messages. But because it’s active all the time and we’re pointing
> >>> > to
> >>> > the same buffer, space allocated for messages at the head of the
> >>> > queue/buffer doesn’t get reclaimed, even long after those messages
> have
> >>> > been processed. Just a theory.
> >>> >
> >>> > We are also trying to reproduce this using some perf tests to enqueue
> >>> > with
> >>> > same pattern, will update with the findings.
> >>> >
> >>> > Thanks
> >>> > Ramayan
> >>> >
> >>> > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari
> >>> > <[hidden email]>
> >>> > wrote:
> >>> >
> >>> >> Another issue that we noticed is when broker goes OOM due to direct
> >>> >> memory, it doesn't create heap dump (specified by "-XX:+
> >>> >> HeapDumpOnOutOfMemoryError"), even when the OOM error is same as
> what
> >>> >> is
> >>> >> mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
> >>> >>
> >>> >> Has anyone been able to find a way to get to heap dump for DM OOM?
> >>> >>
> >>> >> - Ramayan
> >>> >>
> >>> >> On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari
> >>> >> <[hidden email]
> >>> >> > wrote:
> >>> >>
> >>> >>> Alex,
> >>> >>>
> >>> >>> Below are the flow to disk logs from broker having 3million+
> messages
> >>> >>> at
> >>> >>> this time. We only have one virtual host. Time is in GMT. Looks
> like
> >>> >>> flow
> >>> >>> to disk is active on the whole virtual host and not a queue level.
> >>> >>>
> >>> >>> When the same broker went OOM yesterday, I did not see any flow to
> >>> >>> disk
> >>> >>> logs from when it was started until it crashed (crashed twice
> within
> >>> >>> 4hrs).
> >>> >>>
> >>> >>>
> >>> >>> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1014 : Message flow to disk active :  Message memory use
> >>> >>> 3356539KB
> >>> >>> exceeds threshold 3355443KB
> >>> >>> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>> >>> 3354866KB
> >>> >>> within threshold 3355443KB
> >>> >>> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1014 : Message flow to disk active :  Message memory use
> >>> >>> 3358509KB
> >>> >>> exceeds threshold 3355443KB
> >>> >>> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>> >>> 3353501KB
> >>> >>> within threshold 3355443KB
> >>> >>> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1014 : Message flow to disk active :  Message memory use
> >>> >>> 3357544KB
> >>> >>> exceeds threshold 3355443KB
> >>> >>> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>> >>> 3353236KB
> >>> >>> within threshold 3355443KB
> >>> >>> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1014 : Message flow to disk active :  Message memory use
> >>> >>> 3356704KB
> >>> >>> exceeds threshold 3355443KB
> >>> >>> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>> >>> 3353511KB
> >>> >>> within threshold 3355443KB
> >>> >>> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1014 : Message flow to disk active :  Message memory use
> >>> >>> 3357948KB
> >>> >>> exceeds threshold 3355443KB
> >>> >>> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>> >>> 3355310KB
> >>> >>> within threshold 3355443KB
> >>> >>> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1014 : Message flow to disk active :  Message memory use
> >>> >>> 3365624KB
> >>> >>> exceeds threshold 3355443KB
> >>> >>> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>> >>> 3355136KB
> >>> >>> within threshold 3355443KB
> >>> >>> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> >>> >>> [Housekeeping[test]]
> >>> >>> BRK-1014 : Message flow to disk active :  Message memory use
> >>> >>> 3358683KB
> >>> >>> exceeds threshold 3355443KB
> >>> >>>
> >>> >>>
> >>> >>> After production release (2days back), we have seen 4 crashes in 3
> >>> >>> different brokers, this is the most pressing concern for us in
> >>> >>> decision if
> >>> >>> we should roll back to 0.32. Any help is greatly appreciated.
> >>> >>>
> >>> >>> Thanks
> >>> >>> Ramayan
> >>> >>>
> >>> >>> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]
> >
> >>> >>> wrote:
> >>> >>>
> >>> >>>> Ramayan,
> >>> >>>> Thanks for the details. I would like to clarify whether flow to
> disk
> >>> >>>> was
> >>> >>>> triggered today for 3 million messages?
> >>> >>>>
> >>> >>>> The following logs are issued for flow to disk:
> >>> >>>> BRK-1014 : Message flow to disk active :  Message memory use
> >>> >>>> {0,number,#}KB
> >>> >>>> exceeds threshold {1,number,#.##}KB
> >>> >>>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>> >>>> {0,number,#}KB within threshold {1,number,#.##}KB
> >>> >>>>
> >>> >>>> Kind Regards,
> >>> >>>> Alex
> >>> >>>>
> >>> >>>>
> >>> >>>> On 19 April 2017 at 17:10, Ramayan Tiwari <
> [hidden email]>
> >>> >>>> wrote:
> >>> >>>>
> >>> >>>> > Hi Alex,
> >>> >>>> >
> >>> >>>> > Thanks for your response, here are the details:
> >>> >>>> >
> >>> >>>> > We use "direct" exchange, without persistence (we specify
> >>> >>>> NON_PERSISTENT
> >>> >>>> > that while sending from client) and use BDB store. We use JSON
> >>> >>>> > virtual
> >>> >>>> host
> >>> >>>> > type. We are not using SSL.
> >>> >>>> >
> >>> >>>> > When the broker went OOM, we had around 1.3 million messages
> with
> >>> >>>> > 100
> >>> >>>> bytes
> >>> >>>> > average message size. Direct memory allocation (value read from
> >>> >>>> > MBean)
> >>> >>>> kept
> >>> >>>> > going up, even though it wouldn't need more DM to store these
> many
> >>> >>>> > messages. DM allocated persisted at 99% for about 3 and half
> hours
> >>> >>>> before
> >>> >>>> > crashing.
> >>> >>>> >
> >>> >>>> > Today, on the same broker we have 3 million messages (same
> message
> >>> >>>> size)
> >>> >>>> > and DM allocated is only at 8%. This seems like there is some
> >>> >>>> > issue
> >>> >>>> with
> >>> >>>> > de-allocation or a leak.
> >>> >>>> >
> >>> >>>> > I have uploaded the memory utilization graph here:
> >>> >>>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> >>> >>>> > view?usp=sharing
> >>> >>>> > Blue line is DM allocated, Yellow is DM Used (sum of queue
> >>> >>>> > payload)
> >>> >>>> and Red
> >>> >>>> > is heap usage.
> >>> >>>> >
> >>> >>>> > Thanks
> >>> >>>> > Ramayan
> >>> >>>> >
> >>> >>>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy
> >>> >>>> > <[hidden email]>
> >>> >>>> wrote:
> >>> >>>> >
> >>> >>>> > > Hi Ramayan,
> >>> >>>> > >
> >>> >>>> > > Could please share with us the details of messaging use
> case(s)
> >>> >>>> > > which
> >>> >>>> > ended
> >>> >>>> > > up in OOM on broker side?
> >>> >>>> > > I would like to reproduce the issue on my local broker in
> order
> >>> >>>> > > to
> >>> >>>> fix
> >>> >>>> > it.
> >>> >>>> > > I would appreciate if you could provide as much details as
> >>> >>>> > > possible,
> >>> >>>> > > including, messaging topology, message persistence type,
> message
> >>> >>>> > > sizes,volumes, etc.
> >>> >>>> > >
> >>> >>>> > > Qpid Broker 6.0.x uses direct memory for keeping message
> content
> >>> >>>> > > and
> >>> >>>> > > receiving/sending data. Each plain connection utilizes 512K of
> >>> >>>> > > direct
> >>> >>>> > > memory. Each SSL connection uses 1M of direct memory. Your
> >>> >>>> > > memory
> >>> >>>> > settings
> >>> >>>> > > look Ok to me.
> >>> >>>> > >
> >>> >>>> > > Kind Regards,
> >>> >>>> > > Alex
> >>> >>>> > >
> >>> >>>> > >
> >>> >>>> > > On 18 April 2017 at 23:39, Ramayan Tiwari
> >>> >>>> > > <[hidden email]>
> >>> >>>> > > wrote:
> >>> >>>> > >
> >>> >>>> > > > Hi All,
> >>> >>>> > > >
> >>> >>>> > > > We are using Java broker 6.0.5, with patch to use
> >>> >>>> MultiQueueConsumer
> >>> >>>> > > > feature. We just finished deploying to production and saw
> >>> >>>> > > > couple of
> >>> >>>> > > > instances of broker OOM due to running out of DirectMemory
> >>> >>>> > > > buffer
> >>> >>>> > > > (exceptions at the end of this email).
> >>> >>>> > > >
> >>> >>>> > > > Here is our setup:
> >>> >>>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite of
> >>> >>>> > > > what the
> >>> >>>> > > > recommendation is, however, for our use cause message
> payload
> >>> >>>> > > > is
> >>> >>>> really
> >>> >>>> > > > small ~400bytes and is way less than the per message
> overhead
> >>> >>>> > > > of
> >>> >>>> 1KB).
> >>> >>>> > In
> >>> >>>> > > > perf testing, we were able to put 2 million messages without
> >>> >>>> > > > any
> >>> >>>> > issues.
> >>> >>>> > > > 2. ~400 connections to broker.
> >>> >>>> > > > 3. Each connection has 20 sessions and there is one multi
> >>> >>>> > > > queue
> >>> >>>> > consumer
> >>> >>>> > > > attached to each session, listening to around 1000 queues.
> >>> >>>> > > > 4. We are still using 0.16 client (I know).
> >>> >>>> > > >
> >>> >>>> > > > With the above setup, the baseline utilization (without any
> >>> >>>> messages)
> >>> >>>> > for
> >>> >>>> > > > direct memory was around 230mb (with 410 connection each
> >>> >>>> > > > taking
> >>> >>>> 500KB).
> >>> >>>> > > >
> >>> >>>> > > > Based on our understanding of broker memory allocation,
> >>> >>>> > > > message
> >>> >>>> payload
> >>> >>>> > > > should be the only thing adding to direct memory utilization
> >>> >>>> > > > (on
> >>> >>>> top of
> >>> >>>> > > > baseline), however, we are experiencing something completely
> >>> >>>> different.
> >>> >>>> > > In
> >>> >>>> > > > our last broker crash, we see that broker is constantly
> >>> >>>> > > > running
> >>> >>>> with
> >>> >>>> > 90%+
> >>> >>>> > > > direct memory allocated, even when message payload sum from
> >>> >>>> > > > all the
> >>> >>>> > > queues
> >>> >>>> > > > is only 6-8% (these % are against available DM of 4gb).
> During
> >>> >>>> these
> >>> >>>> > high
> >>> >>>> > > > DM usage period, heap usage was around 60% (of 12gb).
> >>> >>>> > > >
> >>> >>>> > > > We would like some help in understanding what could be the
> >>> >>>> > > > reason
> >>> >>>> of
> >>> >>>> > > these
> >>> >>>> > > > high DM allocations. Are there things other than message
> >>> >>>> > > > payload
> >>> >>>> and
> >>> >>>> > AMQP
> >>> >>>> > > > connection, which use DM and could be contributing to these
> >>> >>>> > > > high
> >>> >>>> usage?
> >>> >>>> > > >
> >>> >>>> > > > Another thing where we are puzzled is the de-allocation of
> DM
> >>> >>>> > > > byte
> >>> >>>> > > buffers.
> >>> >>>> > > > From log mining of heap and DM utilization, de-allocation of
> >>> >>>> > > > DM
> >>> >>>> doesn't
> >>> >>>> > > > correlate with heap GC. If anyone has seen any documentation
> >>> >>>> related to
> >>> >>>> > > > this, it would be very helpful if you could share that.
> >>> >>>> > > >
> >>> >>>> > > > Thanks
> >>> >>>> > > > Ramayan
> >>> >>>> > > >
> >>> >>>> > > >
> >>> >>>> > > > *Exceptions*
> >>> >>>> > > >
> >>> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
> >>> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> ~[na:1.8.0_40]
> >>> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:
> 123)
> >>> >>>> > > > ~[na:1.8.0_40]
> >>> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> >>> >>>> > > ~[na:1.8.0_40]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> >>> >>>> > > > QpidByteBuffer.java:474)
> >>> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.
> NonBlockingConnectionPlainD
> >>> >>>> elegate.
> >>> >>>> > > >
> >>> >>>> > > > restoreApplicationBufferForWrite(
> NonBlockingConnectionPlainDele
> >>> >>>> > > > gate.java:93)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > >
> >>> >>>> > > > org.apache.qpid.server.transport.
> NonBlockingConnectionPlainDele
> >>> >>>> > > > gate.processData(NonBlockingConnectionPlainDele
> gate.java:60)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.
> NonBlockingConnection.doRead(
> >>> >>>> > > > NonBlockingConnection.java:506)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.
> NonBlockingConnection.doWork(
> >>> >>>> > > > NonBlockingConnection.java:285)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.
> NetworkConnectionScheduler.
> >>> >>>> > > > processConnection(NetworkConnectionScheduler.java:124)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$
> ConnectionPr
> >>> >>>> ocessor.
> >>> >>>> > > > processConnection(SelectorThread.java:504)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$
> >>> >>>> > > > SelectionTask.performSelect(SelectorThread.java:337)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > >
> >>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$
> SelectionTask.run(
> >>> >>>> > > > SelectorThread.java:87)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
> >>> >>>> > > > SelectorThread.java:462)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >>> >>>> > > > ThreadPoolExecutor.java:1142)
> >>> >>>> > > > ~[na:1.8.0_40]
> >>> >>>> > > > at
> >>> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>> >>>> > > > ThreadPoolExecutor.java:617)
> >>> >>>> > > > ~[na:1.8.0_40]
> >>> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> >>> >>>> > > >
> >>> >>>> > > >
> >>> >>>> > > >
> >>> >>>> > > > *Second exception*
> >>> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
> >>> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> ~[na:1.8.0_40]
> >>> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:
> 123)
> >>> >>>> > > > ~[na:1.8.0_40]
> >>> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> >>> >>>> > > ~[na:1.8.0_40]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> >>> >>>> > > > QpidByteBuffer.java:474)
> >>> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > >
> >>> >>>> > > > org.apache.qpid.server.transport.
> NonBlockingConnectionPlainDele
> >>> >>>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
> >>> >>>> > > > setTransportEncryption(NonBlockingConnection.java:625)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.
> NonBlockingConnection.<init>(
> >>> >>>> > > > NonBlockingConnection.java:117)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.
> NonBlockingNetworkTransport.
> >>> >>>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.SelectorThread$
> SelectionTas
> >>> >>>> k$1.run(
> >>> >>>> > > > SelectorThread.java:191)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
> >>> >>>> > > > SelectorThread.java:462)
> >>> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>> >>>> > > > at
> >>> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >>> >>>> > > > ThreadPoolExecutor.java:1142)
> >>> >>>> > > > ~[na:1.8.0_40]
> >>> >>>> > > > at
> >>> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>> >>>> > > > ThreadPoolExecutor.java:617)
> >>> >>>> > > > ~[na:1.8.0_40]
> >>> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> >>> >>>> > > >
> >>> >>>> > >
> >>> >>>> >
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Lorenz Quack
Hello Ramayan,

We are still working on a fix for this issue.
In the mean time we had an idea to potentially workaround the issue until a proper fix is released.

The idea is to decrease the qpid network buffer size the broker uses.
While this still allows for sparsely populated buffers it would improve the overall occupancy ratio.

Here are the steps to follow:
 * ensure you are not using TLS
 * apply the attached patch
 * figure out the size of the largest messages you are sending (including header and some overhead)
 * set the context variable "qpid.broker.networkBufferSize" to that value but not smaller than 4096 
 * test

Decreasing the qpid network buffer size automatically limits the maximum AMQP frame size.
Since you are using a very old client we are not sure how well it copes with small frame sizes where it has to split a message across multiple frames.
Therefore, to play it safe you should not set it smaller than the largest messages (+ header + overhead) you are sending.
I do not know what message sizes you are sending but AMQP imposes the restriction that the framesize cannot be smaller than 4096 bytes.
In the qpid broker the default currently is 256 kB.

In the current state the broker does not allow setting the network buffer to values smaller than 64 kB to allow TLS frames to fit into one network buffer.
I attached a patch to this mail that lowers that restriction to the limit imposed by AMQP (4096 Bytes).
Obviously, you should not use this when using TLS.


I hope this reduces the problems you are currently facing until we can complete the proper fix.

Kind regards,
Lorenz


On Fri, 2017-04-21 at 09:17 -0700, Ramayan Tiwari wrote:

> Thanks so much Keith and the team for finding the root cause. We are so
> relieved that we fix the root cause shortly.
>
> Couple of things that I forgot to mention on the mitigation steps we took
> in the last incident:
> 1) We triggered GC from JMX bean multiple times, it did not help in
> reducing DM allocated.
> 2) We also killed all the AMQP connections to the broker when DM was at
> 80%. This did not help either. The way we killed connections - using JMX
> got list of all the open AMQP connections and called close from JMX mbean.
>
> I am hoping the above two are not related to root cause, but wanted to
> bring it up in case this is relevant.
>
> Thanks
> Ramayan
>
> On Fri, Apr 21, 2017 at 8:29 AM, Keith W <[hidden email]> wrote:
>
> >
> > Hello Ramayan
> >
> > I believe I understand the root cause of the problem.  We have
> > identified a flaw in the direct memory buffer management employed by
> > Qpid Broker J which for some messaging use-cases can lead to the OOM
> > direct you describe.   For the issue to manifest the producing
> > application needs to use a single connection for the production of
> > messages some of which are short-lived (i.e. are consumed quickly)
> > whilst others remain on the queue for some time.  Priority queues,
> > sorted queues and consumers utilising selectors that result in some
> > messages being left of the queue could all produce this patten.  The
> > pattern leads to a sparsely occupied 256K net buffers which cannot be
> > released or reused until every message that reference a 'chunk' of it
> > is either consumed or flown to disk.   The problem was introduced with
> > Qpid v6.0 and exists in v6.1 and trunk too.
> >
> > The flow to disk feature is not helping us here because its algorithm
> > considers only the size of live messages on the queues. If the
> > accumulative live size does not exceed the threshold, the messages
> > aren't flown to disk. I speculate that when you observed that moving
> > messages cause direct message usage to drop earlier today, your
> > message movement cause a queue to go over threshold, cause message to
> > be flown to disk and their direct memory references released.  The
> > logs will confirm this is so.
> >
> > I have not identified an easy workaround at the moment.   Decreasing
> > the flow to disk threshold and/or increasing available direct memory
> > should alleviate and may be an acceptable short term workaround.  If
> > it were possible for publishing application to publish short lived and
> > long lived messages on two separate JMS connections this would avoid
> > this defect.
> >
> > QPID-7753 tracks this issue and QPID-7754 is a related this problem.
> > We intend to be working on these early next week and will be aiming
> > for a fix that is back-portable to 6.0.
> >
> > Apologies that you have run into this defect and thanks for reporting.
> >
> > Thanks, Keith
> >
> >
> >
> >
> >
> >
> >
> > On 21 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
> > wrote:
> > >
> > > Hi All,
> > >
> > > We have been monitoring the brokers everyday and today we found one
> > instance
> > >
> > > where broker’s DM was constantly going up and was about to crash, so we
> > > experimented some mitigations, one of which caused the DM to come down.
> > > Following are the details, which might help us understanding the issue:
> > >
> > > Traffic scenario:
> > >
> > > DM allocation had been constantly going up and was at 90%. There were two
> > > queues which seemed to align with the theories that we had. Q1’s size had
> > > been large right after the broker start and had slow consumption of
> > > messages, queue size only reduced from 76MB to 75MB over a period of
> > 6hrs.
> > >
> > > Q2 on the other hand, started small and was gradually growing, queue size
> > > went from 7MB to 10MB in 6hrs. There were other queues with traffic
> > during
> > >
> > > this time.
> > >
> > > Action taken:
> > >
> > > Moved all the messages from Q2 (since this was our original theory) to Q3
> > > (already created but no messages in it). This did not help with the DM
> > > growing up.
> > > Moved all the messages from Q1 to Q4 (already created but no messages in
> > > it). This reduced DM allocation from 93% to 31%.
> > >
> > > We have the heap dump and thread dump from when broker was 90% in DM
> > > allocation. We are going to analyze that to see if we can get some clue.
> > We
> > >
> > > wanted to share this new information which might help in reasoning about
> > the
> > >
> > > memory issue.
> > >
> > > - Ramayan
> > >
> > >
> > > On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <
> > [hidden email]>
> > >
> > > wrote:
> > > >
> > > >
> > > > Hi Keith,
> > > >
> > > > Thanks so much for your response and digging into the issue. Below are
> > the
> > >
> > > >
> > > > answer to your questions:
> > > >
> > > > 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1 where it
> > > > was released because we need JMX support. Here is the destination
> > format:
> > >
> > > >
> > > > ""%s ; {node : { type : queue }, link : { x-subscribes : { arguments : {
> > > > x-multiqueue : [%s], x-pull-only : true }}}}";"
> > > >
> > > > 2) Our machines have 40 cores, which will make the number of threads to
> > > > 80. This might not be an issue, because this will show up in the
> > baseline DM
> > >
> > > >
> > > > allocated, which is only 6% (of 4GB) when we just bring up the broker.
> > > >
> > > > 3) The only setting that we tuned WRT to DM is flowToDiskThreshold,
> > which
> > >
> > > >
> > > > is set at 80% now.
> > > >
> > > > 4) Only one virtual host in the broker.
> > > >
> > > > 5) Most of our queues (99%) are priority, we also have 8-10 sorted
> > queues.
> > >
> > > >
> > > >
> > > > 6) Yeah we are using the standard 0.16 client and not AMQP 1.0 clients.
> > > > The connection log line looks like:
> > > > CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol Version : 0-10
> > :
> > >
> > > >
> > > > Client ID : test : Client Version : 0.16 : Client Product : qpid
> > > >
> > > > We had another broker crashed about an hour back, we do see the same
> > > > patterns:
> > > > 1) There is a queue which is constantly growing, enqueue is faster than
> > > > dequeue on that queue for a long period of time.
> > > > 2) Flow to disk didn't kick in at all.
> > > >
> > > > This graph shows memory growth (red line - heap, blue - DM allocated,
> > > > yellow - DM used)
> > > >
> > > > https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/
> > view?usp=sharing
> > >
> > > >
> > > >
> > > > The below graph shows growth on a single queue (there are 10-12 other
> > > > queues with traffic as well, something large size than this queue):
> > > >
> > > > https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/
> > view?usp=sharing
> > >
> > > >
> > > >
> > > > Couple of questions:
> > > > 1) Is there any developer level doc/design spec on how Qpid uses DM?
> > > > 2) We are not getting heap dumps automatically when broker crashes due
> > to
> > >
> > > >
> > > > DM (HeapDumpOnOutOfMemoryError not respected). Has anyone found a way
> > to get
> > >
> > > >
> > > > around this problem?
> > > >
> > > > Thanks
> > > > Ramayan
> > > >
> > > > On Thu, Apr 20, 2017 at 9:08 AM, Keith W <[hidden email]> wrote:
> > > > >
> > > > >
> > > > > Hi Ramayan
> > > > >
> > > > > We have been discussing your problem here and have a couple of
> > questions.
> > >
> > > >
> > > > >
> > > > >
> > > > > I have been experimenting with use-cases based on your descriptions
> > > > > above, but so far, have been unsuccessful in reproducing a
> > > > > "java.lang.OutOfMemoryError: Direct buffer memory"  condition. The
> > > > > direct memory usage reflects the expected model: it levels off when
> > > > > the flow to disk threshold is reached and direct memory is release as
> > > > > messages are consumed until the minimum size for caching of direct is
> > > > > reached.
> > > > >
> > > > > 1] For clarity let me check: we believe when you say "patch to use
> > > > > MultiQueueConsumer" you are referring to the patch attached to
> > > > > QPID-7462 "Add experimental "pull" consumers to the broker"  and you
> > > > > are using a combination of this "x-pull-only"  with the standard
> > > > > "x-multiqueue" feature.  Is this correct?
> > > > >
> > > > > 2] One idea we had here relates to the size of the virtualhost IO
> > > > > pool.   As you know from the documentation, the Broker caches/reuses
> > > > > direct memory internally but the documentation fails to mentions that
> > > > > each pooled virtualhost IO thread also grabs a chunk (256K) of direct
> > > > > memory from this cache.  By default the virtual host IO pool is sized
> > > > > Math.max(Runtime.getRuntime().availableProcessors() * 2, 64), so if
> > > > > you have a machine with a very large number of cores, you may have a
> > > > > surprising large amount of direct memory assigned to virtualhost IO
> > > > > threads.   Check the value of connectionThreadPoolSize on the
> > > > > virtualhost
> > > > > (http://<server>:<port>/api/latest/virtualhost/<virtualhostnodename>/<;
> > virtualhostname>)
> > >
> > > >
> > > > >
> > > > > to see what value is in force.  What is it?  It is possible to tune
> > > > > the pool size using context variable
> > > > > virtualhost.connectionThreadPool.size.
> > > > >
> > > > > 3] Tell me if you are tuning the Broker in way beyond the direct/heap
> > > > > memory settings you have told us about already.  For instance you are
> > > > > changing any of the direct memory pooling settings
> > > > > broker.directByteBufferPoolSize, default network buffer size
> > > > > qpid.broker.networkBufferSize or applying any other non-standard
> > > > > settings?
> > > > >
> > > > > 4] How many virtual hosts do you have on the Broker?
> > > > >
> > > > > 5] What is the consumption pattern of the messages?  Do consume in a
> > > > > strictly FIFO fashion or are you making use of message selectors
> > > > > or/and any of the out-of-order queue types (LVQs, priority queue or
> > > > > sorted queues)?
> > > > >
> > > > > 6] Is it just the 0.16 client involved in the application?   Can I
> > > > > check that you are not using any of the AMQP 1.0 clients
> > > > > (org,apache.qpid:qpid-jms-client or
> > > > > org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as either
> > > > > consumers or producers)
> > > > >
> > > > > Hopefully the answers to these questions will get us closer to a
> > > > > reproduction.   If you are able to reliable reproduce it, please share
> > > > > the steps with us.
> > > > >
> > > > > Kind regards, Keith.
> > > > >
> > > > >
> > > > > On 20 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > After a lot of log mining, we might have a way to explain the
> > sustained
> > >
> > > >
> > > > >
> > > > > >
> > > > > > increased in DirectMemory allocation, the correlation seems to be
> > with
> > >
> > > >
> > > > >
> > > > > >
> > > > > > the
> > > > > > growth in the size of a Queue that is getting consumed but at a much
> > > > > > slower
> > > > > > rate than producers putting messages on this queue.
> > > > > >
> > > > > > The pattern we see is that in each instance of broker crash, there is
> > > > > > at
> > > > > > least one queue (usually 1 queue) whose size kept growing steadily.
> > > > > > It’d be
> > > > > > of significant size but not the largest queue -- usually there are
> > > > > > multiple
> > > > > > larger queues -- but it was different from other queues in that its
> > > > > > size
> > > > > > was growing steadily. The queue would also be moving, but its
> > > > > > processing
> > > > > > rate was not keeping up with the enqueue rate.
> > > > > >
> > > > > > Our theory that might be totally wrong: If a queue is moving the
> > entire
> > >
> > > >
> > > > >
> > > > > >
> > > > > > time, maybe then the broker would keep reusing the same buffer in
> > > > > > direct
> > > > > > memory for the queue, and keep on adding onto it at the end to
> > > > > > accommodate
> > > > > > new messages. But because it’s active all the time and we’re pointing
> > > > > > to
> > > > > > the same buffer, space allocated for messages at the head of the
> > > > > > queue/buffer doesn’t get reclaimed, even long after those messages
> > have
> > >
> > > >
> > > > >
> > > > > >
> > > > > > been processed. Just a theory.
> > > > > >
> > > > > > We are also trying to reproduce this using some perf tests to enqueue
> > > > > > with
> > > > > > same pattern, will update with the findings.
> > > > > >
> > > > > > Thanks
> > > > > > Ramayan
> > > > > >
> > > > > > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari
> > > > > > <[hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Another issue that we noticed is when broker goes OOM due to direct
> > > > > > > memory, it doesn't create heap dump (specified by "-XX:+
> > > > > > > HeapDumpOnOutOfMemoryError"), even when the OOM error is same as
> > what
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > is
> > > > > > > mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
> > > > > > >
> > > > > > > Has anyone been able to find a way to get to heap dump for DM OOM?
> > > > > > >
> > > > > > > - Ramayan
> > > > > > >
> > > > > > > On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari
> > > > > > > <[hidden email]
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Alex,
> > > > > > > >
> > > > > > > > Below are the flow to disk logs from broker having 3million+
> > messages
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > at
> > > > > > > > this time. We only have one virtual host. Time is in GMT. Looks
> > like
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > flow
> > > > > > > > to disk is active on the whole virtual host and not a queue level.
> > > > > > > >
> > > > > > > > When the same broker went OOM yesterday, I did not see any flow to
> > > > > > > > disk
> > > > > > > > logs from when it was started until it crashed (crashed twice
> > within
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 4hrs).
> > > > > > > >
> > > > > > > >
> > > > > > > > 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1014 : Message flow to disk active :  Message memory use
> > > > > > > > 3356539KB
> > > > > > > > exceeds threshold 3355443KB
> > > > > > > > 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory use
> > > > > > > > 3354866KB
> > > > > > > > within threshold 3355443KB
> > > > > > > > 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1014 : Message flow to disk active :  Message memory use
> > > > > > > > 3358509KB
> > > > > > > > exceeds threshold 3355443KB
> > > > > > > > 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory use
> > > > > > > > 3353501KB
> > > > > > > > within threshold 3355443KB
> > > > > > > > 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1014 : Message flow to disk active :  Message memory use
> > > > > > > > 3357544KB
> > > > > > > > exceeds threshold 3355443KB
> > > > > > > > 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory use
> > > > > > > > 3353236KB
> > > > > > > > within threshold 3355443KB
> > > > > > > > 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1014 : Message flow to disk active :  Message memory use
> > > > > > > > 3356704KB
> > > > > > > > exceeds threshold 3355443KB
> > > > > > > > 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory use
> > > > > > > > 3353511KB
> > > > > > > > within threshold 3355443KB
> > > > > > > > 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1014 : Message flow to disk active :  Message memory use
> > > > > > > > 3357948KB
> > > > > > > > exceeds threshold 3355443KB
> > > > > > > > 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory use
> > > > > > > > 3355310KB
> > > > > > > > within threshold 3355443KB
> > > > > > > > 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1014 : Message flow to disk active :  Message memory use
> > > > > > > > 3365624KB
> > > > > > > > exceeds threshold 3355443KB
> > > > > > > > 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory use
> > > > > > > > 3355136KB
> > > > > > > > within threshold 3355443KB
> > > > > > > > 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > [Housekeeping[test]]
> > > > > > > > BRK-1014 : Message flow to disk active :  Message memory use
> > > > > > > > 3358683KB
> > > > > > > > exceeds threshold 3355443KB
> > > > > > > >
> > > > > > > >
> > > > > > > > After production release (2days back), we have seen 4 crashes in 3
> > > > > > > > different brokers, this is the most pressing concern for us in
> > > > > > > > decision if
> > > > > > > > we should roll back to 0.32. Any help is greatly appreciated.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Ramayan
> > > > > > > >
> > > > > > > > On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <[hidden email]
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Ramayan,
> > > > > > > > > Thanks for the details. I would like to clarify whether flow to
> > disk
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > was
> > > > > > > > > triggered today for 3 million messages?
> > > > > > > > >
> > > > > > > > > The following logs are issued for flow to disk:
> > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory use
> > > > > > > > > {0,number,#}KB
> > > > > > > > > exceeds threshold {1,number,#.##}KB
> > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory use
> > > > > > > > > {0,number,#}KB within threshold {1,number,#.##}KB
> > > > > > > > >
> > > > > > > > > Kind Regards,
> > > > > > > > > Alex
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 19 April 2017 at 17:10, Ramayan Tiwari <
> > [hidden email]>
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Alex,
> > > > > > > > > >
> > > > > > > > > > Thanks for your response, here are the details:
> > > > > > > > > >
> > > > > > > > > > We use "direct" exchange, without persistence (we specify
> > > > > > > > > NON_PERSISTENT
> > > > > > > > > >
> > > > > > > > > > that while sending from client) and use BDB store. We use JSON
> > > > > > > > > > virtual
> > > > > > > > > host
> > > > > > > > > >
> > > > > > > > > > type. We are not using SSL.
> > > > > > > > > >
> > > > > > > > > > When the broker went OOM, we had around 1.3 million messages
> > with
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 100
> > > > > > > > > bytes
> > > > > > > > > >
> > > > > > > > > > average message size. Direct memory allocation (value read from
> > > > > > > > > > MBean)
> > > > > > > > > kept
> > > > > > > > > >
> > > > > > > > > > going up, even though it wouldn't need more DM to store these
> > many
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > messages. DM allocated persisted at 99% for about 3 and half
> > hours
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > before
> > > > > > > > > >
> > > > > > > > > > crashing.
> > > > > > > > > >
> > > > > > > > > > Today, on the same broker we have 3 million messages (same
> > message
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > size)
> > > > > > > > > >
> > > > > > > > > > and DM allocated is only at 8%. This seems like there is some
> > > > > > > > > > issue
> > > > > > > > > with
> > > > > > > > > >
> > > > > > > > > > de-allocation or a leak.
> > > > > > > > > >
> > > > > > > > > > I have uploaded the memory utilization graph here:
> > > > > > > > > > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> > > > > > > > > > view?usp=sharing
> > > > > > > > > > Blue line is DM allocated, Yellow is DM Used (sum of queue
> > > > > > > > > > payload)
> > > > > > > > > and Red
> > > > > > > > > >
> > > > > > > > > > is heap usage.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Ramayan
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy
> > > > > > > > > > <[hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi Ramayan,
> > > > > > > > > > >
> > > > > > > > > > > Could please share with us the details of messaging use
> > case(s)
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > which
> > > > > > > > > > ended
> > > > > > > > > > >
> > > > > > > > > > > up in OOM on broker side?
> > > > > > > > > > > I would like to reproduce the issue on my local broker in
> > order
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > to
> > > > > > > > > fix
> > > > > > > > > >
> > > > > > > > > > it.
> > > > > > > > > > >
> > > > > > > > > > > I would appreciate if you could provide as much details as
> > > > > > > > > > > possible,
> > > > > > > > > > > including, messaging topology, message persistence type,
> > message
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > sizes,volumes, etc.
> > > > > > > > > > >
> > > > > > > > > > > Qpid Broker 6.0.x uses direct memory for keeping message
> > content
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > > receiving/sending data. Each plain connection utilizes 512K of
> > > > > > > > > > > direct
> > > > > > > > > > > memory. Each SSL connection uses 1M of direct memory. Your
> > > > > > > > > > > memory
> > > > > > > > > > settings
> > > > > > > > > > >
> > > > > > > > > > > look Ok to me.
> > > > > > > > > > >
> > > > > > > > > > > Kind Regards,
> > > > > > > > > > > Alex
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On 18 April 2017 at 23:39, Ramayan Tiwari
> > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > >
> > > > > > > > > > > > We are using Java broker 6.0.5, with patch to use
> > > > > > > > > MultiQueueConsumer
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > feature. We just finished deploying to production and saw
> > > > > > > > > > > > couple of
> > > > > > > > > > > > instances of broker OOM due to running out of DirectMemory
> > > > > > > > > > > > buffer
> > > > > > > > > > > > (exceptions at the end of this email).
> > > > > > > > > > > >
> > > > > > > > > > > > Here is our setup:
> > > > > > > > > > > > 1. Max heap 12g, max direct memory 4g (this is opposite of
> > > > > > > > > > > > what the
> > > > > > > > > > > > recommendation is, however, for our use cause message
> > payload
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > is
> > > > > > > > > really
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > small ~400bytes and is way less than the per message
> > overhead
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > of
> > > > > > > > > 1KB).
> > > > > > > > > >
> > > > > > > > > > In
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > perf testing, we were able to put 2 million messages without
> > > > > > > > > > > > any
> > > > > > > > > > issues.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 2. ~400 connections to broker.
> > > > > > > > > > > > 3. Each connection has 20 sessions and there is one multi
> > > > > > > > > > > > queue
> > > > > > > > > > consumer
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > attached to each session, listening to around 1000 queues.
> > > > > > > > > > > > 4. We are still using 0.16 client (I know).
> > > > > > > > > > > >
> > > > > > > > > > > > With the above setup, the baseline utilization (without any
> > > > > > > > > messages)
> > > > > > > > > >
> > > > > > > > > > for
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > direct memory was around 230mb (with 410 connection each
> > > > > > > > > > > > taking
> > > > > > > > > 500KB).
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Based on our understanding of broker memory allocation,
> > > > > > > > > > > > message
> > > > > > > > > payload
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > should be the only thing adding to direct memory utilization
> > > > > > > > > > > > (on
> > > > > > > > > top of
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > baseline), however, we are experiencing something completely
> > > > > > > > > different.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > In
> > > > > > > > > > > >
> > > > > > > > > > > > our last broker crash, we see that broker is constantly
> > > > > > > > > > > > running
> > > > > > > > > with
> > > > > > > > > >
> > > > > > > > > > 90%+
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > direct memory allocated, even when message payload sum from
> > > > > > > > > > > > all the
> > > > > > > > > > > queues
> > > > > > > > > > > >
> > > > > > > > > > > > is only 6-8% (these % are against available DM of 4gb).
> > During
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > these
> > > > > > > > > >
> > > > > > > > > > high
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > DM usage period, heap usage was around 60% (of 12gb).
> > > > > > > > > > > >
> > > > > > > > > > > > We would like some help in understanding what could be the
> > > > > > > > > > > > reason
> > > > > > > > > of
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > these
> > > > > > > > > > > >
> > > > > > > > > > > > high DM allocations. Are there things other than message
> > > > > > > > > > > > payload
> > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > AMQP
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > connection, which use DM and could be contributing to these
> > > > > > > > > > > > high
> > > > > > > > > usage?
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Another thing where we are puzzled is the de-allocation of
> > DM
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > byte
> > > > > > > > > > > buffers.
> > > > > > > > > > > >
> > > > > > > > > > > > From log mining of heap and DM utilization, de-allocation of
> > > > > > > > > > > > DM
> > > > > > > > > doesn't
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > correlate with heap GC. If anyone has seen any documentation
> > > > > > > > > related to
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > this, it would be very helpful if you could share that.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > > Ramayan
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > *Exceptions*
> > > > > > > > > > > >
> > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer memory
> > > > > > > > > > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> > ~[na:1.8.0_40]
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:
> > 123)
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > >
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.
> > NonBlockingConnectionPlainD
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > elegate.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > restoreApplicationBufferForWrite(
> > NonBlockingConnectionPlainDele
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > gate.java:93)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > > > org.apache.qpid.server.transport.
> > NonBlockingConnectionPlainDele
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > gate.processData(NonBlockingConnectionPlainDele
> > gate.java:60)
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.
> > NonBlockingConnection.doRead(
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > NonBlockingConnection.java:506)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.
> > NonBlockingConnection.doWork(
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > NonBlockingConnection.java:285)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.
> > NetworkConnectionScheduler.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > processConnection(NetworkConnectionScheduler.java:124)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > ConnectionPr
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ocessor.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > processConnection(SelectorThread.java:504)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > > > > > > > > > > SelectionTask.performSelect(SelectorThread.java:337)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > SelectionTask.run(
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > SelectorThread.java:87)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread.run(
> > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > at
> > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > *Second exception*
> > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer memory
> > > > > > > > > > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> > ~[na:1.8.0_40]
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:
> > 123)
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > >
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > > > org.apache.qpid.server.transport.
> > NonBlockingConnectionPlainDele
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.NonBlockingConnection.
> > > > > > > > > > > > setTransportEncryption(NonBlockingConnection.java:625)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.
> > NonBlockingConnection.<init>(
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > NonBlockingConnection.java:117)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.
> > NonBlockingNetworkTransport.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > SelectionTas
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > k$1.run(
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > SelectorThread.java:191)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread.run(
> > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > at
> > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > at
> > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> > > > > > > > > > > >
> > > > > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [hidden email]
> > > > > For additional commands, e-mail: [hidden email]
> > > > >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

QPID-7753-Temporary_workaround.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Hi Lorenz,

Thanks so much for the patch. We have a perf test now to reproduce this
issue, so we did test with 256KB, 64KB and 4KB network byte buffer. None of
these configurations help with the issue (or give any more breathing room)
for our use case. We would like to share the perf analysis with the
community:

https://docs.google.com/document/d/1Wc1e-id-WlpI7FGU1Lx8XcKaV8sauRp82T5XZVU-RiM/edit?usp=sharing

Feel free to comment on the doc if certain details are incorrect or if
there are questions.

Since the short term solution doesn't help us, we are very interested in
getting some details on how the community plans to address this, a high
level description of the approach will be very helpful for us in order to
brainstorm our use cases along with this solution.

- Ramayan

On Fri, Apr 28, 2017 at 9:34 AM, Lorenz Quack <[hidden email]>
wrote:

> Hello Ramayan,
>
> We are still working on a fix for this issue.
> In the mean time we had an idea to potentially workaround the issue until
> a proper fix is released.
>
> The idea is to decrease the qpid network buffer size the broker uses.
> While this still allows for sparsely populated buffers it would improve
> the overall occupancy ratio.
>
> Here are the steps to follow:
>  * ensure you are not using TLS
>  * apply the attached patch
>  * figure out the size of the largest messages you are sending (including
> header and some overhead)
>  * set the context variable "qpid.broker.networkBufferSize" to that value
> but not smaller than 4096
>  * test
>
> Decreasing the qpid network buffer size automatically limits the maximum
> AMQP frame size.
> Since you are using a very old client we are not sure how well it copes
> with small frame sizes where it has to split a message across multiple
> frames.
> Therefore, to play it safe you should not set it smaller than the largest
> messages (+ header + overhead) you are sending.
> I do not know what message sizes you are sending but AMQP imposes the
> restriction that the framesize cannot be smaller than 4096 bytes.
> In the qpid broker the default currently is 256 kB.
>
> In the current state the broker does not allow setting the network buffer
> to values smaller than 64 kB to allow TLS frames to fit into one network
> buffer.
> I attached a patch to this mail that lowers that restriction to the limit
> imposed by AMQP (4096 Bytes).
> Obviously, you should not use this when using TLS.
>
>
> I hope this reduces the problems you are currently facing until we can
> complete the proper fix.
>
> Kind regards,
> Lorenz
>
>
> On Fri, 2017-04-21 at 09:17 -0700, Ramayan Tiwari wrote:
> > Thanks so much Keith and the team for finding the root cause. We are so
> > relieved that we fix the root cause shortly.
> >
> > Couple of things that I forgot to mention on the mitigation steps we took
> > in the last incident:
> > 1) We triggered GC from JMX bean multiple times, it did not help in
> > reducing DM allocated.
> > 2) We also killed all the AMQP connections to the broker when DM was at
> > 80%. This did not help either. The way we killed connections - using JMX
> > got list of all the open AMQP connections and called close from JMX
> mbean.
> >
> > I am hoping the above two are not related to root cause, but wanted to
> > bring it up in case this is relevant.
> >
> > Thanks
> > Ramayan
> >
> > On Fri, Apr 21, 2017 at 8:29 AM, Keith W <[hidden email]> wrote:
> >
> > >
> > > Hello Ramayan
> > >
> > > I believe I understand the root cause of the problem.  We have
> > > identified a flaw in the direct memory buffer management employed by
> > > Qpid Broker J which for some messaging use-cases can lead to the OOM
> > > direct you describe.   For the issue to manifest the producing
> > > application needs to use a single connection for the production of
> > > messages some of which are short-lived (i.e. are consumed quickly)
> > > whilst others remain on the queue for some time.  Priority queues,
> > > sorted queues and consumers utilising selectors that result in some
> > > messages being left of the queue could all produce this patten.  The
> > > pattern leads to a sparsely occupied 256K net buffers which cannot be
> > > released or reused until every message that reference a 'chunk' of it
> > > is either consumed or flown to disk.   The problem was introduced with
> > > Qpid v6.0 and exists in v6.1 and trunk too.
> > >
> > > The flow to disk feature is not helping us here because its algorithm
> > > considers only the size of live messages on the queues. If the
> > > accumulative live size does not exceed the threshold, the messages
> > > aren't flown to disk. I speculate that when you observed that moving
> > > messages cause direct message usage to drop earlier today, your
> > > message movement cause a queue to go over threshold, cause message to
> > > be flown to disk and their direct memory references released.  The
> > > logs will confirm this is so.
> > >
> > > I have not identified an easy workaround at the moment.   Decreasing
> > > the flow to disk threshold and/or increasing available direct memory
> > > should alleviate and may be an acceptable short term workaround.  If
> > > it were possible for publishing application to publish short lived and
> > > long lived messages on two separate JMS connections this would avoid
> > > this defect.
> > >
> > > QPID-7753 tracks this issue and QPID-7754 is a related this problem.
> > > We intend to be working on these early next week and will be aiming
> > > for a fix that is back-portable to 6.0.
> > >
> > > Apologies that you have run into this defect and thanks for reporting.
> > >
> > > Thanks, Keith
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 21 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
> > > wrote:
> > > >
> > > > Hi All,
> > > >
> > > > We have been monitoring the brokers everyday and today we found one
> > > instance
> > > >
> > > > where broker’s DM was constantly going up and was about to crash, so
> we
> > > > experimented some mitigations, one of which caused the DM to come
> down.
> > > > Following are the details, which might help us understanding the
> issue:
> > > >
> > > > Traffic scenario:
> > > >
> > > > DM allocation had been constantly going up and was at 90%. There
> were two
> > > > queues which seemed to align with the theories that we had. Q1’s
> size had
> > > > been large right after the broker start and had slow consumption of
> > > > messages, queue size only reduced from 76MB to 75MB over a period of
> > > 6hrs.
> > > >
> > > > Q2 on the other hand, started small and was gradually growing, queue
> size
> > > > went from 7MB to 10MB in 6hrs. There were other queues with traffic
> > > during
> > > >
> > > > this time.
> > > >
> > > > Action taken:
> > > >
> > > > Moved all the messages from Q2 (since this was our original theory)
> to Q3
> > > > (already created but no messages in it). This did not help with the
> DM
> > > > growing up.
> > > > Moved all the messages from Q1 to Q4 (already created but no
> messages in
> > > > it). This reduced DM allocation from 93% to 31%.
> > > >
> > > > We have the heap dump and thread dump from when broker was 90% in DM
> > > > allocation. We are going to analyze that to see if we can get some
> clue.
> > > We
> > > >
> > > > wanted to share this new information which might help in reasoning
> about
> > > the
> > > >
> > > > memory issue.
> > > >
> > > > - Ramayan
> > > >
> > > >
> > > > On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <
> > > [hidden email]>
> > > >
> > > > wrote:
> > > > >
> > > > >
> > > > > Hi Keith,
> > > > >
> > > > > Thanks so much for your response and digging into the issue. Below
> are
> > > the
> > > >
> > > > >
> > > > > answer to your questions:
> > > > >
> > > > > 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1
> where it
> > > > > was released because we need JMX support. Here is the destination
> > > format:
> > > >
> > > > >
> > > > > ""%s ; {node : { type : queue }, link : { x-subscribes : {
> arguments : {
> > > > > x-multiqueue : [%s], x-pull-only : true }}}}";"
> > > > >
> > > > > 2) Our machines have 40 cores, which will make the number of
> threads to
> > > > > 80. This might not be an issue, because this will show up in the
> > > baseline DM
> > > >
> > > > >
> > > > > allocated, which is only 6% (of 4GB) when we just bring up the
> broker.
> > > > >
> > > > > 3) The only setting that we tuned WRT to DM is flowToDiskThreshold,
> > > which
> > > >
> > > > >
> > > > > is set at 80% now.
> > > > >
> > > > > 4) Only one virtual host in the broker.
> > > > >
> > > > > 5) Most of our queues (99%) are priority, we also have 8-10 sorted
> > > queues.
> > > >
> > > > >
> > > > >
> > > > > 6) Yeah we are using the standard 0.16 client and not AMQP 1.0
> clients.
> > > > > The connection log line looks like:
> > > > > CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol Version :
> 0-10
> > > :
> > > >
> > > > >
> > > > > Client ID : test : Client Version : 0.16 : Client Product : qpid
> > > > >
> > > > > We had another broker crashed about an hour back, we do see the
> same
> > > > > patterns:
> > > > > 1) There is a queue which is constantly growing, enqueue is faster
> than
> > > > > dequeue on that queue for a long period of time.
> > > > > 2) Flow to disk didn't kick in at all.
> > > > >
> > > > > This graph shows memory growth (red line - heap, blue - DM
> allocated,
> > > > > yellow - DM used)
> > > > >
> > > > > https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/
> > > view?usp=sharing
> > > >
> > > > >
> > > > >
> > > > > The below graph shows growth on a single queue (there are 10-12
> other
> > > > > queues with traffic as well, something large size than this queue):
> > > > >
> > > > > https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/
> > > view?usp=sharing
> > > >
> > > > >
> > > > >
> > > > > Couple of questions:
> > > > > 1) Is there any developer level doc/design spec on how Qpid uses
> DM?
> > > > > 2) We are not getting heap dumps automatically when broker crashes
> due
> > > to
> > > >
> > > > >
> > > > > DM (HeapDumpOnOutOfMemoryError not respected). Has anyone found a
> way
> > > to get
> > > >
> > > > >
> > > > > around this problem?
> > > > >
> > > > > Thanks
> > > > > Ramayan
> > > > >
> > > > > On Thu, Apr 20, 2017 at 9:08 AM, Keith W <[hidden email]>
> wrote:
> > > > > >
> > > > > >
> > > > > > Hi Ramayan
> > > > > >
> > > > > > We have been discussing your problem here and have a couple of
> > > questions.
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > I have been experimenting with use-cases based on your
> descriptions
> > > > > > above, but so far, have been unsuccessful in reproducing a
> > > > > > "java.lang.OutOfMemoryError: Direct buffer memory"  condition.
> The
> > > > > > direct memory usage reflects the expected model: it levels off
> when
> > > > > > the flow to disk threshold is reached and direct memory is
> release as
> > > > > > messages are consumed until the minimum size for caching of
> direct is
> > > > > > reached.
> > > > > >
> > > > > > 1] For clarity let me check: we believe when you say "patch to
> use
> > > > > > MultiQueueConsumer" you are referring to the patch attached to
> > > > > > QPID-7462 "Add experimental "pull" consumers to the broker"  and
> you
> > > > > > are using a combination of this "x-pull-only"  with the standard
> > > > > > "x-multiqueue" feature.  Is this correct?
> > > > > >
> > > > > > 2] One idea we had here relates to the size of the virtualhost IO
> > > > > > pool.   As you know from the documentation, the Broker
> caches/reuses
> > > > > > direct memory internally but the documentation fails to mentions
> that
> > > > > > each pooled virtualhost IO thread also grabs a chunk (256K) of
> direct
> > > > > > memory from this cache.  By default the virtual host IO pool is
> sized
> > > > > > Math.max(Runtime.getRuntime().availableProcessors() * 2, 64),
> so if
> > > > > > you have a machine with a very large number of cores, you may
> have a
> > > > > > surprising large amount of direct memory assigned to virtualhost
> IO
> > > > > > threads.   Check the value of connectionThreadPoolSize on the
> > > > > > virtualhost
> > > > > > (http://<server>:<port>/api/latest/virtualhost/<
> virtualhostnodename>/<;
> > > virtualhostname>)
> > > >
> > > > >
> > > > > >
> > > > > > to see what value is in force.  What is it?  It is possible to
> tune
> > > > > > the pool size using context variable
> > > > > > virtualhost.connectionThreadPool.size.
> > > > > >
> > > > > > 3] Tell me if you are tuning the Broker in way beyond the
> direct/heap
> > > > > > memory settings you have told us about already.  For instance
> you are
> > > > > > changing any of the direct memory pooling settings
> > > > > > broker.directByteBufferPoolSize, default network buffer size
> > > > > > qpid.broker.networkBufferSize or applying any other non-standard
> > > > > > settings?
> > > > > >
> > > > > > 4] How many virtual hosts do you have on the Broker?
> > > > > >
> > > > > > 5] What is the consumption pattern of the messages?  Do consume
> in a
> > > > > > strictly FIFO fashion or are you making use of message selectors
> > > > > > or/and any of the out-of-order queue types (LVQs, priority queue
> or
> > > > > > sorted queues)?
> > > > > >
> > > > > > 6] Is it just the 0.16 client involved in the application?   Can
> I
> > > > > > check that you are not using any of the AMQP 1.0 clients
> > > > > > (org,apache.qpid:qpid-jms-client or
> > > > > > org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as
> either
> > > > > > consumers or producers)
> > > > > >
> > > > > > Hopefully the answers to these questions will get us closer to a
> > > > > > reproduction.   If you are able to reliable reproduce it, please
> share
> > > > > > the steps with us.
> > > > > >
> > > > > > Kind regards, Keith.
> > > > > >
> > > > > >
> > > > > > On 20 April 2017 at 10:21, Ramayan Tiwari <
> [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > > After a lot of log mining, we might have a way to explain the
> > > sustained
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > increased in DirectMemory allocation, the correlation seems to
> be
> > > with
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > the
> > > > > > > growth in the size of a Queue that is getting consumed but at
> a much
> > > > > > > slower
> > > > > > > rate than producers putting messages on this queue.
> > > > > > >
> > > > > > > The pattern we see is that in each instance of broker crash,
> there is
> > > > > > > at
> > > > > > > least one queue (usually 1 queue) whose size kept growing
> steadily.
> > > > > > > It’d be
> > > > > > > of significant size but not the largest queue -- usually there
> are
> > > > > > > multiple
> > > > > > > larger queues -- but it was different from other queues in
> that its
> > > > > > > size
> > > > > > > was growing steadily. The queue would also be moving, but its
> > > > > > > processing
> > > > > > > rate was not keeping up with the enqueue rate.
> > > > > > >
> > > > > > > Our theory that might be totally wrong: If a queue is moving
> the
> > > entire
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > time, maybe then the broker would keep reusing the same buffer
> in
> > > > > > > direct
> > > > > > > memory for the queue, and keep on adding onto it at the end to
> > > > > > > accommodate
> > > > > > > new messages. But because it’s active all the time and we’re
> pointing
> > > > > > > to
> > > > > > > the same buffer, space allocated for messages at the head of
> the
> > > > > > > queue/buffer doesn’t get reclaimed, even long after those
> messages
> > > have
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > been processed. Just a theory.
> > > > > > >
> > > > > > > We are also trying to reproduce this using some perf tests to
> enqueue
> > > > > > > with
> > > > > > > same pattern, will update with the findings.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Ramayan
> > > > > > >
> > > > > > > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari
> > > > > > > <[hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Another issue that we noticed is when broker goes OOM due to
> direct
> > > > > > > > memory, it doesn't create heap dump (specified by "-XX:+
> > > > > > > > HeapDumpOnOutOfMemoryError"), even when the OOM error is
> same as
> > > what
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > is
> > > > > > > > mentioned in the oracle JVM docs
> ("java.lang.OutOfMemoryError").
> > > > > > > >
> > > > > > > > Has anyone been able to find a way to get to heap dump for
> DM OOM?
> > > > > > > >
> > > > > > > > - Ramayan
> > > > > > > >
> > > > > > > > On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari
> > > > > > > > <[hidden email]
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Alex,
> > > > > > > > >
> > > > > > > > > Below are the flow to disk logs from broker having
> 3million+
> > > messages
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > at
> > > > > > > > > this time. We only have one virtual host. Time is in GMT.
> Looks
> > > like
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > flow
> > > > > > > > > to disk is active on the whole virtual host and not a
> queue level.
> > > > > > > > >
> > > > > > > > > When the same broker went OOM yesterday, I did not see any
> flow to
> > > > > > > > > disk
> > > > > > > > > logs from when it was started until it crashed (crashed
> twice
> > > within
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 4hrs).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> use
> > > > > > > > > 3356539KB
> > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> use
> > > > > > > > > 3354866KB
> > > > > > > > > within threshold 3355443KB
> > > > > > > > > 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> use
> > > > > > > > > 3358509KB
> > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> use
> > > > > > > > > 3353501KB
> > > > > > > > > within threshold 3355443KB
> > > > > > > > > 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> use
> > > > > > > > > 3357544KB
> > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> use
> > > > > > > > > 3353236KB
> > > > > > > > > within threshold 3355443KB
> > > > > > > > > 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> use
> > > > > > > > > 3356704KB
> > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> use
> > > > > > > > > 3353511KB
> > > > > > > > > within threshold 3355443KB
> > > > > > > > > 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> use
> > > > > > > > > 3357948KB
> > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> use
> > > > > > > > > 3355310KB
> > > > > > > > > within threshold 3355443KB
> > > > > > > > > 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> use
> > > > > > > > > 3365624KB
> > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> use
> > > > > > > > > 3355136KB
> > > > > > > > > within threshold 3355443KB
> > > > > > > > > 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > [Housekeeping[test]]
> > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> use
> > > > > > > > > 3358683KB
> > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > After production release (2days back), we have seen 4
> crashes in 3
> > > > > > > > > different brokers, this is the most pressing concern for
> us in
> > > > > > > > > decision if
> > > > > > > > > we should roll back to 0.32. Any help is greatly
> appreciated.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Ramayan
> > > > > > > > >
> > > > > > > > > On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <
> [hidden email]
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Ramayan,
> > > > > > > > > > Thanks for the details. I would like to clarify whether
> flow to
> > > disk
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > was
> > > > > > > > > > triggered today for 3 million messages?
> > > > > > > > > >
> > > > > > > > > > The following logs are issued for flow to disk:
> > > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> use
> > > > > > > > > > {0,number,#}KB
> > > > > > > > > > exceeds threshold {1,number,#.##}KB
> > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> memory use
> > > > > > > > > > {0,number,#}KB within threshold {1,number,#.##}KB
> > > > > > > > > >
> > > > > > > > > > Kind Regards,
> > > > > > > > > > Alex
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 19 April 2017 at 17:10, Ramayan Tiwari <
> > > [hidden email]>
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi Alex,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your response, here are the details:
> > > > > > > > > > >
> > > > > > > > > > > We use "direct" exchange, without persistence (we
> specify
> > > > > > > > > > NON_PERSISTENT
> > > > > > > > > > >
> > > > > > > > > > > that while sending from client) and use BDB store. We
> use JSON
> > > > > > > > > > > virtual
> > > > > > > > > > host
> > > > > > > > > > >
> > > > > > > > > > > type. We are not using SSL.
> > > > > > > > > > >
> > > > > > > > > > > When the broker went OOM, we had around 1.3 million
> messages
> > > with
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 100
> > > > > > > > > > bytes
> > > > > > > > > > >
> > > > > > > > > > > average message size. Direct memory allocation (value
> read from
> > > > > > > > > > > MBean)
> > > > > > > > > > kept
> > > > > > > > > > >
> > > > > > > > > > > going up, even though it wouldn't need more DM to
> store these
> > > many
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > messages. DM allocated persisted at 99% for about 3
> and half
> > > hours
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > before
> > > > > > > > > > >
> > > > > > > > > > > crashing.
> > > > > > > > > > >
> > > > > > > > > > > Today, on the same broker we have 3 million messages
> (same
> > > message
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > size)
> > > > > > > > > > >
> > > > > > > > > > > and DM allocated is only at 8%. This seems like there
> is some
> > > > > > > > > > > issue
> > > > > > > > > > with
> > > > > > > > > > >
> > > > > > > > > > > de-allocation or a leak.
> > > > > > > > > > >
> > > > > > > > > > > I have uploaded the memory utilization graph here:
> > > > > > > > > > > https://drive.google.com/file/d/
> 0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> > > > > > > > > > > view?usp=sharing
> > > > > > > > > > > Blue line is DM allocated, Yellow is DM Used (sum of
> queue
> > > > > > > > > > > payload)
> > > > > > > > > > and Red
> > > > > > > > > > >
> > > > > > > > > > > is heap usage.
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > > Ramayan
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy
> > > > > > > > > > > <[hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Ramayan,
> > > > > > > > > > > >
> > > > > > > > > > > > Could please share with us the details of messaging
> use
> > > case(s)
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > which
> > > > > > > > > > > ended
> > > > > > > > > > > >
> > > > > > > > > > > > up in OOM on broker side?
> > > > > > > > > > > > I would like to reproduce the issue on my local
> broker in
> > > order
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > to
> > > > > > > > > > fix
> > > > > > > > > > >
> > > > > > > > > > > it.
> > > > > > > > > > > >
> > > > > > > > > > > > I would appreciate if you could provide as much
> details as
> > > > > > > > > > > > possible,
> > > > > > > > > > > > including, messaging topology, message persistence
> type,
> > > message
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > sizes,volumes, etc.
> > > > > > > > > > > >
> > > > > > > > > > > > Qpid Broker 6.0.x uses direct memory for keeping
> message
> > > content
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > and
> > > > > > > > > > > > receiving/sending data. Each plain connection
> utilizes 512K of
> > > > > > > > > > > > direct
> > > > > > > > > > > > memory. Each SSL connection uses 1M of direct
> memory. Your
> > > > > > > > > > > > memory
> > > > > > > > > > > settings
> > > > > > > > > > > >
> > > > > > > > > > > > look Ok to me.
> > > > > > > > > > > >
> > > > > > > > > > > > Kind Regards,
> > > > > > > > > > > > Alex
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On 18 April 2017 at 23:39, Ramayan Tiwari
> > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > >
> > > > > > > > > > > > > We are using Java broker 6.0.5, with patch to use
> > > > > > > > > > MultiQueueConsumer
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > feature. We just finished deploying to production
> and saw
> > > > > > > > > > > > > couple of
> > > > > > > > > > > > > instances of broker OOM due to running out of
> DirectMemory
> > > > > > > > > > > > > buffer
> > > > > > > > > > > > > (exceptions at the end of this email).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Here is our setup:
> > > > > > > > > > > > > 1. Max heap 12g, max direct memory 4g (this is
> opposite of
> > > > > > > > > > > > > what the
> > > > > > > > > > > > > recommendation is, however, for our use cause
> message
> > > payload
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > is
> > > > > > > > > > really
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > small ~400bytes and is way less than the per
> message
> > > overhead
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > of
> > > > > > > > > > 1KB).
> > > > > > > > > > >
> > > > > > > > > > > In
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > perf testing, we were able to put 2 million
> messages without
> > > > > > > > > > > > > any
> > > > > > > > > > > issues.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. ~400 connections to broker.
> > > > > > > > > > > > > 3. Each connection has 20 sessions and there is
> one multi
> > > > > > > > > > > > > queue
> > > > > > > > > > > consumer
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > attached to each session, listening to around 1000
> queues.
> > > > > > > > > > > > > 4. We are still using 0.16 client (I know).
> > > > > > > > > > > > >
> > > > > > > > > > > > > With the above setup, the baseline utilization
> (without any
> > > > > > > > > > messages)
> > > > > > > > > > >
> > > > > > > > > > > for
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > direct memory was around 230mb (with 410
> connection each
> > > > > > > > > > > > > taking
> > > > > > > > > > 500KB).
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Based on our understanding of broker memory
> allocation,
> > > > > > > > > > > > > message
> > > > > > > > > > payload
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > should be the only thing adding to direct memory
> utilization
> > > > > > > > > > > > > (on
> > > > > > > > > > top of
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > baseline), however, we are experiencing something
> completely
> > > > > > > > > > different.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > In
> > > > > > > > > > > > >
> > > > > > > > > > > > > our last broker crash, we see that broker is
> constantly
> > > > > > > > > > > > > running
> > > > > > > > > > with
> > > > > > > > > > >
> > > > > > > > > > > 90%+
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > direct memory allocated, even when message payload
> sum from
> > > > > > > > > > > > > all the
> > > > > > > > > > > > queues
> > > > > > > > > > > > >
> > > > > > > > > > > > > is only 6-8% (these % are against available DM of
> 4gb).
> > > During
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > these
> > > > > > > > > > >
> > > > > > > > > > > high
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > DM usage period, heap usage was around 60% (of
> 12gb).
> > > > > > > > > > > > >
> > > > > > > > > > > > > We would like some help in understanding what
> could be the
> > > > > > > > > > > > > reason
> > > > > > > > > > of
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > these
> > > > > > > > > > > > >
> > > > > > > > > > > > > high DM allocations. Are there things other than
> message
> > > > > > > > > > > > > payload
> > > > > > > > > > and
> > > > > > > > > > >
> > > > > > > > > > > AMQP
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > connection, which use DM and could be contributing
> to these
> > > > > > > > > > > > > high
> > > > > > > > > > usage?
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Another thing where we are puzzled is the
> de-allocation of
> > > DM
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > byte
> > > > > > > > > > > > buffers.
> > > > > > > > > > > > >
> > > > > > > > > > > > > From log mining of heap and DM utilization,
> de-allocation of
> > > > > > > > > > > > > DM
> > > > > > > > > > doesn't
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > correlate with heap GC. If anyone has seen any
> documentation
> > > > > > > > > > related to
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > this, it would be very helpful if you could share
> that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Exceptions*
> > > > > > > > > > > > >
> > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer memory
> > > > > > > > > > > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> > > ~[na:1.8.0_40]
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> init>(DirectByteBuffer.java:
> > > 123)
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.
> java:311)
> > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > >
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NonBlockingConnectionPlainD
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > elegate.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > restoreApplicationBufferForWrite(
> > > NonBlockingConnectionPlainDele
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > gate.java:93)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NonBlockingConnectionPlainDele
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > gate.processData(NonBlockingConnectionPlainDele
> > > gate.java:60)
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NonBlockingConnection.doRead(
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > NonBlockingConnection.java:506)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NonBlockingConnection.doWork(
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > NonBlockingConnection.java:285)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NetworkConnectionScheduler.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > processConnection(NetworkConnectionScheduler.
> java:124)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > ConnectionPr
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ocessor.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > processConnection(SelectorThread.java:504)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > > > > > > > > > > > SelectionTask.performSelect(
> SelectorThread.java:337)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > SelectionTask.run(
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > SelectorThread.java:87)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.
> transport.SelectorThread.run(
> > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > java.util.concurrent.
> ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> ~[na:1.8.0_40]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > *Second exception*
> > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer memory
> > > > > > > > > > > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> > > ~[na:1.8.0_40]
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> init>(DirectByteBuffer.java:
> > > 123)
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.
> java:311)
> > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > >
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > >
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NonBlockingConnectionPlainDele
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > gate.<init>(NonBlockingConnectionPlainDele
> gate.java:45)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> NonBlockingConnection.
> > > > > > > > > > > > > setTransportEncryption(NonBlockingConnection.java:
> 625)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NonBlockingConnection.<init>(
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > NonBlockingConnection.java:117)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NonBlockingNetworkTransport.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > acceptSocketChannel(NonBlockingNetworkTransport.
> java:158)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > SelectionTas
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > k$1.run(
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > SelectorThread.java:191)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > org.apache.qpid.server.
> transport.SelectorThread.run(
> > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > at
> > > > > > > > > > > > > java.util.concurrent.
> ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> ~[na:1.8.0_40]
> > > > > > > > > > > > >
> > > > > > > > >
> > > > > > ------------------------------------------------------------
> ---------
> > > > > > To unsubscribe, e-mail: [hidden email]
> > > > > > For additional commands, e-mail: [hidden email]
> > > > > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Lorenz Quack
Hi Ramayan,

The high-level plan is currently as follows:
 1) Periodically try to compact sparse direct memory buffers.
 2) Increase accuracy of messages' direct memory usage estimation to more
reliably trigger flow to disk.
 3) Add an additional flow to disk trigger based on the amount of allocated
direct memory.

A little bit more details:
 1) We plan on periodically checking the amount of direct memory usage and
if it is above a
    threshold (50%) we compare the sum of all queue sizes with the amount
of allocated direct memory.
    If the ratio falls below a certain threshold we trigger a compaction
task which goes through all queues
    and copy's a certain amount of old message buffers into new ones
thereby freeing the old buffers so
    that they can be returned to the buffer pool and be reused.

 2) Currently we trigger flow to disk based on an estimate of how much
memory the messages on the
    queues consume. We had to use estimates because we did not have
accurate size numbers for
    message headers. By having accurate size information for message
headers we can more reliably
    enforce queue memory limits.

 3) The flow to disk trigger based on message size had another problem
which is more pertinent to the
    current issue. We only considered the size of the messages and not how
much memory we allocate
    to store those messages. In the FIFO use case those numbers will be
very close to each other but in
    use cases like yours we can end up with sparse buffers and the numbers
will diverge. Because of this
    divergence we do not trigger flow to disk in time and the broker can go
OOM.
    To fix the issue we want to add an additional flow to disk trigger
based on the amount of allocated direct
    memory. This should prevent the broker from going OOM even if the
compaction strategy outlined above
    should fail for some reason (e.g., the compaction task cannot keep up
with the arrival of new messages).

Currently, there are patches for the above points but they suffer from some
thread-safety issues that need to be addressed.

I hope this description helps. Any feedback is, as always, welcome.

Kind regards,
Lorenz



On Sat, Apr 29, 2017 at 12:00 AM, Ramayan Tiwari <[hidden email]>
wrote:

> Hi Lorenz,
>
> Thanks so much for the patch. We have a perf test now to reproduce this
> issue, so we did test with 256KB, 64KB and 4KB network byte buffer. None of
> these configurations help with the issue (or give any more breathing room)
> for our use case. We would like to share the perf analysis with the
> community:
>
> https://docs.google.com/document/d/1Wc1e-id-WlpI7FGU1Lx8XcKaV8sauRp82T5XZV
> U-RiM/edit?usp=sharing
>
> Feel free to comment on the doc if certain details are incorrect or if
> there are questions.
>
> Since the short term solution doesn't help us, we are very interested in
> getting some details on how the community plans to address this, a high
> level description of the approach will be very helpful for us in order to
> brainstorm our use cases along with this solution.
>
> - Ramayan
>
> On Fri, Apr 28, 2017 at 9:34 AM, Lorenz Quack <[hidden email]>
> wrote:
>
> > Hello Ramayan,
> >
> > We are still working on a fix for this issue.
> > In the mean time we had an idea to potentially workaround the issue until
> > a proper fix is released.
> >
> > The idea is to decrease the qpid network buffer size the broker uses.
> > While this still allows for sparsely populated buffers it would improve
> > the overall occupancy ratio.
> >
> > Here are the steps to follow:
> >  * ensure you are not using TLS
> >  * apply the attached patch
> >  * figure out the size of the largest messages you are sending (including
> > header and some overhead)
> >  * set the context variable "qpid.broker.networkBufferSize" to that
> value
> > but not smaller than 4096
> >  * test
> >
> > Decreasing the qpid network buffer size automatically limits the maximum
> > AMQP frame size.
> > Since you are using a very old client we are not sure how well it copes
> > with small frame sizes where it has to split a message across multiple
> > frames.
> > Therefore, to play it safe you should not set it smaller than the largest
> > messages (+ header + overhead) you are sending.
> > I do not know what message sizes you are sending but AMQP imposes the
> > restriction that the framesize cannot be smaller than 4096 bytes.
> > In the qpid broker the default currently is 256 kB.
> >
> > In the current state the broker does not allow setting the network buffer
> > to values smaller than 64 kB to allow TLS frames to fit into one network
> > buffer.
> > I attached a patch to this mail that lowers that restriction to the limit
> > imposed by AMQP (4096 Bytes).
> > Obviously, you should not use this when using TLS.
> >
> >
> > I hope this reduces the problems you are currently facing until we can
> > complete the proper fix.
> >
> > Kind regards,
> > Lorenz
> >
> >
> > On Fri, 2017-04-21 at 09:17 -0700, Ramayan Tiwari wrote:
> > > Thanks so much Keith and the team for finding the root cause. We are so
> > > relieved that we fix the root cause shortly.
> > >
> > > Couple of things that I forgot to mention on the mitigation steps we
> took
> > > in the last incident:
> > > 1) We triggered GC from JMX bean multiple times, it did not help in
> > > reducing DM allocated.
> > > 2) We also killed all the AMQP connections to the broker when DM was at
> > > 80%. This did not help either. The way we killed connections - using
> JMX
> > > got list of all the open AMQP connections and called close from JMX
> > mbean.
> > >
> > > I am hoping the above two are not related to root cause, but wanted to
> > > bring it up in case this is relevant.
> > >
> > > Thanks
> > > Ramayan
> > >
> > > On Fri, Apr 21, 2017 at 8:29 AM, Keith W <[hidden email]> wrote:
> > >
> > > >
> > > > Hello Ramayan
> > > >
> > > > I believe I understand the root cause of the problem.  We have
> > > > identified a flaw in the direct memory buffer management employed by
> > > > Qpid Broker J which for some messaging use-cases can lead to the OOM
> > > > direct you describe.   For the issue to manifest the producing
> > > > application needs to use a single connection for the production of
> > > > messages some of which are short-lived (i.e. are consumed quickly)
> > > > whilst others remain on the queue for some time.  Priority queues,
> > > > sorted queues and consumers utilising selectors that result in some
> > > > messages being left of the queue could all produce this patten.  The
> > > > pattern leads to a sparsely occupied 256K net buffers which cannot be
> > > > released or reused until every message that reference a 'chunk' of it
> > > > is either consumed or flown to disk.   The problem was introduced
> with
> > > > Qpid v6.0 and exists in v6.1 and trunk too.
> > > >
> > > > The flow to disk feature is not helping us here because its algorithm
> > > > considers only the size of live messages on the queues. If the
> > > > accumulative live size does not exceed the threshold, the messages
> > > > aren't flown to disk. I speculate that when you observed that moving
> > > > messages cause direct message usage to drop earlier today, your
> > > > message movement cause a queue to go over threshold, cause message to
> > > > be flown to disk and their direct memory references released.  The
> > > > logs will confirm this is so.
> > > >
> > > > I have not identified an easy workaround at the moment.   Decreasing
> > > > the flow to disk threshold and/or increasing available direct memory
> > > > should alleviate and may be an acceptable short term workaround.  If
> > > > it were possible for publishing application to publish short lived
> and
> > > > long lived messages on two separate JMS connections this would avoid
> > > > this defect.
> > > >
> > > > QPID-7753 tracks this issue and QPID-7754 is a related this problem.
> > > > We intend to be working on these early next week and will be aiming
> > > > for a fix that is back-portable to 6.0.
> > > >
> > > > Apologies that you have run into this defect and thanks for
> reporting.
> > > >
> > > > Thanks, Keith
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 21 April 2017 at 10:21, Ramayan Tiwari <[hidden email]>
> > > > wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > We have been monitoring the brokers everyday and today we found one
> > > > instance
> > > > >
> > > > > where broker’s DM was constantly going up and was about to crash,
> so
> > we
> > > > > experimented some mitigations, one of which caused the DM to come
> > down.
> > > > > Following are the details, which might help us understanding the
> > issue:
> > > > >
> > > > > Traffic scenario:
> > > > >
> > > > > DM allocation had been constantly going up and was at 90%. There
> > were two
> > > > > queues which seemed to align with the theories that we had. Q1’s
> > size had
> > > > > been large right after the broker start and had slow consumption of
> > > > > messages, queue size only reduced from 76MB to 75MB over a period
> of
> > > > 6hrs.
> > > > >
> > > > > Q2 on the other hand, started small and was gradually growing,
> queue
> > size
> > > > > went from 7MB to 10MB in 6hrs. There were other queues with traffic
> > > > during
> > > > >
> > > > > this time.
> > > > >
> > > > > Action taken:
> > > > >
> > > > > Moved all the messages from Q2 (since this was our original theory)
> > to Q3
> > > > > (already created but no messages in it). This did not help with the
> > DM
> > > > > growing up.
> > > > > Moved all the messages from Q1 to Q4 (already created but no
> > messages in
> > > > > it). This reduced DM allocation from 93% to 31%.
> > > > >
> > > > > We have the heap dump and thread dump from when broker was 90% in
> DM
> > > > > allocation. We are going to analyze that to see if we can get some
> > clue.
> > > > We
> > > > >
> > > > > wanted to share this new information which might help in reasoning
> > about
> > > > the
> > > > >
> > > > > memory issue.
> > > > >
> > > > > - Ramayan
> > > > >
> > > > >
> > > > > On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <
> > > > [hidden email]>
> > > > >
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Hi Keith,
> > > > > >
> > > > > > Thanks so much for your response and digging into the issue.
> Below
> > are
> > > > the
> > > > >
> > > > > >
> > > > > > answer to your questions:
> > > > > >
> > > > > > 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1
> > where it
> > > > > > was released because we need JMX support. Here is the destination
> > > > format:
> > > > >
> > > > > >
> > > > > > ""%s ; {node : { type : queue }, link : { x-subscribes : {
> > arguments : {
> > > > > > x-multiqueue : [%s], x-pull-only : true }}}}";"
> > > > > >
> > > > > > 2) Our machines have 40 cores, which will make the number of
> > threads to
> > > > > > 80. This might not be an issue, because this will show up in the
> > > > baseline DM
> > > > >
> > > > > >
> > > > > > allocated, which is only 6% (of 4GB) when we just bring up the
> > broker.
> > > > > >
> > > > > > 3) The only setting that we tuned WRT to DM is
> flowToDiskThreshold,
> > > > which
> > > > >
> > > > > >
> > > > > > is set at 80% now.
> > > > > >
> > > > > > 4) Only one virtual host in the broker.
> > > > > >
> > > > > > 5) Most of our queues (99%) are priority, we also have 8-10
> sorted
> > > > queues.
> > > > >
> > > > > >
> > > > > >
> > > > > > 6) Yeah we are using the standard 0.16 client and not AMQP 1.0
> > clients.
> > > > > > The connection log line looks like:
> > > > > > CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol Version
> :
> > 0-10
> > > > :
> > > > >
> > > > > >
> > > > > > Client ID : test : Client Version : 0.16 : Client Product : qpid
> > > > > >
> > > > > > We had another broker crashed about an hour back, we do see the
> > same
> > > > > > patterns:
> > > > > > 1) There is a queue which is constantly growing, enqueue is
> faster
> > than
> > > > > > dequeue on that queue for a long period of time.
> > > > > > 2) Flow to disk didn't kick in at all.
> > > > > >
> > > > > > This graph shows memory growth (red line - heap, blue - DM
> > allocated,
> > > > > > yellow - DM used)
> > > > > >
> > > > > > https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/
> > > > view?usp=sharing
> > > > >
> > > > > >
> > > > > >
> > > > > > The below graph shows growth on a single queue (there are 10-12
> > other
> > > > > > queues with traffic as well, something large size than this
> queue):
> > > > > >
> > > > > > https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/
> > > > view?usp=sharing
> > > > >
> > > > > >
> > > > > >
> > > > > > Couple of questions:
> > > > > > 1) Is there any developer level doc/design spec on how Qpid uses
> > DM?
> > > > > > 2) We are not getting heap dumps automatically when broker
> crashes
> > due
> > > > to
> > > > >
> > > > > >
> > > > > > DM (HeapDumpOnOutOfMemoryError not respected). Has anyone found a
> > way
> > > > to get
> > > > >
> > > > > >
> > > > > > around this problem?
> > > > > >
> > > > > > Thanks
> > > > > > Ramayan
> > > > > >
> > > > > > On Thu, Apr 20, 2017 at 9:08 AM, Keith W <[hidden email]>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Hi Ramayan
> > > > > > >
> > > > > > > We have been discussing your problem here and have a couple of
> > > > questions.
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I have been experimenting with use-cases based on your
> > descriptions
> > > > > > > above, but so far, have been unsuccessful in reproducing a
> > > > > > > "java.lang.OutOfMemoryError: Direct buffer memory"  condition.
> > The
> > > > > > > direct memory usage reflects the expected model: it levels off
> > when
> > > > > > > the flow to disk threshold is reached and direct memory is
> > release as
> > > > > > > messages are consumed until the minimum size for caching of
> > direct is
> > > > > > > reached.
> > > > > > >
> > > > > > > 1] For clarity let me check: we believe when you say "patch to
> > use
> > > > > > > MultiQueueConsumer" you are referring to the patch attached to
> > > > > > > QPID-7462 "Add experimental "pull" consumers to the broker"
> and
> > you
> > > > > > > are using a combination of this "x-pull-only"  with the
> standard
> > > > > > > "x-multiqueue" feature.  Is this correct?
> > > > > > >
> > > > > > > 2] One idea we had here relates to the size of the virtualhost
> IO
> > > > > > > pool.   As you know from the documentation, the Broker
> > caches/reuses
> > > > > > > direct memory internally but the documentation fails to
> mentions
> > that
> > > > > > > each pooled virtualhost IO thread also grabs a chunk (256K) of
> > direct
> > > > > > > memory from this cache.  By default the virtual host IO pool is
> > sized
> > > > > > > Math.max(Runtime.getRuntime().availableProcessors() * 2, 64),
> > so if
> > > > > > > you have a machine with a very large number of cores, you may
> > have a
> > > > > > > surprising large amount of direct memory assigned to
> virtualhost
> > IO
> > > > > > > threads.   Check the value of connectionThreadPoolSize on the
> > > > > > > virtualhost
> > > > > > > (http://<server>:<port>/api/latest/virtualhost/<
> > virtualhostnodename>/<;
> > > > virtualhostname>)
> > > > >
> > > > > >
> > > > > > >
> > > > > > > to see what value is in force.  What is it?  It is possible to
> > tune
> > > > > > > the pool size using context variable
> > > > > > > virtualhost.connectionThreadPool.size.
> > > > > > >
> > > > > > > 3] Tell me if you are tuning the Broker in way beyond the
> > direct/heap
> > > > > > > memory settings you have told us about already.  For instance
> > you are
> > > > > > > changing any of the direct memory pooling settings
> > > > > > > broker.directByteBufferPoolSize, default network buffer size
> > > > > > > qpid.broker.networkBufferSize or applying any other
> non-standard
> > > > > > > settings?
> > > > > > >
> > > > > > > 4] How many virtual hosts do you have on the Broker?
> > > > > > >
> > > > > > > 5] What is the consumption pattern of the messages?  Do consume
> > in a
> > > > > > > strictly FIFO fashion or are you making use of message
> selectors
> > > > > > > or/and any of the out-of-order queue types (LVQs, priority
> queue
> > or
> > > > > > > sorted queues)?
> > > > > > >
> > > > > > > 6] Is it just the 0.16 client involved in the application?
>  Can
> > I
> > > > > > > check that you are not using any of the AMQP 1.0 clients
> > > > > > > (org,apache.qpid:qpid-jms-client or
> > > > > > > org.apache.qpid:qpid-amqp-1-0-client) in the software stack
> (as
> > either
> > > > > > > consumers or producers)
> > > > > > >
> > > > > > > Hopefully the answers to these questions will get us closer to
> a
> > > > > > > reproduction.   If you are able to reliable reproduce it,
> please
> > share
> > > > > > > the steps with us.
> > > > > > >
> > > > > > > Kind regards, Keith.
> > > > > > >
> > > > > > >
> > > > > > > On 20 April 2017 at 10:21, Ramayan Tiwari <
> > [hidden email]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > After a lot of log mining, we might have a way to explain the
> > > > sustained
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > increased in DirectMemory allocation, the correlation seems
> to
> > be
> > > > with
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > the
> > > > > > > > growth in the size of a Queue that is getting consumed but at
> > a much
> > > > > > > > slower
> > > > > > > > rate than producers putting messages on this queue.
> > > > > > > >
> > > > > > > > The pattern we see is that in each instance of broker crash,
> > there is
> > > > > > > > at
> > > > > > > > least one queue (usually 1 queue) whose size kept growing
> > steadily.
> > > > > > > > It’d be
> > > > > > > > of significant size but not the largest queue -- usually
> there
> > are
> > > > > > > > multiple
> > > > > > > > larger queues -- but it was different from other queues in
> > that its
> > > > > > > > size
> > > > > > > > was growing steadily. The queue would also be moving, but its
> > > > > > > > processing
> > > > > > > > rate was not keeping up with the enqueue rate.
> > > > > > > >
> > > > > > > > Our theory that might be totally wrong: If a queue is moving
> > the
> > > > entire
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > time, maybe then the broker would keep reusing the same
> buffer
> > in
> > > > > > > > direct
> > > > > > > > memory for the queue, and keep on adding onto it at the end
> to
> > > > > > > > accommodate
> > > > > > > > new messages. But because it’s active all the time and we’re
> > pointing
> > > > > > > > to
> > > > > > > > the same buffer, space allocated for messages at the head of
> > the
> > > > > > > > queue/buffer doesn’t get reclaimed, even long after those
> > messages
> > > > have
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > been processed. Just a theory.
> > > > > > > >
> > > > > > > > We are also trying to reproduce this using some perf tests to
> > enqueue
> > > > > > > > with
> > > > > > > > same pattern, will update with the findings.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Ramayan
> > > > > > > >
> > > > > > > > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari
> > > > > > > > <[hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Another issue that we noticed is when broker goes OOM due
> to
> > direct
> > > > > > > > > memory, it doesn't create heap dump (specified by "-XX:+
> > > > > > > > > HeapDumpOnOutOfMemoryError"), even when the OOM error is
> > same as
> > > > what
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > is
> > > > > > > > > mentioned in the oracle JVM docs
> > ("java.lang.OutOfMemoryError").
> > > > > > > > >
> > > > > > > > > Has anyone been able to find a way to get to heap dump for
> > DM OOM?
> > > > > > > > >
> > > > > > > > > - Ramayan
> > > > > > > > >
> > > > > > > > > On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari
> > > > > > > > > <[hidden email]
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Alex,
> > > > > > > > > >
> > > > > > > > > > Below are the flow to disk logs from broker having
> > 3million+
> > > > messages
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > at
> > > > > > > > > > this time. We only have one virtual host. Time is in GMT.
> > Looks
> > > > like
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > flow
> > > > > > > > > > to disk is active on the whole virtual host and not a
> > queue level.
> > > > > > > > > >
> > > > > > > > > > When the same broker went OOM yesterday, I did not see
> any
> > flow to
> > > > > > > > > > disk
> > > > > > > > > > logs from when it was started until it crashed (crashed
> > twice
> > > > within
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 4hrs).
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> > use
> > > > > > > > > > 3356539KB
> > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> > use
> > > > > > > > > > 3354866KB
> > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> > use
> > > > > > > > > > 3358509KB
> > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> > use
> > > > > > > > > > 3353501KB
> > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> > use
> > > > > > > > > > 3357544KB
> > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> > use
> > > > > > > > > > 3353236KB
> > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> > use
> > > > > > > > > > 3356704KB
> > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> > use
> > > > > > > > > > 3353511KB
> > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> > use
> > > > > > > > > > 3357948KB
> > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> > use
> > > > > > > > > > 3355310KB
> > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> > use
> > > > > > > > > > 3365624KB
> > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message memory
> > use
> > > > > > > > > > 3355136KB
> > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > BRK-1014 : Message flow to disk active :  Message memory
> > use
> > > > > > > > > > 3358683KB
> > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > After production release (2days back), we have seen 4
> > crashes in 3
> > > > > > > > > > different brokers, this is the most pressing concern for
> > us in
> > > > > > > > > > decision if
> > > > > > > > > > we should roll back to 0.32. Any help is greatly
> > appreciated.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Ramayan
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <
> > [hidden email]
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Ramayan,
> > > > > > > > > > > Thanks for the details. I would like to clarify whether
> > flow to
> > > > disk
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > was
> > > > > > > > > > > triggered today for 3 million messages?
> > > > > > > > > > >
> > > > > > > > > > > The following logs are issued for flow to disk:
> > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> memory
> > use
> > > > > > > > > > > {0,number,#}KB
> > > > > > > > > > > exceeds threshold {1,number,#.##}KB
> > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > memory use
> > > > > > > > > > > {0,number,#}KB within threshold {1,number,#.##}KB
> > > > > > > > > > >
> > > > > > > > > > > Kind Regards,
> > > > > > > > > > > Alex
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On 19 April 2017 at 17:10, Ramayan Tiwari <
> > > > [hidden email]>
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Alex,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for your response, here are the details:
> > > > > > > > > > > >
> > > > > > > > > > > > We use "direct" exchange, without persistence (we
> > specify
> > > > > > > > > > > NON_PERSISTENT
> > > > > > > > > > > >
> > > > > > > > > > > > that while sending from client) and use BDB store. We
> > use JSON
> > > > > > > > > > > > virtual
> > > > > > > > > > > host
> > > > > > > > > > > >
> > > > > > > > > > > > type. We are not using SSL.
> > > > > > > > > > > >
> > > > > > > > > > > > When the broker went OOM, we had around 1.3 million
> > messages
> > > > with
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 100
> > > > > > > > > > > bytes
> > > > > > > > > > > >
> > > > > > > > > > > > average message size. Direct memory allocation (value
> > read from
> > > > > > > > > > > > MBean)
> > > > > > > > > > > kept
> > > > > > > > > > > >
> > > > > > > > > > > > going up, even though it wouldn't need more DM to
> > store these
> > > > many
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > messages. DM allocated persisted at 99% for about 3
> > and half
> > > > hours
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > before
> > > > > > > > > > > >
> > > > > > > > > > > > crashing.
> > > > > > > > > > > >
> > > > > > > > > > > > Today, on the same broker we have 3 million messages
> > (same
> > > > message
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > size)
> > > > > > > > > > > >
> > > > > > > > > > > > and DM allocated is only at 8%. This seems like there
> > is some
> > > > > > > > > > > > issue
> > > > > > > > > > > with
> > > > > > > > > > > >
> > > > > > > > > > > > de-allocation or a leak.
> > > > > > > > > > > >
> > > > > > > > > > > > I have uploaded the memory utilization graph here:
> > > > > > > > > > > > https://drive.google.com/file/d/
> > 0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> > > > > > > > > > > > view?usp=sharing
> > > > > > > > > > > > Blue line is DM allocated, Yellow is DM Used (sum of
> > queue
> > > > > > > > > > > > payload)
> > > > > > > > > > > and Red
> > > > > > > > > > > >
> > > > > > > > > > > > is heap usage.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > > Ramayan
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy
> > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Ramayan,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Could please share with us the details of messaging
> > use
> > > > case(s)
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > which
> > > > > > > > > > > > ended
> > > > > > > > > > > > >
> > > > > > > > > > > > > up in OOM on broker side?
> > > > > > > > > > > > > I would like to reproduce the issue on my local
> > broker in
> > > > order
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > to
> > > > > > > > > > > fix
> > > > > > > > > > > >
> > > > > > > > > > > > it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would appreciate if you could provide as much
> > details as
> > > > > > > > > > > > > possible,
> > > > > > > > > > > > > including, messaging topology, message persistence
> > type,
> > > > message
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > sizes,volumes, etc.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Qpid Broker 6.0.x uses direct memory for keeping
> > message
> > > > content
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > and
> > > > > > > > > > > > > receiving/sending data. Each plain connection
> > utilizes 512K of
> > > > > > > > > > > > > direct
> > > > > > > > > > > > > memory. Each SSL connection uses 1M of direct
> > memory. Your
> > > > > > > > > > > > > memory
> > > > > > > > > > > > settings
> > > > > > > > > > > > >
> > > > > > > > > > > > > look Ok to me.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kind Regards,
> > > > > > > > > > > > > Alex
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 18 April 2017 at 23:39, Ramayan Tiwari
> > > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We are using Java broker 6.0.5, with patch to use
> > > > > > > > > > > MultiQueueConsumer
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > feature. We just finished deploying to production
> > and saw
> > > > > > > > > > > > > > couple of
> > > > > > > > > > > > > > instances of broker OOM due to running out of
> > DirectMemory
> > > > > > > > > > > > > > buffer
> > > > > > > > > > > > > > (exceptions at the end of this email).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Here is our setup:
> > > > > > > > > > > > > > 1. Max heap 12g, max direct memory 4g (this is
> > opposite of
> > > > > > > > > > > > > > what the
> > > > > > > > > > > > > > recommendation is, however, for our use cause
> > message
> > > > payload
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > is
> > > > > > > > > > > really
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > small ~400bytes and is way less than the per
> > message
> > > > overhead
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > of
> > > > > > > > > > > 1KB).
> > > > > > > > > > > >
> > > > > > > > > > > > In
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > perf testing, we were able to put 2 million
> > messages without
> > > > > > > > > > > > > > any
> > > > > > > > > > > > issues.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. ~400 connections to broker.
> > > > > > > > > > > > > > 3. Each connection has 20 sessions and there is
> > one multi
> > > > > > > > > > > > > > queue
> > > > > > > > > > > > consumer
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > attached to each session, listening to around
> 1000
> > queues.
> > > > > > > > > > > > > > 4. We are still using 0.16 client (I know).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With the above setup, the baseline utilization
> > (without any
> > > > > > > > > > > messages)
> > > > > > > > > > > >
> > > > > > > > > > > > for
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > direct memory was around 230mb (with 410
> > connection each
> > > > > > > > > > > > > > taking
> > > > > > > > > > > 500KB).
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Based on our understanding of broker memory
> > allocation,
> > > > > > > > > > > > > > message
> > > > > > > > > > > payload
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > should be the only thing adding to direct memory
> > utilization
> > > > > > > > > > > > > > (on
> > > > > > > > > > > top of
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > baseline), however, we are experiencing something
> > completely
> > > > > > > > > > > different.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > In
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > our last broker crash, we see that broker is
> > constantly
> > > > > > > > > > > > > > running
> > > > > > > > > > > with
> > > > > > > > > > > >
> > > > > > > > > > > > 90%+
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > direct memory allocated, even when message
> payload
> > sum from
> > > > > > > > > > > > > > all the
> > > > > > > > > > > > > queues
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > is only 6-8% (these % are against available DM of
> > 4gb).
> > > > During
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > these
> > > > > > > > > > > >
> > > > > > > > > > > > high
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > DM usage period, heap usage was around 60% (of
> > 12gb).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We would like some help in understanding what
> > could be the
> > > > > > > > > > > > > > reason
> > > > > > > > > > > of
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > these
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > high DM allocations. Are there things other than
> > message
> > > > > > > > > > > > > > payload
> > > > > > > > > > > and
> > > > > > > > > > > >
> > > > > > > > > > > > AMQP
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > connection, which use DM and could be
> contributing
> > to these
> > > > > > > > > > > > > > high
> > > > > > > > > > > usage?
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Another thing where we are puzzled is the
> > de-allocation of
> > > > DM
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > byte
> > > > > > > > > > > > > buffers.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > From log mining of heap and DM utilization,
> > de-allocation of
> > > > > > > > > > > > > > DM
> > > > > > > > > > > doesn't
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > correlate with heap GC. If anyone has seen any
> > documentation
> > > > > > > > > > > related to
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > this, it would be very helpful if you could share
> > that.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Exceptions*
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer memory
> > > > > > > > > > > > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> > > > ~[na:1.8.0_40]
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> > init>(DirectByteBuffer.java:
> > > > 123)
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > at java.nio.ByteBuffer.
> allocateDirect(ByteBuffer.
> > java:311)
> > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> > QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NonBlockingConnectionPlainD
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > elegate.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > restoreApplicationBufferForWrite(
> > > > NonBlockingConnectionPlainDele
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > gate.java:93)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NonBlockingConnectionPlainDele
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > gate.processData(NonBlockingConnectionPlainDele
> > > > gate.java:60)
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NonBlockingConnection.doRead(
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > NonBlockingConnection.java:506)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NonBlockingConnection.doWork(
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > NonBlockingConnection.java:285)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NetworkConnectionScheduler.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > processConnection(NetworkConnectionScheduler.
> > java:124)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > > ConnectionPr
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ocessor.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > processConnection(SelectorThread.java:504)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > > > > > > > > > > > > SelectionTask.performSelect(
> > SelectorThread.java:337)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > > SelectionTask.run(
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > SelectorThread.java:87)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.
> > transport.SelectorThread.run(
> > > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > java.util.concurrent.
> ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > java.util.concurrent.
> > ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> > ~[na:1.8.0_40]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *Second exception*
> > > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer memory
> > > > > > > > > > > > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> > > > ~[na:1.8.0_40]
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> > init>(DirectByteBuffer.java:
> > > > 123)
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > at java.nio.ByteBuffer.
> allocateDirect(ByteBuffer.
> > java:311)
> > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> > QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NonBlockingConnectionPlainDele
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > gate.<init>(NonBlockingConnectionPlainDele
> > gate.java:45)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > NonBlockingConnection.
> > > > > > > > > > > > > > setTransportEncryption(
> NonBlockingConnection.java:
> > 625)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NonBlockingConnection.<init>(
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > NonBlockingConnection.java:117)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NonBlockingNetworkTransport.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > acceptSocketChannel(NonBlockingNetworkTransport.
> > java:158)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.transport.SelectorThread$
> > > > SelectionTas
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > k$1.run(
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > SelectorThread.java:191)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > org.apache.qpid.server.
> > transport.SelectorThread.run(
> > > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > java.util.concurrent.
> ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > > java.util.concurrent.
> > ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> > ~[na:1.8.0_40]
> > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > ------------------------------------------------------------
> > ---------
> > > > > > > To unsubscribe, e-mail: [hidden email]
> > > > > > > For additional commands, e-mail: [hidden email]
> > > > > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Oleksandr Rudyy
Hi Ramayan,

We attached to the QPID-7753 a patch with a work around for 6.0.x branch.
It triggers flow to disk based on direct memory consumption rather than
estimation of the space occupied by the message content. The flow to disk
should evacuate message content preventing running out of direct memory. We
already committed the changes into 6.0.x and 6.1.x branches. It will be
included into upcoming 6.0.7 and 6.1.3 releases.

Please try and test the patch in your environment.

We are still working at finishing of the fix for trunk.

Kind Regards,
Alex

On 30 April 2017 at 15:45, Lorenz Quack <[hidden email]> wrote:

> Hi Ramayan,
>
> The high-level plan is currently as follows:
>  1) Periodically try to compact sparse direct memory buffers.
>  2) Increase accuracy of messages' direct memory usage estimation to more
> reliably trigger flow to disk.
>  3) Add an additional flow to disk trigger based on the amount of allocated
> direct memory.
>
> A little bit more details:
>  1) We plan on periodically checking the amount of direct memory usage and
> if it is above a
>     threshold (50%) we compare the sum of all queue sizes with the amount
> of allocated direct memory.
>     If the ratio falls below a certain threshold we trigger a compaction
> task which goes through all queues
>     and copy's a certain amount of old message buffers into new ones
> thereby freeing the old buffers so
>     that they can be returned to the buffer pool and be reused.
>
>  2) Currently we trigger flow to disk based on an estimate of how much
> memory the messages on the
>     queues consume. We had to use estimates because we did not have
> accurate size numbers for
>     message headers. By having accurate size information for message
> headers we can more reliably
>     enforce queue memory limits.
>
>  3) The flow to disk trigger based on message size had another problem
> which is more pertinent to the
>     current issue. We only considered the size of the messages and not how
> much memory we allocate
>     to store those messages. In the FIFO use case those numbers will be
> very close to each other but in
>     use cases like yours we can end up with sparse buffers and the numbers
> will diverge. Because of this
>     divergence we do not trigger flow to disk in time and the broker can go
> OOM.
>     To fix the issue we want to add an additional flow to disk trigger
> based on the amount of allocated direct
>     memory. This should prevent the broker from going OOM even if the
> compaction strategy outlined above
>     should fail for some reason (e.g., the compaction task cannot keep up
> with the arrival of new messages).
>
> Currently, there are patches for the above points but they suffer from some
> thread-safety issues that need to be addressed.
>
> I hope this description helps. Any feedback is, as always, welcome.
>
> Kind regards,
> Lorenz
>
>
>
> On Sat, Apr 29, 2017 at 12:00 AM, Ramayan Tiwari <[hidden email]
> >
> wrote:
>
> > Hi Lorenz,
> >
> > Thanks so much for the patch. We have a perf test now to reproduce this
> > issue, so we did test with 256KB, 64KB and 4KB network byte buffer. None
> of
> > these configurations help with the issue (or give any more breathing
> room)
> > for our use case. We would like to share the perf analysis with the
> > community:
> >
> > https://docs.google.com/document/d/1Wc1e-id-
> WlpI7FGU1Lx8XcKaV8sauRp82T5XZV
> > U-RiM/edit?usp=sharing
> >
> > Feel free to comment on the doc if certain details are incorrect or if
> > there are questions.
> >
> > Since the short term solution doesn't help us, we are very interested in
> > getting some details on how the community plans to address this, a high
> > level description of the approach will be very helpful for us in order to
> > brainstorm our use cases along with this solution.
> >
> > - Ramayan
> >
> > On Fri, Apr 28, 2017 at 9:34 AM, Lorenz Quack <[hidden email]>
> > wrote:
> >
> > > Hello Ramayan,
> > >
> > > We are still working on a fix for this issue.
> > > In the mean time we had an idea to potentially workaround the issue
> until
> > > a proper fix is released.
> > >
> > > The idea is to decrease the qpid network buffer size the broker uses.
> > > While this still allows for sparsely populated buffers it would improve
> > > the overall occupancy ratio.
> > >
> > > Here are the steps to follow:
> > >  * ensure you are not using TLS
> > >  * apply the attached patch
> > >  * figure out the size of the largest messages you are sending
> (including
> > > header and some overhead)
> > >  * set the context variable "qpid.broker.networkBufferSize" to that
> > value
> > > but not smaller than 4096
> > >  * test
> > >
> > > Decreasing the qpid network buffer size automatically limits the
> maximum
> > > AMQP frame size.
> > > Since you are using a very old client we are not sure how well it copes
> > > with small frame sizes where it has to split a message across multiple
> > > frames.
> > > Therefore, to play it safe you should not set it smaller than the
> largest
> > > messages (+ header + overhead) you are sending.
> > > I do not know what message sizes you are sending but AMQP imposes the
> > > restriction that the framesize cannot be smaller than 4096 bytes.
> > > In the qpid broker the default currently is 256 kB.
> > >
> > > In the current state the broker does not allow setting the network
> buffer
> > > to values smaller than 64 kB to allow TLS frames to fit into one
> network
> > > buffer.
> > > I attached a patch to this mail that lowers that restriction to the
> limit
> > > imposed by AMQP (4096 Bytes).
> > > Obviously, you should not use this when using TLS.
> > >
> > >
> > > I hope this reduces the problems you are currently facing until we can
> > > complete the proper fix.
> > >
> > > Kind regards,
> > > Lorenz
> > >
> > >
> > > On Fri, 2017-04-21 at 09:17 -0700, Ramayan Tiwari wrote:
> > > > Thanks so much Keith and the team for finding the root cause. We are
> so
> > > > relieved that we fix the root cause shortly.
> > > >
> > > > Couple of things that I forgot to mention on the mitigation steps we
> > took
> > > > in the last incident:
> > > > 1) We triggered GC from JMX bean multiple times, it did not help in
> > > > reducing DM allocated.
> > > > 2) We also killed all the AMQP connections to the broker when DM was
> at
> > > > 80%. This did not help either. The way we killed connections - using
> > JMX
> > > > got list of all the open AMQP connections and called close from JMX
> > > mbean.
> > > >
> > > > I am hoping the above two are not related to root cause, but wanted
> to
> > > > bring it up in case this is relevant.
> > > >
> > > > Thanks
> > > > Ramayan
> > > >
> > > > On Fri, Apr 21, 2017 at 8:29 AM, Keith W <[hidden email]>
> wrote:
> > > >
> > > > >
> > > > > Hello Ramayan
> > > > >
> > > > > I believe I understand the root cause of the problem.  We have
> > > > > identified a flaw in the direct memory buffer management employed
> by
> > > > > Qpid Broker J which for some messaging use-cases can lead to the
> OOM
> > > > > direct you describe.   For the issue to manifest the producing
> > > > > application needs to use a single connection for the production of
> > > > > messages some of which are short-lived (i.e. are consumed quickly)
> > > > > whilst others remain on the queue for some time.  Priority queues,
> > > > > sorted queues and consumers utilising selectors that result in some
> > > > > messages being left of the queue could all produce this patten.
> The
> > > > > pattern leads to a sparsely occupied 256K net buffers which cannot
> be
> > > > > released or reused until every message that reference a 'chunk' of
> it
> > > > > is either consumed or flown to disk.   The problem was introduced
> > with
> > > > > Qpid v6.0 and exists in v6.1 and trunk too.
> > > > >
> > > > > The flow to disk feature is not helping us here because its
> algorithm
> > > > > considers only the size of live messages on the queues. If the
> > > > > accumulative live size does not exceed the threshold, the messages
> > > > > aren't flown to disk. I speculate that when you observed that
> moving
> > > > > messages cause direct message usage to drop earlier today, your
> > > > > message movement cause a queue to go over threshold, cause message
> to
> > > > > be flown to disk and their direct memory references released.  The
> > > > > logs will confirm this is so.
> > > > >
> > > > > I have not identified an easy workaround at the moment.
>  Decreasing
> > > > > the flow to disk threshold and/or increasing available direct
> memory
> > > > > should alleviate and may be an acceptable short term workaround.
> If
> > > > > it were possible for publishing application to publish short lived
> > and
> > > > > long lived messages on two separate JMS connections this would
> avoid
> > > > > this defect.
> > > > >
> > > > > QPID-7753 tracks this issue and QPID-7754 is a related this
> problem.
> > > > > We intend to be working on these early next week and will be aiming
> > > > > for a fix that is back-portable to 6.0.
> > > > >
> > > > > Apologies that you have run into this defect and thanks for
> > reporting.
> > > > >
> > > > > Thanks, Keith
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 21 April 2017 at 10:21, Ramayan Tiwari <
> [hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > We have been monitoring the brokers everyday and today we found
> one
> > > > > instance
> > > > > >
> > > > > > where broker’s DM was constantly going up and was about to crash,
> > so
> > > we
> > > > > > experimented some mitigations, one of which caused the DM to come
> > > down.
> > > > > > Following are the details, which might help us understanding the
> > > issue:
> > > > > >
> > > > > > Traffic scenario:
> > > > > >
> > > > > > DM allocation had been constantly going up and was at 90%. There
> > > were two
> > > > > > queues which seemed to align with the theories that we had. Q1’s
> > > size had
> > > > > > been large right after the broker start and had slow consumption
> of
> > > > > > messages, queue size only reduced from 76MB to 75MB over a period
> > of
> > > > > 6hrs.
> > > > > >
> > > > > > Q2 on the other hand, started small and was gradually growing,
> > queue
> > > size
> > > > > > went from 7MB to 10MB in 6hrs. There were other queues with
> traffic
> > > > > during
> > > > > >
> > > > > > this time.
> > > > > >
> > > > > > Action taken:
> > > > > >
> > > > > > Moved all the messages from Q2 (since this was our original
> theory)
> > > to Q3
> > > > > > (already created but no messages in it). This did not help with
> the
> > > DM
> > > > > > growing up.
> > > > > > Moved all the messages from Q1 to Q4 (already created but no
> > > messages in
> > > > > > it). This reduced DM allocation from 93% to 31%.
> > > > > >
> > > > > > We have the heap dump and thread dump from when broker was 90% in
> > DM
> > > > > > allocation. We are going to analyze that to see if we can get
> some
> > > clue.
> > > > > We
> > > > > >
> > > > > > wanted to share this new information which might help in
> reasoning
> > > about
> > > > > the
> > > > > >
> > > > > > memory issue.
> > > > > >
> > > > > > - Ramayan
> > > > > >
> > > > > >
> > > > > > On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <
> > > > > [hidden email]>
> > > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Hi Keith,
> > > > > > >
> > > > > > > Thanks so much for your response and digging into the issue.
> > Below
> > > are
> > > > > the
> > > > > >
> > > > > > >
> > > > > > > answer to your questions:
> > > > > > >
> > > > > > > 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1
> > > where it
> > > > > > > was released because we need JMX support. Here is the
> destination
> > > > > format:
> > > > > >
> > > > > > >
> > > > > > > ""%s ; {node : { type : queue }, link : { x-subscribes : {
> > > arguments : {
> > > > > > > x-multiqueue : [%s], x-pull-only : true }}}}";"
> > > > > > >
> > > > > > > 2) Our machines have 40 cores, which will make the number of
> > > threads to
> > > > > > > 80. This might not be an issue, because this will show up in
> the
> > > > > baseline DM
> > > > > >
> > > > > > >
> > > > > > > allocated, which is only 6% (of 4GB) when we just bring up the
> > > broker.
> > > > > > >
> > > > > > > 3) The only setting that we tuned WRT to DM is
> > flowToDiskThreshold,
> > > > > which
> > > > > >
> > > > > > >
> > > > > > > is set at 80% now.
> > > > > > >
> > > > > > > 4) Only one virtual host in the broker.
> > > > > > >
> > > > > > > 5) Most of our queues (99%) are priority, we also have 8-10
> > sorted
> > > > > queues.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 6) Yeah we are using the standard 0.16 client and not AMQP 1.0
> > > clients.
> > > > > > > The connection log line looks like:
> > > > > > > CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol
> Version
> > :
> > > 0-10
> > > > > :
> > > > > >
> > > > > > >
> > > > > > > Client ID : test : Client Version : 0.16 : Client Product :
> qpid
> > > > > > >
> > > > > > > We had another broker crashed about an hour back, we do see the
> > > same
> > > > > > > patterns:
> > > > > > > 1) There is a queue which is constantly growing, enqueue is
> > faster
> > > than
> > > > > > > dequeue on that queue for a long period of time.
> > > > > > > 2) Flow to disk didn't kick in at all.
> > > > > > >
> > > > > > > This graph shows memory growth (red line - heap, blue - DM
> > > allocated,
> > > > > > > yellow - DM used)
> > > > > > >
> > > > > > > https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/
> > > > > view?usp=sharing
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The below graph shows growth on a single queue (there are 10-12
> > > other
> > > > > > > queues with traffic as well, something large size than this
> > queue):
> > > > > > >
> > > > > > > https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/
> > > > > view?usp=sharing
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Couple of questions:
> > > > > > > 1) Is there any developer level doc/design spec on how Qpid
> uses
> > > DM?
> > > > > > > 2) We are not getting heap dumps automatically when broker
> > crashes
> > > due
> > > > > to
> > > > > >
> > > > > > >
> > > > > > > DM (HeapDumpOnOutOfMemoryError not respected). Has anyone
> found a
> > > way
> > > > > to get
> > > > > >
> > > > > > >
> > > > > > > around this problem?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Ramayan
> > > > > > >
> > > > > > > On Thu, Apr 20, 2017 at 9:08 AM, Keith W <[hidden email]
> >
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Hi Ramayan
> > > > > > > >
> > > > > > > > We have been discussing your problem here and have a couple
> of
> > > > > questions.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > I have been experimenting with use-cases based on your
> > > descriptions
> > > > > > > > above, but so far, have been unsuccessful in reproducing a
> > > > > > > > "java.lang.OutOfMemoryError: Direct buffer memory"
> condition.
> > > The
> > > > > > > > direct memory usage reflects the expected model: it levels
> off
> > > when
> > > > > > > > the flow to disk threshold is reached and direct memory is
> > > release as
> > > > > > > > messages are consumed until the minimum size for caching of
> > > direct is
> > > > > > > > reached.
> > > > > > > >
> > > > > > > > 1] For clarity let me check: we believe when you say "patch
> to
> > > use
> > > > > > > > MultiQueueConsumer" you are referring to the patch attached
> to
> > > > > > > > QPID-7462 "Add experimental "pull" consumers to the broker"
> > and
> > > you
> > > > > > > > are using a combination of this "x-pull-only"  with the
> > standard
> > > > > > > > "x-multiqueue" feature.  Is this correct?
> > > > > > > >
> > > > > > > > 2] One idea we had here relates to the size of the
> virtualhost
> > IO
> > > > > > > > pool.   As you know from the documentation, the Broker
> > > caches/reuses
> > > > > > > > direct memory internally but the documentation fails to
> > mentions
> > > that
> > > > > > > > each pooled virtualhost IO thread also grabs a chunk (256K)
> of
> > > direct
> > > > > > > > memory from this cache.  By default the virtual host IO pool
> is
> > > sized
> > > > > > > > Math.max(Runtime.getRuntime().availableProcessors() * 2,
> 64),
> > > so if
> > > > > > > > you have a machine with a very large number of cores, you may
> > > have a
> > > > > > > > surprising large amount of direct memory assigned to
> > virtualhost
> > > IO
> > > > > > > > threads.   Check the value of connectionThreadPoolSize on the
> > > > > > > > virtualhost
> > > > > > > > (http://<server>:<port>/api/latest/virtualhost/<
> > > virtualhostnodename>/<;
> > > > > virtualhostname>)
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > to see what value is in force.  What is it?  It is possible
> to
> > > tune
> > > > > > > > the pool size using context variable
> > > > > > > > virtualhost.connectionThreadPool.size.
> > > > > > > >
> > > > > > > > 3] Tell me if you are tuning the Broker in way beyond the
> > > direct/heap
> > > > > > > > memory settings you have told us about already.  For instance
> > > you are
> > > > > > > > changing any of the direct memory pooling settings
> > > > > > > > broker.directByteBufferPoolSize, default network buffer size
> > > > > > > > qpid.broker.networkBufferSize or applying any other
> > non-standard
> > > > > > > > settings?
> > > > > > > >
> > > > > > > > 4] How many virtual hosts do you have on the Broker?
> > > > > > > >
> > > > > > > > 5] What is the consumption pattern of the messages?  Do
> consume
> > > in a
> > > > > > > > strictly FIFO fashion or are you making use of message
> > selectors
> > > > > > > > or/and any of the out-of-order queue types (LVQs, priority
> > queue
> > > or
> > > > > > > > sorted queues)?
> > > > > > > >
> > > > > > > > 6] Is it just the 0.16 client involved in the application?
> >  Can
> > > I
> > > > > > > > check that you are not using any of the AMQP 1.0 clients
> > > > > > > > (org,apache.qpid:qpid-jms-client or
> > > > > > > > org.apache.qpid:qpid-amqp-1-0-client) in the software stack
> > (as
> > > either
> > > > > > > > consumers or producers)
> > > > > > > >
> > > > > > > > Hopefully the answers to these questions will get us closer
> to
> > a
> > > > > > > > reproduction.   If you are able to reliable reproduce it,
> > please
> > > share
> > > > > > > > the steps with us.
> > > > > > > >
> > > > > > > > Kind regards, Keith.
> > > > > > > >
> > > > > > > >
> > > > > > > > On 20 April 2017 at 10:21, Ramayan Tiwari <
> > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > After a lot of log mining, we might have a way to explain
> the
> > > > > sustained
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > increased in DirectMemory allocation, the correlation seems
> > to
> > > be
> > > > > with
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > the
> > > > > > > > > growth in the size of a Queue that is getting consumed but
> at
> > > a much
> > > > > > > > > slower
> > > > > > > > > rate than producers putting messages on this queue.
> > > > > > > > >
> > > > > > > > > The pattern we see is that in each instance of broker
> crash,
> > > there is
> > > > > > > > > at
> > > > > > > > > least one queue (usually 1 queue) whose size kept growing
> > > steadily.
> > > > > > > > > It’d be
> > > > > > > > > of significant size but not the largest queue -- usually
> > there
> > > are
> > > > > > > > > multiple
> > > > > > > > > larger queues -- but it was different from other queues in
> > > that its
> > > > > > > > > size
> > > > > > > > > was growing steadily. The queue would also be moving, but
> its
> > > > > > > > > processing
> > > > > > > > > rate was not keeping up with the enqueue rate.
> > > > > > > > >
> > > > > > > > > Our theory that might be totally wrong: If a queue is
> moving
> > > the
> > > > > entire
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > time, maybe then the broker would keep reusing the same
> > buffer
> > > in
> > > > > > > > > direct
> > > > > > > > > memory for the queue, and keep on adding onto it at the end
> > to
> > > > > > > > > accommodate
> > > > > > > > > new messages. But because it’s active all the time and
> we’re
> > > pointing
> > > > > > > > > to
> > > > > > > > > the same buffer, space allocated for messages at the head
> of
> > > the
> > > > > > > > > queue/buffer doesn’t get reclaimed, even long after those
> > > messages
> > > > > have
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > been processed. Just a theory.
> > > > > > > > >
> > > > > > > > > We are also trying to reproduce this using some perf tests
> to
> > > enqueue
> > > > > > > > > with
> > > > > > > > > same pattern, will update with the findings.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Ramayan
> > > > > > > > >
> > > > > > > > > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari
> > > > > > > > > <[hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Another issue that we noticed is when broker goes OOM due
> > to
> > > direct
> > > > > > > > > > memory, it doesn't create heap dump (specified by "-XX:+
> > > > > > > > > > HeapDumpOnOutOfMemoryError"), even when the OOM error is
> > > same as
> > > > > what
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > > mentioned in the oracle JVM docs
> > > ("java.lang.OutOfMemoryError").
> > > > > > > > > >
> > > > > > > > > > Has anyone been able to find a way to get to heap dump
> for
> > > DM OOM?
> > > > > > > > > >
> > > > > > > > > > - Ramayan
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari
> > > > > > > > > > <[hidden email]
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Alex,
> > > > > > > > > > >
> > > > > > > > > > > Below are the flow to disk logs from broker having
> > > 3million+
> > > > > messages
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > at
> > > > > > > > > > > this time. We only have one virtual host. Time is in
> GMT.
> > > Looks
> > > > > like
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > flow
> > > > > > > > > > > to disk is active on the whole virtual host and not a
> > > queue level.
> > > > > > > > > > >
> > > > > > > > > > > When the same broker went OOM yesterday, I did not see
> > any
> > > flow to
> > > > > > > > > > > disk
> > > > > > > > > > > logs from when it was started until it crashed (crashed
> > > twice
> > > > > within
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 4hrs).
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> memory
> > > use
> > > > > > > > > > > 3356539KB
> > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> memory
> > > use
> > > > > > > > > > > 3354866KB
> > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> memory
> > > use
> > > > > > > > > > > 3358509KB
> > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> memory
> > > use
> > > > > > > > > > > 3353501KB
> > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> memory
> > > use
> > > > > > > > > > > 3357544KB
> > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> memory
> > > use
> > > > > > > > > > > 3353236KB
> > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> memory
> > > use
> > > > > > > > > > > 3356704KB
> > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> memory
> > > use
> > > > > > > > > > > 3353511KB
> > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> memory
> > > use
> > > > > > > > > > > 3357948KB
> > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> memory
> > > use
> > > > > > > > > > > 3355310KB
> > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> memory
> > > use
> > > > > > > > > > > 3365624KB
> > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> memory
> > > use
> > > > > > > > > > > 3355136KB
> > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> memory
> > > use
> > > > > > > > > > > 3358683KB
> > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > After production release (2days back), we have seen 4
> > > crashes in 3
> > > > > > > > > > > different brokers, this is the most pressing concern
> for
> > > us in
> > > > > > > > > > > decision if
> > > > > > > > > > > we should roll back to 0.32. Any help is greatly
> > > appreciated.
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > > Ramayan
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <
> > > [hidden email]
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Ramayan,
> > > > > > > > > > > > Thanks for the details. I would like to clarify
> whether
> > > flow to
> > > > > disk
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > was
> > > > > > > > > > > > triggered today for 3 million messages?
> > > > > > > > > > > >
> > > > > > > > > > > > The following logs are issued for flow to disk:
> > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > memory
> > > use
> > > > > > > > > > > > {0,number,#}KB
> > > > > > > > > > > > exceeds threshold {1,number,#.##}KB
> > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > > memory use
> > > > > > > > > > > > {0,number,#}KB within threshold {1,number,#.##}KB
> > > > > > > > > > > >
> > > > > > > > > > > > Kind Regards,
> > > > > > > > > > > > Alex
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On 19 April 2017 at 17:10, Ramayan Tiwari <
> > > > > [hidden email]>
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Alex,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for your response, here are the details:
> > > > > > > > > > > > >
> > > > > > > > > > > > > We use "direct" exchange, without persistence (we
> > > specify
> > > > > > > > > > > > NON_PERSISTENT
> > > > > > > > > > > > >
> > > > > > > > > > > > > that while sending from client) and use BDB store.
> We
> > > use JSON
> > > > > > > > > > > > > virtual
> > > > > > > > > > > > host
> > > > > > > > > > > > >
> > > > > > > > > > > > > type. We are not using SSL.
> > > > > > > > > > > > >
> > > > > > > > > > > > > When the broker went OOM, we had around 1.3 million
> > > messages
> > > > > with
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 100
> > > > > > > > > > > > bytes
> > > > > > > > > > > > >
> > > > > > > > > > > > > average message size. Direct memory allocation
> (value
> > > read from
> > > > > > > > > > > > > MBean)
> > > > > > > > > > > > kept
> > > > > > > > > > > > >
> > > > > > > > > > > > > going up, even though it wouldn't need more DM to
> > > store these
> > > > > many
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > messages. DM allocated persisted at 99% for about 3
> > > and half
> > > > > hours
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > before
> > > > > > > > > > > > >
> > > > > > > > > > > > > crashing.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Today, on the same broker we have 3 million
> messages
> > > (same
> > > > > message
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > size)
> > > > > > > > > > > > >
> > > > > > > > > > > > > and DM allocated is only at 8%. This seems like
> there
> > > is some
> > > > > > > > > > > > > issue
> > > > > > > > > > > > with
> > > > > > > > > > > > >
> > > > > > > > > > > > > de-allocation or a leak.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have uploaded the memory utilization graph here:
> > > > > > > > > > > > > https://drive.google.com/file/d/
> > > 0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> > > > > > > > > > > > > view?usp=sharing
> > > > > > > > > > > > > Blue line is DM allocated, Yellow is DM Used (sum
> of
> > > queue
> > > > > > > > > > > > > payload)
> > > > > > > > > > > > and Red
> > > > > > > > > > > > >
> > > > > > > > > > > > > is heap usage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy
> > > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Ramayan,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Could please share with us the details of
> messaging
> > > use
> > > > > case(s)
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > ended
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > up in OOM on broker side?
> > > > > > > > > > > > > > I would like to reproduce the issue on my local
> > > broker in
> > > > > order
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > to
> > > > > > > > > > > > fix
> > > > > > > > > > > > >
> > > > > > > > > > > > > it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I would appreciate if you could provide as much
> > > details as
> > > > > > > > > > > > > > possible,
> > > > > > > > > > > > > > including, messaging topology, message
> persistence
> > > type,
> > > > > message
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > sizes,volumes, etc.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Qpid Broker 6.0.x uses direct memory for keeping
> > > message
> > > > > content
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > receiving/sending data. Each plain connection
> > > utilizes 512K of
> > > > > > > > > > > > > > direct
> > > > > > > > > > > > > > memory. Each SSL connection uses 1M of direct
> > > memory. Your
> > > > > > > > > > > > > > memory
> > > > > > > > > > > > > settings
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > look Ok to me.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Kind Regards,
> > > > > > > > > > > > > > Alex
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 18 April 2017 at 23:39, Ramayan Tiwari
> > > > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We are using Java broker 6.0.5, with patch to
> use
> > > > > > > > > > > > MultiQueueConsumer
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > feature. We just finished deploying to
> production
> > > and saw
> > > > > > > > > > > > > > > couple of
> > > > > > > > > > > > > > > instances of broker OOM due to running out of
> > > DirectMemory
> > > > > > > > > > > > > > > buffer
> > > > > > > > > > > > > > > (exceptions at the end of this email).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Here is our setup:
> > > > > > > > > > > > > > > 1. Max heap 12g, max direct memory 4g (this is
> > > opposite of
> > > > > > > > > > > > > > > what the
> > > > > > > > > > > > > > > recommendation is, however, for our use cause
> > > message
> > > > > payload
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > really
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > small ~400bytes and is way less than the per
> > > message
> > > > > overhead
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > 1KB).
> > > > > > > > > > > > >
> > > > > > > > > > > > > In
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > perf testing, we were able to put 2 million
> > > messages without
> > > > > > > > > > > > > > > any
> > > > > > > > > > > > > issues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. ~400 connections to broker.
> > > > > > > > > > > > > > > 3. Each connection has 20 sessions and there is
> > > one multi
> > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > consumer
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > attached to each session, listening to around
> > 1000
> > > queues.
> > > > > > > > > > > > > > > 4. We are still using 0.16 client (I know).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > With the above setup, the baseline utilization
> > > (without any
> > > > > > > > > > > > messages)
> > > > > > > > > > > > >
> > > > > > > > > > > > > for
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > direct memory was around 230mb (with 410
> > > connection each
> > > > > > > > > > > > > > > taking
> > > > > > > > > > > > 500KB).
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Based on our understanding of broker memory
> > > allocation,
> > > > > > > > > > > > > > > message
> > > > > > > > > > > > payload
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > should be the only thing adding to direct
> memory
> > > utilization
> > > > > > > > > > > > > > > (on
> > > > > > > > > > > > top of
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > baseline), however, we are experiencing
> something
> > > completely
> > > > > > > > > > > > different.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > our last broker crash, we see that broker is
> > > constantly
> > > > > > > > > > > > > > > running
> > > > > > > > > > > > with
> > > > > > > > > > > > >
> > > > > > > > > > > > > 90%+
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > direct memory allocated, even when message
> > payload
> > > sum from
> > > > > > > > > > > > > > > all the
> > > > > > > > > > > > > > queues
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > is only 6-8% (these % are against available DM
> of
> > > 4gb).
> > > > > During
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > these
> > > > > > > > > > > > >
> > > > > > > > > > > > > high
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > DM usage period, heap usage was around 60% (of
> > > 12gb).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We would like some help in understanding what
> > > could be the
> > > > > > > > > > > > > > > reason
> > > > > > > > > > > > of
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > these
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > high DM allocations. Are there things other
> than
> > > message
> > > > > > > > > > > > > > > payload
> > > > > > > > > > > > and
> > > > > > > > > > > > >
> > > > > > > > > > > > > AMQP
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > connection, which use DM and could be
> > contributing
> > > to these
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > usage?
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Another thing where we are puzzled is the
> > > de-allocation of
> > > > > DM
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > byte
> > > > > > > > > > > > > > buffers.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From log mining of heap and DM utilization,
> > > de-allocation of
> > > > > > > > > > > > > > > DM
> > > > > > > > > > > > doesn't
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > correlate with heap GC. If anyone has seen any
> > > documentation
> > > > > > > > > > > > related to
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > this, it would be very helpful if you could
> share
> > > that.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > *Exceptions*
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer
> memory
> > > > > > > > > > > > > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> > > > > ~[na:1.8.0_40]
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> > > init>(DirectByteBuffer.java:
> > > > > 123)
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > at java.nio.ByteBuffer.
> > allocateDirect(ByteBuffer.
> > > java:311)
> > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> > > QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NonBlockingConnectionPlainD
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > elegate.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > restoreApplicationBufferForWrite(
> > > > > NonBlockingConnectionPlainDele
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > gate.java:93)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NonBlockingConnectionPlainDele
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > gate.processData(
> NonBlockingConnectionPlainDele
> > > > > gate.java:60)
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NonBlockingConnection.doRead(
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > NonBlockingConnection.java:506)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NonBlockingConnection.doWork(
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > NonBlockingConnection.java:285)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NetworkConnectionScheduler.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > processConnection(NetworkConnectionScheduler.
> > > java:124)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.
> transport.SelectorThread$
> > > > > ConnectionPr
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > ocessor.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > processConnection(SelectorThread.java:504)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.
> transport.SelectorThread$
> > > > > > > > > > > > > > > SelectionTask.performSelect(
> > > SelectorThread.java:337)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > org.apache.qpid.server.
> transport.SelectorThread$
> > > > > SelectionTask.run(
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > SelectorThread.java:87)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.
> > > transport.SelectorThread.run(
> > > > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > java.util.concurrent.
> > ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > java.util.concurrent.
> > > ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > *Second exception*
> > > > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer
> memory
> > > > > > > > > > > > > > > at java.nio.Bits.reserveMemory(Bits.java:658)
> > > > > ~[na:1.8.0_40]
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> > > init>(DirectByteBuffer.java:
> > > > > 123)
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > at java.nio.ByteBuffer.
> > allocateDirect(ByteBuffer.
> > > java:311)
> > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> > > QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NonBlockingConnectionPlainDele
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > gate.<init>(NonBlockingConnectionPlainDele
> > > gate.java:45)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > NonBlockingConnection.
> > > > > > > > > > > > > > > setTransportEncryption(
> > NonBlockingConnection.java:
> > > 625)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NonBlockingConnection.<init>(
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > NonBlockingConnection.java:117)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NonBlockingNetworkTransport.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > acceptSocketChannel(
> NonBlockingNetworkTransport.
> > > java:158)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.
> transport.SelectorThread$
> > > > > SelectionTas
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > k$1.run(
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > SelectorThread.java:191)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > org.apache.qpid.server.
> > > transport.SelectorThread.run(
> > > > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > java.util.concurrent.
> > ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > java.util.concurrent.
> > > ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > ------------------------------------------------------------
> > > ---------
> > > > > > > > To unsubscribe, e-mail: [hidden email]
> > > > > > > > For additional commands, e-mail: [hidden email]
> > > > > > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Hi Alex,

Thanks for providing the patch. I verified the fix with same perf test, and
it does prevent broker from going OOM, however. DM utilization doesn't get
any better after hitting the threshold (where flow to disk is activated
based on total used % across broker - graph in the link below).

After hitting the final threshold, flow to disk activates and deactivates
pretty frequently across all the queues. The reason seems to be because
there is only one threshold currently to trigger flow to disk. Would it
make sense to break this down to high and low threshold - so that once flow
to disk is active after hitting high threshold, it will be active until the
queue utilization (or broker DM allocation) reaches the low threshold.

Graph and flow to disk logs are here:
https://docs.google.com/document/d/1Wc1e-id-WlpI7FGU1Lx8XcKaV8sauRp82T5XZVU-RiM/edit#heading=h.6400pltvjhy7

Thanks
Ramayan

On Thu, May 4, 2017 at 2:44 AM, Oleksandr Rudyy <[hidden email]> wrote:

> Hi Ramayan,
>
> We attached to the QPID-7753 a patch with a work around for 6.0.x branch.
> It triggers flow to disk based on direct memory consumption rather than
> estimation of the space occupied by the message content. The flow to disk
> should evacuate message content preventing running out of direct memory. We
> already committed the changes into 6.0.x and 6.1.x branches. It will be
> included into upcoming 6.0.7 and 6.1.3 releases.
>
> Please try and test the patch in your environment.
>
> We are still working at finishing of the fix for trunk.
>
> Kind Regards,
> Alex
>
> On 30 April 2017 at 15:45, Lorenz Quack <[hidden email]> wrote:
>
> > Hi Ramayan,
> >
> > The high-level plan is currently as follows:
> >  1) Periodically try to compact sparse direct memory buffers.
> >  2) Increase accuracy of messages' direct memory usage estimation to more
> > reliably trigger flow to disk.
> >  3) Add an additional flow to disk trigger based on the amount of
> allocated
> > direct memory.
> >
> > A little bit more details:
> >  1) We plan on periodically checking the amount of direct memory usage
> and
> > if it is above a
> >     threshold (50%) we compare the sum of all queue sizes with the amount
> > of allocated direct memory.
> >     If the ratio falls below a certain threshold we trigger a compaction
> > task which goes through all queues
> >     and copy's a certain amount of old message buffers into new ones
> > thereby freeing the old buffers so
> >     that they can be returned to the buffer pool and be reused.
> >
> >  2) Currently we trigger flow to disk based on an estimate of how much
> > memory the messages on the
> >     queues consume. We had to use estimates because we did not have
> > accurate size numbers for
> >     message headers. By having accurate size information for message
> > headers we can more reliably
> >     enforce queue memory limits.
> >
> >  3) The flow to disk trigger based on message size had another problem
> > which is more pertinent to the
> >     current issue. We only considered the size of the messages and not
> how
> > much memory we allocate
> >     to store those messages. In the FIFO use case those numbers will be
> > very close to each other but in
> >     use cases like yours we can end up with sparse buffers and the
> numbers
> > will diverge. Because of this
> >     divergence we do not trigger flow to disk in time and the broker can
> go
> > OOM.
> >     To fix the issue we want to add an additional flow to disk trigger
> > based on the amount of allocated direct
> >     memory. This should prevent the broker from going OOM even if the
> > compaction strategy outlined above
> >     should fail for some reason (e.g., the compaction task cannot keep up
> > with the arrival of new messages).
> >
> > Currently, there are patches for the above points but they suffer from
> some
> > thread-safety issues that need to be addressed.
> >
> > I hope this description helps. Any feedback is, as always, welcome.
> >
> > Kind regards,
> > Lorenz
> >
> >
> >
> > On Sat, Apr 29, 2017 at 12:00 AM, Ramayan Tiwari <
> [hidden email]
> > >
> > wrote:
> >
> > > Hi Lorenz,
> > >
> > > Thanks so much for the patch. We have a perf test now to reproduce this
> > > issue, so we did test with 256KB, 64KB and 4KB network byte buffer.
> None
> > of
> > > these configurations help with the issue (or give any more breathing
> > room)
> > > for our use case. We would like to share the perf analysis with the
> > > community:
> > >
> > > https://docs.google.com/document/d/1Wc1e-id-
> > WlpI7FGU1Lx8XcKaV8sauRp82T5XZV
> > > U-RiM/edit?usp=sharing
> > >
> > > Feel free to comment on the doc if certain details are incorrect or if
> > > there are questions.
> > >
> > > Since the short term solution doesn't help us, we are very interested
> in
> > > getting some details on how the community plans to address this, a high
> > > level description of the approach will be very helpful for us in order
> to
> > > brainstorm our use cases along with this solution.
> > >
> > > - Ramayan
> > >
> > > On Fri, Apr 28, 2017 at 9:34 AM, Lorenz Quack <[hidden email]>
> > > wrote:
> > >
> > > > Hello Ramayan,
> > > >
> > > > We are still working on a fix for this issue.
> > > > In the mean time we had an idea to potentially workaround the issue
> > until
> > > > a proper fix is released.
> > > >
> > > > The idea is to decrease the qpid network buffer size the broker uses.
> > > > While this still allows for sparsely populated buffers it would
> improve
> > > > the overall occupancy ratio.
> > > >
> > > > Here are the steps to follow:
> > > >  * ensure you are not using TLS
> > > >  * apply the attached patch
> > > >  * figure out the size of the largest messages you are sending
> > (including
> > > > header and some overhead)
> > > >  * set the context variable "qpid.broker.networkBufferSize" to that
> > > value
> > > > but not smaller than 4096
> > > >  * test
> > > >
> > > > Decreasing the qpid network buffer size automatically limits the
> > maximum
> > > > AMQP frame size.
> > > > Since you are using a very old client we are not sure how well it
> copes
> > > > with small frame sizes where it has to split a message across
> multiple
> > > > frames.
> > > > Therefore, to play it safe you should not set it smaller than the
> > largest
> > > > messages (+ header + overhead) you are sending.
> > > > I do not know what message sizes you are sending but AMQP imposes the
> > > > restriction that the framesize cannot be smaller than 4096 bytes.
> > > > In the qpid broker the default currently is 256 kB.
> > > >
> > > > In the current state the broker does not allow setting the network
> > buffer
> > > > to values smaller than 64 kB to allow TLS frames to fit into one
> > network
> > > > buffer.
> > > > I attached a patch to this mail that lowers that restriction to the
> > limit
> > > > imposed by AMQP (4096 Bytes).
> > > > Obviously, you should not use this when using TLS.
> > > >
> > > >
> > > > I hope this reduces the problems you are currently facing until we
> can
> > > > complete the proper fix.
> > > >
> > > > Kind regards,
> > > > Lorenz
> > > >
> > > >
> > > > On Fri, 2017-04-21 at 09:17 -0700, Ramayan Tiwari wrote:
> > > > > Thanks so much Keith and the team for finding the root cause. We
> are
> > so
> > > > > relieved that we fix the root cause shortly.
> > > > >
> > > > > Couple of things that I forgot to mention on the mitigation steps
> we
> > > took
> > > > > in the last incident:
> > > > > 1) We triggered GC from JMX bean multiple times, it did not help in
> > > > > reducing DM allocated.
> > > > > 2) We also killed all the AMQP connections to the broker when DM
> was
> > at
> > > > > 80%. This did not help either. The way we killed connections -
> using
> > > JMX
> > > > > got list of all the open AMQP connections and called close from JMX
> > > > mbean.
> > > > >
> > > > > I am hoping the above two are not related to root cause, but wanted
> > to
> > > > > bring it up in case this is relevant.
> > > > >
> > > > > Thanks
> > > > > Ramayan
> > > > >
> > > > > On Fri, Apr 21, 2017 at 8:29 AM, Keith W <[hidden email]>
> > wrote:
> > > > >
> > > > > >
> > > > > > Hello Ramayan
> > > > > >
> > > > > > I believe I understand the root cause of the problem.  We have
> > > > > > identified a flaw in the direct memory buffer management employed
> > by
> > > > > > Qpid Broker J which for some messaging use-cases can lead to the
> > OOM
> > > > > > direct you describe.   For the issue to manifest the producing
> > > > > > application needs to use a single connection for the production
> of
> > > > > > messages some of which are short-lived (i.e. are consumed
> quickly)
> > > > > > whilst others remain on the queue for some time.  Priority
> queues,
> > > > > > sorted queues and consumers utilising selectors that result in
> some
> > > > > > messages being left of the queue could all produce this patten.
> > The
> > > > > > pattern leads to a sparsely occupied 256K net buffers which
> cannot
> > be
> > > > > > released or reused until every message that reference a 'chunk'
> of
> > it
> > > > > > is either consumed or flown to disk.   The problem was introduced
> > > with
> > > > > > Qpid v6.0 and exists in v6.1 and trunk too.
> > > > > >
> > > > > > The flow to disk feature is not helping us here because its
> > algorithm
> > > > > > considers only the size of live messages on the queues. If the
> > > > > > accumulative live size does not exceed the threshold, the
> messages
> > > > > > aren't flown to disk. I speculate that when you observed that
> > moving
> > > > > > messages cause direct message usage to drop earlier today, your
> > > > > > message movement cause a queue to go over threshold, cause
> message
> > to
> > > > > > be flown to disk and their direct memory references released.
> The
> > > > > > logs will confirm this is so.
> > > > > >
> > > > > > I have not identified an easy workaround at the moment.
> >  Decreasing
> > > > > > the flow to disk threshold and/or increasing available direct
> > memory
> > > > > > should alleviate and may be an acceptable short term workaround.
> > If
> > > > > > it were possible for publishing application to publish short
> lived
> > > and
> > > > > > long lived messages on two separate JMS connections this would
> > avoid
> > > > > > this defect.
> > > > > >
> > > > > > QPID-7753 tracks this issue and QPID-7754 is a related this
> > problem.
> > > > > > We intend to be working on these early next week and will be
> aiming
> > > > > > for a fix that is back-portable to 6.0.
> > > > > >
> > > > > > Apologies that you have run into this defect and thanks for
> > > reporting.
> > > > > >
> > > > > > Thanks, Keith
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 21 April 2017 at 10:21, Ramayan Tiwari <
> > [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > We have been monitoring the brokers everyday and today we found
> > one
> > > > > > instance
> > > > > > >
> > > > > > > where broker’s DM was constantly going up and was about to
> crash,
> > > so
> > > > we
> > > > > > > experimented some mitigations, one of which caused the DM to
> come
> > > > down.
> > > > > > > Following are the details, which might help us understanding
> the
> > > > issue:
> > > > > > >
> > > > > > > Traffic scenario:
> > > > > > >
> > > > > > > DM allocation had been constantly going up and was at 90%.
> There
> > > > were two
> > > > > > > queues which seemed to align with the theories that we had.
> Q1’s
> > > > size had
> > > > > > > been large right after the broker start and had slow
> consumption
> > of
> > > > > > > messages, queue size only reduced from 76MB to 75MB over a
> period
> > > of
> > > > > > 6hrs.
> > > > > > >
> > > > > > > Q2 on the other hand, started small and was gradually growing,
> > > queue
> > > > size
> > > > > > > went from 7MB to 10MB in 6hrs. There were other queues with
> > traffic
> > > > > > during
> > > > > > >
> > > > > > > this time.
> > > > > > >
> > > > > > > Action taken:
> > > > > > >
> > > > > > > Moved all the messages from Q2 (since this was our original
> > theory)
> > > > to Q3
> > > > > > > (already created but no messages in it). This did not help with
> > the
> > > > DM
> > > > > > > growing up.
> > > > > > > Moved all the messages from Q1 to Q4 (already created but no
> > > > messages in
> > > > > > > it). This reduced DM allocation from 93% to 31%.
> > > > > > >
> > > > > > > We have the heap dump and thread dump from when broker was 90%
> in
> > > DM
> > > > > > > allocation. We are going to analyze that to see if we can get
> > some
> > > > clue.
> > > > > > We
> > > > > > >
> > > > > > > wanted to share this new information which might help in
> > reasoning
> > > > about
> > > > > > the
> > > > > > >
> > > > > > > memory issue.
> > > > > > >
> > > > > > > - Ramayan
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <
> > > > > > [hidden email]>
> > > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Hi Keith,
> > > > > > > >
> > > > > > > > Thanks so much for your response and digging into the issue.
> > > Below
> > > > are
> > > > > > the
> > > > > > >
> > > > > > > >
> > > > > > > > answer to your questions:
> > > > > > > >
> > > > > > > > 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use
> 6.1
> > > > where it
> > > > > > > > was released because we need JMX support. Here is the
> > destination
> > > > > > format:
> > > > > > >
> > > > > > > >
> > > > > > > > ""%s ; {node : { type : queue }, link : { x-subscribes : {
> > > > arguments : {
> > > > > > > > x-multiqueue : [%s], x-pull-only : true }}}}";"
> > > > > > > >
> > > > > > > > 2) Our machines have 40 cores, which will make the number of
> > > > threads to
> > > > > > > > 80. This might not be an issue, because this will show up in
> > the
> > > > > > baseline DM
> > > > > > >
> > > > > > > >
> > > > > > > > allocated, which is only 6% (of 4GB) when we just bring up
> the
> > > > broker.
> > > > > > > >
> > > > > > > > 3) The only setting that we tuned WRT to DM is
> > > flowToDiskThreshold,
> > > > > > which
> > > > > > >
> > > > > > > >
> > > > > > > > is set at 80% now.
> > > > > > > >
> > > > > > > > 4) Only one virtual host in the broker.
> > > > > > > >
> > > > > > > > 5) Most of our queues (99%) are priority, we also have 8-10
> > > sorted
> > > > > > queues.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 6) Yeah we are using the standard 0.16 client and not AMQP
> 1.0
> > > > clients.
> > > > > > > > The connection log line looks like:
> > > > > > > > CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol
> > Version
> > > :
> > > > 0-10
> > > > > > :
> > > > > > >
> > > > > > > >
> > > > > > > > Client ID : test : Client Version : 0.16 : Client Product :
> > qpid
> > > > > > > >
> > > > > > > > We had another broker crashed about an hour back, we do see
> the
> > > > same
> > > > > > > > patterns:
> > > > > > > > 1) There is a queue which is constantly growing, enqueue is
> > > faster
> > > > than
> > > > > > > > dequeue on that queue for a long period of time.
> > > > > > > > 2) Flow to disk didn't kick in at all.
> > > > > > > >
> > > > > > > > This graph shows memory growth (red line - heap, blue - DM
> > > > allocated,
> > > > > > > > yellow - DM used)
> > > > > > > >
> > > > > > > > https://drive.google.com/file/d/
> 0Bwi0MEV3srPRdVhXdTBncHJLY2c/
> > > > > > view?usp=sharing
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > The below graph shows growth on a single queue (there are
> 10-12
> > > > other
> > > > > > > > queues with traffic as well, something large size than this
> > > queue):
> > > > > > > >
> > > > > > > > https://drive.google.com/file/d/
> 0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/
> > > > > > view?usp=sharing
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Couple of questions:
> > > > > > > > 1) Is there any developer level doc/design spec on how Qpid
> > uses
> > > > DM?
> > > > > > > > 2) We are not getting heap dumps automatically when broker
> > > crashes
> > > > due
> > > > > > to
> > > > > > >
> > > > > > > >
> > > > > > > > DM (HeapDumpOnOutOfMemoryError not respected). Has anyone
> > found a
> > > > way
> > > > > > to get
> > > > > > >
> > > > > > > >
> > > > > > > > around this problem?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Ramayan
> > > > > > > >
> > > > > > > > On Thu, Apr 20, 2017 at 9:08 AM, Keith W <
> [hidden email]
> > >
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Ramayan
> > > > > > > > >
> > > > > > > > > We have been discussing your problem here and have a couple
> > of
> > > > > > questions.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I have been experimenting with use-cases based on your
> > > > descriptions
> > > > > > > > > above, but so far, have been unsuccessful in reproducing a
> > > > > > > > > "java.lang.OutOfMemoryError: Direct buffer memory"
> > condition.
> > > > The
> > > > > > > > > direct memory usage reflects the expected model: it levels
> > off
> > > > when
> > > > > > > > > the flow to disk threshold is reached and direct memory is
> > > > release as
> > > > > > > > > messages are consumed until the minimum size for caching of
> > > > direct is
> > > > > > > > > reached.
> > > > > > > > >
> > > > > > > > > 1] For clarity let me check: we believe when you say "patch
> > to
> > > > use
> > > > > > > > > MultiQueueConsumer" you are referring to the patch attached
> > to
> > > > > > > > > QPID-7462 "Add experimental "pull" consumers to the broker"
> > > and
> > > > you
> > > > > > > > > are using a combination of this "x-pull-only"  with the
> > > standard
> > > > > > > > > "x-multiqueue" feature.  Is this correct?
> > > > > > > > >
> > > > > > > > > 2] One idea we had here relates to the size of the
> > virtualhost
> > > IO
> > > > > > > > > pool.   As you know from the documentation, the Broker
> > > > caches/reuses
> > > > > > > > > direct memory internally but the documentation fails to
> > > mentions
> > > > that
> > > > > > > > > each pooled virtualhost IO thread also grabs a chunk (256K)
> > of
> > > > direct
> > > > > > > > > memory from this cache.  By default the virtual host IO
> pool
> > is
> > > > sized
> > > > > > > > > Math.max(Runtime.getRuntime().availableProcessors() * 2,
> > 64),
> > > > so if
> > > > > > > > > you have a machine with a very large number of cores, you
> may
> > > > have a
> > > > > > > > > surprising large amount of direct memory assigned to
> > > virtualhost
> > > > IO
> > > > > > > > > threads.   Check the value of connectionThreadPoolSize on
> the
> > > > > > > > > virtualhost
> > > > > > > > > (http://<server>:<port>/api/latest/virtualhost/<
> > > > virtualhostnodename>/<;
> > > > > > virtualhostname>)
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > to see what value is in force.  What is it?  It is possible
> > to
> > > > tune
> > > > > > > > > the pool size using context variable
> > > > > > > > > virtualhost.connectionThreadPool.size.
> > > > > > > > >
> > > > > > > > > 3] Tell me if you are tuning the Broker in way beyond the
> > > > direct/heap
> > > > > > > > > memory settings you have told us about already.  For
> instance
> > > > you are
> > > > > > > > > changing any of the direct memory pooling settings
> > > > > > > > > broker.directByteBufferPoolSize, default network buffer
> size
> > > > > > > > > qpid.broker.networkBufferSize or applying any other
> > > non-standard
> > > > > > > > > settings?
> > > > > > > > >
> > > > > > > > > 4] How many virtual hosts do you have on the Broker?
> > > > > > > > >
> > > > > > > > > 5] What is the consumption pattern of the messages?  Do
> > consume
> > > > in a
> > > > > > > > > strictly FIFO fashion or are you making use of message
> > > selectors
> > > > > > > > > or/and any of the out-of-order queue types (LVQs, priority
> > > queue
> > > > or
> > > > > > > > > sorted queues)?
> > > > > > > > >
> > > > > > > > > 6] Is it just the 0.16 client involved in the application?
> > >  Can
> > > > I
> > > > > > > > > check that you are not using any of the AMQP 1.0 clients
> > > > > > > > > (org,apache.qpid:qpid-jms-client or
> > > > > > > > > org.apache.qpid:qpid-amqp-1-0-client) in the software
> stack
> > > (as
> > > > either
> > > > > > > > > consumers or producers)
> > > > > > > > >
> > > > > > > > > Hopefully the answers to these questions will get us closer
> > to
> > > a
> > > > > > > > > reproduction.   If you are able to reliable reproduce it,
> > > please
> > > > share
> > > > > > > > > the steps with us.
> > > > > > > > >
> > > > > > > > > Kind regards, Keith.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 20 April 2017 at 10:21, Ramayan Tiwari <
> > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > After a lot of log mining, we might have a way to explain
> > the
> > > > > > sustained
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > increased in DirectMemory allocation, the correlation
> seems
> > > to
> > > > be
> > > > > > with
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > > growth in the size of a Queue that is getting consumed
> but
> > at
> > > > a much
> > > > > > > > > > slower
> > > > > > > > > > rate than producers putting messages on this queue.
> > > > > > > > > >
> > > > > > > > > > The pattern we see is that in each instance of broker
> > crash,
> > > > there is
> > > > > > > > > > at
> > > > > > > > > > least one queue (usually 1 queue) whose size kept growing
> > > > steadily.
> > > > > > > > > > It’d be
> > > > > > > > > > of significant size but not the largest queue -- usually
> > > there
> > > > are
> > > > > > > > > > multiple
> > > > > > > > > > larger queues -- but it was different from other queues
> in
> > > > that its
> > > > > > > > > > size
> > > > > > > > > > was growing steadily. The queue would also be moving, but
> > its
> > > > > > > > > > processing
> > > > > > > > > > rate was not keeping up with the enqueue rate.
> > > > > > > > > >
> > > > > > > > > > Our theory that might be totally wrong: If a queue is
> > moving
> > > > the
> > > > > > entire
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > time, maybe then the broker would keep reusing the same
> > > buffer
> > > > in
> > > > > > > > > > direct
> > > > > > > > > > memory for the queue, and keep on adding onto it at the
> end
> > > to
> > > > > > > > > > accommodate
> > > > > > > > > > new messages. But because it’s active all the time and
> > we’re
> > > > pointing
> > > > > > > > > > to
> > > > > > > > > > the same buffer, space allocated for messages at the head
> > of
> > > > the
> > > > > > > > > > queue/buffer doesn’t get reclaimed, even long after those
> > > > messages
> > > > > > have
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > been processed. Just a theory.
> > > > > > > > > >
> > > > > > > > > > We are also trying to reproduce this using some perf
> tests
> > to
> > > > enqueue
> > > > > > > > > > with
> > > > > > > > > > same pattern, will update with the findings.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Ramayan
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari
> > > > > > > > > > <[hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Another issue that we noticed is when broker goes OOM
> due
> > > to
> > > > direct
> > > > > > > > > > > memory, it doesn't create heap dump (specified by
> "-XX:+
> > > > > > > > > > > HeapDumpOnOutOfMemoryError"), even when the OOM error
> is
> > > > same as
> > > > > > what
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > is
> > > > > > > > > > > mentioned in the oracle JVM docs
> > > > ("java.lang.OutOfMemoryError").
> > > > > > > > > > >
> > > > > > > > > > > Has anyone been able to find a way to get to heap dump
> > for
> > > > DM OOM?
> > > > > > > > > > >
> > > > > > > > > > > - Ramayan
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari
> > > > > > > > > > > <[hidden email]
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Alex,
> > > > > > > > > > > >
> > > > > > > > > > > > Below are the flow to disk logs from broker having
> > > > 3million+
> > > > > > messages
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > at
> > > > > > > > > > > > this time. We only have one virtual host. Time is in
> > GMT.
> > > > Looks
> > > > > > like
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > flow
> > > > > > > > > > > > to disk is active on the whole virtual host and not a
> > > > queue level.
> > > > > > > > > > > >
> > > > > > > > > > > > When the same broker went OOM yesterday, I did not
> see
> > > any
> > > > flow to
> > > > > > > > > > > > disk
> > > > > > > > > > > > logs from when it was started until it crashed
> (crashed
> > > > twice
> > > > > > within
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 4hrs).
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > memory
> > > > use
> > > > > > > > > > > > 3356539KB
> > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > memory
> > > > use
> > > > > > > > > > > > 3354866KB
> > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > memory
> > > > use
> > > > > > > > > > > > 3358509KB
> > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > memory
> > > > use
> > > > > > > > > > > > 3353501KB
> > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > memory
> > > > use
> > > > > > > > > > > > 3357544KB
> > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > memory
> > > > use
> > > > > > > > > > > > 3353236KB
> > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > memory
> > > > use
> > > > > > > > > > > > 3356704KB
> > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > memory
> > > > use
> > > > > > > > > > > > 3353511KB
> > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > memory
> > > > use
> > > > > > > > > > > > 3357948KB
> > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > memory
> > > > use
> > > > > > > > > > > > 3355310KB
> > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > memory
> > > > use
> > > > > > > > > > > > 3365624KB
> > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > memory
> > > > use
> > > > > > > > > > > > 3355136KB
> > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > memory
> > > > use
> > > > > > > > > > > > 3358683KB
> > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > After production release (2days back), we have seen 4
> > > > crashes in 3
> > > > > > > > > > > > different brokers, this is the most pressing concern
> > for
> > > > us in
> > > > > > > > > > > > decision if
> > > > > > > > > > > > we should roll back to 0.32. Any help is greatly
> > > > appreciated.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > > Ramayan
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <
> > > > [hidden email]
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ramayan,
> > > > > > > > > > > > > Thanks for the details. I would like to clarify
> > whether
> > > > flow to
> > > > > > disk
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > was
> > > > > > > > > > > > > triggered today for 3 million messages?
> > > > > > > > > > > > >
> > > > > > > > > > > > > The following logs are issued for flow to disk:
> > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > memory
> > > > use
> > > > > > > > > > > > > {0,number,#}KB
> > > > > > > > > > > > > exceeds threshold {1,number,#.##}KB
> > > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > > > memory use
> > > > > > > > > > > > > {0,number,#}KB within threshold {1,number,#.##}KB
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kind Regards,
> > > > > > > > > > > > > Alex
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 19 April 2017 at 17:10, Ramayan Tiwari <
> > > > > > [hidden email]>
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Alex,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for your response, here are the details:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We use "direct" exchange, without persistence (we
> > > > specify
> > > > > > > > > > > > > NON_PERSISTENT
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > that while sending from client) and use BDB
> store.
> > We
> > > > use JSON
> > > > > > > > > > > > > > virtual
> > > > > > > > > > > > > host
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > type. We are not using SSL.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > When the broker went OOM, we had around 1.3
> million
> > > > messages
> > > > > > with
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 100
> > > > > > > > > > > > > bytes
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > average message size. Direct memory allocation
> > (value
> > > > read from
> > > > > > > > > > > > > > MBean)
> > > > > > > > > > > > > kept
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > going up, even though it wouldn't need more DM to
> > > > store these
> > > > > > many
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > messages. DM allocated persisted at 99% for
> about 3
> > > > and half
> > > > > > hours
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > before
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > crashing.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Today, on the same broker we have 3 million
> > messages
> > > > (same
> > > > > > message
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > size)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > and DM allocated is only at 8%. This seems like
> > there
> > > > is some
> > > > > > > > > > > > > > issue
> > > > > > > > > > > > > with
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > de-allocation or a leak.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I have uploaded the memory utilization graph
> here:
> > > > > > > > > > > > > > https://drive.google.com/file/d/
> > > > 0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> > > > > > > > > > > > > > view?usp=sharing
> > > > > > > > > > > > > > Blue line is DM allocated, Yellow is DM Used (sum
> > of
> > > > queue
> > > > > > > > > > > > > > payload)
> > > > > > > > > > > > > and Red
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > is heap usage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy
> > > > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Ramayan,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Could please share with us the details of
> > messaging
> > > > use
> > > > > > case(s)
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > ended
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > up in OOM on broker side?
> > > > > > > > > > > > > > > I would like to reproduce the issue on my local
> > > > broker in
> > > > > > order
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > fix
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would appreciate if you could provide as much
> > > > details as
> > > > > > > > > > > > > > > possible,
> > > > > > > > > > > > > > > including, messaging topology, message
> > persistence
> > > > type,
> > > > > > message
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > sizes,volumes, etc.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Qpid Broker 6.0.x uses direct memory for
> keeping
> > > > message
> > > > > > content
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > receiving/sending data. Each plain connection
> > > > utilizes 512K of
> > > > > > > > > > > > > > > direct
> > > > > > > > > > > > > > > memory. Each SSL connection uses 1M of direct
> > > > memory. Your
> > > > > > > > > > > > > > > memory
> > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > look Ok to me.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Kind Regards,
> > > > > > > > > > > > > > > Alex
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On 18 April 2017 at 23:39, Ramayan Tiwari
> > > > > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We are using Java broker 6.0.5, with patch to
> > use
> > > > > > > > > > > > > MultiQueueConsumer
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > feature. We just finished deploying to
> > production
> > > > and saw
> > > > > > > > > > > > > > > > couple of
> > > > > > > > > > > > > > > > instances of broker OOM due to running out of
> > > > DirectMemory
> > > > > > > > > > > > > > > > buffer
> > > > > > > > > > > > > > > > (exceptions at the end of this email).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Here is our setup:
> > > > > > > > > > > > > > > > 1. Max heap 12g, max direct memory 4g (this
> is
> > > > opposite of
> > > > > > > > > > > > > > > > what the
> > > > > > > > > > > > > > > > recommendation is, however, for our use cause
> > > > message
> > > > > > payload
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > really
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > small ~400bytes and is way less than the per
> > > > message
> > > > > > overhead
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > 1KB).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > perf testing, we were able to put 2 million
> > > > messages without
> > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > issues.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. ~400 connections to broker.
> > > > > > > > > > > > > > > > 3. Each connection has 20 sessions and there
> is
> > > > one multi
> > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > consumer
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > attached to each session, listening to around
> > > 1000
> > > > queues.
> > > > > > > > > > > > > > > > 4. We are still using 0.16 client (I know).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > With the above setup, the baseline
> utilization
> > > > (without any
> > > > > > > > > > > > > messages)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > direct memory was around 230mb (with 410
> > > > connection each
> > > > > > > > > > > > > > > > taking
> > > > > > > > > > > > > 500KB).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Based on our understanding of broker memory
> > > > allocation,
> > > > > > > > > > > > > > > > message
> > > > > > > > > > > > > payload
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > should be the only thing adding to direct
> > memory
> > > > utilization
> > > > > > > > > > > > > > > > (on
> > > > > > > > > > > > > top of
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > baseline), however, we are experiencing
> > something
> > > > completely
> > > > > > > > > > > > > different.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > our last broker crash, we see that broker is
> > > > constantly
> > > > > > > > > > > > > > > > running
> > > > > > > > > > > > > with
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 90%+
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > direct memory allocated, even when message
> > > payload
> > > > sum from
> > > > > > > > > > > > > > > > all the
> > > > > > > > > > > > > > > queues
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > is only 6-8% (these % are against available
> DM
> > of
> > > > 4gb).
> > > > > > During
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > these
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > high
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > DM usage period, heap usage was around 60%
> (of
> > > > 12gb).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We would like some help in understanding what
> > > > could be the
> > > > > > > > > > > > > > > > reason
> > > > > > > > > > > > > of
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > high DM allocations. Are there things other
> > than
> > > > message
> > > > > > > > > > > > > > > > payload
> > > > > > > > > > > > > and
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > AMQP
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > connection, which use DM and could be
> > > contributing
> > > > to these
> > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > usage?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Another thing where we are puzzled is the
> > > > de-allocation of
> > > > > > DM
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > byte
> > > > > > > > > > > > > > > buffers.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From log mining of heap and DM utilization,
> > > > de-allocation of
> > > > > > > > > > > > > > > > DM
> > > > > > > > > > > > > doesn't
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > correlate with heap GC. If anyone has seen
> any
> > > > documentation
> > > > > > > > > > > > > related to
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > this, it would be very helpful if you could
> > share
> > > > that.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > *Exceptions*
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer
> > memory
> > > > > > > > > > > > > > > > at java.nio.Bits.reserveMemory(
> Bits.java:658)
> > > > > > ~[na:1.8.0_40]
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> > > > init>(DirectByteBuffer.java:
> > > > > > 123)
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > at java.nio.ByteBuffer.
> > > allocateDirect(ByteBuffer.
> > > > java:311)
> > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> > > > QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > NonBlockingConnectionPlainD
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > elegate.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > restoreApplicationBufferForWrite(
> > > > > > NonBlockingConnectionPlainDele
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > gate.java:93)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > NonBlockingConnectionPlainDele
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > gate.processData(
> > NonBlockingConnectionPlainDele
> > > > > > gate.java:60)
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > NonBlockingConnection.doRead(
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > NonBlockingConnection.java:506)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > NonBlockingConnection.doWork(
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > NonBlockingConnection.java:285)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > NetworkConnectionScheduler.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > processConnection(
> NetworkConnectionScheduler.
> > > > java:124)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.
> > transport.SelectorThread$
> > > > > > ConnectionPr
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ocessor.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > processConnection(SelectorThread.java:504)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.
> > transport.SelectorThread$
> > > > > > > > > > > > > > > > SelectionTask.performSelect(
> > > > SelectorThread.java:337)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > org.apache.qpid.server.
> > transport.SelectorThread$
> > > > > > SelectionTask.run(
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > SelectorThread.java:87)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.
> > > > transport.SelectorThread.run(
> > > > > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > java.util.concurrent.
> > > ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > java.util.concurrent.
> > > > ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > *Second exception*
> > > > > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer
> > memory
> > > > > > > > > > > > > > > > at java.nio.Bits.reserveMemory(
> Bits.java:658)
> > > > > > ~[na:1.8.0_40]
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> > > > init>(DirectByteBuffer.java:
> > > > > > 123)
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > at java.nio.ByteBuffer.
> > > allocateDirect(ByteBuffer.
> > > > java:311)
> > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> > > > QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > NonBlockingConnectionPlainDele
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > gate.<init>(NonBlockingConnectionPlainDele
> > > > gate.java:45)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > NonBlockingConnection.
> > > > > > > > > > > > > > > > setTransportEncryption(
> > > NonBlockingConnection.java:
> > > > 625)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > NonBlockingConnection.<init>(
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > NonBlockingConnection.java:117)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > NonBlockingNetworkTransport.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > acceptSocketChannel(
> > NonBlockingNetworkTransport.
> > > > java:158)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.
> > transport.SelectorThread$
> > > > > > SelectionTas
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > k$1.run(
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > SelectorThread.java:191)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > org.apache.qpid.server.
> > > > transport.SelectorThread.run(
> > > > > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > java.util.concurrent.
> > > ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > java.util.concurrent.
> > > > ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > ------------------------------
> ------------------------------
> > > > ---------
> > > > > > > > > To unsubscribe, e-mail: [hidden email]
> > > > > > > > > For additional commands, e-mail:
> [hidden email]
> > > > > > > > >
> > > >
> > > >
> > > > ------------------------------------------------------------
> ---------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Oleksandr Rudyy
Hi Ramayan,

Thanks for testing the patch and providing a feedback.

Regarding direct memory utilization, the Qpid Broker caches up to 256MB of
direct memory internally in QpidByteBuffers. Thus, when testing the Broker
with only 256MB of direct memory, the entire direct memory could be cached
and it would look as if direct memory is never released. Potentially, you
can reduce the number of buffers cached on broker by changing context
variable 'broker.directByteBufferPoolSize'. By default, it is set to 1000.
With buffer size of 256K, it would give ~256M of cache.

Regarding introducing lower and upper thresholds for 'flow to disk'. It
seems like a good idea and we will try to implement it early this week on
trunk first.

Kind Regards,
Alex


On 5 May 2017 at 23:49, Ramayan Tiwari <[hidden email]> wrote:

> Hi Alex,
>
> Thanks for providing the patch. I verified the fix with same perf test, and
> it does prevent broker from going OOM, however. DM utilization doesn't get
> any better after hitting the threshold (where flow to disk is activated
> based on total used % across broker - graph in the link below).
>
> After hitting the final threshold, flow to disk activates and deactivates
> pretty frequently across all the queues. The reason seems to be because
> there is only one threshold currently to trigger flow to disk. Would it
> make sense to break this down to high and low threshold - so that once flow
> to disk is active after hitting high threshold, it will be active until the
> queue utilization (or broker DM allocation) reaches the low threshold.
>
> Graph and flow to disk logs are here:
> https://docs.google.com/document/d/1Wc1e-id-WlpI7FGU1Lx8XcKaV8sauRp82T5XZV
> U-RiM/edit#heading=h.6400pltvjhy7
>
> Thanks
> Ramayan
>
> On Thu, May 4, 2017 at 2:44 AM, Oleksandr Rudyy <[hidden email]> wrote:
>
> > Hi Ramayan,
> >
> > We attached to the QPID-7753 a patch with a work around for 6.0.x branch.
> > It triggers flow to disk based on direct memory consumption rather than
> > estimation of the space occupied by the message content. The flow to disk
> > should evacuate message content preventing running out of direct memory.
> We
> > already committed the changes into 6.0.x and 6.1.x branches. It will be
> > included into upcoming 6.0.7 and 6.1.3 releases.
> >
> > Please try and test the patch in your environment.
> >
> > We are still working at finishing of the fix for trunk.
> >
> > Kind Regards,
> > Alex
> >
> > On 30 April 2017 at 15:45, Lorenz Quack <[hidden email]> wrote:
> >
> > > Hi Ramayan,
> > >
> > > The high-level plan is currently as follows:
> > >  1) Periodically try to compact sparse direct memory buffers.
> > >  2) Increase accuracy of messages' direct memory usage estimation to
> more
> > > reliably trigger flow to disk.
> > >  3) Add an additional flow to disk trigger based on the amount of
> > allocated
> > > direct memory.
> > >
> > > A little bit more details:
> > >  1) We plan on periodically checking the amount of direct memory usage
> > and
> > > if it is above a
> > >     threshold (50%) we compare the sum of all queue sizes with the
> amount
> > > of allocated direct memory.
> > >     If the ratio falls below a certain threshold we trigger a
> compaction
> > > task which goes through all queues
> > >     and copy's a certain amount of old message buffers into new ones
> > > thereby freeing the old buffers so
> > >     that they can be returned to the buffer pool and be reused.
> > >
> > >  2) Currently we trigger flow to disk based on an estimate of how much
> > > memory the messages on the
> > >     queues consume. We had to use estimates because we did not have
> > > accurate size numbers for
> > >     message headers. By having accurate size information for message
> > > headers we can more reliably
> > >     enforce queue memory limits.
> > >
> > >  3) The flow to disk trigger based on message size had another problem
> > > which is more pertinent to the
> > >     current issue. We only considered the size of the messages and not
> > how
> > > much memory we allocate
> > >     to store those messages. In the FIFO use case those numbers will be
> > > very close to each other but in
> > >     use cases like yours we can end up with sparse buffers and the
> > numbers
> > > will diverge. Because of this
> > >     divergence we do not trigger flow to disk in time and the broker
> can
> > go
> > > OOM.
> > >     To fix the issue we want to add an additional flow to disk trigger
> > > based on the amount of allocated direct
> > >     memory. This should prevent the broker from going OOM even if the
> > > compaction strategy outlined above
> > >     should fail for some reason (e.g., the compaction task cannot keep
> up
> > > with the arrival of new messages).
> > >
> > > Currently, there are patches for the above points but they suffer from
> > some
> > > thread-safety issues that need to be addressed.
> > >
> > > I hope this description helps. Any feedback is, as always, welcome.
> > >
> > > Kind regards,
> > > Lorenz
> > >
> > >
> > >
> > > On Sat, Apr 29, 2017 at 12:00 AM, Ramayan Tiwari <
> > [hidden email]
> > > >
> > > wrote:
> > >
> > > > Hi Lorenz,
> > > >
> > > > Thanks so much for the patch. We have a perf test now to reproduce
> this
> > > > issue, so we did test with 256KB, 64KB and 4KB network byte buffer.
> > None
> > > of
> > > > these configurations help with the issue (or give any more breathing
> > > room)
> > > > for our use case. We would like to share the perf analysis with the
> > > > community:
> > > >
> > > > https://docs.google.com/document/d/1Wc1e-id-
> > > WlpI7FGU1Lx8XcKaV8sauRp82T5XZV
> > > > U-RiM/edit?usp=sharing
> > > >
> > > > Feel free to comment on the doc if certain details are incorrect or
> if
> > > > there are questions.
> > > >
> > > > Since the short term solution doesn't help us, we are very interested
> > in
> > > > getting some details on how the community plans to address this, a
> high
> > > > level description of the approach will be very helpful for us in
> order
> > to
> > > > brainstorm our use cases along with this solution.
> > > >
> > > > - Ramayan
> > > >
> > > > On Fri, Apr 28, 2017 at 9:34 AM, Lorenz Quack <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > Hello Ramayan,
> > > > >
> > > > > We are still working on a fix for this issue.
> > > > > In the mean time we had an idea to potentially workaround the issue
> > > until
> > > > > a proper fix is released.
> > > > >
> > > > > The idea is to decrease the qpid network buffer size the broker
> uses.
> > > > > While this still allows for sparsely populated buffers it would
> > improve
> > > > > the overall occupancy ratio.
> > > > >
> > > > > Here are the steps to follow:
> > > > >  * ensure you are not using TLS
> > > > >  * apply the attached patch
> > > > >  * figure out the size of the largest messages you are sending
> > > (including
> > > > > header and some overhead)
> > > > >  * set the context variable "qpid.broker.networkBufferSize" to
> that
> > > > value
> > > > > but not smaller than 4096
> > > > >  * test
> > > > >
> > > > > Decreasing the qpid network buffer size automatically limits the
> > > maximum
> > > > > AMQP frame size.
> > > > > Since you are using a very old client we are not sure how well it
> > copes
> > > > > with small frame sizes where it has to split a message across
> > multiple
> > > > > frames.
> > > > > Therefore, to play it safe you should not set it smaller than the
> > > largest
> > > > > messages (+ header + overhead) you are sending.
> > > > > I do not know what message sizes you are sending but AMQP imposes
> the
> > > > > restriction that the framesize cannot be smaller than 4096 bytes.
> > > > > In the qpid broker the default currently is 256 kB.
> > > > >
> > > > > In the current state the broker does not allow setting the network
> > > buffer
> > > > > to values smaller than 64 kB to allow TLS frames to fit into one
> > > network
> > > > > buffer.
> > > > > I attached a patch to this mail that lowers that restriction to the
> > > limit
> > > > > imposed by AMQP (4096 Bytes).
> > > > > Obviously, you should not use this when using TLS.
> > > > >
> > > > >
> > > > > I hope this reduces the problems you are currently facing until we
> > can
> > > > > complete the proper fix.
> > > > >
> > > > > Kind regards,
> > > > > Lorenz
> > > > >
> > > > >
> > > > > On Fri, 2017-04-21 at 09:17 -0700, Ramayan Tiwari wrote:
> > > > > > Thanks so much Keith and the team for finding the root cause. We
> > are
> > > so
> > > > > > relieved that we fix the root cause shortly.
> > > > > >
> > > > > > Couple of things that I forgot to mention on the mitigation steps
> > we
> > > > took
> > > > > > in the last incident:
> > > > > > 1) We triggered GC from JMX bean multiple times, it did not help
> in
> > > > > > reducing DM allocated.
> > > > > > 2) We also killed all the AMQP connections to the broker when DM
> > was
> > > at
> > > > > > 80%. This did not help either. The way we killed connections -
> > using
> > > > JMX
> > > > > > got list of all the open AMQP connections and called close from
> JMX
> > > > > mbean.
> > > > > >
> > > > > > I am hoping the above two are not related to root cause, but
> wanted
> > > to
> > > > > > bring it up in case this is relevant.
> > > > > >
> > > > > > Thanks
> > > > > > Ramayan
> > > > > >
> > > > > > On Fri, Apr 21, 2017 at 8:29 AM, Keith W <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Hello Ramayan
> > > > > > >
> > > > > > > I believe I understand the root cause of the problem.  We have
> > > > > > > identified a flaw in the direct memory buffer management
> employed
> > > by
> > > > > > > Qpid Broker J which for some messaging use-cases can lead to
> the
> > > OOM
> > > > > > > direct you describe.   For the issue to manifest the producing
> > > > > > > application needs to use a single connection for the production
> > of
> > > > > > > messages some of which are short-lived (i.e. are consumed
> > quickly)
> > > > > > > whilst others remain on the queue for some time.  Priority
> > queues,
> > > > > > > sorted queues and consumers utilising selectors that result in
> > some
> > > > > > > messages being left of the queue could all produce this patten.
> > > The
> > > > > > > pattern leads to a sparsely occupied 256K net buffers which
> > cannot
> > > be
> > > > > > > released or reused until every message that reference a 'chunk'
> > of
> > > it
> > > > > > > is either consumed or flown to disk.   The problem was
> introduced
> > > > with
> > > > > > > Qpid v6.0 and exists in v6.1 and trunk too.
> > > > > > >
> > > > > > > The flow to disk feature is not helping us here because its
> > > algorithm
> > > > > > > considers only the size of live messages on the queues. If the
> > > > > > > accumulative live size does not exceed the threshold, the
> > messages
> > > > > > > aren't flown to disk. I speculate that when you observed that
> > > moving
> > > > > > > messages cause direct message usage to drop earlier today, your
> > > > > > > message movement cause a queue to go over threshold, cause
> > message
> > > to
> > > > > > > be flown to disk and their direct memory references released.
> > The
> > > > > > > logs will confirm this is so.
> > > > > > >
> > > > > > > I have not identified an easy workaround at the moment.
> > >  Decreasing
> > > > > > > the flow to disk threshold and/or increasing available direct
> > > memory
> > > > > > > should alleviate and may be an acceptable short term
> workaround.
> > > If
> > > > > > > it were possible for publishing application to publish short
> > lived
> > > > and
> > > > > > > long lived messages on two separate JMS connections this would
> > > avoid
> > > > > > > this defect.
> > > > > > >
> > > > > > > QPID-7753 tracks this issue and QPID-7754 is a related this
> > > problem.
> > > > > > > We intend to be working on these early next week and will be
> > aiming
> > > > > > > for a fix that is back-portable to 6.0.
> > > > > > >
> > > > > > > Apologies that you have run into this defect and thanks for
> > > > reporting.
> > > > > > >
> > > > > > > Thanks, Keith
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 21 April 2017 at 10:21, Ramayan Tiwari <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > We have been monitoring the brokers everyday and today we
> found
> > > one
> > > > > > > instance
> > > > > > > >
> > > > > > > > where broker’s DM was constantly going up and was about to
> > crash,
> > > > so
> > > > > we
> > > > > > > > experimented some mitigations, one of which caused the DM to
> > come
> > > > > down.
> > > > > > > > Following are the details, which might help us understanding
> > the
> > > > > issue:
> > > > > > > >
> > > > > > > > Traffic scenario:
> > > > > > > >
> > > > > > > > DM allocation had been constantly going up and was at 90%.
> > There
> > > > > were two
> > > > > > > > queues which seemed to align with the theories that we had.
> > Q1’s
> > > > > size had
> > > > > > > > been large right after the broker start and had slow
> > consumption
> > > of
> > > > > > > > messages, queue size only reduced from 76MB to 75MB over a
> > period
> > > > of
> > > > > > > 6hrs.
> > > > > > > >
> > > > > > > > Q2 on the other hand, started small and was gradually
> growing,
> > > > queue
> > > > > size
> > > > > > > > went from 7MB to 10MB in 6hrs. There were other queues with
> > > traffic
> > > > > > > during
> > > > > > > >
> > > > > > > > this time.
> > > > > > > >
> > > > > > > > Action taken:
> > > > > > > >
> > > > > > > > Moved all the messages from Q2 (since this was our original
> > > theory)
> > > > > to Q3
> > > > > > > > (already created but no messages in it). This did not help
> with
> > > the
> > > > > DM
> > > > > > > > growing up.
> > > > > > > > Moved all the messages from Q1 to Q4 (already created but no
> > > > > messages in
> > > > > > > > it). This reduced DM allocation from 93% to 31%.
> > > > > > > >
> > > > > > > > We have the heap dump and thread dump from when broker was
> 90%
> > in
> > > > DM
> > > > > > > > allocation. We are going to analyze that to see if we can get
> > > some
> > > > > clue.
> > > > > > > We
> > > > > > > >
> > > > > > > > wanted to share this new information which might help in
> > > reasoning
> > > > > about
> > > > > > > the
> > > > > > > >
> > > > > > > > memory issue.
> > > > > > > >
> > > > > > > > - Ramayan
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Apr 20, 2017 at 11:20 AM, Ramayan Tiwari <
> > > > > > > [hidden email]>
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Keith,
> > > > > > > > >
> > > > > > > > > Thanks so much for your response and digging into the
> issue.
> > > > Below
> > > > > are
> > > > > > > the
> > > > > > > >
> > > > > > > > >
> > > > > > > > > answer to your questions:
> > > > > > > > >
> > > > > > > > > 1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use
> > 6.1
> > > > > where it
> > > > > > > > > was released because we need JMX support. Here is the
> > > destination
> > > > > > > format:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > ""%s ; {node : { type : queue }, link : { x-subscribes : {
> > > > > arguments : {
> > > > > > > > > x-multiqueue : [%s], x-pull-only : true }}}}";"
> > > > > > > > >
> > > > > > > > > 2) Our machines have 40 cores, which will make the number
> of
> > > > > threads to
> > > > > > > > > 80. This might not be an issue, because this will show up
> in
> > > the
> > > > > > > baseline DM
> > > > > > > >
> > > > > > > > >
> > > > > > > > > allocated, which is only 6% (of 4GB) when we just bring up
> > the
> > > > > broker.
> > > > > > > > >
> > > > > > > > > 3) The only setting that we tuned WRT to DM is
> > > > flowToDiskThreshold,
> > > > > > > which
> > > > > > > >
> > > > > > > > >
> > > > > > > > > is set at 80% now.
> > > > > > > > >
> > > > > > > > > 4) Only one virtual host in the broker.
> > > > > > > > >
> > > > > > > > > 5) Most of our queues (99%) are priority, we also have 8-10
> > > > sorted
> > > > > > > queues.
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 6) Yeah we are using the standard 0.16 client and not AMQP
> > 1.0
> > > > > clients.
> > > > > > > > > The connection log line looks like:
> > > > > > > > > CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol
> > > Version
> > > > :
> > > > > 0-10
> > > > > > > :
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Client ID : test : Client Version : 0.16 : Client Product :
> > > qpid
> > > > > > > > >
> > > > > > > > > We had another broker crashed about an hour back, we do see
> > the
> > > > > same
> > > > > > > > > patterns:
> > > > > > > > > 1) There is a queue which is constantly growing, enqueue is
> > > > faster
> > > > > than
> > > > > > > > > dequeue on that queue for a long period of time.
> > > > > > > > > 2) Flow to disk didn't kick in at all.
> > > > > > > > >
> > > > > > > > > This graph shows memory growth (red line - heap, blue - DM
> > > > > allocated,
> > > > > > > > > yellow - DM used)
> > > > > > > > >
> > > > > > > > > https://drive.google.com/file/d/
> > 0Bwi0MEV3srPRdVhXdTBncHJLY2c/
> > > > > > > view?usp=sharing
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The below graph shows growth on a single queue (there are
> > 10-12
> > > > > other
> > > > > > > > > queues with traffic as well, something large size than this
> > > > queue):
> > > > > > > > >
> > > > > > > > > https://drive.google.com/file/d/
> > 0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/
> > > > > > > view?usp=sharing
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Couple of questions:
> > > > > > > > > 1) Is there any developer level doc/design spec on how Qpid
> > > uses
> > > > > DM?
> > > > > > > > > 2) We are not getting heap dumps automatically when broker
> > > > crashes
> > > > > due
> > > > > > > to
> > > > > > > >
> > > > > > > > >
> > > > > > > > > DM (HeapDumpOnOutOfMemoryError not respected). Has anyone
> > > found a
> > > > > way
> > > > > > > to get
> > > > > > > >
> > > > > > > > >
> > > > > > > > > around this problem?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Ramayan
> > > > > > > > >
> > > > > > > > > On Thu, Apr 20, 2017 at 9:08 AM, Keith W <
> > [hidden email]
> > > >
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Ramayan
> > > > > > > > > >
> > > > > > > > > > We have been discussing your problem here and have a
> couple
> > > of
> > > > > > > questions.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I have been experimenting with use-cases based on your
> > > > > descriptions
> > > > > > > > > > above, but so far, have been unsuccessful in reproducing
> a
> > > > > > > > > > "java.lang.OutOfMemoryError: Direct buffer memory"
> > > condition.
> > > > > The
> > > > > > > > > > direct memory usage reflects the expected model: it
> levels
> > > off
> > > > > when
> > > > > > > > > > the flow to disk threshold is reached and direct memory
> is
> > > > > release as
> > > > > > > > > > messages are consumed until the minimum size for caching
> of
> > > > > direct is
> > > > > > > > > > reached.
> > > > > > > > > >
> > > > > > > > > > 1] For clarity let me check: we believe when you say
> "patch
> > > to
> > > > > use
> > > > > > > > > > MultiQueueConsumer" you are referring to the patch
> attached
> > > to
> > > > > > > > > > QPID-7462 "Add experimental "pull" consumers to the
> broker"
> > > > and
> > > > > you
> > > > > > > > > > are using a combination of this "x-pull-only"  with the
> > > > standard
> > > > > > > > > > "x-multiqueue" feature.  Is this correct?
> > > > > > > > > >
> > > > > > > > > > 2] One idea we had here relates to the size of the
> > > virtualhost
> > > > IO
> > > > > > > > > > pool.   As you know from the documentation, the Broker
> > > > > caches/reuses
> > > > > > > > > > direct memory internally but the documentation fails to
> > > > mentions
> > > > > that
> > > > > > > > > > each pooled virtualhost IO thread also grabs a chunk
> (256K)
> > > of
> > > > > direct
> > > > > > > > > > memory from this cache.  By default the virtual host IO
> > pool
> > > is
> > > > > sized
> > > > > > > > > > Math.max(Runtime.getRuntime().availableProcessors() * 2,
> > > 64),
> > > > > so if
> > > > > > > > > > you have a machine with a very large number of cores, you
> > may
> > > > > have a
> > > > > > > > > > surprising large amount of direct memory assigned to
> > > > virtualhost
> > > > > IO
> > > > > > > > > > threads.   Check the value of connectionThreadPoolSize on
> > the
> > > > > > > > > > virtualhost
> > > > > > > > > > (http://<server>:<port>/api/latest/virtualhost/<
> > > > > virtualhostnodename>/<;
> > > > > > > virtualhostname>)
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > to see what value is in force.  What is it?  It is
> possible
> > > to
> > > > > tune
> > > > > > > > > > the pool size using context variable
> > > > > > > > > > virtualhost.connectionThreadPool.size.
> > > > > > > > > >
> > > > > > > > > > 3] Tell me if you are tuning the Broker in way beyond the
> > > > > direct/heap
> > > > > > > > > > memory settings you have told us about already.  For
> > instance
> > > > > you are
> > > > > > > > > > changing any of the direct memory pooling settings
> > > > > > > > > > broker.directByteBufferPoolSize, default network buffer
> > size
> > > > > > > > > > qpid.broker.networkBufferSize or applying any other
> > > > non-standard
> > > > > > > > > > settings?
> > > > > > > > > >
> > > > > > > > > > 4] How many virtual hosts do you have on the Broker?
> > > > > > > > > >
> > > > > > > > > > 5] What is the consumption pattern of the messages?  Do
> > > consume
> > > > > in a
> > > > > > > > > > strictly FIFO fashion or are you making use of message
> > > > selectors
> > > > > > > > > > or/and any of the out-of-order queue types (LVQs,
> priority
> > > > queue
> > > > > or
> > > > > > > > > > sorted queues)?
> > > > > > > > > >
> > > > > > > > > > 6] Is it just the 0.16 client involved in the
> application?
> > > >  Can
> > > > > I
> > > > > > > > > > check that you are not using any of the AMQP 1.0 clients
> > > > > > > > > > (org,apache.qpid:qpid-jms-client or
> > > > > > > > > > org.apache.qpid:qpid-amqp-1-0-client) in the software
> > stack
> > > > (as
> > > > > either
> > > > > > > > > > consumers or producers)
> > > > > > > > > >
> > > > > > > > > > Hopefully the answers to these questions will get us
> closer
> > > to
> > > > a
> > > > > > > > > > reproduction.   If you are able to reliable reproduce it,
> > > > please
> > > > > share
> > > > > > > > > > the steps with us.
> > > > > > > > > >
> > > > > > > > > > Kind regards, Keith.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 20 April 2017 at 10:21, Ramayan Tiwari <
> > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > After a lot of log mining, we might have a way to
> explain
> > > the
> > > > > > > sustained
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > increased in DirectMemory allocation, the correlation
> > seems
> > > > to
> > > > > be
> > > > > > > with
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > the
> > > > > > > > > > > growth in the size of a Queue that is getting consumed
> > but
> > > at
> > > > > a much
> > > > > > > > > > > slower
> > > > > > > > > > > rate than producers putting messages on this queue.
> > > > > > > > > > >
> > > > > > > > > > > The pattern we see is that in each instance of broker
> > > crash,
> > > > > there is
> > > > > > > > > > > at
> > > > > > > > > > > least one queue (usually 1 queue) whose size kept
> growing
> > > > > steadily.
> > > > > > > > > > > It’d be
> > > > > > > > > > > of significant size but not the largest queue --
> usually
> > > > there
> > > > > are
> > > > > > > > > > > multiple
> > > > > > > > > > > larger queues -- but it was different from other queues
> > in
> > > > > that its
> > > > > > > > > > > size
> > > > > > > > > > > was growing steadily. The queue would also be moving,
> but
> > > its
> > > > > > > > > > > processing
> > > > > > > > > > > rate was not keeping up with the enqueue rate.
> > > > > > > > > > >
> > > > > > > > > > > Our theory that might be totally wrong: If a queue is
> > > moving
> > > > > the
> > > > > > > entire
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > time, maybe then the broker would keep reusing the same
> > > > buffer
> > > > > in
> > > > > > > > > > > direct
> > > > > > > > > > > memory for the queue, and keep on adding onto it at the
> > end
> > > > to
> > > > > > > > > > > accommodate
> > > > > > > > > > > new messages. But because it’s active all the time and
> > > we’re
> > > > > pointing
> > > > > > > > > > > to
> > > > > > > > > > > the same buffer, space allocated for messages at the
> head
> > > of
> > > > > the
> > > > > > > > > > > queue/buffer doesn’t get reclaimed, even long after
> those
> > > > > messages
> > > > > > > have
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > been processed. Just a theory.
> > > > > > > > > > >
> > > > > > > > > > > We are also trying to reproduce this using some perf
> > tests
> > > to
> > > > > enqueue
> > > > > > > > > > > with
> > > > > > > > > > > same pattern, will update with the findings.
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > > Ramayan
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari
> > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Another issue that we noticed is when broker goes OOM
> > due
> > > > to
> > > > > direct
> > > > > > > > > > > > memory, it doesn't create heap dump (specified by
> > "-XX:+
> > > > > > > > > > > > HeapDumpOnOutOfMemoryError"), even when the OOM error
> > is
> > > > > same as
> > > > > > > what
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > is
> > > > > > > > > > > > mentioned in the oracle JVM docs
> > > > > ("java.lang.OutOfMemoryError").
> > > > > > > > > > > >
> > > > > > > > > > > > Has anyone been able to find a way to get to heap
> dump
> > > for
> > > > > DM OOM?
> > > > > > > > > > > >
> > > > > > > > > > > > - Ramayan
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari
> > > > > > > > > > > > <[hidden email]
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alex,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Below are the flow to disk logs from broker having
> > > > > 3million+
> > > > > > > messages
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > at
> > > > > > > > > > > > > this time. We only have one virtual host. Time is
> in
> > > GMT.
> > > > > Looks
> > > > > > > like
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > flow
> > > > > > > > > > > > > to disk is active on the whole virtual host and
> not a
> > > > > queue level.
> > > > > > > > > > > > >
> > > > > > > > > > > > > When the same broker went OOM yesterday, I did not
> > see
> > > > any
> > > > > flow to
> > > > > > > > > > > > > disk
> > > > > > > > > > > > > logs from when it was started until it crashed
> > (crashed
> > > > > twice
> > > > > > > within
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4hrs).
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3356539KB
> > > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3354866KB
> > > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3358509KB
> > > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3353501KB
> > > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3357544KB
> > > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3353236KB
> > > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3356704KB
> > > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3353511KB
> > > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3357948KB
> > > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3355310KB
> > > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3365624KB
> > > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1015 : Message flow to disk inactive : Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3355136KB
> > > > > > > > > > > > > within threshold 3355443KB
> > > > > > > > > > > > > 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> > > > > > > > > > > > > [Housekeeping[test]]
> > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > memory
> > > > > use
> > > > > > > > > > > > > 3358683KB
> > > > > > > > > > > > > exceeds threshold 3355443KB
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > After production release (2days back), we have
> seen 4
> > > > > crashes in 3
> > > > > > > > > > > > > different brokers, this is the most pressing
> concern
> > > for
> > > > > us in
> > > > > > > > > > > > > decision if
> > > > > > > > > > > > > we should roll back to 0.32. Any help is greatly
> > > > > appreciated.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <
> > > > > [hidden email]
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ramayan,
> > > > > > > > > > > > > > Thanks for the details. I would like to clarify
> > > whether
> > > > > flow to
> > > > > > > disk
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > was
> > > > > > > > > > > > > > triggered today for 3 million messages?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The following logs are issued for flow to disk:
> > > > > > > > > > > > > > BRK-1014 : Message flow to disk active :  Message
> > > > memory
> > > > > use
> > > > > > > > > > > > > > {0,number,#}KB
> > > > > > > > > > > > > > exceeds threshold {1,number,#.##}KB
> > > > > > > > > > > > > > BRK-1015 : Message flow to disk inactive :
> Message
> > > > > memory use
> > > > > > > > > > > > > > {0,number,#}KB within threshold {1,number,#.##}KB
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Kind Regards,
> > > > > > > > > > > > > > Alex
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 19 April 2017 at 17:10, Ramayan Tiwari <
> > > > > > > [hidden email]>
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Alex,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for your response, here are the details:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We use "direct" exchange, without persistence
> (we
> > > > > specify
> > > > > > > > > > > > > > NON_PERSISTENT
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > that while sending from client) and use BDB
> > store.
> > > We
> > > > > use JSON
> > > > > > > > > > > > > > > virtual
> > > > > > > > > > > > > > host
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > type. We are not using SSL.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > When the broker went OOM, we had around 1.3
> > million
> > > > > messages
> > > > > > > with
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 100
> > > > > > > > > > > > > > bytes
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > average message size. Direct memory allocation
> > > (value
> > > > > read from
> > > > > > > > > > > > > > > MBean)
> > > > > > > > > > > > > > kept
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > going up, even though it wouldn't need more DM
> to
> > > > > store these
> > > > > > > many
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > messages. DM allocated persisted at 99% for
> > about 3
> > > > > and half
> > > > > > > hours
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > before
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > crashing.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Today, on the same broker we have 3 million
> > > messages
> > > > > (same
> > > > > > > message
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > size)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > and DM allocated is only at 8%. This seems like
> > > there
> > > > > is some
> > > > > > > > > > > > > > > issue
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > de-allocation or a leak.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I have uploaded the memory utilization graph
> > here:
> > > > > > > > > > > > > > > https://drive.google.com/file/d/
> > > > > 0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> > > > > > > > > > > > > > > view?usp=sharing
> > > > > > > > > > > > > > > Blue line is DM allocated, Yellow is DM Used
> (sum
> > > of
> > > > > queue
> > > > > > > > > > > > > > > payload)
> > > > > > > > > > > > > > and Red
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > is heap usage.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr
> Rudyy
> > > > > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Ramayan,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Could please share with us the details of
> > > messaging
> > > > > use
> > > > > > > case(s)
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > ended
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > up in OOM on broker side?
> > > > > > > > > > > > > > > > I would like to reproduce the issue on my
> local
> > > > > broker in
> > > > > > > order
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > fix
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I would appreciate if you could provide as
> much
> > > > > details as
> > > > > > > > > > > > > > > > possible,
> > > > > > > > > > > > > > > > including, messaging topology, message
> > > persistence
> > > > > type,
> > > > > > > message
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > sizes,volumes, etc.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Qpid Broker 6.0.x uses direct memory for
> > keeping
> > > > > message
> > > > > > > content
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > receiving/sending data. Each plain connection
> > > > > utilizes 512K of
> > > > > > > > > > > > > > > > direct
> > > > > > > > > > > > > > > > memory. Each SSL connection uses 1M of direct
> > > > > memory. Your
> > > > > > > > > > > > > > > > memory
> > > > > > > > > > > > > > > settings
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > look Ok to me.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Kind Regards,
> > > > > > > > > > > > > > > > Alex
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On 18 April 2017 at 23:39, Ramayan Tiwari
> > > > > > > > > > > > > > > > <[hidden email]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > We are using Java broker 6.0.5, with patch
> to
> > > use
> > > > > > > > > > > > > > MultiQueueConsumer
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > feature. We just finished deploying to
> > > production
> > > > > and saw
> > > > > > > > > > > > > > > > > couple of
> > > > > > > > > > > > > > > > > instances of broker OOM due to running out
> of
> > > > > DirectMemory
> > > > > > > > > > > > > > > > > buffer
> > > > > > > > > > > > > > > > > (exceptions at the end of this email).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Here is our setup:
> > > > > > > > > > > > > > > > > 1. Max heap 12g, max direct memory 4g (this
> > is
> > > > > opposite of
> > > > > > > > > > > > > > > > > what the
> > > > > > > > > > > > > > > > > recommendation is, however, for our use
> cause
> > > > > message
> > > > > > > payload
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > really
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > small ~400bytes and is way less than the
> per
> > > > > message
> > > > > > > overhead
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > 1KB).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > perf testing, we were able to put 2 million
> > > > > messages without
> > > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > issues.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 2. ~400 connections to broker.
> > > > > > > > > > > > > > > > > 3. Each connection has 20 sessions and
> there
> > is
> > > > > one multi
> > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > consumer
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > attached to each session, listening to
> around
> > > > 1000
> > > > > queues.
> > > > > > > > > > > > > > > > > 4. We are still using 0.16 client (I know).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > With the above setup, the baseline
> > utilization
> > > > > (without any
> > > > > > > > > > > > > > messages)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > direct memory was around 230mb (with 410
> > > > > connection each
> > > > > > > > > > > > > > > > > taking
> > > > > > > > > > > > > > 500KB).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Based on our understanding of broker memory
> > > > > allocation,
> > > > > > > > > > > > > > > > > message
> > > > > > > > > > > > > > payload
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > should be the only thing adding to direct
> > > memory
> > > > > utilization
> > > > > > > > > > > > > > > > > (on
> > > > > > > > > > > > > > top of
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > baseline), however, we are experiencing
> > > something
> > > > > completely
> > > > > > > > > > > > > > different.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > our last broker crash, we see that broker
> is
> > > > > constantly
> > > > > > > > > > > > > > > > > running
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 90%+
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > direct memory allocated, even when message
> > > > payload
> > > > > sum from
> > > > > > > > > > > > > > > > > all the
> > > > > > > > > > > > > > > > queues
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > is only 6-8% (these % are against available
> > DM
> > > of
> > > > > 4gb).
> > > > > > > During
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > these
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > DM usage period, heap usage was around 60%
> > (of
> > > > > 12gb).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > We would like some help in understanding
> what
> > > > > could be the
> > > > > > > > > > > > > > > > > reason
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > high DM allocations. Are there things other
> > > than
> > > > > message
> > > > > > > > > > > > > > > > > payload
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > AMQP
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > connection, which use DM and could be
> > > > contributing
> > > > > to these
> > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > usage?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Another thing where we are puzzled is the
> > > > > de-allocation of
> > > > > > > DM
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > byte
> > > > > > > > > > > > > > > > buffers.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > From log mining of heap and DM utilization,
> > > > > de-allocation of
> > > > > > > > > > > > > > > > > DM
> > > > > > > > > > > > > > doesn't
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > correlate with heap GC. If anyone has seen
> > any
> > > > > documentation
> > > > > > > > > > > > > > related to
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > this, it would be very helpful if you could
> > > share
> > > > > that.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > > > Ramayan
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > *Exceptions*
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer
> > > memory
> > > > > > > > > > > > > > > > > at java.nio.Bits.reserveMemory(
> > Bits.java:658)
> > > > > > > ~[na:1.8.0_40]
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> > > > > init>(DirectByteBuffer.java:
> > > > > > > 123)
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > > at java.nio.ByteBuffer.
> > > > allocateDirect(ByteBuffer.
> > > > > java:311)
> > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> > > > > QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > > NonBlockingConnectionPlainD
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > elegate.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > restoreApplicationBufferForWrite(
> > > > > > > NonBlockingConnectionPlainDele
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > gate.java:93)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > > NonBlockingConnectionPlainDele
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > gate.processData(
> > > NonBlockingConnectionPlainDele
> > > > > > > gate.java:60)
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > > NonBlockingConnection.doRead(
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > NonBlockingConnection.java:506)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > > NonBlockingConnection.doWork(
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > NonBlockingConnection.java:285)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > > NetworkConnectionScheduler.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > processConnection(
> > NetworkConnectionScheduler.
> > > > > java:124)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.
> > > transport.SelectorThread$
> > > > > > > ConnectionPr
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ocessor.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > processConnection(SelectorThread.java:504)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.
> > > transport.SelectorThread$
> > > > > > > > > > > > > > > > > SelectionTask.performSelect(
> > > > > SelectorThread.java:337)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > org.apache.qpid.server.
> > > transport.SelectorThread$
> > > > > > > SelectionTask.run(
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > SelectorThread.java:87)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.
> > > > > transport.SelectorThread.run(
> > > > > > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > java.util.concurrent.
> > > > ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > java.util.concurrent.
> > > > > ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > *Second exception*
> > > > > > > > > > > > > > > > > java.lang.OutOfMemoryError: Direct buffer
> > > memory
> > > > > > > > > > > > > > > > > at java.nio.Bits.reserveMemory(
> > Bits.java:658)
> > > > > > > ~[na:1.8.0_40]
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > at java.nio.DirectByteBuffer.<
> > > > > init>(DirectByteBuffer.java:
> > > > > > > 123)
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > > at java.nio.ByteBuffer.
> > > > allocateDirect(ByteBuffer.
> > > > > java:311)
> > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.bytebuffer.
> > > > > QpidByteBuffer.allocateDirect(
> > > > > > > > > > > > > > > > > QpidByteBuffer.java:474)
> > > > > > > > > > > > > > > > > ~[qpid-common-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > > NonBlockingConnectionPlainDele
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > gate.<init>(NonBlockingConnectionPlainDele
> > > > > gate.java:45)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > NonBlockingConnection.
> > > > > > > > > > > > > > > > > setTransportEncryption(
> > > > NonBlockingConnection.java:
> > > > > 625)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > > NonBlockingConnection.<init>(
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > NonBlockingConnection.java:117)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.transport.
> > > > > > > NonBlockingNetworkTransport.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > acceptSocketChannel(
> > > NonBlockingNetworkTransport.
> > > > > java:158)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.
> > > transport.SelectorThread$
> > > > > > > SelectionTas
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > k$1.run(
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > SelectorThread.java:191)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > org.apache.qpid.server.
> > > > > transport.SelectorThread.run(
> > > > > > > > > > > > > > > > > SelectorThread.java:462)
> > > > > > > > > > > > > > > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > java.util.concurrent.
> > > > ThreadPoolExecutor.runWorker(
> > > > > > > > > > > > > > > > > ThreadPoolExecutor.java:1142)
> > > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > java.util.concurrent.
> > > > > ThreadPoolExecutor$Worker.run(
> > > > > > > > > > > > > > > > > ThreadPoolExecutor.java:617)
> > > > > > > > > > > > > > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:745)
> > > > > ~[na:1.8.0_40]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > ------------------------------
> > ------------------------------
> > > > > ---------
> > > > > > > > > > To unsubscribe, e-mail: [hidden email].
> org
> > > > > > > > > > For additional commands, e-mail:
> > [hidden email]
> > > > > > > > > >
> > > > >
> > > > >
> > > > > ------------------------------------------------------------
> > ---------
> > > > > To unsubscribe, e-mail: [hidden email]
> > > > > For additional commands, e-mail: [hidden email]
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java broker OOM due to DirectMemory

Ramayan Tiwari
Hi Alex,

Any update on the fix for this?
QPID-7753 is assigned a fix version for 7.0.0, I am hoping that the fix
will also be back ported to 6.0.x.

Thanks
Ramayan

On Mon, May 8, 2017 at 2:14 AM, Oleksandr Rudyy <[hidden email]> wrote:

> Hi Ramayan,
>
> Thanks for testing the patch and providing a feedback.
>
> Regarding direct memory utilization, the Qpid Broker caches up to 256MB of
> direct memory internally in QpidByteBuffers. Thus, when testing the Broker
> with only 256MB of direct memory, the entire direct memory could be cached
> and it would look as if direct memory is never released. Potentially, you
> can reduce the number of buffers cached on broker by changing context
> variable 'broker.directByteBufferPoolSize'. By default, it is set to 1000.
> With buffer size of 256K, it would give ~256M of cache.
>
> Regarding introducing lower and upper thresholds for 'flow to disk'. It
> seems like a good idea and we will try to implement it early this week on
> trunk first.
>
> Kind Regards,
> Alex
>
>
> On 5 May 2017 at 23:49, Ramayan Tiwari <[hidden email]> wrote:
>
> > Hi Alex,
> >
> > Thanks for providing the patch. I verified the fix with same perf test,
> and
> > it does prevent broker from going OOM, however. DM utilization doesn't
> get
> > any better after hitting the threshold (where flow to disk is activated
> > based on total used % across broker - graph in the link below).
> >
> > After hitting the final threshold, flow to disk activates and deactivates
> > pretty frequently across all the queues. The reason seems to be because
> > there is only one threshold currently to trigger flow to disk. Would it
> > make sense to break this down to high and low threshold - so that once
> flow
> > to disk is active after hitting high threshold, it will be active until
> the
> > queue utilization (or broker DM allocation) reaches the low threshold.
> >
> > Graph and flow to disk logs are here:
> > https://docs.google.com/document/d/1Wc1e-id-
> WlpI7FGU1Lx8XcKaV8sauRp82T5XZV
> > U-RiM/edit#heading=h.6400pltvjhy7
> >
> > Thanks
> > Ramayan
> >
> > On Thu, May 4, 2017 at 2:44 AM, Oleksandr Rudyy <[hidden email]>
> wrote:
> >
> > > Hi Ramayan,
> > >
> > > We attached to the QPID-7753 a patch with a work around for 6.0.x
> branch.
> > > It triggers flow to disk based on direct memory consumption rather than
> > > estimation of the space occupied by the message content. The flow to
> disk
> > > should evacuate message content preventing running out of direct
> memory.
> > We
> > > already committed the changes into 6.0.x and 6.1.x branches. It will be
> > > included into upcoming 6.0.7 and 6.1.3 releases.
> > >
> > > Please try and test the patch in your environment.
> > >
> > > We are still working at finishing of the fix for trunk.
> > >
> > > Kind Regards,
> > > Alex
> > >
> > > On 30 April 2017 at 15:45, Lorenz Quack <[hidden email]>
> wrote:
> > >
> > > > Hi Ramayan,
> > > >
> > > > The high-level plan is currently as follows:
> > > >  1) Periodically try to compact sparse direct memory buffers.
> > > >  2) Increase accuracy of messages' direct memory usage estimation to
> > more
> > > > reliably trigger flow to disk.
> > > >  3) Add an additional flow to disk trigger based on the amount of
> > > allocated
> > > > direct memory.
> > > >
> > > > A little bit more details:
> > > >  1) We plan on periodically checking the amount of direct memory
> usage
> > > and
> > > > if it is above a
> > > >     threshold (50%) we compare the sum of all queue sizes with the
> > amount
> > > > of allocated direct memory.
> > > >     If the ratio falls below a certain threshold we trigger a
> > compaction
> > > > task which goes through all queues
> > > >     and copy's a certain amount of old message buffers into new ones
> > > > thereby freeing the old buffers so
> > > >     that they can be returned to the buffer pool and be reused.
> > > >
> > > >  2) Currently we trigger flow