Quantcast

0.14 cluster never survives more than an hour or so.

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

0.14 cluster never survives more than an hour or so.

Paul Colby-2
Hi guys,

I'm having an issue with my new 0.14 cluster, where the same configuration
was fine with 0.12.

The cluster starts up, and all brokers are happy.  Then, with no client
activity at all, after some seemingly random amount time (usually around 30
minutes to an hour) all brokers in the cluster (three, in this case) report
the following error:

critical Error delivering frames: Cluster timer drop non-existent task
ManagementAgent::periodicProcessing (qpid/cluster/ClusterTimer.cpp:128)

Then they all shutdown, leaving their respective stores dirty :(

Any ideas what might be going wrong here?

Thanks,

pc
----
http://colby.id.au
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 0.14 cluster never survives more than an hour or so.

Pavel Moravec
Hi Paul,
this usually happens as a consequence of cluster split-brain. Are you using CMAN (Cluster Manager)?

(Technically, when split brain occurs, two (or more) qpid brokers think they are the elder nodes (elder node = "the managing" node, usually the node that is oldest in the cluster). But there can be just one elder node in a cluster, as the elder node periodically invokes periodicProcessing task cluster-wide that can run just one at a time. When more elder nodes are present, all invokes the task on every cluster member, causing more tasks to be executed - that is prevented by broker shutdown.)

Kind regards,
Pavel Moravec


----- Original Message -----

> From: "Paul Colby" <[hidden email]>
> To: [hidden email]
> Sent: Thursday, April 12, 2012 5:08:01 AM
> Subject: 0.14 cluster never survives more than an hour or so.
>
> Hi guys,
>
> I'm having an issue with my new 0.14 cluster, where the same
> configuration
> was fine with 0.12.
>
> The cluster starts up, and all brokers are happy.  Then, with no
> client
> activity at all, after some seemingly random amount time (usually
> around 30
> minutes to an hour) all brokers in the cluster (three, in this case)
> report
> the following error:
>
> critical Error delivering frames: Cluster timer drop non-existent
> task
> ManagementAgent::periodicProcessing
> (qpid/cluster/ClusterTimer.cpp:128)
>
> Then they all shutdown, leaving their respective stores dirty :(
>
> Any ideas what might be going wrong here?
>
> Thanks,
>
> pc
> ----
> http://colby.id.au
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 0.14 cluster never survives more than an hour or so.

Gordon Sim
In reply to this post by Paul Colby-2
On 04/12/2012 04:08 AM, Paul Colby wrote:

> Hi guys,
>
> I'm having an issue with my new 0.14 cluster, where the same configuration
> was fine with 0.12.
>
> The cluster starts up, and all brokers are happy.  Then, with no client
> activity at all, after some seemingly random amount time (usually around 30
> minutes to an hour) all brokers in the cluster (three, in this case) report
> the following error:
>
> critical Error delivering frames: Cluster timer drop non-existent task
> ManagementAgent::periodicProcessing (qpid/cluster/ClusterTimer.cpp:128)

Could it be: https://issues.apache.org/jira/browse/QPID-3369? The error
message is similar.

>
> Then they all shutdown, leaving their respective stores dirty :(
>
> Any ideas what might be going wrong here?
>
> Thanks,
>
> pc
> ----
> http://colby.id.au
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 0.14 cluster never survives more than an hour or so.

Paul Colby-2
In reply to this post by Pavel Moravec
Thanks Pavel and Gordon, I really appreciate you guys getting back to me so
quickly :)

I'm not currently using cman.  I hadn't been using it on 0.12 either.  I
suspect that split-brain is not the case, since the test cluster in
question on on virtual machines all within a single host, with *very*
reliable virtual networking between them.  After reading your response, I
did have a quick look at setting up cman to verify either way, but that's
not proving to be quick and easy, so I'll come back to it shortly.

The https://issues.apache.org/jira/browse/QPID-3369 issue does look
interesting.  I'll apply the patch suggested there and see what difference
it makes.

Thanks again.  I'll let you know how it goes :)

pc
----
http://colby.id.au


On Thu, Apr 12, 2012 at 9:39 PM, Pavel Moravec <[hidden email]> wrote:

> Hi Paul,
> this usually happens as a consequence of cluster split-brain. Are you
> using CMAN (Cluster Manager)?
>
> (Technically, when split brain occurs, two (or more) qpid brokers think
> they are the elder nodes (elder node = "the managing" node, usually the
> node that is oldest in the cluster). But there can be just one elder node
> in a cluster, as the elder node periodically invokes periodicProcessing
> task cluster-wide that can run just one at a time. When more elder nodes
> are present, all invokes the task on every cluster member, causing more
> tasks to be executed - that is prevented by broker shutdown.)
>
> Kind regards,
> Pavel Moravec
>
>
> ----- Original Message -----
> > From: "Paul Colby" <[hidden email]>
> > To: [hidden email]
> > Sent: Thursday, April 12, 2012 5:08:01 AM
> > Subject: 0.14 cluster never survives more than an hour or so.
> >
> > Hi guys,
> >
> > I'm having an issue with my new 0.14 cluster, where the same
> > configuration
> > was fine with 0.12.
> >
> > The cluster starts up, and all brokers are happy.  Then, with no
> > client
> > activity at all, after some seemingly random amount time (usually
> > around 30
> > minutes to an hour) all brokers in the cluster (three, in this case)
> > report
> > the following error:
> >
> > critical Error delivering frames: Cluster timer drop non-existent
> > task
> > ManagementAgent::periodicProcessing
> > (qpid/cluster/ClusterTimer.cpp:128)
> >
> > Then they all shutdown, leaving their respective stores dirty :(
> >
> > Any ideas what might be going wrong here?
> >
> > Thanks,
> >
> > pc
> > ----
> > http://colby.id.au
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 0.14 cluster never survives more than an hour or so.

Paul Colby-2
Alas the patch at  https://issues.apache.org/jira/browse/QPID-3369  has not
fixed the issue.

Interestingly though, it did move the error to a different line, but with a
very similar message. eg

Apr 13 17:04:17 gateway02 qpidd[32258]: 2012-04-13 17:04:17 critical Error
delivering frames: Cluster timer wakeup non-existent task
ManagementAgent::periodicProcessing (qpid/cluster/ClusterTimer.cpp:112)

So it's moved from  ClusterTimer::deliverDrop
to ClusterTimer::deliverWakeup instead... but with the same effectual
result.

pc
----
http://colby.id.au


On Fri, Apr 13, 2012 at 9:30 AM, Paul Colby <[hidden email]> wrote:

> Thanks Pavel and Gordon, I really appreciate you guys getting back to me
> so quickly :)
>
> I'm not currently using cman.  I hadn't been using it on 0.12 either.  I
> suspect that split-brain is not the case, since the test cluster in
> question on on virtual machines all within a single host, with *very*
> reliable virtual networking between them.  After reading your response, I
> did have a quick look at setting up cman to verify either way, but that's
> not proving to be quick and easy, so I'll come back to it shortly.
>
> The https://issues.apache.org/jira/browse/QPID-3369 issue does look
> interesting.  I'll apply the patch suggested there and see what difference
> it makes.
>
> Thanks again.  I'll let you know how it goes :)
>
> pc
> ----
> http://colby.id.au
>
>
>
> On Thu, Apr 12, 2012 at 9:39 PM, Pavel Moravec <[hidden email]>wrote:
>
>> Hi Paul,
>> this usually happens as a consequence of cluster split-brain. Are you
>> using CMAN (Cluster Manager)?
>>
>> (Technically, when split brain occurs, two (or more) qpid brokers think
>> they are the elder nodes (elder node = "the managing" node, usually the
>> node that is oldest in the cluster). But there can be just one elder node
>> in a cluster, as the elder node periodically invokes periodicProcessing
>> task cluster-wide that can run just one at a time. When more elder nodes
>> are present, all invokes the task on every cluster member, causing more
>> tasks to be executed - that is prevented by broker shutdown.)
>>
>> Kind regards,
>> Pavel Moravec
>>
>>
>> ----- Original Message -----
>> > From: "Paul Colby" <[hidden email]>
>> > To: [hidden email]
>> > Sent: Thursday, April 12, 2012 5:08:01 AM
>> > Subject: 0.14 cluster never survives more than an hour or so.
>> >
>> > Hi guys,
>> >
>> > I'm having an issue with my new 0.14 cluster, where the same
>> > configuration
>> > was fine with 0.12.
>> >
>> > The cluster starts up, and all brokers are happy.  Then, with no
>> > client
>> > activity at all, after some seemingly random amount time (usually
>> > around 30
>> > minutes to an hour) all brokers in the cluster (three, in this case)
>> > report
>> > the following error:
>> >
>> > critical Error delivering frames: Cluster timer drop non-existent
>> > task
>> > ManagementAgent::periodicProcessing
>> > (qpid/cluster/ClusterTimer.cpp:128)
>> >
>> > Then they all shutdown, leaving their respective stores dirty :(
>> >
>> > Any ideas what might be going wrong here?
>> >
>> > Thanks,
>> >
>> > pc
>> > ----
>> > http://colby.id.au
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 0.14 cluster never survives more than an hour or so.

Pavel Moravec
Hi Paul,
both errors occur under very similar circumstances. I recommend enabling debug logs of cluster component by adding:

log-enable=debug+:cluster
log-enable=notice+

to qpidd.conf and post the logs to a new JIRA. (you can try enabling trace logs that might provide more verbose output but running traces for 1/2 hour would require some nontrivial disk space)

To alleviate consequences, I think disabling management shall help (but some other problems can arise later on somewhere else, as this just prevents the consequence and not the root cause bug). And some QMF based services (like qpid-tool) won't work with management disabled.

To disable management stuff, add to qpidd.conf:

mgmt-enable=no

Alternatively, one can setup frequency of management updates (that are processed by the periodicProcessing task), see mgmt-pub-interval option (set by default to 10 seconds). Setting it to e.g. 2 hours, your qpid cluster will run for at least 2 hours without the error. But again, some QMF based services rely on the updates.


Kind regards,
Pavel Moravec


----- Original Message -----

> From: "Paul Colby" <[hidden email]>
> To: [hidden email]
> Sent: Friday, April 13, 2012 11:02:14 AM
> Subject: Re: 0.14 cluster never survives more than an hour or so.
>
> Alas the patch at  https://issues.apache.org/jira/browse/QPID-3369
>  has not
> fixed the issue.
>
> Interestingly though, it did move the error to a different line, but
> with a
> very similar message. eg
>
> Apr 13 17:04:17 gateway02 qpidd[32258]: 2012-04-13 17:04:17 critical
> Error
> delivering frames: Cluster timer wakeup non-existent task
> ManagementAgent::periodicProcessing
> (qpid/cluster/ClusterTimer.cpp:112)
>
> So it's moved from  ClusterTimer::deliverDrop
> to ClusterTimer::deliverWakeup instead... but with the same effectual
> result.
>
> pc
> ----
> http://colby.id.au
>
>
> On Fri, Apr 13, 2012 at 9:30 AM, Paul Colby <[hidden email]> wrote:
>
> > Thanks Pavel and Gordon, I really appreciate you guys getting back
> > to me
> > so quickly :)
> >
> > I'm not currently using cman.  I hadn't been using it on 0.12
> > either.  I
> > suspect that split-brain is not the case, since the test cluster in
> > question on on virtual machines all within a single host, with
> > *very*
> > reliable virtual networking between them.  After reading your
> > response, I
> > did have a quick look at setting up cman to verify either way, but
> > that's
> > not proving to be quick and easy, so I'll come back to it shortly.
> >
> > The https://issues.apache.org/jira/browse/QPID-3369 issue does look
> > interesting.  I'll apply the patch suggested there and see what
> > difference
> > it makes.
> >
> > Thanks again.  I'll let you know how it goes :)
> >
> > pc
> > ----
> > http://colby.id.au
> >
> >
> >
> > On Thu, Apr 12, 2012 at 9:39 PM, Pavel Moravec
> > <[hidden email]>wrote:
> >
> >> Hi Paul,
> >> this usually happens as a consequence of cluster split-brain. Are
> >> you
> >> using CMAN (Cluster Manager)?
> >>
> >> (Technically, when split brain occurs, two (or more) qpid brokers
> >> think
> >> they are the elder nodes (elder node = "the managing" node,
> >> usually the
> >> node that is oldest in the cluster). But there can be just one
> >> elder node
> >> in a cluster, as the elder node periodically invokes
> >> periodicProcessing
> >> task cluster-wide that can run just one at a time. When more elder
> >> nodes
> >> are present, all invokes the task on every cluster member, causing
> >> more
> >> tasks to be executed - that is prevented by broker shutdown.)
> >>
> >> Kind regards,
> >> Pavel Moravec
> >>
> >>
> >> ----- Original Message -----
> >> > From: "Paul Colby" <[hidden email]>
> >> > To: [hidden email]
> >> > Sent: Thursday, April 12, 2012 5:08:01 AM
> >> > Subject: 0.14 cluster never survives more than an hour or so.
> >> >
> >> > Hi guys,
> >> >
> >> > I'm having an issue with my new 0.14 cluster, where the same
> >> > configuration
> >> > was fine with 0.12.
> >> >
> >> > The cluster starts up, and all brokers are happy.  Then, with no
> >> > client
> >> > activity at all, after some seemingly random amount time
> >> > (usually
> >> > around 30
> >> > minutes to an hour) all brokers in the cluster (three, in this
> >> > case)
> >> > report
> >> > the following error:
> >> >
> >> > critical Error delivering frames: Cluster timer drop
> >> > non-existent
> >> > task
> >> > ManagementAgent::periodicProcessing
> >> > (qpid/cluster/ClusterTimer.cpp:128)
> >> >
> >> > Then they all shutdown, leaving their respective stores dirty :(
> >> >
> >> > Any ideas what might be going wrong here?
> >> >
> >> > Thanks,
> >> >
> >> > pc
> >> > ----
> >> > http://colby.id.au
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: 0.14 cluster never survives more than an hour or so.

Paul Colby-2
Thanks for the direction Pavel!!  I've found the problem! :)

In short it was a configuration management service (Puppet this case)
restarting the network subsystem on all three servers in the cluster at
once.

I assume at this stage, that the network restarts are a result of a puppet
mis-configuraton (I'll check that our with our ops guys on Monday).  But
effectively, as I understand it from looking at the debug logs, the
temporary loss of network causes all three brokers to think that they are
now the elder (a split brain scenario), so when the network is restored
seconds later, they all realise something is horribly wrong, and shutdown
immediately (as they ought).

So, I guess in this case there's two things I should do the guard against
this happening (besides tweaking puppet):
1. increase the cluster-size parameter (currently 0 for testing).
2. use cman - something I will definitely need to look into next :)

Thanks again!

pc
----
http://colby.id.au


On Fri, Apr 13, 2012 at 8:48 PM, Pavel Moravec <[hidden email]> wrote:

> Hi Paul,
> both errors occur under very similar circumstances. I recommend enabling
> debug logs of cluster component by adding:
>
> log-enable=debug+:cluster
> log-enable=notice+
>
> to qpidd.conf and post the logs to a new JIRA. (you can try enabling trace
> logs that might provide more verbose output but running traces for 1/2 hour
> would require some nontrivial disk space)
>
> To alleviate consequences, I think disabling management shall help (but
> some other problems can arise later on somewhere else, as this just
> prevents the consequence and not the root cause bug). And some QMF based
> services (like qpid-tool) won't work with management disabled.
>
> To disable management stuff, add to qpidd.conf:
>
> mgmt-enable=no
>
> Alternatively, one can setup frequency of management updates (that are
> processed by the periodicProcessing task), see mgmt-pub-interval option
> (set by default to 10 seconds). Setting it to e.g. 2 hours, your qpid
> cluster will run for at least 2 hours without the error. But again, some
> QMF based services rely on the updates.
>
>
> Kind regards,
> Pavel Moravec
>
>
> ----- Original Message -----
> > From: "Paul Colby" <[hidden email]>
> > To: [hidden email]
> > Sent: Friday, April 13, 2012 11:02:14 AM
> > Subject: Re: 0.14 cluster never survives more than an hour or so.
> >
> > Alas the patch at  https://issues.apache.org/jira/browse/QPID-3369
> >  has not
> > fixed the issue.
> >
> > Interestingly though, it did move the error to a different line, but
> > with a
> > very similar message. eg
> >
> > Apr 13 17:04:17 gateway02 qpidd[32258]: 2012-04-13 17:04:17 critical
> > Error
> > delivering frames: Cluster timer wakeup non-existent task
> > ManagementAgent::periodicProcessing
> > (qpid/cluster/ClusterTimer.cpp:112)
> >
> > So it's moved from  ClusterTimer::deliverDrop
> > to ClusterTimer::deliverWakeup instead... but with the same effectual
> > result.
> >
> > pc
> > ----
> > http://colby.id.au
> >
> >
> > On Fri, Apr 13, 2012 at 9:30 AM, Paul Colby <[hidden email]> wrote:
> >
> > > Thanks Pavel and Gordon, I really appreciate you guys getting back
> > > to me
> > > so quickly :)
> > >
> > > I'm not currently using cman.  I hadn't been using it on 0.12
> > > either.  I
> > > suspect that split-brain is not the case, since the test cluster in
> > > question on on virtual machines all within a single host, with
> > > *very*
> > > reliable virtual networking between them.  After reading your
> > > response, I
> > > did have a quick look at setting up cman to verify either way, but
> > > that's
> > > not proving to be quick and easy, so I'll come back to it shortly.
> > >
> > > The https://issues.apache.org/jira/browse/QPID-3369 issue does look
> > > interesting.  I'll apply the patch suggested there and see what
> > > difference
> > > it makes.
> > >
> > > Thanks again.  I'll let you know how it goes :)
> > >
> > > pc
> > > ----
> > > http://colby.id.au
> > >
> > >
> > >
> > > On Thu, Apr 12, 2012 at 9:39 PM, Pavel Moravec
> > > <[hidden email]>wrote:
> > >
> > >> Hi Paul,
> > >> this usually happens as a consequence of cluster split-brain. Are
> > >> you
> > >> using CMAN (Cluster Manager)?
> > >>
> > >> (Technically, when split brain occurs, two (or more) qpid brokers
> > >> think
> > >> they are the elder nodes (elder node = "the managing" node,
> > >> usually the
> > >> node that is oldest in the cluster). But there can be just one
> > >> elder node
> > >> in a cluster, as the elder node periodically invokes
> > >> periodicProcessing
> > >> task cluster-wide that can run just one at a time. When more elder
> > >> nodes
> > >> are present, all invokes the task on every cluster member, causing
> > >> more
> > >> tasks to be executed - that is prevented by broker shutdown.)
> > >>
> > >> Kind regards,
> > >> Pavel Moravec
> > >>
> > >>
> > >> ----- Original Message -----
> > >> > From: "Paul Colby" <[hidden email]>
> > >> > To: [hidden email]
> > >> > Sent: Thursday, April 12, 2012 5:08:01 AM
> > >> > Subject: 0.14 cluster never survives more than an hour or so.
> > >> >
> > >> > Hi guys,
> > >> >
> > >> > I'm having an issue with my new 0.14 cluster, where the same
> > >> > configuration
> > >> > was fine with 0.12.
> > >> >
> > >> > The cluster starts up, and all brokers are happy.  Then, with no
> > >> > client
> > >> > activity at all, after some seemingly random amount time
> > >> > (usually
> > >> > around 30
> > >> > minutes to an hour) all brokers in the cluster (three, in this
> > >> > case)
> > >> > report
> > >> > the following error:
> > >> >
> > >> > critical Error delivering frames: Cluster timer drop
> > >> > non-existent
> > >> > task
> > >> > ManagementAgent::periodicProcessing
> > >> > (qpid/cluster/ClusterTimer.cpp:128)
> > >> >
> > >> > Then they all shutdown, leaving their respective stores dirty :(
> > >> >
> > >> > Any ideas what might be going wrong here?
> > >> >
> > >> > Thanks,
> > >> >
> > >> > pc
> > >> > ----
> > >> > http://colby.id.au
> > >> >
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [hidden email]
> > >> For additional commands, e-mail: [hidden email]
> > >>
> > >>
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Loading...