Defining the behaviour of Proton Engine API under error conditions

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Defining the behaviour of Proton Engine API under error conditions

Keith Wall
Hi all

Phil and I are tasked with producing a comprehensive set of system
tests for Proton Engine.

The aim is to produce a test suite that will execute against all
Proton implementations thus guaranteeing that all exhibit identical
behaviour and assuring conformance with the AMQP 1-0 specification.

This work has highlighted the need to define how Proton Engine API
behaves under error conditions.

To start a discussion, below we have identified below six different
types of error and posed a number of questions regarding the behaviour
of the Engine API. Thoughts?

Regards, Keith.

Background
==========

We have identified two sources of test-cases:

A) AMQP specification (parts 1 and 2).  For example, the spec states
(2.4.1) "the open frame can only be sent on channel 0" suggesting a
test case should exercise the path where Proton receives an Open on a
channel number other than 0, and similarly (2.4.2) "[The Close] frame
MUST be the last thing ever written onto a connection" suggests a test
case where Proton receives another frame after the receipt of Close.

B) Proton-API itself suggests test-cases. For example, what if a user
tries to bind a connection to more than one transport, opens too many
channels, or calls connection close on a connection that has already
been closed.


Error conditions
================

Numbers 1-4 relate to the bottom half (i.e. the transport I/O functions):

1) bytes that do not conform to AMQP 1.0 Part 1 [e.g. inconsistent size/doff]

2) bytes that while constituting a valid frame (conforms to Part
2.3.1) are an invalid AMQP frame (violates Part 2.3.2) [e.g.
frame-body containing a primitive string rather than performative]

3) bytes that constitute a valid AMQP frame (conforms to Part 2.3.2) but:
 3A) performative is malformed [e.g. field with unexpected type or
mandatory with value null]
 3B) performative with additional fields [e.g. a Close with additional fields ]
 3C) frame that breaks a AMQP business rule [e.g. Open received on
non-zero channel]

4) state error [e.g. Begin performative received before Open, Attach
on unknown channel number etc]

Numbers 5-6 relate to the top half (i.e. the functions relating to
Connection, Session, etc):

5) illegal parameters to a method call [e.g.
pn_connection_set_container with null container name]
6) illegal state [e.g. pn_connection_open called twice, pn_session
called on unopened connection, pn_session called too many times etc]

Questions
=========

When the bottom half encounters input characterised by 1-4, how does
the botton-half of the API behave? What is the effect on the top half?

1. Will the bottom half continue to accept more input?
2. Will the botton-half continue to produce output?
3. How will the application using top half API know an error has
occured? What are the application's responsibilities when it learns of
an error?
4. If a connection is already opened, how (if at all) does the
presense of the error condition affect the connection?

When the top half used in a manner characterised by 5-6, how does the
top half behave?  What, if any, is the affect on the bottom half?
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

Ted Ross

On 03/27/2013 11:53 AM, Keith W wrote:

> Hi all
>
> Phil and I are tasked with producing a comprehensive set of system
> tests for Proton Engine.
>
> The aim is to produce a test suite that will execute against all
> Proton implementations thus guaranteeing that all exhibit identical
> behaviour and assuring conformance with the AMQP 1-0 specification.
>
> This work has highlighted the need to define how Proton Engine API
> behaves under error conditions.
>
> To start a discussion, below we have identified below six different
> types of error and posed a number of questions regarding the behaviour
> of the Engine API. Thoughts?
>
> Regards, Keith.
>
> Background
> ==========
>
> We have identified two sources of test-cases:
>
> A) AMQP specification (parts 1 and 2).  For example, the spec states
> (2.4.1) "the open frame can only be sent on channel 0" suggesting a
> test case should exercise the path where Proton receives an Open on a
> channel number other than 0, and similarly (2.4.2) "[The Close] frame
> MUST be the last thing ever written onto a connection" suggests a test
> case where Proton receives another frame after the receipt of Close.
>
> B) Proton-API itself suggests test-cases. For example, what if a user
> tries to bind a connection to more than one transport, opens too many
> channels, or calls connection close on a connection that has already
> been closed.
>
>
> Error conditions
> ================
>
> Numbers 1-4 relate to the bottom half (i.e. the transport I/O functions):
>
> 1) bytes that do not conform to AMQP 1.0 Part 1 [e.g. inconsistent size/doff]
>
> 2) bytes that while constituting a valid frame (conforms to Part
> 2.3.1) are an invalid AMQP frame (violates Part 2.3.2) [e.g.
> frame-body containing a primitive string rather than performative]
>
> 3) bytes that constitute a valid AMQP frame (conforms to Part 2.3.2) but:
>   3A) performative is malformed [e.g. field with unexpected type or
> mandatory with value null]
>   3B) performative with additional fields [e.g. a Close with additional fields ]
>   3C) frame that breaks a AMQP business rule [e.g. Open received on
> non-zero channel]
>
> 4) state error [e.g. Begin performative received before Open, Attach
> on unknown channel number etc]
>
> Numbers 5-6 relate to the top half (i.e. the functions relating to
> Connection, Session, etc):
>
> 5) illegal parameters to a method call [e.g.
> pn_connection_set_container with null container name]
> 6) illegal state [e.g. pn_connection_open called twice, pn_session
> called on unopened connection, pn_session called too many times etc]
>
> Questions
> =========
>
> When the bottom half encounters input characterised by 1-4, how does
> the botton-half of the API behave? What is the effect on the top half?
>
> 1. Will the bottom half continue to accept more input?
> 2. Will the botton-half continue to produce output?
> 3. How will the application using top half API know an error has
> occured? What are the application's responsibilities when it learns of
> an error?
> 4. If a connection is already opened, how (if at all) does the
> presense of the error condition affect the connection?
>
> When the top half used in a manner characterised by 5-6, how does the
> top half behave?  What, if any, is the affect on the bottom half?

I think bottom-half errors should be handled more severely than top-half
errors.  In the 1-4 cases (bottom half), the connection should be closed
in the same manner as if the socket was suddenly closed.  The top half
can then respond to the loss of the connection in the normal way, as
though the transport connection was lost.

I will go further and say that the bottom half should determine as early
as possible that the stream is ill-formed.  It would be bad for the
bottom half to receive and buffer a gigabyte of data before checking to
see if the header is well-formed.  We should be thinking about how
denial of service attacks might be perpetrated against Proton.

The top-half cases should return error codes or throw exceptions to the
application.

-Ted

Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

Rafael Schloming-3
In reply to this post by Keith Wall
On Wed, Mar 27, 2013 at 11:53 AM, Keith W <[hidden email]> wrote:

> Hi all
>
> Phil and I are tasked with producing a comprehensive set of system
> tests for Proton Engine.
>
> The aim is to produce a test suite that will execute against all
> Proton implementations thus guaranteeing that all exhibit identical
> behaviour and assuring conformance with the AMQP 1-0 specification.
>
> This work has highlighted the need to define how Proton Engine API
> behaves under error conditions.
>
> To start a discussion, below we have identified below six different
> types of error and posed a number of questions regarding the behaviour
> of the Engine API. Thoughts?
>
> Regards, Keith.
>
> Background
> ==========
>
> We have identified two sources of test-cases:
>
> A) AMQP specification (parts 1 and 2).  For example, the spec states
> (2.4.1) "the open frame can only be sent on channel 0" suggesting a
> test case should exercise the path where Proton receives an Open on a
> channel number other than 0, and similarly (2.4.2) "[The Close] frame
> MUST be the last thing ever written onto a connection" suggests a test
> case where Proton receives another frame after the receipt of Close.
>
> B) Proton-API itself suggests test-cases. For example, what if a user
> tries to bind a connection to more than one transport, opens too many
> channels, or calls connection close on a connection that has already
> been closed.
>
>
> Error conditions
> ================
>
> Numbers 1-4 relate to the bottom half (i.e. the transport I/O functions):
>
> 1) bytes that do not conform to AMQP 1.0 Part 1 [e.g. inconsistent
> size/doff]
>
> 2) bytes that while constituting a valid frame (conforms to Part
> 2.3.1) are an invalid AMQP frame (violates Part 2.3.2) [e.g.
> frame-body containing a primitive string rather than performative]
>
> 3) bytes that constitute a valid AMQP frame (conforms to Part 2.3.2) but:
>  3A) performative is malformed [e.g. field with unexpected type or
> mandatory with value null]
>  3B) performative with additional fields [e.g. a Close with additional
> fields ]
>  3C) frame that breaks a AMQP business rule [e.g. Open received on
> non-zero channel]
>
> 4) state error [e.g. Begin performative received before Open, Attach
> on unknown channel number etc]
>
> Numbers 5-6 relate to the top half (i.e. the functions relating to
> Connection, Session, etc):
>
> 5) illegal parameters to a method call [e.g.
> pn_connection_set_container with null container name]
> 6) illegal state [e.g. pn_connection_open called twice, pn_session
> called on unopened connection, pn_session called too many times etc]
>

One thing that pops out here is your comment about it being an error to
call pn_session() on an unopened connection. I believe this may indicate a
misunderstanding of a key property of the design.

The top half represents endpoint state. Connections, sessions, links, and
deliveries, are all just data structures. These data structures can be
built up in one of two ways, either directly through the top half API by
calling the various constructors: pn_session(), pn_sender(), pn_receiver(),
pn_delivery(), or they can be built up by binding a transport object to a
connection and then feeding bytes into that transport object. Now for the
bytes that are fed into that transport, there certainly is a constraint
that you can't send a begin frame without already having sent an open
frame, and likewise you can't send an attach frame without already sending
a begin frame, however these constraints are part of the protocol
definition, and have no bearing with how the same endpoint data structures
can be constructed directly through the top half API.

It is perfectly valid to construct a connection, session, and link, create
deliveries on them, supply data for those deliveries, even update and/or
settle those deliveries without ever opening any of the containing
connection/session/links, or indeed without ever binding a transport. It is
the job of the engine to figure out how to translate the current endpoint
state into a valid sequence of protocol primitives. So, for example, if a
transport is bound to connection with a whole lot of open sessions and
links, but the connection itself isn't open yet, the transport will simply
not send any frames because there is nothing it can legally send until the
connection is opened.


>
> Questions
> =========
>

So your questions below are best answered in the context of the new
transport interface, so I'm going to describe that a bit first:

     +-------------+          +-------------+          +-------------+
     |             |  Input   |             |  Tail    |             |
     |             |--------->|             |--------->|             |
     |   Socket    |          |   Driver    |          |  Transport  |
     |             |<---------|             |<---------|             |
     |             |  Output  |             |  Head    |             |
     +-------------+          +-------------+          +-------------+

If you recall, conceptually we have the driver reading data available in
the socket's input buffer into free capacity at the tail of the transport,
and also writing pending bytes at the head of the transport into free space
available in the socket's output buffer.

Now it's important to understand the different possible conditions that can
arise and how each component will/should behave.

On the socket side of things we have the following possibilities:
  - the socket input can go into a state where it will never produce
anymore bytes
    + there are two sub variants of this state, it can happen normally, and
it can happen with some kind of error code returned by the socket API
  - the socket output can go into a state where it will never accept
anymore bytes
    + this can happen due to some kind of error, or it can happen because
the driver explicitly closes the output

On the Transport side of things we have the following possibilities:
  - the tail can go into a state where it will never have free space for
more input
  - the head can go into a state where it will never again produce pending
bytes

Given the above there are a few things we can say about driver behaviour:
  - if the head goes into a state where it will never again produce pending
bytes, the driver should shutdown the socket output
  - if the tail goes into a state where it will never again have capacity
for input, the driver should shutdown the socket input
  - if the socket input goes into a state where it will never produce
anymore bytes for the transport, the driver should inform the transport
that it will never get anymore bytes
  - if the socket output goes into a state where it will never be able to
write anymore bytes produced by the transoprt, the driver should inform the
transport

The above is all captured in the API for the transport. We have close_tail
for the driver to inform the transport that the socket input was closed, we
have capacity which can return an error code indicating that there will
never be any free space, we have pending which can return an error code
indicating that there will never be anymore available bytes, and of course
we have close_head which the driver can use to inform the transport that no
bytes will ever be consumed again:

  int pn_transport_close_tail(pn_transport_t *transport);
  ssize_t pn_transport_capacity(pn_transport_t *transport);

  ssize_t pn_transport_pending(pn_transport_t *transport);
  int pn_transport_close_head(pn_transport_t *transport);


>
> When the bottom half encounters input characterised by 1-4, how does
> the botton-half of the API behave? What is the effect on the top half?
>
> 1. Will the bottom half continue to accept more input?
>

In a way this is kind of unimportant to specify. With the way the new
transport interface works, the driver will read anywhere from 0 up to
"capacity" bytes into the transport's tail. Depending on how network reads
end up being fragmented, this could be end up being a large amount of
garbage data, a little amount of garbage data, or some amount of good data
with garbadge somewhere in the middle. In all cases the transport will end
up going into an error state, possibly writing out some number of detach
and end frames, almost certainly writing out a close frame with some kind
of helpful debugging info in it, and then indicating that there will never
be anymore pending bytes available by returning PN_EOS from
pn_transport_pending.


> 2. Will the botton-half continue to produce output?
>

Yes, see above.


> 3. How will the application using top half API know an error has
> occured? What are the application's responsibilities when it learns of
> an error?
>

The transport has an error variable which can be inspected to get more
information about what has happened. Also, if/when the transport is unbound
from the connection, all of the remote endpoint states will transition back
to UNINIT.

I'm not sure how to answer what the applications responsibilities are. That
seems to depend on the application. It could just decide to shutdown with
an error message or it could decide to employ some sort of retry strategy,
e.g. connect to a backup service, create a new transport, and bind it to
the same top half. I'm not sure I'd say it has any hard and fast
responsibilities per/se though.


> 4. If a connection is already opened, how (if at all) does the
> presense of the error condition affect the connection?
>

Basically at some point the transport can no longer process anymore input
data, so it's kind of the same thing as if the wire were cut and that input
data was simply no longer available to process. Of course the fact that it
was likely a programming error that lead to that circumstance would
probably influence how you might react, e.g. you wouldn't try to reconnect
(to the same implementation at least) and do the same operation again, but
logically the top half should still be available and reflect the endpoint
state as of the last valid bytes that were processed. In fact you could
imagine failover in a situation like this so long as you were failing over
to a different vendor.


>
> When the top half used in a manner characterised by 5-6, how does the
> top half behave?  What, if any, is the affect on the bottom half?
>

So for 5 I would expect there to be no effect on the top half or the bottom
half, i.e. whenever possible we should pre-examine all input parameters and
ensure that we can successfuly complete whatever it is we're being asked to
do before we actually go about changing any internal state. This may of
course not always be practical, but I'd have to consider any exceptions on
a case by case basis.

Likewise with 6, although I don't believe any of the examples given there
actually constitute illegal states for the endpoint data structures, with
the exception of calling pn_session() too many times which I don't believe
is really a state error, it's more of a resource error, akin to calling
malloc too many times or getting an out of memory exception in Java. (To be
clear, the number of sessions held by the top half data structure is not
limited by the max channel negotiated at the wire level, the max channel
only impacts how many sessions can be attached simultaneously, so the only
real limit on the number of calls to pn_session() is how much memory you
have.)

--Rafael
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

Rafael Schloming-3
On Wed, Mar 27, 2013 at 4:16 PM, Rafael Schloming <[hidden email]> wrote:

> 1. Will the bottom half continue to accept more input?
>>
>
> In a way this is kind of unimportant to specify. With the way the new
> transport interface works, the driver will read anywhere from 0 up to
> "capacity" bytes into the transport's tail. Depending on how network reads
> end up being fragmented, this could be end up being a large amount of
> garbage data, a little amount of garbage data, or some amount of good data
> with garbadge somewhere in the middle. In all cases the transport will end
> up going into an error state, possibly writing out some number of detach
> and end frames, almost certainly writing out a close frame with some kind
> of helpful debugging info in it, and then indicating that there will never
> be anymore pending bytes available by returning PN_EOS from
> pn_transport_pending.
>

I realize I didn't quite finish my thought here. The reason it is
unimportant to specify is that the transport accepts data in chunks. This
is because we don't want the driver to ever have to buffer data on it's
own, so when the transport reports it has capacty "X" to the driver, it has
pretty much already guaranteed to process whatever bytes the driver gives
it up to "X" amount. It can't really decide to stop after the first few
bytes because they were malformed, it pretty much needs to "process" the
rest of the garbage by just throwing it away. So a transport implementation
could decide to keep on accepting more garbage bytes until it has written
out it's close frame with it's helpful debugging info, or it could decide
to stop accepting input and leave it to the driver to either discard the
garbage on its own, or do a shutdown of the socket input.

I would say the most friendly thing to do here in a debugging context would
probably be to keep on accepting and throwing away the garbage until the
helpful close frame is written to the wire. In a production environment
it's possible that a more aggressive strategy might be preferred.

--Rafael
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

rgodfrey
In reply to this post by Rafael Schloming-3
On 27 March 2013 21:16, Rafael Schloming <[hidden email]> wrote:
> On Wed, Mar 27, 2013 at 11:53 AM, Keith W <[hidden email]> wrote:
>

[..snip..]

>
>> 3. How will the application using top half API know an error has
>> occured? What are the application's responsibilities when it learns of
>> an error?
>>
>
> The transport has an error variable which can be inspected to get more
> information about what has happened. Also, if/when the transport is unbound
> from the connection, all of the remote endpoint states will transition back
> to UNINIT.
>
> I'm not sure how to answer what the applications responsibilities are. That
> seems to depend on the application. It could just decide to shutdown with
> an error message or it could decide to employ some sort of retry strategy,
> e.g. connect to a backup service, create a new transport, and bind it to
> the same top half. I'm not sure I'd say it has any hard and fast
> responsibilities per/se though.
>
>

I think the point Keith was trying to make here is that in order for a
component to be compliant with the AMQP specification, anywhere where
the spec says "The connection/session/link should be
closed/ended/detached with the X error", is it the application that
actually has responsibility for setting the local error state and
calling the close() on the appropriate endpoint.  If it is we are
placing a burden on the application authors to preserve AMQP
compliance.

-- Rob
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

Rafael Schloming-3
On Wed, Mar 27, 2013 at 6:34 PM, Rob Godfrey <[hidden email]>wrote:

> On 27 March 2013 21:16, Rafael Schloming <[hidden email]> wrote:
> > On Wed, Mar 27, 2013 at 11:53 AM, Keith W <[hidden email]> wrote:
> >
>
> [..snip..]
>
> >
> >> 3. How will the application using top half API know an error has
> >> occured? What are the application's responsibilities when it learns of
> >> an error?
> >>
> >
> > The transport has an error variable which can be inspected to get more
> > information about what has happened. Also, if/when the transport is
> unbound
> > from the connection, all of the remote endpoint states will transition
> back
> > to UNINIT.
> >
> > I'm not sure how to answer what the applications responsibilities are.
> That
> > seems to depend on the application. It could just decide to shutdown with
> > an error message or it could decide to employ some sort of retry
> strategy,
> > e.g. connect to a backup service, create a new transport, and bind it to
> > the same top half. I'm not sure I'd say it has any hard and fast
> > responsibilities per/se though.
> >
> >
>
> I think the point Keith was trying to make here is that in order for a
> component to be compliant with the AMQP specification, anywhere where
> the spec says "The connection/session/link should be
> closed/ended/detached with the X error", is it the application that
> actually has responsibility for setting the local error state and
> calling the close() on the appropriate endpoint.  If it is we are
> placing a burden on the application authors to preserve AMQP
> compliance.
>

Ah, that makes sense. Thanks for clarifying. I don't recall offhand all the
places the spec says close/end/detach with an X error, but in situations
like framing errors or missing/invalid arguments required by the transport
to properly maintain state, say an out-of-sequence delivery-id, that is
something the transport would handle automatically. I expect there may be
some cases that the top half would need to initiate explicitly though, e.g.
obviously a redirect error would need to be initiated by the top half.

--Rafael
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

rgodfrey
On 28 March 2013 02:45, Rafael Schloming <[hidden email]> wrote:

> On Wed, Mar 27, 2013 at 6:34 PM, Rob Godfrey <[hidden email]
> >wrote:
>
> > On 27 March 2013 21:16, Rafael Schloming <[hidden email]> wrote:
> > > On Wed, Mar 27, 2013 at 11:53 AM, Keith W <[hidden email]>
> wrote:
> > >
> >
> > [..snip..]
> >
> > >
> > >> 3. How will the application using top half API know an error has
> > >> occured? What are the application's responsibilities when it learns of
> > >> an error?
> > >>
> > >
> > > The transport has an error variable which can be inspected to get more
> > > information about what has happened. Also, if/when the transport is
> > unbound
> > > from the connection, all of the remote endpoint states will transition
> > back
> > > to UNINIT.
> > >
> > > I'm not sure how to answer what the applications responsibilities are.
> > That
> > > seems to depend on the application. It could just decide to shutdown
> with
> > > an error message or it could decide to employ some sort of retry
> > strategy,
> > > e.g. connect to a backup service, create a new transport, and bind it
> to
> > > the same top half. I'm not sure I'd say it has any hard and fast
> > > responsibilities per/se though.
> > >
> > >
> >
> > I think the point Keith was trying to make here is that in order for a
> > component to be compliant with the AMQP specification, anywhere where
> > the spec says "The connection/session/link should be
> > closed/ended/detached with the X error", is it the application that
> > actually has responsibility for setting the local error state and
> > calling the close() on the appropriate endpoint.  If it is we are
> > placing a burden on the application authors to preserve AMQP
> > compliance.
> >
>
> Ah, that makes sense. Thanks for clarifying. I don't recall offhand all the
> places the spec says close/end/detach with an X error, but in situations
> like framing errors or missing/invalid arguments required by the transport
> to properly maintain state, say an out-of-sequence delivery-id, that is
> something the transport would handle automatically. I expect there may be
> some cases that the top half would need to initiate explicitly though, e.g.
> obviously a redirect error would need to be initiated by the top half.
>
>
So when you say the transport handles automatically, how does that work
from the app's perspective?  Is the transport updating the local state of
the endpoint, or the remote state... if the transport is actually injecting
the close performative into the outgoing stream (the detach, end, or close)
then it seems somewhat weird if the local app still sees this endpoint as
locally open.. on the other hand it is also weird if the local state is now
a shared responsibility between the app and the transport.

Also what about cases where the error is in the opening performative for
the endpoint (open,begin,attach)... if the way to communicate the error is
by responding with the paired closing performative, this may first require
the sending of an opening... For example, if there is an error in an
incoming (unsolicited) open frame - say the container id is null - then the
only way to communicate this back to the other side is to send a valid open
immediately followed bby a close frame with the appropriate error code.
Are you saying that the transport would do all this silently without this
being visible to the app?

-- Rob


> --Rafael
>
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

Rafael Schloming-3
On Thu, Mar 28, 2013 at 5:31 AM, Rob Godfrey <[hidden email]>wrote:

> On 28 March 2013 02:45, Rafael Schloming <[hidden email]> wrote:
>
> > On Wed, Mar 27, 2013 at 6:34 PM, Rob Godfrey <[hidden email]
> > >wrote:
> >
> > > On 27 March 2013 21:16, Rafael Schloming <[hidden email]> wrote:
> > > > On Wed, Mar 27, 2013 at 11:53 AM, Keith W <[hidden email]>
> > wrote:
> > > >
> > >
> > > [..snip..]
> > >
> > > >
> > > >> 3. How will the application using top half API know an error has
> > > >> occured? What are the application's responsibilities when it learns
> of
> > > >> an error?
> > > >>
> > > >
> > > > The transport has an error variable which can be inspected to get
> more
> > > > information about what has happened. Also, if/when the transport is
> > > unbound
> > > > from the connection, all of the remote endpoint states will
> transition
> > > back
> > > > to UNINIT.
> > > >
> > > > I'm not sure how to answer what the applications responsibilities
> are.
> > > That
> > > > seems to depend on the application. It could just decide to shutdown
> > with
> > > > an error message or it could decide to employ some sort of retry
> > > strategy,
> > > > e.g. connect to a backup service, create a new transport, and bind it
> > to
> > > > the same top half. I'm not sure I'd say it has any hard and fast
> > > > responsibilities per/se though.
> > > >
> > > >
> > >
> > > I think the point Keith was trying to make here is that in order for a
> > > component to be compliant with the AMQP specification, anywhere where
> > > the spec says "The connection/session/link should be
> > > closed/ended/detached with the X error", is it the application that
> > > actually has responsibility for setting the local error state and
> > > calling the close() on the appropriate endpoint.  If it is we are
> > > placing a burden on the application authors to preserve AMQP
> > > compliance.
> > >
> >
> > Ah, that makes sense. Thanks for clarifying. I don't recall offhand all
> the
> > places the spec says close/end/detach with an X error, but in situations
> > like framing errors or missing/invalid arguments required by the
> transport
> > to properly maintain state, say an out-of-sequence delivery-id, that is
> > something the transport would handle automatically. I expect there may be
> > some cases that the top half would need to initiate explicitly though,
> e.g.
> > obviously a redirect error would need to be initiated by the top half.
> >
> >
> So when you say the transport handles automatically, how does that work
> from the app's perspective?  Is the transport updating the local state of
> the endpoint, or the remote state... if the transport is actually injecting
> the close performative into the outgoing stream (the detach, end, or close)
> then it seems somewhat weird if the local app still sees this endpoint as
> locally open.. on the other hand it is also weird if the local state is now
> a shared responsibility between the app and the transport.
>

So initially I think we had a much muddier vision of the relationship
between the top half objects and the "state on the wire". The first few
iterations of the engine were much less sophisticated than the current one,
and the top half data structure and any changes to it were pretty much
directly mapped into on-the-wire actions. This proved kind of brittle, as
you can manipulate a data structure in much more flexible ways than you can
the wire.

Perhaps a helpful analogy here is serializing collection data structures.
Imagine a Map. A serialized map has to be represented in some linear form
"on disk". Now if you couple the Map data structure too closely to the
serialization code, then you might start confusing serialized format
constraints with data structure constraints and start getting arbitrary
un-map-like constraints, e.g. maybe you can only access the key/value pairs
in the order that they were deserialized, and as you don't really have
direct visibility into that order then you have an API that is very brittle.

So in the case of the engine, in order to make the top half API more
robust, I ended up loosening the coupling to the "on the wire" state. I
don't recall the specifics exactly, but I do remember cases where initially
the API would arbitrarily behave in different ways depending on what
had/had not happened on the wire, and there was no way for the application
programmer to directly know what actually did happen on the wire. (I
vaguely recall it was something to do with credit, and the number magically
changing depending on whether the value had been written to the socket or
not, something that app has no way of knowing.)

In any case, the upshot of this is that while it might be tempting to think
of the top half data structure as exactly matching on-the-wire state,
particularly because it is named in terms of
connections/sessions/links/deliveries, that's like thinking of your Map
data structure as only holding what happens to have been already serialized
to disk at a given moment. What the top half data structure actually
represents is the desired endpoint state of the local app, and the last
known endpoint state of the remote app.

To answer your question, say there is a framing error/the-wire is cut
(really there isn't any way to know the difference since you could cut the
wire half way through a frame header), the transport interface will write
out the close frame as required by the spec, and it will indicate through
it's error interface that an error has occurred, however it won't alter any
of the local/remote states of the top half endpoints. The local states
remain reflective of the local apps desired state, and the remote states
remain reflective of the remote apps last known desired state. This kind of
has to be this way because you don't want to confuse links being
involuntarily detached because the wire was cut with the remote endpoint
wanting to actively shut down the link.

So to summarize, there are three things an app can look at, local endpoint
state, remote endpoint state, and transport state. The former is always
what the app wants it to be, i.e. just because the wire was cut doesn't
mean the local app wanted the endpoint to be closed, the remote endpoint
state is always reflective of what the remote endpoint wanted (as last we
know it), and again just because the wire is cut that doesn't mean we
actually know the remote endpoint wanted to be closed, and then of course
there is the transport state which tells you whether or not the transport
is functioning normally or has reached an error condition.


>
> Also what about cases where the error is in the opening performative for
> the endpoint (open,begin,attach)... if the way to communicate the error is
> by responding with the paired closing performative, this may first require
> the sending of an opening... For example, if there is an error in an
> incoming (unsolicited) open frame - say the container id is null - then the
> only way to communicate this back to the other side is to send a valid open
> immediately followed bby a close frame with the appropriate error code.
> Are you saying that the transport would do all this silently without this
> being visible to the app?
>

I'm saying it should do this. I don't know offhand how well the current
implementation handles that particular case. Looking at the code I suspect
the C would end up just sending a close frame without sending the open
frame.

--Rafael
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

Phil Harvey-2
On 28 March 2013 13:17, Rafael Schloming <[hidden email]> wrote:

> On Thu, Mar 28, 2013 at 5:31 AM, Rob Godfrey <[hidden email]
> >wrote:
>
> > On 28 March 2013 02:45, Rafael Schloming <[hidden email]> wrote:
> >
> > > On Wed, Mar 27, 2013 at 6:34 PM, Rob Godfrey <[hidden email]
> > > >wrote:
> > >
> > > > On 27 March 2013 21:16, Rafael Schloming <[hidden email]> wrote:
> > > > > On Wed, Mar 27, 2013 at 11:53 AM, Keith W <[hidden email]>
> > > wrote:
> > > > >
> > > >
> > > > [..snip..]
>

[..snip..]

>
> To answer your question, say there is a framing error/the-wire is cut
> (really there isn't any way to know the difference since you could cut the
> wire half way through a frame header), the transport interface will write
> out the close frame as required by the spec, and it will indicate through
> it's error interface that an error has occurred, however it won't alter any
> of the local/remote states of the top half endpoints. The local states
> remain reflective of the local apps desired state, and the remote states
> remain reflective of the remote apps last known desired state. This kind of
> has to be this way because you don't want to confuse links being
> involuntarily detached because the wire was cut with the remote endpoint
> wanting to actively shut down the link.
>

I find this a little confusing.  After the Transport has silently sent the
Close frame, what would the local Application typically do next in order to
get the Engine back to a usable state?

There is an alternative approach which I would find simpler.  In this
alternative. the Engine would not implicitly send the Close frame.
Instead, the Application would explicitly control this by doing the
following:

- The Application checks the Transport's error state as usual
- The Application discovers that the Transport is in an error state and
therefore calls Connection.setCondition(errorDetailsObtainedFromTransport)
followed by Connection.close()
- The Application calls Transport.output (or pn_transport_head in
proton-c), causing the Close frame bytes to be produced.

As a Proton Engine developer I would find this simpler to implement.

Moreover, as a Proton Engine user, it gives me a clearer separation of
responsibility between the application and the Engine.

Maybe I'm just not grokking the Engine's philosophy, but on the whole the
Engine API feels like it gives me control over what frames are being
produced (though not *when* - and that's fine by me), so I find the idea of
the Transport layer silently sending a Close rather surprising.

What are people's views on this?

[..snip..]

>
> --Rafael
>

Phil
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

Rafael Schloming-3
On Thu, Mar 28, 2013 at 11:16 AM, Phil Harvey <[hidden email]>wrote:

> On 28 March 2013 13:17, Rafael Schloming <[hidden email]> wrote:
>
> > On Thu, Mar 28, 2013 at 5:31 AM, Rob Godfrey <[hidden email]
> > >wrote:
> >
> > > On 28 March 2013 02:45, Rafael Schloming <[hidden email]> wrote:
> > >
> > > > On Wed, Mar 27, 2013 at 6:34 PM, Rob Godfrey <
> [hidden email]
> > > > >wrote:
> > > >
> > > > > On 27 March 2013 21:16, Rafael Schloming <[hidden email]> wrote:
> > > > > > On Wed, Mar 27, 2013 at 11:53 AM, Keith W <[hidden email]>
> > > > wrote:
> > > > > >
> > > > >
> > > > > [..snip..]
> >
>
> [..snip..]
>
> >
> > To answer your question, say there is a framing error/the-wire is cut
> > (really there isn't any way to know the difference since you could cut
> the
> > wire half way through a frame header), the transport interface will write
> > out the close frame as required by the spec, and it will indicate through
> > it's error interface that an error has occurred, however it won't alter
> any
> > of the local/remote states of the top half endpoints. The local states
> > remain reflective of the local apps desired state, and the remote states
> > remain reflective of the remote apps last known desired state. This kind
> of
> > has to be this way because you don't want to confuse links being
> > involuntarily detached because the wire was cut with the remote endpoint
> > wanting to actively shut down the link.
> >
>
> I find this a little confusing.  After the Transport has silently sent the
> Close frame, what would the local Application typically do next in order to
> get the Engine back to a usable state?
>

It would unbind the transport. When the transport is unbound, all the
remote state is cleared, and the app is free to use the connection/endpoint
data structure as if it had simply built them up into their current state
explicitly via constructors as opposed to it being the result of network
interactions.


>
> There is an alternative approach which I would find simpler.  In this
> alternative. the Engine would not implicitly send the Close frame.
> Instead, the Application would explicitly control this by doing the
> following:
>
> - The Application checks the Transport's error state as usual
> - The Application discovers that the Transport is in an error state and
> therefore calls Connection.setCondition(errorDetailsObtainedFromTransport)
> followed by Connection.close()
> - The Application calls Transport.output (or pn_transport_head in
> proton-c), causing the Close frame bytes to be produced.
>
> As a Proton Engine developer I would find this simpler to implement.
>

I'm not sure I follow why this would be any simpler to implement.


>
> Moreover, as a Proton Engine user, it gives me a clearer separation of
> responsibility between the application and the Engine.
>
> Maybe I'm just not grokking the Engine's philosophy, but on the whole the
> Engine API feels like it gives me control over what frames are being
> produced (though not *when* - and that's fine by me), so I find the idea of
> the Transport layer silently sending a Close rather surprising.
>

As a user of the top half, I don't think you should need to be at all aware
of what frames are/aren't sent by the engine. You should only need to think
about the semantics that the top half of the API provide. So I would ask
you, why do you care what frames are/aren't sent by the engine? It seems to
me that all you should care about are the intentions of the remote
application, e.g. did the remote application intend to open/close a link,
session, or connection.

I think it is key to understand the dual nature of the open/close,
begin/end, and attach/detach frames as defined by the protocol They are
used to express two classes of things, i.e. they are frames that combine
information from multiple distinct layers. They are used in a purely
structural/framing/formatting level to manage on-the-wire constructs such
as frame sizes, channel numbers, handles, etc. Concepts that never need to
leak outside the transport. They are also used at a higher semantic level,
e.g. to express an application's intent to establish a link. Not every use
of one of these frames involves both elements, for example if a very small
number of channels is negotiated by the transport, but the app wants to use
a lot of sessions, then the engine implementation might well
detach/reattach sessions on demand in order to share the limited number of
available channels. This would never be visible to the top half, and many
attach/detach frames would be sent without the application doing anything
to trigger it.

I think the particular case of a framing-error style close is somewhat
similar. Mechanically/structurally the close frame is required to conform
to the spec, but in such a case it is not expressing the intent of the
*application* to close the connection, it is expressing the intent of the
*transport* to close the connection.


>
> What are people's views on this?
>

The spec doesn't allow you to do anything other than send a close frame
with an appropriate condition if a framing error occurs. It seems odd for
us to ask the app developer to manually do this when the impl could just as
well do it on its own. What you describe would require the app developer to
always include extra boilerplate, and furthermore it puts the burden of
maintaining spec compliance onto that extra boilerplate code. I would say
anything that allows the app developer to make the engine violate the
protocol spec is probably a bad thing, and anything requiring the app
developer to be vigilant and follow good coding practices in order to avoid
violating the spec is certainly a really really bad thing. ;-)

--Rafael
Reply | Threaded
Open this post in threaded view
|

Re: Defining the behaviour of Proton Engine API under error conditions

Phil Harvey-2
Thanks for the response.  It does clarify the Engine's semantics, and the
intended division of responsibility between the Engine and the
application.  I intend to document this soon in a short conceptual summary
under proton/docs/engine/.

I chatted to Keith about this and we're uncertain about some of the details
of the steps that follow an invalid frame being pushed into the Engine. To
illustrate this, we wrote the following pseudo-code for the main loop of a
typical application (eg a messaging client), similar to Driver's
pn_connector_process function.

1   tail_buf = pn_transport_tail()
2   tail_capacity = pn_transport_capacity()
3   read = socket_recv(tail_buf, tail_capacity)
4   # ... [1]
5
6   push_err_no = pn_transport_push(read) # see [Q1]
7
8   if (push_err_no < 0)
9     socket_shutdown(SHUTDOWN_READ)
10  end if
11
12  # ... [2]
13
14  head_pending = pn_transport_pending() # see [Q2]
15  if(head_pending > 0)
16    head_buf = pn_transport_head()
17    written = socket_send(head_buf, head_pending)
18    # ... [3]
19
20    pn_transport_pop(written)
21  else if (head_pending < 0)
22    socket_shutdown(SHUTDOWN_WRITE)
23  end if

Elided sections:
[1] A well-behaved application would call pn_transport_close_tail() if
socket_recv() < 0
[2] Application makes use of top half API - pn_session_head(),
pn_work_head() etc
[3] A well-behaved application would call pn_transport_close_head() if
socket_send() < 0


=== Questions about error handling ===

Imagine that the bytes read from the socket on line 3 represent a valid
frame followed by a frame that is invalid (e.g. because it contains a field
of an unexpected datatype). In this case:

[Q1] Should pn_transport_push return -1 on line 6, thereby signalling that
the application can't push any more bytes into it?

[Q2] On lines 14-21, what is in the transport's outgoing byte queue? We
expect that it would be:
  <frame1 corresponding to top-half API calls on line 12>
  <frame2 ... >
  <...>
  <the CLOSE frame triggered by the invalid input>.

Or maybe the CLOSE frame somehow replaces the other outgoing frames in the
transport's outgoing byte queue?

Note: if the application supports failover, it would subsequently unbind
the transport, create a new socket, create a new transport, and bind the
existing connection to it.


Phil



On 28 March 2013 16:25, Rafael Schloming <[hidden email]> wrote:

> On Thu, Mar 28, 2013 at 11:16 AM, Phil Harvey <[hidden email]
> >wrote:
>
> > On 28 March 2013 13:17, Rafael Schloming <[hidden email]> wrote:
> >
> > > On Thu, Mar 28, 2013 at 5:31 AM, Rob Godfrey <[hidden email]
> > > >wrote:
> > >
> > > > On 28 March 2013 02:45, Rafael Schloming <[hidden email]> wrote:
> > > >
> > > > > On Wed, Mar 27, 2013 at 6:34 PM, Rob Godfrey <
> > [hidden email]
> > > > > >wrote:
> > > > >
> > > > > > On 27 March 2013 21:16, Rafael Schloming <[hidden email]>
> wrote:
> > > > > > > On Wed, Mar 27, 2013 at 11:53 AM, Keith W <
> [hidden email]>
> > > > > wrote:
> > > > > > >
> > > > > >
> > > > > > [..snip..]
> > >
> >
> > [..snip..]
> >
> > >
> > > To answer your question, say there is a framing error/the-wire is cut
> > > (really there isn't any way to know the difference since you could cut
> > the
> > > wire half way through a frame header), the transport interface will
> write
> > > out the close frame as required by the spec, and it will indicate
> through
> > > it's error interface that an error has occurred, however it won't alter
> > any
> > > of the local/remote states of the top half endpoints. The local states
> > > remain reflective of the local apps desired state, and the remote
> states
> > > remain reflective of the remote apps last known desired state. This
> kind
> > of
> > > has to be this way because you don't want to confuse links being
> > > involuntarily detached because the wire was cut with the remote
> endpoint
> > > wanting to actively shut down the link.
> > >
> >
> > I find this a little confusing.  After the Transport has silently sent
> the
> > Close frame, what would the local Application typically do next in order
> to
> > get the Engine back to a usable state?
> >
>
> It would unbind the transport. When the transport is unbound, all the
> remote state is cleared, and the app is free to use the connection/endpoint
> data structure as if it had simply built them up into their current state
> explicitly via constructors as opposed to it being the result of network
> interactions.
>
>
> >
> > There is an alternative approach which I would find simpler.  In this
> > alternative. the Engine would not implicitly send the Close frame.
> > Instead, the Application would explicitly control this by doing the
> > following:
> >
> > - The Application checks the Transport's error state as usual
> > - The Application discovers that the Transport is in an error state and
> > therefore calls
> Connection.setCondition(errorDetailsObtainedFromTransport)
> > followed by Connection.close()
> > - The Application calls Transport.output (or pn_transport_head in
> > proton-c), causing the Close frame bytes to be produced.
> >
> > As a Proton Engine developer I would find this simpler to implement.
> >
>
> I'm not sure I follow why this would be any simpler to implement.
>
>
> >
> > Moreover, as a Proton Engine user, it gives me a clearer separation of
> > responsibility between the application and the Engine.
> >
> > Maybe I'm just not grokking the Engine's philosophy, but on the whole the
> > Engine API feels like it gives me control over what frames are being
> > produced (though not *when* - and that's fine by me), so I find the idea
> of
> > the Transport layer silently sending a Close rather surprising.
> >
>
> As a user of the top half, I don't think you should need to be at all aware
> of what frames are/aren't sent by the engine. You should only need to think
> about the semantics that the top half of the API provide. So I would ask
> you, why do you care what frames are/aren't sent by the engine? It seems to
> me that all you should care about are the intentions of the remote
> application, e.g. did the remote application intend to open/close a link,
> session, or connection.
>
> I think it is key to understand the dual nature of the open/close,
> begin/end, and attach/detach frames as defined by the protocol They are
> used to express two classes of things, i.e. they are frames that combine
> information from multiple distinct layers. They are used in a purely
> structural/framing/formatting level to manage on-the-wire constructs such
> as frame sizes, channel numbers, handles, etc. Concepts that never need to
> leak outside the transport. They are also used at a higher semantic level,
> e.g. to express an application's intent to establish a link. Not every use
> of one of these frames involves both elements, for example if a very small
> number of channels is negotiated by the transport, but the app wants to use
> a lot of sessions, then the engine implementation might well
> detach/reattach sessions on demand in order to share the limited number of
> available channels. This would never be visible to the top half, and many
> attach/detach frames would be sent without the application doing anything
> to trigger it.
>
> I think the particular case of a framing-error style close is somewhat
> similar. Mechanically/structurally the close frame is required to conform
> to the spec, but in such a case it is not expressing the intent of the
> *application* to close the connection, it is expressing the intent of the
> *transport* to close the connection.
>
>
> >
> > What are people's views on this?
> >
>
> The spec doesn't allow you to do anything other than send a close frame
> with an appropriate condition if a framing error occurs. It seems odd for
> us to ask the app developer to manually do this when the impl could just as
> well do it on its own. What you describe would require the app developer to
> always include extra boilerplate, and furthermore it puts the burden of
> maintaining spec compliance onto that extra boilerplate code. I would say
> anything that allows the app developer to make the engine violate the
> protocol spec is probably a bad thing, and anything requiring the app
> developer to be vigilant and follow good coding practices in order to avoid
> violating the spec is certainly a really really bad thing. ;-)
>
> --Rafael
>