Hey all,
It is currently impossible to enable state checkpointing for iterative jobs, because en exception is thrown when creating the jobgraph. This behaviour is motivated by the lack of precise guarantees that we can give with the current fault-tolerance implementations for cyclic graphs. This PR <https://github.com/apache/flink/pull/812> adds an optional flag to force checkpoints even in case of iterations. The algorithm will take checkpoints periodically as before, but records in transit inside the loop will be lost. However even this guarantee is enough for most applications (Machine Learning for instance) and certainly much better than not having anything at all. I suggest we add this to the 0.9 release as currently many applications suffer from this limitation (SAMOA, ML pipelines, graph streaming etc.) Cheers, Gyula |
Hey Gyula,
I understand your reasoning, but I don't think its worth to rush this into the release. As you've said, we cannot give precise guarantees. But this is arguably one of the key requirements for any fault tolerance mechanism. Therefore I disagree that this is better than not having anything at all. I think it will already go a long way to have the non-iterative case working reliably. And as far as I know there are no users really suffering from this at the moment (in the sense that someone has complained on the mailing list). Hence, I vote to postpone this. – Ufuk On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: > Hey all, > > It is currently impossible to enable state checkpointing for iterative > jobs, because en exception is thrown when creating the jobgraph. This > behaviour is motivated by the lack of precise guarantees that we can give > with the current fault-tolerance implementations for cyclic graphs. > > This PR <https://github.com/apache/flink/pull/812> adds an optional flag to > force checkpoints even in case of iterations. The algorithm will take > checkpoints periodically as before, but records in transit inside the loop > will be lost. > > However even this guarantee is enough for most applications (Machine > Learning for instance) and certainly much better than not having anything > at all. > > > I suggest we add this to the 0.9 release as currently many applications > suffer from this limitation (SAMOA, ML pipelines, graph streaming etc.) > > > Cheers, > > Gyula |
As for people currently suffering from it:
An application King is developing requires iterations, and they need checkpoints. Practically all SAMOA programs would need this. It is very likely that the state interfaces will be changed after the release, so this is not something that we can just add later. I don't see a reason why we should not add it, as it is clearly documented. In this actual case not having guarantees at all means people will never use it in any production system. Having limited guarantees means that it will depend on the application. On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> wrote: > Hey Gyula, > > I understand your reasoning, but I don't think its worth to rush this into > the release. > > As you've said, we cannot give precise guarantees. But this is arguably > one of the key requirements for any fault tolerance mechanism. Therefore I > disagree that this is better than not having anything at all. I think it > will already go a long way to have the non-iterative case working reliably. > > And as far as I know there are no users really suffering from this at the > moment (in the sense that someone has complained on the mailing list). > > Hence, I vote to postpone this. > > – Ufuk > > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: > > > Hey all, > > > > It is currently impossible to enable state checkpointing for iterative > > jobs, because en exception is thrown when creating the jobgraph. This > > behaviour is motivated by the lack of precise guarantees that we can give > > with the current fault-tolerance implementations for cyclic graphs. > > > > This PR <https://github.com/apache/flink/pull/812> adds an optional > flag to > > force checkpoints even in case of iterations. The algorithm will take > > checkpoints periodically as before, but records in transit inside the > loop > > will be lost. > > > > However even this guarantee is enough for most applications (Machine > > Learning for instance) and certainly much better than not having anything > > at all. > > > > > > I suggest we add this to the 0.9 release as currently many applications > > suffer from this limitation (SAMOA, ML pipelines, graph streaming etc.) > > > > > > Cheers, > > > > Gyula > > |
I agree that for the sake of the above mentioned use cases it is reasonable
to add this to the release with the right documentation, for machine learning potentially loosing one round of feedback data should not matter. Let us not block prominent users until the next release on this. On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[hidden email]> wrote: > As for people currently suffering from it: > > An application King is developing requires iterations, and they need > checkpoints. Practically all SAMOA programs would need this. > > It is very likely that the state interfaces will be changed after the > release, so this is not something that we can just add later. I don't see a > reason why we should not add it, as it is clearly documented. In this > actual case not having guarantees at all means people will never use it in > any production system. Having limited guarantees means that it will depend > on the application. > > On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> wrote: > > > Hey Gyula, > > > > I understand your reasoning, but I don't think its worth to rush this > into > > the release. > > > > As you've said, we cannot give precise guarantees. But this is arguably > > one of the key requirements for any fault tolerance mechanism. Therefore > I > > disagree that this is better than not having anything at all. I think it > > will already go a long way to have the non-iterative case working > reliably. > > > > And as far as I know there are no users really suffering from this at the > > moment (in the sense that someone has complained on the mailing list). > > > > Hence, I vote to postpone this. > > > > – Ufuk > > > > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: > > > > > Hey all, > > > > > > It is currently impossible to enable state checkpointing for iterative > > > jobs, because en exception is thrown when creating the jobgraph. This > > > behaviour is motivated by the lack of precise guarantees that we can > give > > > with the current fault-tolerance implementations for cyclic graphs. > > > > > > This PR <https://github.com/apache/flink/pull/812> adds an optional > > flag to > > > force checkpoints even in case of iterations. The algorithm will take > > > checkpoints periodically as before, but records in transit inside the > > loop > > > will be lost. > > > > > > However even this guarantee is enough for most applications (Machine > > > Learning for instance) and certainly much better than not having > anything > > > at all. > > > > > > > > > I suggest we add this to the 0.9 release as currently many applications > > > suffer from this limitation (SAMOA, ML pipelines, graph streaming etc.) > > > > > > > > > Cheers, > > > > > > Gyula > > > > > |
This is the answer I gave on the PR (we should have one place for
discussing this, though): I would be against merging this in the current form. What I propose is to analyse the topology to verify that there are no checkpointed operators inside iterations. Operators before and after iterations can be checkpointed and we can safely allow the user to enable checkpointing. If we have the code to analyse which operators are inside iterations we could also disallow windows inside iterations. I think windows inside iterations don't make sense since elements in different "iterations" would end up in the same window. Maybe I'm wrong here though, then please correct me. On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi <[hidden email]> wrote: > I agree that for the sake of the above mentioned use cases it is reasonable > to add this to the release with the right documentation, for machine > learning potentially loosing one round of feedback data should not matter. > > Let us not block prominent users until the next release on this. > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[hidden email]> wrote: > >> As for people currently suffering from it: >> >> An application King is developing requires iterations, and they need >> checkpoints. Practically all SAMOA programs would need this. >> >> It is very likely that the state interfaces will be changed after the >> release, so this is not something that we can just add later. I don't see a >> reason why we should not add it, as it is clearly documented. In this >> actual case not having guarantees at all means people will never use it in >> any production system. Having limited guarantees means that it will depend >> on the application. >> >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> wrote: >> >> > Hey Gyula, >> > >> > I understand your reasoning, but I don't think its worth to rush this >> into >> > the release. >> > >> > As you've said, we cannot give precise guarantees. But this is arguably >> > one of the key requirements for any fault tolerance mechanism. Therefore >> I >> > disagree that this is better than not having anything at all. I think it >> > will already go a long way to have the non-iterative case working >> reliably. >> > >> > And as far as I know there are no users really suffering from this at the >> > moment (in the sense that someone has complained on the mailing list). >> > >> > Hence, I vote to postpone this. >> > >> > – Ufuk >> > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: >> > >> > > Hey all, >> > > >> > > It is currently impossible to enable state checkpointing for iterative >> > > jobs, because en exception is thrown when creating the jobgraph. This >> > > behaviour is motivated by the lack of precise guarantees that we can >> give >> > > with the current fault-tolerance implementations for cyclic graphs. >> > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an optional >> > flag to >> > > force checkpoints even in case of iterations. The algorithm will take >> > > checkpoints periodically as before, but records in transit inside the >> > loop >> > > will be lost. >> > > >> > > However even this guarantee is enough for most applications (Machine >> > > Learning for instance) and certainly much better than not having >> anything >> > > at all. >> > > >> > > >> > > I suggest we add this to the 0.9 release as currently many applications >> > > suffer from this limitation (SAMOA, ML pipelines, graph streaming etc.) >> > > >> > > >> > > Cheers, >> > > >> > > Gyula >> > >> > >> |
I disagree. Not having checkpointed operators inside the iteration still
breaks the guarantees. It is not about the states it is about the loop itself. On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <[hidden email]> wrote: > This is the answer I gave on the PR (we should have one place for > discussing this, though): > > I would be against merging this in the current form. What I propose is > to analyse the topology to verify that there are no checkpointed > operators inside iterations. Operators before and after iterations can > be checkpointed and we can safely allow the user to enable > checkpointing. > > If we have the code to analyse which operators are inside iterations > we could also disallow windows inside iterations. I think windows > inside iterations don't make sense since elements in different > "iterations" would end up in the same window. Maybe I'm wrong here > though, then please correct me. > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > <[hidden email]> wrote: > > I agree that for the sake of the above mentioned use cases it is > reasonable > > to add this to the release with the right documentation, for machine > > learning potentially loosing one round of feedback data should not > matter. > > > > Let us not block prominent users until the next release on this. > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[hidden email]> > wrote: > > > >> As for people currently suffering from it: > >> > >> An application King is developing requires iterations, and they need > >> checkpoints. Practically all SAMOA programs would need this. > >> > >> It is very likely that the state interfaces will be changed after the > >> release, so this is not something that we can just add later. I don't > see a > >> reason why we should not add it, as it is clearly documented. In this > >> actual case not having guarantees at all means people will never use it > in > >> any production system. Having limited guarantees means that it will > depend > >> on the application. > >> > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> wrote: > >> > >> > Hey Gyula, > >> > > >> > I understand your reasoning, but I don't think its worth to rush this > >> into > >> > the release. > >> > > >> > As you've said, we cannot give precise guarantees. But this is > arguably > >> > one of the key requirements for any fault tolerance mechanism. > Therefore > >> I > >> > disagree that this is better than not having anything at all. I think > it > >> > will already go a long way to have the non-iterative case working > >> reliably. > >> > > >> > And as far as I know there are no users really suffering from this at > the > >> > moment (in the sense that someone has complained on the mailing list). > >> > > >> > Hence, I vote to postpone this. > >> > > >> > – Ufuk > >> > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: > >> > > >> > > Hey all, > >> > > > >> > > It is currently impossible to enable state checkpointing for > iterative > >> > > jobs, because en exception is thrown when creating the jobgraph. > This > >> > > behaviour is motivated by the lack of precise guarantees that we can > >> give > >> > > with the current fault-tolerance implementations for cyclic graphs. > >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an optional > >> > flag to > >> > > force checkpoints even in case of iterations. The algorithm will > take > >> > > checkpoints periodically as before, but records in transit inside > the > >> > loop > >> > > will be lost. > >> > > > >> > > However even this guarantee is enough for most applications (Machine > >> > > Learning for instance) and certainly much better than not having > >> anything > >> > > at all. > >> > > > >> > > > >> > > I suggest we add this to the 0.9 release as currently many > applications > >> > > suffer from this limitation (SAMOA, ML pipelines, graph streaming > etc.) > >> > > > >> > > > >> > > Cheers, > >> > > > >> > > Gyula > >> > > >> > > >> > |
To continue Gyula's point, for consistent snapshots we need to persist the records in transit within the loop and also slightly change the current protocol since it works only for DAGs. Before going into that direction though I would propose we first see whether there is a nice way to make iterations more structured. Paris ________________________________________ From: Gyula Fóra <[hidden email]> Sent: Wednesday, June 10, 2015 10:19 AM To: [hidden email] Subject: Re: Force enabling checkpoints for iterative streaming jobs I disagree. Not having checkpointed operators inside the iteration still breaks the guarantees. It is not about the states it is about the loop itself. On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <[hidden email]> wrote: > This is the answer I gave on the PR (we should have one place for > discussing this, though): > > I would be against merging this in the current form. What I propose is > to analyse the topology to verify that there are no checkpointed > operators inside iterations. Operators before and after iterations can > be checkpointed and we can safely allow the user to enable > checkpointing. > > If we have the code to analyse which operators are inside iterations > we could also disallow windows inside iterations. I think windows > inside iterations don't make sense since elements in different > "iterations" would end up in the same window. Maybe I'm wrong here > though, then please correct me. > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > <[hidden email]> wrote: > > I agree that for the sake of the above mentioned use cases it is > reasonable > > to add this to the release with the right documentation, for machine > > learning potentially loosing one round of feedback data should not > matter. > > > > Let us not block prominent users until the next release on this. > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[hidden email]> > wrote: > > > >> As for people currently suffering from it: > >> > >> An application King is developing requires iterations, and they need > >> checkpoints. Practically all SAMOA programs would need this. > >> > >> It is very likely that the state interfaces will be changed after the > >> release, so this is not something that we can just add later. I don't > see a > >> reason why we should not add it, as it is clearly documented. In this > >> actual case not having guarantees at all means people will never use it > in > >> any production system. Having limited guarantees means that it will > depend > >> on the application. > >> > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> wrote: > >> > >> > Hey Gyula, > >> > > >> > I understand your reasoning, but I don't think its worth to rush this > >> into > >> > the release. > >> > > >> > As you've said, we cannot give precise guarantees. But this is > arguably > >> > one of the key requirements for any fault tolerance mechanism. > Therefore > >> I > >> > disagree that this is better than not having anything at all. I think > it > >> > will already go a long way to have the non-iterative case working > >> reliably. > >> > > >> > And as far as I know there are no users really suffering from this at > the > >> > moment (in the sense that someone has complained on the mailing list). > >> > > >> > Hence, I vote to postpone this. > >> > > >> > – Ufuk > >> > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: > >> > > >> > > Hey all, > >> > > > >> > > It is currently impossible to enable state checkpointing for > iterative > >> > > jobs, because en exception is thrown when creating the jobgraph. > This > >> > > behaviour is motivated by the lack of precise guarantees that we can > >> give > >> > > with the current fault-tolerance implementations for cyclic graphs. > >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an optional > >> > flag to > >> > > force checkpoints even in case of iterations. The algorithm will > take > >> > > checkpoints periodically as before, but records in transit inside > the > >> > loop > >> > > will be lost. > >> > > > >> > > However even this guarantee is enough for most applications (Machine > >> > > Learning for instance) and certainly much better than not having > >> anything > >> > > at all. > >> > > > >> > > > >> > > I suggest we add this to the 0.9 release as currently many > applications > >> > > suffer from this limitation (SAMOA, ML pipelines, graph streaming > etc.) > >> > > > >> > > > >> > > Cheers, > >> > > > >> > > Gyula > >> > > >> > > >> > |
And also I would like to remind everyone that any fault tolerance we
provide is only as good as the fault tolerance of the master node. Which is non existent at the moment. So I don't see a reason why a user should not be able to choose whether he wants state checkpoints for iterations as well. In any case this will be used by King for instance, so making it part of the release would save a lot of work for everyone. Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, 10:29): > > To continue Gyula's point, for consistent snapshots we need to persist the > records in transit within the loop and also slightly change the current > protocol since it works only for DAGs. Before going into that direction > though I would propose we first see whether there is a nice way to make > iterations more structured. > > Paris > ________________________________________ > From: Gyula Fóra <[hidden email]> > Sent: Wednesday, June 10, 2015 10:19 AM > To: [hidden email] > Subject: Re: Force enabling checkpoints for iterative streaming jobs > > I disagree. Not having checkpointed operators inside the iteration still > breaks the guarantees. > > It is not about the states it is about the loop itself. > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <[hidden email]> > wrote: > > > This is the answer I gave on the PR (we should have one place for > > discussing this, though): > > > > I would be against merging this in the current form. What I propose is > > to analyse the topology to verify that there are no checkpointed > > operators inside iterations. Operators before and after iterations can > > be checkpointed and we can safely allow the user to enable > > checkpointing. > > > > If we have the code to analyse which operators are inside iterations > > we could also disallow windows inside iterations. I think windows > > inside iterations don't make sense since elements in different > > "iterations" would end up in the same window. Maybe I'm wrong here > > though, then please correct me. > > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > > <[hidden email]> wrote: > > > I agree that for the sake of the above mentioned use cases it is > > reasonable > > > to add this to the release with the right documentation, for machine > > > learning potentially loosing one round of feedback data should not > > matter. > > > > > > Let us not block prominent users until the next release on this. > > > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[hidden email]> > > wrote: > > > > > >> As for people currently suffering from it: > > >> > > >> An application King is developing requires iterations, and they need > > >> checkpoints. Practically all SAMOA programs would need this. > > >> > > >> It is very likely that the state interfaces will be changed after the > > >> release, so this is not something that we can just add later. I don't > > see a > > >> reason why we should not add it, as it is clearly documented. In this > > >> actual case not having guarantees at all means people will never use > it > > in > > >> any production system. Having limited guarantees means that it will > > depend > > >> on the application. > > >> > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> wrote: > > >> > > >> > Hey Gyula, > > >> > > > >> > I understand your reasoning, but I don't think its worth to rush > this > > >> into > > >> > the release. > > >> > > > >> > As you've said, we cannot give precise guarantees. But this is > > arguably > > >> > one of the key requirements for any fault tolerance mechanism. > > Therefore > > >> I > > >> > disagree that this is better than not having anything at all. I > think > > it > > >> > will already go a long way to have the non-iterative case working > > >> reliably. > > >> > > > >> > And as far as I know there are no users really suffering from this > at > > the > > >> > moment (in the sense that someone has complained on the mailing > list). > > >> > > > >> > Hence, I vote to postpone this. > > >> > > > >> > – Ufuk > > >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: > > >> > > > >> > > Hey all, > > >> > > > > >> > > It is currently impossible to enable state checkpointing for > > iterative > > >> > > jobs, because en exception is thrown when creating the jobgraph. > > This > > >> > > behaviour is motivated by the lack of precise guarantees that we > can > > >> give > > >> > > with the current fault-tolerance implementations for cyclic > graphs. > > >> > > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an > optional > > >> > flag to > > >> > > force checkpoints even in case of iterations. The algorithm will > > take > > >> > > checkpoints periodically as before, but records in transit inside > > the > > >> > loop > > >> > > will be lost. > > >> > > > > >> > > However even this guarantee is enough for most applications > (Machine > > >> > > Learning for instance) and certainly much better than not having > > >> anything > > >> > > at all. > > >> > > > > >> > > > > >> > > I suggest we add this to the 0.9 release as currently many > > applications > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph streaming > > etc.) > > >> > > > > >> > > > > >> > > Cheers, > > >> > > > > >> > > Gyula > > >> > > > >> > > > >> > > > |
Without going into the details, how well tested is this feature? The PR
only extends one test by a few lines. Is that really enough to ensure that 1) the change does not cause trouble 2) is working as expected If this feature should go into the release, it must be thoroughly checked and we must take the time for that. Including code and hoping for the best because time is scarce is not an option IMO. Fabian 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: > And also I would like to remind everyone that any fault tolerance we > provide is only as good as the fault tolerance of the master node. Which is > non existent at the moment. > > So I don't see a reason why a user should not be able to choose whether he > wants state checkpoints for iterations as well. > > In any case this will be used by King for instance, so making it part of > the release would save a lot of work for everyone. > > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, > 10:29): > > > > > To continue Gyula's point, for consistent snapshots we need to persist > the > > records in transit within the loop and also slightly change the current > > protocol since it works only for DAGs. Before going into that direction > > though I would propose we first see whether there is a nice way to make > > iterations more structured. > > > > Paris > > ________________________________________ > > From: Gyula Fóra <[hidden email]> > > Sent: Wednesday, June 10, 2015 10:19 AM > > To: [hidden email] > > Subject: Re: Force enabling checkpoints for iterative streaming jobs > > > > I disagree. Not having checkpointed operators inside the iteration still > > breaks the guarantees. > > > > It is not about the states it is about the loop itself. > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <[hidden email]> > > wrote: > > > > > This is the answer I gave on the PR (we should have one place for > > > discussing this, though): > > > > > > I would be against merging this in the current form. What I propose is > > > to analyse the topology to verify that there are no checkpointed > > > operators inside iterations. Operators before and after iterations can > > > be checkpointed and we can safely allow the user to enable > > > checkpointing. > > > > > > If we have the code to analyse which operators are inside iterations > > > we could also disallow windows inside iterations. I think windows > > > inside iterations don't make sense since elements in different > > > "iterations" would end up in the same window. Maybe I'm wrong here > > > though, then please correct me. > > > > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > > > <[hidden email]> wrote: > > > > I agree that for the sake of the above mentioned use cases it is > > > reasonable > > > > to add this to the release with the right documentation, for machine > > > > learning potentially loosing one round of feedback data should not > > > matter. > > > > > > > > Let us not block prominent users until the next release on this. > > > > > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[hidden email]> > > > wrote: > > > > > > > >> As for people currently suffering from it: > > > >> > > > >> An application King is developing requires iterations, and they need > > > >> checkpoints. Practically all SAMOA programs would need this. > > > >> > > > >> It is very likely that the state interfaces will be changed after > the > > > >> release, so this is not something that we can just add later. I > don't > > > see a > > > >> reason why we should not add it, as it is clearly documented. In > this > > > >> actual case not having guarantees at all means people will never use > > it > > > in > > > >> any production system. Having limited guarantees means that it will > > > depend > > > >> on the application. > > > >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> > wrote: > > > >> > > > >> > Hey Gyula, > > > >> > > > > >> > I understand your reasoning, but I don't think its worth to rush > > this > > > >> into > > > >> > the release. > > > >> > > > > >> > As you've said, we cannot give precise guarantees. But this is > > > arguably > > > >> > one of the key requirements for any fault tolerance mechanism. > > > Therefore > > > >> I > > > >> > disagree that this is better than not having anything at all. I > > think > > > it > > > >> > will already go a long way to have the non-iterative case working > > > >> reliably. > > > >> > > > > >> > And as far as I know there are no users really suffering from this > > at > > > the > > > >> > moment (in the sense that someone has complained on the mailing > > list). > > > >> > > > > >> > Hence, I vote to postpone this. > > > >> > > > > >> > – Ufuk > > > >> > > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: > > > >> > > > > >> > > Hey all, > > > >> > > > > > >> > > It is currently impossible to enable state checkpointing for > > > iterative > > > >> > > jobs, because en exception is thrown when creating the jobgraph. > > > This > > > >> > > behaviour is motivated by the lack of precise guarantees that we > > can > > > >> give > > > >> > > with the current fault-tolerance implementations for cyclic > > graphs. > > > >> > > > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an > > optional > > > >> > flag to > > > >> > > force checkpoints even in case of iterations. The algorithm will > > > take > > > >> > > checkpoints periodically as before, but records in transit > inside > > > the > > > >> > loop > > > >> > > will be lost. > > > >> > > > > > >> > > However even this guarantee is enough for most applications > > (Machine > > > >> > > Learning for instance) and certainly much better than not having > > > >> anything > > > >> > > at all. > > > >> > > > > > >> > > > > > >> > > I suggest we add this to the 0.9 release as currently many > > > applications > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph > streaming > > > etc.) > > > >> > > > > > >> > > > > > >> > > Cheers, > > > >> > > > > > >> > > Gyula > > > >> > > > > >> > > > > >> > > > > > > |
The other tests verify that the checkpointing algorithm runs properly. That
also ensures that it runs for iterations because a loop is just an extra source and sink in the jobgraph (so it is the same for the algorithm). Fabian Hueske <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, 11:19): > Without going into the details, how well tested is this feature? The PR > only extends one test by a few lines. > > Is that really enough to ensure that > 1) the change does not cause trouble > 2) is working as expected > > If this feature should go into the release, it must be thoroughly checked > and we must take the time for that. > Including code and hoping for the best because time is scarce is not an > option IMO. > > Fabian > > > 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: > > > And also I would like to remind everyone that any fault tolerance we > > provide is only as good as the fault tolerance of the master node. Which > is > > non existent at the moment. > > > > So I don't see a reason why a user should not be able to choose whether > he > > wants state checkpoints for iterations as well. > > > > In any case this will be used by King for instance, so making it part of > > the release would save a lot of work for everyone. > > > > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, > > 10:29): > > > > > > > > To continue Gyula's point, for consistent snapshots we need to persist > > the > > > records in transit within the loop and also slightly change the > current > > > protocol since it works only for DAGs. Before going into that direction > > > though I would propose we first see whether there is a nice way to make > > > iterations more structured. > > > > > > Paris > > > ________________________________________ > > > From: Gyula Fóra <[hidden email]> > > > Sent: Wednesday, June 10, 2015 10:19 AM > > > To: [hidden email] > > > Subject: Re: Force enabling checkpoints for iterative streaming jobs > > > > > > I disagree. Not having checkpointed operators inside the iteration > still > > > breaks the guarantees. > > > > > > It is not about the states it is about the loop itself. > > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <[hidden email] > > > > > wrote: > > > > > > > This is the answer I gave on the PR (we should have one place for > > > > discussing this, though): > > > > > > > > I would be against merging this in the current form. What I propose > is > > > > to analyse the topology to verify that there are no checkpointed > > > > operators inside iterations. Operators before and after iterations > can > > > > be checkpointed and we can safely allow the user to enable > > > > checkpointing. > > > > > > > > If we have the code to analyse which operators are inside iterations > > > > we could also disallow windows inside iterations. I think windows > > > > inside iterations don't make sense since elements in different > > > > "iterations" would end up in the same window. Maybe I'm wrong here > > > > though, then please correct me. > > > > > > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > > > > <[hidden email]> wrote: > > > > > I agree that for the sake of the above mentioned use cases it is > > > > reasonable > > > > > to add this to the release with the right documentation, for > machine > > > > > learning potentially loosing one round of feedback data should not > > > > matter. > > > > > > > > > > Let us not block prominent users until the next release on this. > > > > > > > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[hidden email]> > > > > wrote: > > > > > > > > > >> As for people currently suffering from it: > > > > >> > > > > >> An application King is developing requires iterations, and they > need > > > > >> checkpoints. Practically all SAMOA programs would need this. > > > > >> > > > > >> It is very likely that the state interfaces will be changed after > > the > > > > >> release, so this is not something that we can just add later. I > > don't > > > > see a > > > > >> reason why we should not add it, as it is clearly documented. In > > this > > > > >> actual case not having guarantees at all means people will never > use > > > it > > > > in > > > > >> any production system. Having limited guarantees means that it > will > > > > depend > > > > >> on the application. > > > > >> > > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> > > wrote: > > > > >> > > > > >> > Hey Gyula, > > > > >> > > > > > >> > I understand your reasoning, but I don't think its worth to rush > > > this > > > > >> into > > > > >> > the release. > > > > >> > > > > > >> > As you've said, we cannot give precise guarantees. But this is > > > > arguably > > > > >> > one of the key requirements for any fault tolerance mechanism. > > > > Therefore > > > > >> I > > > > >> > disagree that this is better than not having anything at all. I > > > think > > > > it > > > > >> > will already go a long way to have the non-iterative case > working > > > > >> reliably. > > > > >> > > > > > >> > And as far as I know there are no users really suffering from > this > > > at > > > > the > > > > >> > moment (in the sense that someone has complained on the mailing > > > list). > > > > >> > > > > > >> > Hence, I vote to postpone this. > > > > >> > > > > > >> > – Ufuk > > > > >> > > > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: > > > > >> > > > > > >> > > Hey all, > > > > >> > > > > > > >> > > It is currently impossible to enable state checkpointing for > > > > iterative > > > > >> > > jobs, because en exception is thrown when creating the > jobgraph. > > > > This > > > > >> > > behaviour is motivated by the lack of precise guarantees that > we > > > can > > > > >> give > > > > >> > > with the current fault-tolerance implementations for cyclic > > > graphs. > > > > >> > > > > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an > > > optional > > > > >> > flag to > > > > >> > > force checkpoints even in case of iterations. The algorithm > will > > > > take > > > > >> > > checkpoints periodically as before, but records in transit > > inside > > > > the > > > > >> > loop > > > > >> > > will be lost. > > > > >> > > > > > > >> > > However even this guarantee is enough for most applications > > > (Machine > > > > >> > > Learning for instance) and certainly much better than not > having > > > > >> anything > > > > >> > > at all. > > > > >> > > > > > > >> > > > > > > >> > > I suggest we add this to the 0.9 release as currently many > > > > applications > > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph > > streaming > > > > etc.) > > > > >> > > > > > > >> > > > > > > >> > > Cheers, > > > > >> > > > > > > >> > > Gyula > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > |
I don't understand why having the state inside an iteration but not
the elements that correspond to this state or created this state is desirable. Maybe an example could help understand this better? On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[hidden email]> wrote: > The other tests verify that the checkpointing algorithm runs properly. That > also ensures that it runs for iterations because a loop is just an extra > source and sink in the jobgraph (so it is the same for the algorithm). > > Fabian Hueske <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, > 11:19): > >> Without going into the details, how well tested is this feature? The PR >> only extends one test by a few lines. >> >> Is that really enough to ensure that >> 1) the change does not cause trouble >> 2) is working as expected >> >> If this feature should go into the release, it must be thoroughly checked >> and we must take the time for that. >> Including code and hoping for the best because time is scarce is not an >> option IMO. >> >> Fabian >> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: >> >> > And also I would like to remind everyone that any fault tolerance we >> > provide is only as good as the fault tolerance of the master node. Which >> is >> > non existent at the moment. >> > >> > So I don't see a reason why a user should not be able to choose whether >> he >> > wants state checkpoints for iterations as well. >> > >> > In any case this will be used by King for instance, so making it part of >> > the release would save a lot of work for everyone. >> > >> > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, >> > 10:29): >> > >> > > >> > > To continue Gyula's point, for consistent snapshots we need to persist >> > the >> > > records in transit within the loop and also slightly change the >> current >> > > protocol since it works only for DAGs. Before going into that direction >> > > though I would propose we first see whether there is a nice way to make >> > > iterations more structured. >> > > >> > > Paris >> > > ________________________________________ >> > > From: Gyula Fóra <[hidden email]> >> > > Sent: Wednesday, June 10, 2015 10:19 AM >> > > To: [hidden email] >> > > Subject: Re: Force enabling checkpoints for iterative streaming jobs >> > > >> > > I disagree. Not having checkpointed operators inside the iteration >> still >> > > breaks the guarantees. >> > > >> > > It is not about the states it is about the loop itself. >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek <[hidden email] >> > >> > > wrote: >> > > >> > > > This is the answer I gave on the PR (we should have one place for >> > > > discussing this, though): >> > > > >> > > > I would be against merging this in the current form. What I propose >> is >> > > > to analyse the topology to verify that there are no checkpointed >> > > > operators inside iterations. Operators before and after iterations >> can >> > > > be checkpointed and we can safely allow the user to enable >> > > > checkpointing. >> > > > >> > > > If we have the code to analyse which operators are inside iterations >> > > > we could also disallow windows inside iterations. I think windows >> > > > inside iterations don't make sense since elements in different >> > > > "iterations" would end up in the same window. Maybe I'm wrong here >> > > > though, then please correct me. >> > > > >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi >> > > > <[hidden email]> wrote: >> > > > > I agree that for the sake of the above mentioned use cases it is >> > > > reasonable >> > > > > to add this to the release with the right documentation, for >> machine >> > > > > learning potentially loosing one round of feedback data should not >> > > > matter. >> > > > > >> > > > > Let us not block prominent users until the next release on this. >> > > > > >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra <[hidden email]> >> > > > wrote: >> > > > > >> > > > >> As for people currently suffering from it: >> > > > >> >> > > > >> An application King is developing requires iterations, and they >> need >> > > > >> checkpoints. Practically all SAMOA programs would need this. >> > > > >> >> > > > >> It is very likely that the state interfaces will be changed after >> > the >> > > > >> release, so this is not something that we can just add later. I >> > don't >> > > > see a >> > > > >> reason why we should not add it, as it is clearly documented. In >> > this >> > > > >> actual case not having guarantees at all means people will never >> use >> > > it >> > > > in >> > > > >> any production system. Having limited guarantees means that it >> will >> > > > depend >> > > > >> on the application. >> > > > >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> >> > wrote: >> > > > >> >> > > > >> > Hey Gyula, >> > > > >> > >> > > > >> > I understand your reasoning, but I don't think its worth to rush >> > > this >> > > > >> into >> > > > >> > the release. >> > > > >> > >> > > > >> > As you've said, we cannot give precise guarantees. But this is >> > > > arguably >> > > > >> > one of the key requirements for any fault tolerance mechanism. >> > > > Therefore >> > > > >> I >> > > > >> > disagree that this is better than not having anything at all. I >> > > think >> > > > it >> > > > >> > will already go a long way to have the non-iterative case >> working >> > > > >> reliably. >> > > > >> > >> > > > >> > And as far as I know there are no users really suffering from >> this >> > > at >> > > > the >> > > > >> > moment (in the sense that someone has complained on the mailing >> > > list). >> > > > >> > >> > > > >> > Hence, I vote to postpone this. >> > > > >> > >> > > > >> > – Ufuk >> > > > >> > >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> wrote: >> > > > >> > >> > > > >> > > Hey all, >> > > > >> > > >> > > > >> > > It is currently impossible to enable state checkpointing for >> > > > iterative >> > > > >> > > jobs, because en exception is thrown when creating the >> jobgraph. >> > > > This >> > > > >> > > behaviour is motivated by the lack of precise guarantees that >> we >> > > can >> > > > >> give >> > > > >> > > with the current fault-tolerance implementations for cyclic >> > > graphs. >> > > > >> > > >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an >> > > optional >> > > > >> > flag to >> > > > >> > > force checkpoints even in case of iterations. The algorithm >> will >> > > > take >> > > > >> > > checkpoints periodically as before, but records in transit >> > inside >> > > > the >> > > > >> > loop >> > > > >> > > will be lost. >> > > > >> > > >> > > > >> > > However even this guarantee is enough for most applications >> > > (Machine >> > > > >> > > Learning for instance) and certainly much better than not >> having >> > > > >> anything >> > > > >> > > at all. >> > > > >> > > >> > > > >> > > >> > > > >> > > I suggest we add this to the 0.9 release as currently many >> > > > applications >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph >> > streaming >> > > > etc.) >> > > > >> > > >> > > > >> > > >> > > > >> > > Cheers, >> > > > >> > > >> > > > >> > > Gyula >> > > > >> > >> > > > >> > >> > > > >> >> > > > >> > > >> > >> |
I don't understand the question, I vote for checkpointing all state in the
job, even inside iterations (its more of a loop). Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, 12:34): > I don't understand why having the state inside an iteration but not > the elements that correspond to this state or created this state is > desirable. Maybe an example could help understand this better? > > On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[hidden email]> wrote: > > The other tests verify that the checkpointing algorithm runs properly. > That > > also ensures that it runs for iterations because a loop is just an extra > > source and sink in the jobgraph (so it is the same for the algorithm). > > > > Fabian Hueske <[hidden email]> ezt írta (időpont: 2015. jún. 10., > Sze, > > 11:19): > > > >> Without going into the details, how well tested is this feature? The PR > >> only extends one test by a few lines. > >> > >> Is that really enough to ensure that > >> 1) the change does not cause trouble > >> 2) is working as expected > >> > >> If this feature should go into the release, it must be thoroughly > checked > >> and we must take the time for that. > >> Including code and hoping for the best because time is scarce is not an > >> option IMO. > >> > >> Fabian > >> > >> > >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: > >> > >> > And also I would like to remind everyone that any fault tolerance we > >> > provide is only as good as the fault tolerance of the master node. > Which > >> is > >> > non existent at the moment. > >> > > >> > So I don't see a reason why a user should not be able to choose > whether > >> he > >> > wants state checkpoints for iterations as well. > >> > > >> > In any case this will be used by King for instance, so making it part > of > >> > the release would save a lot of work for everyone. > >> > > >> > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, > >> > 10:29): > >> > > >> > > > >> > > To continue Gyula's point, for consistent snapshots we need to > persist > >> > the > >> > > records in transit within the loop and also slightly change the > >> current > >> > > protocol since it works only for DAGs. Before going into that > direction > >> > > though I would propose we first see whether there is a nice way to > make > >> > > iterations more structured. > >> > > > >> > > Paris > >> > > ________________________________________ > >> > > From: Gyula Fóra <[hidden email]> > >> > > Sent: Wednesday, June 10, 2015 10:19 AM > >> > > To: [hidden email] > >> > > Subject: Re: Force enabling checkpoints for iterative streaming jobs > >> > > > >> > > I disagree. Not having checkpointed operators inside the iteration > >> still > >> > > breaks the guarantees. > >> > > > >> > > It is not about the states it is about the loop itself. > >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < > [hidden email] > >> > > >> > > wrote: > >> > > > >> > > > This is the answer I gave on the PR (we should have one place for > >> > > > discussing this, though): > >> > > > > >> > > > I would be against merging this in the current form. What I > propose > >> is > >> > > > to analyse the topology to verify that there are no checkpointed > >> > > > operators inside iterations. Operators before and after iterations > >> can > >> > > > be checkpointed and we can safely allow the user to enable > >> > > > checkpointing. > >> > > > > >> > > > If we have the code to analyse which operators are inside > iterations > >> > > > we could also disallow windows inside iterations. I think windows > >> > > > inside iterations don't make sense since elements in different > >> > > > "iterations" would end up in the same window. Maybe I'm wrong here > >> > > > though, then please correct me. > >> > > > > >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > >> > > > <[hidden email]> wrote: > >> > > > > I agree that for the sake of the above mentioned use cases it is > >> > > > reasonable > >> > > > > to add this to the release with the right documentation, for > >> machine > >> > > > > learning potentially loosing one round of feedback data should > not > >> > > > matter. > >> > > > > > >> > > > > Let us not block prominent users until the next release on this. > >> > > > > > >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < > [hidden email]> > >> > > > wrote: > >> > > > > > >> > > > >> As for people currently suffering from it: > >> > > > >> > >> > > > >> An application King is developing requires iterations, and they > >> need > >> > > > >> checkpoints. Practically all SAMOA programs would need this. > >> > > > >> > >> > > > >> It is very likely that the state interfaces will be changed > after > >> > the > >> > > > >> release, so this is not something that we can just add later. I > >> > don't > >> > > > see a > >> > > > >> reason why we should not add it, as it is clearly documented. > In > >> > this > >> > > > >> actual case not having guarantees at all means people will > never > >> use > >> > > it > >> > > > in > >> > > > >> any production system. Having limited guarantees means that it > >> will > >> > > > depend > >> > > > >> on the application. > >> > > > >> > >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> > >> > wrote: > >> > > > >> > >> > > > >> > Hey Gyula, > >> > > > >> > > >> > > > >> > I understand your reasoning, but I don't think its worth to > rush > >> > > this > >> > > > >> into > >> > > > >> > the release. > >> > > > >> > > >> > > > >> > As you've said, we cannot give precise guarantees. But this > is > >> > > > arguably > >> > > > >> > one of the key requirements for any fault tolerance > mechanism. > >> > > > Therefore > >> > > > >> I > >> > > > >> > disagree that this is better than not having anything at > all. I > >> > > think > >> > > > it > >> > > > >> > will already go a long way to have the non-iterative case > >> working > >> > > > >> reliably. > >> > > > >> > > >> > > > >> > And as far as I know there are no users really suffering from > >> this > >> > > at > >> > > > the > >> > > > >> > moment (in the sense that someone has complained on the > mailing > >> > > list). > >> > > > >> > > >> > > > >> > Hence, I vote to postpone this. > >> > > > >> > > >> > > > >> > – Ufuk > >> > > > >> > > >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> > wrote: > >> > > > >> > > >> > > > >> > > Hey all, > >> > > > >> > > > >> > > > >> > > It is currently impossible to enable state checkpointing > for > >> > > > iterative > >> > > > >> > > jobs, because en exception is thrown when creating the > >> jobgraph. > >> > > > This > >> > > > >> > > behaviour is motivated by the lack of precise guarantees > that > >> we > >> > > can > >> > > > >> give > >> > > > >> > > with the current fault-tolerance implementations for cyclic > >> > > graphs. > >> > > > >> > > > >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an > >> > > optional > >> > > > >> > flag to > >> > > > >> > > force checkpoints even in case of iterations. The algorithm > >> will > >> > > > take > >> > > > >> > > checkpoints periodically as before, but records in transit > >> > inside > >> > > > the > >> > > > >> > loop > >> > > > >> > > will be lost. > >> > > > >> > > > >> > > > >> > > However even this guarantee is enough for most applications > >> > > (Machine > >> > > > >> > > Learning for instance) and certainly much better than not > >> having > >> > > > >> anything > >> > > > >> > > at all. > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > I suggest we add this to the 0.9 release as currently many > >> > > > applications > >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph > >> > streaming > >> > > > etc.) > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > Cheers, > >> > > > >> > > > >> > > > >> > > Gyula > >> > > > >> > > >> > > > >> > > >> > > > >> > >> > > > > >> > > > >> > > >> > |
The elements that are in-flight in an iteration are also state of the
job. I'm wondering whether the state inside iterations still makes sense without these in-flight elements. But I also don't know the King use-case, that's why I though an example could be helpful. On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <[hidden email]> wrote: > I don't understand the question, I vote for checkpointing all state in the > job, even inside iterations (its more of a loop). > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 10., > Sze, 12:34): > >> I don't understand why having the state inside an iteration but not >> the elements that correspond to this state or created this state is >> desirable. Maybe an example could help understand this better? >> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[hidden email]> wrote: >> > The other tests verify that the checkpointing algorithm runs properly. >> That >> > also ensures that it runs for iterations because a loop is just an extra >> > source and sink in the jobgraph (so it is the same for the algorithm). >> > >> > Fabian Hueske <[hidden email]> ezt írta (időpont: 2015. jún. 10., >> Sze, >> > 11:19): >> > >> >> Without going into the details, how well tested is this feature? The PR >> >> only extends one test by a few lines. >> >> >> >> Is that really enough to ensure that >> >> 1) the change does not cause trouble >> >> 2) is working as expected >> >> >> >> If this feature should go into the release, it must be thoroughly >> checked >> >> and we must take the time for that. >> >> Including code and hoping for the best because time is scarce is not an >> >> option IMO. >> >> >> >> Fabian >> >> >> >> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: >> >> >> >> > And also I would like to remind everyone that any fault tolerance we >> >> > provide is only as good as the fault tolerance of the master node. >> Which >> >> is >> >> > non existent at the moment. >> >> > >> >> > So I don't see a reason why a user should not be able to choose >> whether >> >> he >> >> > wants state checkpoints for iterations as well. >> >> > >> >> > In any case this will be used by King for instance, so making it part >> of >> >> > the release would save a lot of work for everyone. >> >> > >> >> > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, >> >> > 10:29): >> >> > >> >> > > >> >> > > To continue Gyula's point, for consistent snapshots we need to >> persist >> >> > the >> >> > > records in transit within the loop and also slightly change the >> >> current >> >> > > protocol since it works only for DAGs. Before going into that >> direction >> >> > > though I would propose we first see whether there is a nice way to >> make >> >> > > iterations more structured. >> >> > > >> >> > > Paris >> >> > > ________________________________________ >> >> > > From: Gyula Fóra <[hidden email]> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM >> >> > > To: [hidden email] >> >> > > Subject: Re: Force enabling checkpoints for iterative streaming jobs >> >> > > >> >> > > I disagree. Not having checkpointed operators inside the iteration >> >> still >> >> > > breaks the guarantees. >> >> > > >> >> > > It is not about the states it is about the loop itself. >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < >> [hidden email] >> >> > >> >> > > wrote: >> >> > > >> >> > > > This is the answer I gave on the PR (we should have one place for >> >> > > > discussing this, though): >> >> > > > >> >> > > > I would be against merging this in the current form. What I >> propose >> >> is >> >> > > > to analyse the topology to verify that there are no checkpointed >> >> > > > operators inside iterations. Operators before and after iterations >> >> can >> >> > > > be checkpointed and we can safely allow the user to enable >> >> > > > checkpointing. >> >> > > > >> >> > > > If we have the code to analyse which operators are inside >> iterations >> >> > > > we could also disallow windows inside iterations. I think windows >> >> > > > inside iterations don't make sense since elements in different >> >> > > > "iterations" would end up in the same window. Maybe I'm wrong here >> >> > > > though, then please correct me. >> >> > > > >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi >> >> > > > <[hidden email]> wrote: >> >> > > > > I agree that for the sake of the above mentioned use cases it is >> >> > > > reasonable >> >> > > > > to add this to the release with the right documentation, for >> >> machine >> >> > > > > learning potentially loosing one round of feedback data should >> not >> >> > > > matter. >> >> > > > > >> >> > > > > Let us not block prominent users until the next release on this. >> >> > > > > >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < >> [hidden email]> >> >> > > > wrote: >> >> > > > > >> >> > > > >> As for people currently suffering from it: >> >> > > > >> >> >> > > > >> An application King is developing requires iterations, and they >> >> need >> >> > > > >> checkpoints. Practically all SAMOA programs would need this. >> >> > > > >> >> >> > > > >> It is very likely that the state interfaces will be changed >> after >> >> > the >> >> > > > >> release, so this is not something that we can just add later. I >> >> > don't >> >> > > > see a >> >> > > > >> reason why we should not add it, as it is clearly documented. >> In >> >> > this >> >> > > > >> actual case not having guarantees at all means people will >> never >> >> use >> >> > > it >> >> > > > in >> >> > > > >> any production system. Having limited guarantees means that it >> >> will >> >> > > > depend >> >> > > > >> on the application. >> >> > > > >> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi <[hidden email]> >> >> > wrote: >> >> > > > >> >> >> > > > >> > Hey Gyula, >> >> > > > >> > >> >> > > > >> > I understand your reasoning, but I don't think its worth to >> rush >> >> > > this >> >> > > > >> into >> >> > > > >> > the release. >> >> > > > >> > >> >> > > > >> > As you've said, we cannot give precise guarantees. But this >> is >> >> > > > arguably >> >> > > > >> > one of the key requirements for any fault tolerance >> mechanism. >> >> > > > Therefore >> >> > > > >> I >> >> > > > >> > disagree that this is better than not having anything at >> all. I >> >> > > think >> >> > > > it >> >> > > > >> > will already go a long way to have the non-iterative case >> >> working >> >> > > > >> reliably. >> >> > > > >> > >> >> > > > >> > And as far as I know there are no users really suffering from >> >> this >> >> > > at >> >> > > > the >> >> > > > >> > moment (in the sense that someone has complained on the >> mailing >> >> > > list). >> >> > > > >> > >> >> > > > >> > Hence, I vote to postpone this. >> >> > > > >> > >> >> > > > >> > – Ufuk >> >> > > > >> > >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> >> wrote: >> >> > > > >> > >> >> > > > >> > > Hey all, >> >> > > > >> > > >> >> > > > >> > > It is currently impossible to enable state checkpointing >> for >> >> > > > iterative >> >> > > > >> > > jobs, because en exception is thrown when creating the >> >> jobgraph. >> >> > > > This >> >> > > > >> > > behaviour is motivated by the lack of precise guarantees >> that >> >> we >> >> > > can >> >> > > > >> give >> >> > > > >> > > with the current fault-tolerance implementations for cyclic >> >> > > graphs. >> >> > > > >> > > >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> adds an >> >> > > optional >> >> > > > >> > flag to >> >> > > > >> > > force checkpoints even in case of iterations. The algorithm >> >> will >> >> > > > take >> >> > > > >> > > checkpoints periodically as before, but records in transit >> >> > inside >> >> > > > the >> >> > > > >> > loop >> >> > > > >> > > will be lost. >> >> > > > >> > > >> >> > > > >> > > However even this guarantee is enough for most applications >> >> > > (Machine >> >> > > > >> > > Learning for instance) and certainly much better than not >> >> having >> >> > > > >> anything >> >> > > > >> > > at all. >> >> > > > >> > > >> >> > > > >> > > >> >> > > > >> > > I suggest we add this to the 0.9 release as currently many >> >> > > > applications >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph >> >> > streaming >> >> > > > etc.) >> >> > > > >> > > >> >> > > > >> > > >> >> > > > >> > > Cheers, >> >> > > > >> > > >> >> > > > >> > > Gyula >> >> > > > >> > >> >> > > > >> > >> >> > > > >> >> >> > > > >> >> > > >> >> > >> >> >> |
You are right, to have consistent results we would need to persist the
records. But since we cannot do that right now, we can still checkpoint all operator states and understand that inflight records in the loop are lost on failure. This is acceptable for most the use-cases that we have developed so far for iterations (machine learning, graph updates, etc.) What is not acceptable is to not have checkpointing at all. Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, 12:43): > The elements that are in-flight in an iteration are also state of the > job. I'm wondering whether the state inside iterations still makes > sense without these in-flight elements. But I also don't know the King > use-case, that's why I though an example could be helpful. > > On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <[hidden email]> wrote: > > I don't understand the question, I vote for checkpointing all state in > the > > job, even inside iterations (its more of a loop). > > > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. > 10., > > Sze, 12:34): > > > >> I don't understand why having the state inside an iteration but not > >> the elements that correspond to this state or created this state is > >> desirable. Maybe an example could help understand this better? > >> > >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[hidden email]> > wrote: > >> > The other tests verify that the checkpointing algorithm runs properly. > >> That > >> > also ensures that it runs for iterations because a loop is just an > extra > >> > source and sink in the jobgraph (so it is the same for the algorithm). > >> > > >> > Fabian Hueske <[hidden email]> ezt írta (időpont: 2015. jún. 10., > >> Sze, > >> > 11:19): > >> > > >> >> Without going into the details, how well tested is this feature? The > PR > >> >> only extends one test by a few lines. > >> >> > >> >> Is that really enough to ensure that > >> >> 1) the change does not cause trouble > >> >> 2) is working as expected > >> >> > >> >> If this feature should go into the release, it must be thoroughly > >> checked > >> >> and we must take the time for that. > >> >> Including code and hoping for the best because time is scarce is not > an > >> >> option IMO. > >> >> > >> >> Fabian > >> >> > >> >> > >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: > >> >> > >> >> > And also I would like to remind everyone that any fault tolerance > we > >> >> > provide is only as good as the fault tolerance of the master node. > >> Which > >> >> is > >> >> > non existent at the moment. > >> >> > > >> >> > So I don't see a reason why a user should not be able to choose > >> whether > >> >> he > >> >> > wants state checkpoints for iterations as well. > >> >> > > >> >> > In any case this will be used by King for instance, so making it > part > >> of > >> >> > the release would save a lot of work for everyone. > >> >> > > >> >> > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., > Sze, > >> >> > 10:29): > >> >> > > >> >> > > > >> >> > > To continue Gyula's point, for consistent snapshots we need to > >> persist > >> >> > the > >> >> > > records in transit within the loop and also slightly change the > >> >> current > >> >> > > protocol since it works only for DAGs. Before going into that > >> direction > >> >> > > though I would propose we first see whether there is a nice way > to > >> make > >> >> > > iterations more structured. > >> >> > > > >> >> > > Paris > >> >> > > ________________________________________ > >> >> > > From: Gyula Fóra <[hidden email]> > >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM > >> >> > > To: [hidden email] > >> >> > > Subject: Re: Force enabling checkpoints for iterative streaming > jobs > >> >> > > > >> >> > > I disagree. Not having checkpointed operators inside the > iteration > >> >> still > >> >> > > breaks the guarantees. > >> >> > > > >> >> > > It is not about the states it is about the loop itself. > >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < > >> [hidden email] > >> >> > > >> >> > > wrote: > >> >> > > > >> >> > > > This is the answer I gave on the PR (we should have one place > for > >> >> > > > discussing this, though): > >> >> > > > > >> >> > > > I would be against merging this in the current form. What I > >> propose > >> >> is > >> >> > > > to analyse the topology to verify that there are no > checkpointed > >> >> > > > operators inside iterations. Operators before and after > iterations > >> >> can > >> >> > > > be checkpointed and we can safely allow the user to enable > >> >> > > > checkpointing. > >> >> > > > > >> >> > > > If we have the code to analyse which operators are inside > >> iterations > >> >> > > > we could also disallow windows inside iterations. I think > windows > >> >> > > > inside iterations don't make sense since elements in different > >> >> > > > "iterations" would end up in the same window. Maybe I'm wrong > here > >> >> > > > though, then please correct me. > >> >> > > > > >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > >> >> > > > <[hidden email]> wrote: > >> >> > > > > I agree that for the sake of the above mentioned use cases > it is > >> >> > > > reasonable > >> >> > > > > to add this to the release with the right documentation, for > >> >> machine > >> >> > > > > learning potentially loosing one round of feedback data > should > >> not > >> >> > > > matter. > >> >> > > > > > >> >> > > > > Let us not block prominent users until the next release on > this. > >> >> > > > > > >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < > >> [hidden email]> > >> >> > > > wrote: > >> >> > > > > > >> >> > > > >> As for people currently suffering from it: > >> >> > > > >> > >> >> > > > >> An application King is developing requires iterations, and > they > >> >> need > >> >> > > > >> checkpoints. Practically all SAMOA programs would need this. > >> >> > > > >> > >> >> > > > >> It is very likely that the state interfaces will be changed > >> after > >> >> > the > >> >> > > > >> release, so this is not something that we can just add > later. I > >> >> > don't > >> >> > > > see a > >> >> > > > >> reason why we should not add it, as it is clearly > documented. > >> In > >> >> > this > >> >> > > > >> actual case not having guarantees at all means people will > >> never > >> >> use > >> >> > > it > >> >> > > > in > >> >> > > > >> any production system. Having limited guarantees means that > it > >> >> will > >> >> > > > depend > >> >> > > > >> on the application. > >> >> > > > >> > >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi < > [hidden email]> > >> >> > wrote: > >> >> > > > >> > >> >> > > > >> > Hey Gyula, > >> >> > > > >> > > >> >> > > > >> > I understand your reasoning, but I don't think its worth > to > >> rush > >> >> > > this > >> >> > > > >> into > >> >> > > > >> > the release. > >> >> > > > >> > > >> >> > > > >> > As you've said, we cannot give precise guarantees. But > this > >> is > >> >> > > > arguably > >> >> > > > >> > one of the key requirements for any fault tolerance > >> mechanism. > >> >> > > > Therefore > >> >> > > > >> I > >> >> > > > >> > disagree that this is better than not having anything at > >> all. I > >> >> > > think > >> >> > > > it > >> >> > > > >> > will already go a long way to have the non-iterative case > >> >> working > >> >> > > > >> reliably. > >> >> > > > >> > > >> >> > > > >> > And as far as I know there are no users really suffering > from > >> >> this > >> >> > > at > >> >> > > > the > >> >> > > > >> > moment (in the sense that someone has complained on the > >> mailing > >> >> > > list). > >> >> > > > >> > > >> >> > > > >> > Hence, I vote to postpone this. > >> >> > > > >> > > >> >> > > > >> > – Ufuk > >> >> > > > >> > > >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> > >> wrote: > >> >> > > > >> > > >> >> > > > >> > > Hey all, > >> >> > > > >> > > > >> >> > > > >> > > It is currently impossible to enable state checkpointing > >> for > >> >> > > > iterative > >> >> > > > >> > > jobs, because en exception is thrown when creating the > >> >> jobgraph. > >> >> > > > This > >> >> > > > >> > > behaviour is motivated by the lack of precise guarantees > >> that > >> >> we > >> >> > > can > >> >> > > > >> give > >> >> > > > >> > > with the current fault-tolerance implementations for > cyclic > >> >> > > graphs. > >> >> > > > >> > > > >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> > adds an > >> >> > > optional > >> >> > > > >> > flag to > >> >> > > > >> > > force checkpoints even in case of iterations. The > algorithm > >> >> will > >> >> > > > take > >> >> > > > >> > > checkpoints periodically as before, but records in > transit > >> >> > inside > >> >> > > > the > >> >> > > > >> > loop > >> >> > > > >> > > will be lost. > >> >> > > > >> > > > >> >> > > > >> > > However even this guarantee is enough for most > applications > >> >> > > (Machine > >> >> > > > >> > > Learning for instance) and certainly much better than > not > >> >> having > >> >> > > > >> anything > >> >> > > > >> > > at all. > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > I suggest we add this to the 0.9 release as currently > many > >> >> > > > applications > >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph > >> >> > streaming > >> >> > > > etc.) > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > Cheers, > >> >> > > > >> > > > >> >> > > > >> > > Gyula > >> >> > > > >> > > >> >> > > > >> > > >> >> > > > >> > >> >> > > > > >> >> > > > >> >> > > >> >> > >> > |
Here is an example for you:
Parallel streaming kmeans, the state we keep is the current cluster centers, and we use iterations to sync the centers across parallel instances. We can afford lost model updated in the loop but we need the checkpoint the models. https://github.com/gyfora/stream-clustering/blob/master/src/main/scala/stream/clustering/StreamClustering.scala (checkpointing is not turned on but you will get the point) Gyula Fóra <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, 12:47): > You are right, to have consistent results we would need to persist the > records. > > But since we cannot do that right now, we can still checkpoint all > operator states and understand that inflight records in the loop are lost > on failure. > > This is acceptable for most the use-cases that we have developed so far > for iterations (machine learning, graph updates, etc.) What is not > acceptable is to not have checkpointing at all. > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 10., > Sze, 12:43): > >> The elements that are in-flight in an iteration are also state of the >> job. I'm wondering whether the state inside iterations still makes >> sense without these in-flight elements. But I also don't know the King >> use-case, that's why I though an example could be helpful. >> >> On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <[hidden email]> >> wrote: >> > I don't understand the question, I vote for checkpointing all state in >> the >> > job, even inside iterations (its more of a loop). >> > >> > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. >> 10., >> > Sze, 12:34): >> > >> >> I don't understand why having the state inside an iteration but not >> >> the elements that correspond to this state or created this state is >> >> desirable. Maybe an example could help understand this better? >> >> >> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[hidden email]> >> wrote: >> >> > The other tests verify that the checkpointing algorithm runs >> properly. >> >> That >> >> > also ensures that it runs for iterations because a loop is just an >> extra >> >> > source and sink in the jobgraph (so it is the same for the >> algorithm). >> >> > >> >> > Fabian Hueske <[hidden email]> ezt írta (időpont: 2015. jún. 10., >> >> Sze, >> >> > 11:19): >> >> > >> >> >> Without going into the details, how well tested is this feature? >> The PR >> >> >> only extends one test by a few lines. >> >> >> >> >> >> Is that really enough to ensure that >> >> >> 1) the change does not cause trouble >> >> >> 2) is working as expected >> >> >> >> >> >> If this feature should go into the release, it must be thoroughly >> >> checked >> >> >> and we must take the time for that. >> >> >> Including code and hoping for the best because time is scarce is >> not an >> >> >> option IMO. >> >> >> >> >> >> Fabian >> >> >> >> >> >> >> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: >> >> >> >> >> >> > And also I would like to remind everyone that any fault tolerance >> we >> >> >> > provide is only as good as the fault tolerance of the master node. >> >> Which >> >> >> is >> >> >> > non existent at the moment. >> >> >> > >> >> >> > So I don't see a reason why a user should not be able to choose >> >> whether >> >> >> he >> >> >> > wants state checkpoints for iterations as well. >> >> >> > >> >> >> > In any case this will be used by King for instance, so making it >> part >> >> of >> >> >> > the release would save a lot of work for everyone. >> >> >> > >> >> >> > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., >> Sze, >> >> >> > 10:29): >> >> >> > >> >> >> > > >> >> >> > > To continue Gyula's point, for consistent snapshots we need to >> >> persist >> >> >> > the >> >> >> > > records in transit within the loop and also slightly change the >> >> >> current >> >> >> > > protocol since it works only for DAGs. Before going into that >> >> direction >> >> >> > > though I would propose we first see whether there is a nice way >> to >> >> make >> >> >> > > iterations more structured. >> >> >> > > >> >> >> > > Paris >> >> >> > > ________________________________________ >> >> >> > > From: Gyula Fóra <[hidden email]> >> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM >> >> >> > > To: [hidden email] >> >> >> > > Subject: Re: Force enabling checkpoints for iterative streaming >> jobs >> >> >> > > >> >> >> > > I disagree. Not having checkpointed operators inside the >> iteration >> >> >> still >> >> >> > > breaks the guarantees. >> >> >> > > >> >> >> > > It is not about the states it is about the loop itself. >> >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < >> >> [hidden email] >> >> >> > >> >> >> > > wrote: >> >> >> > > >> >> >> > > > This is the answer I gave on the PR (we should have one place >> for >> >> >> > > > discussing this, though): >> >> >> > > > >> >> >> > > > I would be against merging this in the current form. What I >> >> propose >> >> >> is >> >> >> > > > to analyse the topology to verify that there are no >> checkpointed >> >> >> > > > operators inside iterations. Operators before and after >> iterations >> >> >> can >> >> >> > > > be checkpointed and we can safely allow the user to enable >> >> >> > > > checkpointing. >> >> >> > > > >> >> >> > > > If we have the code to analyse which operators are inside >> >> iterations >> >> >> > > > we could also disallow windows inside iterations. I think >> windows >> >> >> > > > inside iterations don't make sense since elements in different >> >> >> > > > "iterations" would end up in the same window. Maybe I'm wrong >> here >> >> >> > > > though, then please correct me. >> >> >> > > > >> >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi >> >> >> > > > <[hidden email]> wrote: >> >> >> > > > > I agree that for the sake of the above mentioned use cases >> it is >> >> >> > > > reasonable >> >> >> > > > > to add this to the release with the right documentation, for >> >> >> machine >> >> >> > > > > learning potentially loosing one round of feedback data >> should >> >> not >> >> >> > > > matter. >> >> >> > > > > >> >> >> > > > > Let us not block prominent users until the next release on >> this. >> >> >> > > > > >> >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < >> >> [hidden email]> >> >> >> > > > wrote: >> >> >> > > > > >> >> >> > > > >> As for people currently suffering from it: >> >> >> > > > >> >> >> >> > > > >> An application King is developing requires iterations, and >> they >> >> >> need >> >> >> > > > >> checkpoints. Practically all SAMOA programs would need >> this. >> >> >> > > > >> >> >> >> > > > >> It is very likely that the state interfaces will be changed >> >> after >> >> >> > the >> >> >> > > > >> release, so this is not something that we can just add >> later. I >> >> >> > don't >> >> >> > > > see a >> >> >> > > > >> reason why we should not add it, as it is clearly >> documented. >> >> In >> >> >> > this >> >> >> > > > >> actual case not having guarantees at all means people will >> >> never >> >> >> use >> >> >> > > it >> >> >> > > > in >> >> >> > > > >> any production system. Having limited guarantees means >> that it >> >> >> will >> >> >> > > > depend >> >> >> > > > >> on the application. >> >> >> > > > >> >> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi < >> [hidden email]> >> >> >> > wrote: >> >> >> > > > >> >> >> >> > > > >> > Hey Gyula, >> >> >> > > > >> > >> >> >> > > > >> > I understand your reasoning, but I don't think its worth >> to >> >> rush >> >> >> > > this >> >> >> > > > >> into >> >> >> > > > >> > the release. >> >> >> > > > >> > >> >> >> > > > >> > As you've said, we cannot give precise guarantees. But >> this >> >> is >> >> >> > > > arguably >> >> >> > > > >> > one of the key requirements for any fault tolerance >> >> mechanism. >> >> >> > > > Therefore >> >> >> > > > >> I >> >> >> > > > >> > disagree that this is better than not having anything at >> >> all. I >> >> >> > > think >> >> >> > > > it >> >> >> > > > >> > will already go a long way to have the non-iterative case >> >> >> working >> >> >> > > > >> reliably. >> >> >> > > > >> > >> >> >> > > > >> > And as far as I know there are no users really suffering >> from >> >> >> this >> >> >> > > at >> >> >> > > > the >> >> >> > > > >> > moment (in the sense that someone has complained on the >> >> mailing >> >> >> > > list). >> >> >> > > > >> > >> >> >> > > > >> > Hence, I vote to postpone this. >> >> >> > > > >> > >> >> >> > > > >> > – Ufuk >> >> >> > > > >> > >> >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> >> >> wrote: >> >> >> > > > >> > >> >> >> > > > >> > > Hey all, >> >> >> > > > >> > > >> >> >> > > > >> > > It is currently impossible to enable state >> checkpointing >> >> for >> >> >> > > > iterative >> >> >> > > > >> > > jobs, because en exception is thrown when creating the >> >> >> jobgraph. >> >> >> > > > This >> >> >> > > > >> > > behaviour is motivated by the lack of precise >> guarantees >> >> that >> >> >> we >> >> >> > > can >> >> >> > > > >> give >> >> >> > > > >> > > with the current fault-tolerance implementations for >> cyclic >> >> >> > > graphs. >> >> >> > > > >> > > >> >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> >> adds an >> >> >> > > optional >> >> >> > > > >> > flag to >> >> >> > > > >> > > force checkpoints even in case of iterations. The >> algorithm >> >> >> will >> >> >> > > > take >> >> >> > > > >> > > checkpoints periodically as before, but records in >> transit >> >> >> > inside >> >> >> > > > the >> >> >> > > > >> > loop >> >> >> > > > >> > > will be lost. >> >> >> > > > >> > > >> >> >> > > > >> > > However even this guarantee is enough for most >> applications >> >> >> > > (Machine >> >> >> > > > >> > > Learning for instance) and certainly much better than >> not >> >> >> having >> >> >> > > > >> anything >> >> >> > > > >> > > at all. >> >> >> > > > >> > > >> >> >> > > > >> > > >> >> >> > > > >> > > I suggest we add this to the 0.9 release as currently >> many >> >> >> > > > applications >> >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph >> >> >> > streaming >> >> >> > > > etc.) >> >> >> > > > >> > > >> >> >> > > > >> > > >> >> >> > > > >> > > Cheers, >> >> >> > > > >> > > >> >> >> > > > >> > > Gyula >> >> >> > > > >> > >> >> >> > > > >> > >> >> >> > > > >> >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> > |
Thanks :D, now I see. It makes sense because we don't have another way
of keeping the cluster state synced/distributed across parallel instances of the operators. On Wed, Jun 10, 2015 at 12:52 PM, Gyula Fóra <[hidden email]> wrote: > Here is an example for you: > > Parallel streaming kmeans, the state we keep is the current cluster > centers, and we use iterations to sync the centers across parallel > instances. > We can afford lost model updated in the loop but we need the checkpoint the > models. > > https://github.com/gyfora/stream-clustering/blob/master/src/main/scala/stream/clustering/StreamClustering.scala > > (checkpointing is not turned on but you will get the point) > > > > Gyula Fóra <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, > 12:47): > >> You are right, to have consistent results we would need to persist the >> records. >> >> But since we cannot do that right now, we can still checkpoint all >> operator states and understand that inflight records in the loop are lost >> on failure. >> >> This is acceptable for most the use-cases that we have developed so far >> for iterations (machine learning, graph updates, etc.) What is not >> acceptable is to not have checkpointing at all. >> >> Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 10., >> Sze, 12:43): >> >>> The elements that are in-flight in an iteration are also state of the >>> job. I'm wondering whether the state inside iterations still makes >>> sense without these in-flight elements. But I also don't know the King >>> use-case, that's why I though an example could be helpful. >>> >>> On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <[hidden email]> >>> wrote: >>> > I don't understand the question, I vote for checkpointing all state in >>> the >>> > job, even inside iterations (its more of a loop). >>> > >>> > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. >>> 10., >>> > Sze, 12:34): >>> > >>> >> I don't understand why having the state inside an iteration but not >>> >> the elements that correspond to this state or created this state is >>> >> desirable. Maybe an example could help understand this better? >>> >> >>> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[hidden email]> >>> wrote: >>> >> > The other tests verify that the checkpointing algorithm runs >>> properly. >>> >> That >>> >> > also ensures that it runs for iterations because a loop is just an >>> extra >>> >> > source and sink in the jobgraph (so it is the same for the >>> algorithm). >>> >> > >>> >> > Fabian Hueske <[hidden email]> ezt írta (időpont: 2015. jún. 10., >>> >> Sze, >>> >> > 11:19): >>> >> > >>> >> >> Without going into the details, how well tested is this feature? >>> The PR >>> >> >> only extends one test by a few lines. >>> >> >> >>> >> >> Is that really enough to ensure that >>> >> >> 1) the change does not cause trouble >>> >> >> 2) is working as expected >>> >> >> >>> >> >> If this feature should go into the release, it must be thoroughly >>> >> checked >>> >> >> and we must take the time for that. >>> >> >> Including code and hoping for the best because time is scarce is >>> not an >>> >> >> option IMO. >>> >> >> >>> >> >> Fabian >>> >> >> >>> >> >> >>> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: >>> >> >> >>> >> >> > And also I would like to remind everyone that any fault tolerance >>> we >>> >> >> > provide is only as good as the fault tolerance of the master node. >>> >> Which >>> >> >> is >>> >> >> > non existent at the moment. >>> >> >> > >>> >> >> > So I don't see a reason why a user should not be able to choose >>> >> whether >>> >> >> he >>> >> >> > wants state checkpoints for iterations as well. >>> >> >> > >>> >> >> > In any case this will be used by King for instance, so making it >>> part >>> >> of >>> >> >> > the release would save a lot of work for everyone. >>> >> >> > >>> >> >> > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. 10., >>> Sze, >>> >> >> > 10:29): >>> >> >> > >>> >> >> > > >>> >> >> > > To continue Gyula's point, for consistent snapshots we need to >>> >> persist >>> >> >> > the >>> >> >> > > records in transit within the loop and also slightly change the >>> >> >> current >>> >> >> > > protocol since it works only for DAGs. Before going into that >>> >> direction >>> >> >> > > though I would propose we first see whether there is a nice way >>> to >>> >> make >>> >> >> > > iterations more structured. >>> >> >> > > >>> >> >> > > Paris >>> >> >> > > ________________________________________ >>> >> >> > > From: Gyula Fóra <[hidden email]> >>> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM >>> >> >> > > To: [hidden email] >>> >> >> > > Subject: Re: Force enabling checkpoints for iterative streaming >>> jobs >>> >> >> > > >>> >> >> > > I disagree. Not having checkpointed operators inside the >>> iteration >>> >> >> still >>> >> >> > > breaks the guarantees. >>> >> >> > > >>> >> >> > > It is not about the states it is about the loop itself. >>> >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < >>> >> [hidden email] >>> >> >> > >>> >> >> > > wrote: >>> >> >> > > >>> >> >> > > > This is the answer I gave on the PR (we should have one place >>> for >>> >> >> > > > discussing this, though): >>> >> >> > > > >>> >> >> > > > I would be against merging this in the current form. What I >>> >> propose >>> >> >> is >>> >> >> > > > to analyse the topology to verify that there are no >>> checkpointed >>> >> >> > > > operators inside iterations. Operators before and after >>> iterations >>> >> >> can >>> >> >> > > > be checkpointed and we can safely allow the user to enable >>> >> >> > > > checkpointing. >>> >> >> > > > >>> >> >> > > > If we have the code to analyse which operators are inside >>> >> iterations >>> >> >> > > > we could also disallow windows inside iterations. I think >>> windows >>> >> >> > > > inside iterations don't make sense since elements in different >>> >> >> > > > "iterations" would end up in the same window. Maybe I'm wrong >>> here >>> >> >> > > > though, then please correct me. >>> >> >> > > > >>> >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi >>> >> >> > > > <[hidden email]> wrote: >>> >> >> > > > > I agree that for the sake of the above mentioned use cases >>> it is >>> >> >> > > > reasonable >>> >> >> > > > > to add this to the release with the right documentation, for >>> >> >> machine >>> >> >> > > > > learning potentially loosing one round of feedback data >>> should >>> >> not >>> >> >> > > > matter. >>> >> >> > > > > >>> >> >> > > > > Let us not block prominent users until the next release on >>> this. >>> >> >> > > > > >>> >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < >>> >> [hidden email]> >>> >> >> > > > wrote: >>> >> >> > > > > >>> >> >> > > > >> As for people currently suffering from it: >>> >> >> > > > >> >>> >> >> > > > >> An application King is developing requires iterations, and >>> they >>> >> >> need >>> >> >> > > > >> checkpoints. Practically all SAMOA programs would need >>> this. >>> >> >> > > > >> >>> >> >> > > > >> It is very likely that the state interfaces will be changed >>> >> after >>> >> >> > the >>> >> >> > > > >> release, so this is not something that we can just add >>> later. I >>> >> >> > don't >>> >> >> > > > see a >>> >> >> > > > >> reason why we should not add it, as it is clearly >>> documented. >>> >> In >>> >> >> > this >>> >> >> > > > >> actual case not having guarantees at all means people will >>> >> never >>> >> >> use >>> >> >> > > it >>> >> >> > > > in >>> >> >> > > > >> any production system. Having limited guarantees means >>> that it >>> >> >> will >>> >> >> > > > depend >>> >> >> > > > >> on the application. >>> >> >> > > > >> >>> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi < >>> [hidden email]> >>> >> >> > wrote: >>> >> >> > > > >> >>> >> >> > > > >> > Hey Gyula, >>> >> >> > > > >> > >>> >> >> > > > >> > I understand your reasoning, but I don't think its worth >>> to >>> >> rush >>> >> >> > > this >>> >> >> > > > >> into >>> >> >> > > > >> > the release. >>> >> >> > > > >> > >>> >> >> > > > >> > As you've said, we cannot give precise guarantees. But >>> this >>> >> is >>> >> >> > > > arguably >>> >> >> > > > >> > one of the key requirements for any fault tolerance >>> >> mechanism. >>> >> >> > > > Therefore >>> >> >> > > > >> I >>> >> >> > > > >> > disagree that this is better than not having anything at >>> >> all. I >>> >> >> > > think >>> >> >> > > > it >>> >> >> > > > >> > will already go a long way to have the non-iterative case >>> >> >> working >>> >> >> > > > >> reliably. >>> >> >> > > > >> > >>> >> >> > > > >> > And as far as I know there are no users really suffering >>> from >>> >> >> this >>> >> >> > > at >>> >> >> > > > the >>> >> >> > > > >> > moment (in the sense that someone has complained on the >>> >> mailing >>> >> >> > > list). >>> >> >> > > > >> > >>> >> >> > > > >> > Hence, I vote to postpone this. >>> >> >> > > > >> > >>> >> >> > > > >> > – Ufuk >>> >> >> > > > >> > >>> >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra <[hidden email]> >>> >> wrote: >>> >> >> > > > >> > >>> >> >> > > > >> > > Hey all, >>> >> >> > > > >> > > >>> >> >> > > > >> > > It is currently impossible to enable state >>> checkpointing >>> >> for >>> >> >> > > > iterative >>> >> >> > > > >> > > jobs, because en exception is thrown when creating the >>> >> >> jobgraph. >>> >> >> > > > This >>> >> >> > > > >> > > behaviour is motivated by the lack of precise >>> guarantees >>> >> that >>> >> >> we >>> >> >> > > can >>> >> >> > > > >> give >>> >> >> > > > >> > > with the current fault-tolerance implementations for >>> cyclic >>> >> >> > > graphs. >>> >> >> > > > >> > > >>> >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> >>> adds an >>> >> >> > > optional >>> >> >> > > > >> > flag to >>> >> >> > > > >> > > force checkpoints even in case of iterations. The >>> algorithm >>> >> >> will >>> >> >> > > > take >>> >> >> > > > >> > > checkpoints periodically as before, but records in >>> transit >>> >> >> > inside >>> >> >> > > > the >>> >> >> > > > >> > loop >>> >> >> > > > >> > > will be lost. >>> >> >> > > > >> > > >>> >> >> > > > >> > > However even this guarantee is enough for most >>> applications >>> >> >> > > (Machine >>> >> >> > > > >> > > Learning for instance) and certainly much better than >>> not >>> >> >> having >>> >> >> > > > >> anything >>> >> >> > > > >> > > at all. >>> >> >> > > > >> > > >>> >> >> > > > >> > > >>> >> >> > > > >> > > I suggest we add this to the 0.9 release as currently >>> many >>> >> >> > > > applications >>> >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, graph >>> >> >> > streaming >>> >> >> > > > etc.) >>> >> >> > > > >> > > >>> >> >> > > > >> > > >>> >> >> > > > >> > > Cheers, >>> >> >> > > > >> > > >>> >> >> > > > >> > > Gyula >>> >> >> > > > >> > >>> >> >> > > > >> > >>> >> >> > > > >> >>> >> >> > > > >>> >> >> > > >>> >> >> > >>> >> >> >>> >> >>> >> |
Max suggested that I add this feature slightly hidden to the execution
config instance. The problem then is that I either make a public field in the config or once again add a method. Any ideas? Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 10., Sze, 14:07): > Thanks :D, now I see. It makes sense because we don't have another way > of keeping the cluster state synced/distributed across parallel > instances of the operators. > > On Wed, Jun 10, 2015 at 12:52 PM, Gyula Fóra <[hidden email]> wrote: > > Here is an example for you: > > > > Parallel streaming kmeans, the state we keep is the current cluster > > centers, and we use iterations to sync the centers across parallel > > instances. > > We can afford lost model updated in the loop but we need the checkpoint > the > > models. > > > > > https://github.com/gyfora/stream-clustering/blob/master/src/main/scala/stream/clustering/StreamClustering.scala > > > > (checkpointing is not turned on but you will get the point) > > > > > > > > Gyula Fóra <[hidden email]> ezt írta (időpont: 2015. jún. 10., > Sze, > > 12:47): > > > >> You are right, to have consistent results we would need to persist the > >> records. > >> > >> But since we cannot do that right now, we can still checkpoint all > >> operator states and understand that inflight records in the loop are > lost > >> on failure. > >> > >> This is acceptable for most the use-cases that we have developed so far > >> for iterations (machine learning, graph updates, etc.) What is not > >> acceptable is to not have checkpointing at all. > >> > >> Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. > 10., > >> Sze, 12:43): > >> > >>> The elements that are in-flight in an iteration are also state of the > >>> job. I'm wondering whether the state inside iterations still makes > >>> sense without these in-flight elements. But I also don't know the King > >>> use-case, that's why I though an example could be helpful. > >>> > >>> On Wed, Jun 10, 2015 at 12:37 PM, Gyula Fóra <[hidden email]> > >>> wrote: > >>> > I don't understand the question, I vote for checkpointing all state > in > >>> the > >>> > job, even inside iterations (its more of a loop). > >>> > > >>> > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. > >>> 10., > >>> > Sze, 12:34): > >>> > > >>> >> I don't understand why having the state inside an iteration but not > >>> >> the elements that correspond to this state or created this state is > >>> >> desirable. Maybe an example could help understand this better? > >>> >> > >>> >> On Wed, Jun 10, 2015 at 11:27 AM, Gyula Fóra <[hidden email]> > >>> wrote: > >>> >> > The other tests verify that the checkpointing algorithm runs > >>> properly. > >>> >> That > >>> >> > also ensures that it runs for iterations because a loop is just an > >>> extra > >>> >> > source and sink in the jobgraph (so it is the same for the > >>> algorithm). > >>> >> > > >>> >> > Fabian Hueske <[hidden email]> ezt írta (időpont: 2015. jún. > 10., > >>> >> Sze, > >>> >> > 11:19): > >>> >> > > >>> >> >> Without going into the details, how well tested is this feature? > >>> The PR > >>> >> >> only extends one test by a few lines. > >>> >> >> > >>> >> >> Is that really enough to ensure that > >>> >> >> 1) the change does not cause trouble > >>> >> >> 2) is working as expected > >>> >> >> > >>> >> >> If this feature should go into the release, it must be thoroughly > >>> >> checked > >>> >> >> and we must take the time for that. > >>> >> >> Including code and hoping for the best because time is scarce is > >>> not an > >>> >> >> option IMO. > >>> >> >> > >>> >> >> Fabian > >>> >> >> > >>> >> >> > >>> >> >> 2015-06-10 11:05 GMT+02:00 Gyula Fóra <[hidden email]>: > >>> >> >> > >>> >> >> > And also I would like to remind everyone that any fault > tolerance > >>> we > >>> >> >> > provide is only as good as the fault tolerance of the master > node. > >>> >> Which > >>> >> >> is > >>> >> >> > non existent at the moment. > >>> >> >> > > >>> >> >> > So I don't see a reason why a user should not be able to choose > >>> >> whether > >>> >> >> he > >>> >> >> > wants state checkpoints for iterations as well. > >>> >> >> > > >>> >> >> > In any case this will be used by King for instance, so making > it > >>> part > >>> >> of > >>> >> >> > the release would save a lot of work for everyone. > >>> >> >> > > >>> >> >> > Paris Carbone <[hidden email]> ezt írta (időpont: 2015. jún. > 10., > >>> Sze, > >>> >> >> > 10:29): > >>> >> >> > > >>> >> >> > > > >>> >> >> > > To continue Gyula's point, for consistent snapshots we need > to > >>> >> persist > >>> >> >> > the > >>> >> >> > > records in transit within the loop and also slightly change > the > >>> >> >> current > >>> >> >> > > protocol since it works only for DAGs. Before going into that > >>> >> direction > >>> >> >> > > though I would propose we first see whether there is a nice > way > >>> to > >>> >> make > >>> >> >> > > iterations more structured. > >>> >> >> > > > >>> >> >> > > Paris > >>> >> >> > > ________________________________________ > >>> >> >> > > From: Gyula Fóra <[hidden email]> > >>> >> >> > > Sent: Wednesday, June 10, 2015 10:19 AM > >>> >> >> > > To: [hidden email] > >>> >> >> > > Subject: Re: Force enabling checkpoints for iterative > streaming > >>> jobs > >>> >> >> > > > >>> >> >> > > I disagree. Not having checkpointed operators inside the > >>> iteration > >>> >> >> still > >>> >> >> > > breaks the guarantees. > >>> >> >> > > > >>> >> >> > > It is not about the states it is about the loop itself. > >>> >> >> > > On Wed, Jun 10, 2015 at 10:12 AM Aljoscha Krettek < > >>> >> [hidden email] > >>> >> >> > > >>> >> >> > > wrote: > >>> >> >> > > > >>> >> >> > > > This is the answer I gave on the PR (we should have one > place > >>> for > >>> >> >> > > > discussing this, though): > >>> >> >> > > > > >>> >> >> > > > I would be against merging this in the current form. What I > >>> >> propose > >>> >> >> is > >>> >> >> > > > to analyse the topology to verify that there are no > >>> checkpointed > >>> >> >> > > > operators inside iterations. Operators before and after > >>> iterations > >>> >> >> can > >>> >> >> > > > be checkpointed and we can safely allow the user to enable > >>> >> >> > > > checkpointing. > >>> >> >> > > > > >>> >> >> > > > If we have the code to analyse which operators are inside > >>> >> iterations > >>> >> >> > > > we could also disallow windows inside iterations. I think > >>> windows > >>> >> >> > > > inside iterations don't make sense since elements in > different > >>> >> >> > > > "iterations" would end up in the same window. Maybe I'm > wrong > >>> here > >>> >> >> > > > though, then please correct me. > >>> >> >> > > > > >>> >> >> > > > On Wed, Jun 10, 2015 at 10:08 AM, Márton Balassi > >>> >> >> > > > <[hidden email]> wrote: > >>> >> >> > > > > I agree that for the sake of the above mentioned use > cases > >>> it is > >>> >> >> > > > reasonable > >>> >> >> > > > > to add this to the release with the right documentation, > for > >>> >> >> machine > >>> >> >> > > > > learning potentially loosing one round of feedback data > >>> should > >>> >> not > >>> >> >> > > > matter. > >>> >> >> > > > > > >>> >> >> > > > > Let us not block prominent users until the next release > on > >>> this. > >>> >> >> > > > > > >>> >> >> > > > > On Wed, Jun 10, 2015 at 8:09 AM, Gyula Fóra < > >>> >> [hidden email]> > >>> >> >> > > > wrote: > >>> >> >> > > > > > >>> >> >> > > > >> As for people currently suffering from it: > >>> >> >> > > > >> > >>> >> >> > > > >> An application King is developing requires iterations, > and > >>> they > >>> >> >> need > >>> >> >> > > > >> checkpoints. Practically all SAMOA programs would need > >>> this. > >>> >> >> > > > >> > >>> >> >> > > > >> It is very likely that the state interfaces will be > changed > >>> >> after > >>> >> >> > the > >>> >> >> > > > >> release, so this is not something that we can just add > >>> later. I > >>> >> >> > don't > >>> >> >> > > > see a > >>> >> >> > > > >> reason why we should not add it, as it is clearly > >>> documented. > >>> >> In > >>> >> >> > this > >>> >> >> > > > >> actual case not having guarantees at all means people > will > >>> >> never > >>> >> >> use > >>> >> >> > > it > >>> >> >> > > > in > >>> >> >> > > > >> any production system. Having limited guarantees means > >>> that it > >>> >> >> will > >>> >> >> > > > depend > >>> >> >> > > > >> on the application. > >>> >> >> > > > >> > >>> >> >> > > > >> On Wed, Jun 10, 2015 at 12:53 AM, Ufuk Celebi < > >>> [hidden email]> > >>> >> >> > wrote: > >>> >> >> > > > >> > >>> >> >> > > > >> > Hey Gyula, > >>> >> >> > > > >> > > >>> >> >> > > > >> > I understand your reasoning, but I don't think its > worth > >>> to > >>> >> rush > >>> >> >> > > this > >>> >> >> > > > >> into > >>> >> >> > > > >> > the release. > >>> >> >> > > > >> > > >>> >> >> > > > >> > As you've said, we cannot give precise guarantees. But > >>> this > >>> >> is > >>> >> >> > > > arguably > >>> >> >> > > > >> > one of the key requirements for any fault tolerance > >>> >> mechanism. > >>> >> >> > > > Therefore > >>> >> >> > > > >> I > >>> >> >> > > > >> > disagree that this is better than not having anything > at > >>> >> all. I > >>> >> >> > > think > >>> >> >> > > > it > >>> >> >> > > > >> > will already go a long way to have the non-iterative > case > >>> >> >> working > >>> >> >> > > > >> reliably. > >>> >> >> > > > >> > > >>> >> >> > > > >> > And as far as I know there are no users really > suffering > >>> from > >>> >> >> this > >>> >> >> > > at > >>> >> >> > > > the > >>> >> >> > > > >> > moment (in the sense that someone has complained on > the > >>> >> mailing > >>> >> >> > > list). > >>> >> >> > > > >> > > >>> >> >> > > > >> > Hence, I vote to postpone this. > >>> >> >> > > > >> > > >>> >> >> > > > >> > – Ufuk > >>> >> >> > > > >> > > >>> >> >> > > > >> > On 10 Jun 2015, at 00:19, Gyula Fóra < > [hidden email]> > >>> >> wrote: > >>> >> >> > > > >> > > >>> >> >> > > > >> > > Hey all, > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > It is currently impossible to enable state > >>> checkpointing > >>> >> for > >>> >> >> > > > iterative > >>> >> >> > > > >> > > jobs, because en exception is thrown when creating > the > >>> >> >> jobgraph. > >>> >> >> > > > This > >>> >> >> > > > >> > > behaviour is motivated by the lack of precise > >>> guarantees > >>> >> that > >>> >> >> we > >>> >> >> > > can > >>> >> >> > > > >> give > >>> >> >> > > > >> > > with the current fault-tolerance implementations for > >>> cyclic > >>> >> >> > > graphs. > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > This PR <https://github.com/apache/flink/pull/812> > >>> adds an > >>> >> >> > > optional > >>> >> >> > > > >> > flag to > >>> >> >> > > > >> > > force checkpoints even in case of iterations. The > >>> algorithm > >>> >> >> will > >>> >> >> > > > take > >>> >> >> > > > >> > > checkpoints periodically as before, but records in > >>> transit > >>> >> >> > inside > >>> >> >> > > > the > >>> >> >> > > > >> > loop > >>> >> >> > > > >> > > will be lost. > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > However even this guarantee is enough for most > >>> applications > >>> >> >> > > (Machine > >>> >> >> > > > >> > > Learning for instance) and certainly much better > than > >>> not > >>> >> >> having > >>> >> >> > > > >> anything > >>> >> >> > > > >> > > at all. > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > I suggest we add this to the 0.9 release as > currently > >>> many > >>> >> >> > > > applications > >>> >> >> > > > >> > > suffer from this limitation (SAMOA, ML pipelines, > graph > >>> >> >> > streaming > >>> >> >> > > > etc.) > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > Cheers, > >>> >> >> > > > >> > > > >>> >> >> > > > >> > > Gyula > >>> >> >> > > > >> > > >>> >> >> > > > >> > > >>> >> >> > > > >> > >>> >> >> > > > > >>> >> >> > > > >>> >> >> > > >>> >> >> > >>> >> > >>> > >> > |
On 10 Jun 2015, at 14:29, Gyula Fóra <[hidden email]> wrote:
> Max suggested that I add this feature slightly hidden to the execution > config instance. > > The problem then is that I either make a public field in the config or once > again add a method. > > Any ideas? I thought about this as well. The only way to add this in a "hidden" way would be have access to the Configuration instance – which the user does not. So I think there is no way w/o adding public access. |
We could add a method on the ExecutionConfig but mark it as deprecated
and explain that it will go away once the interplay of iterations, state and so on is properly figured out. On Wed, Jun 10, 2015 at 2:36 PM, Ufuk Celebi <[hidden email]> wrote: > On 10 Jun 2015, at 14:29, Gyula Fóra <[hidden email]> wrote: > >> Max suggested that I add this feature slightly hidden to the execution >> config instance. >> >> The problem then is that I either make a public field in the config or once >> again add a method. >> >> Any ideas? > > I thought about this as well. The only way to add this in a "hidden" way would be have access to the Configuration instance – which the user does not. So I think there is no way w/o adding public access. |
Then I suggest we leave it in the environment along with the other
checkpointing methods. I updated my PR so it includes hints how to force enable checkpoints (and the reduced guarantees) when an error is thrown for iterative jobs. On Wed, Jun 10, 2015 at 2:46 PM, Aljoscha Krettek <[hidden email]> wrote: > We could add a method on the ExecutionConfig but mark it as deprecated > and explain that it will go away once the interplay of iterations, > state and so on is properly figured out. > > On Wed, Jun 10, 2015 at 2:36 PM, Ufuk Celebi <[hidden email]> wrote: > > On 10 Jun 2015, at 14:29, Gyula Fóra <[hidden email]> wrote: > > > >> Max suggested that I add this feature slightly hidden to the execution > >> config instance. > >> > >> The problem then is that I either make a public field in the config or > once > >> again add a method. > >> > >> Any ideas? > > > > I thought about this as well. The only way to add this in a "hidden" way > would be have access to the Configuration instance – which the user does > not. So I think there is no way w/o adding public access. > |
Free forum by Nabble | Edit this page |