[VOTE] Release 1.11.0, release candidate #4

classic Classic list List threaded Threaded
57 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.11.0, release candidate #4

Zhijiang(wangzhijiang999)
Hi all,

The vote already lasted for more than 72 hours. Thanks everyone for helping test and verify the release.
I will finalize the vote result soon in a separate email.

Best,
Zhijiang


------------------------------------------------------------------
From:Jingsong Li <[hidden email]>
Send Time:2020年7月6日(星期一) 12:11
To:dev <[hidden email]>
Subject:Re: [VOTE] Release 1.11.0, release candidate #4

+1 (non-binding)

- verified signature and checksum
- build from source
- checked webui and log sanity
- played with filesystem and new connectors
- played with Hive connector

Best,
Jingsonga

On Mon, Jul 6, 2020 at 9:50 AM Xintong Song <[hidden email]> wrote:

> +1 (non-binding)
>
> - verified signature and checksum
> - build from source
> - checked log sanity
> - checked webui
> - played with memory configurations
> - played with binding addresses/ports
>
> Thank you~
>
> Xintong Song
>
>
>
> On Sun, Jul 5, 2020 at 9:41 PM Benchao Li <[hidden email]> wrote:
>
> > +1 (non-binding)
> >
> > Checks:
> > - verified signature and shasum of release files [OK]
> > - build from source [OK]
> > - started standalone cluster, sql-client [mostly OK except one issue]
> >   - played with sql-client
> >   - played with new features: LIKE / Table Options
> >   - checked Web UI functionality
> >   - canceled job from UI
> >
> > While I'm playing with the new table factories, I found one issue[1]
> which
> > surprises me.
> > I don't think this should be a blocker, hence I'll still vote my +1.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-18487
> >
> > Zhijiang <[hidden email]> 于2020年7月5日周日 下午1:10写道:
> >
> > > Hi Thomas,
> > >
> > > Regarding [2], it has more detail infos in the Jira description (
> > > https://issues.apache.org/jira/browse/FLINK-16404).
> > >
> > > I can also give some basic explanations here to dismiss the concern.
> > > 1. In the past, the following buffers after the barrier will be cached
> on
> > > downstream side before alignment.
> > > 2. In 1.11, the upstream would not send the buffers after the barrier.
> > > When the downstream finishes the alignment, it will notify the
> downstream
> > > of continuing sending following buffers, since it can process them
> after
> > > alignment.
> > > 3. The only difference is that the temporary blocked buffers are cached
> > > either on downstream side or on upstream side before alignment.
> > > 4. The side effect would be the additional notification cost for every
> > > barrier alignment. If the downstream and upstream are deployed in
> > separate
> > > TaskManager, the cost is network transport delay (the effect can be
> > ignored
> > > based on our testing with 1s checkpoint interval). For sharing slot in
> > your
> > > case, the cost is only one method call in processor, can be ignored
> also.
> > >
> > > You mentioned "In this case, the downstream task has a high average
> > > checkpoint duration(~30s, sync part)." This duration is not reflecting
> > the
> > > changes above, and it is only indicating the duration for calling
> > > `Operation.snapshotState`.
> > > If this duration is beyond your expectation, you can check or debug
> > > whether the source/sink operations might take more time to finish
> > > `snapshotState` in practice. E.g. you can
> > > make the implementation of this method as empty to further verify the
> > > effect.
> > >
> > > Best,
> > > Zhijiang
> > >
> > >
> > > ------------------------------------------------------------------
> > > From:Thomas Weise <[hidden email]>
> > > Send Time:2020年7月5日(星期日) 12:22
> > > To:dev <[hidden email]>; Zhijiang <[hidden email]>
> > > Cc:Yingjie Cao <[hidden email]>
> > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >
> > > Hi Zhijiang,
> > >
> > > Could you please point me to more details regarding: "[2]: Delay send
> the
> > > following buffers after checkpoint barrier on upstream side until
> barrier
> > > alignment on downstream side."
> > >
> > > In this case, the downstream task has a high average checkpoint
> duration
> > > (~30s, sync part). If there was a change to hold buffers depending on
> > > downstream performance, could this possibly apply to this case (even
> when
> > > there is no shuffle that would require alignment)?
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <[hidden email]
> > > .invalid>
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > Thanks for the further update information.
> > > >
> > > > I guess we can dismiss the network stack changes, since in your case
> > the
> > > > downstream and upstream would probably be deployed in the same slot
> > > > bypassing the network data shuffle.
> > > > Also I guess release-1.11 will not bring general performance
> regression
> > > in
> > > > runtime engine, as we also did the performance testing for all
> general
> > > > cases by [1] in real cluster before and the testing results should
> fit
> > > the
> > > > expectation. But we indeed did not test the specific source and sink
> > > > connectors yet as I known.
> > > >
> > > > Regarding your performance regression with 40%, I wonder it is
> probably
> > > > related to specific source/sink changes (e.g. kinesis) or environment
> > > > issues with corner case.
> > > > If possible, it would be helpful to further locate whether the
> > regression
> > > > is caused by kinesis, by replacing the kinesis source & sink and
> > keeping
> > > > the others same.
> > > >
> > > > As you said, it would be efficient to contact with you directly next
> > week
> > > > to further discuss this issue. And we are willing/eager to provide
> any
> > > help
> > > > to resolve this issue soon.
> > > >
> > > > Besides that, I guess this issue should not be the blocker for the
> > > > release, since it is probably a corner case based on the current
> > > analysis.
> > > > If we really conclude anything need to be resolved after the final
> > > > release, then we can also make the next minor release-1.11.1 come
> soon.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > > >
> > > > Best,
> > > > Zhijiang
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > From:Thomas Weise <[hidden email]>
> > > > Send Time:2020年7月4日(星期六) 12:26
> > > > To:dev <[hidden email]>; Zhijiang <[hidden email]>
> > > > Cc:Yingjie Cao <[hidden email]>
> > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >
> > > > Hi Zhijiang,
> > > >
> > > > It will probably be best if we connect next week and discuss the
> issue
> > > > directly since this could be quite difficult to reproduce.
> > > >
> > > > Before the testing result on our side comes out for your respective
> job
> > > > case, I have some other questions to confirm for further analysis:
> > > >     -  How much percentage regression you found after switching to
> > 1.11?
> > > >
> > > > ~40% throughput decline
> > > >
> > > >     -  Are there any network bottleneck in your cluster? E.g. the
> > network
> > > > bandwidth is full caused by other jobs? If so, it might have more
> > effects
> > > > by above [2]
> > > >
> > > > The test runs on a k8s cluster that is also used for other production
> > > jobs.
> > > > There is no reason be believe network is the bottleneck.
> > > >
> > > >     -  Did you adjust the default network buffer setting? E.g.
> > > > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > > "taskmanager.network.memory.buffers-per-channel"
> > > >
> > > > The job is using the defaults, i.e we don't configure the settings.
> If
> > > you
> > > > want me to try specific settings in the hope that it will help to
> > isolate
> > > > the issue please let me know.
> > > >
> > > >     -  I guess the topology has three vertexes "KinesisConsumer ->
> > > Chained
> > > > FlatMap -> KinesisProducer", and the partition mode for
> > "KinesisConsumer
> > > ->
> > > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
> the
> > > edge
> > > > connection is one-to-one, not all-to-all, then the above [1][2]
> should
> > no
> > > > effects in theory with default network buffer setting.
> > > >
> > > > There are only 2 vertices and the edge is "forward".
> > > >
> > > >     - By slot sharing, I guess these three vertex parallelism task
> > would
> > > > probably be deployed into the same slot, then the data shuffle is by
> > > memory
> > > > queue, not network stack. If so, the above [2] should no effect.
> > > >
> > > > Yes, vertices share slots.
> > > >
> > > >     - I also saw some Jira changes for kinesis in this release, could
> > you
> > > > confirm that these changes would not effect the performance?
> > > >
> > > > I will need to take a look. 1.10 already had a regression introduced
> by
> > > the
> > > > Kinesis producer update.
> > > >
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <[hidden email]
> > > > .invalid>
> > > > wrote:
> > > >
> > > > > Hi Thomas,
> > > > >
> > > > > Thanks for your reply with rich information!
> > > > >
> > > > > We are trying to reproduce your case in our cluster to further
> verify
> > > it,
> > > > > and  @Yingjie Cao is working on it now.
> > > > >  As we have not kinesis consumer and producer internally, so we
> will
> > > > > construct the common source and sink instead in the case of
> > > backpressure.
> > > > >
> > > > > Firstly, we can dismiss the rockdb factor in this release, since
> you
> > > also
> > > > > mentioned that "filesystem leads to same symptoms".
> > > > >
> > > > > Secondly, if my understanding is right, you emphasis that the
> > > regression
> > > > > only exists for the jobs with low checkpoint interval (10s).
> > > > > Based on that, I have two suspicions with the network related
> changes
> > > in
> > > > > this release:
> > > > >     - [1]: Limited the maximum backlog value (default 10) in
> > > subpartition
> > > > > queue.
> > > > >     - [2]: Delay send the following buffers after checkpoint
> barrier
> > on
> > > > > upstream side until barrier alignment on downstream side.
> > > > >
> > > > > These changes are motivated for reducing the in-flight buffers to
> > > speedup
> > > > > checkpoint especially in the case of backpressure.
> > > > > In theory they should have very minor performance effect and
> actually
> > > we
> > > > > also tested in cluster to verify within expectation before merging
> > > them,
> > > > >  but maybe there are other corner cases we have not thought of
> > before.
> > > > >
> > > > > Before the testing result on our side comes out for your respective
> > job
> > > > > case, I have some other questions to confirm for further analysis:
> > > > >     -  How much percentage regression you found after switching to
> > > 1.11?
> > > > >     -  Are there any network bottleneck in your cluster? E.g. the
> > > network
> > > > > bandwidth is full caused by other jobs? If so, it might have more
> > > effects
> > > > > by above [2]
> > > > >     -  Did you adjust the default network buffer setting? E.g.
> > > > > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > > > "taskmanager.network.memory.buffers-per-channel"
> > > > >     -  I guess the topology has three vertexes "KinesisConsumer ->
> > > > Chained
> > > > > FlatMap -> KinesisProducer", and the partition mode for
> > > "KinesisConsumer
> > > > ->
> > > > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
> > the
> > > > edge
> > > > > connection is one-to-one, not all-to-all, then the above [1][2]
> > should
> > > no
> > > > > effects in theory with default network buffer setting.
> > > > >     - By slot sharing, I guess these three vertex parallelism task
> > > would
> > > > > probably be deployed into the same slot, then the data shuffle is
> by
> > > > memory
> > > > > queue, not network stack. If so, the above [2] should no effect.
> > > > >     - I also saw some Jira changes for kinesis in this release,
> could
> > > you
> > > > > confirm that these changes would not effect the performance?
> > > > >
> > > > > Best,
> > > > > Zhijiang
> > > > >
> > > > >
> > > > > ------------------------------------------------------------------
> > > > > From:Thomas Weise <[hidden email]>
> > > > > Send Time:2020年7月3日(星期五) 01:07
> > > > > To:dev <[hidden email]>; Zhijiang <
> [hidden email]>
> > > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > >
> > > > > Hi Zhijiang,
> > > > >
> > > > > The performance degradation manifests in backpressure which leads
> to
> > > > > growing backlog in the source. I switched a few times between 1.10
> > and
> > > > 1.11
> > > > > and the behavior is consistent.
> > > > >
> > > > > The DAG is:
> > > > >
> > > > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
> forward
> > > > > ---------> KinesisProducer
> > > > >
> > > > > Parallelism: 160
> > > > > No shuffle/rebalance.
> > > > >
> > > > > Checkpointing config:
> > > > >
> > > > > Checkpointing Mode Exactly Once
> > > > > Interval 10s
> > > > > Timeout 10m 0s
> > > > > Minimum Pause Between Checkpoints 10s
> > > > > Maximum Concurrent Checkpoints 1
> > > > > Persist Checkpoints Externally Enabled (delete on cancellation)
> > > > >
> > > > > State backend: rocksdb  (filesystem leads to same symptoms)
> > > > > Checkpoint size is tiny (500KB)
> > > > >
> > > > > An interesting difference to another job that I had upgraded
> > > successfully
> > > > > is the low checkpointing interval.
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> [hidden email]
> > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Hi Thomas,
> > > > > >
> > > > > > Thanks for the efficient feedback.
> > > > > >
> > > > > > Regarding the suggestion of adding the release notes document, I
> > > agree
> > > > > > with your point. Maybe we should adjust the vote template
> > accordingly
> > > > in
> > > > > > the respective wiki to guide the following release processes.
> > > > > >
> > > > > > Regarding the performance regression, could you provide some more
> > > > details
> > > > > > for our better measurement or reproducing on our sides?
> > > > > > E.g. I guess the topology only includes two vertexes source and
> > sink?
> > > > > > What is the parallelism for every vertex?
> > > > > > The upstream shuffles data to the downstream via rebalance
> > > partitioner
> > > > or
> > > > > > other?
> > > > > > The checkpoint mode is exactly-once with rocksDB state backend?
> > > > > > The backpressure happened in this case?
> > > > > > How much percentage regression in this case?
> > > > > >
> > > > > > Best,
> > > > > > Zhijiang
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> ------------------------------------------------------------------
> > > > > > From:Thomas Weise <[hidden email]>
> > > > > > Send Time:2020年7月2日(星期四) 09:54
> > > > > > To:dev <[hidden email]>
> > > > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > > >
> > > > > > Hi Till,
> > > > > >
> > > > > > Yes, we don't have the setting in flink-conf.yaml.
> > > > > >
> > > > > > Generally, we carry forward the existing configuration and any
> > change
> > > > to
> > > > > > default configuration values would impact the upgrade.
> > > > > >
> > > > > > Yes, since it is an incompatible change I would state it in the
> > > release
> > > > > > notes.
> > > > > >
> > > > > > Thanks,
> > > > > > Thomas
> > > > > >
> > > > > > BTW I found a performance regression while trying to upgrade
> > another
> > > > > > pipeline with this RC. It is a simple Kinesis to Kinesis job.
> > Wasn't
> > > > able
> > > > > > to pin it down yet, symptoms include increased checkpoint
> alignment
> > > > time.
> > > > > >
> > > > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Thomas,
> > > > > > >
> > > > > > > just to confirm: When starting the image in local mode, then
> you
> > > > don't
> > > > > > have
> > > > > > > any of the JobManager memory configuration settings configured
> in
> > > the
> > > > > > > effective flink-conf.yaml, right? Does this mean that you have
> > > > > explicitly
> > > > > > > removed `jobmanager.heap.size: 1024m` from the default
> > > configuration?
> > > > > If
> > > > > > > this is the case, then I believe it was more of an
> unintentional
> > > > > artifact
> > > > > > > that it worked before and it has been corrected now so that one
> > > needs
> > > > > to
> > > > > > > specify the memory of the JM process explicitly. Do you think
> it
> > > > would
> > > > > > help
> > > > > > > to explicitly state this in the release notes?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <[hidden email]>
> > > wrote:
> > > > > > >
> > > > > > > > Thanks for preparing another RC!
> > > > > > > >
> > > > > > > > As mentioned in the previous RC thread, it would be super
> > helpful
> > > > if
> > > > > > the
> > > > > > > > release notes that are part of the documentation can be
> > included
> > > > [1].
> > > > > > > It's
> > > > > > > > a significant time-saver to have read those first.
> > > > > > > >
> > > > > > > > I found one more non-backward compatible change that would be
> > > worth
> > > > > > > > addressing/mentioning:
> > > > > > > >
> > > > > > > > It is now necessary to configure the jobmanager heap size in
> > > > > > > > flink-conf.yaml (with either jobmanager.heap.size
> > > > > > > > or jobmanager.memory.heap.size). Why would I not want to do
> > that
> > > > > > anyways?
> > > > > > > > Well, we set it dynamically for a cluster deployment via the
> > > > > > > > flinkk8soperator, but the container image can also be used
> for
> > > > > testing
> > > > > > > with
> > > > > > > > local mode (./bin/jobmanager.sh start-foreground local). That
> > > will
> > > > > fail
> > > > > > > if
> > > > > > > > the heap wasn't configured and that's how I noticed it.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > > > > > > >
> > > > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> > > > [hidden email]
> > > > > > > > .invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > Please review and vote on the release candidate #4 for the
> > > > version
> > > > > > > > 1.11.0,
> > > > > > > > > as follows:
> > > > > > > > > [ ] +1, Approve the release
> > > > > > > > > [ ] -1, Do not approve the release (please provide specific
> > > > > comments)
> > > > > > > > >
> > > > > > > > > The complete staging area is available for your review,
> which
> > > > > > includes:
> > > > > > > > > * JIRA release notes [1],
> > > > > > > > > * the official Apache source release and binary convenience
> > > > > releases
> > > > > > to
> > > > > > > > be
> > > > > > > > > deployed to dist.apache.org [2], which are signed with the
> > key
> > > > > with
> > > > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> > > > > > > > > * all artifacts to be deployed to the Maven Central
> > Repository
> > > > [4],
> > > > > > > > > * source code tag "release-1.11.0-rc4" [5],
> > > > > > > > > * website pull request listing the new release and adding
> > > > > > announcement
> > > > > > > > > blog post [6].
> > > > > > > > >
> > > > > > > > > The vote will be open for at least 72 hours. It is adopted
> by
> > > > > > majority
> > > > > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Release Manager
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > > > > > > > [2]
> > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > > > > > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > > > > > [4]
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > > > > > > > [5]
> > > > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > > > > > > > [6] https://github.com/apache/flink-web/pull/352
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> > --
> >
> > Best,
> > Benchao Li
> >
>


--
Best, Jingsong Lee

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] Release 1.11.0, release candidate #4

Yang Wang
+1 (non-binding)

- Verified building from source
- Running Flink on local, submit jobs via cli and webui
- Running Flink on Yarn
   - Test per-job, session, application modes
   - Test provided flink lib
   - Test remote user jar
- Running Flink on K8s
   - Standalone yaml submission, including session and applicationmode
   - Native submission, including session and application mode


Best,
Yang

Zhijiang <[hidden email]> 于2020年7月6日周一 下午2:43写道:

> Hi all,
>
> The vote already lasted for more than 72 hours. Thanks everyone for
> helping test and verify the release.
> I will finalize the vote result soon in a separate email.
>
> Best,
> Zhijiang
>
>
> ------------------------------------------------------------------
> From:Jingsong Li <[hidden email]>
> Send Time:2020年7月6日(星期一) 12:11
> To:dev <[hidden email]>
> Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>
> +1 (non-binding)
>
> - verified signature and checksum
> - build from source
> - checked webui and log sanity
> - played with filesystem and new connectors
> - played with Hive connector
>
> Best,
> Jingsonga
>
> On Mon, Jul 6, 2020 at 9:50 AM Xintong Song <[hidden email]> wrote:
>
> > +1 (non-binding)
> >
> > - verified signature and checksum
> > - build from source
> > - checked log sanity
> > - checked webui
> > - played with memory configurations
> > - played with binding addresses/ports
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Sun, Jul 5, 2020 at 9:41 PM Benchao Li <[hidden email]> wrote:
> >
> > > +1 (non-binding)
> > >
> > > Checks:
> > > - verified signature and shasum of release files [OK]
> > > - build from source [OK]
> > > - started standalone cluster, sql-client [mostly OK except one issue]
> > >   - played with sql-client
> > >   - played with new features: LIKE / Table Options
> > >   - checked Web UI functionality
> > >   - canceled job from UI
> > >
> > > While I'm playing with the new table factories, I found one issue[1]
> > which
> > > surprises me.
> > > I don't think this should be a blocker, hence I'll still vote my +1.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-18487
> > >
> > > Zhijiang <[hidden email]> 于2020年7月5日周日 下午1:10写道:
> > >
> > > > Hi Thomas,
> > > >
> > > > Regarding [2], it has more detail infos in the Jira description (
> > > > https://issues.apache.org/jira/browse/FLINK-16404).
> > > >
> > > > I can also give some basic explanations here to dismiss the concern.
> > > > 1. In the past, the following buffers after the barrier will be
> cached
> > on
> > > > downstream side before alignment.
> > > > 2. In 1.11, the upstream would not send the buffers after the
> barrier.
> > > > When the downstream finishes the alignment, it will notify the
> > downstream
> > > > of continuing sending following buffers, since it can process them
> > after
> > > > alignment.
> > > > 3. The only difference is that the temporary blocked buffers are
> cached
> > > > either on downstream side or on upstream side before alignment.
> > > > 4. The side effect would be the additional notification cost for
> every
> > > > barrier alignment. If the downstream and upstream are deployed in
> > > separate
> > > > TaskManager, the cost is network transport delay (the effect can be
> > > ignored
> > > > based on our testing with 1s checkpoint interval). For sharing slot
> in
> > > your
> > > > case, the cost is only one method call in processor, can be ignored
> > also.
> > > >
> > > > You mentioned "In this case, the downstream task has a high average
> > > > checkpoint duration(~30s, sync part)." This duration is not
> reflecting
> > > the
> > > > changes above, and it is only indicating the duration for calling
> > > > `Operation.snapshotState`.
> > > > If this duration is beyond your expectation, you can check or debug
> > > > whether the source/sink operations might take more time to finish
> > > > `snapshotState` in practice. E.g. you can
> > > > make the implementation of this method as empty to further verify the
> > > > effect.
> > > >
> > > > Best,
> > > > Zhijiang
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > From:Thomas Weise <[hidden email]>
> > > > Send Time:2020年7月5日(星期日) 12:22
> > > > To:dev <[hidden email]>; Zhijiang <[hidden email]>
> > > > Cc:Yingjie Cao <[hidden email]>
> > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >
> > > > Hi Zhijiang,
> > > >
> > > > Could you please point me to more details regarding: "[2]: Delay send
> > the
> > > > following buffers after checkpoint barrier on upstream side until
> > barrier
> > > > alignment on downstream side."
> > > >
> > > > In this case, the downstream task has a high average checkpoint
> > duration
> > > > (~30s, sync part). If there was a change to hold buffers depending on
> > > > downstream performance, could this possibly apply to this case (even
> > when
> > > > there is no shuffle that would require alignment)?
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <[hidden email]
> > > > .invalid>
> > > > wrote:
> > > >
> > > > > Hi Thomas,
> > > > >
> > > > > Thanks for the further update information.
> > > > >
> > > > > I guess we can dismiss the network stack changes, since in your
> case
> > > the
> > > > > downstream and upstream would probably be deployed in the same slot
> > > > > bypassing the network data shuffle.
> > > > > Also I guess release-1.11 will not bring general performance
> > regression
> > > > in
> > > > > runtime engine, as we also did the performance testing for all
> > general
> > > > > cases by [1] in real cluster before and the testing results should
> > fit
> > > > the
> > > > > expectation. But we indeed did not test the specific source and
> sink
> > > > > connectors yet as I known.
> > > > >
> > > > > Regarding your performance regression with 40%, I wonder it is
> > probably
> > > > > related to specific source/sink changes (e.g. kinesis) or
> environment
> > > > > issues with corner case.
> > > > > If possible, it would be helpful to further locate whether the
> > > regression
> > > > > is caused by kinesis, by replacing the kinesis source & sink and
> > > keeping
> > > > > the others same.
> > > > >
> > > > > As you said, it would be efficient to contact with you directly
> next
> > > week
> > > > > to further discuss this issue. And we are willing/eager to provide
> > any
> > > > help
> > > > > to resolve this issue soon.
> > > > >
> > > > > Besides that, I guess this issue should not be the blocker for the
> > > > > release, since it is probably a corner case based on the current
> > > > analysis.
> > > > > If we really conclude anything need to be resolved after the final
> > > > > release, then we can also make the next minor release-1.11.1 come
> > soon.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > > > >
> > > > > Best,
> > > > > Zhijiang
> > > > >
> > > > >
> > > > > ------------------------------------------------------------------
> > > > > From:Thomas Weise <[hidden email]>
> > > > > Send Time:2020年7月4日(星期六) 12:26
> > > > > To:dev <[hidden email]>; Zhijiang <
> [hidden email]>
> > > > > Cc:Yingjie Cao <[hidden email]>
> > > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > >
> > > > > Hi Zhijiang,
> > > > >
> > > > > It will probably be best if we connect next week and discuss the
> > issue
> > > > > directly since this could be quite difficult to reproduce.
> > > > >
> > > > > Before the testing result on our side comes out for your respective
> > job
> > > > > case, I have some other questions to confirm for further analysis:
> > > > >     -  How much percentage regression you found after switching to
> > > 1.11?
> > > > >
> > > > > ~40% throughput decline
> > > > >
> > > > >     -  Are there any network bottleneck in your cluster? E.g. the
> > > network
> > > > > bandwidth is full caused by other jobs? If so, it might have more
> > > effects
> > > > > by above [2]
> > > > >
> > > > > The test runs on a k8s cluster that is also used for other
> production
> > > > jobs.
> > > > > There is no reason be believe network is the bottleneck.
> > > > >
> > > > >     -  Did you adjust the default network buffer setting? E.g.
> > > > > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > > > "taskmanager.network.memory.buffers-per-channel"
> > > > >
> > > > > The job is using the defaults, i.e we don't configure the settings.
> > If
> > > > you
> > > > > want me to try specific settings in the hope that it will help to
> > > isolate
> > > > > the issue please let me know.
> > > > >
> > > > >     -  I guess the topology has three vertexes "KinesisConsumer ->
> > > > Chained
> > > > > FlatMap -> KinesisProducer", and the partition mode for
> > > "KinesisConsumer
> > > > ->
> > > > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
> > the
> > > > edge
> > > > > connection is one-to-one, not all-to-all, then the above [1][2]
> > should
> > > no
> > > > > effects in theory with default network buffer setting.
> > > > >
> > > > > There are only 2 vertices and the edge is "forward".
> > > > >
> > > > >     - By slot sharing, I guess these three vertex parallelism task
> > > would
> > > > > probably be deployed into the same slot, then the data shuffle is
> by
> > > > memory
> > > > > queue, not network stack. If so, the above [2] should no effect.
> > > > >
> > > > > Yes, vertices share slots.
> > > > >
> > > > >     - I also saw some Jira changes for kinesis in this release,
> could
> > > you
> > > > > confirm that these changes would not effect the performance?
> > > > >
> > > > > I will need to take a look. 1.10 already had a regression
> introduced
> > by
> > > > the
> > > > > Kinesis producer update.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> [hidden email]
> > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Hi Thomas,
> > > > > >
> > > > > > Thanks for your reply with rich information!
> > > > > >
> > > > > > We are trying to reproduce your case in our cluster to further
> > verify
> > > > it,
> > > > > > and  @Yingjie Cao is working on it now.
> > > > > >  As we have not kinesis consumer and producer internally, so we
> > will
> > > > > > construct the common source and sink instead in the case of
> > > > backpressure.
> > > > > >
> > > > > > Firstly, we can dismiss the rockdb factor in this release, since
> > you
> > > > also
> > > > > > mentioned that "filesystem leads to same symptoms".
> > > > > >
> > > > > > Secondly, if my understanding is right, you emphasis that the
> > > > regression
> > > > > > only exists for the jobs with low checkpoint interval (10s).
> > > > > > Based on that, I have two suspicions with the network related
> > changes
> > > > in
> > > > > > this release:
> > > > > >     - [1]: Limited the maximum backlog value (default 10) in
> > > > subpartition
> > > > > > queue.
> > > > > >     - [2]: Delay send the following buffers after checkpoint
> > barrier
> > > on
> > > > > > upstream side until barrier alignment on downstream side.
> > > > > >
> > > > > > These changes are motivated for reducing the in-flight buffers to
> > > > speedup
> > > > > > checkpoint especially in the case of backpressure.
> > > > > > In theory they should have very minor performance effect and
> > actually
> > > > we
> > > > > > also tested in cluster to verify within expectation before
> merging
> > > > them,
> > > > > >  but maybe there are other corner cases we have not thought of
> > > before.
> > > > > >
> > > > > > Before the testing result on our side comes out for your
> respective
> > > job
> > > > > > case, I have some other questions to confirm for further
> analysis:
> > > > > >     -  How much percentage regression you found after switching
> to
> > > > 1.11?
> > > > > >     -  Are there any network bottleneck in your cluster? E.g. the
> > > > network
> > > > > > bandwidth is full caused by other jobs? If so, it might have more
> > > > effects
> > > > > > by above [2]
> > > > > >     -  Did you adjust the default network buffer setting? E.g.
> > > > > > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > > > > "taskmanager.network.memory.buffers-per-channel"
> > > > > >     -  I guess the topology has three vertexes "KinesisConsumer
> ->
> > > > > Chained
> > > > > > FlatMap -> KinesisProducer", and the partition mode for
> > > > "KinesisConsumer
> > > > > ->
> > > > > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
> so,
> > > the
> > > > > edge
> > > > > > connection is one-to-one, not all-to-all, then the above [1][2]
> > > should
> > > > no
> > > > > > effects in theory with default network buffer setting.
> > > > > >     - By slot sharing, I guess these three vertex parallelism
> task
> > > > would
> > > > > > probably be deployed into the same slot, then the data shuffle is
> > by
> > > > > memory
> > > > > > queue, not network stack. If so, the above [2] should no effect.
> > > > > >     - I also saw some Jira changes for kinesis in this release,
> > could
> > > > you
> > > > > > confirm that these changes would not effect the performance?
> > > > > >
> > > > > > Best,
> > > > > > Zhijiang
> > > > > >
> > > > > >
> > > > > >
> ------------------------------------------------------------------
> > > > > > From:Thomas Weise <[hidden email]>
> > > > > > Send Time:2020年7月3日(星期五) 01:07
> > > > > > To:dev <[hidden email]>; Zhijiang <
> > [hidden email]>
> > > > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > > >
> > > > > > Hi Zhijiang,
> > > > > >
> > > > > > The performance degradation manifests in backpressure which leads
> > to
> > > > > > growing backlog in the source. I switched a few times between
> 1.10
> > > and
> > > > > 1.11
> > > > > > and the behavior is consistent.
> > > > > >
> > > > > > The DAG is:
> > > > > >
> > > > > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
> > forward
> > > > > > ---------> KinesisProducer
> > > > > >
> > > > > > Parallelism: 160
> > > > > > No shuffle/rebalance.
> > > > > >
> > > > > > Checkpointing config:
> > > > > >
> > > > > > Checkpointing Mode Exactly Once
> > > > > > Interval 10s
> > > > > > Timeout 10m 0s
> > > > > > Minimum Pause Between Checkpoints 10s
> > > > > > Maximum Concurrent Checkpoints 1
> > > > > > Persist Checkpoints Externally Enabled (delete on cancellation)
> > > > > >
> > > > > > State backend: rocksdb  (filesystem leads to same symptoms)
> > > > > > Checkpoint size is tiny (500KB)
> > > > > >
> > > > > > An interesting difference to another job that I had upgraded
> > > > successfully
> > > > > > is the low checkpointing interval.
> > > > > >
> > > > > > Thanks,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> > [hidden email]
> > > > > > .invalid>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Thomas,
> > > > > > >
> > > > > > > Thanks for the efficient feedback.
> > > > > > >
> > > > > > > Regarding the suggestion of adding the release notes document,
> I
> > > > agree
> > > > > > > with your point. Maybe we should adjust the vote template
> > > accordingly
> > > > > in
> > > > > > > the respective wiki to guide the following release processes.
> > > > > > >
> > > > > > > Regarding the performance regression, could you provide some
> more
> > > > > details
> > > > > > > for our better measurement or reproducing on our sides?
> > > > > > > E.g. I guess the topology only includes two vertexes source and
> > > sink?
> > > > > > > What is the parallelism for every vertex?
> > > > > > > The upstream shuffles data to the downstream via rebalance
> > > > partitioner
> > > > > or
> > > > > > > other?
> > > > > > > The checkpoint mode is exactly-once with rocksDB state backend?
> > > > > > > The backpressure happened in this case?
> > > > > > > How much percentage regression in this case?
> > > > > > >
> > > > > > > Best,
> > > > > > > Zhijiang
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > ------------------------------------------------------------------
> > > > > > > From:Thomas Weise <[hidden email]>
> > > > > > > Send Time:2020年7月2日(星期四) 09:54
> > > > > > > To:dev <[hidden email]>
> > > > > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > > > >
> > > > > > > Hi Till,
> > > > > > >
> > > > > > > Yes, we don't have the setting in flink-conf.yaml.
> > > > > > >
> > > > > > > Generally, we carry forward the existing configuration and any
> > > change
> > > > > to
> > > > > > > default configuration values would impact the upgrade.
> > > > > > >
> > > > > > > Yes, since it is an incompatible change I would state it in the
> > > > release
> > > > > > > notes.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Thomas
> > > > > > >
> > > > > > > BTW I found a performance regression while trying to upgrade
> > > another
> > > > > > > pipeline with this RC. It is a simple Kinesis to Kinesis job.
> > > Wasn't
> > > > > able
> > > > > > > to pin it down yet, symptoms include increased checkpoint
> > alignment
> > > > > time.
> > > > > > >
> > > > > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Thomas,
> > > > > > > >
> > > > > > > > just to confirm: When starting the image in local mode, then
> > you
> > > > > don't
> > > > > > > have
> > > > > > > > any of the JobManager memory configuration settings
> configured
> > in
> > > > the
> > > > > > > > effective flink-conf.yaml, right? Does this mean that you
> have
> > > > > > explicitly
> > > > > > > > removed `jobmanager.heap.size: 1024m` from the default
> > > > configuration?
> > > > > > If
> > > > > > > > this is the case, then I believe it was more of an
> > unintentional
> > > > > > artifact
> > > > > > > > that it worked before and it has been corrected now so that
> one
> > > > needs
> > > > > > to
> > > > > > > > specify the memory of the JM process explicitly. Do you think
> > it
> > > > > would
> > > > > > > help
> > > > > > > > to explicitly state this in the release notes?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Till
> > > > > > > >
> > > > > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <[hidden email]>
> > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for preparing another RC!
> > > > > > > > >
> > > > > > > > > As mentioned in the previous RC thread, it would be super
> > > helpful
> > > > > if
> > > > > > > the
> > > > > > > > > release notes that are part of the documentation can be
> > > included
> > > > > [1].
> > > > > > > > It's
> > > > > > > > > a significant time-saver to have read those first.
> > > > > > > > >
> > > > > > > > > I found one more non-backward compatible change that would
> be
> > > > worth
> > > > > > > > > addressing/mentioning:
> > > > > > > > >
> > > > > > > > > It is now necessary to configure the jobmanager heap size
> in
> > > > > > > > > flink-conf.yaml (with either jobmanager.heap.size
> > > > > > > > > or jobmanager.memory.heap.size). Why would I not want to do
> > > that
> > > > > > > anyways?
> > > > > > > > > Well, we set it dynamically for a cluster deployment via
> the
> > > > > > > > > flinkk8soperator, but the container image can also be used
> > for
> > > > > > testing
> > > > > > > > with
> > > > > > > > > local mode (./bin/jobmanager.sh start-foreground local).
> That
> > > > will
> > > > > > fail
> > > > > > > > if
> > > > > > > > > the heap wasn't configured and that's how I noticed it.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Thomas
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > > > > > > > >
> > > > > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> > > > > [hidden email]
> > > > > > > > > .invalid>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > Please review and vote on the release candidate #4 for
> the
> > > > > version
> > > > > > > > > 1.11.0,
> > > > > > > > > > as follows:
> > > > > > > > > > [ ] +1, Approve the release
> > > > > > > > > > [ ] -1, Do not approve the release (please provide
> specific
> > > > > > comments)
> > > > > > > > > >
> > > > > > > > > > The complete staging area is available for your review,
> > which
> > > > > > > includes:
> > > > > > > > > > * JIRA release notes [1],
> > > > > > > > > > * the official Apache source release and binary
> convenience
> > > > > > releases
> > > > > > > to
> > > > > > > > > be
> > > > > > > > > > deployed to dist.apache.org [2], which are signed with
> the
> > > key
> > > > > > with
> > > > > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> > > > > > > > > > * all artifacts to be deployed to the Maven Central
> > > Repository
> > > > > [4],
> > > > > > > > > > * source code tag "release-1.11.0-rc4" [5],
> > > > > > > > > > * website pull request listing the new release and adding
> > > > > > > announcement
> > > > > > > > > > blog post [6].
> > > > > > > > > >
> > > > > > > > > > The vote will be open for at least 72 hours. It is
> adopted
> > by
> > > > > > > majority
> > > > > > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Release Manager
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > > > > > > > > [2]
> > > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > > > > > > > > [3]
> https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > > > > > > [4]
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > > > > > > > > [5]
> > > > > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > > > > > > > > [6] https://github.com/apache/flink-web/pull/352
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > > --
> > >
> > > Best,
> > > Benchao Li
> > >
> >
>
>
> --
> Best, Jingsong Lee
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise
In reply to this post by Zhijiang(wangzhijiang999)
+ dev@ for visibility

I will investigate further today.


On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <[hidden email]> wrote:

> On 06.07.20 20:39, Stephan Ewen wrote:
> >    - Did sink checkpoint notifications change in a relevant way, for
> example
> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
>
> I think that's unrelated: the Kafka fixes were isolated in Kafka and the
> one bug I discovered on the way was about the Task reaper.
>
>
> On 07.07.20 17:51, Zhijiang wrote:
> > Sorry for my misunderstood of the previous information, Thomas. I was
> assuming that the sync checkpoint duration increased after upgrade as it
> was mentioned before.
> >
> > If I remembered correctly, the memory state backend also has the same
> issue? If so, we can dismiss the rocksDB state changes. As the slot sharing
> enabled, the downstream and upstream should
> > probably deployed into the same slot, then no network shuffle effect.
> >
> > I think we need to find out whether it has other symptoms changed
> besides the performance regression to further figure out the scope.
> > E.g. any metrics changes, the number of TaskManager and the number of
> slots per TaskManager from deployment changes.
> > 40% regression is really big, I guess the changes should also be
> reflected in other places.
> >
> > I am not sure whether we can reproduce the regression in our AWS
> environment by writing any Kinesis jobs, since there are also normal
> Kinesis jobs as Thomas mentioned after upgrade.
> > So it probably looks like to touch some corner case. I am very willing
> to provide any help for debugging if possible.
> >
> >
> > Best,
> > Zhijiang
> >
> >
> > ------------------------------------------------------------------
> > From:Thomas Weise <[hidden email]>
> > Send Time:2020年7月7日(星期二) 23:01
> > To:Stephan Ewen <[hidden email]>
> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
> [hidden email]>; Zhijiang <[hidden email]>
> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> release candidate #4)
> >
> > We are deploying our apps with FlinkK8sOperator. We have one job that
> works as expected after the upgrade and the one discussed here that has the
> performance regression.
> >
> > "The performance regression is obvious caused by long duration of sync
> checkpoint process in Kinesis sink operator, which would block the normal
> data processing until back pressure the source."
> >
> > That's a constant. Before (1.10) and upgrade have the same sync
> checkpointing time. The question is what change came in with the upgrade.
> >
> >
> >
> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[hidden email]> wrote:
> >
> > @Thomas Just one thing real quick: Are you using the standalone setup
> scripts (like start-cluster.sh, and the former "slaves" file) ?
> > Be aware that this is now called "workers" because of avoiding sensitive
> names.
> > In one internal benchmark we saw quite a lot of slowdown initially,
> before seeing that the cluster was not a distributed cluster any more ;-)
> >
> >
> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <[hidden email]>
> wrote:
> > Thanks for this kickoff and help analysis, Stephan!
> > Thanks for the further feedback and investigation, Thomas!
> >
> > The performance regression is obvious caused by long duration of sync
> checkpoint process in Kinesis sink operator, which would block the normal
> data processing until back pressure the source.
> > Maybe we could dig into the process of sync execution in checkpoint.
> E.g. break down the steps inside respective operator#snapshotState to
> statistic which operation cost most of the time, then
> > we might probably find the root cause to bring such cost.
> >
> > Look forward to the further progress. :)
> >
> > Best,
> > Zhijiang
> >
> > ------------------------------------------------------------------
> > From:Stephan Ewen <[hidden email]>
> > Send Time:2020年7月7日(星期二) 14:52
> > To:Thomas Weise <[hidden email]>
> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <[hidden email]>;
> Aljoscha Krettek <[hidden email]>; Arvid Heise <[hidden email]>
> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> release candidate #4)
> >
> > Thank you for the digging so deeply.
> > Mysterious think this regression.
> >
> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]> wrote:
> > @Stephan: yes, I refer to sync time in the web UI (it is unchanged
> between 1.10 and 1.11 for the specific pipeline).
> >
> > I verified that increasing the checkpointing interval does not make a
> difference.
> >
> > I looked at the Kinesis connector changes since 1.10.1 and don't see
> anything that could cause this.
> >
> > Another pipeline that is using the Kinesis consumer (but not the
> producer) performs as expected.
> >
> > I tried reverting the AWS SDK version change, symptoms remain unchanged:
> >
> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
> b/flink-connectors/flink-connector-kinesis/pom.xml
> > index a6abce23ba..741743a05e 100644
> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
> > @@ -33,7 +33,7 @@ under the License.
> >
> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> >          <name>flink-connector-kinesis</name>
> >          <properties>
> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
> >
> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> >
> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> >
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> >
> > I'm planning to take a look with a profiler next.
> >
> > Thomas
> >
> >
> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[hidden email]> wrote:
> > Hi all!
> >
> > Forking this thread out of the release vote thread.
> >  From what Thomas describes, it really sounds like a sink-specific issue.
> >
> > @Thomas: When you say sink has a long synchronous checkpoint time, you
> mean the time that is shown as "sync time" on the metrics and web UI? That
> is not including any network buffer related operations. It is purely the
> operator's time.
> >
> > Can we dig into the changes we did in sinks:
> >    - Kinesis version upgrade, AWS library updates
> >
> >    - Could it be that some call (checkpoint complete) that was
> previously (1.10) in a separate thread is not in the mailbox and this
> simply reduces the number of threads that do the work?
> >
> >    - Did sink checkpoint notifications change in a relevant way, for
> example due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
> >
> > Best,
> > Stephan
> >
> >
> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <[hidden email]>
> wrote:
> > Hi Thomas,
> >
> >   Regarding [2], it has more detail infos in the Jira description (
> https://issues.apache.org/jira/browse/FLINK-16404).
> >
> >   I can also give some basic explanations here to dismiss the concern.
> >   1. In the past, the following buffers after the barrier will be cached
> on downstream side before alignment.
> >   2. In 1.11, the upstream would not send the buffers after the barrier.
> When the downstream finishes the alignment, it will notify the downstream
> of continuing sending following buffers, since it can process them after
> alignment.
> >   3. The only difference is that the temporary blocked buffers are
> cached either on downstream side or on upstream side before alignment.
> >   4. The side effect would be the additional notification cost for every
> barrier alignment. If the downstream and upstream are deployed in separate
> TaskManager, the cost is network transport delay (the effect can be ignored
> based on our testing with 1s checkpoint interval). For sharing slot in your
> case, the cost is only one method call in processor, can be ignored also.
> >
> >   You mentioned "In this case, the downstream task has a high average
> checkpoint duration(~30s, sync part)." This duration is not reflecting the
> changes above, and it is only indicating the duration for calling
> `Operation.snapshotState`.
> >   If this duration is beyond your expectation, you can check or debug
> whether the source/sink operations might take more time to finish
> `snapshotState` in practice. E.g. you can
> >   make the implementation of this method as empty to further verify the
> effect.
> >
> >   Best,
> >   Zhijiang
> >
> >
> >   ------------------------------------------------------------------
> >   From:Thomas Weise <[hidden email]>
> >   Send Time:2020年7月5日(星期日) 12:22
> >   To:dev <[hidden email]>; Zhijiang <[hidden email]>
> >   Cc:Yingjie Cao <[hidden email]>
> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >
> >   Hi Zhijiang,
> >
> >   Could you please point me to more details regarding: "[2]: Delay send
> the
> >   following buffers after checkpoint barrier on upstream side until
> barrier
> >   alignment on downstream side."
> >
> >   In this case, the downstream task has a high average checkpoint
> duration
> >   (~30s, sync part). If there was a change to hold buffers depending on
> >   downstream performance, could this possibly apply to this case (even
> when
> >   there is no shuffle that would require alignment)?
> >
> >   Thanks,
> >   Thomas
> >
> >
> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <[hidden email]
> .invalid>
> >   wrote:
> >
> >   > Hi Thomas,
> >   >
> >   > Thanks for the further update information.
> >   >
> >   > I guess we can dismiss the network stack changes, since in your case
> the
> >   > downstream and upstream would probably be deployed in the same slot
> >   > bypassing the network data shuffle.
> >   > Also I guess release-1.11 will not bring general performance
> regression in
> >   > runtime engine, as we also did the performance testing for all
> general
> >   > cases by [1] in real cluster before and the testing results should
> fit the
> >   > expectation. But we indeed did not test the specific source and sink
> >   > connectors yet as I known.
> >   >
> >   > Regarding your performance regression with 40%, I wonder it is
> probably
> >   > related to specific source/sink changes (e.g. kinesis) or environment
> >   > issues with corner case.
> >   > If possible, it would be helpful to further locate whether the
> regression
> >   > is caused by kinesis, by replacing the kinesis source & sink and
> keeping
> >   > the others same.
> >   >
> >   > As you said, it would be efficient to contact with you directly next
> week
> >   > to further discuss this issue. And we are willing/eager to provide
> any help
> >   > to resolve this issue soon.
> >   >
> >   > Besides that, I guess this issue should not be the blocker for the
> >   > release, since it is probably a corner case based on the current
> analysis.
> >   > If we really conclude anything need to be resolved after the final
> >   > release, then we can also make the next minor release-1.11.1 come
> soon.
> >   >
> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
> >   >
> >   > Best,
> >   > Zhijiang
> >   >
> >   >
> >   > ------------------------------------------------------------------
> >   > From:Thomas Weise <[hidden email]>
> >   > Send Time:2020年7月4日(星期六) 12:26
> >   > To:dev <[hidden email]>; Zhijiang <[hidden email]>
> >   > Cc:Yingjie Cao <[hidden email]>
> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >   >
> >   > Hi Zhijiang,
> >   >
> >   > It will probably be best if we connect next week and discuss the
> issue
> >   > directly since this could be quite difficult to reproduce.
> >   >
> >   > Before the testing result on our side comes out for your respective
> job
> >   > case, I have some other questions to confirm for further analysis:
> >   >     -  How much percentage regression you found after switching to
> 1.11?
> >   >
> >   > ~40% throughput decline
> >   >
> >   >     -  Are there any network bottleneck in your cluster? E.g. the
> network
> >   > bandwidth is full caused by other jobs? If so, it might have more
> effects
> >   > by above [2]
> >   >
> >   > The test runs on a k8s cluster that is also used for other
> production jobs.
> >   > There is no reason be believe network is the bottleneck.
> >   >
> >   >     -  Did you adjust the default network buffer setting? E.g.
> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
> >   > "taskmanager.network.memory.buffers-per-channel"
> >   >
> >   > The job is using the defaults, i.e we don't configure the settings.
> If you
> >   > want me to try specific settings in the hope that it will help to
> isolate
> >   > the issue please let me know.
> >   >
> >   >     -  I guess the topology has three vertexes "KinesisConsumer ->
> Chained
> >   > FlatMap -> KinesisProducer", and the partition mode for
> "KinesisConsumer ->
> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
> the edge
> >   > connection is one-to-one, not all-to-all, then the above [1][2]
> should no
> >   > effects in theory with default network buffer setting.
> >   >
> >   > There are only 2 vertices and the edge is "forward".
> >   >
> >   >     - By slot sharing, I guess these three vertex parallelism task
> would
> >   > probably be deployed into the same slot, then the data shuffle is by
> memory
> >   > queue, not network stack. If so, the above [2] should no effect.
> >   >
> >   > Yes, vertices share slots.
> >   >
> >   >     - I also saw some Jira changes for kinesis in this release,
> could you
> >   > confirm that these changes would not effect the performance?
> >   >
> >   > I will need to take a look. 1.10 already had a regression introduced
> by the
> >   > Kinesis producer update.
> >   >
> >   >
> >   > Thanks,
> >   > Thomas
> >   >
> >   >
> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <[hidden email]
> >   > .invalid>
> >   > wrote:
> >   >
> >   > > Hi Thomas,
> >   > >
> >   > > Thanks for your reply with rich information!
> >   > >
> >   > > We are trying to reproduce your case in our cluster to further
> verify it,
> >   > > and  @Yingjie Cao is working on it now.
> >   > >  As we have not kinesis consumer and producer internally, so we
> will
> >   > > construct the common source and sink instead in the case of
> backpressure.
> >   > >
> >   > > Firstly, we can dismiss the rockdb factor in this release, since
> you also
> >   > > mentioned that "filesystem leads to same symptoms".
> >   > >
> >   > > Secondly, if my understanding is right, you emphasis that the
> regression
> >   > > only exists for the jobs with low checkpoint interval (10s).
> >   > > Based on that, I have two suspicions with the network related
> changes in
> >   > > this release:
> >   > >     - [1]: Limited the maximum backlog value (default 10) in
> subpartition
> >   > > queue.
> >   > >     - [2]: Delay send the following buffers after checkpoint
> barrier on
> >   > > upstream side until barrier alignment on downstream side.
> >   > >
> >   > > These changes are motivated for reducing the in-flight buffers to
> speedup
> >   > > checkpoint especially in the case of backpressure.
> >   > > In theory they should have very minor performance effect and
> actually we
> >   > > also tested in cluster to verify within expectation before merging
> them,
> >   > >  but maybe there are other corner cases we have not thought of
> before.
> >   > >
> >   > > Before the testing result on our side comes out for your
> respective job
> >   > > case, I have some other questions to confirm for further analysis:
> >   > >     -  How much percentage regression you found after switching to
> 1.11?
> >   > >     -  Are there any network bottleneck in your cluster? E.g. the
> network
> >   > > bandwidth is full caused by other jobs? If so, it might have more
> effects
> >   > > by above [2]
> >   > >     -  Did you adjust the default network buffer setting? E.g.
> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
> >   > > "taskmanager.network.memory.buffers-per-channel"
> >   > >     -  I guess the topology has three vertexes "KinesisConsumer ->
> >   > Chained
> >   > > FlatMap -> KinesisProducer", and the partition mode for
> "KinesisConsumer
> >   > ->
> >   > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
> the
> >   > edge
> >   > > connection is one-to-one, not all-to-all, then the above [1][2]
> should no
> >   > > effects in theory with default network buffer setting.
> >   > >     - By slot sharing, I guess these three vertex parallelism task
> would
> >   > > probably be deployed into the same slot, then the data shuffle is
> by
> >   > memory
> >   > > queue, not network stack. If so, the above [2] should no effect.
> >   > >     - I also saw some Jira changes for kinesis in this release,
> could you
> >   > > confirm that these changes would not effect the performance?
> >   > >
> >   > > Best,
> >   > > Zhijiang
> >   > >
> >   > >
> >   > > ------------------------------------------------------------------
> >   > > From:Thomas Weise <[hidden email]>
> >   > > Send Time:2020年7月3日(星期五) 01:07
> >   > > To:dev <[hidden email]>; Zhijiang <
> [hidden email]>
> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >   > >
> >   > > Hi Zhijiang,
> >   > >
> >   > > The performance degradation manifests in backpressure which leads
> to
> >   > > growing backlog in the source. I switched a few times between 1.10
> and
> >   > 1.11
> >   > > and the behavior is consistent.
> >   > >
> >   > > The DAG is:
> >   > >
> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
> forward
> >   > > ---------> KinesisProducer
> >   > >
> >   > > Parallelism: 160
> >   > > No shuffle/rebalance.
> >   > >
> >   > > Checkpointing config:
> >   > >
> >   > > Checkpointing Mode Exactly Once
> >   > > Interval 10s
> >   > > Timeout 10m 0s
> >   > > Minimum Pause Between Checkpoints 10s
> >   > > Maximum Concurrent Checkpoints 1
> >   > > Persist Checkpoints Externally Enabled (delete on cancellation)
> >   > >
> >   > > State backend: rocksdb  (filesystem leads to same symptoms)
> >   > > Checkpoint size is tiny (500KB)
> >   > >
> >   > > An interesting difference to another job that I had upgraded
> successfully
> >   > > is the low checkpointing interval.
> >   > >
> >   > > Thanks,
> >   > > Thomas
> >   > >
> >   > >
> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> [hidden email]
> >   > > .invalid>
> >   > > wrote:
> >   > >
> >   > > > Hi Thomas,
> >   > > >
> >   > > > Thanks for the efficient feedback.
> >   > > >
> >   > > > Regarding the suggestion of adding the release notes document, I
> agree
> >   > > > with your point. Maybe we should adjust the vote template
> accordingly
> >   > in
> >   > > > the respective wiki to guide the following release processes.
> >   > > >
> >   > > > Regarding the performance regression, could you provide some more
> >   > details
> >   > > > for our better measurement or reproducing on our sides?
> >   > > > E.g. I guess the topology only includes two vertexes source and
> sink?
> >   > > > What is the parallelism for every vertex?
> >   > > > The upstream shuffles data to the downstream via rebalance
> partitioner
> >   > or
> >   > > > other?
> >   > > > The checkpoint mode is exactly-once with rocksDB state backend?
> >   > > > The backpressure happened in this case?
> >   > > > How much percentage regression in this case?
> >   > > >
> >   > > > Best,
> >   > > > Zhijiang
> >   > > >
> >   > > >
> >   > > >
> >   > > >
> ------------------------------------------------------------------
> >   > > > From:Thomas Weise <[hidden email]>
> >   > > > Send Time:2020年7月2日(星期四) 09:54
> >   > > > To:dev <[hidden email]>
> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >   > > >
> >   > > > Hi Till,
> >   > > >
> >   > > > Yes, we don't have the setting in flink-conf.yaml.
> >   > > >
> >   > > > Generally, we carry forward the existing configuration and any
> change
> >   > to
> >   > > > default configuration values would impact the upgrade.
> >   > > >
> >   > > > Yes, since it is an incompatible change I would state it in the
> release
> >   > > > notes.
> >   > > >
> >   > > > Thanks,
> >   > > > Thomas
> >   > > >
> >   > > > BTW I found a performance regression while trying to upgrade
> another
> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis job.
> Wasn't
> >   > able
> >   > > > to pin it down yet, symptoms include increased checkpoint
> alignment
> >   > time.
> >   > > >
> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> [hidden email]>
> >   > > > wrote:
> >   > > >
> >   > > > > Hi Thomas,
> >   > > > >
> >   > > > > just to confirm: When starting the image in local mode, then
> you
> >   > don't
> >   > > > have
> >   > > > > any of the JobManager memory configuration settings configured
> in the
> >   > > > > effective flink-conf.yaml, right? Does this mean that you have
> >   > > explicitly
> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
> configuration?
> >   > > If
> >   > > > > this is the case, then I believe it was more of an
> unintentional
> >   > > artifact
> >   > > > > that it worked before and it has been corrected now so that
> one needs
> >   > > to
> >   > > > > specify the memory of the JM process explicitly. Do you think
> it
> >   > would
> >   > > > help
> >   > > > > to explicitly state this in the release notes?
> >   > > > >
> >   > > > > Cheers,
> >   > > > > Till
> >   > > > >
> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <[hidden email]>
> wrote:
> >   > > > >
> >   > > > > > Thanks for preparing another RC!
> >   > > > > >
> >   > > > > > As mentioned in the previous RC thread, it would be super
> helpful
> >   > if
> >   > > > the
> >   > > > > > release notes that are part of the documentation can be
> included
> >   > [1].
> >   > > > > It's
> >   > > > > > a significant time-saver to have read those first.
> >   > > > > >
> >   > > > > > I found one more non-backward compatible change that would
> be worth
> >   > > > > > addressing/mentioning:
> >   > > > > >
> >   > > > > > It is now necessary to configure the jobmanager heap size in
> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
> >   > > > > > or jobmanager.memory.heap.size). Why would I not want to do
> that
> >   > > > anyways?
> >   > > > > > Well, we set it dynamically for a cluster deployment via the
> >   > > > > > flinkk8soperator, but the container image can also be used
> for
> >   > > testing
> >   > > > > with
> >   > > > > > local mode (./bin/jobmanager.sh start-foreground local).
> That will
> >   > > fail
> >   > > > > if
> >   > > > > > the heap wasn't configured and that's how I noticed it.
> >   > > > > >
> >   > > > > > Thanks,
> >   > > > > > Thomas
> >   > > > > >
> >   > > > > > [1]
> >   > > > > >
> >   > > > > >
> >   > > > >
> >   > > >
> >   > >
> >   >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> >   > > > > >
> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> >   > [hidden email]
> >   > > > > > .invalid>
> >   > > > > > wrote:
> >   > > > > >
> >   > > > > > > Hi everyone,
> >   > > > > > >
> >   > > > > > > Please review and vote on the release candidate #4 for the
> >   > version
> >   > > > > > 1.11.0,
> >   > > > > > > as follows:
> >   > > > > > > [ ] +1, Approve the release
> >   > > > > > > [ ] -1, Do not approve the release (please provide specific
> >   > > comments)
> >   > > > > > >
> >   > > > > > > The complete staging area is available for your review,
> which
> >   > > > includes:
> >   > > > > > > * JIRA release notes [1],
> >   > > > > > > * the official Apache source release and binary convenience
> >   > > releases
> >   > > > to
> >   > > > > > be
> >   > > > > > > deployed to dist.apache.org [2], which are signed with
> the key
> >   > > with
> >   > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
> >   > > > > > > * all artifacts to be deployed to the Maven Central
> Repository
> >   > [4],
> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
> >   > > > > > > * website pull request listing the new release and adding
> >   > > > announcement
> >   > > > > > > blog post [6].
> >   > > > > > >
> >   > > > > > > The vote will be open for at least 72 hours. It is adopted
> by
> >   > > > majority
> >   > > > > > > approval, with at least 3 PMC affirmative votes.
> >   > > > > > >
> >   > > > > > > Thanks,
> >   > > > > > > Release Manager
> >   > > > > > >
> >   > > > > > > [1]
> >   > > > > > >
> >   > > > > >
> >   > > > >
> >   > > >
> >   > >
> >   >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> >   > > > > > > [2]
> >   > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> >   > > > > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> >   > > > > > > [4]
> >   > > > > > >
> >   > > > >
> >   > >
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> >   > > > > > > [5]
> >   > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
> >   > > > > > >
> >   > > > > > >
> >   > > > > >
> >   > > > >
> >   > > >
> >   > > >
> >   > >
> >   > >
> >   >
> >   >
> >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise
Sorry for the delay.

I confirmed that the regression is due to the sink (unsurprising, since
another job with the same consumer, but not the producer, runs as expected).

As promised I did CPU profiling on the problematic application, which gives
more insight into the regression [1]

The screenshots show that the average time for snapshotState increases from
~9s to ~28s. The data also shows the increase in sleep time during
snapshotState.

Does anyone, based on changes made in 1.11, have a theory why?

I had previously looked at the changes to the Kinesis connector and also
reverted the SDK upgrade, which did not change the situation.

It will likely be necessary to drill into the sink / checkpointing details
to understand the cause of the problem.

Let me know if anyone has specific questions that I can answer from the
profiling results.

Thomas

[1]
https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing

On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]> wrote:

> + dev@ for visibility
>
> I will investigate further today.
>
>
> On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <[hidden email]>
> wrote:
>
>> On 06.07.20 20:39, Stephan Ewen wrote:
>> >    - Did sink checkpoint notifications change in a relevant way, for
>> example
>> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
>>
>> I think that's unrelated: the Kafka fixes were isolated in Kafka and the
>> one bug I discovered on the way was about the Task reaper.
>>
>>
>> On 07.07.20 17:51, Zhijiang wrote:
>> > Sorry for my misunderstood of the previous information, Thomas. I was
>> assuming that the sync checkpoint duration increased after upgrade as it
>> was mentioned before.
>> >
>> > If I remembered correctly, the memory state backend also has the same
>> issue? If so, we can dismiss the rocksDB state changes. As the slot sharing
>> enabled, the downstream and upstream should
>> > probably deployed into the same slot, then no network shuffle effect.
>> >
>> > I think we need to find out whether it has other symptoms changed
>> besides the performance regression to further figure out the scope.
>> > E.g. any metrics changes, the number of TaskManager and the number of
>> slots per TaskManager from deployment changes.
>> > 40% regression is really big, I guess the changes should also be
>> reflected in other places.
>> >
>> > I am not sure whether we can reproduce the regression in our AWS
>> environment by writing any Kinesis jobs, since there are also normal
>> Kinesis jobs as Thomas mentioned after upgrade.
>> > So it probably looks like to touch some corner case. I am very willing
>> to provide any help for debugging if possible.
>> >
>> >
>> > Best,
>> > Zhijiang
>> >
>> >
>> > ------------------------------------------------------------------
>> > From:Thomas Weise <[hidden email]>
>> > Send Time:2020年7月7日(星期二) 23:01
>> > To:Stephan Ewen <[hidden email]>
>> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
>> [hidden email]>; Zhijiang <[hidden email]>
>> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
>> release candidate #4)
>> >
>> > We are deploying our apps with FlinkK8sOperator. We have one job that
>> works as expected after the upgrade and the one discussed here that has the
>> performance regression.
>> >
>> > "The performance regression is obvious caused by long duration of sync
>> checkpoint process in Kinesis sink operator, which would block the normal
>> data processing until back pressure the source."
>> >
>> > That's a constant. Before (1.10) and upgrade have the same sync
>> checkpointing time. The question is what change came in with the upgrade.
>> >
>> >
>> >
>> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[hidden email]> wrote:
>> >
>> > @Thomas Just one thing real quick: Are you using the standalone setup
>> scripts (like start-cluster.sh, and the former "slaves" file) ?
>> > Be aware that this is now called "workers" because of avoiding
>> sensitive names.
>> > In one internal benchmark we saw quite a lot of slowdown initially,
>> before seeing that the cluster was not a distributed cluster any more ;-)
>> >
>> >
>> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <[hidden email]>
>> wrote:
>> > Thanks for this kickoff and help analysis, Stephan!
>> > Thanks for the further feedback and investigation, Thomas!
>> >
>> > The performance regression is obvious caused by long duration of sync
>> checkpoint process in Kinesis sink operator, which would block the normal
>> data processing until back pressure the source.
>> > Maybe we could dig into the process of sync execution in checkpoint.
>> E.g. break down the steps inside respective operator#snapshotState to
>> statistic which operation cost most of the time, then
>> > we might probably find the root cause to bring such cost.
>> >
>> > Look forward to the further progress. :)
>> >
>> > Best,
>> > Zhijiang
>> >
>> > ------------------------------------------------------------------
>> > From:Stephan Ewen <[hidden email]>
>> > Send Time:2020年7月7日(星期二) 14:52
>> > To:Thomas Weise <[hidden email]>
>> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
>> [hidden email]>; Aljoscha Krettek <[hidden email]>;
>> Arvid Heise <[hidden email]>
>> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
>> release candidate #4)
>> >
>> > Thank you for the digging so deeply.
>> > Mysterious think this regression.
>> >
>> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]> wrote:
>> > @Stephan: yes, I refer to sync time in the web UI (it is unchanged
>> between 1.10 and 1.11 for the specific pipeline).
>> >
>> > I verified that increasing the checkpointing interval does not make a
>> difference.
>> >
>> > I looked at the Kinesis connector changes since 1.10.1 and don't see
>> anything that could cause this.
>> >
>> > Another pipeline that is using the Kinesis consumer (but not the
>> producer) performs as expected.
>> >
>> > I tried reverting the AWS SDK version change, symptoms remain unchanged:
>> >
>> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
>> b/flink-connectors/flink-connector-kinesis/pom.xml
>> > index a6abce23ba..741743a05e 100644
>> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
>> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
>> > @@ -33,7 +33,7 @@ under the License.
>> >
>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
>> >          <name>flink-connector-kinesis</name>
>> >          <properties>
>> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
>> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
>> >
>> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
>> >
>> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
>> >
>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
>> >
>> > I'm planning to take a look with a profiler next.
>> >
>> > Thomas
>> >
>> >
>> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[hidden email]> wrote:
>> > Hi all!
>> >
>> > Forking this thread out of the release vote thread.
>> >  From what Thomas describes, it really sounds like a sink-specific
>> issue.
>> >
>> > @Thomas: When you say sink has a long synchronous checkpoint time, you
>> mean the time that is shown as "sync time" on the metrics and web UI? That
>> is not including any network buffer related operations. It is purely the
>> operator's time.
>> >
>> > Can we dig into the changes we did in sinks:
>> >    - Kinesis version upgrade, AWS library updates
>> >
>> >    - Could it be that some call (checkpoint complete) that was
>> previously (1.10) in a separate thread is not in the mailbox and this
>> simply reduces the number of threads that do the work?
>> >
>> >    - Did sink checkpoint notifications change in a relevant way, for
>> example due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
>> >
>> > Best,
>> > Stephan
>> >
>> >
>> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <[hidden email]>
>> wrote:
>> > Hi Thomas,
>> >
>> >   Regarding [2], it has more detail infos in the Jira description (
>> https://issues.apache.org/jira/browse/FLINK-16404).
>> >
>> >   I can also give some basic explanations here to dismiss the concern.
>> >   1. In the past, the following buffers after the barrier will be
>> cached on downstream side before alignment.
>> >   2. In 1.11, the upstream would not send the buffers after the
>> barrier. When the downstream finishes the alignment, it will notify the
>> downstream of continuing sending following buffers, since it can process
>> them after alignment.
>> >   3. The only difference is that the temporary blocked buffers are
>> cached either on downstream side or on upstream side before alignment.
>> >   4. The side effect would be the additional notification cost for
>> every barrier alignment. If the downstream and upstream are deployed in
>> separate TaskManager, the cost is network transport delay (the effect can
>> be ignored based on our testing with 1s checkpoint interval). For sharing
>> slot in your case, the cost is only one method call in processor, can be
>> ignored also.
>> >
>> >   You mentioned "In this case, the downstream task has a high average
>> checkpoint duration(~30s, sync part)." This duration is not reflecting the
>> changes above, and it is only indicating the duration for calling
>> `Operation.snapshotState`.
>> >   If this duration is beyond your expectation, you can check or debug
>> whether the source/sink operations might take more time to finish
>> `snapshotState` in practice. E.g. you can
>> >   make the implementation of this method as empty to further verify the
>> effect.
>> >
>> >   Best,
>> >   Zhijiang
>> >
>> >
>> >   ------------------------------------------------------------------
>> >   From:Thomas Weise <[hidden email]>
>> >   Send Time:2020年7月5日(星期日) 12:22
>> >   To:dev <[hidden email]>; Zhijiang <[hidden email]>
>> >   Cc:Yingjie Cao <[hidden email]>
>> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >
>> >   Hi Zhijiang,
>> >
>> >   Could you please point me to more details regarding: "[2]: Delay send
>> the
>> >   following buffers after checkpoint barrier on upstream side until
>> barrier
>> >   alignment on downstream side."
>> >
>> >   In this case, the downstream task has a high average checkpoint
>> duration
>> >   (~30s, sync part). If there was a change to hold buffers depending on
>> >   downstream performance, could this possibly apply to this case (even
>> when
>> >   there is no shuffle that would require alignment)?
>> >
>> >   Thanks,
>> >   Thomas
>> >
>> >
>> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <[hidden email]
>> .invalid>
>> >   wrote:
>> >
>> >   > Hi Thomas,
>> >   >
>> >   > Thanks for the further update information.
>> >   >
>> >   > I guess we can dismiss the network stack changes, since in your
>> case the
>> >   > downstream and upstream would probably be deployed in the same slot
>> >   > bypassing the network data shuffle.
>> >   > Also I guess release-1.11 will not bring general performance
>> regression in
>> >   > runtime engine, as we also did the performance testing for all
>> general
>> >   > cases by [1] in real cluster before and the testing results should
>> fit the
>> >   > expectation. But we indeed did not test the specific source and sink
>> >   > connectors yet as I known.
>> >   >
>> >   > Regarding your performance regression with 40%, I wonder it is
>> probably
>> >   > related to specific source/sink changes (e.g. kinesis) or
>> environment
>> >   > issues with corner case.
>> >   > If possible, it would be helpful to further locate whether the
>> regression
>> >   > is caused by kinesis, by replacing the kinesis source & sink and
>> keeping
>> >   > the others same.
>> >   >
>> >   > As you said, it would be efficient to contact with you directly
>> next week
>> >   > to further discuss this issue. And we are willing/eager to provide
>> any help
>> >   > to resolve this issue soon.
>> >   >
>> >   > Besides that, I guess this issue should not be the blocker for the
>> >   > release, since it is probably a corner case based on the current
>> analysis.
>> >   > If we really conclude anything need to be resolved after the final
>> >   > release, then we can also make the next minor release-1.11.1 come
>> soon.
>> >   >
>> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
>> >   >
>> >   > Best,
>> >   > Zhijiang
>> >   >
>> >   >
>> >   > ------------------------------------------------------------------
>> >   > From:Thomas Weise <[hidden email]>
>> >   > Send Time:2020年7月4日(星期六) 12:26
>> >   > To:dev <[hidden email]>; Zhijiang <[hidden email]
>> >
>> >   > Cc:Yingjie Cao <[hidden email]>
>> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   >
>> >   > Hi Zhijiang,
>> >   >
>> >   > It will probably be best if we connect next week and discuss the
>> issue
>> >   > directly since this could be quite difficult to reproduce.
>> >   >
>> >   > Before the testing result on our side comes out for your respective
>> job
>> >   > case, I have some other questions to confirm for further analysis:
>> >   >     -  How much percentage regression you found after switching to
>> 1.11?
>> >   >
>> >   > ~40% throughput decline
>> >   >
>> >   >     -  Are there any network bottleneck in your cluster? E.g. the
>> network
>> >   > bandwidth is full caused by other jobs? If so, it might have more
>> effects
>> >   > by above [2]
>> >   >
>> >   > The test runs on a k8s cluster that is also used for other
>> production jobs.
>> >   > There is no reason be believe network is the bottleneck.
>> >   >
>> >   >     -  Did you adjust the default network buffer setting? E.g.
>> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
>> >   > "taskmanager.network.memory.buffers-per-channel"
>> >   >
>> >   > The job is using the defaults, i.e we don't configure the settings.
>> If you
>> >   > want me to try specific settings in the hope that it will help to
>> isolate
>> >   > the issue please let me know.
>> >   >
>> >   >     -  I guess the topology has three vertexes "KinesisConsumer ->
>> Chained
>> >   > FlatMap -> KinesisProducer", and the partition mode for
>> "KinesisConsumer ->
>> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
>> the edge
>> >   > connection is one-to-one, not all-to-all, then the above [1][2]
>> should no
>> >   > effects in theory with default network buffer setting.
>> >   >
>> >   > There are only 2 vertices and the edge is "forward".
>> >   >
>> >   >     - By slot sharing, I guess these three vertex parallelism task
>> would
>> >   > probably be deployed into the same slot, then the data shuffle is
>> by memory
>> >   > queue, not network stack. If so, the above [2] should no effect.
>> >   >
>> >   > Yes, vertices share slots.
>> >   >
>> >   >     - I also saw some Jira changes for kinesis in this release,
>> could you
>> >   > confirm that these changes would not effect the performance?
>> >   >
>> >   > I will need to take a look. 1.10 already had a regression
>> introduced by the
>> >   > Kinesis producer update.
>> >   >
>> >   >
>> >   > Thanks,
>> >   > Thomas
>> >   >
>> >   >
>> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
>> [hidden email]
>> >   > .invalid>
>> >   > wrote:
>> >   >
>> >   > > Hi Thomas,
>> >   > >
>> >   > > Thanks for your reply with rich information!
>> >   > >
>> >   > > We are trying to reproduce your case in our cluster to further
>> verify it,
>> >   > > and  @Yingjie Cao is working on it now.
>> >   > >  As we have not kinesis consumer and producer internally, so we
>> will
>> >   > > construct the common source and sink instead in the case of
>> backpressure.
>> >   > >
>> >   > > Firstly, we can dismiss the rockdb factor in this release, since
>> you also
>> >   > > mentioned that "filesystem leads to same symptoms".
>> >   > >
>> >   > > Secondly, if my understanding is right, you emphasis that the
>> regression
>> >   > > only exists for the jobs with low checkpoint interval (10s).
>> >   > > Based on that, I have two suspicions with the network related
>> changes in
>> >   > > this release:
>> >   > >     - [1]: Limited the maximum backlog value (default 10) in
>> subpartition
>> >   > > queue.
>> >   > >     - [2]: Delay send the following buffers after checkpoint
>> barrier on
>> >   > > upstream side until barrier alignment on downstream side.
>> >   > >
>> >   > > These changes are motivated for reducing the in-flight buffers to
>> speedup
>> >   > > checkpoint especially in the case of backpressure.
>> >   > > In theory they should have very minor performance effect and
>> actually we
>> >   > > also tested in cluster to verify within expectation before
>> merging them,
>> >   > >  but maybe there are other corner cases we have not thought of
>> before.
>> >   > >
>> >   > > Before the testing result on our side comes out for your
>> respective job
>> >   > > case, I have some other questions to confirm for further analysis:
>> >   > >     -  How much percentage regression you found after switching
>> to 1.11?
>> >   > >     -  Are there any network bottleneck in your cluster? E.g. the
>> network
>> >   > > bandwidth is full caused by other jobs? If so, it might have more
>> effects
>> >   > > by above [2]
>> >   > >     -  Did you adjust the default network buffer setting? E.g.
>> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
>> >   > > "taskmanager.network.memory.buffers-per-channel"
>> >   > >     -  I guess the topology has three vertexes "KinesisConsumer ->
>> >   > Chained
>> >   > > FlatMap -> KinesisProducer", and the partition mode for
>> "KinesisConsumer
>> >   > ->
>> >   > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
>> so, the
>> >   > edge
>> >   > > connection is one-to-one, not all-to-all, then the above [1][2]
>> should no
>> >   > > effects in theory with default network buffer setting.
>> >   > >     - By slot sharing, I guess these three vertex parallelism
>> task would
>> >   > > probably be deployed into the same slot, then the data shuffle is
>> by
>> >   > memory
>> >   > > queue, not network stack. If so, the above [2] should no effect.
>> >   > >     - I also saw some Jira changes for kinesis in this release,
>> could you
>> >   > > confirm that these changes would not effect the performance?
>> >   > >
>> >   > > Best,
>> >   > > Zhijiang
>> >   > >
>> >   > >
>> >   > > ------------------------------------------------------------------
>> >   > > From:Thomas Weise <[hidden email]>
>> >   > > Send Time:2020年7月3日(星期五) 01:07
>> >   > > To:dev <[hidden email]>; Zhijiang <
>> [hidden email]>
>> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   > >
>> >   > > Hi Zhijiang,
>> >   > >
>> >   > > The performance degradation manifests in backpressure which leads
>> to
>> >   > > growing backlog in the source. I switched a few times between
>> 1.10 and
>> >   > 1.11
>> >   > > and the behavior is consistent.
>> >   > >
>> >   > > The DAG is:
>> >   > >
>> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
>> forward
>> >   > > ---------> KinesisProducer
>> >   > >
>> >   > > Parallelism: 160
>> >   > > No shuffle/rebalance.
>> >   > >
>> >   > > Checkpointing config:
>> >   > >
>> >   > > Checkpointing Mode Exactly Once
>> >   > > Interval 10s
>> >   > > Timeout 10m 0s
>> >   > > Minimum Pause Between Checkpoints 10s
>> >   > > Maximum Concurrent Checkpoints 1
>> >   > > Persist Checkpoints Externally Enabled (delete on cancellation)
>> >   > >
>> >   > > State backend: rocksdb  (filesystem leads to same symptoms)
>> >   > > Checkpoint size is tiny (500KB)
>> >   > >
>> >   > > An interesting difference to another job that I had upgraded
>> successfully
>> >   > > is the low checkpointing interval.
>> >   > >
>> >   > > Thanks,
>> >   > > Thomas
>> >   > >
>> >   > >
>> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
>> [hidden email]
>> >   > > .invalid>
>> >   > > wrote:
>> >   > >
>> >   > > > Hi Thomas,
>> >   > > >
>> >   > > > Thanks for the efficient feedback.
>> >   > > >
>> >   > > > Regarding the suggestion of adding the release notes document,
>> I agree
>> >   > > > with your point. Maybe we should adjust the vote template
>> accordingly
>> >   > in
>> >   > > > the respective wiki to guide the following release processes.
>> >   > > >
>> >   > > > Regarding the performance regression, could you provide some
>> more
>> >   > details
>> >   > > > for our better measurement or reproducing on our sides?
>> >   > > > E.g. I guess the topology only includes two vertexes source and
>> sink?
>> >   > > > What is the parallelism for every vertex?
>> >   > > > The upstream shuffles data to the downstream via rebalance
>> partitioner
>> >   > or
>> >   > > > other?
>> >   > > > The checkpoint mode is exactly-once with rocksDB state backend?
>> >   > > > The backpressure happened in this case?
>> >   > > > How much percentage regression in this case?
>> >   > > >
>> >   > > > Best,
>> >   > > > Zhijiang
>> >   > > >
>> >   > > >
>> >   > > >
>> >   > > >
>> ------------------------------------------------------------------
>> >   > > > From:Thomas Weise <[hidden email]>
>> >   > > > Send Time:2020年7月2日(星期四) 09:54
>> >   > > > To:dev <[hidden email]>
>> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   > > >
>> >   > > > Hi Till,
>> >   > > >
>> >   > > > Yes, we don't have the setting in flink-conf.yaml.
>> >   > > >
>> >   > > > Generally, we carry forward the existing configuration and any
>> change
>> >   > to
>> >   > > > default configuration values would impact the upgrade.
>> >   > > >
>> >   > > > Yes, since it is an incompatible change I would state it in the
>> release
>> >   > > > notes.
>> >   > > >
>> >   > > > Thanks,
>> >   > > > Thomas
>> >   > > >
>> >   > > > BTW I found a performance regression while trying to upgrade
>> another
>> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis job.
>> Wasn't
>> >   > able
>> >   > > > to pin it down yet, symptoms include increased checkpoint
>> alignment
>> >   > time.
>> >   > > >
>> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
>> [hidden email]>
>> >   > > > wrote:
>> >   > > >
>> >   > > > > Hi Thomas,
>> >   > > > >
>> >   > > > > just to confirm: When starting the image in local mode, then
>> you
>> >   > don't
>> >   > > > have
>> >   > > > > any of the JobManager memory configuration settings
>> configured in the
>> >   > > > > effective flink-conf.yaml, right? Does this mean that you have
>> >   > > explicitly
>> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
>> configuration?
>> >   > > If
>> >   > > > > this is the case, then I believe it was more of an
>> unintentional
>> >   > > artifact
>> >   > > > > that it worked before and it has been corrected now so that
>> one needs
>> >   > > to
>> >   > > > > specify the memory of the JM process explicitly. Do you think
>> it
>> >   > would
>> >   > > > help
>> >   > > > > to explicitly state this in the release notes?
>> >   > > > >
>> >   > > > > Cheers,
>> >   > > > > Till
>> >   > > > >
>> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <[hidden email]>
>> wrote:
>> >   > > > >
>> >   > > > > > Thanks for preparing another RC!
>> >   > > > > >
>> >   > > > > > As mentioned in the previous RC thread, it would be super
>> helpful
>> >   > if
>> >   > > > the
>> >   > > > > > release notes that are part of the documentation can be
>> included
>> >   > [1].
>> >   > > > > It's
>> >   > > > > > a significant time-saver to have read those first.
>> >   > > > > >
>> >   > > > > > I found one more non-backward compatible change that would
>> be worth
>> >   > > > > > addressing/mentioning:
>> >   > > > > >
>> >   > > > > > It is now necessary to configure the jobmanager heap size in
>> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
>> >   > > > > > or jobmanager.memory.heap.size). Why would I not want to do
>> that
>> >   > > > anyways?
>> >   > > > > > Well, we set it dynamically for a cluster deployment via the
>> >   > > > > > flinkk8soperator, but the container image can also be used
>> for
>> >   > > testing
>> >   > > > > with
>> >   > > > > > local mode (./bin/jobmanager.sh start-foreground local).
>> That will
>> >   > > fail
>> >   > > > > if
>> >   > > > > > the heap wasn't configured and that's how I noticed it.
>> >   > > > > >
>> >   > > > > > Thanks,
>> >   > > > > > Thomas
>> >   > > > > >
>> >   > > > > > [1]
>> >   > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > >
>> >   >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>> >   > > > > >
>> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
>> >   > [hidden email]
>> >   > > > > > .invalid>
>> >   > > > > > wrote:
>> >   > > > > >
>> >   > > > > > > Hi everyone,
>> >   > > > > > >
>> >   > > > > > > Please review and vote on the release candidate #4 for the
>> >   > version
>> >   > > > > > 1.11.0,
>> >   > > > > > > as follows:
>> >   > > > > > > [ ] +1, Approve the release
>> >   > > > > > > [ ] -1, Do not approve the release (please provide
>> specific
>> >   > > comments)
>> >   > > > > > >
>> >   > > > > > > The complete staging area is available for your review,
>> which
>> >   > > > includes:
>> >   > > > > > > * JIRA release notes [1],
>> >   > > > > > > * the official Apache source release and binary
>> convenience
>> >   > > releases
>> >   > > > to
>> >   > > > > > be
>> >   > > > > > > deployed to dist.apache.org [2], which are signed with
>> the key
>> >   > > with
>> >   > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
>> >   > > > > > > * all artifacts to be deployed to the Maven Central
>> Repository
>> >   > [4],
>> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
>> >   > > > > > > * website pull request listing the new release and adding
>> >   > > > announcement
>> >   > > > > > > blog post [6].
>> >   > > > > > >
>> >   > > > > > > The vote will be open for at least 72 hours. It is
>> adopted by
>> >   > > > majority
>> >   > > > > > > approval, with at least 3 PMC affirmative votes.
>> >   > > > > > >
>> >   > > > > > > Thanks,
>> >   > > > > > > Release Manager
>> >   > > > > > >
>> >   > > > > > > [1]
>> >   > > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > >
>> >   >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
>> >   > > > > > > [2]
>> >   > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
>> >   > > > > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >   > > > > > > [4]
>> >   > > > > > >
>> >   > > > >
>> >   > >
>> https://repository.apache.org/content/repositories/orgapacheflink-1377/
>> >   > > > > > > [5]
>> >   > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
>> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
>> >   > > > > > >
>> >   > > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > > >
>> >   > >
>> >   > >
>> >   >
>> >   >
>> >
>> >
>> >
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Zhijiang(wangzhijiang999)
Hi Thomas,

Thanks for your further profiling information and glad to see we already finalized the location to cause the regression.
Actually I was also suspicious of the point of #snapshotState in previous discussions since it indeed cost much time to block normal operator processing.

Based on your below feedback, the sleep time during #snapshotState might be the main concern, and I also digged into the implementation of FlinkKinesisProducer#snapshotState.
while (producer.getOutstandingRecordsCount() > 0) {
   producer.flush();
   try {
      Thread.sleep(500);
   } catch (InterruptedException e) {
      LOG.warn("Flushing was interrupted.");
      break;
   }
}
It seems that the sleep time is mainly affected by the internal operations inside KinesisProducer implementation provided by amazonaws, which I am not quite familiar with.
But I noticed there were two upgrades related to it in release-1.11.0. One is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is for upgrading aws-sdk-version to 1.11.754 [2].
You mentioned that you already reverted the SDK upgrade to verify no changes. Did you also revert the [1] to verify?
[1] https://issues.apache.org/jira/browse/FLINK-17496
[2] https://issues.apache.org/jira/browse/FLINK-14881

Best,
Zhijiang
------------------------------------------------------------------
From:Thomas Weise <[hidden email]>
Send Time:2020年7月17日(星期五) 05:29
To:dev <[hidden email]>
Cc:Zhijiang <[hidden email]>; Stephan Ewen <[hidden email]>; Arvid Heise <[hidden email]>; Aljoscha Krettek <[hidden email]>
Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Sorry for the delay.

I confirmed that the regression is due to the sink (unsurprising, since
another job with the same consumer, but not the producer, runs as expected).

As promised I did CPU profiling on the problematic application, which gives
more insight into the regression [1]

The screenshots show that the average time for snapshotState increases from
~9s to ~28s. The data also shows the increase in sleep time during
snapshotState.

Does anyone, based on changes made in 1.11, have a theory why?

I had previously looked at the changes to the Kinesis connector and also
reverted the SDK upgrade, which did not change the situation.

It will likely be necessary to drill into the sink / checkpointing details
to understand the cause of the problem.

Let me know if anyone has specific questions that I can answer from the
profiling results.

Thomas

[1]
https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing

On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]> wrote:

> + dev@ for visibility
>
> I will investigate further today.
>
>
> On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <[hidden email]>
> wrote:
>
>> On 06.07.20 20:39, Stephan Ewen wrote:
>> >    - Did sink checkpoint notifications change in a relevant way, for
>> example
>> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
>>
>> I think that's unrelated: the Kafka fixes were isolated in Kafka and the
>> one bug I discovered on the way was about the Task reaper.
>>
>>
>> On 07.07.20 17:51, Zhijiang wrote:
>> > Sorry for my misunderstood of the previous information, Thomas. I was
>> assuming that the sync checkpoint duration increased after upgrade as it
>> was mentioned before.
>> >
>> > If I remembered correctly, the memory state backend also has the same
>> issue? If so, we can dismiss the rocksDB state changes. As the slot sharing
>> enabled, the downstream and upstream should
>> > probably deployed into the same slot, then no network shuffle effect.
>> >
>> > I think we need to find out whether it has other symptoms changed
>> besides the performance regression to further figure out the scope.
>> > E.g. any metrics changes, the number of TaskManager and the number of
>> slots per TaskManager from deployment changes.
>> > 40% regression is really big, I guess the changes should also be
>> reflected in other places.
>> >
>> > I am not sure whether we can reproduce the regression in our AWS
>> environment by writing any Kinesis jobs, since there are also normal
>> Kinesis jobs as Thomas mentioned after upgrade.
>> > So it probably looks like to touch some corner case. I am very willing
>> to provide any help for debugging if possible.
>> >
>> >
>> > Best,
>> > Zhijiang
>> >
>> >
>> > ------------------------------------------------------------------
>> > From:Thomas Weise <[hidden email]>
>> > Send Time:2020年7月7日(星期二) 23:01
>> > To:Stephan Ewen <[hidden email]>
>> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
>> [hidden email]>; Zhijiang <[hidden email]>
>> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
>> release candidate #4)
>> >
>> > We are deploying our apps with FlinkK8sOperator. We have one job that
>> works as expected after the upgrade and the one discussed here that has the
>> performance regression.
>> >
>> > "The performance regression is obvious caused by long duration of sync
>> checkpoint process in Kinesis sink operator, which would block the normal
>> data processing until back pressure the source."
>> >
>> > That's a constant. Before (1.10) and upgrade have the same sync
>> checkpointing time. The question is what change came in with the upgrade.
>> >
>> >
>> >
>> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[hidden email]> wrote:
>> >
>> > @Thomas Just one thing real quick: Are you using the standalone setup
>> scripts (like start-cluster.sh, and the former "slaves" file) ?
>> > Be aware that this is now called "workers" because of avoiding
>> sensitive names.
>> > In one internal benchmark we saw quite a lot of slowdown initially,
>> before seeing that the cluster was not a distributed cluster any more ;-)
>> >
>> >
>> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <[hidden email]>
>> wrote:
>> > Thanks for this kickoff and help analysis, Stephan!
>> > Thanks for the further feedback and investigation, Thomas!
>> >
>> > The performance regression is obvious caused by long duration of sync
>> checkpoint process in Kinesis sink operator, which would block the normal
>> data processing until back pressure the source.
>> > Maybe we could dig into the process of sync execution in checkpoint.
>> E.g. break down the steps inside respective operator#snapshotState to
>> statistic which operation cost most of the time, then
>> > we might probably find the root cause to bring such cost.
>> >
>> > Look forward to the further progress. :)
>> >
>> > Best,
>> > Zhijiang
>> >
>> > ------------------------------------------------------------------
>> > From:Stephan Ewen <[hidden email]>
>> > Send Time:2020年7月7日(星期二) 14:52
>> > To:Thomas Weise <[hidden email]>
>> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
>> [hidden email]>; Aljoscha Krettek <[hidden email]>;
>> Arvid Heise <[hidden email]>
>> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
>> release candidate #4)
>> >
>> > Thank you for the digging so deeply.
>> > Mysterious think this regression.
>> >
>> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]> wrote:
>> > @Stephan: yes, I refer to sync time in the web UI (it is unchanged
>> between 1.10 and 1.11 for the specific pipeline).
>> >
>> > I verified that increasing the checkpointing interval does not make a
>> difference.
>> >
>> > I looked at the Kinesis connector changes since 1.10.1 and don't see
>> anything that could cause this.
>> >
>> > Another pipeline that is using the Kinesis consumer (but not the
>> producer) performs as expected.
>> >
>> > I tried reverting the AWS SDK version change, symptoms remain unchanged:
>> >
>> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
>> b/flink-connectors/flink-connector-kinesis/pom.xml
>> > index a6abce23ba..741743a05e 100644
>> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
>> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
>> > @@ -33,7 +33,7 @@ under the License.
>> >
>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
>> >          <name>flink-connector-kinesis</name>
>> >          <properties>
>> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
>> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
>> >
>> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
>> >
>> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
>> >
>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
>> >
>> > I'm planning to take a look with a profiler next.
>> >
>> > Thomas
>> >
>> >
>> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[hidden email]> wrote:
>> > Hi all!
>> >
>> > Forking this thread out of the release vote thread.
>> >  From what Thomas describes, it really sounds like a sink-specific
>> issue.
>> >
>> > @Thomas: When you say sink has a long synchronous checkpoint time, you
>> mean the time that is shown as "sync time" on the metrics and web UI? That
>> is not including any network buffer related operations. It is purely the
>> operator's time.
>> >
>> > Can we dig into the changes we did in sinks:
>> >    - Kinesis version upgrade, AWS library updates
>> >
>> >    - Could it be that some call (checkpoint complete) that was
>> previously (1.10) in a separate thread is not in the mailbox and this
>> simply reduces the number of threads that do the work?
>> >
>> >    - Did sink checkpoint notifications change in a relevant way, for
>> example due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
>> >
>> > Best,
>> > Stephan
>> >
>> >
>> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <[hidden email]>
>> wrote:
>> > Hi Thomas,
>> >
>> >   Regarding [2], it has more detail infos in the Jira description (
>> https://issues.apache.org/jira/browse/FLINK-16404).
>> >
>> >   I can also give some basic explanations here to dismiss the concern.
>> >   1. In the past, the following buffers after the barrier will be
>> cached on downstream side before alignment.
>> >   2. In 1.11, the upstream would not send the buffers after the
>> barrier. When the downstream finishes the alignment, it will notify the
>> downstream of continuing sending following buffers, since it can process
>> them after alignment.
>> >   3. The only difference is that the temporary blocked buffers are
>> cached either on downstream side or on upstream side before alignment.
>> >   4. The side effect would be the additional notification cost for
>> every barrier alignment. If the downstream and upstream are deployed in
>> separate TaskManager, the cost is network transport delay (the effect can
>> be ignored based on our testing with 1s checkpoint interval). For sharing
>> slot in your case, the cost is only one method call in processor, can be
>> ignored also.
>> >
>> >   You mentioned "In this case, the downstream task has a high average
>> checkpoint duration(~30s, sync part)." This duration is not reflecting the
>> changes above, and it is only indicating the duration for calling
>> `Operation.snapshotState`.
>> >   If this duration is beyond your expectation, you can check or debug
>> whether the source/sink operations might take more time to finish
>> `snapshotState` in practice. E.g. you can
>> >   make the implementation of this method as empty to further verify the
>> effect.
>> >
>> >   Best,
>> >   Zhijiang
>> >
>> >
>> >   ------------------------------------------------------------------
>> >   From:Thomas Weise <[hidden email]>
>> >   Send Time:2020年7月5日(星期日) 12:22
>> >   To:dev <[hidden email]>; Zhijiang <[hidden email]>
>> >   Cc:Yingjie Cao <[hidden email]>
>> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >
>> >   Hi Zhijiang,
>> >
>> >   Could you please point me to more details regarding: "[2]: Delay send
>> the
>> >   following buffers after checkpoint barrier on upstream side until
>> barrier
>> >   alignment on downstream side."
>> >
>> >   In this case, the downstream task has a high average checkpoint
>> duration
>> >   (~30s, sync part). If there was a change to hold buffers depending on
>> >   downstream performance, could this possibly apply to this case (even
>> when
>> >   there is no shuffle that would require alignment)?
>> >
>> >   Thanks,
>> >   Thomas
>> >
>> >
>> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <[hidden email]
>> .invalid>
>> >   wrote:
>> >
>> >   > Hi Thomas,
>> >   >
>> >   > Thanks for the further update information.
>> >   >
>> >   > I guess we can dismiss the network stack changes, since in your
>> case the
>> >   > downstream and upstream would probably be deployed in the same slot
>> >   > bypassing the network data shuffle.
>> >   > Also I guess release-1.11 will not bring general performance
>> regression in
>> >   > runtime engine, as we also did the performance testing for all
>> general
>> >   > cases by [1] in real cluster before and the testing results should
>> fit the
>> >   > expectation. But we indeed did not test the specific source and sink
>> >   > connectors yet as I known.
>> >   >
>> >   > Regarding your performance regression with 40%, I wonder it is
>> probably
>> >   > related to specific source/sink changes (e.g. kinesis) or
>> environment
>> >   > issues with corner case.
>> >   > If possible, it would be helpful to further locate whether the
>> regression
>> >   > is caused by kinesis, by replacing the kinesis source & sink and
>> keeping
>> >   > the others same.
>> >   >
>> >   > As you said, it would be efficient to contact with you directly
>> next week
>> >   > to further discuss this issue. And we are willing/eager to provide
>> any help
>> >   > to resolve this issue soon.
>> >   >
>> >   > Besides that, I guess this issue should not be the blocker for the
>> >   > release, since it is probably a corner case based on the current
>> analysis.
>> >   > If we really conclude anything need to be resolved after the final
>> >   > release, then we can also make the next minor release-1.11.1 come
>> soon.
>> >   >
>> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
>> >   >
>> >   > Best,
>> >   > Zhijiang
>> >   >
>> >   >
>> >   > ------------------------------------------------------------------
>> >   > From:Thomas Weise <[hidden email]>
>> >   > Send Time:2020年7月4日(星期六) 12:26
>> >   > To:dev <[hidden email]>; Zhijiang <[hidden email]
>> >
>> >   > Cc:Yingjie Cao <[hidden email]>
>> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   >
>> >   > Hi Zhijiang,
>> >   >
>> >   > It will probably be best if we connect next week and discuss the
>> issue
>> >   > directly since this could be quite difficult to reproduce.
>> >   >
>> >   > Before the testing result on our side comes out for your respective
>> job
>> >   > case, I have some other questions to confirm for further analysis:
>> >   >     -  How much percentage regression you found after switching to
>> 1.11?
>> >   >
>> >   > ~40% throughput decline
>> >   >
>> >   >     -  Are there any network bottleneck in your cluster? E.g. the
>> network
>> >   > bandwidth is full caused by other jobs? If so, it might have more
>> effects
>> >   > by above [2]
>> >   >
>> >   > The test runs on a k8s cluster that is also used for other
>> production jobs.
>> >   > There is no reason be believe network is the bottleneck.
>> >   >
>> >   >     -  Did you adjust the default network buffer setting? E.g.
>> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
>> >   > "taskmanager.network.memory.buffers-per-channel"
>> >   >
>> >   > The job is using the defaults, i.e we don't configure the settings.
>> If you
>> >   > want me to try specific settings in the hope that it will help to
>> isolate
>> >   > the issue please let me know.
>> >   >
>> >   >     -  I guess the topology has three vertexes "KinesisConsumer ->
>> Chained
>> >   > FlatMap -> KinesisProducer", and the partition mode for
>> "KinesisConsumer ->
>> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
>> the edge
>> >   > connection is one-to-one, not all-to-all, then the above [1][2]
>> should no
>> >   > effects in theory with default network buffer setting.
>> >   >
>> >   > There are only 2 vertices and the edge is "forward".
>> >   >
>> >   >     - By slot sharing, I guess these three vertex parallelism task
>> would
>> >   > probably be deployed into the same slot, then the data shuffle is
>> by memory
>> >   > queue, not network stack. If so, the above [2] should no effect.
>> >   >
>> >   > Yes, vertices share slots.
>> >   >
>> >   >     - I also saw some Jira changes for kinesis in this release,
>> could you
>> >   > confirm that these changes would not effect the performance?
>> >   >
>> >   > I will need to take a look. 1.10 already had a regression
>> introduced by the
>> >   > Kinesis producer update.
>> >   >
>> >   >
>> >   > Thanks,
>> >   > Thomas
>> >   >
>> >   >
>> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
>> [hidden email]
>> >   > .invalid>
>> >   > wrote:
>> >   >
>> >   > > Hi Thomas,
>> >   > >
>> >   > > Thanks for your reply with rich information!
>> >   > >
>> >   > > We are trying to reproduce your case in our cluster to further
>> verify it,
>> >   > > and  @Yingjie Cao is working on it now.
>> >   > >  As we have not kinesis consumer and producer internally, so we
>> will
>> >   > > construct the common source and sink instead in the case of
>> backpressure.
>> >   > >
>> >   > > Firstly, we can dismiss the rockdb factor in this release, since
>> you also
>> >   > > mentioned that "filesystem leads to same symptoms".
>> >   > >
>> >   > > Secondly, if my understanding is right, you emphasis that the
>> regression
>> >   > > only exists for the jobs with low checkpoint interval (10s).
>> >   > > Based on that, I have two suspicions with the network related
>> changes in
>> >   > > this release:
>> >   > >     - [1]: Limited the maximum backlog value (default 10) in
>> subpartition
>> >   > > queue.
>> >   > >     - [2]: Delay send the following buffers after checkpoint
>> barrier on
>> >   > > upstream side until barrier alignment on downstream side.
>> >   > >
>> >   > > These changes are motivated for reducing the in-flight buffers to
>> speedup
>> >   > > checkpoint especially in the case of backpressure.
>> >   > > In theory they should have very minor performance effect and
>> actually we
>> >   > > also tested in cluster to verify within expectation before
>> merging them,
>> >   > >  but maybe there are other corner cases we have not thought of
>> before.
>> >   > >
>> >   > > Before the testing result on our side comes out for your
>> respective job
>> >   > > case, I have some other questions to confirm for further analysis:
>> >   > >     -  How much percentage regression you found after switching
>> to 1.11?
>> >   > >     -  Are there any network bottleneck in your cluster? E.g. the
>> network
>> >   > > bandwidth is full caused by other jobs? If so, it might have more
>> effects
>> >   > > by above [2]
>> >   > >     -  Did you adjust the default network buffer setting? E.g.
>> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
>> >   > > "taskmanager.network.memory.buffers-per-channel"
>> >   > >     -  I guess the topology has three vertexes "KinesisConsumer ->
>> >   > Chained
>> >   > > FlatMap -> KinesisProducer", and the partition mode for
>> "KinesisConsumer
>> >   > ->
>> >   > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
>> so, the
>> >   > edge
>> >   > > connection is one-to-one, not all-to-all, then the above [1][2]
>> should no
>> >   > > effects in theory with default network buffer setting.
>> >   > >     - By slot sharing, I guess these three vertex parallelism
>> task would
>> >   > > probably be deployed into the same slot, then the data shuffle is
>> by
>> >   > memory
>> >   > > queue, not network stack. If so, the above [2] should no effect.
>> >   > >     - I also saw some Jira changes for kinesis in this release,
>> could you
>> >   > > confirm that these changes would not effect the performance?
>> >   > >
>> >   > > Best,
>> >   > > Zhijiang
>> >   > >
>> >   > >
>> >   > > ------------------------------------------------------------------
>> >   > > From:Thomas Weise <[hidden email]>
>> >   > > Send Time:2020年7月3日(星期五) 01:07
>> >   > > To:dev <[hidden email]>; Zhijiang <
>> [hidden email]>
>> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   > >
>> >   > > Hi Zhijiang,
>> >   > >
>> >   > > The performance degradation manifests in backpressure which leads
>> to
>> >   > > growing backlog in the source. I switched a few times between
>> 1.10 and
>> >   > 1.11
>> >   > > and the behavior is consistent.
>> >   > >
>> >   > > The DAG is:
>> >   > >
>> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
>> forward
>> >   > > ---------> KinesisProducer
>> >   > >
>> >   > > Parallelism: 160
>> >   > > No shuffle/rebalance.
>> >   > >
>> >   > > Checkpointing config:
>> >   > >
>> >   > > Checkpointing Mode Exactly Once
>> >   > > Interval 10s
>> >   > > Timeout 10m 0s
>> >   > > Minimum Pause Between Checkpoints 10s
>> >   > > Maximum Concurrent Checkpoints 1
>> >   > > Persist Checkpoints Externally Enabled (delete on cancellation)
>> >   > >
>> >   > > State backend: rocksdb  (filesystem leads to same symptoms)
>> >   > > Checkpoint size is tiny (500KB)
>> >   > >
>> >   > > An interesting difference to another job that I had upgraded
>> successfully
>> >   > > is the low checkpointing interval.
>> >   > >
>> >   > > Thanks,
>> >   > > Thomas
>> >   > >
>> >   > >
>> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
>> [hidden email]
>> >   > > .invalid>
>> >   > > wrote:
>> >   > >
>> >   > > > Hi Thomas,
>> >   > > >
>> >   > > > Thanks for the efficient feedback.
>> >   > > >
>> >   > > > Regarding the suggestion of adding the release notes document,
>> I agree
>> >   > > > with your point. Maybe we should adjust the vote template
>> accordingly
>> >   > in
>> >   > > > the respective wiki to guide the following release processes.
>> >   > > >
>> >   > > > Regarding the performance regression, could you provide some
>> more
>> >   > details
>> >   > > > for our better measurement or reproducing on our sides?
>> >   > > > E.g. I guess the topology only includes two vertexes source and
>> sink?
>> >   > > > What is the parallelism for every vertex?
>> >   > > > The upstream shuffles data to the downstream via rebalance
>> partitioner
>> >   > or
>> >   > > > other?
>> >   > > > The checkpoint mode is exactly-once with rocksDB state backend?
>> >   > > > The backpressure happened in this case?
>> >   > > > How much percentage regression in this case?
>> >   > > >
>> >   > > > Best,
>> >   > > > Zhijiang
>> >   > > >
>> >   > > >
>> >   > > >
>> >   > > >
>> ------------------------------------------------------------------
>> >   > > > From:Thomas Weise <[hidden email]>
>> >   > > > Send Time:2020年7月2日(星期四) 09:54
>> >   > > > To:dev <[hidden email]>
>> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   > > >
>> >   > > > Hi Till,
>> >   > > >
>> >   > > > Yes, we don't have the setting in flink-conf.yaml.
>> >   > > >
>> >   > > > Generally, we carry forward the existing configuration and any
>> change
>> >   > to
>> >   > > > default configuration values would impact the upgrade.
>> >   > > >
>> >   > > > Yes, since it is an incompatible change I would state it in the
>> release
>> >   > > > notes.
>> >   > > >
>> >   > > > Thanks,
>> >   > > > Thomas
>> >   > > >
>> >   > > > BTW I found a performance regression while trying to upgrade
>> another
>> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis job.
>> Wasn't
>> >   > able
>> >   > > > to pin it down yet, symptoms include increased checkpoint
>> alignment
>> >   > time.
>> >   > > >
>> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
>> [hidden email]>
>> >   > > > wrote:
>> >   > > >
>> >   > > > > Hi Thomas,
>> >   > > > >
>> >   > > > > just to confirm: When starting the image in local mode, then
>> you
>> >   > don't
>> >   > > > have
>> >   > > > > any of the JobManager memory configuration settings
>> configured in the
>> >   > > > > effective flink-conf.yaml, right? Does this mean that you have
>> >   > > explicitly
>> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
>> configuration?
>> >   > > If
>> >   > > > > this is the case, then I believe it was more of an
>> unintentional
>> >   > > artifact
>> >   > > > > that it worked before and it has been corrected now so that
>> one needs
>> >   > > to
>> >   > > > > specify the memory of the JM process explicitly. Do you think
>> it
>> >   > would
>> >   > > > help
>> >   > > > > to explicitly state this in the release notes?
>> >   > > > >
>> >   > > > > Cheers,
>> >   > > > > Till
>> >   > > > >
>> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <[hidden email]>
>> wrote:
>> >   > > > >
>> >   > > > > > Thanks for preparing another RC!
>> >   > > > > >
>> >   > > > > > As mentioned in the previous RC thread, it would be super
>> helpful
>> >   > if
>> >   > > > the
>> >   > > > > > release notes that are part of the documentation can be
>> included
>> >   > [1].
>> >   > > > > It's
>> >   > > > > > a significant time-saver to have read those first.
>> >   > > > > >
>> >   > > > > > I found one more non-backward compatible change that would
>> be worth
>> >   > > > > > addressing/mentioning:
>> >   > > > > >
>> >   > > > > > It is now necessary to configure the jobmanager heap size in
>> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
>> >   > > > > > or jobmanager.memory.heap.size). Why would I not want to do
>> that
>> >   > > > anyways?
>> >   > > > > > Well, we set it dynamically for a cluster deployment via the
>> >   > > > > > flinkk8soperator, but the container image can also be used
>> for
>> >   > > testing
>> >   > > > > with
>> >   > > > > > local mode (./bin/jobmanager.sh start-foreground local).
>> That will
>> >   > > fail
>> >   > > > > if
>> >   > > > > > the heap wasn't configured and that's how I noticed it.
>> >   > > > > >
>> >   > > > > > Thanks,
>> >   > > > > > Thomas
>> >   > > > > >
>> >   > > > > > [1]
>> >   > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > >
>> >   >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>> >   > > > > >
>> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
>> >   > [hidden email]
>> >   > > > > > .invalid>
>> >   > > > > > wrote:
>> >   > > > > >
>> >   > > > > > > Hi everyone,
>> >   > > > > > >
>> >   > > > > > > Please review and vote on the release candidate #4 for the
>> >   > version
>> >   > > > > > 1.11.0,
>> >   > > > > > > as follows:
>> >   > > > > > > [ ] +1, Approve the release
>> >   > > > > > > [ ] -1, Do not approve the release (please provide
>> specific
>> >   > > comments)
>> >   > > > > > >
>> >   > > > > > > The complete staging area is available for your review,
>> which
>> >   > > > includes:
>> >   > > > > > > * JIRA release notes [1],
>> >   > > > > > > * the official Apache source release and binary
>> convenience
>> >   > > releases
>> >   > > > to
>> >   > > > > > be
>> >   > > > > > > deployed to dist.apache.org [2], which are signed with
>> the key
>> >   > > with
>> >   > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
>> >   > > > > > > * all artifacts to be deployed to the Maven Central
>> Repository
>> >   > [4],
>> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
>> >   > > > > > > * website pull request listing the new release and adding
>> >   > > > announcement
>> >   > > > > > > blog post [6].
>> >   > > > > > >
>> >   > > > > > > The vote will be open for at least 72 hours. It is
>> adopted by
>> >   > > > majority
>> >   > > > > > > approval, with at least 3 PMC affirmative votes.
>> >   > > > > > >
>> >   > > > > > > Thanks,
>> >   > > > > > > Release Manager
>> >   > > > > > >
>> >   > > > > > > [1]
>> >   > > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > >
>> >   >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
>> >   > > > > > > [2]
>> >   > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
>> >   > > > > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >   > > > > > > [4]
>> >   > > > > > >
>> >   > > > >
>> >   > >
>> https://repository.apache.org/content/repositories/orgapacheflink-1377/
>> >   > > > > > > [5]
>> >   > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
>> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
>> >   > > > > > >
>> >   > > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > > >
>> >   > >
>> >   > >
>> >   >
>> >   >
>> >
>> >
>> >
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise
The cause of the issue is all but clear.

Previously I had mentioned that there is no suspect change to the Kinesis
connector and that I had reverted the AWS SDK change to no effect.

https://issues.apache.org/jira/browse/FLINK-17496 actually fixed another
regression in the previous release and is present before and after.

I repeated the run with 1.11.0 core and downgraded the entire Kinesis
connector to 1.10.1: Nothing changes, i.e. the regression is still present.
Therefore we will need to look elsewhere for the root cause.

Regarding the time spent in snapshotState, repeat runs reveal a wide range
for both versions, 1.10 and 1.11. So again this is nothing pointing to a
root cause.

At this point, I have no ideas remaining other than doing a bisect to find
the culprit. Any other suggestions?

Thomas


On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <[hidden email]>
wrote:

> Hi Thomas,
>
> Thanks for your further profiling information and glad to see we already
> finalized the location to cause the regression.
> Actually I was also suspicious of the point of #snapshotState in previous
> discussions since it indeed cost much time to block normal operator
> processing.
>
> Based on your below feedback, the sleep time during #snapshotState might
> be the main concern, and I also digged into the implementation of
> FlinkKinesisProducer#snapshotState.
> while (producer.getOutstandingRecordsCount() > 0) {
>    producer.flush();
>    try {
>       Thread.sleep(500);
>    } catch (InterruptedException e) {
>       LOG.warn("Flushing was interrupted.");
>       break;
>    }
> }
> It seems that the sleep time is mainly affected by the internal operations
> inside KinesisProducer implementation provided by amazonaws, which I am not
> quite familiar with.
> But I noticed there were two upgrades related to it in release-1.11.0. One
> is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is for
> upgrading aws-sdk-version to 1.11.754 [2].
> You mentioned that you already reverted the SDK upgrade to verify no
> changes. Did you also revert the [1] to verify?
> [1] https://issues.apache.org/jira/browse/FLINK-17496
> [2] https://issues.apache.org/jira/browse/FLINK-14881
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> From:Thomas Weise <[hidden email]>
> Send Time:2020年7月17日(星期五) 05:29
> To:dev <[hidden email]>
> Cc:Zhijiang <[hidden email]>; Stephan Ewen <[hidden email]>;
> Arvid Heise <[hidden email]>; Aljoscha Krettek <[hidden email]>
> Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release
> candidate #4)
>
> Sorry for the delay.
>
> I confirmed that the regression is due to the sink (unsurprising, since
> another job with the same consumer, but not the producer, runs as
> expected).
>
> As promised I did CPU profiling on the problematic application, which gives
> more insight into the regression [1]
>
> The screenshots show that the average time for snapshotState increases from
> ~9s to ~28s. The data also shows the increase in sleep time during
> snapshotState.
>
> Does anyone, based on changes made in 1.11, have a theory why?
>
> I had previously looked at the changes to the Kinesis connector and also
> reverted the SDK upgrade, which did not change the situation.
>
> It will likely be necessary to drill into the sink / checkpointing details
> to understand the cause of the problem.
>
> Let me know if anyone has specific questions that I can answer from the
> profiling results.
>
> Thomas
>
> [1]
>
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
>
> On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]> wrote:
>
> > + dev@ for visibility
> >
> > I will investigate further today.
> >
> >
> > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <[hidden email]>
> > wrote:
> >
> >> On 06.07.20 20:39, Stephan Ewen wrote:
> >> >    - Did sink checkpoint notifications change in a relevant way, for
> >> example
> >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
> >>
> >> I think that's unrelated: the Kafka fixes were isolated in Kafka and the
> >> one bug I discovered on the way was about the Task reaper.
> >>
> >>
> >> On 07.07.20 17:51, Zhijiang wrote:
> >> > Sorry for my misunderstood of the previous information, Thomas. I was
> >> assuming that the sync checkpoint duration increased after upgrade as it
> >> was mentioned before.
> >> >
> >> > If I remembered correctly, the memory state backend also has the same
> >> issue? If so, we can dismiss the rocksDB state changes. As the slot
> sharing
> >> enabled, the downstream and upstream should
> >> > probably deployed into the same slot, then no network shuffle effect.
> >> >
> >> > I think we need to find out whether it has other symptoms changed
> >> besides the performance regression to further figure out the scope.
> >> > E.g. any metrics changes, the number of TaskManager and the number of
> >> slots per TaskManager from deployment changes.
> >> > 40% regression is really big, I guess the changes should also be
> >> reflected in other places.
> >> >
> >> > I am not sure whether we can reproduce the regression in our AWS
> >> environment by writing any Kinesis jobs, since there are also normal
> >> Kinesis jobs as Thomas mentioned after upgrade.
> >> > So it probably looks like to touch some corner case. I am very willing
> >> to provide any help for debugging if possible.
> >> >
> >> >
> >> > Best,
> >> > Zhijiang
> >> >
> >> >
> >> > ------------------------------------------------------------------
> >> > From:Thomas Weise <[hidden email]>
> >> > Send Time:2020年7月7日(星期二) 23:01
> >> > To:Stephan Ewen <[hidden email]>
> >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
> >> [hidden email]>; Zhijiang <[hidden email]>
> >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> >> release candidate #4)
> >> >
> >> > We are deploying our apps with FlinkK8sOperator. We have one job that
> >> works as expected after the upgrade and the one discussed here that has
> the
> >> performance regression.
> >> >
> >> > "The performance regression is obvious caused by long duration of sync
> >> checkpoint process in Kinesis sink operator, which would block the
> normal
> >> data processing until back pressure the source."
> >> >
> >> > That's a constant. Before (1.10) and upgrade have the same sync
> >> checkpointing time. The question is what change came in with the
> upgrade.
> >> >
> >> >
> >> >
> >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[hidden email]> wrote:
> >> >
> >> > @Thomas Just one thing real quick: Are you using the standalone setup
> >> scripts (like start-cluster.sh, and the former "slaves" file) ?
> >> > Be aware that this is now called "workers" because of avoiding
> >> sensitive names.
> >> > In one internal benchmark we saw quite a lot of slowdown initially,
> >> before seeing that the cluster was not a distributed cluster any more
> ;-)
> >> >
> >> >
> >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <[hidden email]>
> >> wrote:
> >> > Thanks for this kickoff and help analysis, Stephan!
> >> > Thanks for the further feedback and investigation, Thomas!
> >> >
> >> > The performance regression is obvious caused by long duration of sync
> >> checkpoint process in Kinesis sink operator, which would block the
> normal
> >> data processing until back pressure the source.
> >> > Maybe we could dig into the process of sync execution in checkpoint.
> >> E.g. break down the steps inside respective operator#snapshotState to
> >> statistic which operation cost most of the time, then
> >> > we might probably find the root cause to bring such cost.
> >> >
> >> > Look forward to the further progress. :)
> >> >
> >> > Best,
> >> > Zhijiang
> >> >
> >> > ------------------------------------------------------------------
> >> > From:Stephan Ewen <[hidden email]>
> >> > Send Time:2020年7月7日(星期二) 14:52
> >> > To:Thomas Weise <[hidden email]>
> >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
> >> [hidden email]>; Aljoscha Krettek <[hidden email]>;
> >> Arvid Heise <[hidden email]>
> >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> >> release candidate #4)
> >> >
> >> > Thank you for the digging so deeply.
> >> > Mysterious think this regression.
> >> >
> >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]> wrote:
> >> > @Stephan: yes, I refer to sync time in the web UI (it is unchanged
> >> between 1.10 and 1.11 for the specific pipeline).
> >> >
> >> > I verified that increasing the checkpointing interval does not make a
> >> difference.
> >> >
> >> > I looked at the Kinesis connector changes since 1.10.1 and don't see
> >> anything that could cause this.
> >> >
> >> > Another pipeline that is using the Kinesis consumer (but not the
> >> producer) performs as expected.
> >> >
> >> > I tried reverting the AWS SDK version change, symptoms remain
> unchanged:
> >> >
> >> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
> >> b/flink-connectors/flink-connector-kinesis/pom.xml
> >> > index a6abce23ba..741743a05e 100644
> >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
> >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
> >> > @@ -33,7 +33,7 @@ under the License.
> >> >
> >> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> >> >          <name>flink-connector-kinesis</name>
> >> >          <properties>
> >> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
> >> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
> >> >
> >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> >> >
> >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> >> >
> >>
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> >> >
> >> > I'm planning to take a look with a profiler next.
> >> >
> >> > Thomas
> >> >
> >> >
> >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[hidden email]>
> wrote:
> >> > Hi all!
> >> >
> >> > Forking this thread out of the release vote thread.
> >> >  From what Thomas describes, it really sounds like a sink-specific
> >> issue.
> >> >
> >> > @Thomas: When you say sink has a long synchronous checkpoint time, you
> >> mean the time that is shown as "sync time" on the metrics and web UI?
> That
> >> is not including any network buffer related operations. It is purely the
> >> operator's time.
> >> >
> >> > Can we dig into the changes we did in sinks:
> >> >    - Kinesis version upgrade, AWS library updates
> >> >
> >> >    - Could it be that some call (checkpoint complete) that was
> >> previously (1.10) in a separate thread is not in the mailbox and this
> >> simply reduces the number of threads that do the work?
> >> >
> >> >    - Did sink checkpoint notifications change in a relevant way, for
> >> example due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
> >> >
> >> > Best,
> >> > Stephan
> >> >
> >> >
> >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <[hidden email]
> .invalid>
> >> wrote:
> >> > Hi Thomas,
> >> >
> >> >   Regarding [2], it has more detail infos in the Jira description (
> >> https://issues.apache.org/jira/browse/FLINK-16404).
> >> >
> >> >   I can also give some basic explanations here to dismiss the concern.
> >> >   1. In the past, the following buffers after the barrier will be
> >> cached on downstream side before alignment.
> >> >   2. In 1.11, the upstream would not send the buffers after the
> >> barrier. When the downstream finishes the alignment, it will notify the
> >> downstream of continuing sending following buffers, since it can process
> >> them after alignment.
> >> >   3. The only difference is that the temporary blocked buffers are
> >> cached either on downstream side or on upstream side before alignment.
> >> >   4. The side effect would be the additional notification cost for
> >> every barrier alignment. If the downstream and upstream are deployed in
> >> separate TaskManager, the cost is network transport delay (the effect
> can
> >> be ignored based on our testing with 1s checkpoint interval). For
> sharing
> >> slot in your case, the cost is only one method call in processor, can be
> >> ignored also.
> >> >
> >> >   You mentioned "In this case, the downstream task has a high average
> >> checkpoint duration(~30s, sync part)." This duration is not reflecting
> the
> >> changes above, and it is only indicating the duration for calling
> >> `Operation.snapshotState`.
> >> >   If this duration is beyond your expectation, you can check or debug
> >> whether the source/sink operations might take more time to finish
> >> `snapshotState` in practice. E.g. you can
> >> >   make the implementation of this method as empty to further verify
> the
> >> effect.
> >> >
> >> >   Best,
> >> >   Zhijiang
> >> >
> >> >
> >> >   ------------------------------------------------------------------
> >> >   From:Thomas Weise <[hidden email]>
> >> >   Send Time:2020年7月5日(星期日) 12:22
> >> >   To:dev <[hidden email]>; Zhijiang <[hidden email]
> >
> >> >   Cc:Yingjie Cao <[hidden email]>
> >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >> >
> >> >   Hi Zhijiang,
> >> >
> >> >   Could you please point me to more details regarding: "[2]: Delay
> send
> >> the
> >> >   following buffers after checkpoint barrier on upstream side until
> >> barrier
> >> >   alignment on downstream side."
> >> >
> >> >   In this case, the downstream task has a high average checkpoint
> >> duration
> >> >   (~30s, sync part). If there was a change to hold buffers depending
> on
> >> >   downstream performance, could this possibly apply to this case (even
> >> when
> >> >   there is no shuffle that would require alignment)?
> >> >
> >> >   Thanks,
> >> >   Thomas
> >> >
> >> >
> >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <[hidden email]
> >> .invalid>
> >> >   wrote:
> >> >
> >> >   > Hi Thomas,
> >> >   >
> >> >   > Thanks for the further update information.
> >> >   >
> >> >   > I guess we can dismiss the network stack changes, since in your
> >> case the
> >> >   > downstream and upstream would probably be deployed in the same
> slot
> >> >   > bypassing the network data shuffle.
> >> >   > Also I guess release-1.11 will not bring general performance
> >> regression in
> >> >   > runtime engine, as we also did the performance testing for all
> >> general
> >> >   > cases by [1] in real cluster before and the testing results should
> >> fit the
> >> >   > expectation. But we indeed did not test the specific source and
> sink
> >> >   > connectors yet as I known.
> >> >   >
> >> >   > Regarding your performance regression with 40%, I wonder it is
> >> probably
> >> >   > related to specific source/sink changes (e.g. kinesis) or
> >> environment
> >> >   > issues with corner case.
> >> >   > If possible, it would be helpful to further locate whether the
> >> regression
> >> >   > is caused by kinesis, by replacing the kinesis source & sink and
> >> keeping
> >> >   > the others same.
> >> >   >
> >> >   > As you said, it would be efficient to contact with you directly
> >> next week
> >> >   > to further discuss this issue. And we are willing/eager to provide
> >> any help
> >> >   > to resolve this issue soon.
> >> >   >
> >> >   > Besides that, I guess this issue should not be the blocker for the
> >> >   > release, since it is probably a corner case based on the current
> >> analysis.
> >> >   > If we really conclude anything need to be resolved after the final
> >> >   > release, then we can also make the next minor release-1.11.1 come
> >> soon.
> >> >   >
> >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
> >> >   >
> >> >   > Best,
> >> >   > Zhijiang
> >> >   >
> >> >   >
> >> >   > ------------------------------------------------------------------
> >> >   > From:Thomas Weise <[hidden email]>
> >> >   > Send Time:2020年7月4日(星期六) 12:26
> >> >   > To:dev <[hidden email]>; Zhijiang <
> [hidden email]
> >> >
> >> >   > Cc:Yingjie Cao <[hidden email]>
> >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >> >   >
> >> >   > Hi Zhijiang,
> >> >   >
> >> >   > It will probably be best if we connect next week and discuss the
> >> issue
> >> >   > directly since this could be quite difficult to reproduce.
> >> >   >
> >> >   > Before the testing result on our side comes out for your
> respective
> >> job
> >> >   > case, I have some other questions to confirm for further analysis:
> >> >   >     -  How much percentage regression you found after switching to
> >> 1.11?
> >> >   >
> >> >   > ~40% throughput decline
> >> >   >
> >> >   >     -  Are there any network bottleneck in your cluster? E.g. the
> >> network
> >> >   > bandwidth is full caused by other jobs? If so, it might have more
> >> effects
> >> >   > by above [2]
> >> >   >
> >> >   > The test runs on a k8s cluster that is also used for other
> >> production jobs.
> >> >   > There is no reason be believe network is the bottleneck.
> >> >   >
> >> >   >     -  Did you adjust the default network buffer setting? E.g.
> >> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
> >> >   > "taskmanager.network.memory.buffers-per-channel"
> >> >   >
> >> >   > The job is using the defaults, i.e we don't configure the
> settings.
> >> If you
> >> >   > want me to try specific settings in the hope that it will help to
> >> isolate
> >> >   > the issue please let me know.
> >> >   >
> >> >   >     -  I guess the topology has three vertexes "KinesisConsumer ->
> >> Chained
> >> >   > FlatMap -> KinesisProducer", and the partition mode for
> >> "KinesisConsumer ->
> >> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
> >> the edge
> >> >   > connection is one-to-one, not all-to-all, then the above [1][2]
> >> should no
> >> >   > effects in theory with default network buffer setting.
> >> >   >
> >> >   > There are only 2 vertices and the edge is "forward".
> >> >   >
> >> >   >     - By slot sharing, I guess these three vertex parallelism task
> >> would
> >> >   > probably be deployed into the same slot, then the data shuffle is
> >> by memory
> >> >   > queue, not network stack. If so, the above [2] should no effect.
> >> >   >
> >> >   > Yes, vertices share slots.
> >> >   >
> >> >   >     - I also saw some Jira changes for kinesis in this release,
> >> could you
> >> >   > confirm that these changes would not effect the performance?
> >> >   >
> >> >   > I will need to take a look. 1.10 already had a regression
> >> introduced by the
> >> >   > Kinesis producer update.
> >> >   >
> >> >   >
> >> >   > Thanks,
> >> >   > Thomas
> >> >   >
> >> >   >
> >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> >> [hidden email]
> >> >   > .invalid>
> >> >   > wrote:
> >> >   >
> >> >   > > Hi Thomas,
> >> >   > >
> >> >   > > Thanks for your reply with rich information!
> >> >   > >
> >> >   > > We are trying to reproduce your case in our cluster to further
> >> verify it,
> >> >   > > and  @Yingjie Cao is working on it now.
> >> >   > >  As we have not kinesis consumer and producer internally, so we
> >> will
> >> >   > > construct the common source and sink instead in the case of
> >> backpressure.
> >> >   > >
> >> >   > > Firstly, we can dismiss the rockdb factor in this release, since
> >> you also
> >> >   > > mentioned that "filesystem leads to same symptoms".
> >> >   > >
> >> >   > > Secondly, if my understanding is right, you emphasis that the
> >> regression
> >> >   > > only exists for the jobs with low checkpoint interval (10s).
> >> >   > > Based on that, I have two suspicions with the network related
> >> changes in
> >> >   > > this release:
> >> >   > >     - [1]: Limited the maximum backlog value (default 10) in
> >> subpartition
> >> >   > > queue.
> >> >   > >     - [2]: Delay send the following buffers after checkpoint
> >> barrier on
> >> >   > > upstream side until barrier alignment on downstream side.
> >> >   > >
> >> >   > > These changes are motivated for reducing the in-flight buffers
> to
> >> speedup
> >> >   > > checkpoint especially in the case of backpressure.
> >> >   > > In theory they should have very minor performance effect and
> >> actually we
> >> >   > > also tested in cluster to verify within expectation before
> >> merging them,
> >> >   > >  but maybe there are other corner cases we have not thought of
> >> before.
> >> >   > >
> >> >   > > Before the testing result on our side comes out for your
> >> respective job
> >> >   > > case, I have some other questions to confirm for further
> analysis:
> >> >   > >     -  How much percentage regression you found after switching
> >> to 1.11?
> >> >   > >     -  Are there any network bottleneck in your cluster? E.g.
> the
> >> network
> >> >   > > bandwidth is full caused by other jobs? If so, it might have
> more
> >> effects
> >> >   > > by above [2]
> >> >   > >     -  Did you adjust the default network buffer setting? E.g.
> >> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
> >> >   > > "taskmanager.network.memory.buffers-per-channel"
> >> >   > >     -  I guess the topology has three vertexes "KinesisConsumer
> ->
> >> >   > Chained
> >> >   > > FlatMap -> KinesisProducer", and the partition mode for
> >> "KinesisConsumer
> >> >   > ->
> >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
> >> so, the
> >> >   > edge
> >> >   > > connection is one-to-one, not all-to-all, then the above [1][2]
> >> should no
> >> >   > > effects in theory with default network buffer setting.
> >> >   > >     - By slot sharing, I guess these three vertex parallelism
> >> task would
> >> >   > > probably be deployed into the same slot, then the data shuffle
> is
> >> by
> >> >   > memory
> >> >   > > queue, not network stack. If so, the above [2] should no effect.
> >> >   > >     - I also saw some Jira changes for kinesis in this release,
> >> could you
> >> >   > > confirm that these changes would not effect the performance?
> >> >   > >
> >> >   > > Best,
> >> >   > > Zhijiang
> >> >   > >
> >> >   > >
> >> >   > >
> ------------------------------------------------------------------
> >> >   > > From:Thomas Weise <[hidden email]>
> >> >   > > Send Time:2020年7月3日(星期五) 01:07
> >> >   > > To:dev <[hidden email]>; Zhijiang <
> >> [hidden email]>
> >> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >> >   > >
> >> >   > > Hi Zhijiang,
> >> >   > >
> >> >   > > The performance degradation manifests in backpressure which
> leads
> >> to
> >> >   > > growing backlog in the source. I switched a few times between
> >> 1.10 and
> >> >   > 1.11
> >> >   > > and the behavior is consistent.
> >> >   > >
> >> >   > > The DAG is:
> >> >   > >
> >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
> >> forward
> >> >   > > ---------> KinesisProducer
> >> >   > >
> >> >   > > Parallelism: 160
> >> >   > > No shuffle/rebalance.
> >> >   > >
> >> >   > > Checkpointing config:
> >> >   > >
> >> >   > > Checkpointing Mode Exactly Once
> >> >   > > Interval 10s
> >> >   > > Timeout 10m 0s
> >> >   > > Minimum Pause Between Checkpoints 10s
> >> >   > > Maximum Concurrent Checkpoints 1
> >> >   > > Persist Checkpoints Externally Enabled (delete on cancellation)
> >> >   > >
> >> >   > > State backend: rocksdb  (filesystem leads to same symptoms)
> >> >   > > Checkpoint size is tiny (500KB)
> >> >   > >
> >> >   > > An interesting difference to another job that I had upgraded
> >> successfully
> >> >   > > is the low checkpointing interval.
> >> >   > >
> >> >   > > Thanks,
> >> >   > > Thomas
> >> >   > >
> >> >   > >
> >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> >> [hidden email]
> >> >   > > .invalid>
> >> >   > > wrote:
> >> >   > >
> >> >   > > > Hi Thomas,
> >> >   > > >
> >> >   > > > Thanks for the efficient feedback.
> >> >   > > >
> >> >   > > > Regarding the suggestion of adding the release notes document,
> >> I agree
> >> >   > > > with your point. Maybe we should adjust the vote template
> >> accordingly
> >> >   > in
> >> >   > > > the respective wiki to guide the following release processes.
> >> >   > > >
> >> >   > > > Regarding the performance regression, could you provide some
> >> more
> >> >   > details
> >> >   > > > for our better measurement or reproducing on our sides?
> >> >   > > > E.g. I guess the topology only includes two vertexes source
> and
> >> sink?
> >> >   > > > What is the parallelism for every vertex?
> >> >   > > > The upstream shuffles data to the downstream via rebalance
> >> partitioner
> >> >   > or
> >> >   > > > other?
> >> >   > > > The checkpoint mode is exactly-once with rocksDB state
> backend?
> >> >   > > > The backpressure happened in this case?
> >> >   > > > How much percentage regression in this case?
> >> >   > > >
> >> >   > > > Best,
> >> >   > > > Zhijiang
> >> >   > > >
> >> >   > > >
> >> >   > > >
> >> >   > > >
> >> ------------------------------------------------------------------
> >> >   > > > From:Thomas Weise <[hidden email]>
> >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> >> >   > > > To:dev <[hidden email]>
> >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> >> >   > > >
> >> >   > > > Hi Till,
> >> >   > > >
> >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
> >> >   > > >
> >> >   > > > Generally, we carry forward the existing configuration and any
> >> change
> >> >   > to
> >> >   > > > default configuration values would impact the upgrade.
> >> >   > > >
> >> >   > > > Yes, since it is an incompatible change I would state it in
> the
> >> release
> >> >   > > > notes.
> >> >   > > >
> >> >   > > > Thanks,
> >> >   > > > Thomas
> >> >   > > >
> >> >   > > > BTW I found a performance regression while trying to upgrade
> >> another
> >> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis job.
> >> Wasn't
> >> >   > able
> >> >   > > > to pin it down yet, symptoms include increased checkpoint
> >> alignment
> >> >   > time.
> >> >   > > >
> >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> >> [hidden email]>
> >> >   > > > wrote:
> >> >   > > >
> >> >   > > > > Hi Thomas,
> >> >   > > > >
> >> >   > > > > just to confirm: When starting the image in local mode, then
> >> you
> >> >   > don't
> >> >   > > > have
> >> >   > > > > any of the JobManager memory configuration settings
> >> configured in the
> >> >   > > > > effective flink-conf.yaml, right? Does this mean that you
> have
> >> >   > > explicitly
> >> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
> >> configuration?
> >> >   > > If
> >> >   > > > > this is the case, then I believe it was more of an
> >> unintentional
> >> >   > > artifact
> >> >   > > > > that it worked before and it has been corrected now so that
> >> one needs
> >> >   > > to
> >> >   > > > > specify the memory of the JM process explicitly. Do you
> think
> >> it
> >> >   > would
> >> >   > > > help
> >> >   > > > > to explicitly state this in the release notes?
> >> >   > > > >
> >> >   > > > > Cheers,
> >> >   > > > > Till
> >> >   > > > >
> >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <[hidden email]
> >
> >> wrote:
> >> >   > > > >
> >> >   > > > > > Thanks for preparing another RC!
> >> >   > > > > >
> >> >   > > > > > As mentioned in the previous RC thread, it would be super
> >> helpful
> >> >   > if
> >> >   > > > the
> >> >   > > > > > release notes that are part of the documentation can be
> >> included
> >> >   > [1].
> >> >   > > > > It's
> >> >   > > > > > a significant time-saver to have read those first.
> >> >   > > > > >
> >> >   > > > > > I found one more non-backward compatible change that would
> >> be worth
> >> >   > > > > > addressing/mentioning:
> >> >   > > > > >
> >> >   > > > > > It is now necessary to configure the jobmanager heap size
> in
> >> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
> >> >   > > > > > or jobmanager.memory.heap.size). Why would I not want to
> do
> >> that
> >> >   > > > anyways?
> >> >   > > > > > Well, we set it dynamically for a cluster deployment via
> the
> >> >   > > > > > flinkk8soperator, but the container image can also be used
> >> for
> >> >   > > testing
> >> >   > > > > with
> >> >   > > > > > local mode (./bin/jobmanager.sh start-foreground local).
> >> That will
> >> >   > > fail
> >> >   > > > > if
> >> >   > > > > > the heap wasn't configured and that's how I noticed it.
> >> >   > > > > >
> >> >   > > > > > Thanks,
> >> >   > > > > > Thomas
> >> >   > > > > >
> >> >   > > > > > [1]
> >> >   > > > > >
> >> >   > > > > >
> >> >   > > > >
> >> >   > > >
> >> >   > >
> >> >   >
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> >> >   > > > > >
> >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> >> >   > [hidden email]
> >> >   > > > > > .invalid>
> >> >   > > > > > wrote:
> >> >   > > > > >
> >> >   > > > > > > Hi everyone,
> >> >   > > > > > >
> >> >   > > > > > > Please review and vote on the release candidate #4 for
> the
> >> >   > version
> >> >   > > > > > 1.11.0,
> >> >   > > > > > > as follows:
> >> >   > > > > > > [ ] +1, Approve the release
> >> >   > > > > > > [ ] -1, Do not approve the release (please provide
> >> specific
> >> >   > > comments)
> >> >   > > > > > >
> >> >   > > > > > > The complete staging area is available for your review,
> >> which
> >> >   > > > includes:
> >> >   > > > > > > * JIRA release notes [1],
> >> >   > > > > > > * the official Apache source release and binary
> >> convenience
> >> >   > > releases
> >> >   > > > to
> >> >   > > > > > be
> >> >   > > > > > > deployed to dist.apache.org [2], which are signed with
> >> the key
> >> >   > > with
> >> >   > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> [3],
> >> >   > > > > > > * all artifacts to be deployed to the Maven Central
> >> Repository
> >> >   > [4],
> >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
> >> >   > > > > > > * website pull request listing the new release and
> adding
> >> >   > > > announcement
> >> >   > > > > > > blog post [6].
> >> >   > > > > > >
> >> >   > > > > > > The vote will be open for at least 72 hours. It is
> >> adopted by
> >> >   > > > majority
> >> >   > > > > > > approval, with at least 3 PMC affirmative votes.
> >> >   > > > > > >
> >> >   > > > > > > Thanks,
> >> >   > > > > > > Release Manager
> >> >   > > > > > >
> >> >   > > > > > > [1]
> >> >   > > > > > >
> >> >   > > > > >
> >> >   > > > >
> >> >   > > >
> >> >   > >
> >> >   >
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> >> >   > > > > > > [2]
> >> >   > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> >> >   > > > > > > [3]
> https://dist.apache.org/repos/dist/release/flink/KEYS
> >> >   > > > > > > [4]
> >> >   > > > > > >
> >> >   > > > >
> >> >   > >
> >> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> >> >   > > > > > > [5]
> >> >   > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> >> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
> >> >   > > > > > >
> >> >   > > > > > >
> >> >   > > > > >
> >> >   > > > >
> >> >   > > >
> >> >   > > >
> >> >   > >
> >> >   > >
> >> >   >
> >> >   >
> >> >
> >> >
> >> >
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Kurt Young
From my experience, java profilers are sometimes not accurate enough to
find out the performance regression
root cause. In this case, I would suggest you try out intel vtune amplifier
to watch more detailed metrics.

Best,
Kurt


On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]> wrote:

> The cause of the issue is all but clear.
>
> Previously I had mentioned that there is no suspect change to the Kinesis
> connector and that I had reverted the AWS SDK change to no effect.
>
> https://issues.apache.org/jira/browse/FLINK-17496 actually fixed another
> regression in the previous release and is present before and after.
>
> I repeated the run with 1.11.0 core and downgraded the entire Kinesis
> connector to 1.10.1: Nothing changes, i.e. the regression is still present.
> Therefore we will need to look elsewhere for the root cause.
>
> Regarding the time spent in snapshotState, repeat runs reveal a wide range
> for both versions, 1.10 and 1.11. So again this is nothing pointing to a
> root cause.
>
> At this point, I have no ideas remaining other than doing a bisect to find
> the culprit. Any other suggestions?
>
> Thomas
>
>
> On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <[hidden email]
> .invalid>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks for your further profiling information and glad to see we already
> > finalized the location to cause the regression.
> > Actually I was also suspicious of the point of #snapshotState in previous
> > discussions since it indeed cost much time to block normal operator
> > processing.
> >
> > Based on your below feedback, the sleep time during #snapshotState might
> > be the main concern, and I also digged into the implementation of
> > FlinkKinesisProducer#snapshotState.
> > while (producer.getOutstandingRecordsCount() > 0) {
> >    producer.flush();
> >    try {
> >       Thread.sleep(500);
> >    } catch (InterruptedException e) {
> >       LOG.warn("Flushing was interrupted.");
> >       break;
> >    }
> > }
> > It seems that the sleep time is mainly affected by the internal
> operations
> > inside KinesisProducer implementation provided by amazonaws, which I am
> not
> > quite familiar with.
> > But I noticed there were two upgrades related to it in release-1.11.0.
> One
> > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is for
> > upgrading aws-sdk-version to 1.11.754 [2].
> > You mentioned that you already reverted the SDK upgrade to verify no
> > changes. Did you also revert the [1] to verify?
> > [1] https://issues.apache.org/jira/browse/FLINK-17496
> > [2] https://issues.apache.org/jira/browse/FLINK-14881
> >
> > Best,
> > Zhijiang
> > ------------------------------------------------------------------
> > From:Thomas Weise <[hidden email]>
> > Send Time:2020年7月17日(星期五) 05:29
> > To:dev <[hidden email]>
> > Cc:Zhijiang <[hidden email]>; Stephan Ewen <[hidden email]
> >;
> > Arvid Heise <[hidden email]>; Aljoscha Krettek <[hidden email]
> >
> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release
> > candidate #4)
> >
> > Sorry for the delay.
> >
> > I confirmed that the regression is due to the sink (unsurprising, since
> > another job with the same consumer, but not the producer, runs as
> > expected).
> >
> > As promised I did CPU profiling on the problematic application, which
> gives
> > more insight into the regression [1]
> >
> > The screenshots show that the average time for snapshotState increases
> from
> > ~9s to ~28s. The data also shows the increase in sleep time during
> > snapshotState.
> >
> > Does anyone, based on changes made in 1.11, have a theory why?
> >
> > I had previously looked at the changes to the Kinesis connector and also
> > reverted the SDK upgrade, which did not change the situation.
> >
> > It will likely be necessary to drill into the sink / checkpointing
> details
> > to understand the cause of the problem.
> >
> > Let me know if anyone has specific questions that I can answer from the
> > profiling results.
> >
> > Thomas
> >
> > [1]
> >
> >
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> >
> > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]> wrote:
> >
> > > + dev@ for visibility
> > >
> > > I will investigate further today.
> > >
> > >
> > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <[hidden email]>
> > > wrote:
> > >
> > >> On 06.07.20 20:39, Stephan Ewen wrote:
> > >> >    - Did sink checkpoint notifications change in a relevant way, for
> > >> example
> > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
> > >>
> > >> I think that's unrelated: the Kafka fixes were isolated in Kafka and
> the
> > >> one bug I discovered on the way was about the Task reaper.
> > >>
> > >>
> > >> On 07.07.20 17:51, Zhijiang wrote:
> > >> > Sorry for my misunderstood of the previous information, Thomas. I
> was
> > >> assuming that the sync checkpoint duration increased after upgrade as
> it
> > >> was mentioned before.
> > >> >
> > >> > If I remembered correctly, the memory state backend also has the
> same
> > >> issue? If so, we can dismiss the rocksDB state changes. As the slot
> > sharing
> > >> enabled, the downstream and upstream should
> > >> > probably deployed into the same slot, then no network shuffle
> effect.
> > >> >
> > >> > I think we need to find out whether it has other symptoms changed
> > >> besides the performance regression to further figure out the scope.
> > >> > E.g. any metrics changes, the number of TaskManager and the number
> of
> > >> slots per TaskManager from deployment changes.
> > >> > 40% regression is really big, I guess the changes should also be
> > >> reflected in other places.
> > >> >
> > >> > I am not sure whether we can reproduce the regression in our AWS
> > >> environment by writing any Kinesis jobs, since there are also normal
> > >> Kinesis jobs as Thomas mentioned after upgrade.
> > >> > So it probably looks like to touch some corner case. I am very
> willing
> > >> to provide any help for debugging if possible.
> > >> >
> > >> >
> > >> > Best,
> > >> > Zhijiang
> > >> >
> > >> >
> > >> > ------------------------------------------------------------------
> > >> > From:Thomas Weise <[hidden email]>
> > >> > Send Time:2020年7月7日(星期二) 23:01
> > >> > To:Stephan Ewen <[hidden email]>
> > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
> > >> [hidden email]>; Zhijiang <[hidden email]>
> > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > >> release candidate #4)
> > >> >
> > >> > We are deploying our apps with FlinkK8sOperator. We have one job
> that
> > >> works as expected after the upgrade and the one discussed here that
> has
> > the
> > >> performance regression.
> > >> >
> > >> > "The performance regression is obvious caused by long duration of
> sync
> > >> checkpoint process in Kinesis sink operator, which would block the
> > normal
> > >> data processing until back pressure the source."
> > >> >
> > >> > That's a constant. Before (1.10) and upgrade have the same sync
> > >> checkpointing time. The question is what change came in with the
> > upgrade.
> > >> >
> > >> >
> > >> >
> > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[hidden email]>
> wrote:
> > >> >
> > >> > @Thomas Just one thing real quick: Are you using the standalone
> setup
> > >> scripts (like start-cluster.sh, and the former "slaves" file) ?
> > >> > Be aware that this is now called "workers" because of avoiding
> > >> sensitive names.
> > >> > In one internal benchmark we saw quite a lot of slowdown initially,
> > >> before seeing that the cluster was not a distributed cluster any more
> > ;-)
> > >> >
> > >> >
> > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <[hidden email]
> >
> > >> wrote:
> > >> > Thanks for this kickoff and help analysis, Stephan!
> > >> > Thanks for the further feedback and investigation, Thomas!
> > >> >
> > >> > The performance regression is obvious caused by long duration of
> sync
> > >> checkpoint process in Kinesis sink operator, which would block the
> > normal
> > >> data processing until back pressure the source.
> > >> > Maybe we could dig into the process of sync execution in checkpoint.
> > >> E.g. break down the steps inside respective operator#snapshotState to
> > >> statistic which operation cost most of the time, then
> > >> > we might probably find the root cause to bring such cost.
> > >> >
> > >> > Look forward to the further progress. :)
> > >> >
> > >> > Best,
> > >> > Zhijiang
> > >> >
> > >> > ------------------------------------------------------------------
> > >> > From:Stephan Ewen <[hidden email]>
> > >> > Send Time:2020年7月7日(星期二) 14:52
> > >> > To:Thomas Weise <[hidden email]>
> > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
> > >> [hidden email]>; Aljoscha Krettek <[hidden email]>;
> > >> Arvid Heise <[hidden email]>
> > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > >> release candidate #4)
> > >> >
> > >> > Thank you for the digging so deeply.
> > >> > Mysterious think this regression.
> > >> >
> > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]> wrote:
> > >> > @Stephan: yes, I refer to sync time in the web UI (it is unchanged
> > >> between 1.10 and 1.11 for the specific pipeline).
> > >> >
> > >> > I verified that increasing the checkpointing interval does not make
> a
> > >> difference.
> > >> >
> > >> > I looked at the Kinesis connector changes since 1.10.1 and don't see
> > >> anything that could cause this.
> > >> >
> > >> > Another pipeline that is using the Kinesis consumer (but not the
> > >> producer) performs as expected.
> > >> >
> > >> > I tried reverting the AWS SDK version change, symptoms remain
> > unchanged:
> > >> >
> > >> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
> > >> b/flink-connectors/flink-connector-kinesis/pom.xml
> > >> > index a6abce23ba..741743a05e 100644
> > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
> > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
> > >> > @@ -33,7 +33,7 @@ under the License.
> > >> >
> > >>
> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> > >> >          <name>flink-connector-kinesis</name>
> > >> >          <properties>
> > >> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
> > >> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
> > >> >
> > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> > >> >
> > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> > >> >
> > >>
> >
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> > >> >
> > >> > I'm planning to take a look with a profiler next.
> > >> >
> > >> > Thomas
> > >> >
> > >> >
> > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[hidden email]>
> > wrote:
> > >> > Hi all!
> > >> >
> > >> > Forking this thread out of the release vote thread.
> > >> >  From what Thomas describes, it really sounds like a sink-specific
> > >> issue.
> > >> >
> > >> > @Thomas: When you say sink has a long synchronous checkpoint time,
> you
> > >> mean the time that is shown as "sync time" on the metrics and web UI?
> > That
> > >> is not including any network buffer related operations. It is purely
> the
> > >> operator's time.
> > >> >
> > >> > Can we dig into the changes we did in sinks:
> > >> >    - Kinesis version upgrade, AWS library updates
> > >> >
> > >> >    - Could it be that some call (checkpoint complete) that was
> > >> previously (1.10) in a separate thread is not in the mailbox and this
> > >> simply reduces the number of threads that do the work?
> > >> >
> > >> >    - Did sink checkpoint notifications change in a relevant way, for
> > >> example due to some Kafka issues we addressed in 1.11 (@Aljoscha
> maybe?)
> > >> >
> > >> > Best,
> > >> > Stephan
> > >> >
> > >> >
> > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <[hidden email]
> > .invalid>
> > >> wrote:
> > >> > Hi Thomas,
> > >> >
> > >> >   Regarding [2], it has more detail infos in the Jira description (
> > >> https://issues.apache.org/jira/browse/FLINK-16404).
> > >> >
> > >> >   I can also give some basic explanations here to dismiss the
> concern.
> > >> >   1. In the past, the following buffers after the barrier will be
> > >> cached on downstream side before alignment.
> > >> >   2. In 1.11, the upstream would not send the buffers after the
> > >> barrier. When the downstream finishes the alignment, it will notify
> the
> > >> downstream of continuing sending following buffers, since it can
> process
> > >> them after alignment.
> > >> >   3. The only difference is that the temporary blocked buffers are
> > >> cached either on downstream side or on upstream side before alignment.
> > >> >   4. The side effect would be the additional notification cost for
> > >> every barrier alignment. If the downstream and upstream are deployed
> in
> > >> separate TaskManager, the cost is network transport delay (the effect
> > can
> > >> be ignored based on our testing with 1s checkpoint interval). For
> > sharing
> > >> slot in your case, the cost is only one method call in processor, can
> be
> > >> ignored also.
> > >> >
> > >> >   You mentioned "In this case, the downstream task has a high
> average
> > >> checkpoint duration(~30s, sync part)." This duration is not reflecting
> > the
> > >> changes above, and it is only indicating the duration for calling
> > >> `Operation.snapshotState`.
> > >> >   If this duration is beyond your expectation, you can check or
> debug
> > >> whether the source/sink operations might take more time to finish
> > >> `snapshotState` in practice. E.g. you can
> > >> >   make the implementation of this method as empty to further verify
> > the
> > >> effect.
> > >> >
> > >> >   Best,
> > >> >   Zhijiang
> > >> >
> > >> >
> > >> >   ------------------------------------------------------------------
> > >> >   From:Thomas Weise <[hidden email]>
> > >> >   Send Time:2020年7月5日(星期日) 12:22
> > >> >   To:dev <[hidden email]>; Zhijiang <
> [hidden email]
> > >
> > >> >   Cc:Yingjie Cao <[hidden email]>
> > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >> >
> > >> >   Hi Zhijiang,
> > >> >
> > >> >   Could you please point me to more details regarding: "[2]: Delay
> > send
> > >> the
> > >> >   following buffers after checkpoint barrier on upstream side until
> > >> barrier
> > >> >   alignment on downstream side."
> > >> >
> > >> >   In this case, the downstream task has a high average checkpoint
> > >> duration
> > >> >   (~30s, sync part). If there was a change to hold buffers depending
> > on
> > >> >   downstream performance, could this possibly apply to this case
> (even
> > >> when
> > >> >   there is no shuffle that would require alignment)?
> > >> >
> > >> >   Thanks,
> > >> >   Thomas
> > >> >
> > >> >
> > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
> [hidden email]
> > >> .invalid>
> > >> >   wrote:
> > >> >
> > >> >   > Hi Thomas,
> > >> >   >
> > >> >   > Thanks for the further update information.
> > >> >   >
> > >> >   > I guess we can dismiss the network stack changes, since in your
> > >> case the
> > >> >   > downstream and upstream would probably be deployed in the same
> > slot
> > >> >   > bypassing the network data shuffle.
> > >> >   > Also I guess release-1.11 will not bring general performance
> > >> regression in
> > >> >   > runtime engine, as we also did the performance testing for all
> > >> general
> > >> >   > cases by [1] in real cluster before and the testing results
> should
> > >> fit the
> > >> >   > expectation. But we indeed did not test the specific source and
> > sink
> > >> >   > connectors yet as I known.
> > >> >   >
> > >> >   > Regarding your performance regression with 40%, I wonder it is
> > >> probably
> > >> >   > related to specific source/sink changes (e.g. kinesis) or
> > >> environment
> > >> >   > issues with corner case.
> > >> >   > If possible, it would be helpful to further locate whether the
> > >> regression
> > >> >   > is caused by kinesis, by replacing the kinesis source & sink and
> > >> keeping
> > >> >   > the others same.
> > >> >   >
> > >> >   > As you said, it would be efficient to contact with you directly
> > >> next week
> > >> >   > to further discuss this issue. And we are willing/eager to
> provide
> > >> any help
> > >> >   > to resolve this issue soon.
> > >> >   >
> > >> >   > Besides that, I guess this issue should not be the blocker for
> the
> > >> >   > release, since it is probably a corner case based on the current
> > >> analysis.
> > >> >   > If we really conclude anything need to be resolved after the
> final
> > >> >   > release, then we can also make the next minor release-1.11.1
> come
> > >> soon.
> > >> >   >
> > >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > >> >   >
> > >> >   > Best,
> > >> >   > Zhijiang
> > >> >   >
> > >> >   >
> > >> >   >
> ------------------------------------------------------------------
> > >> >   > From:Thomas Weise <[hidden email]>
> > >> >   > Send Time:2020年7月4日(星期六) 12:26
> > >> >   > To:dev <[hidden email]>; Zhijiang <
> > [hidden email]
> > >> >
> > >> >   > Cc:Yingjie Cao <[hidden email]>
> > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >> >   >
> > >> >   > Hi Zhijiang,
> > >> >   >
> > >> >   > It will probably be best if we connect next week and discuss the
> > >> issue
> > >> >   > directly since this could be quite difficult to reproduce.
> > >> >   >
> > >> >   > Before the testing result on our side comes out for your
> > respective
> > >> job
> > >> >   > case, I have some other questions to confirm for further
> analysis:
> > >> >   >     -  How much percentage regression you found after switching
> to
> > >> 1.11?
> > >> >   >
> > >> >   > ~40% throughput decline
> > >> >   >
> > >> >   >     -  Are there any network bottleneck in your cluster? E.g.
> the
> > >> network
> > >> >   > bandwidth is full caused by other jobs? If so, it might have
> more
> > >> effects
> > >> >   > by above [2]
> > >> >   >
> > >> >   > The test runs on a k8s cluster that is also used for other
> > >> production jobs.
> > >> >   > There is no reason be believe network is the bottleneck.
> > >> >   >
> > >> >   >     -  Did you adjust the default network buffer setting? E.g.
> > >> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
> > >> >   > "taskmanager.network.memory.buffers-per-channel"
> > >> >   >
> > >> >   > The job is using the defaults, i.e we don't configure the
> > settings.
> > >> If you
> > >> >   > want me to try specific settings in the hope that it will help
> to
> > >> isolate
> > >> >   > the issue please let me know.
> > >> >   >
> > >> >   >     -  I guess the topology has three vertexes "KinesisConsumer
> ->
> > >> Chained
> > >> >   > FlatMap -> KinesisProducer", and the partition mode for
> > >> "KinesisConsumer ->
> > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
> so,
> > >> the edge
> > >> >   > connection is one-to-one, not all-to-all, then the above [1][2]
> > >> should no
> > >> >   > effects in theory with default network buffer setting.
> > >> >   >
> > >> >   > There are only 2 vertices and the edge is "forward".
> > >> >   >
> > >> >   >     - By slot sharing, I guess these three vertex parallelism
> task
> > >> would
> > >> >   > probably be deployed into the same slot, then the data shuffle
> is
> > >> by memory
> > >> >   > queue, not network stack. If so, the above [2] should no effect.
> > >> >   >
> > >> >   > Yes, vertices share slots.
> > >> >   >
> > >> >   >     - I also saw some Jira changes for kinesis in this release,
> > >> could you
> > >> >   > confirm that these changes would not effect the performance?
> > >> >   >
> > >> >   > I will need to take a look. 1.10 already had a regression
> > >> introduced by the
> > >> >   > Kinesis producer update.
> > >> >   >
> > >> >   >
> > >> >   > Thanks,
> > >> >   > Thomas
> > >> >   >
> > >> >   >
> > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> > >> [hidden email]
> > >> >   > .invalid>
> > >> >   > wrote:
> > >> >   >
> > >> >   > > Hi Thomas,
> > >> >   > >
> > >> >   > > Thanks for your reply with rich information!
> > >> >   > >
> > >> >   > > We are trying to reproduce your case in our cluster to further
> > >> verify it,
> > >> >   > > and  @Yingjie Cao is working on it now.
> > >> >   > >  As we have not kinesis consumer and producer internally, so
> we
> > >> will
> > >> >   > > construct the common source and sink instead in the case of
> > >> backpressure.
> > >> >   > >
> > >> >   > > Firstly, we can dismiss the rockdb factor in this release,
> since
> > >> you also
> > >> >   > > mentioned that "filesystem leads to same symptoms".
> > >> >   > >
> > >> >   > > Secondly, if my understanding is right, you emphasis that the
> > >> regression
> > >> >   > > only exists for the jobs with low checkpoint interval (10s).
> > >> >   > > Based on that, I have two suspicions with the network related
> > >> changes in
> > >> >   > > this release:
> > >> >   > >     - [1]: Limited the maximum backlog value (default 10) in
> > >> subpartition
> > >> >   > > queue.
> > >> >   > >     - [2]: Delay send the following buffers after checkpoint
> > >> barrier on
> > >> >   > > upstream side until barrier alignment on downstream side.
> > >> >   > >
> > >> >   > > These changes are motivated for reducing the in-flight buffers
> > to
> > >> speedup
> > >> >   > > checkpoint especially in the case of backpressure.
> > >> >   > > In theory they should have very minor performance effect and
> > >> actually we
> > >> >   > > also tested in cluster to verify within expectation before
> > >> merging them,
> > >> >   > >  but maybe there are other corner cases we have not thought of
> > >> before.
> > >> >   > >
> > >> >   > > Before the testing result on our side comes out for your
> > >> respective job
> > >> >   > > case, I have some other questions to confirm for further
> > analysis:
> > >> >   > >     -  How much percentage regression you found after
> switching
> > >> to 1.11?
> > >> >   > >     -  Are there any network bottleneck in your cluster? E.g.
> > the
> > >> network
> > >> >   > > bandwidth is full caused by other jobs? If so, it might have
> > more
> > >> effects
> > >> >   > > by above [2]
> > >> >   > >     -  Did you adjust the default network buffer setting? E.g.
> > >> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
> > >> >   > > "taskmanager.network.memory.buffers-per-channel"
> > >> >   > >     -  I guess the topology has three vertexes
> "KinesisConsumer
> > ->
> > >> >   > Chained
> > >> >   > > FlatMap -> KinesisProducer", and the partition mode for
> > >> "KinesisConsumer
> > >> >   > ->
> > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
> > >> so, the
> > >> >   > edge
> > >> >   > > connection is one-to-one, not all-to-all, then the above
> [1][2]
> > >> should no
> > >> >   > > effects in theory with default network buffer setting.
> > >> >   > >     - By slot sharing, I guess these three vertex parallelism
> > >> task would
> > >> >   > > probably be deployed into the same slot, then the data shuffle
> > is
> > >> by
> > >> >   > memory
> > >> >   > > queue, not network stack. If so, the above [2] should no
> effect.
> > >> >   > >     - I also saw some Jira changes for kinesis in this
> release,
> > >> could you
> > >> >   > > confirm that these changes would not effect the performance?
> > >> >   > >
> > >> >   > > Best,
> > >> >   > > Zhijiang
> > >> >   > >
> > >> >   > >
> > >> >   > >
> > ------------------------------------------------------------------
> > >> >   > > From:Thomas Weise <[hidden email]>
> > >> >   > > Send Time:2020年7月3日(星期五) 01:07
> > >> >   > > To:dev <[hidden email]>; Zhijiang <
> > >> [hidden email]>
> > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >> >   > >
> > >> >   > > Hi Zhijiang,
> > >> >   > >
> > >> >   > > The performance degradation manifests in backpressure which
> > leads
> > >> to
> > >> >   > > growing backlog in the source. I switched a few times between
> > >> 1.10 and
> > >> >   > 1.11
> > >> >   > > and the behavior is consistent.
> > >> >   > >
> > >> >   > > The DAG is:
> > >> >   > >
> > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
> > >> forward
> > >> >   > > ---------> KinesisProducer
> > >> >   > >
> > >> >   > > Parallelism: 160
> > >> >   > > No shuffle/rebalance.
> > >> >   > >
> > >> >   > > Checkpointing config:
> > >> >   > >
> > >> >   > > Checkpointing Mode Exactly Once
> > >> >   > > Interval 10s
> > >> >   > > Timeout 10m 0s
> > >> >   > > Minimum Pause Between Checkpoints 10s
> > >> >   > > Maximum Concurrent Checkpoints 1
> > >> >   > > Persist Checkpoints Externally Enabled (delete on
> cancellation)
> > >> >   > >
> > >> >   > > State backend: rocksdb  (filesystem leads to same symptoms)
> > >> >   > > Checkpoint size is tiny (500KB)
> > >> >   > >
> > >> >   > > An interesting difference to another job that I had upgraded
> > >> successfully
> > >> >   > > is the low checkpointing interval.
> > >> >   > >
> > >> >   > > Thanks,
> > >> >   > > Thomas
> > >> >   > >
> > >> >   > >
> > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> > >> [hidden email]
> > >> >   > > .invalid>
> > >> >   > > wrote:
> > >> >   > >
> > >> >   > > > Hi Thomas,
> > >> >   > > >
> > >> >   > > > Thanks for the efficient feedback.
> > >> >   > > >
> > >> >   > > > Regarding the suggestion of adding the release notes
> document,
> > >> I agree
> > >> >   > > > with your point. Maybe we should adjust the vote template
> > >> accordingly
> > >> >   > in
> > >> >   > > > the respective wiki to guide the following release
> processes.
> > >> >   > > >
> > >> >   > > > Regarding the performance regression, could you provide some
> > >> more
> > >> >   > details
> > >> >   > > > for our better measurement or reproducing on our sides?
> > >> >   > > > E.g. I guess the topology only includes two vertexes source
> > and
> > >> sink?
> > >> >   > > > What is the parallelism for every vertex?
> > >> >   > > > The upstream shuffles data to the downstream via rebalance
> > >> partitioner
> > >> >   > or
> > >> >   > > > other?
> > >> >   > > > The checkpoint mode is exactly-once with rocksDB state
> > backend?
> > >> >   > > > The backpressure happened in this case?
> > >> >   > > > How much percentage regression in this case?
> > >> >   > > >
> > >> >   > > > Best,
> > >> >   > > > Zhijiang
> > >> >   > > >
> > >> >   > > >
> > >> >   > > >
> > >> >   > > >
> > >> ------------------------------------------------------------------
> > >> >   > > > From:Thomas Weise <[hidden email]>
> > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> > >> >   > > > To:dev <[hidden email]>
> > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > >> >   > > >
> > >> >   > > > Hi Till,
> > >> >   > > >
> > >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
> > >> >   > > >
> > >> >   > > > Generally, we carry forward the existing configuration and
> any
> > >> change
> > >> >   > to
> > >> >   > > > default configuration values would impact the upgrade.
> > >> >   > > >
> > >> >   > > > Yes, since it is an incompatible change I would state it in
> > the
> > >> release
> > >> >   > > > notes.
> > >> >   > > >
> > >> >   > > > Thanks,
> > >> >   > > > Thomas
> > >> >   > > >
> > >> >   > > > BTW I found a performance regression while trying to upgrade
> > >> another
> > >> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis
> job.
> > >> Wasn't
> > >> >   > able
> > >> >   > > > to pin it down yet, symptoms include increased checkpoint
> > >> alignment
> > >> >   > time.
> > >> >   > > >
> > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> > >> [hidden email]>
> > >> >   > > > wrote:
> > >> >   > > >
> > >> >   > > > > Hi Thomas,
> > >> >   > > > >
> > >> >   > > > > just to confirm: When starting the image in local mode,
> then
> > >> you
> > >> >   > don't
> > >> >   > > > have
> > >> >   > > > > any of the JobManager memory configuration settings
> > >> configured in the
> > >> >   > > > > effective flink-conf.yaml, right? Does this mean that you
> > have
> > >> >   > > explicitly
> > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
> > >> configuration?
> > >> >   > > If
> > >> >   > > > > this is the case, then I believe it was more of an
> > >> unintentional
> > >> >   > > artifact
> > >> >   > > > > that it worked before and it has been corrected now so
> that
> > >> one needs
> > >> >   > > to
> > >> >   > > > > specify the memory of the JM process explicitly. Do you
> > think
> > >> it
> > >> >   > would
> > >> >   > > > help
> > >> >   > > > > to explicitly state this in the release notes?
> > >> >   > > > >
> > >> >   > > > > Cheers,
> > >> >   > > > > Till
> > >> >   > > > >
> > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
> [hidden email]
> > >
> > >> wrote:
> > >> >   > > > >
> > >> >   > > > > > Thanks for preparing another RC!
> > >> >   > > > > >
> > >> >   > > > > > As mentioned in the previous RC thread, it would be
> super
> > >> helpful
> > >> >   > if
> > >> >   > > > the
> > >> >   > > > > > release notes that are part of the documentation can be
> > >> included
> > >> >   > [1].
> > >> >   > > > > It's
> > >> >   > > > > > a significant time-saver to have read those first.
> > >> >   > > > > >
> > >> >   > > > > > I found one more non-backward compatible change that
> would
> > >> be worth
> > >> >   > > > > > addressing/mentioning:
> > >> >   > > > > >
> > >> >   > > > > > It is now necessary to configure the jobmanager heap
> size
> > in
> > >> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
> > >> >   > > > > > or jobmanager.memory.heap.size). Why would I not want to
> > do
> > >> that
> > >> >   > > > anyways?
> > >> >   > > > > > Well, we set it dynamically for a cluster deployment via
> > the
> > >> >   > > > > > flinkk8soperator, but the container image can also be
> used
> > >> for
> > >> >   > > testing
> > >> >   > > > > with
> > >> >   > > > > > local mode (./bin/jobmanager.sh start-foreground local).
> > >> That will
> > >> >   > > fail
> > >> >   > > > > if
> > >> >   > > > > > the heap wasn't configured and that's how I noticed it.
> > >> >   > > > > >
> > >> >   > > > > > Thanks,
> > >> >   > > > > > Thomas
> > >> >   > > > > >
> > >> >   > > > > > [1]
> > >> >   > > > > >
> > >> >   > > > > >
> > >> >   > > > >
> > >> >   > > >
> > >> >   > >
> > >> >   >
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > >> >   > > > > >
> > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> > >> >   > [hidden email]
> > >> >   > > > > > .invalid>
> > >> >   > > > > > wrote:
> > >> >   > > > > >
> > >> >   > > > > > > Hi everyone,
> > >> >   > > > > > >
> > >> >   > > > > > > Please review and vote on the release candidate #4 for
> > the
> > >> >   > version
> > >> >   > > > > > 1.11.0,
> > >> >   > > > > > > as follows:
> > >> >   > > > > > > [ ] +1, Approve the release
> > >> >   > > > > > > [ ] -1, Do not approve the release (please provide
> > >> specific
> > >> >   > > comments)
> > >> >   > > > > > >
> > >> >   > > > > > > The complete staging area is available for your
> review,
> > >> which
> > >> >   > > > includes:
> > >> >   > > > > > > * JIRA release notes [1],
> > >> >   > > > > > > * the official Apache source release and binary
> > >> convenience
> > >> >   > > releases
> > >> >   > > > to
> > >> >   > > > > > be
> > >> >   > > > > > > deployed to dist.apache.org [2], which are signed
> with
> > >> the key
> > >> >   > > with
> > >> >   > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> > [3],
> > >> >   > > > > > > * all artifacts to be deployed to the Maven Central
> > >> Repository
> > >> >   > [4],
> > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
> > >> >   > > > > > > * website pull request listing the new release and
> > adding
> > >> >   > > > announcement
> > >> >   > > > > > > blog post [6].
> > >> >   > > > > > >
> > >> >   > > > > > > The vote will be open for at least 72 hours. It is
> > >> adopted by
> > >> >   > > > majority
> > >> >   > > > > > > approval, with at least 3 PMC affirmative votes.
> > >> >   > > > > > >
> > >> >   > > > > > > Thanks,
> > >> >   > > > > > > Release Manager
> > >> >   > > > > > >
> > >> >   > > > > > > [1]
> > >> >   > > > > > >
> > >> >   > > > > >
> > >> >   > > > >
> > >> >   > > >
> > >> >   > >
> > >> >   >
> > >>
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > >> >   > > > > > > [2]
> > >> >   > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > >> >   > > > > > > [3]
> > https://dist.apache.org/repos/dist/release/flink/KEYS
> > >> >   > > > > > > [4]
> > >> >   > > > > > >
> > >> >   > > > >
> > >> >   > >
> > >>
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > >> >   > > > > > > [5]
> > >> >   > >
> https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > >> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
> > >> >   > > > > > >
> > >> >   > > > > > >
> > >> >   > > > > >
> > >> >   > > > >
> > >> >   > > >
> > >> >   > > >
> > >> >   > >
> > >> >   > >
> > >> >   >
> > >> >   >
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise-2
I run git bisect and the first commit that shows the regression is:

https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90


On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]> wrote:

> From my experience, java profilers are sometimes not accurate enough to
> find out the performance regression
> root cause. In this case, I would suggest you try out intel vtune amplifier
> to watch more detailed metrics.
>
> Best,
> Kurt
>
>
> On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]> wrote:
>
> > The cause of the issue is all but clear.
> >
> > Previously I had mentioned that there is no suspect change to the Kinesis
> > connector and that I had reverted the AWS SDK change to no effect.
> >
> > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed another
> > regression in the previous release and is present before and after.
> >
> > I repeated the run with 1.11.0 core and downgraded the entire Kinesis
> > connector to 1.10.1: Nothing changes, i.e. the regression is still
> present.
> > Therefore we will need to look elsewhere for the root cause.
> >
> > Regarding the time spent in snapshotState, repeat runs reveal a wide
> range
> > for both versions, 1.10 and 1.11. So again this is nothing pointing to a
> > root cause.
> >
> > At this point, I have no ideas remaining other than doing a bisect to
> find
> > the culprit. Any other suggestions?
> >
> > Thomas
> >
> >
> > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <[hidden email]
> > .invalid>
> > wrote:
> >
> > > Hi Thomas,
> > >
> > > Thanks for your further profiling information and glad to see we
> already
> > > finalized the location to cause the regression.
> > > Actually I was also suspicious of the point of #snapshotState in
> previous
> > > discussions since it indeed cost much time to block normal operator
> > > processing.
> > >
> > > Based on your below feedback, the sleep time during #snapshotState
> might
> > > be the main concern, and I also digged into the implementation of
> > > FlinkKinesisProducer#snapshotState.
> > > while (producer.getOutstandingRecordsCount() > 0) {
> > >    producer.flush();
> > >    try {
> > >       Thread.sleep(500);
> > >    } catch (InterruptedException e) {
> > >       LOG.warn("Flushing was interrupted.");
> > >       break;
> > >    }
> > > }
> > > It seems that the sleep time is mainly affected by the internal
> > operations
> > > inside KinesisProducer implementation provided by amazonaws, which I am
> > not
> > > quite familiar with.
> > > But I noticed there were two upgrades related to it in release-1.11.0.
> > One
> > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is
> for
> > > upgrading aws-sdk-version to 1.11.754 [2].
> > > You mentioned that you already reverted the SDK upgrade to verify no
> > > changes. Did you also revert the [1] to verify?
> > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> > >
> > > Best,
> > > Zhijiang
> > > ------------------------------------------------------------------
> > > From:Thomas Weise <[hidden email]>
> > > Send Time:2020年7月17日(星期五) 05:29
> > > To:dev <[hidden email]>
> > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
> [hidden email]
> > >;
> > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
> [hidden email]
> > >
> > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> release
> > > candidate #4)
> > >
> > > Sorry for the delay.
> > >
> > > I confirmed that the regression is due to the sink (unsurprising, since
> > > another job with the same consumer, but not the producer, runs as
> > > expected).
> > >
> > > As promised I did CPU profiling on the problematic application, which
> > gives
> > > more insight into the regression [1]
> > >
> > > The screenshots show that the average time for snapshotState increases
> > from
> > > ~9s to ~28s. The data also shows the increase in sleep time during
> > > snapshotState.
> > >
> > > Does anyone, based on changes made in 1.11, have a theory why?
> > >
> > > I had previously looked at the changes to the Kinesis connector and
> also
> > > reverted the SDK upgrade, which did not change the situation.
> > >
> > > It will likely be necessary to drill into the sink / checkpointing
> > details
> > > to understand the cause of the problem.
> > >
> > > Let me know if anyone has specific questions that I can answer from the
> > > profiling results.
> > >
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> > >
> > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]> wrote:
> > >
> > > > + dev@ for visibility
> > > >
> > > > I will investigate further today.
> > > >
> > > >
> > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <[hidden email]
> >
> > > > wrote:
> > > >
> > > >> On 06.07.20 20:39, Stephan Ewen wrote:
> > > >> >    - Did sink checkpoint notifications change in a relevant way,
> for
> > > >> example
> > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
> > > >>
> > > >> I think that's unrelated: the Kafka fixes were isolated in Kafka and
> > the
> > > >> one bug I discovered on the way was about the Task reaper.
> > > >>
> > > >>
> > > >> On 07.07.20 17:51, Zhijiang wrote:
> > > >> > Sorry for my misunderstood of the previous information, Thomas. I
> > was
> > > >> assuming that the sync checkpoint duration increased after upgrade
> as
> > it
> > > >> was mentioned before.
> > > >> >
> > > >> > If I remembered correctly, the memory state backend also has the
> > same
> > > >> issue? If so, we can dismiss the rocksDB state changes. As the slot
> > > sharing
> > > >> enabled, the downstream and upstream should
> > > >> > probably deployed into the same slot, then no network shuffle
> > effect.
> > > >> >
> > > >> > I think we need to find out whether it has other symptoms changed
> > > >> besides the performance regression to further figure out the scope.
> > > >> > E.g. any metrics changes, the number of TaskManager and the number
> > of
> > > >> slots per TaskManager from deployment changes.
> > > >> > 40% regression is really big, I guess the changes should also be
> > > >> reflected in other places.
> > > >> >
> > > >> > I am not sure whether we can reproduce the regression in our AWS
> > > >> environment by writing any Kinesis jobs, since there are also normal
> > > >> Kinesis jobs as Thomas mentioned after upgrade.
> > > >> > So it probably looks like to touch some corner case. I am very
> > willing
> > > >> to provide any help for debugging if possible.
> > > >> >
> > > >> >
> > > >> > Best,
> > > >> > Zhijiang
> > > >> >
> > > >> >
> > > >> > ------------------------------------------------------------------
> > > >> > From:Thomas Weise <[hidden email]>
> > > >> > Send Time:2020年7月7日(星期二) 23:01
> > > >> > To:Stephan Ewen <[hidden email]>
> > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
> > > >> [hidden email]>; Zhijiang <[hidden email]>
> > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > > >> release candidate #4)
> > > >> >
> > > >> > We are deploying our apps with FlinkK8sOperator. We have one job
> > that
> > > >> works as expected after the upgrade and the one discussed here that
> > has
> > > the
> > > >> performance regression.
> > > >> >
> > > >> > "The performance regression is obvious caused by long duration of
> > sync
> > > >> checkpoint process in Kinesis sink operator, which would block the
> > > normal
> > > >> data processing until back pressure the source."
> > > >> >
> > > >> > That's a constant. Before (1.10) and upgrade have the same sync
> > > >> checkpointing time. The question is what change came in with the
> > > upgrade.
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[hidden email]>
> > wrote:
> > > >> >
> > > >> > @Thomas Just one thing real quick: Are you using the standalone
> > setup
> > > >> scripts (like start-cluster.sh, and the former "slaves" file) ?
> > > >> > Be aware that this is now called "workers" because of avoiding
> > > >> sensitive names.
> > > >> > In one internal benchmark we saw quite a lot of slowdown
> initially,
> > > >> before seeing that the cluster was not a distributed cluster any
> more
> > > ;-)
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
> [hidden email]
> > >
> > > >> wrote:
> > > >> > Thanks for this kickoff and help analysis, Stephan!
> > > >> > Thanks for the further feedback and investigation, Thomas!
> > > >> >
> > > >> > The performance regression is obvious caused by long duration of
> > sync
> > > >> checkpoint process in Kinesis sink operator, which would block the
> > > normal
> > > >> data processing until back pressure the source.
> > > >> > Maybe we could dig into the process of sync execution in
> checkpoint.
> > > >> E.g. break down the steps inside respective operator#snapshotState
> to
> > > >> statistic which operation cost most of the time, then
> > > >> > we might probably find the root cause to bring such cost.
> > > >> >
> > > >> > Look forward to the further progress. :)
> > > >> >
> > > >> > Best,
> > > >> > Zhijiang
> > > >> >
> > > >> > ------------------------------------------------------------------
> > > >> > From:Stephan Ewen <[hidden email]>
> > > >> > Send Time:2020年7月7日(星期二) 14:52
> > > >> > To:Thomas Weise <[hidden email]>
> > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
> > > >> [hidden email]>; Aljoscha Krettek <[hidden email]
> >;
> > > >> Arvid Heise <[hidden email]>
> > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > > >> release candidate #4)
> > > >> >
> > > >> > Thank you for the digging so deeply.
> > > >> > Mysterious think this regression.
> > > >> >
> > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]> wrote:
> > > >> > @Stephan: yes, I refer to sync time in the web UI (it is unchanged
> > > >> between 1.10 and 1.11 for the specific pipeline).
> > > >> >
> > > >> > I verified that increasing the checkpointing interval does not
> make
> > a
> > > >> difference.
> > > >> >
> > > >> > I looked at the Kinesis connector changes since 1.10.1 and don't
> see
> > > >> anything that could cause this.
> > > >> >
> > > >> > Another pipeline that is using the Kinesis consumer (but not the
> > > >> producer) performs as expected.
> > > >> >
> > > >> > I tried reverting the AWS SDK version change, symptoms remain
> > > unchanged:
> > > >> >
> > > >> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
> > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
> > > >> > index a6abce23ba..741743a05e 100644
> > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
> > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
> > > >> > @@ -33,7 +33,7 @@ under the License.
> > > >> >
> > > >>
> > <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> > > >> >          <name>flink-connector-kinesis</name>
> > > >> >          <properties>
> > > >> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
> > > >> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
> > > >> >
> > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> > > >> >
> > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> > > >> >
> > > >>
> > >
> >
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> > > >> >
> > > >> > I'm planning to take a look with a profiler next.
> > > >> >
> > > >> > Thomas
> > > >> >
> > > >> >
> > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[hidden email]>
> > > wrote:
> > > >> > Hi all!
> > > >> >
> > > >> > Forking this thread out of the release vote thread.
> > > >> >  From what Thomas describes, it really sounds like a sink-specific
> > > >> issue.
> > > >> >
> > > >> > @Thomas: When you say sink has a long synchronous checkpoint time,
> > you
> > > >> mean the time that is shown as "sync time" on the metrics and web
> UI?
> > > That
> > > >> is not including any network buffer related operations. It is purely
> > the
> > > >> operator's time.
> > > >> >
> > > >> > Can we dig into the changes we did in sinks:
> > > >> >    - Kinesis version upgrade, AWS library updates
> > > >> >
> > > >> >    - Could it be that some call (checkpoint complete) that was
> > > >> previously (1.10) in a separate thread is not in the mailbox and
> this
> > > >> simply reduces the number of threads that do the work?
> > > >> >
> > > >> >    - Did sink checkpoint notifications change in a relevant way,
> for
> > > >> example due to some Kafka issues we addressed in 1.11 (@Aljoscha
> > maybe?)
> > > >> >
> > > >> > Best,
> > > >> > Stephan
> > > >> >
> > > >> >
> > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
> [hidden email]
> > > .invalid>
> > > >> wrote:
> > > >> > Hi Thomas,
> > > >> >
> > > >> >   Regarding [2], it has more detail infos in the Jira description
> (
> > > >> https://issues.apache.org/jira/browse/FLINK-16404).
> > > >> >
> > > >> >   I can also give some basic explanations here to dismiss the
> > concern.
> > > >> >   1. In the past, the following buffers after the barrier will be
> > > >> cached on downstream side before alignment.
> > > >> >   2. In 1.11, the upstream would not send the buffers after the
> > > >> barrier. When the downstream finishes the alignment, it will notify
> > the
> > > >> downstream of continuing sending following buffers, since it can
> > process
> > > >> them after alignment.
> > > >> >   3. The only difference is that the temporary blocked buffers are
> > > >> cached either on downstream side or on upstream side before
> alignment.
> > > >> >   4. The side effect would be the additional notification cost for
> > > >> every barrier alignment. If the downstream and upstream are deployed
> > in
> > > >> separate TaskManager, the cost is network transport delay (the
> effect
> > > can
> > > >> be ignored based on our testing with 1s checkpoint interval). For
> > > sharing
> > > >> slot in your case, the cost is only one method call in processor,
> can
> > be
> > > >> ignored also.
> > > >> >
> > > >> >   You mentioned "In this case, the downstream task has a high
> > average
> > > >> checkpoint duration(~30s, sync part)." This duration is not
> reflecting
> > > the
> > > >> changes above, and it is only indicating the duration for calling
> > > >> `Operation.snapshotState`.
> > > >> >   If this duration is beyond your expectation, you can check or
> > debug
> > > >> whether the source/sink operations might take more time to finish
> > > >> `snapshotState` in practice. E.g. you can
> > > >> >   make the implementation of this method as empty to further
> verify
> > > the
> > > >> effect.
> > > >> >
> > > >> >   Best,
> > > >> >   Zhijiang
> > > >> >
> > > >> >
> > > >> >
>  ------------------------------------------------------------------
> > > >> >   From:Thomas Weise <[hidden email]>
> > > >> >   Send Time:2020年7月5日(星期日) 12:22
> > > >> >   To:dev <[hidden email]>; Zhijiang <
> > [hidden email]
> > > >
> > > >> >   Cc:Yingjie Cao <[hidden email]>
> > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >> >
> > > >> >   Hi Zhijiang,
> > > >> >
> > > >> >   Could you please point me to more details regarding: "[2]: Delay
> > > send
> > > >> the
> > > >> >   following buffers after checkpoint barrier on upstream side
> until
> > > >> barrier
> > > >> >   alignment on downstream side."
> > > >> >
> > > >> >   In this case, the downstream task has a high average checkpoint
> > > >> duration
> > > >> >   (~30s, sync part). If there was a change to hold buffers
> depending
> > > on
> > > >> >   downstream performance, could this possibly apply to this case
> > (even
> > > >> when
> > > >> >   there is no shuffle that would require alignment)?
> > > >> >
> > > >> >   Thanks,
> > > >> >   Thomas
> > > >> >
> > > >> >
> > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
> > [hidden email]
> > > >> .invalid>
> > > >> >   wrote:
> > > >> >
> > > >> >   > Hi Thomas,
> > > >> >   >
> > > >> >   > Thanks for the further update information.
> > > >> >   >
> > > >> >   > I guess we can dismiss the network stack changes, since in
> your
> > > >> case the
> > > >> >   > downstream and upstream would probably be deployed in the same
> > > slot
> > > >> >   > bypassing the network data shuffle.
> > > >> >   > Also I guess release-1.11 will not bring general performance
> > > >> regression in
> > > >> >   > runtime engine, as we also did the performance testing for all
> > > >> general
> > > >> >   > cases by [1] in real cluster before and the testing results
> > should
> > > >> fit the
> > > >> >   > expectation. But we indeed did not test the specific source
> and
> > > sink
> > > >> >   > connectors yet as I known.
> > > >> >   >
> > > >> >   > Regarding your performance regression with 40%, I wonder it is
> > > >> probably
> > > >> >   > related to specific source/sink changes (e.g. kinesis) or
> > > >> environment
> > > >> >   > issues with corner case.
> > > >> >   > If possible, it would be helpful to further locate whether the
> > > >> regression
> > > >> >   > is caused by kinesis, by replacing the kinesis source & sink
> and
> > > >> keeping
> > > >> >   > the others same.
> > > >> >   >
> > > >> >   > As you said, it would be efficient to contact with you
> directly
> > > >> next week
> > > >> >   > to further discuss this issue. And we are willing/eager to
> > provide
> > > >> any help
> > > >> >   > to resolve this issue soon.
> > > >> >   >
> > > >> >   > Besides that, I guess this issue should not be the blocker for
> > the
> > > >> >   > release, since it is probably a corner case based on the
> current
> > > >> analysis.
> > > >> >   > If we really conclude anything need to be resolved after the
> > final
> > > >> >   > release, then we can also make the next minor release-1.11.1
> > come
> > > >> soon.
> > > >> >   >
> > > >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > > >> >   >
> > > >> >   > Best,
> > > >> >   > Zhijiang
> > > >> >   >
> > > >> >   >
> > > >> >   >
> > ------------------------------------------------------------------
> > > >> >   > From:Thomas Weise <[hidden email]>
> > > >> >   > Send Time:2020年7月4日(星期六) 12:26
> > > >> >   > To:dev <[hidden email]>; Zhijiang <
> > > [hidden email]
> > > >> >
> > > >> >   > Cc:Yingjie Cao <[hidden email]>
> > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >> >   >
> > > >> >   > Hi Zhijiang,
> > > >> >   >
> > > >> >   > It will probably be best if we connect next week and discuss
> the
> > > >> issue
> > > >> >   > directly since this could be quite difficult to reproduce.
> > > >> >   >
> > > >> >   > Before the testing result on our side comes out for your
> > > respective
> > > >> job
> > > >> >   > case, I have some other questions to confirm for further
> > analysis:
> > > >> >   >     -  How much percentage regression you found after
> switching
> > to
> > > >> 1.11?
> > > >> >   >
> > > >> >   > ~40% throughput decline
> > > >> >   >
> > > >> >   >     -  Are there any network bottleneck in your cluster? E.g.
> > the
> > > >> network
> > > >> >   > bandwidth is full caused by other jobs? If so, it might have
> > more
> > > >> effects
> > > >> >   > by above [2]
> > > >> >   >
> > > >> >   > The test runs on a k8s cluster that is also used for other
> > > >> production jobs.
> > > >> >   > There is no reason be believe network is the bottleneck.
> > > >> >   >
> > > >> >   >     -  Did you adjust the default network buffer setting? E.g.
> > > >> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > >> >   > "taskmanager.network.memory.buffers-per-channel"
> > > >> >   >
> > > >> >   > The job is using the defaults, i.e we don't configure the
> > > settings.
> > > >> If you
> > > >> >   > want me to try specific settings in the hope that it will help
> > to
> > > >> isolate
> > > >> >   > the issue please let me know.
> > > >> >   >
> > > >> >   >     -  I guess the topology has three vertexes
> "KinesisConsumer
> > ->
> > > >> Chained
> > > >> >   > FlatMap -> KinesisProducer", and the partition mode for
> > > >> "KinesisConsumer ->
> > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
> > so,
> > > >> the edge
> > > >> >   > connection is one-to-one, not all-to-all, then the above
> [1][2]
> > > >> should no
> > > >> >   > effects in theory with default network buffer setting.
> > > >> >   >
> > > >> >   > There are only 2 vertices and the edge is "forward".
> > > >> >   >
> > > >> >   >     - By slot sharing, I guess these three vertex parallelism
> > task
> > > >> would
> > > >> >   > probably be deployed into the same slot, then the data shuffle
> > is
> > > >> by memory
> > > >> >   > queue, not network stack. If so, the above [2] should no
> effect.
> > > >> >   >
> > > >> >   > Yes, vertices share slots.
> > > >> >   >
> > > >> >   >     - I also saw some Jira changes for kinesis in this
> release,
> > > >> could you
> > > >> >   > confirm that these changes would not effect the performance?
> > > >> >   >
> > > >> >   > I will need to take a look. 1.10 already had a regression
> > > >> introduced by the
> > > >> >   > Kinesis producer update.
> > > >> >   >
> > > >> >   >
> > > >> >   > Thanks,
> > > >> >   > Thomas
> > > >> >   >
> > > >> >   >
> > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> > > >> [hidden email]
> > > >> >   > .invalid>
> > > >> >   > wrote:
> > > >> >   >
> > > >> >   > > Hi Thomas,
> > > >> >   > >
> > > >> >   > > Thanks for your reply with rich information!
> > > >> >   > >
> > > >> >   > > We are trying to reproduce your case in our cluster to
> further
> > > >> verify it,
> > > >> >   > > and  @Yingjie Cao is working on it now.
> > > >> >   > >  As we have not kinesis consumer and producer internally, so
> > we
> > > >> will
> > > >> >   > > construct the common source and sink instead in the case of
> > > >> backpressure.
> > > >> >   > >
> > > >> >   > > Firstly, we can dismiss the rockdb factor in this release,
> > since
> > > >> you also
> > > >> >   > > mentioned that "filesystem leads to same symptoms".
> > > >> >   > >
> > > >> >   > > Secondly, if my understanding is right, you emphasis that
> the
> > > >> regression
> > > >> >   > > only exists for the jobs with low checkpoint interval (10s).
> > > >> >   > > Based on that, I have two suspicions with the network
> related
> > > >> changes in
> > > >> >   > > this release:
> > > >> >   > >     - [1]: Limited the maximum backlog value (default 10) in
> > > >> subpartition
> > > >> >   > > queue.
> > > >> >   > >     - [2]: Delay send the following buffers after checkpoint
> > > >> barrier on
> > > >> >   > > upstream side until barrier alignment on downstream side.
> > > >> >   > >
> > > >> >   > > These changes are motivated for reducing the in-flight
> buffers
> > > to
> > > >> speedup
> > > >> >   > > checkpoint especially in the case of backpressure.
> > > >> >   > > In theory they should have very minor performance effect and
> > > >> actually we
> > > >> >   > > also tested in cluster to verify within expectation before
> > > >> merging them,
> > > >> >   > >  but maybe there are other corner cases we have not thought
> of
> > > >> before.
> > > >> >   > >
> > > >> >   > > Before the testing result on our side comes out for your
> > > >> respective job
> > > >> >   > > case, I have some other questions to confirm for further
> > > analysis:
> > > >> >   > >     -  How much percentage regression you found after
> > switching
> > > >> to 1.11?
> > > >> >   > >     -  Are there any network bottleneck in your cluster?
> E.g.
> > > the
> > > >> network
> > > >> >   > > bandwidth is full caused by other jobs? If so, it might have
> > > more
> > > >> effects
> > > >> >   > > by above [2]
> > > >> >   > >     -  Did you adjust the default network buffer setting?
> E.g.
> > > >> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
> > > >> >   > >     -  I guess the topology has three vertexes
> > "KinesisConsumer
> > > ->
> > > >> >   > Chained
> > > >> >   > > FlatMap -> KinesisProducer", and the partition mode for
> > > >> "KinesisConsumer
> > > >> >   > ->
> > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"?
> If
> > > >> so, the
> > > >> >   > edge
> > > >> >   > > connection is one-to-one, not all-to-all, then the above
> > [1][2]
> > > >> should no
> > > >> >   > > effects in theory with default network buffer setting.
> > > >> >   > >     - By slot sharing, I guess these three vertex
> parallelism
> > > >> task would
> > > >> >   > > probably be deployed into the same slot, then the data
> shuffle
> > > is
> > > >> by
> > > >> >   > memory
> > > >> >   > > queue, not network stack. If so, the above [2] should no
> > effect.
> > > >> >   > >     - I also saw some Jira changes for kinesis in this
> > release,
> > > >> could you
> > > >> >   > > confirm that these changes would not effect the performance?
> > > >> >   > >
> > > >> >   > > Best,
> > > >> >   > > Zhijiang
> > > >> >   > >
> > > >> >   > >
> > > >> >   > >
> > > ------------------------------------------------------------------
> > > >> >   > > From:Thomas Weise <[hidden email]>
> > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
> > > >> >   > > To:dev <[hidden email]>; Zhijiang <
> > > >> [hidden email]>
> > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >> >   > >
> > > >> >   > > Hi Zhijiang,
> > > >> >   > >
> > > >> >   > > The performance degradation manifests in backpressure which
> > > leads
> > > >> to
> > > >> >   > > growing backlog in the source. I switched a few times
> between
> > > >> 1.10 and
> > > >> >   > 1.11
> > > >> >   > > and the behavior is consistent.
> > > >> >   > >
> > > >> >   > > The DAG is:
> > > >> >   > >
> > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
> > > >> forward
> > > >> >   > > ---------> KinesisProducer
> > > >> >   > >
> > > >> >   > > Parallelism: 160
> > > >> >   > > No shuffle/rebalance.
> > > >> >   > >
> > > >> >   > > Checkpointing config:
> > > >> >   > >
> > > >> >   > > Checkpointing Mode Exactly Once
> > > >> >   > > Interval 10s
> > > >> >   > > Timeout 10m 0s
> > > >> >   > > Minimum Pause Between Checkpoints 10s
> > > >> >   > > Maximum Concurrent Checkpoints 1
> > > >> >   > > Persist Checkpoints Externally Enabled (delete on
> > cancellation)
> > > >> >   > >
> > > >> >   > > State backend: rocksdb  (filesystem leads to same symptoms)
> > > >> >   > > Checkpoint size is tiny (500KB)
> > > >> >   > >
> > > >> >   > > An interesting difference to another job that I had upgraded
> > > >> successfully
> > > >> >   > > is the low checkpointing interval.
> > > >> >   > >
> > > >> >   > > Thanks,
> > > >> >   > > Thomas
> > > >> >   > >
> > > >> >   > >
> > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> > > >> [hidden email]
> > > >> >   > > .invalid>
> > > >> >   > > wrote:
> > > >> >   > >
> > > >> >   > > > Hi Thomas,
> > > >> >   > > >
> > > >> >   > > > Thanks for the efficient feedback.
> > > >> >   > > >
> > > >> >   > > > Regarding the suggestion of adding the release notes
> > document,
> > > >> I agree
> > > >> >   > > > with your point. Maybe we should adjust the vote template
> > > >> accordingly
> > > >> >   > in
> > > >> >   > > > the respective wiki to guide the following release
> > processes.
> > > >> >   > > >
> > > >> >   > > > Regarding the performance regression, could you provide
> some
> > > >> more
> > > >> >   > details
> > > >> >   > > > for our better measurement or reproducing on our sides?
> > > >> >   > > > E.g. I guess the topology only includes two vertexes
> source
> > > and
> > > >> sink?
> > > >> >   > > > What is the parallelism for every vertex?
> > > >> >   > > > The upstream shuffles data to the downstream via rebalance
> > > >> partitioner
> > > >> >   > or
> > > >> >   > > > other?
> > > >> >   > > > The checkpoint mode is exactly-once with rocksDB state
> > > backend?
> > > >> >   > > > The backpressure happened in this case?
> > > >> >   > > > How much percentage regression in this case?
> > > >> >   > > >
> > > >> >   > > > Best,
> > > >> >   > > > Zhijiang
> > > >> >   > > >
> > > >> >   > > >
> > > >> >   > > >
> > > >> >   > > >
> > > >> ------------------------------------------------------------------
> > > >> >   > > > From:Thomas Weise <[hidden email]>
> > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> > > >> >   > > > To:dev <[hidden email]>
> > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > >> >   > > >
> > > >> >   > > > Hi Till,
> > > >> >   > > >
> > > >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
> > > >> >   > > >
> > > >> >   > > > Generally, we carry forward the existing configuration and
> > any
> > > >> change
> > > >> >   > to
> > > >> >   > > > default configuration values would impact the upgrade.
> > > >> >   > > >
> > > >> >   > > > Yes, since it is an incompatible change I would state it
> in
> > > the
> > > >> release
> > > >> >   > > > notes.
> > > >> >   > > >
> > > >> >   > > > Thanks,
> > > >> >   > > > Thomas
> > > >> >   > > >
> > > >> >   > > > BTW I found a performance regression while trying to
> upgrade
> > > >> another
> > > >> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis
> > job.
> > > >> Wasn't
> > > >> >   > able
> > > >> >   > > > to pin it down yet, symptoms include increased checkpoint
> > > >> alignment
> > > >> >   > time.
> > > >> >   > > >
> > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> > > >> [hidden email]>
> > > >> >   > > > wrote:
> > > >> >   > > >
> > > >> >   > > > > Hi Thomas,
> > > >> >   > > > >
> > > >> >   > > > > just to confirm: When starting the image in local mode,
> > then
> > > >> you
> > > >> >   > don't
> > > >> >   > > > have
> > > >> >   > > > > any of the JobManager memory configuration settings
> > > >> configured in the
> > > >> >   > > > > effective flink-conf.yaml, right? Does this mean that
> you
> > > have
> > > >> >   > > explicitly
> > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
> > > >> configuration?
> > > >> >   > > If
> > > >> >   > > > > this is the case, then I believe it was more of an
> > > >> unintentional
> > > >> >   > > artifact
> > > >> >   > > > > that it worked before and it has been corrected now so
> > that
> > > >> one needs
> > > >> >   > > to
> > > >> >   > > > > specify the memory of the JM process explicitly. Do you
> > > think
> > > >> it
> > > >> >   > would
> > > >> >   > > > help
> > > >> >   > > > > to explicitly state this in the release notes?
> > > >> >   > > > >
> > > >> >   > > > > Cheers,
> > > >> >   > > > > Till
> > > >> >   > > > >
> > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
> > [hidden email]
> > > >
> > > >> wrote:
> > > >> >   > > > >
> > > >> >   > > > > > Thanks for preparing another RC!
> > > >> >   > > > > >
> > > >> >   > > > > > As mentioned in the previous RC thread, it would be
> > super
> > > >> helpful
> > > >> >   > if
> > > >> >   > > > the
> > > >> >   > > > > > release notes that are part of the documentation can
> be
> > > >> included
> > > >> >   > [1].
> > > >> >   > > > > It's
> > > >> >   > > > > > a significant time-saver to have read those first.
> > > >> >   > > > > >
> > > >> >   > > > > > I found one more non-backward compatible change that
> > would
> > > >> be worth
> > > >> >   > > > > > addressing/mentioning:
> > > >> >   > > > > >
> > > >> >   > > > > > It is now necessary to configure the jobmanager heap
> > size
> > > in
> > > >> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
> > > >> >   > > > > > or jobmanager.memory.heap.size). Why would I not want
> to
> > > do
> > > >> that
> > > >> >   > > > anyways?
> > > >> >   > > > > > Well, we set it dynamically for a cluster deployment
> via
> > > the
> > > >> >   > > > > > flinkk8soperator, but the container image can also be
> > used
> > > >> for
> > > >> >   > > testing
> > > >> >   > > > > with
> > > >> >   > > > > > local mode (./bin/jobmanager.sh start-foreground
> local).
> > > >> That will
> > > >> >   > > fail
> > > >> >   > > > > if
> > > >> >   > > > > > the heap wasn't configured and that's how I noticed
> it.
> > > >> >   > > > > >
> > > >> >   > > > > > Thanks,
> > > >> >   > > > > > Thomas
> > > >> >   > > > > >
> > > >> >   > > > > > [1]
> > > >> >   > > > > >
> > > >> >   > > > > >
> > > >> >   > > > >
> > > >> >   > > >
> > > >> >   > >
> > > >> >   >
> > > >>
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > > >> >   > > > > >
> > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> > > >> >   > [hidden email]
> > > >> >   > > > > > .invalid>
> > > >> >   > > > > > wrote:
> > > >> >   > > > > >
> > > >> >   > > > > > > Hi everyone,
> > > >> >   > > > > > >
> > > >> >   > > > > > > Please review and vote on the release candidate #4
> for
> > > the
> > > >> >   > version
> > > >> >   > > > > > 1.11.0,
> > > >> >   > > > > > > as follows:
> > > >> >   > > > > > > [ ] +1, Approve the release
> > > >> >   > > > > > > [ ] -1, Do not approve the release (please provide
> > > >> specific
> > > >> >   > > comments)
> > > >> >   > > > > > >
> > > >> >   > > > > > > The complete staging area is available for your
> > review,
> > > >> which
> > > >> >   > > > includes:
> > > >> >   > > > > > > * JIRA release notes [1],
> > > >> >   > > > > > > * the official Apache source release and binary
> > > >> convenience
> > > >> >   > > releases
> > > >> >   > > > to
> > > >> >   > > > > > be
> > > >> >   > > > > > > deployed to dist.apache.org [2], which are signed
> > with
> > > >> the key
> > > >> >   > > with
> > > >> >   > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> > > [3],
> > > >> >   > > > > > > * all artifacts to be deployed to the Maven Central
> > > >> Repository
> > > >> >   > [4],
> > > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
> > > >> >   > > > > > > * website pull request listing the new release and
> > > adding
> > > >> >   > > > announcement
> > > >> >   > > > > > > blog post [6].
> > > >> >   > > > > > >
> > > >> >   > > > > > > The vote will be open for at least 72 hours. It is
> > > >> adopted by
> > > >> >   > > > majority
> > > >> >   > > > > > > approval, with at least 3 PMC affirmative votes.
> > > >> >   > > > > > >
> > > >> >   > > > > > > Thanks,
> > > >> >   > > > > > > Release Manager
> > > >> >   > > > > > >
> > > >> >   > > > > > > [1]
> > > >> >   > > > > > >
> > > >> >   > > > > >
> > > >> >   > > > >
> > > >> >   > > >
> > > >> >   > >
> > > >> >   >
> > > >>
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > >> >   > > > > > > [2]
> > > >> >   >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > >> >   > > > > > > [3]
> > > https://dist.apache.org/repos/dist/release/flink/KEYS
> > > >> >   > > > > > > [4]
> > > >> >   > > > > > >
> > > >> >   > > > >
> > > >> >   > >
> > > >>
> > https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > >> >   > > > > > > [5]
> > > >> >   > >
> > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > >> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
> > > >> >   > > > > > >
> > > >> >   > > > > > >
> > > >> >   > > > > >
> > > >> >   > > > >
> > > >> >   > > >
> > > >> >   > > >
> > > >> >   > >
> > > >> >   > >
> > > >> >   >
> > > >> >   >
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >>
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Roman Khachatryan
Hi Thomas,

Thanks a lot for the analysis.

The first thing that I'd check is whether checkpoints became more frequent
with this commit (as each of them adds at least 500ms if there is at least
one not sent record, according to FlinkKinesisProducer.snapshotState).

Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs
first "bad" commits)?

On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <[hidden email]> wrote:

> I run git bisect and the first commit that shows the regression is:
>
>
> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
>
>
> On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]> wrote:
>
> > From my experience, java profilers are sometimes not accurate enough to
> > find out the performance regression
> > root cause. In this case, I would suggest you try out intel vtune
> amplifier
> > to watch more detailed metrics.
> >
> > Best,
> > Kurt
> >
> >
> > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]> wrote:
> >
> > > The cause of the issue is all but clear.
> > >
> > > Previously I had mentioned that there is no suspect change to the
> Kinesis
> > > connector and that I had reverted the AWS SDK change to no effect.
> > >
> > > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed
> another
> > > regression in the previous release and is present before and after.
> > >
> > > I repeated the run with 1.11.0 core and downgraded the entire Kinesis
> > > connector to 1.10.1: Nothing changes, i.e. the regression is still
> > present.
> > > Therefore we will need to look elsewhere for the root cause.
> > >
> > > Regarding the time spent in snapshotState, repeat runs reveal a wide
> > range
> > > for both versions, 1.10 and 1.11. So again this is nothing pointing to
> a
> > > root cause.
> > >
> > > At this point, I have no ideas remaining other than doing a bisect to
> > find
> > > the culprit. Any other suggestions?
> > >
> > > Thomas
> > >
> > >
> > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <[hidden email]
> > > .invalid>
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > Thanks for your further profiling information and glad to see we
> > already
> > > > finalized the location to cause the regression.
> > > > Actually I was also suspicious of the point of #snapshotState in
> > previous
> > > > discussions since it indeed cost much time to block normal operator
> > > > processing.
> > > >
> > > > Based on your below feedback, the sleep time during #snapshotState
> > might
> > > > be the main concern, and I also digged into the implementation of
> > > > FlinkKinesisProducer#snapshotState.
> > > > while (producer.getOutstandingRecordsCount() > 0) {
> > > >    producer.flush();
> > > >    try {
> > > >       Thread.sleep(500);
> > > >    } catch (InterruptedException e) {
> > > >       LOG.warn("Flushing was interrupted.");
> > > >       break;
> > > >    }
> > > > }
> > > > It seems that the sleep time is mainly affected by the internal
> > > operations
> > > > inside KinesisProducer implementation provided by amazonaws, which I
> am
> > > not
> > > > quite familiar with.
> > > > But I noticed there were two upgrades related to it in
> release-1.11.0.
> > > One
> > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is
> > for
> > > > upgrading aws-sdk-version to 1.11.754 [2].
> > > > You mentioned that you already reverted the SDK upgrade to verify no
> > > > changes. Did you also revert the [1] to verify?
> > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> > > >
> > > > Best,
> > > > Zhijiang
> > > > ------------------------------------------------------------------
> > > > From:Thomas Weise <[hidden email]>
> > > > Send Time:2020年7月17日(星期五) 05:29
> > > > To:dev <[hidden email]>
> > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
> > [hidden email]
> > > >;
> > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
> > [hidden email]
> > > >
> > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > release
> > > > candidate #4)
> > > >
> > > > Sorry for the delay.
> > > >
> > > > I confirmed that the regression is due to the sink (unsurprising,
> since
> > > > another job with the same consumer, but not the producer, runs as
> > > > expected).
> > > >
> > > > As promised I did CPU profiling on the problematic application, which
> > > gives
> > > > more insight into the regression [1]
> > > >
> > > > The screenshots show that the average time for snapshotState
> increases
> > > from
> > > > ~9s to ~28s. The data also shows the increase in sleep time during
> > > > snapshotState.
> > > >
> > > > Does anyone, based on changes made in 1.11, have a theory why?
> > > >
> > > > I had previously looked at the changes to the Kinesis connector and
> > also
> > > > reverted the SDK upgrade, which did not change the situation.
> > > >
> > > > It will likely be necessary to drill into the sink / checkpointing
> > > details
> > > > to understand the cause of the problem.
> > > >
> > > > Let me know if anyone has specific questions that I can answer from
> the
> > > > profiling results.
> > > >
> > > > Thomas
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> > > >
> > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]>
> wrote:
> > > >
> > > > > + dev@ for visibility
> > > > >
> > > > > I will investigate further today.
> > > > >
> > > > >
> > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
> > > > >> >    - Did sink checkpoint notifications change in a relevant way,
> > for
> > > > >> example
> > > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
> > > > >>
> > > > >> I think that's unrelated: the Kafka fixes were isolated in Kafka
> and
> > > the
> > > > >> one bug I discovered on the way was about the Task reaper.
> > > > >>
> > > > >>
> > > > >> On 07.07.20 17:51, Zhijiang wrote:
> > > > >> > Sorry for my misunderstood of the previous information, Thomas.
> I
> > > was
> > > > >> assuming that the sync checkpoint duration increased after upgrade
> > as
> > > it
> > > > >> was mentioned before.
> > > > >> >
> > > > >> > If I remembered correctly, the memory state backend also has the
> > > same
> > > > >> issue? If so, we can dismiss the rocksDB state changes. As the
> slot
> > > > sharing
> > > > >> enabled, the downstream and upstream should
> > > > >> > probably deployed into the same slot, then no network shuffle
> > > effect.
> > > > >> >
> > > > >> > I think we need to find out whether it has other symptoms
> changed
> > > > >> besides the performance regression to further figure out the
> scope.
> > > > >> > E.g. any metrics changes, the number of TaskManager and the
> number
> > > of
> > > > >> slots per TaskManager from deployment changes.
> > > > >> > 40% regression is really big, I guess the changes should also be
> > > > >> reflected in other places.
> > > > >> >
> > > > >> > I am not sure whether we can reproduce the regression in our AWS
> > > > >> environment by writing any Kinesis jobs, since there are also
> normal
> > > > >> Kinesis jobs as Thomas mentioned after upgrade.
> > > > >> > So it probably looks like to touch some corner case. I am very
> > > willing
> > > > >> to provide any help for debugging if possible.
> > > > >> >
> > > > >> >
> > > > >> > Best,
> > > > >> > Zhijiang
> > > > >> >
> > > > >> >
> > > > >> >
> ------------------------------------------------------------------
> > > > >> > From:Thomas Weise <[hidden email]>
> > > > >> > Send Time:2020年7月7日(星期二) 23:01
> > > > >> > To:Stephan Ewen <[hidden email]>
> > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
> > > > >> [hidden email]>; Zhijiang <[hidden email]>
> > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
> 1.11.0,
> > > > >> release candidate #4)
> > > > >> >
> > > > >> > We are deploying our apps with FlinkK8sOperator. We have one job
> > > that
> > > > >> works as expected after the upgrade and the one discussed here
> that
> > > has
> > > > the
> > > > >> performance regression.
> > > > >> >
> > > > >> > "The performance regression is obvious caused by long duration
> of
> > > sync
> > > > >> checkpoint process in Kinesis sink operator, which would block the
> > > > normal
> > > > >> data processing until back pressure the source."
> > > > >> >
> > > > >> > That's a constant. Before (1.10) and upgrade have the same sync
> > > > >> checkpointing time. The question is what change came in with the
> > > > upgrade.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[hidden email]>
> > > wrote:
> > > > >> >
> > > > >> > @Thomas Just one thing real quick: Are you using the standalone
> > > setup
> > > > >> scripts (like start-cluster.sh, and the former "slaves" file) ?
> > > > >> > Be aware that this is now called "workers" because of avoiding
> > > > >> sensitive names.
> > > > >> > In one internal benchmark we saw quite a lot of slowdown
> > initially,
> > > > >> before seeing that the cluster was not a distributed cluster any
> > more
> > > > ;-)
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
> > [hidden email]
> > > >
> > > > >> wrote:
> > > > >> > Thanks for this kickoff and help analysis, Stephan!
> > > > >> > Thanks for the further feedback and investigation, Thomas!
> > > > >> >
> > > > >> > The performance regression is obvious caused by long duration of
> > > sync
> > > > >> checkpoint process in Kinesis sink operator, which would block the
> > > > normal
> > > > >> data processing until back pressure the source.
> > > > >> > Maybe we could dig into the process of sync execution in
> > checkpoint.
> > > > >> E.g. break down the steps inside respective operator#snapshotState
> > to
> > > > >> statistic which operation cost most of the time, then
> > > > >> > we might probably find the root cause to bring such cost.
> > > > >> >
> > > > >> > Look forward to the further progress. :)
> > > > >> >
> > > > >> > Best,
> > > > >> > Zhijiang
> > > > >> >
> > > > >> >
> ------------------------------------------------------------------
> > > > >> > From:Stephan Ewen <[hidden email]>
> > > > >> > Send Time:2020年7月7日(星期二) 14:52
> > > > >> > To:Thomas Weise <[hidden email]>
> > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
> > > > >> [hidden email]>; Aljoscha Krettek <
> [hidden email]
> > >;
> > > > >> Arvid Heise <[hidden email]>
> > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
> 1.11.0,
> > > > >> release candidate #4)
> > > > >> >
> > > > >> > Thank you for the digging so deeply.
> > > > >> > Mysterious think this regression.
> > > > >> >
> > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]> wrote:
> > > > >> > @Stephan: yes, I refer to sync time in the web UI (it is
> unchanged
> > > > >> between 1.10 and 1.11 for the specific pipeline).
> > > > >> >
> > > > >> > I verified that increasing the checkpointing interval does not
> > make
> > > a
> > > > >> difference.
> > > > >> >
> > > > >> > I looked at the Kinesis connector changes since 1.10.1 and don't
> > see
> > > > >> anything that could cause this.
> > > > >> >
> > > > >> > Another pipeline that is using the Kinesis consumer (but not the
> > > > >> producer) performs as expected.
> > > > >> >
> > > > >> > I tried reverting the AWS SDK version change, symptoms remain
> > > > unchanged:
> > > > >> >
> > > > >> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
> > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
> > > > >> > index a6abce23ba..741743a05e 100644
> > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
> > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
> > > > >> > @@ -33,7 +33,7 @@ under the License.
> > > > >> >
> > > > >>
> > >
> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> > > > >> >          <name>flink-connector-kinesis</name>
> > > > >> >          <properties>
> > > > >> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
> > > > >> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
> > > > >> >
> > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> > > > >> >
> > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> > > > >> >
> > > > >>
> > > >
> > >
> >
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> > > > >> >
> > > > >> > I'm planning to take a look with a profiler next.
> > > > >> >
> > > > >> > Thomas
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[hidden email]>
> > > > wrote:
> > > > >> > Hi all!
> > > > >> >
> > > > >> > Forking this thread out of the release vote thread.
> > > > >> >  From what Thomas describes, it really sounds like a
> sink-specific
> > > > >> issue.
> > > > >> >
> > > > >> > @Thomas: When you say sink has a long synchronous checkpoint
> time,
> > > you
> > > > >> mean the time that is shown as "sync time" on the metrics and web
> > UI?
> > > > That
> > > > >> is not including any network buffer related operations. It is
> purely
> > > the
> > > > >> operator's time.
> > > > >> >
> > > > >> > Can we dig into the changes we did in sinks:
> > > > >> >    - Kinesis version upgrade, AWS library updates
> > > > >> >
> > > > >> >    - Could it be that some call (checkpoint complete) that was
> > > > >> previously (1.10) in a separate thread is not in the mailbox and
> > this
> > > > >> simply reduces the number of threads that do the work?
> > > > >> >
> > > > >> >    - Did sink checkpoint notifications change in a relevant way,
> > for
> > > > >> example due to some Kafka issues we addressed in 1.11 (@Aljoscha
> > > maybe?)
> > > > >> >
> > > > >> > Best,
> > > > >> > Stephan
> > > > >> >
> > > > >> >
> > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
> > [hidden email]
> > > > .invalid>
> > > > >> wrote:
> > > > >> > Hi Thomas,
> > > > >> >
> > > > >> >   Regarding [2], it has more detail infos in the Jira
> description
> > (
> > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
> > > > >> >
> > > > >> >   I can also give some basic explanations here to dismiss the
> > > concern.
> > > > >> >   1. In the past, the following buffers after the barrier will
> be
> > > > >> cached on downstream side before alignment.
> > > > >> >   2. In 1.11, the upstream would not send the buffers after the
> > > > >> barrier. When the downstream finishes the alignment, it will
> notify
> > > the
> > > > >> downstream of continuing sending following buffers, since it can
> > > process
> > > > >> them after alignment.
> > > > >> >   3. The only difference is that the temporary blocked buffers
> are
> > > > >> cached either on downstream side or on upstream side before
> > alignment.
> > > > >> >   4. The side effect would be the additional notification cost
> for
> > > > >> every barrier alignment. If the downstream and upstream are
> deployed
> > > in
> > > > >> separate TaskManager, the cost is network transport delay (the
> > effect
> > > > can
> > > > >> be ignored based on our testing with 1s checkpoint interval). For
> > > > sharing
> > > > >> slot in your case, the cost is only one method call in processor,
> > can
> > > be
> > > > >> ignored also.
> > > > >> >
> > > > >> >   You mentioned "In this case, the downstream task has a high
> > > average
> > > > >> checkpoint duration(~30s, sync part)." This duration is not
> > reflecting
> > > > the
> > > > >> changes above, and it is only indicating the duration for calling
> > > > >> `Operation.snapshotState`.
> > > > >> >   If this duration is beyond your expectation, you can check or
> > > debug
> > > > >> whether the source/sink operations might take more time to finish
> > > > >> `snapshotState` in practice. E.g. you can
> > > > >> >   make the implementation of this method as empty to further
> > verify
> > > > the
> > > > >> effect.
> > > > >> >
> > > > >> >   Best,
> > > > >> >   Zhijiang
> > > > >> >
> > > > >> >
> > > > >> >
> >  ------------------------------------------------------------------
> > > > >> >   From:Thomas Weise <[hidden email]>
> > > > >> >   Send Time:2020年7月5日(星期日) 12:22
> > > > >> >   To:dev <[hidden email]>; Zhijiang <
> > > [hidden email]
> > > > >
> > > > >> >   Cc:Yingjie Cao <[hidden email]>
> > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > >> >
> > > > >> >   Hi Zhijiang,
> > > > >> >
> > > > >> >   Could you please point me to more details regarding: "[2]:
> Delay
> > > > send
> > > > >> the
> > > > >> >   following buffers after checkpoint barrier on upstream side
> > until
> > > > >> barrier
> > > > >> >   alignment on downstream side."
> > > > >> >
> > > > >> >   In this case, the downstream task has a high average
> checkpoint
> > > > >> duration
> > > > >> >   (~30s, sync part). If there was a change to hold buffers
> > depending
> > > > on
> > > > >> >   downstream performance, could this possibly apply to this case
> > > (even
> > > > >> when
> > > > >> >   there is no shuffle that would require alignment)?
> > > > >> >
> > > > >> >   Thanks,
> > > > >> >   Thomas
> > > > >> >
> > > > >> >
> > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
> > > [hidden email]
> > > > >> .invalid>
> > > > >> >   wrote:
> > > > >> >
> > > > >> >   > Hi Thomas,
> > > > >> >   >
> > > > >> >   > Thanks for the further update information.
> > > > >> >   >
> > > > >> >   > I guess we can dismiss the network stack changes, since in
> > your
> > > > >> case the
> > > > >> >   > downstream and upstream would probably be deployed in the
> same
> > > > slot
> > > > >> >   > bypassing the network data shuffle.
> > > > >> >   > Also I guess release-1.11 will not bring general performance
> > > > >> regression in
> > > > >> >   > runtime engine, as we also did the performance testing for
> all
> > > > >> general
> > > > >> >   > cases by [1] in real cluster before and the testing results
> > > should
> > > > >> fit the
> > > > >> >   > expectation. But we indeed did not test the specific source
> > and
> > > > sink
> > > > >> >   > connectors yet as I known.
> > > > >> >   >
> > > > >> >   > Regarding your performance regression with 40%, I wonder it
> is
> > > > >> probably
> > > > >> >   > related to specific source/sink changes (e.g. kinesis) or
> > > > >> environment
> > > > >> >   > issues with corner case.
> > > > >> >   > If possible, it would be helpful to further locate whether
> the
> > > > >> regression
> > > > >> >   > is caused by kinesis, by replacing the kinesis source & sink
> > and
> > > > >> keeping
> > > > >> >   > the others same.
> > > > >> >   >
> > > > >> >   > As you said, it would be efficient to contact with you
> > directly
> > > > >> next week
> > > > >> >   > to further discuss this issue. And we are willing/eager to
> > > provide
> > > > >> any help
> > > > >> >   > to resolve this issue soon.
> > > > >> >   >
> > > > >> >   > Besides that, I guess this issue should not be the blocker
> for
> > > the
> > > > >> >   > release, since it is probably a corner case based on the
> > current
> > > > >> analysis.
> > > > >> >   > If we really conclude anything need to be resolved after the
> > > final
> > > > >> >   > release, then we can also make the next minor release-1.11.1
> > > come
> > > > >> soon.
> > > > >> >   >
> > > > >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > > > >> >   >
> > > > >> >   > Best,
> > > > >> >   > Zhijiang
> > > > >> >   >
> > > > >> >   >
> > > > >> >   >
> > > ------------------------------------------------------------------
> > > > >> >   > From:Thomas Weise <[hidden email]>
> > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
> > > > >> >   > To:dev <[hidden email]>; Zhijiang <
> > > > [hidden email]
> > > > >> >
> > > > >> >   > Cc:Yingjie Cao <[hidden email]>
> > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > >> >   >
> > > > >> >   > Hi Zhijiang,
> > > > >> >   >
> > > > >> >   > It will probably be best if we connect next week and discuss
> > the
> > > > >> issue
> > > > >> >   > directly since this could be quite difficult to reproduce.
> > > > >> >   >
> > > > >> >   > Before the testing result on our side comes out for your
> > > > respective
> > > > >> job
> > > > >> >   > case, I have some other questions to confirm for further
> > > analysis:
> > > > >> >   >     -  How much percentage regression you found after
> > switching
> > > to
> > > > >> 1.11?
> > > > >> >   >
> > > > >> >   > ~40% throughput decline
> > > > >> >   >
> > > > >> >   >     -  Are there any network bottleneck in your cluster?
> E.g.
> > > the
> > > > >> network
> > > > >> >   > bandwidth is full caused by other jobs? If so, it might have
> > > more
> > > > >> effects
> > > > >> >   > by above [2]
> > > > >> >   >
> > > > >> >   > The test runs on a k8s cluster that is also used for other
> > > > >> production jobs.
> > > > >> >   > There is no reason be believe network is the bottleneck.
> > > > >> >   >
> > > > >> >   >     -  Did you adjust the default network buffer setting?
> E.g.
> > > > >> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
> > > > >> >   >
> > > > >> >   > The job is using the defaults, i.e we don't configure the
> > > > settings.
> > > > >> If you
> > > > >> >   > want me to try specific settings in the hope that it will
> help
> > > to
> > > > >> isolate
> > > > >> >   > the issue please let me know.
> > > > >> >   >
> > > > >> >   >     -  I guess the topology has three vertexes
> > "KinesisConsumer
> > > ->
> > > > >> Chained
> > > > >> >   > FlatMap -> KinesisProducer", and the partition mode for
> > > > >> "KinesisConsumer ->
> > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"?
> If
> > > so,
> > > > >> the edge
> > > > >> >   > connection is one-to-one, not all-to-all, then the above
> > [1][2]
> > > > >> should no
> > > > >> >   > effects in theory with default network buffer setting.
> > > > >> >   >
> > > > >> >   > There are only 2 vertices and the edge is "forward".
> > > > >> >   >
> > > > >> >   >     - By slot sharing, I guess these three vertex
> parallelism
> > > task
> > > > >> would
> > > > >> >   > probably be deployed into the same slot, then the data
> shuffle
> > > is
> > > > >> by memory
> > > > >> >   > queue, not network stack. If so, the above [2] should no
> > effect.
> > > > >> >   >
> > > > >> >   > Yes, vertices share slots.
> > > > >> >   >
> > > > >> >   >     - I also saw some Jira changes for kinesis in this
> > release,
> > > > >> could you
> > > > >> >   > confirm that these changes would not effect the performance?
> > > > >> >   >
> > > > >> >   > I will need to take a look. 1.10 already had a regression
> > > > >> introduced by the
> > > > >> >   > Kinesis producer update.
> > > > >> >   >
> > > > >> >   >
> > > > >> >   > Thanks,
> > > > >> >   > Thomas
> > > > >> >   >
> > > > >> >   >
> > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> > > > >> [hidden email]
> > > > >> >   > .invalid>
> > > > >> >   > wrote:
> > > > >> >   >
> > > > >> >   > > Hi Thomas,
> > > > >> >   > >
> > > > >> >   > > Thanks for your reply with rich information!
> > > > >> >   > >
> > > > >> >   > > We are trying to reproduce your case in our cluster to
> > further
> > > > >> verify it,
> > > > >> >   > > and  @Yingjie Cao is working on it now.
> > > > >> >   > >  As we have not kinesis consumer and producer internally,
> so
> > > we
> > > > >> will
> > > > >> >   > > construct the common source and sink instead in the case
> of
> > > > >> backpressure.
> > > > >> >   > >
> > > > >> >   > > Firstly, we can dismiss the rockdb factor in this release,
> > > since
> > > > >> you also
> > > > >> >   > > mentioned that "filesystem leads to same symptoms".
> > > > >> >   > >
> > > > >> >   > > Secondly, if my understanding is right, you emphasis that
> > the
> > > > >> regression
> > > > >> >   > > only exists for the jobs with low checkpoint interval
> (10s).
> > > > >> >   > > Based on that, I have two suspicions with the network
> > related
> > > > >> changes in
> > > > >> >   > > this release:
> > > > >> >   > >     - [1]: Limited the maximum backlog value (default 10)
> in
> > > > >> subpartition
> > > > >> >   > > queue.
> > > > >> >   > >     - [2]: Delay send the following buffers after
> checkpoint
> > > > >> barrier on
> > > > >> >   > > upstream side until barrier alignment on downstream side.
> > > > >> >   > >
> > > > >> >   > > These changes are motivated for reducing the in-flight
> > buffers
> > > > to
> > > > >> speedup
> > > > >> >   > > checkpoint especially in the case of backpressure.
> > > > >> >   > > In theory they should have very minor performance effect
> and
> > > > >> actually we
> > > > >> >   > > also tested in cluster to verify within expectation before
> > > > >> merging them,
> > > > >> >   > >  but maybe there are other corner cases we have not
> thought
> > of
> > > > >> before.
> > > > >> >   > >
> > > > >> >   > > Before the testing result on our side comes out for your
> > > > >> respective job
> > > > >> >   > > case, I have some other questions to confirm for further
> > > > analysis:
> > > > >> >   > >     -  How much percentage regression you found after
> > > switching
> > > > >> to 1.11?
> > > > >> >   > >     -  Are there any network bottleneck in your cluster?
> > E.g.
> > > > the
> > > > >> network
> > > > >> >   > > bandwidth is full caused by other jobs? If so, it might
> have
> > > > more
> > > > >> effects
> > > > >> >   > > by above [2]
> > > > >> >   > >     -  Did you adjust the default network buffer setting?
> > E.g.
> > > > >> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
> > > > >> >   > >     -  I guess the topology has three vertexes
> > > "KinesisConsumer
> > > > ->
> > > > >> >   > Chained
> > > > >> >   > > FlatMap -> KinesisProducer", and the partition mode for
> > > > >> "KinesisConsumer
> > > > >> >   > ->
> > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both
> "forward"?
> > If
> > > > >> so, the
> > > > >> >   > edge
> > > > >> >   > > connection is one-to-one, not all-to-all, then the above
> > > [1][2]
> > > > >> should no
> > > > >> >   > > effects in theory with default network buffer setting.
> > > > >> >   > >     - By slot sharing, I guess these three vertex
> > parallelism
> > > > >> task would
> > > > >> >   > > probably be deployed into the same slot, then the data
> > shuffle
> > > > is
> > > > >> by
> > > > >> >   > memory
> > > > >> >   > > queue, not network stack. If so, the above [2] should no
> > > effect.
> > > > >> >   > >     - I also saw some Jira changes for kinesis in this
> > > release,
> > > > >> could you
> > > > >> >   > > confirm that these changes would not effect the
> performance?
> > > > >> >   > >
> > > > >> >   > > Best,
> > > > >> >   > > Zhijiang
> > > > >> >   > >
> > > > >> >   > >
> > > > >> >   > >
> > > > ------------------------------------------------------------------
> > > > >> >   > > From:Thomas Weise <[hidden email]>
> > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
> > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
> > > > >> [hidden email]>
> > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > >> >   > >
> > > > >> >   > > Hi Zhijiang,
> > > > >> >   > >
> > > > >> >   > > The performance degradation manifests in backpressure
> which
> > > > leads
> > > > >> to
> > > > >> >   > > growing backlog in the source. I switched a few times
> > between
> > > > >> 1.10 and
> > > > >> >   > 1.11
> > > > >> >   > > and the behavior is consistent.
> > > > >> >   > >
> > > > >> >   > > The DAG is:
> > > > >> >   > >
> > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)
>  --------
> > > > >> forward
> > > > >> >   > > ---------> KinesisProducer
> > > > >> >   > >
> > > > >> >   > > Parallelism: 160
> > > > >> >   > > No shuffle/rebalance.
> > > > >> >   > >
> > > > >> >   > > Checkpointing config:
> > > > >> >   > >
> > > > >> >   > > Checkpointing Mode Exactly Once
> > > > >> >   > > Interval 10s
> > > > >> >   > > Timeout 10m 0s
> > > > >> >   > > Minimum Pause Between Checkpoints 10s
> > > > >> >   > > Maximum Concurrent Checkpoints 1
> > > > >> >   > > Persist Checkpoints Externally Enabled (delete on
> > > cancellation)
> > > > >> >   > >
> > > > >> >   > > State backend: rocksdb  (filesystem leads to same
> symptoms)
> > > > >> >   > > Checkpoint size is tiny (500KB)
> > > > >> >   > >
> > > > >> >   > > An interesting difference to another job that I had
> upgraded
> > > > >> successfully
> > > > >> >   > > is the low checkpointing interval.
> > > > >> >   > >
> > > > >> >   > > Thanks,
> > > > >> >   > > Thomas
> > > > >> >   > >
> > > > >> >   > >
> > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> > > > >> [hidden email]
> > > > >> >   > > .invalid>
> > > > >> >   > > wrote:
> > > > >> >   > >
> > > > >> >   > > > Hi Thomas,
> > > > >> >   > > >
> > > > >> >   > > > Thanks for the efficient feedback.
> > > > >> >   > > >
> > > > >> >   > > > Regarding the suggestion of adding the release notes
> > > document,
> > > > >> I agree
> > > > >> >   > > > with your point. Maybe we should adjust the vote
> template
> > > > >> accordingly
> > > > >> >   > in
> > > > >> >   > > > the respective wiki to guide the following release
> > > processes.
> > > > >> >   > > >
> > > > >> >   > > > Regarding the performance regression, could you provide
> > some
> > > > >> more
> > > > >> >   > details
> > > > >> >   > > > for our better measurement or reproducing on our sides?
> > > > >> >   > > > E.g. I guess the topology only includes two vertexes
> > source
> > > > and
> > > > >> sink?
> > > > >> >   > > > What is the parallelism for every vertex?
> > > > >> >   > > > The upstream shuffles data to the downstream via
> rebalance
> > > > >> partitioner
> > > > >> >   > or
> > > > >> >   > > > other?
> > > > >> >   > > > The checkpoint mode is exactly-once with rocksDB state
> > > > backend?
> > > > >> >   > > > The backpressure happened in this case?
> > > > >> >   > > > How much percentage regression in this case?
> > > > >> >   > > >
> > > > >> >   > > > Best,
> > > > >> >   > > > Zhijiang
> > > > >> >   > > >
> > > > >> >   > > >
> > > > >> >   > > >
> > > > >> >   > > >
> > > > >> ------------------------------------------------------------------
> > > > >> >   > > > From:Thomas Weise <[hidden email]>
> > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> > > > >> >   > > > To:dev <[hidden email]>
> > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > >> >   > > >
> > > > >> >   > > > Hi Till,
> > > > >> >   > > >
> > > > >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
> > > > >> >   > > >
> > > > >> >   > > > Generally, we carry forward the existing configuration
> and
> > > any
> > > > >> change
> > > > >> >   > to
> > > > >> >   > > > default configuration values would impact the upgrade.
> > > > >> >   > > >
> > > > >> >   > > > Yes, since it is an incompatible change I would state it
> > in
> > > > the
> > > > >> release
> > > > >> >   > > > notes.
> > > > >> >   > > >
> > > > >> >   > > > Thanks,
> > > > >> >   > > > Thomas
> > > > >> >   > > >
> > > > >> >   > > > BTW I found a performance regression while trying to
> > upgrade
> > > > >> another
> > > > >> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis
> > > job.
> > > > >> Wasn't
> > > > >> >   > able
> > > > >> >   > > > to pin it down yet, symptoms include increased
> checkpoint
> > > > >> alignment
> > > > >> >   > time.
> > > > >> >   > > >
> > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> > > > >> [hidden email]>
> > > > >> >   > > > wrote:
> > > > >> >   > > >
> > > > >> >   > > > > Hi Thomas,
> > > > >> >   > > > >
> > > > >> >   > > > > just to confirm: When starting the image in local
> mode,
> > > then
> > > > >> you
> > > > >> >   > don't
> > > > >> >   > > > have
> > > > >> >   > > > > any of the JobManager memory configuration settings
> > > > >> configured in the
> > > > >> >   > > > > effective flink-conf.yaml, right? Does this mean that
> > you
> > > > have
> > > > >> >   > > explicitly
> > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
> > > > >> configuration?
> > > > >> >   > > If
> > > > >> >   > > > > this is the case, then I believe it was more of an
> > > > >> unintentional
> > > > >> >   > > artifact
> > > > >> >   > > > > that it worked before and it has been corrected now so
> > > that
> > > > >> one needs
> > > > >> >   > > to
> > > > >> >   > > > > specify the memory of the JM process explicitly. Do
> you
> > > > think
> > > > >> it
> > > > >> >   > would
> > > > >> >   > > > help
> > > > >> >   > > > > to explicitly state this in the release notes?
> > > > >> >   > > > >
> > > > >> >   > > > > Cheers,
> > > > >> >   > > > > Till
> > > > >> >   > > > >
> > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
> > > [hidden email]
> > > > >
> > > > >> wrote:
> > > > >> >   > > > >
> > > > >> >   > > > > > Thanks for preparing another RC!
> > > > >> >   > > > > >
> > > > >> >   > > > > > As mentioned in the previous RC thread, it would be
> > > super
> > > > >> helpful
> > > > >> >   > if
> > > > >> >   > > > the
> > > > >> >   > > > > > release notes that are part of the documentation can
> > be
> > > > >> included
> > > > >> >   > [1].
> > > > >> >   > > > > It's
> > > > >> >   > > > > > a significant time-saver to have read those first.
> > > > >> >   > > > > >
> > > > >> >   > > > > > I found one more non-backward compatible change that
> > > would
> > > > >> be worth
> > > > >> >   > > > > > addressing/mentioning:
> > > > >> >   > > > > >
> > > > >> >   > > > > > It is now necessary to configure the jobmanager heap
> > > size
> > > > in
> > > > >> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
> > > > >> >   > > > > > or jobmanager.memory.heap.size). Why would I not
> want
> > to
> > > > do
> > > > >> that
> > > > >> >   > > > anyways?
> > > > >> >   > > > > > Well, we set it dynamically for a cluster deployment
> > via
> > > > the
> > > > >> >   > > > > > flinkk8soperator, but the container image can also
> be
> > > used
> > > > >> for
> > > > >> >   > > testing
> > > > >> >   > > > > with
> > > > >> >   > > > > > local mode (./bin/jobmanager.sh start-foreground
> > local).
> > > > >> That will
> > > > >> >   > > fail
> > > > >> >   > > > > if
> > > > >> >   > > > > > the heap wasn't configured and that's how I noticed
> > it.
> > > > >> >   > > > > >
> > > > >> >   > > > > > Thanks,
> > > > >> >   > > > > > Thomas
> > > > >> >   > > > > >
> > > > >> >   > > > > > [1]
> > > > >> >   > > > > >
> > > > >> >   > > > > >
> > > > >> >   > > > >
> > > > >> >   > > >
> > > > >> >   > >
> > > > >> >   >
> > > > >>
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > > > >> >   > > > > >
> > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> > > > >> >   > [hidden email]
> > > > >> >   > > > > > .invalid>
> > > > >> >   > > > > > wrote:
> > > > >> >   > > > > >
> > > > >> >   > > > > > > Hi everyone,
> > > > >> >   > > > > > >
> > > > >> >   > > > > > > Please review and vote on the release candidate #4
> > for
> > > > the
> > > > >> >   > version
> > > > >> >   > > > > > 1.11.0,
> > > > >> >   > > > > > > as follows:
> > > > >> >   > > > > > > [ ] +1, Approve the release
> > > > >> >   > > > > > > [ ] -1, Do not approve the release (please provide
> > > > >> specific
> > > > >> >   > > comments)
> > > > >> >   > > > > > >
> > > > >> >   > > > > > > The complete staging area is available for your
> > > review,
> > > > >> which
> > > > >> >   > > > includes:
> > > > >> >   > > > > > > * JIRA release notes [1],
> > > > >> >   > > > > > > * the official Apache source release and binary
> > > > >> convenience
> > > > >> >   > > releases
> > > > >> >   > > > to
> > > > >> >   > > > > > be
> > > > >> >   > > > > > > deployed to dist.apache.org [2], which are signed
> > > with
> > > > >> the key
> > > > >> >   > > with
> > > > >> >   > > > > > > fingerprint
> 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> > > > [3],
> > > > >> >   > > > > > > * all artifacts to be deployed to the Maven
> Central
> > > > >> Repository
> > > > >> >   > [4],
> > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
> > > > >> >   > > > > > > * website pull request listing the new release and
> > > > adding
> > > > >> >   > > > announcement
> > > > >> >   > > > > > > blog post [6].
> > > > >> >   > > > > > >
> > > > >> >   > > > > > > The vote will be open for at least 72 hours. It is
> > > > >> adopted by
> > > > >> >   > > > majority
> > > > >> >   > > > > > > approval, with at least 3 PMC affirmative votes.
> > > > >> >   > > > > > >
> > > > >> >   > > > > > > Thanks,
> > > > >> >   > > > > > > Release Manager
> > > > >> >   > > > > > >
> > > > >> >   > > > > > > [1]
> > > > >> >   > > > > > >
> > > > >> >   > > > > >
> > > > >> >   > > > >
> > > > >> >   > > >
> > > > >> >   > >
> > > > >> >   >
> > > > >>
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > > >> >   > > > > > > [2]
> > > > >> >   >
> > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > > >> >   > > > > > > [3]
> > > > https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > >> >   > > > > > > [4]
> > > > >> >   > > > > > >
> > > > >> >   > > > >
> > > > >> >   > >
> > > > >>
> > >
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > > >> >   > > > > > > [5]
> > > > >> >   > >
> > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > > >> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
> > > > >> >   > > > > > >
> > > > >> >   > > > > > >
> > > > >> >   > > > > >
> > > > >> >   > > > >
> > > > >> >   > > >
> > > > >> >   > > >
> > > > >> >   > >
> > > > >> >   > >
> > > > >> >   >
> > > > >> >   >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>


--
Regards,
Roman
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise
Hi Roman,

Indeed there are more frequent checkpoints with this change! The
application was configured to checkpoint every 10s. With 1.10 ("good
commit"), that leads to fewer completed checkpoints compared to 1.11 ("bad
commit"). Just to be clear, the only difference between the two runs was
the commit 355184d69a8519d29937725c8d85e8465d7e3a90

Since the sync part of checkpoints with the Kinesis producer always takes
~30 seconds, the 10s configured checkpoint frequency really had no effect
before 1.11. I confirmed that both commits perform comparably by setting
the checkpoint frequency and min pause to 60s.

I still have to verify with the final 1.11.0 release commit.

It's probably good to take a look at the Kinesis producer. Is it really
necessary to have 500ms sleep time? What's responsible for the ~30s
duration in snapshotState?

As things stand it doesn't make sense to use checkpoint intervals < 30s
when using the Kinesis producer.

Thanks,
Thomas

On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <[hidden email]>
wrote:

> Hi Thomas,
>
> Thanks a lot for the analysis.
>
> The first thing that I'd check is whether checkpoints became more frequent
> with this commit (as each of them adds at least 500ms if there is at least
> one not sent record, according to FlinkKinesisProducer.snapshotState).
>
> Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs
> first "bad" commits)?
>
> On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <[hidden email]>
> wrote:
>
> > I run git bisect and the first commit that shows the regression is:
> >
> >
> >
> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
> >
> >
> > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]> wrote:
> >
> > > From my experience, java profilers are sometimes not accurate enough to
> > > find out the performance regression
> > > root cause. In this case, I would suggest you try out intel vtune
> > amplifier
> > > to watch more detailed metrics.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]> wrote:
> > >
> > > > The cause of the issue is all but clear.
> > > >
> > > > Previously I had mentioned that there is no suspect change to the
> > Kinesis
> > > > connector and that I had reverted the AWS SDK change to no effect.
> > > >
> > > > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed
> > another
> > > > regression in the previous release and is present before and after.
> > > >
> > > > I repeated the run with 1.11.0 core and downgraded the entire Kinesis
> > > > connector to 1.10.1: Nothing changes, i.e. the regression is still
> > > present.
> > > > Therefore we will need to look elsewhere for the root cause.
> > > >
> > > > Regarding the time spent in snapshotState, repeat runs reveal a wide
> > > range
> > > > for both versions, 1.10 and 1.11. So again this is nothing pointing
> to
> > a
> > > > root cause.
> > > >
> > > > At this point, I have no ideas remaining other than doing a bisect to
> > > find
> > > > the culprit. Any other suggestions?
> > > >
> > > > Thomas
> > > >
> > > >
> > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <[hidden email]
> > > > .invalid>
> > > > wrote:
> > > >
> > > > > Hi Thomas,
> > > > >
> > > > > Thanks for your further profiling information and glad to see we
> > > already
> > > > > finalized the location to cause the regression.
> > > > > Actually I was also suspicious of the point of #snapshotState in
> > > previous
> > > > > discussions since it indeed cost much time to block normal operator
> > > > > processing.
> > > > >
> > > > > Based on your below feedback, the sleep time during #snapshotState
> > > might
> > > > > be the main concern, and I also digged into the implementation of
> > > > > FlinkKinesisProducer#snapshotState.
> > > > > while (producer.getOutstandingRecordsCount() > 0) {
> > > > >    producer.flush();
> > > > >    try {
> > > > >       Thread.sleep(500);
> > > > >    } catch (InterruptedException e) {
> > > > >       LOG.warn("Flushing was interrupted.");
> > > > >       break;
> > > > >    }
> > > > > }
> > > > > It seems that the sleep time is mainly affected by the internal
> > > > operations
> > > > > inside KinesisProducer implementation provided by amazonaws, which
> I
> > am
> > > > not
> > > > > quite familiar with.
> > > > > But I noticed there were two upgrades related to it in
> > release-1.11.0.
> > > > One
> > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another
> is
> > > for
> > > > > upgrading aws-sdk-version to 1.11.754 [2].
> > > > > You mentioned that you already reverted the SDK upgrade to verify
> no
> > > > > changes. Did you also revert the [1] to verify?
> > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> > > > >
> > > > > Best,
> > > > > Zhijiang
> > > > > ------------------------------------------------------------------
> > > > > From:Thomas Weise <[hidden email]>
> > > > > Send Time:2020年7月17日(星期五) 05:29
> > > > > To:dev <[hidden email]>
> > > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
> > > [hidden email]
> > > > >;
> > > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
> > > [hidden email]
> > > > >
> > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > > release
> > > > > candidate #4)
> > > > >
> > > > > Sorry for the delay.
> > > > >
> > > > > I confirmed that the regression is due to the sink (unsurprising,
> > since
> > > > > another job with the same consumer, but not the producer, runs as
> > > > > expected).
> > > > >
> > > > > As promised I did CPU profiling on the problematic application,
> which
> > > > gives
> > > > > more insight into the regression [1]
> > > > >
> > > > > The screenshots show that the average time for snapshotState
> > increases
> > > > from
> > > > > ~9s to ~28s. The data also shows the increase in sleep time during
> > > > > snapshotState.
> > > > >
> > > > > Does anyone, based on changes made in 1.11, have a theory why?
> > > > >
> > > > > I had previously looked at the changes to the Kinesis connector and
> > > also
> > > > > reverted the SDK upgrade, which did not change the situation.
> > > > >
> > > > > It will likely be necessary to drill into the sink / checkpointing
> > > > details
> > > > > to understand the cause of the problem.
> > > > >
> > > > > Let me know if anyone has specific questions that I can answer from
> > the
> > > > > profiling results.
> > > > >
> > > > > Thomas
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> > > > >
> > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]>
> > wrote:
> > > > >
> > > > > > + dev@ for visibility
> > > > > >
> > > > > > I will investigate further today.
> > > > > >
> > > > > >
> > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
> > > > > >> >    - Did sink checkpoint notifications change in a relevant
> way,
> > > for
> > > > > >> example
> > > > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha
> maybe?)
> > > > > >>
> > > > > >> I think that's unrelated: the Kafka fixes were isolated in Kafka
> > and
> > > > the
> > > > > >> one bug I discovered on the way was about the Task reaper.
> > > > > >>
> > > > > >>
> > > > > >> On 07.07.20 17:51, Zhijiang wrote:
> > > > > >> > Sorry for my misunderstood of the previous information,
> Thomas.
> > I
> > > > was
> > > > > >> assuming that the sync checkpoint duration increased after
> upgrade
> > > as
> > > > it
> > > > > >> was mentioned before.
> > > > > >> >
> > > > > >> > If I remembered correctly, the memory state backend also has
> the
> > > > same
> > > > > >> issue? If so, we can dismiss the rocksDB state changes. As the
> > slot
> > > > > sharing
> > > > > >> enabled, the downstream and upstream should
> > > > > >> > probably deployed into the same slot, then no network shuffle
> > > > effect.
> > > > > >> >
> > > > > >> > I think we need to find out whether it has other symptoms
> > changed
> > > > > >> besides the performance regression to further figure out the
> > scope.
> > > > > >> > E.g. any metrics changes, the number of TaskManager and the
> > number
> > > > of
> > > > > >> slots per TaskManager from deployment changes.
> > > > > >> > 40% regression is really big, I guess the changes should also
> be
> > > > > >> reflected in other places.
> > > > > >> >
> > > > > >> > I am not sure whether we can reproduce the regression in our
> AWS
> > > > > >> environment by writing any Kinesis jobs, since there are also
> > normal
> > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
> > > > > >> > So it probably looks like to touch some corner case. I am very
> > > > willing
> > > > > >> to provide any help for debugging if possible.
> > > > > >> >
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Zhijiang
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > ------------------------------------------------------------------
> > > > > >> > From:Thomas Weise <[hidden email]>
> > > > > >> > Send Time:2020年7月7日(星期二) 23:01
> > > > > >> > To:Stephan Ewen <[hidden email]>
> > > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
> > > > > >> [hidden email]>; Zhijiang <[hidden email]>
> > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
> > 1.11.0,
> > > > > >> release candidate #4)
> > > > > >> >
> > > > > >> > We are deploying our apps with FlinkK8sOperator. We have one
> job
> > > > that
> > > > > >> works as expected after the upgrade and the one discussed here
> > that
> > > > has
> > > > > the
> > > > > >> performance regression.
> > > > > >> >
> > > > > >> > "The performance regression is obvious caused by long duration
> > of
> > > > sync
> > > > > >> checkpoint process in Kinesis sink operator, which would block
> the
> > > > > normal
> > > > > >> data processing until back pressure the source."
> > > > > >> >
> > > > > >> > That's a constant. Before (1.10) and upgrade have the same
> sync
> > > > > >> checkpointing time. The question is what change came in with the
> > > > > upgrade.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[hidden email]
> >
> > > > wrote:
> > > > > >> >
> > > > > >> > @Thomas Just one thing real quick: Are you using the
> standalone
> > > > setup
> > > > > >> scripts (like start-cluster.sh, and the former "slaves" file) ?
> > > > > >> > Be aware that this is now called "workers" because of avoiding
> > > > > >> sensitive names.
> > > > > >> > In one internal benchmark we saw quite a lot of slowdown
> > > initially,
> > > > > >> before seeing that the cluster was not a distributed cluster any
> > > more
> > > > > ;-)
> > > > > >> >
> > > > > >> >
> > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
> > > [hidden email]
> > > > >
> > > > > >> wrote:
> > > > > >> > Thanks for this kickoff and help analysis, Stephan!
> > > > > >> > Thanks for the further feedback and investigation, Thomas!
> > > > > >> >
> > > > > >> > The performance regression is obvious caused by long duration
> of
> > > > sync
> > > > > >> checkpoint process in Kinesis sink operator, which would block
> the
> > > > > normal
> > > > > >> data processing until back pressure the source.
> > > > > >> > Maybe we could dig into the process of sync execution in
> > > checkpoint.
> > > > > >> E.g. break down the steps inside respective
> operator#snapshotState
> > > to
> > > > > >> statistic which operation cost most of the time, then
> > > > > >> > we might probably find the root cause to bring such cost.
> > > > > >> >
> > > > > >> > Look forward to the further progress. :)
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Zhijiang
> > > > > >> >
> > > > > >> >
> > ------------------------------------------------------------------
> > > > > >> > From:Stephan Ewen <[hidden email]>
> > > > > >> > Send Time:2020年7月7日(星期二) 14:52
> > > > > >> > To:Thomas Weise <[hidden email]>
> > > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
> > > > > >> [hidden email]>; Aljoscha Krettek <
> > [hidden email]
> > > >;
> > > > > >> Arvid Heise <[hidden email]>
> > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
> > 1.11.0,
> > > > > >> release candidate #4)
> > > > > >> >
> > > > > >> > Thank you for the digging so deeply.
> > > > > >> > Mysterious think this regression.
> > > > > >> >
> > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]>
> wrote:
> > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it is
> > unchanged
> > > > > >> between 1.10 and 1.11 for the specific pipeline).
> > > > > >> >
> > > > > >> > I verified that increasing the checkpointing interval does not
> > > make
> > > > a
> > > > > >> difference.
> > > > > >> >
> > > > > >> > I looked at the Kinesis connector changes since 1.10.1 and
> don't
> > > see
> > > > > >> anything that could cause this.
> > > > > >> >
> > > > > >> > Another pipeline that is using the Kinesis consumer (but not
> the
> > > > > >> producer) performs as expected.
> > > > > >> >
> > > > > >> > I tried reverting the AWS SDK version change, symptoms remain
> > > > > unchanged:
> > > > > >> >
> > > > > >> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
> > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
> > > > > >> > index a6abce23ba..741743a05e 100644
> > > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
> > > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
> > > > > >> > @@ -33,7 +33,7 @@ under the License.
> > > > > >> >
> > > > > >>
> > > >
> > <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> > > > > >> >          <name>flink-connector-kinesis</name>
> > > > > >> >          <properties>
> > > > > >> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
> > > > > >> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
> > > > > >> >
> > > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> > > > > >> >
> > > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> > > > > >> >
> > > > > >> > I'm planning to take a look with a profiler next.
> > > > > >> >
> > > > > >> > Thomas
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
> [hidden email]>
> > > > > wrote:
> > > > > >> > Hi all!
> > > > > >> >
> > > > > >> > Forking this thread out of the release vote thread.
> > > > > >> >  From what Thomas describes, it really sounds like a
> > sink-specific
> > > > > >> issue.
> > > > > >> >
> > > > > >> > @Thomas: When you say sink has a long synchronous checkpoint
> > time,
> > > > you
> > > > > >> mean the time that is shown as "sync time" on the metrics and
> web
> > > UI?
> > > > > That
> > > > > >> is not including any network buffer related operations. It is
> > purely
> > > > the
> > > > > >> operator's time.
> > > > > >> >
> > > > > >> > Can we dig into the changes we did in sinks:
> > > > > >> >    - Kinesis version upgrade, AWS library updates
> > > > > >> >
> > > > > >> >    - Could it be that some call (checkpoint complete) that was
> > > > > >> previously (1.10) in a separate thread is not in the mailbox and
> > > this
> > > > > >> simply reduces the number of threads that do the work?
> > > > > >> >
> > > > > >> >    - Did sink checkpoint notifications change in a relevant
> way,
> > > for
> > > > > >> example due to some Kafka issues we addressed in 1.11 (@Aljoscha
> > > > maybe?)
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Stephan
> > > > > >> >
> > > > > >> >
> > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
> > > [hidden email]
> > > > > .invalid>
> > > > > >> wrote:
> > > > > >> > Hi Thomas,
> > > > > >> >
> > > > > >> >   Regarding [2], it has more detail infos in the Jira
> > description
> > > (
> > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
> > > > > >> >
> > > > > >> >   I can also give some basic explanations here to dismiss the
> > > > concern.
> > > > > >> >   1. In the past, the following buffers after the barrier will
> > be
> > > > > >> cached on downstream side before alignment.
> > > > > >> >   2. In 1.11, the upstream would not send the buffers after
> the
> > > > > >> barrier. When the downstream finishes the alignment, it will
> > notify
> > > > the
> > > > > >> downstream of continuing sending following buffers, since it can
> > > > process
> > > > > >> them after alignment.
> > > > > >> >   3. The only difference is that the temporary blocked buffers
> > are
> > > > > >> cached either on downstream side or on upstream side before
> > > alignment.
> > > > > >> >   4. The side effect would be the additional notification cost
> > for
> > > > > >> every barrier alignment. If the downstream and upstream are
> > deployed
> > > > in
> > > > > >> separate TaskManager, the cost is network transport delay (the
> > > effect
> > > > > can
> > > > > >> be ignored based on our testing with 1s checkpoint interval).
> For
> > > > > sharing
> > > > > >> slot in your case, the cost is only one method call in
> processor,
> > > can
> > > > be
> > > > > >> ignored also.
> > > > > >> >
> > > > > >> >   You mentioned "In this case, the downstream task has a high
> > > > average
> > > > > >> checkpoint duration(~30s, sync part)." This duration is not
> > > reflecting
> > > > > the
> > > > > >> changes above, and it is only indicating the duration for
> calling
> > > > > >> `Operation.snapshotState`.
> > > > > >> >   If this duration is beyond your expectation, you can check
> or
> > > > debug
> > > > > >> whether the source/sink operations might take more time to
> finish
> > > > > >> `snapshotState` in practice. E.g. you can
> > > > > >> >   make the implementation of this method as empty to further
> > > verify
> > > > > the
> > > > > >> effect.
> > > > > >> >
> > > > > >> >   Best,
> > > > > >> >   Zhijiang
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > >  ------------------------------------------------------------------
> > > > > >> >   From:Thomas Weise <[hidden email]>
> > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
> > > > > >> >   To:dev <[hidden email]>; Zhijiang <
> > > > [hidden email]
> > > > > >
> > > > > >> >   Cc:Yingjie Cao <[hidden email]>
> > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > > >> >
> > > > > >> >   Hi Zhijiang,
> > > > > >> >
> > > > > >> >   Could you please point me to more details regarding: "[2]:
> > Delay
> > > > > send
> > > > > >> the
> > > > > >> >   following buffers after checkpoint barrier on upstream side
> > > until
> > > > > >> barrier
> > > > > >> >   alignment on downstream side."
> > > > > >> >
> > > > > >> >   In this case, the downstream task has a high average
> > checkpoint
> > > > > >> duration
> > > > > >> >   (~30s, sync part). If there was a change to hold buffers
> > > depending
> > > > > on
> > > > > >> >   downstream performance, could this possibly apply to this
> case
> > > > (even
> > > > > >> when
> > > > > >> >   there is no shuffle that would require alignment)?
> > > > > >> >
> > > > > >> >   Thanks,
> > > > > >> >   Thomas
> > > > > >> >
> > > > > >> >
> > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
> > > > [hidden email]
> > > > > >> .invalid>
> > > > > >> >   wrote:
> > > > > >> >
> > > > > >> >   > Hi Thomas,
> > > > > >> >   >
> > > > > >> >   > Thanks for the further update information.
> > > > > >> >   >
> > > > > >> >   > I guess we can dismiss the network stack changes, since in
> > > your
> > > > > >> case the
> > > > > >> >   > downstream and upstream would probably be deployed in the
> > same
> > > > > slot
> > > > > >> >   > bypassing the network data shuffle.
> > > > > >> >   > Also I guess release-1.11 will not bring general
> performance
> > > > > >> regression in
> > > > > >> >   > runtime engine, as we also did the performance testing for
> > all
> > > > > >> general
> > > > > >> >   > cases by [1] in real cluster before and the testing
> results
> > > > should
> > > > > >> fit the
> > > > > >> >   > expectation. But we indeed did not test the specific
> source
> > > and
> > > > > sink
> > > > > >> >   > connectors yet as I known.
> > > > > >> >   >
> > > > > >> >   > Regarding your performance regression with 40%, I wonder
> it
> > is
> > > > > >> probably
> > > > > >> >   > related to specific source/sink changes (e.g. kinesis) or
> > > > > >> environment
> > > > > >> >   > issues with corner case.
> > > > > >> >   > If possible, it would be helpful to further locate whether
> > the
> > > > > >> regression
> > > > > >> >   > is caused by kinesis, by replacing the kinesis source &
> sink
> > > and
> > > > > >> keeping
> > > > > >> >   > the others same.
> > > > > >> >   >
> > > > > >> >   > As you said, it would be efficient to contact with you
> > > directly
> > > > > >> next week
> > > > > >> >   > to further discuss this issue. And we are willing/eager to
> > > > provide
> > > > > >> any help
> > > > > >> >   > to resolve this issue soon.
> > > > > >> >   >
> > > > > >> >   > Besides that, I guess this issue should not be the blocker
> > for
> > > > the
> > > > > >> >   > release, since it is probably a corner case based on the
> > > current
> > > > > >> analysis.
> > > > > >> >   > If we really conclude anything need to be resolved after
> the
> > > > final
> > > > > >> >   > release, then we can also make the next minor
> release-1.11.1
> > > > come
> > > > > >> soon.
> > > > > >> >   >
> > > > > >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > > > > >> >   >
> > > > > >> >   > Best,
> > > > > >> >   > Zhijiang
> > > > > >> >   >
> > > > > >> >   >
> > > > > >> >   >
> > > > ------------------------------------------------------------------
> > > > > >> >   > From:Thomas Weise <[hidden email]>
> > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
> > > > > >> >   > To:dev <[hidden email]>; Zhijiang <
> > > > > [hidden email]
> > > > > >> >
> > > > > >> >   > Cc:Yingjie Cao <[hidden email]>
> > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > > >> >   >
> > > > > >> >   > Hi Zhijiang,
> > > > > >> >   >
> > > > > >> >   > It will probably be best if we connect next week and
> discuss
> > > the
> > > > > >> issue
> > > > > >> >   > directly since this could be quite difficult to reproduce.
> > > > > >> >   >
> > > > > >> >   > Before the testing result on our side comes out for your
> > > > > respective
> > > > > >> job
> > > > > >> >   > case, I have some other questions to confirm for further
> > > > analysis:
> > > > > >> >   >     -  How much percentage regression you found after
> > > switching
> > > > to
> > > > > >> 1.11?
> > > > > >> >   >
> > > > > >> >   > ~40% throughput decline
> > > > > >> >   >
> > > > > >> >   >     -  Are there any network bottleneck in your cluster?
> > E.g.
> > > > the
> > > > > >> network
> > > > > >> >   > bandwidth is full caused by other jobs? If so, it might
> have
> > > > more
> > > > > >> effects
> > > > > >> >   > by above [2]
> > > > > >> >   >
> > > > > >> >   > The test runs on a k8s cluster that is also used for other
> > > > > >> production jobs.
> > > > > >> >   > There is no reason be believe network is the bottleneck.
> > > > > >> >   >
> > > > > >> >   >     -  Did you adjust the default network buffer setting?
> > E.g.
> > > > > >> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
> > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
> > > > > >> >   >
> > > > > >> >   > The job is using the defaults, i.e we don't configure the
> > > > > settings.
> > > > > >> If you
> > > > > >> >   > want me to try specific settings in the hope that it will
> > help
> > > > to
> > > > > >> isolate
> > > > > >> >   > the issue please let me know.
> > > > > >> >   >
> > > > > >> >   >     -  I guess the topology has three vertexes
> > > "KinesisConsumer
> > > > ->
> > > > > >> Chained
> > > > > >> >   > FlatMap -> KinesisProducer", and the partition mode for
> > > > > >> "KinesisConsumer ->
> > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
> "forward"?
> > If
> > > > so,
> > > > > >> the edge
> > > > > >> >   > connection is one-to-one, not all-to-all, then the above
> > > [1][2]
> > > > > >> should no
> > > > > >> >   > effects in theory with default network buffer setting.
> > > > > >> >   >
> > > > > >> >   > There are only 2 vertices and the edge is "forward".
> > > > > >> >   >
> > > > > >> >   >     - By slot sharing, I guess these three vertex
> > parallelism
> > > > task
> > > > > >> would
> > > > > >> >   > probably be deployed into the same slot, then the data
> > shuffle
> > > > is
> > > > > >> by memory
> > > > > >> >   > queue, not network stack. If so, the above [2] should no
> > > effect.
> > > > > >> >   >
> > > > > >> >   > Yes, vertices share slots.
> > > > > >> >   >
> > > > > >> >   >     - I also saw some Jira changes for kinesis in this
> > > release,
> > > > > >> could you
> > > > > >> >   > confirm that these changes would not effect the
> performance?
> > > > > >> >   >
> > > > > >> >   > I will need to take a look. 1.10 already had a regression
> > > > > >> introduced by the
> > > > > >> >   > Kinesis producer update.
> > > > > >> >   >
> > > > > >> >   >
> > > > > >> >   > Thanks,
> > > > > >> >   > Thomas
> > > > > >> >   >
> > > > > >> >   >
> > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> > > > > >> [hidden email]
> > > > > >> >   > .invalid>
> > > > > >> >   > wrote:
> > > > > >> >   >
> > > > > >> >   > > Hi Thomas,
> > > > > >> >   > >
> > > > > >> >   > > Thanks for your reply with rich information!
> > > > > >> >   > >
> > > > > >> >   > > We are trying to reproduce your case in our cluster to
> > > further
> > > > > >> verify it,
> > > > > >> >   > > and  @Yingjie Cao is working on it now.
> > > > > >> >   > >  As we have not kinesis consumer and producer
> internally,
> > so
> > > > we
> > > > > >> will
> > > > > >> >   > > construct the common source and sink instead in the case
> > of
> > > > > >> backpressure.
> > > > > >> >   > >
> > > > > >> >   > > Firstly, we can dismiss the rockdb factor in this
> release,
> > > > since
> > > > > >> you also
> > > > > >> >   > > mentioned that "filesystem leads to same symptoms".
> > > > > >> >   > >
> > > > > >> >   > > Secondly, if my understanding is right, you emphasis
> that
> > > the
> > > > > >> regression
> > > > > >> >   > > only exists for the jobs with low checkpoint interval
> > (10s).
> > > > > >> >   > > Based on that, I have two suspicions with the network
> > > related
> > > > > >> changes in
> > > > > >> >   > > this release:
> > > > > >> >   > >     - [1]: Limited the maximum backlog value (default
> 10)
> > in
> > > > > >> subpartition
> > > > > >> >   > > queue.
> > > > > >> >   > >     - [2]: Delay send the following buffers after
> > checkpoint
> > > > > >> barrier on
> > > > > >> >   > > upstream side until barrier alignment on downstream
> side.
> > > > > >> >   > >
> > > > > >> >   > > These changes are motivated for reducing the in-flight
> > > buffers
> > > > > to
> > > > > >> speedup
> > > > > >> >   > > checkpoint especially in the case of backpressure.
> > > > > >> >   > > In theory they should have very minor performance effect
> > and
> > > > > >> actually we
> > > > > >> >   > > also tested in cluster to verify within expectation
> before
> > > > > >> merging them,
> > > > > >> >   > >  but maybe there are other corner cases we have not
> > thought
> > > of
> > > > > >> before.
> > > > > >> >   > >
> > > > > >> >   > > Before the testing result on our side comes out for your
> > > > > >> respective job
> > > > > >> >   > > case, I have some other questions to confirm for further
> > > > > analysis:
> > > > > >> >   > >     -  How much percentage regression you found after
> > > > switching
> > > > > >> to 1.11?
> > > > > >> >   > >     -  Are there any network bottleneck in your cluster?
> > > E.g.
> > > > > the
> > > > > >> network
> > > > > >> >   > > bandwidth is full caused by other jobs? If so, it might
> > have
> > > > > more
> > > > > >> effects
> > > > > >> >   > > by above [2]
> > > > > >> >   > >     -  Did you adjust the default network buffer
> setting?
> > > E.g.
> > > > > >> >   > > "taskmanager.network.memory.floating-buffers-per-gate"
> or
> > > > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
> > > > > >> >   > >     -  I guess the topology has three vertexes
> > > > "KinesisConsumer
> > > > > ->
> > > > > >> >   > Chained
> > > > > >> >   > > FlatMap -> KinesisProducer", and the partition mode for
> > > > > >> "KinesisConsumer
> > > > > >> >   > ->
> > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both
> > "forward"?
> > > If
> > > > > >> so, the
> > > > > >> >   > edge
> > > > > >> >   > > connection is one-to-one, not all-to-all, then the above
> > > > [1][2]
> > > > > >> should no
> > > > > >> >   > > effects in theory with default network buffer setting.
> > > > > >> >   > >     - By slot sharing, I guess these three vertex
> > > parallelism
> > > > > >> task would
> > > > > >> >   > > probably be deployed into the same slot, then the data
> > > shuffle
> > > > > is
> > > > > >> by
> > > > > >> >   > memory
> > > > > >> >   > > queue, not network stack. If so, the above [2] should no
> > > > effect.
> > > > > >> >   > >     - I also saw some Jira changes for kinesis in this
> > > > release,
> > > > > >> could you
> > > > > >> >   > > confirm that these changes would not effect the
> > performance?
> > > > > >> >   > >
> > > > > >> >   > > Best,
> > > > > >> >   > > Zhijiang
> > > > > >> >   > >
> > > > > >> >   > >
> > > > > >> >   > >
> > > > > ------------------------------------------------------------------
> > > > > >> >   > > From:Thomas Weise <[hidden email]>
> > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
> > > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
> > > > > >> [hidden email]>
> > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > > >> >   > >
> > > > > >> >   > > Hi Zhijiang,
> > > > > >> >   > >
> > > > > >> >   > > The performance degradation manifests in backpressure
> > which
> > > > > leads
> > > > > >> to
> > > > > >> >   > > growing backlog in the source. I switched a few times
> > > between
> > > > > >> 1.10 and
> > > > > >> >   > 1.11
> > > > > >> >   > > and the behavior is consistent.
> > > > > >> >   > >
> > > > > >> >   > > The DAG is:
> > > > > >> >   > >
> > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)
> >  --------
> > > > > >> forward
> > > > > >> >   > > ---------> KinesisProducer
> > > > > >> >   > >
> > > > > >> >   > > Parallelism: 160
> > > > > >> >   > > No shuffle/rebalance.
> > > > > >> >   > >
> > > > > >> >   > > Checkpointing config:
> > > > > >> >   > >
> > > > > >> >   > > Checkpointing Mode Exactly Once
> > > > > >> >   > > Interval 10s
> > > > > >> >   > > Timeout 10m 0s
> > > > > >> >   > > Minimum Pause Between Checkpoints 10s
> > > > > >> >   > > Maximum Concurrent Checkpoints 1
> > > > > >> >   > > Persist Checkpoints Externally Enabled (delete on
> > > > cancellation)
> > > > > >> >   > >
> > > > > >> >   > > State backend: rocksdb  (filesystem leads to same
> > symptoms)
> > > > > >> >   > > Checkpoint size is tiny (500KB)
> > > > > >> >   > >
> > > > > >> >   > > An interesting difference to another job that I had
> > upgraded
> > > > > >> successfully
> > > > > >> >   > > is the low checkpointing interval.
> > > > > >> >   > >
> > > > > >> >   > > Thanks,
> > > > > >> >   > > Thomas
> > > > > >> >   > >
> > > > > >> >   > >
> > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> > > > > >> [hidden email]
> > > > > >> >   > > .invalid>
> > > > > >> >   > > wrote:
> > > > > >> >   > >
> > > > > >> >   > > > Hi Thomas,
> > > > > >> >   > > >
> > > > > >> >   > > > Thanks for the efficient feedback.
> > > > > >> >   > > >
> > > > > >> >   > > > Regarding the suggestion of adding the release notes
> > > > document,
> > > > > >> I agree
> > > > > >> >   > > > with your point. Maybe we should adjust the vote
> > template
> > > > > >> accordingly
> > > > > >> >   > in
> > > > > >> >   > > > the respective wiki to guide the following release
> > > > processes.
> > > > > >> >   > > >
> > > > > >> >   > > > Regarding the performance regression, could you
> provide
> > > some
> > > > > >> more
> > > > > >> >   > details
> > > > > >> >   > > > for our better measurement or reproducing on our
> sides?
> > > > > >> >   > > > E.g. I guess the topology only includes two vertexes
> > > source
> > > > > and
> > > > > >> sink?
> > > > > >> >   > > > What is the parallelism for every vertex?
> > > > > >> >   > > > The upstream shuffles data to the downstream via
> > rebalance
> > > > > >> partitioner
> > > > > >> >   > or
> > > > > >> >   > > > other?
> > > > > >> >   > > > The checkpoint mode is exactly-once with rocksDB state
> > > > > backend?
> > > > > >> >   > > > The backpressure happened in this case?
> > > > > >> >   > > > How much percentage regression in this case?
> > > > > >> >   > > >
> > > > > >> >   > > > Best,
> > > > > >> >   > > > Zhijiang
> > > > > >> >   > > >
> > > > > >> >   > > >
> > > > > >> >   > > >
> > > > > >> >   > > >
> > > > > >>
> ------------------------------------------------------------------
> > > > > >> >   > > > From:Thomas Weise <[hidden email]>
> > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> > > > > >> >   > > > To:dev <[hidden email]>
> > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate
> #4
> > > > > >> >   > > >
> > > > > >> >   > > > Hi Till,
> > > > > >> >   > > >
> > > > > >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
> > > > > >> >   > > >
> > > > > >> >   > > > Generally, we carry forward the existing configuration
> > and
> > > > any
> > > > > >> change
> > > > > >> >   > to
> > > > > >> >   > > > default configuration values would impact the upgrade.
> > > > > >> >   > > >
> > > > > >> >   > > > Yes, since it is an incompatible change I would state
> it
> > > in
> > > > > the
> > > > > >> release
> > > > > >> >   > > > notes.
> > > > > >> >   > > >
> > > > > >> >   > > > Thanks,
> > > > > >> >   > > > Thomas
> > > > > >> >   > > >
> > > > > >> >   > > > BTW I found a performance regression while trying to
> > > upgrade
> > > > > >> another
> > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis to
> Kinesis
> > > > job.
> > > > > >> Wasn't
> > > > > >> >   > able
> > > > > >> >   > > > to pin it down yet, symptoms include increased
> > checkpoint
> > > > > >> alignment
> > > > > >> >   > time.
> > > > > >> >   > > >
> > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> > > > > >> [hidden email]>
> > > > > >> >   > > > wrote:
> > > > > >> >   > > >
> > > > > >> >   > > > > Hi Thomas,
> > > > > >> >   > > > >
> > > > > >> >   > > > > just to confirm: When starting the image in local
> > mode,
> > > > then
> > > > > >> you
> > > > > >> >   > don't
> > > > > >> >   > > > have
> > > > > >> >   > > > > any of the JobManager memory configuration settings
> > > > > >> configured in the
> > > > > >> >   > > > > effective flink-conf.yaml, right? Does this mean
> that
> > > you
> > > > > have
> > > > > >> >   > > explicitly
> > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the
> default
> > > > > >> configuration?
> > > > > >> >   > > If
> > > > > >> >   > > > > this is the case, then I believe it was more of an
> > > > > >> unintentional
> > > > > >> >   > > artifact
> > > > > >> >   > > > > that it worked before and it has been corrected now
> so
> > > > that
> > > > > >> one needs
> > > > > >> >   > > to
> > > > > >> >   > > > > specify the memory of the JM process explicitly. Do
> > you
> > > > > think
> > > > > >> it
> > > > > >> >   > would
> > > > > >> >   > > > help
> > > > > >> >   > > > > to explicitly state this in the release notes?
> > > > > >> >   > > > >
> > > > > >> >   > > > > Cheers,
> > > > > >> >   > > > > Till
> > > > > >> >   > > > >
> > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
> > > > [hidden email]
> > > > > >
> > > > > >> wrote:
> > > > > >> >   > > > >
> > > > > >> >   > > > > > Thanks for preparing another RC!
> > > > > >> >   > > > > >
> > > > > >> >   > > > > > As mentioned in the previous RC thread, it would
> be
> > > > super
> > > > > >> helpful
> > > > > >> >   > if
> > > > > >> >   > > > the
> > > > > >> >   > > > > > release notes that are part of the documentation
> can
> > > be
> > > > > >> included
> > > > > >> >   > [1].
> > > > > >> >   > > > > It's
> > > > > >> >   > > > > > a significant time-saver to have read those first.
> > > > > >> >   > > > > >
> > > > > >> >   > > > > > I found one more non-backward compatible change
> that
> > > > would
> > > > > >> be worth
> > > > > >> >   > > > > > addressing/mentioning:
> > > > > >> >   > > > > >
> > > > > >> >   > > > > > It is now necessary to configure the jobmanager
> heap
> > > > size
> > > > > in
> > > > > >> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
> > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why would I not
> > want
> > > to
> > > > > do
> > > > > >> that
> > > > > >> >   > > > anyways?
> > > > > >> >   > > > > > Well, we set it dynamically for a cluster
> deployment
> > > via
> > > > > the
> > > > > >> >   > > > > > flinkk8soperator, but the container image can also
> > be
> > > > used
> > > > > >> for
> > > > > >> >   > > testing
> > > > > >> >   > > > > with
> > > > > >> >   > > > > > local mode (./bin/jobmanager.sh start-foreground
> > > local).
> > > > > >> That will
> > > > > >> >   > > fail
> > > > > >> >   > > > > if
> > > > > >> >   > > > > > the heap wasn't configured and that's how I
> noticed
> > > it.
> > > > > >> >   > > > > >
> > > > > >> >   > > > > > Thanks,
> > > > > >> >   > > > > > Thomas
> > > > > >> >   > > > > >
> > > > > >> >   > > > > > [1]
> > > > > >> >   > > > > >
> > > > > >> >   > > > > >
> > > > > >> >   > > > >
> > > > > >> >   > > >
> > > > > >> >   > >
> > > > > >> >   >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > > > > >> >   > > > > >
> > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> > > > > >> >   > [hidden email]
> > > > > >> >   > > > > > .invalid>
> > > > > >> >   > > > > > wrote:
> > > > > >> >   > > > > >
> > > > > >> >   > > > > > > Hi everyone,
> > > > > >> >   > > > > > >
> > > > > >> >   > > > > > > Please review and vote on the release candidate
> #4
> > > for
> > > > > the
> > > > > >> >   > version
> > > > > >> >   > > > > > 1.11.0,
> > > > > >> >   > > > > > > as follows:
> > > > > >> >   > > > > > > [ ] +1, Approve the release
> > > > > >> >   > > > > > > [ ] -1, Do not approve the release (please
> provide
> > > > > >> specific
> > > > > >> >   > > comments)
> > > > > >> >   > > > > > >
> > > > > >> >   > > > > > > The complete staging area is available for your
> > > > review,
> > > > > >> which
> > > > > >> >   > > > includes:
> > > > > >> >   > > > > > > * JIRA release notes [1],
> > > > > >> >   > > > > > > * the official Apache source release and binary
> > > > > >> convenience
> > > > > >> >   > > releases
> > > > > >> >   > > > to
> > > > > >> >   > > > > > be
> > > > > >> >   > > > > > > deployed to dist.apache.org [2], which are
> signed
> > > > with
> > > > > >> the key
> > > > > >> >   > > with
> > > > > >> >   > > > > > > fingerprint
> > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> > > > > [3],
> > > > > >> >   > > > > > > * all artifacts to be deployed to the Maven
> > Central
> > > > > >> Repository
> > > > > >> >   > [4],
> > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
> > > > > >> >   > > > > > > * website pull request listing the new release
> and
> > > > > adding
> > > > > >> >   > > > announcement
> > > > > >> >   > > > > > > blog post [6].
> > > > > >> >   > > > > > >
> > > > > >> >   > > > > > > The vote will be open for at least 72 hours. It
> is
> > > > > >> adopted by
> > > > > >> >   > > > majority
> > > > > >> >   > > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > >> >   > > > > > >
> > > > > >> >   > > > > > > Thanks,
> > > > > >> >   > > > > > > Release Manager
> > > > > >> >   > > > > > >
> > > > > >> >   > > > > > > [1]
> > > > > >> >   > > > > > >
> > > > > >> >   > > > > >
> > > > > >> >   > > > >
> > > > > >> >   > > >
> > > > > >> >   > >
> > > > > >> >   >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > > > >> >   > > > > > > [2]
> > > > > >> >   >
> > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > > > >> >   > > > > > > [3]
> > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > >> >   > > > > > > [4]
> > > > > >> >   > > > > > >
> > > > > >> >   > > > >
> > > > > >> >   > >
> > > > > >>
> > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > > > >> >   > > > > > > [5]
> > > > > >> >   > >
> > > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > > > >> >   > > > > > > [6]
> https://github.com/apache/flink-web/pull/352
> > > > > >> >   > > > > > >
> > > > > >> >   > > > > > >
> > > > > >> >   > > > > >
> > > > > >> >   > > > >
> > > > > >> >   > > >
> > > > > >> >   > > >
> > > > > >> >   > >
> > > > > >> >   > >
> > > > > >> >   >
> > > > > >> >   >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Regards,
> Roman
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Roman Khachatryan
Hi Thomas,

Thanks for your reply!

I think you are right, we can remove this sleep and improve KinesisProducer.
Probably, it's snapshotState can also be sped up by forcing records flush
more often.
Do you see that 30s checkpointing duration is caused by KinesisProducer (or
maybe other operators)?

I'd also like to understand the reason behind this increase in checkpoint
frequency.
Can you please share these values:
 - execution.checkpointing.min-pause
 - execution.checkpointing.max-concurrent-checkpoints
 - execution.checkpointing.timeout

And what is the "new" observed checkpoint frequency (or how many
checkpoints are created) compared to older versions?


On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <[hidden email]> wrote:

> Hi Roman,
>
> Indeed there are more frequent checkpoints with this change! The
> application was configured to checkpoint every 10s. With 1.10 ("good
> commit"), that leads to fewer completed checkpoints compared to 1.11 ("bad
> commit"). Just to be clear, the only difference between the two runs was
> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>
> Since the sync part of checkpoints with the Kinesis producer always takes
> ~30 seconds, the 10s configured checkpoint frequency really had no effect
> before 1.11. I confirmed that both commits perform comparably by setting
> the checkpoint frequency and min pause to 60s.
>
> I still have to verify with the final 1.11.0 release commit.
>
> It's probably good to take a look at the Kinesis producer. Is it really
> necessary to have 500ms sleep time? What's responsible for the ~30s
> duration in snapshotState?
>
> As things stand it doesn't make sense to use checkpoint intervals < 30s
> when using the Kinesis producer.
>
> Thanks,
> Thomas
>
> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <[hidden email]>
> wrote:
>
> > Hi Thomas,
> >
> > Thanks a lot for the analysis.
> >
> > The first thing that I'd check is whether checkpoints became more
> frequent
> > with this commit (as each of them adds at least 500ms if there is at
> least
> > one not sent record, according to FlinkKinesisProducer.snapshotState).
> >
> > Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs
> > first "bad" commits)?
> >
> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <[hidden email]>
> > wrote:
> >
> > > I run git bisect and the first commit that shows the regression is:
> > >
> > >
> > >
> >
> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
> > >
> > >
> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]> wrote:
> > >
> > > > From my experience, java profilers are sometimes not accurate enough
> to
> > > > find out the performance regression
> > > > root cause. In this case, I would suggest you try out intel vtune
> > > amplifier
> > > > to watch more detailed metrics.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]> wrote:
> > > >
> > > > > The cause of the issue is all but clear.
> > > > >
> > > > > Previously I had mentioned that there is no suspect change to the
> > > Kinesis
> > > > > connector and that I had reverted the AWS SDK change to no effect.
> > > > >
> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed
> > > another
> > > > > regression in the previous release and is present before and after.
> > > > >
> > > > > I repeated the run with 1.11.0 core and downgraded the entire
> Kinesis
> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is still
> > > > present.
> > > > > Therefore we will need to look elsewhere for the root cause.
> > > > >
> > > > > Regarding the time spent in snapshotState, repeat runs reveal a
> wide
> > > > range
> > > > > for both versions, 1.10 and 1.11. So again this is nothing pointing
> > to
> > > a
> > > > > root cause.
> > > > >
> > > > > At this point, I have no ideas remaining other than doing a bisect
> to
> > > > find
> > > > > the culprit. Any other suggestions?
> > > > >
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
> [hidden email]
> > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Hi Thomas,
> > > > > >
> > > > > > Thanks for your further profiling information and glad to see we
> > > > already
> > > > > > finalized the location to cause the regression.
> > > > > > Actually I was also suspicious of the point of #snapshotState in
> > > > previous
> > > > > > discussions since it indeed cost much time to block normal
> operator
> > > > > > processing.
> > > > > >
> > > > > > Based on your below feedback, the sleep time during
> #snapshotState
> > > > might
> > > > > > be the main concern, and I also digged into the implementation of
> > > > > > FlinkKinesisProducer#snapshotState.
> > > > > > while (producer.getOutstandingRecordsCount() > 0) {
> > > > > >    producer.flush();
> > > > > >    try {
> > > > > >       Thread.sleep(500);
> > > > > >    } catch (InterruptedException e) {
> > > > > >       LOG.warn("Flushing was interrupted.");
> > > > > >       break;
> > > > > >    }
> > > > > > }
> > > > > > It seems that the sleep time is mainly affected by the internal
> > > > > operations
> > > > > > inside KinesisProducer implementation provided by amazonaws,
> which
> > I
> > > am
> > > > > not
> > > > > > quite familiar with.
> > > > > > But I noticed there were two upgrades related to it in
> > > release-1.11.0.
> > > > > One
> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and
> another
> > is
> > > > for
> > > > > > upgrading aws-sdk-version to 1.11.754 [2].
> > > > > > You mentioned that you already reverted the SDK upgrade to verify
> > no
> > > > > > changes. Did you also revert the [1] to verify?
> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> > > > > >
> > > > > > Best,
> > > > > > Zhijiang
> > > > > >
> ------------------------------------------------------------------
> > > > > > From:Thomas Weise <[hidden email]>
> > > > > > Send Time:2020年7月17日(星期五) 05:29
> > > > > > To:dev <[hidden email]>
> > > > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
> > > > [hidden email]
> > > > > >;
> > > > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
> > > > [hidden email]
> > > > > >
> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > > > release
> > > > > > candidate #4)
> > > > > >
> > > > > > Sorry for the delay.
> > > > > >
> > > > > > I confirmed that the regression is due to the sink (unsurprising,
> > > since
> > > > > > another job with the same consumer, but not the producer, runs as
> > > > > > expected).
> > > > > >
> > > > > > As promised I did CPU profiling on the problematic application,
> > which
> > > > > gives
> > > > > > more insight into the regression [1]
> > > > > >
> > > > > > The screenshots show that the average time for snapshotState
> > > increases
> > > > > from
> > > > > > ~9s to ~28s. The data also shows the increase in sleep time
> during
> > > > > > snapshotState.
> > > > > >
> > > > > > Does anyone, based on changes made in 1.11, have a theory why?
> > > > > >
> > > > > > I had previously looked at the changes to the Kinesis connector
> and
> > > > also
> > > > > > reverted the SDK upgrade, which did not change the situation.
> > > > > >
> > > > > > It will likely be necessary to drill into the sink /
> checkpointing
> > > > > details
> > > > > > to understand the cause of the problem.
> > > > > >
> > > > > > Let me know if anyone has specific questions that I can answer
> from
> > > the
> > > > > > profiling results.
> > > > > >
> > > > > > Thomas
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> > > > > >
> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > > + dev@ for visibility
> > > > > > >
> > > > > > > I will investigate further today.
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
> > > > > > >> >    - Did sink checkpoint notifications change in a relevant
> > way,
> > > > for
> > > > > > >> example
> > > > > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha
> > maybe?)
> > > > > > >>
> > > > > > >> I think that's unrelated: the Kafka fixes were isolated in
> Kafka
> > > and
> > > > > the
> > > > > > >> one bug I discovered on the way was about the Task reaper.
> > > > > > >>
> > > > > > >>
> > > > > > >> On 07.07.20 17:51, Zhijiang wrote:
> > > > > > >> > Sorry for my misunderstood of the previous information,
> > Thomas.
> > > I
> > > > > was
> > > > > > >> assuming that the sync checkpoint duration increased after
> > upgrade
> > > > as
> > > > > it
> > > > > > >> was mentioned before.
> > > > > > >> >
> > > > > > >> > If I remembered correctly, the memory state backend also has
> > the
> > > > > same
> > > > > > >> issue? If so, we can dismiss the rocksDB state changes. As the
> > > slot
> > > > > > sharing
> > > > > > >> enabled, the downstream and upstream should
> > > > > > >> > probably deployed into the same slot, then no network
> shuffle
> > > > > effect.
> > > > > > >> >
> > > > > > >> > I think we need to find out whether it has other symptoms
> > > changed
> > > > > > >> besides the performance regression to further figure out the
> > > scope.
> > > > > > >> > E.g. any metrics changes, the number of TaskManager and the
> > > number
> > > > > of
> > > > > > >> slots per TaskManager from deployment changes.
> > > > > > >> > 40% regression is really big, I guess the changes should
> also
> > be
> > > > > > >> reflected in other places.
> > > > > > >> >
> > > > > > >> > I am not sure whether we can reproduce the regression in our
> > AWS
> > > > > > >> environment by writing any Kinesis jobs, since there are also
> > > normal
> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
> > > > > > >> > So it probably looks like to touch some corner case. I am
> very
> > > > > willing
> > > > > > >> to provide any help for debugging if possible.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Best,
> > > > > > >> > Zhijiang
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > ------------------------------------------------------------------
> > > > > > >> > From:Thomas Weise <[hidden email]>
> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01
> > > > > > >> > To:Stephan Ewen <[hidden email]>
> > > > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
> > > > > > >> [hidden email]>; Zhijiang <[hidden email]>
> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
> > > 1.11.0,
> > > > > > >> release candidate #4)
> > > > > > >> >
> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We have one
> > job
> > > > > that
> > > > > > >> works as expected after the upgrade and the one discussed here
> > > that
> > > > > has
> > > > > > the
> > > > > > >> performance regression.
> > > > > > >> >
> > > > > > >> > "The performance regression is obvious caused by long
> duration
> > > of
> > > > > sync
> > > > > > >> checkpoint process in Kinesis sink operator, which would block
> > the
> > > > > > normal
> > > > > > >> data processing until back pressure the source."
> > > > > > >> >
> > > > > > >> > That's a constant. Before (1.10) and upgrade have the same
> > sync
> > > > > > >> checkpointing time. The question is what change came in with
> the
> > > > > > upgrade.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <
> [hidden email]
> > >
> > > > > wrote:
> > > > > > >> >
> > > > > > >> > @Thomas Just one thing real quick: Are you using the
> > standalone
> > > > > setup
> > > > > > >> scripts (like start-cluster.sh, and the former "slaves" file)
> ?
> > > > > > >> > Be aware that this is now called "workers" because of
> avoiding
> > > > > > >> sensitive names.
> > > > > > >> > In one internal benchmark we saw quite a lot of slowdown
> > > > initially,
> > > > > > >> before seeing that the cluster was not a distributed cluster
> any
> > > > more
> > > > > > ;-)
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
> > > > [hidden email]
> > > > > >
> > > > > > >> wrote:
> > > > > > >> > Thanks for this kickoff and help analysis, Stephan!
> > > > > > >> > Thanks for the further feedback and investigation, Thomas!
> > > > > > >> >
> > > > > > >> > The performance regression is obvious caused by long
> duration
> > of
> > > > > sync
> > > > > > >> checkpoint process in Kinesis sink operator, which would block
> > the
> > > > > > normal
> > > > > > >> data processing until back pressure the source.
> > > > > > >> > Maybe we could dig into the process of sync execution in
> > > > checkpoint.
> > > > > > >> E.g. break down the steps inside respective
> > operator#snapshotState
> > > > to
> > > > > > >> statistic which operation cost most of the time, then
> > > > > > >> > we might probably find the root cause to bring such cost.
> > > > > > >> >
> > > > > > >> > Look forward to the further progress. :)
> > > > > > >> >
> > > > > > >> > Best,
> > > > > > >> > Zhijiang
> > > > > > >> >
> > > > > > >> >
> > > ------------------------------------------------------------------
> > > > > > >> > From:Stephan Ewen <[hidden email]>
> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52
> > > > > > >> > To:Thomas Weise <[hidden email]>
> > > > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
> > > > > > >> [hidden email]>; Aljoscha Krettek <
> > > [hidden email]
> > > > >;
> > > > > > >> Arvid Heise <[hidden email]>
> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
> > > 1.11.0,
> > > > > > >> release candidate #4)
> > > > > > >> >
> > > > > > >> > Thank you for the digging so deeply.
> > > > > > >> > Mysterious think this regression.
> > > > > > >> >
> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]>
> > wrote:
> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it is
> > > unchanged
> > > > > > >> between 1.10 and 1.11 for the specific pipeline).
> > > > > > >> >
> > > > > > >> > I verified that increasing the checkpointing interval does
> not
> > > > make
> > > > > a
> > > > > > >> difference.
> > > > > > >> >
> > > > > > >> > I looked at the Kinesis connector changes since 1.10.1 and
> > don't
> > > > see
> > > > > > >> anything that could cause this.
> > > > > > >> >
> > > > > > >> > Another pipeline that is using the Kinesis consumer (but not
> > the
> > > > > > >> producer) performs as expected.
> > > > > > >> >
> > > > > > >> > I tried reverting the AWS SDK version change, symptoms
> remain
> > > > > > unchanged:
> > > > > > >> >
> > > > > > >> > diff --git
> a/flink-connectors/flink-connector-kinesis/pom.xml
> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
> > > > > > >> > index a6abce23ba..741743a05e 100644
> > > > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
> > > > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
> > > > > > >> > @@ -33,7 +33,7 @@ under the License.
> > > > > > >> >
> > > > > > >>
> > > > >
> > >
> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> > > > > > >> >          <name>flink-connector-kinesis</name>
> > > > > > >> >          <properties>
> > > > > > >> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
> > > > > > >> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
> > > > > > >> >
> > > > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> > > > > > >> >
> > > > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> > > > > > >> >
> > > > > > >> > I'm planning to take a look with a profiler next.
> > > > > > >> >
> > > > > > >> > Thomas
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
> > [hidden email]>
> > > > > > wrote:
> > > > > > >> > Hi all!
> > > > > > >> >
> > > > > > >> > Forking this thread out of the release vote thread.
> > > > > > >> >  From what Thomas describes, it really sounds like a
> > > sink-specific
> > > > > > >> issue.
> > > > > > >> >
> > > > > > >> > @Thomas: When you say sink has a long synchronous checkpoint
> > > time,
> > > > > you
> > > > > > >> mean the time that is shown as "sync time" on the metrics and
> > web
> > > > UI?
> > > > > > That
> > > > > > >> is not including any network buffer related operations. It is
> > > purely
> > > > > the
> > > > > > >> operator's time.
> > > > > > >> >
> > > > > > >> > Can we dig into the changes we did in sinks:
> > > > > > >> >    - Kinesis version upgrade, AWS library updates
> > > > > > >> >
> > > > > > >> >    - Could it be that some call (checkpoint complete) that
> was
> > > > > > >> previously (1.10) in a separate thread is not in the mailbox
> and
> > > > this
> > > > > > >> simply reduces the number of threads that do the work?
> > > > > > >> >
> > > > > > >> >    - Did sink checkpoint notifications change in a relevant
> > way,
> > > > for
> > > > > > >> example due to some Kafka issues we addressed in 1.11
> (@Aljoscha
> > > > > maybe?)
> > > > > > >> >
> > > > > > >> > Best,
> > > > > > >> > Stephan
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
> > > > [hidden email]
> > > > > > .invalid>
> > > > > > >> wrote:
> > > > > > >> > Hi Thomas,
> > > > > > >> >
> > > > > > >> >   Regarding [2], it has more detail infos in the Jira
> > > description
> > > > (
> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
> > > > > > >> >
> > > > > > >> >   I can also give some basic explanations here to dismiss
> the
> > > > > concern.
> > > > > > >> >   1. In the past, the following buffers after the barrier
> will
> > > be
> > > > > > >> cached on downstream side before alignment.
> > > > > > >> >   2. In 1.11, the upstream would not send the buffers after
> > the
> > > > > > >> barrier. When the downstream finishes the alignment, it will
> > > notify
> > > > > the
> > > > > > >> downstream of continuing sending following buffers, since it
> can
> > > > > process
> > > > > > >> them after alignment.
> > > > > > >> >   3. The only difference is that the temporary blocked
> buffers
> > > are
> > > > > > >> cached either on downstream side or on upstream side before
> > > > alignment.
> > > > > > >> >   4. The side effect would be the additional notification
> cost
> > > for
> > > > > > >> every barrier alignment. If the downstream and upstream are
> > > deployed
> > > > > in
> > > > > > >> separate TaskManager, the cost is network transport delay (the
> > > > effect
> > > > > > can
> > > > > > >> be ignored based on our testing with 1s checkpoint interval).
> > For
> > > > > > sharing
> > > > > > >> slot in your case, the cost is only one method call in
> > processor,
> > > > can
> > > > > be
> > > > > > >> ignored also.
> > > > > > >> >
> > > > > > >> >   You mentioned "In this case, the downstream task has a
> high
> > > > > average
> > > > > > >> checkpoint duration(~30s, sync part)." This duration is not
> > > > reflecting
> > > > > > the
> > > > > > >> changes above, and it is only indicating the duration for
> > calling
> > > > > > >> `Operation.snapshotState`.
> > > > > > >> >   If this duration is beyond your expectation, you can check
> > or
> > > > > debug
> > > > > > >> whether the source/sink operations might take more time to
> > finish
> > > > > > >> `snapshotState` in practice. E.g. you can
> > > > > > >> >   make the implementation of this method as empty to further
> > > > verify
> > > > > > the
> > > > > > >> effect.
> > > > > > >> >
> > > > > > >> >   Best,
> > > > > > >> >   Zhijiang
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > >  ------------------------------------------------------------------
> > > > > > >> >   From:Thomas Weise <[hidden email]>
> > > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
> > > > > > >> >   To:dev <[hidden email]>; Zhijiang <
> > > > > [hidden email]
> > > > > > >
> > > > > > >> >   Cc:Yingjie Cao <[hidden email]>
> > > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > > > >> >
> > > > > > >> >   Hi Zhijiang,
> > > > > > >> >
> > > > > > >> >   Could you please point me to more details regarding: "[2]:
> > > Delay
> > > > > > send
> > > > > > >> the
> > > > > > >> >   following buffers after checkpoint barrier on upstream
> side
> > > > until
> > > > > > >> barrier
> > > > > > >> >   alignment on downstream side."
> > > > > > >> >
> > > > > > >> >   In this case, the downstream task has a high average
> > > checkpoint
> > > > > > >> duration
> > > > > > >> >   (~30s, sync part). If there was a change to hold buffers
> > > > depending
> > > > > > on
> > > > > > >> >   downstream performance, could this possibly apply to this
> > case
> > > > > (even
> > > > > > >> when
> > > > > > >> >   there is no shuffle that would require alignment)?
> > > > > > >> >
> > > > > > >> >   Thanks,
> > > > > > >> >   Thomas
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
> > > > > [hidden email]
> > > > > > >> .invalid>
> > > > > > >> >   wrote:
> > > > > > >> >
> > > > > > >> >   > Hi Thomas,
> > > > > > >> >   >
> > > > > > >> >   > Thanks for the further update information.
> > > > > > >> >   >
> > > > > > >> >   > I guess we can dismiss the network stack changes, since
> in
> > > > your
> > > > > > >> case the
> > > > > > >> >   > downstream and upstream would probably be deployed in
> the
> > > same
> > > > > > slot
> > > > > > >> >   > bypassing the network data shuffle.
> > > > > > >> >   > Also I guess release-1.11 will not bring general
> > performance
> > > > > > >> regression in
> > > > > > >> >   > runtime engine, as we also did the performance testing
> for
> > > all
> > > > > > >> general
> > > > > > >> >   > cases by [1] in real cluster before and the testing
> > results
> > > > > should
> > > > > > >> fit the
> > > > > > >> >   > expectation. But we indeed did not test the specific
> > source
> > > > and
> > > > > > sink
> > > > > > >> >   > connectors yet as I known.
> > > > > > >> >   >
> > > > > > >> >   > Regarding your performance regression with 40%, I wonder
> > it
> > > is
> > > > > > >> probably
> > > > > > >> >   > related to specific source/sink changes (e.g. kinesis)
> or
> > > > > > >> environment
> > > > > > >> >   > issues with corner case.
> > > > > > >> >   > If possible, it would be helpful to further locate
> whether
> > > the
> > > > > > >> regression
> > > > > > >> >   > is caused by kinesis, by replacing the kinesis source &
> > sink
> > > > and
> > > > > > >> keeping
> > > > > > >> >   > the others same.
> > > > > > >> >   >
> > > > > > >> >   > As you said, it would be efficient to contact with you
> > > > directly
> > > > > > >> next week
> > > > > > >> >   > to further discuss this issue. And we are willing/eager
> to
> > > > > provide
> > > > > > >> any help
> > > > > > >> >   > to resolve this issue soon.
> > > > > > >> >   >
> > > > > > >> >   > Besides that, I guess this issue should not be the
> blocker
> > > for
> > > > > the
> > > > > > >> >   > release, since it is probably a corner case based on the
> > > > current
> > > > > > >> analysis.
> > > > > > >> >   > If we really conclude anything need to be resolved after
> > the
> > > > > final
> > > > > > >> >   > release, then we can also make the next minor
> > release-1.11.1
> > > > > come
> > > > > > >> soon.
> > > > > > >> >   >
> > > > > > >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
> > > > > > >> >   >
> > > > > > >> >   > Best,
> > > > > > >> >   > Zhijiang
> > > > > > >> >   >
> > > > > > >> >   >
> > > > > > >> >   >
> > > > > ------------------------------------------------------------------
> > > > > > >> >   > From:Thomas Weise <[hidden email]>
> > > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
> > > > > > >> >   > To:dev <[hidden email]>; Zhijiang <
> > > > > > [hidden email]
> > > > > > >> >
> > > > > > >> >   > Cc:Yingjie Cao <[hidden email]>
> > > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
> > > > > > >> >   >
> > > > > > >> >   > Hi Zhijiang,
> > > > > > >> >   >
> > > > > > >> >   > It will probably be best if we connect next week and
> > discuss
> > > > the
> > > > > > >> issue
> > > > > > >> >   > directly since this could be quite difficult to
> reproduce.
> > > > > > >> >   >
> > > > > > >> >   > Before the testing result on our side comes out for your
> > > > > > respective
> > > > > > >> job
> > > > > > >> >   > case, I have some other questions to confirm for further
> > > > > analysis:
> > > > > > >> >   >     -  How much percentage regression you found after
> > > > switching
> > > > > to
> > > > > > >> 1.11?
> > > > > > >> >   >
> > > > > > >> >   > ~40% throughput decline
> > > > > > >> >   >
> > > > > > >> >   >     -  Are there any network bottleneck in your cluster?
> > > E.g.
> > > > > the
> > > > > > >> network
> > > > > > >> >   > bandwidth is full caused by other jobs? If so, it might
> > have
> > > > > more
> > > > > > >> effects
> > > > > > >> >   > by above [2]
> > > > > > >> >   >
> > > > > > >> >   > The test runs on a k8s cluster that is also used for
> other
> > > > > > >> production jobs.
> > > > > > >> >   > There is no reason be believe network is the bottleneck.
> > > > > > >> >   >
> > > > > > >> >   >     -  Did you adjust the default network buffer
> setting?
> > > E.g.
> > > > > > >> >   > "taskmanager.network.memory.floating-buffers-per-gate"
> or
> > > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
> > > > > > >> >   >
> > > > > > >> >   > The job is using the defaults, i.e we don't configure
> the
> > > > > > settings.
> > > > > > >> If you
> > > > > > >> >   > want me to try specific settings in the hope that it
> will
> > > help
> > > > > to
> > > > > > >> isolate
> > > > > > >> >   > the issue please let me know.
> > > > > > >> >   >
> > > > > > >> >   >     -  I guess the topology has three vertexes
> > > > "KinesisConsumer
> > > > > ->
> > > > > > >> Chained
> > > > > > >> >   > FlatMap -> KinesisProducer", and the partition mode for
> > > > > > >> "KinesisConsumer ->
> > > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
> > "forward"?
> > > If
> > > > > so,
> > > > > > >> the edge
> > > > > > >> >   > connection is one-to-one, not all-to-all, then the above
> > > > [1][2]
> > > > > > >> should no
> > > > > > >> >   > effects in theory with default network buffer setting.
> > > > > > >> >   >
> > > > > > >> >   > There are only 2 vertices and the edge is "forward".
> > > > > > >> >   >
> > > > > > >> >   >     - By slot sharing, I guess these three vertex
> > > parallelism
> > > > > task
> > > > > > >> would
> > > > > > >> >   > probably be deployed into the same slot, then the data
> > > shuffle
> > > > > is
> > > > > > >> by memory
> > > > > > >> >   > queue, not network stack. If so, the above [2] should no
> > > > effect.
> > > > > > >> >   >
> > > > > > >> >   > Yes, vertices share slots.
> > > > > > >> >   >
> > > > > > >> >   >     - I also saw some Jira changes for kinesis in this
> > > > release,
> > > > > > >> could you
> > > > > > >> >   > confirm that these changes would not effect the
> > performance?
> > > > > > >> >   >
> > > > > > >> >   > I will need to take a look. 1.10 already had a
> regression
> > > > > > >> introduced by the
> > > > > > >> >   > Kinesis producer update.
> > > > > > >> >   >
> > > > > > >> >   >
> > > > > > >> >   > Thanks,
> > > > > > >> >   > Thomas
> > > > > > >> >   >
> > > > > > >> >   >
> > > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> > > > > > >> [hidden email]
> > > > > > >> >   > .invalid>
> > > > > > >> >   > wrote:
> > > > > > >> >   >
> > > > > > >> >   > > Hi Thomas,
> > > > > > >> >   > >
> > > > > > >> >   > > Thanks for your reply with rich information!
> > > > > > >> >   > >
> > > > > > >> >   > > We are trying to reproduce your case in our cluster to
> > > > further
> > > > > > >> verify it,
> > > > > > >> >   > > and  @Yingjie Cao is working on it now.
> > > > > > >> >   > >  As we have not kinesis consumer and producer
> > internally,
> > > so
> > > > > we
> > > > > > >> will
> > > > > > >> >   > > construct the common source and sink instead in the
> case
> > > of
> > > > > > >> backpressure.
> > > > > > >> >   > >
> > > > > > >> >   > > Firstly, we can dismiss the rockdb factor in this
> > release,
> > > > > since
> > > > > > >> you also
> > > > > > >> >   > > mentioned that "filesystem leads to same symptoms".
> > > > > > >> >   > >
> > > > > > >> >   > > Secondly, if my understanding is right, you emphasis
> > that
> > > > the
> > > > > > >> regression
> > > > > > >> >   > > only exists for the jobs with low checkpoint interval
> > > (10s).
> > > > > > >> >   > > Based on that, I have two suspicions with the network
> > > > related
> > > > > > >> changes in
> > > > > > >> >   > > this release:
> > > > > > >> >   > >     - [1]: Limited the maximum backlog value (default
> > 10)
> > > in
> > > > > > >> subpartition
> > > > > > >> >   > > queue.
> > > > > > >> >   > >     - [2]: Delay send the following buffers after
> > > checkpoint
> > > > > > >> barrier on
> > > > > > >> >   > > upstream side until barrier alignment on downstream
> > side.
> > > > > > >> >   > >
> > > > > > >> >   > > These changes are motivated for reducing the in-flight
> > > > buffers
> > > > > > to
> > > > > > >> speedup
> > > > > > >> >   > > checkpoint especially in the case of backpressure.
> > > > > > >> >   > > In theory they should have very minor performance
> effect
> > > and
> > > > > > >> actually we
> > > > > > >> >   > > also tested in cluster to verify within expectation
> > before
> > > > > > >> merging them,
> > > > > > >> >   > >  but maybe there are other corner cases we have not
> > > thought
> > > > of
> > > > > > >> before.
> > > > > > >> >   > >
> > > > > > >> >   > > Before the testing result on our side comes out for
> your
> > > > > > >> respective job
> > > > > > >> >   > > case, I have some other questions to confirm for
> further
> > > > > > analysis:
> > > > > > >> >   > >     -  How much percentage regression you found after
> > > > > switching
> > > > > > >> to 1.11?
> > > > > > >> >   > >     -  Are there any network bottleneck in your
> cluster?
> > > > E.g.
> > > > > > the
> > > > > > >> network
> > > > > > >> >   > > bandwidth is full caused by other jobs? If so, it
> might
> > > have
> > > > > > more
> > > > > > >> effects
> > > > > > >> >   > > by above [2]
> > > > > > >> >   > >     -  Did you adjust the default network buffer
> > setting?
> > > > E.g.
> > > > > > >> >   > > "taskmanager.network.memory.floating-buffers-per-gate"
> > or
> > > > > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
> > > > > > >> >   > >     -  I guess the topology has three vertexes
> > > > > "KinesisConsumer
> > > > > > ->
> > > > > > >> >   > Chained
> > > > > > >> >   > > FlatMap -> KinesisProducer", and the partition mode
> for
> > > > > > >> "KinesisConsumer
> > > > > > >> >   > ->
> > > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both
> > > "forward"?
> > > > If
> > > > > > >> so, the
> > > > > > >> >   > edge
> > > > > > >> >   > > connection is one-to-one, not all-to-all, then the
> above
> > > > > [1][2]
> > > > > > >> should no
> > > > > > >> >   > > effects in theory with default network buffer setting.
> > > > > > >> >   > >     - By slot sharing, I guess these three vertex
> > > > parallelism
> > > > > > >> task would
> > > > > > >> >   > > probably be deployed into the same slot, then the data
> > > > shuffle
> > > > > > is
> > > > > > >> by
> > > > > > >> >   > memory
> > > > > > >> >   > > queue, not network stack. If so, the above [2] should
> no
> > > > > effect.
> > > > > > >> >   > >     - I also saw some Jira changes for kinesis in this
> > > > > release,
> > > > > > >> could you
> > > > > > >> >   > > confirm that these changes would not effect the
> > > performance?
> > > > > > >> >   > >
> > > > > > >> >   > > Best,
> > > > > > >> >   > > Zhijiang
> > > > > > >> >   > >
> > > > > > >> >   > >
> > > > > > >> >   > >
> > > > > >
> ------------------------------------------------------------------
> > > > > > >> >   > > From:Thomas Weise <[hidden email]>
> > > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
> > > > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
> > > > > > >> [hidden email]>
> > > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate
> #4
> > > > > > >> >   > >
> > > > > > >> >   > > Hi Zhijiang,
> > > > > > >> >   > >
> > > > > > >> >   > > The performance degradation manifests in backpressure
> > > which
> > > > > > leads
> > > > > > >> to
> > > > > > >> >   > > growing backlog in the source. I switched a few times
> > > > between
> > > > > > >> 1.10 and
> > > > > > >> >   > 1.11
> > > > > > >> >   > > and the behavior is consistent.
> > > > > > >> >   > >
> > > > > > >> >   > > The DAG is:
> > > > > > >> >   > >
> > > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)
> > >  --------
> > > > > > >> forward
> > > > > > >> >   > > ---------> KinesisProducer
> > > > > > >> >   > >
> > > > > > >> >   > > Parallelism: 160
> > > > > > >> >   > > No shuffle/rebalance.
> > > > > > >> >   > >
> > > > > > >> >   > > Checkpointing config:
> > > > > > >> >   > >
> > > > > > >> >   > > Checkpointing Mode Exactly Once
> > > > > > >> >   > > Interval 10s
> > > > > > >> >   > > Timeout 10m 0s
> > > > > > >> >   > > Minimum Pause Between Checkpoints 10s
> > > > > > >> >   > > Maximum Concurrent Checkpoints 1
> > > > > > >> >   > > Persist Checkpoints Externally Enabled (delete on
> > > > > cancellation)
> > > > > > >> >   > >
> > > > > > >> >   > > State backend: rocksdb  (filesystem leads to same
> > > symptoms)
> > > > > > >> >   > > Checkpoint size is tiny (500KB)
> > > > > > >> >   > >
> > > > > > >> >   > > An interesting difference to another job that I had
> > > upgraded
> > > > > > >> successfully
> > > > > > >> >   > > is the low checkpointing interval.
> > > > > > >> >   > >
> > > > > > >> >   > > Thanks,
> > > > > > >> >   > > Thomas
> > > > > > >> >   > >
> > > > > > >> >   > >
> > > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> > > > > > >> [hidden email]
> > > > > > >> >   > > .invalid>
> > > > > > >> >   > > wrote:
> > > > > > >> >   > >
> > > > > > >> >   > > > Hi Thomas,
> > > > > > >> >   > > >
> > > > > > >> >   > > > Thanks for the efficient feedback.
> > > > > > >> >   > > >
> > > > > > >> >   > > > Regarding the suggestion of adding the release notes
> > > > > document,
> > > > > > >> I agree
> > > > > > >> >   > > > with your point. Maybe we should adjust the vote
> > > template
> > > > > > >> accordingly
> > > > > > >> >   > in
> > > > > > >> >   > > > the respective wiki to guide the following release
> > > > > processes.
> > > > > > >> >   > > >
> > > > > > >> >   > > > Regarding the performance regression, could you
> > provide
> > > > some
> > > > > > >> more
> > > > > > >> >   > details
> > > > > > >> >   > > > for our better measurement or reproducing on our
> > sides?
> > > > > > >> >   > > > E.g. I guess the topology only includes two vertexes
> > > > source
> > > > > > and
> > > > > > >> sink?
> > > > > > >> >   > > > What is the parallelism for every vertex?
> > > > > > >> >   > > > The upstream shuffles data to the downstream via
> > > rebalance
> > > > > > >> partitioner
> > > > > > >> >   > or
> > > > > > >> >   > > > other?
> > > > > > >> >   > > > The checkpoint mode is exactly-once with rocksDB
> state
> > > > > > backend?
> > > > > > >> >   > > > The backpressure happened in this case?
> > > > > > >> >   > > > How much percentage regression in this case?
> > > > > > >> >   > > >
> > > > > > >> >   > > > Best,
> > > > > > >> >   > > > Zhijiang
> > > > > > >> >   > > >
> > > > > > >> >   > > >
> > > > > > >> >   > > >
> > > > > > >> >   > > >
> > > > > > >>
> > ------------------------------------------------------------------
> > > > > > >> >   > > > From:Thomas Weise <[hidden email]>
> > > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> > > > > > >> >   > > > To:dev <[hidden email]>
> > > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate
> > #4
> > > > > > >> >   > > >
> > > > > > >> >   > > > Hi Till,
> > > > > > >> >   > > >
> > > > > > >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
> > > > > > >> >   > > >
> > > > > > >> >   > > > Generally, we carry forward the existing
> configuration
> > > and
> > > > > any
> > > > > > >> change
> > > > > > >> >   > to
> > > > > > >> >   > > > default configuration values would impact the
> upgrade.
> > > > > > >> >   > > >
> > > > > > >> >   > > > Yes, since it is an incompatible change I would
> state
> > it
> > > > in
> > > > > > the
> > > > > > >> release
> > > > > > >> >   > > > notes.
> > > > > > >> >   > > >
> > > > > > >> >   > > > Thanks,
> > > > > > >> >   > > > Thomas
> > > > > > >> >   > > >
> > > > > > >> >   > > > BTW I found a performance regression while trying to
> > > > upgrade
> > > > > > >> another
> > > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis to
> > Kinesis
> > > > > job.
> > > > > > >> Wasn't
> > > > > > >> >   > able
> > > > > > >> >   > > > to pin it down yet, symptoms include increased
> > > checkpoint
> > > > > > >> alignment
> > > > > > >> >   > time.
> > > > > > >> >   > > >
> > > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
> > > > > > >> [hidden email]>
> > > > > > >> >   > > > wrote:
> > > > > > >> >   > > >
> > > > > > >> >   > > > > Hi Thomas,
> > > > > > >> >   > > > >
> > > > > > >> >   > > > > just to confirm: When starting the image in local
> > > mode,
> > > > > then
> > > > > > >> you
> > > > > > >> >   > don't
> > > > > > >> >   > > > have
> > > > > > >> >   > > > > any of the JobManager memory configuration
> settings
> > > > > > >> configured in the
> > > > > > >> >   > > > > effective flink-conf.yaml, right? Does this mean
> > that
> > > > you
> > > > > > have
> > > > > > >> >   > > explicitly
> > > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the
> > default
> > > > > > >> configuration?
> > > > > > >> >   > > If
> > > > > > >> >   > > > > this is the case, then I believe it was more of an
> > > > > > >> unintentional
> > > > > > >> >   > > artifact
> > > > > > >> >   > > > > that it worked before and it has been corrected
> now
> > so
> > > > > that
> > > > > > >> one needs
> > > > > > >> >   > > to
> > > > > > >> >   > > > > specify the memory of the JM process explicitly.
> Do
> > > you
> > > > > > think
> > > > > > >> it
> > > > > > >> >   > would
> > > > > > >> >   > > > help
> > > > > > >> >   > > > > to explicitly state this in the release notes?
> > > > > > >> >   > > > >
> > > > > > >> >   > > > > Cheers,
> > > > > > >> >   > > > > Till
> > > > > > >> >   > > > >
> > > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
> > > > > [hidden email]
> > > > > > >
> > > > > > >> wrote:
> > > > > > >> >   > > > >
> > > > > > >> >   > > > > > Thanks for preparing another RC!
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > > > As mentioned in the previous RC thread, it would
> > be
> > > > > super
> > > > > > >> helpful
> > > > > > >> >   > if
> > > > > > >> >   > > > the
> > > > > > >> >   > > > > > release notes that are part of the documentation
> > can
> > > > be
> > > > > > >> included
> > > > > > >> >   > [1].
> > > > > > >> >   > > > > It's
> > > > > > >> >   > > > > > a significant time-saver to have read those
> first.
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > > > I found one more non-backward compatible change
> > that
> > > > > would
> > > > > > >> be worth
> > > > > > >> >   > > > > > addressing/mentioning:
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > > > It is now necessary to configure the jobmanager
> > heap
> > > > > size
> > > > > > in
> > > > > > >> >   > > > > > flink-conf.yaml (with either
> jobmanager.heap.size
> > > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why would I not
> > > want
> > > > to
> > > > > > do
> > > > > > >> that
> > > > > > >> >   > > > anyways?
> > > > > > >> >   > > > > > Well, we set it dynamically for a cluster
> > deployment
> > > > via
> > > > > > the
> > > > > > >> >   > > > > > flinkk8soperator, but the container image can
> also
> > > be
> > > > > used
> > > > > > >> for
> > > > > > >> >   > > testing
> > > > > > >> >   > > > > with
> > > > > > >> >   > > > > > local mode (./bin/jobmanager.sh start-foreground
> > > > local).
> > > > > > >> That will
> > > > > > >> >   > > fail
> > > > > > >> >   > > > > if
> > > > > > >> >   > > > > > the heap wasn't configured and that's how I
> > noticed
> > > > it.
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > > > Thanks,
> > > > > > >> >   > > > > > Thomas
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > > > [1]
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > >
> > > > > > >> >   > > >
> > > > > > >> >   > >
> > > > > > >> >   >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
> > > > > > >> >   > [hidden email]
> > > > > > >> >   > > > > > .invalid>
> > > > > > >> >   > > > > > wrote:
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > > > > Hi everyone,
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > > > > Please review and vote on the release
> candidate
> > #4
> > > > for
> > > > > > the
> > > > > > >> >   > version
> > > > > > >> >   > > > > > 1.11.0,
> > > > > > >> >   > > > > > > as follows:
> > > > > > >> >   > > > > > > [ ] +1, Approve the release
> > > > > > >> >   > > > > > > [ ] -1, Do not approve the release (please
> > provide
> > > > > > >> specific
> > > > > > >> >   > > comments)
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > > > > The complete staging area is available for
> your
> > > > > review,
> > > > > > >> which
> > > > > > >> >   > > > includes:
> > > > > > >> >   > > > > > > * JIRA release notes [1],
> > > > > > >> >   > > > > > > * the official Apache source release and
> binary
> > > > > > >> convenience
> > > > > > >> >   > > releases
> > > > > > >> >   > > > to
> > > > > > >> >   > > > > > be
> > > > > > >> >   > > > > > > deployed to dist.apache.org [2], which are
> > signed
> > > > > with
> > > > > > >> the key
> > > > > > >> >   > > with
> > > > > > >> >   > > > > > > fingerprint
> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> > > > > > [3],
> > > > > > >> >   > > > > > > * all artifacts to be deployed to the Maven
> > > Central
> > > > > > >> Repository
> > > > > > >> >   > [4],
> > > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
> > > > > > >> >   > > > > > > * website pull request listing the new release
> > and
> > > > > > adding
> > > > > > >> >   > > > announcement
> > > > > > >> >   > > > > > > blog post [6].
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > > > > The vote will be open for at least 72 hours.
> It
> > is
> > > > > > >> adopted by
> > > > > > >> >   > > > majority
> > > > > > >> >   > > > > > > approval, with at least 3 PMC affirmative
> votes.
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > > > > Thanks,
> > > > > > >> >   > > > > > > Release Manager
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > > > > [1]
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > >
> > > > > > >> >   > > >
> > > > > > >> >   > >
> > > > > > >> >   >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> > > > > > >> >   > > > > > > [2]
> > > > > > >> >   >
> > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> > > > > > >> >   > > > > > > [3]
> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > > >> >   > > > > > > [4]
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > >
> > > > > > >> >   > >
> > > > > > >>
> > > > >
> > >
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> > > > > > >> >   > > > > > > [5]
> > > > > > >> >   > >
> > > > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> > > > > > >> >   > > > > > > [6]
> > https://github.com/apache/flink-web/pull/352
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > > > >
> > > > > > >> >   > > > > >
> > > > > > >> >   > > > >
> > > > > > >> >   > > >
> > > > > > >> >   > > >
> > > > > > >> >   > >
> > > > > > >> >   > >
> > > > > > >> >   >
> > > > > > >> >   >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Regards,
> > Roman
> >
>


--
Regards,
Roman
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise
Hi Roman,

Here are the checkpoint summaries for both commits:

https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0

The config:

    CheckpointConfig checkpointConfig = env.getCheckpointConfig();
    checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
    checkpointConfig.setCheckpointInterval(*10_000*);
    checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
    checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
    checkpointConfig.setCheckpointTimeout(600_000);
    checkpointConfig.setMaxConcurrentCheckpoints(1);
    checkpointConfig.setFailOnCheckpointingErrors(true);

The values marked bold when changed to *60_000* make the symptom disappear.
I meanwhile also verified that with the 1.11.0 release commit.

I will take a look at the sleep time issue.

Thanks,
Thomas


On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <[hidden email]>
wrote:

> Hi Thomas,
>
> Thanks for your reply!
>
> I think you are right, we can remove this sleep and improve
> KinesisProducer.
> Probably, it's snapshotState can also be sped up by forcing records flush
> more often.
> Do you see that 30s checkpointing duration is caused by KinesisProducer
> (or maybe other operators)?
>
> I'd also like to understand the reason behind this increase in checkpoint
> frequency.
> Can you please share these values:
>  - execution.checkpointing.min-pause
>  - execution.checkpointing.max-concurrent-checkpoints
>  - execution.checkpointing.timeout
>
> And what is the "new" observed checkpoint frequency (or how many
> checkpoints are created) compared to older versions?
>
>
> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <[hidden email]> wrote:
>
>> Hi Roman,
>>
>> Indeed there are more frequent checkpoints with this change! The
>> application was configured to checkpoint every 10s. With 1.10 ("good
>> commit"), that leads to fewer completed checkpoints compared to 1.11 ("bad
>> commit"). Just to be clear, the only difference between the two runs was
>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>>
>> Since the sync part of checkpoints with the Kinesis producer always takes
>> ~30 seconds, the 10s configured checkpoint frequency really had no effect
>> before 1.11. I confirmed that both commits perform comparably by setting
>> the checkpoint frequency and min pause to 60s.
>>
>> I still have to verify with the final 1.11.0 release commit.
>>
>> It's probably good to take a look at the Kinesis producer. Is it really
>> necessary to have 500ms sleep time? What's responsible for the ~30s
>> duration in snapshotState?
>>
>> As things stand it doesn't make sense to use checkpoint intervals < 30s
>> when using the Kinesis producer.
>>
>> Thanks,
>> Thomas
>>
>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <[hidden email]
>> >
>> wrote:
>>
>> > Hi Thomas,
>> >
>> > Thanks a lot for the analysis.
>> >
>> > The first thing that I'd check is whether checkpoints became more
>> frequent
>> > with this commit (as each of them adds at least 500ms if there is at
>> least
>> > one not sent record, according to FlinkKinesisProducer.snapshotState).
>> >
>> > Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs
>> > first "bad" commits)?
>> >
>> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <[hidden email]>
>> > wrote:
>> >
>> > > I run git bisect and the first commit that shows the regression is:
>> > >
>> > >
>> > >
>> >
>> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
>> > >
>> > >
>> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]> wrote:
>> > >
>> > > > From my experience, java profilers are sometimes not accurate
>> enough to
>> > > > find out the performance regression
>> > > > root cause. In this case, I would suggest you try out intel vtune
>> > > amplifier
>> > > > to watch more detailed metrics.
>> > > >
>> > > > Best,
>> > > > Kurt
>> > > >
>> > > >
>> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]>
>> wrote:
>> > > >
>> > > > > The cause of the issue is all but clear.
>> > > > >
>> > > > > Previously I had mentioned that there is no suspect change to the
>> > > Kinesis
>> > > > > connector and that I had reverted the AWS SDK change to no effect.
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed
>> > > another
>> > > > > regression in the previous release and is present before and
>> after.
>> > > > >
>> > > > > I repeated the run with 1.11.0 core and downgraded the entire
>> Kinesis
>> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is still
>> > > > present.
>> > > > > Therefore we will need to look elsewhere for the root cause.
>> > > > >
>> > > > > Regarding the time spent in snapshotState, repeat runs reveal a
>> wide
>> > > > range
>> > > > > for both versions, 1.10 and 1.11. So again this is nothing
>> pointing
>> > to
>> > > a
>> > > > > root cause.
>> > > > >
>> > > > > At this point, I have no ideas remaining other than doing a
>> bisect to
>> > > > find
>> > > > > the culprit. Any other suggestions?
>> > > > >
>> > > > > Thomas
>> > > > >
>> > > > >
>> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
>> [hidden email]
>> > > > > .invalid>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Thomas,
>> > > > > >
>> > > > > > Thanks for your further profiling information and glad to see we
>> > > > already
>> > > > > > finalized the location to cause the regression.
>> > > > > > Actually I was also suspicious of the point of #snapshotState in
>> > > > previous
>> > > > > > discussions since it indeed cost much time to block normal
>> operator
>> > > > > > processing.
>> > > > > >
>> > > > > > Based on your below feedback, the sleep time during
>> #snapshotState
>> > > > might
>> > > > > > be the main concern, and I also digged into the implementation
>> of
>> > > > > > FlinkKinesisProducer#snapshotState.
>> > > > > > while (producer.getOutstandingRecordsCount() > 0) {
>> > > > > >    producer.flush();
>> > > > > >    try {
>> > > > > >       Thread.sleep(500);
>> > > > > >    } catch (InterruptedException e) {
>> > > > > >       LOG.warn("Flushing was interrupted.");
>> > > > > >       break;
>> > > > > >    }
>> > > > > > }
>> > > > > > It seems that the sleep time is mainly affected by the internal
>> > > > > operations
>> > > > > > inside KinesisProducer implementation provided by amazonaws,
>> which
>> > I
>> > > am
>> > > > > not
>> > > > > > quite familiar with.
>> > > > > > But I noticed there were two upgrades related to it in
>> > > release-1.11.0.
>> > > > > One
>> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and
>> another
>> > is
>> > > > for
>> > > > > > upgrading aws-sdk-version to 1.11.754 [2].
>> > > > > > You mentioned that you already reverted the SDK upgrade to
>> verify
>> > no
>> > > > > > changes. Did you also revert the [1] to verify?
>> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
>> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
>> > > > > >
>> > > > > > Best,
>> > > > > > Zhijiang
>> > > > > >
>> ------------------------------------------------------------------
>> > > > > > From:Thomas Weise <[hidden email]>
>> > > > > > Send Time:2020年7月17日(星期五) 05:29
>> > > > > > To:dev <[hidden email]>
>> > > > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
>> > > > [hidden email]
>> > > > > >;
>> > > > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
>> > > > [hidden email]
>> > > > > >
>> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>> 1.11.0,
>> > > > release
>> > > > > > candidate #4)
>> > > > > >
>> > > > > > Sorry for the delay.
>> > > > > >
>> > > > > > I confirmed that the regression is due to the sink
>> (unsurprising,
>> > > since
>> > > > > > another job with the same consumer, but not the producer, runs
>> as
>> > > > > > expected).
>> > > > > >
>> > > > > > As promised I did CPU profiling on the problematic application,
>> > which
>> > > > > gives
>> > > > > > more insight into the regression [1]
>> > > > > >
>> > > > > > The screenshots show that the average time for snapshotState
>> > > increases
>> > > > > from
>> > > > > > ~9s to ~28s. The data also shows the increase in sleep time
>> during
>> > > > > > snapshotState.
>> > > > > >
>> > > > > > Does anyone, based on changes made in 1.11, have a theory why?
>> > > > > >
>> > > > > > I had previously looked at the changes to the Kinesis connector
>> and
>> > > > also
>> > > > > > reverted the SDK upgrade, which did not change the situation.
>> > > > > >
>> > > > > > It will likely be necessary to drill into the sink /
>> checkpointing
>> > > > > details
>> > > > > > to understand the cause of the problem.
>> > > > > >
>> > > > > > Let me know if anyone has specific questions that I can answer
>> from
>> > > the
>> > > > > > profiling results.
>> > > > > >
>> > > > > > Thomas
>> > > > > >
>> > > > > > [1]
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
>> > > > > >
>> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]>
>> > > wrote:
>> > > > > >
>> > > > > > > + dev@ for visibility
>> > > > > > >
>> > > > > > > I will investigate further today.
>> > > > > > >
>> > > > > > >
>> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
>> > > [hidden email]
>> > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
>> > > > > > >> >    - Did sink checkpoint notifications change in a relevant
>> > way,
>> > > > for
>> > > > > > >> example
>> > > > > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha
>> > maybe?)
>> > > > > > >>
>> > > > > > >> I think that's unrelated: the Kafka fixes were isolated in
>> Kafka
>> > > and
>> > > > > the
>> > > > > > >> one bug I discovered on the way was about the Task reaper.
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On 07.07.20 17:51, Zhijiang wrote:
>> > > > > > >> > Sorry for my misunderstood of the previous information,
>> > Thomas.
>> > > I
>> > > > > was
>> > > > > > >> assuming that the sync checkpoint duration increased after
>> > upgrade
>> > > > as
>> > > > > it
>> > > > > > >> was mentioned before.
>> > > > > > >> >
>> > > > > > >> > If I remembered correctly, the memory state backend also
>> has
>> > the
>> > > > > same
>> > > > > > >> issue? If so, we can dismiss the rocksDB state changes. As
>> the
>> > > slot
>> > > > > > sharing
>> > > > > > >> enabled, the downstream and upstream should
>> > > > > > >> > probably deployed into the same slot, then no network
>> shuffle
>> > > > > effect.
>> > > > > > >> >
>> > > > > > >> > I think we need to find out whether it has other symptoms
>> > > changed
>> > > > > > >> besides the performance regression to further figure out the
>> > > scope.
>> > > > > > >> > E.g. any metrics changes, the number of TaskManager and the
>> > > number
>> > > > > of
>> > > > > > >> slots per TaskManager from deployment changes.
>> > > > > > >> > 40% regression is really big, I guess the changes should
>> also
>> > be
>> > > > > > >> reflected in other places.
>> > > > > > >> >
>> > > > > > >> > I am not sure whether we can reproduce the regression in
>> our
>> > AWS
>> > > > > > >> environment by writing any Kinesis jobs, since there are also
>> > > normal
>> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
>> > > > > > >> > So it probably looks like to touch some corner case. I am
>> very
>> > > > > willing
>> > > > > > >> to provide any help for debugging if possible.
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > Best,
>> > > > > > >> > Zhijiang
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> >
>> > > ------------------------------------------------------------------
>> > > > > > >> > From:Thomas Weise <[hidden email]>
>> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01
>> > > > > > >> > To:Stephan Ewen <[hidden email]>
>> > > > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
>> > > > > > >> [hidden email]>; Zhijiang <[hidden email]>
>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>> > > 1.11.0,
>> > > > > > >> release candidate #4)
>> > > > > > >> >
>> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We have
>> one
>> > job
>> > > > > that
>> > > > > > >> works as expected after the upgrade and the one discussed
>> here
>> > > that
>> > > > > has
>> > > > > > the
>> > > > > > >> performance regression.
>> > > > > > >> >
>> > > > > > >> > "The performance regression is obvious caused by long
>> duration
>> > > of
>> > > > > sync
>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>> block
>> > the
>> > > > > > normal
>> > > > > > >> data processing until back pressure the source."
>> > > > > > >> >
>> > > > > > >> > That's a constant. Before (1.10) and upgrade have the same
>> > sync
>> > > > > > >> checkpointing time. The question is what change came in with
>> the
>> > > > > > upgrade.
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <
>> [hidden email]
>> > >
>> > > > > wrote:
>> > > > > > >> >
>> > > > > > >> > @Thomas Just one thing real quick: Are you using the
>> > standalone
>> > > > > setup
>> > > > > > >> scripts (like start-cluster.sh, and the former "slaves"
>> file) ?
>> > > > > > >> > Be aware that this is now called "workers" because of
>> avoiding
>> > > > > > >> sensitive names.
>> > > > > > >> > In one internal benchmark we saw quite a lot of slowdown
>> > > > initially,
>> > > > > > >> before seeing that the cluster was not a distributed cluster
>> any
>> > > > more
>> > > > > > ;-)
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
>> > > > [hidden email]
>> > > > > >
>> > > > > > >> wrote:
>> > > > > > >> > Thanks for this kickoff and help analysis, Stephan!
>> > > > > > >> > Thanks for the further feedback and investigation, Thomas!
>> > > > > > >> >
>> > > > > > >> > The performance regression is obvious caused by long
>> duration
>> > of
>> > > > > sync
>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>> block
>> > the
>> > > > > > normal
>> > > > > > >> data processing until back pressure the source.
>> > > > > > >> > Maybe we could dig into the process of sync execution in
>> > > > checkpoint.
>> > > > > > >> E.g. break down the steps inside respective
>> > operator#snapshotState
>> > > > to
>> > > > > > >> statistic which operation cost most of the time, then
>> > > > > > >> > we might probably find the root cause to bring such cost.
>> > > > > > >> >
>> > > > > > >> > Look forward to the further progress. :)
>> > > > > > >> >
>> > > > > > >> > Best,
>> > > > > > >> > Zhijiang
>> > > > > > >> >
>> > > > > > >> >
>> > > ------------------------------------------------------------------
>> > > > > > >> > From:Stephan Ewen <[hidden email]>
>> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52
>> > > > > > >> > To:Thomas Weise <[hidden email]>
>> > > > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
>> > > > > > >> [hidden email]>; Aljoscha Krettek <
>> > > [hidden email]
>> > > > >;
>> > > > > > >> Arvid Heise <[hidden email]>
>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>> > > 1.11.0,
>> > > > > > >> release candidate #4)
>> > > > > > >> >
>> > > > > > >> > Thank you for the digging so deeply.
>> > > > > > >> > Mysterious think this regression.
>> > > > > > >> >
>> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]>
>> > wrote:
>> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it is
>> > > unchanged
>> > > > > > >> between 1.10 and 1.11 for the specific pipeline).
>> > > > > > >> >
>> > > > > > >> > I verified that increasing the checkpointing interval does
>> not
>> > > > make
>> > > > > a
>> > > > > > >> difference.
>> > > > > > >> >
>> > > > > > >> > I looked at the Kinesis connector changes since 1.10.1 and
>> > don't
>> > > > see
>> > > > > > >> anything that could cause this.
>> > > > > > >> >
>> > > > > > >> > Another pipeline that is using the Kinesis consumer (but
>> not
>> > the
>> > > > > > >> producer) performs as expected.
>> > > > > > >> >
>> > > > > > >> > I tried reverting the AWS SDK version change, symptoms
>> remain
>> > > > > > unchanged:
>> > > > > > >> >
>> > > > > > >> > diff --git
>> a/flink-connectors/flink-connector-kinesis/pom.xml
>> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
>> > > > > > >> > index a6abce23ba..741743a05e 100644
>> > > > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
>> > > > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
>> > > > > > >> > @@ -33,7 +33,7 @@ under the License.
>> > > > > > >> >
>> > > > > > >>
>> > > > >
>> > >
>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
>> > > > > > >> >          <name>flink-connector-kinesis</name>
>> > > > > > >> >          <properties>
>> > > > > > >> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
>> > > > > > >> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
>> > > > > > >> >
>> > > > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
>> > > > > > >> >
>> > > > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
>> > > > > > >> >
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
>> > > > > > >> >
>> > > > > > >> > I'm planning to take a look with a profiler next.
>> > > > > > >> >
>> > > > > > >> > Thomas
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
>> > [hidden email]>
>> > > > > > wrote:
>> > > > > > >> > Hi all!
>> > > > > > >> >
>> > > > > > >> > Forking this thread out of the release vote thread.
>> > > > > > >> >  From what Thomas describes, it really sounds like a
>> > > sink-specific
>> > > > > > >> issue.
>> > > > > > >> >
>> > > > > > >> > @Thomas: When you say sink has a long synchronous
>> checkpoint
>> > > time,
>> > > > > you
>> > > > > > >> mean the time that is shown as "sync time" on the metrics and
>> > web
>> > > > UI?
>> > > > > > That
>> > > > > > >> is not including any network buffer related operations. It is
>> > > purely
>> > > > > the
>> > > > > > >> operator's time.
>> > > > > > >> >
>> > > > > > >> > Can we dig into the changes we did in sinks:
>> > > > > > >> >    - Kinesis version upgrade, AWS library updates
>> > > > > > >> >
>> > > > > > >> >    - Could it be that some call (checkpoint complete) that
>> was
>> > > > > > >> previously (1.10) in a separate thread is not in the mailbox
>> and
>> > > > this
>> > > > > > >> simply reduces the number of threads that do the work?
>> > > > > > >> >
>> > > > > > >> >    - Did sink checkpoint notifications change in a relevant
>> > way,
>> > > > for
>> > > > > > >> example due to some Kafka issues we addressed in 1.11
>> (@Aljoscha
>> > > > > maybe?)
>> > > > > > >> >
>> > > > > > >> > Best,
>> > > > > > >> > Stephan
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
>> > > > [hidden email]
>> > > > > > .invalid>
>> > > > > > >> wrote:
>> > > > > > >> > Hi Thomas,
>> > > > > > >> >
>> > > > > > >> >   Regarding [2], it has more detail infos in the Jira
>> > > description
>> > > > (
>> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
>> > > > > > >> >
>> > > > > > >> >   I can also give some basic explanations here to dismiss
>> the
>> > > > > concern.
>> > > > > > >> >   1. In the past, the following buffers after the barrier
>> will
>> > > be
>> > > > > > >> cached on downstream side before alignment.
>> > > > > > >> >   2. In 1.11, the upstream would not send the buffers after
>> > the
>> > > > > > >> barrier. When the downstream finishes the alignment, it will
>> > > notify
>> > > > > the
>> > > > > > >> downstream of continuing sending following buffers, since it
>> can
>> > > > > process
>> > > > > > >> them after alignment.
>> > > > > > >> >   3. The only difference is that the temporary blocked
>> buffers
>> > > are
>> > > > > > >> cached either on downstream side or on upstream side before
>> > > > alignment.
>> > > > > > >> >   4. The side effect would be the additional notification
>> cost
>> > > for
>> > > > > > >> every barrier alignment. If the downstream and upstream are
>> > > deployed
>> > > > > in
>> > > > > > >> separate TaskManager, the cost is network transport delay
>> (the
>> > > > effect
>> > > > > > can
>> > > > > > >> be ignored based on our testing with 1s checkpoint interval).
>> > For
>> > > > > > sharing
>> > > > > > >> slot in your case, the cost is only one method call in
>> > processor,
>> > > > can
>> > > > > be
>> > > > > > >> ignored also.
>> > > > > > >> >
>> > > > > > >> >   You mentioned "In this case, the downstream task has a
>> high
>> > > > > average
>> > > > > > >> checkpoint duration(~30s, sync part)." This duration is not
>> > > > reflecting
>> > > > > > the
>> > > > > > >> changes above, and it is only indicating the duration for
>> > calling
>> > > > > > >> `Operation.snapshotState`.
>> > > > > > >> >   If this duration is beyond your expectation, you can
>> check
>> > or
>> > > > > debug
>> > > > > > >> whether the source/sink operations might take more time to
>> > finish
>> > > > > > >> `snapshotState` in practice. E.g. you can
>> > > > > > >> >   make the implementation of this method as empty to
>> further
>> > > > verify
>> > > > > > the
>> > > > > > >> effect.
>> > > > > > >> >
>> > > > > > >> >   Best,
>> > > > > > >> >   Zhijiang
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> >
>> > > >  ------------------------------------------------------------------
>> > > > > > >> >   From:Thomas Weise <[hidden email]>
>> > > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
>> > > > > > >> >   To:dev <[hidden email]>; Zhijiang <
>> > > > > [hidden email]
>> > > > > > >
>> > > > > > >> >   Cc:Yingjie Cao <[hidden email]>
>> > > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> > > > > > >> >
>> > > > > > >> >   Hi Zhijiang,
>> > > > > > >> >
>> > > > > > >> >   Could you please point me to more details regarding:
>> "[2]:
>> > > Delay
>> > > > > > send
>> > > > > > >> the
>> > > > > > >> >   following buffers after checkpoint barrier on upstream
>> side
>> > > > until
>> > > > > > >> barrier
>> > > > > > >> >   alignment on downstream side."
>> > > > > > >> >
>> > > > > > >> >   In this case, the downstream task has a high average
>> > > checkpoint
>> > > > > > >> duration
>> > > > > > >> >   (~30s, sync part). If there was a change to hold buffers
>> > > > depending
>> > > > > > on
>> > > > > > >> >   downstream performance, could this possibly apply to this
>> > case
>> > > > > (even
>> > > > > > >> when
>> > > > > > >> >   there is no shuffle that would require alignment)?
>> > > > > > >> >
>> > > > > > >> >   Thanks,
>> > > > > > >> >   Thomas
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
>> > > > > [hidden email]
>> > > > > > >> .invalid>
>> > > > > > >> >   wrote:
>> > > > > > >> >
>> > > > > > >> >   > Hi Thomas,
>> > > > > > >> >   >
>> > > > > > >> >   > Thanks for the further update information.
>> > > > > > >> >   >
>> > > > > > >> >   > I guess we can dismiss the network stack changes,
>> since in
>> > > > your
>> > > > > > >> case the
>> > > > > > >> >   > downstream and upstream would probably be deployed in
>> the
>> > > same
>> > > > > > slot
>> > > > > > >> >   > bypassing the network data shuffle.
>> > > > > > >> >   > Also I guess release-1.11 will not bring general
>> > performance
>> > > > > > >> regression in
>> > > > > > >> >   > runtime engine, as we also did the performance testing
>> for
>> > > all
>> > > > > > >> general
>> > > > > > >> >   > cases by [1] in real cluster before and the testing
>> > results
>> > > > > should
>> > > > > > >> fit the
>> > > > > > >> >   > expectation. But we indeed did not test the specific
>> > source
>> > > > and
>> > > > > > sink
>> > > > > > >> >   > connectors yet as I known.
>> > > > > > >> >   >
>> > > > > > >> >   > Regarding your performance regression with 40%, I
>> wonder
>> > it
>> > > is
>> > > > > > >> probably
>> > > > > > >> >   > related to specific source/sink changes (e.g. kinesis)
>> or
>> > > > > > >> environment
>> > > > > > >> >   > issues with corner case.
>> > > > > > >> >   > If possible, it would be helpful to further locate
>> whether
>> > > the
>> > > > > > >> regression
>> > > > > > >> >   > is caused by kinesis, by replacing the kinesis source &
>> > sink
>> > > > and
>> > > > > > >> keeping
>> > > > > > >> >   > the others same.
>> > > > > > >> >   >
>> > > > > > >> >   > As you said, it would be efficient to contact with you
>> > > > directly
>> > > > > > >> next week
>> > > > > > >> >   > to further discuss this issue. And we are
>> willing/eager to
>> > > > > provide
>> > > > > > >> any help
>> > > > > > >> >   > to resolve this issue soon.
>> > > > > > >> >   >
>> > > > > > >> >   > Besides that, I guess this issue should not be the
>> blocker
>> > > for
>> > > > > the
>> > > > > > >> >   > release, since it is probably a corner case based on
>> the
>> > > > current
>> > > > > > >> analysis.
>> > > > > > >> >   > If we really conclude anything need to be resolved
>> after
>> > the
>> > > > > final
>> > > > > > >> >   > release, then we can also make the next minor
>> > release-1.11.1
>> > > > > come
>> > > > > > >> soon.
>> > > > > > >> >   >
>> > > > > > >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
>> > > > > > >> >   >
>> > > > > > >> >   > Best,
>> > > > > > >> >   > Zhijiang
>> > > > > > >> >   >
>> > > > > > >> >   >
>> > > > > > >> >   >
>> > > > > ------------------------------------------------------------------
>> > > > > > >> >   > From:Thomas Weise <[hidden email]>
>> > > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
>> > > > > > >> >   > To:dev <[hidden email]>; Zhijiang <
>> > > > > > [hidden email]
>> > > > > > >> >
>> > > > > > >> >   > Cc:Yingjie Cao <[hidden email]>
>> > > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> > > > > > >> >   >
>> > > > > > >> >   > Hi Zhijiang,
>> > > > > > >> >   >
>> > > > > > >> >   > It will probably be best if we connect next week and
>> > discuss
>> > > > the
>> > > > > > >> issue
>> > > > > > >> >   > directly since this could be quite difficult to
>> reproduce.
>> > > > > > >> >   >
>> > > > > > >> >   > Before the testing result on our side comes out for
>> your
>> > > > > > respective
>> > > > > > >> job
>> > > > > > >> >   > case, I have some other questions to confirm for
>> further
>> > > > > analysis:
>> > > > > > >> >   >     -  How much percentage regression you found after
>> > > > switching
>> > > > > to
>> > > > > > >> 1.11?
>> > > > > > >> >   >
>> > > > > > >> >   > ~40% throughput decline
>> > > > > > >> >   >
>> > > > > > >> >   >     -  Are there any network bottleneck in your
>> cluster?
>> > > E.g.
>> > > > > the
>> > > > > > >> network
>> > > > > > >> >   > bandwidth is full caused by other jobs? If so, it might
>> > have
>> > > > > more
>> > > > > > >> effects
>> > > > > > >> >   > by above [2]
>> > > > > > >> >   >
>> > > > > > >> >   > The test runs on a k8s cluster that is also used for
>> other
>> > > > > > >> production jobs.
>> > > > > > >> >   > There is no reason be believe network is the
>> bottleneck.
>> > > > > > >> >   >
>> > > > > > >> >   >     -  Did you adjust the default network buffer
>> setting?
>> > > E.g.
>> > > > > > >> >   > "taskmanager.network.memory.floating-buffers-per-gate"
>> or
>> > > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
>> > > > > > >> >   >
>> > > > > > >> >   > The job is using the defaults, i.e we don't configure
>> the
>> > > > > > settings.
>> > > > > > >> If you
>> > > > > > >> >   > want me to try specific settings in the hope that it
>> will
>> > > help
>> > > > > to
>> > > > > > >> isolate
>> > > > > > >> >   > the issue please let me know.
>> > > > > > >> >   >
>> > > > > > >> >   >     -  I guess the topology has three vertexes
>> > > > "KinesisConsumer
>> > > > > ->
>> > > > > > >> Chained
>> > > > > > >> >   > FlatMap -> KinesisProducer", and the partition mode for
>> > > > > > >> "KinesisConsumer ->
>> > > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
>> > "forward"?
>> > > If
>> > > > > so,
>> > > > > > >> the edge
>> > > > > > >> >   > connection is one-to-one, not all-to-all, then the
>> above
>> > > > [1][2]
>> > > > > > >> should no
>> > > > > > >> >   > effects in theory with default network buffer setting.
>> > > > > > >> >   >
>> > > > > > >> >   > There are only 2 vertices and the edge is "forward".
>> > > > > > >> >   >
>> > > > > > >> >   >     - By slot sharing, I guess these three vertex
>> > > parallelism
>> > > > > task
>> > > > > > >> would
>> > > > > > >> >   > probably be deployed into the same slot, then the data
>> > > shuffle
>> > > > > is
>> > > > > > >> by memory
>> > > > > > >> >   > queue, not network stack. If so, the above [2] should
>> no
>> > > > effect.
>> > > > > > >> >   >
>> > > > > > >> >   > Yes, vertices share slots.
>> > > > > > >> >   >
>> > > > > > >> >   >     - I also saw some Jira changes for kinesis in this
>> > > > release,
>> > > > > > >> could you
>> > > > > > >> >   > confirm that these changes would not effect the
>> > performance?
>> > > > > > >> >   >
>> > > > > > >> >   > I will need to take a look. 1.10 already had a
>> regression
>> > > > > > >> introduced by the
>> > > > > > >> >   > Kinesis producer update.
>> > > > > > >> >   >
>> > > > > > >> >   >
>> > > > > > >> >   > Thanks,
>> > > > > > >> >   > Thomas
>> > > > > > >> >   >
>> > > > > > >> >   >
>> > > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
>> > > > > > >> [hidden email]
>> > > > > > >> >   > .invalid>
>> > > > > > >> >   > wrote:
>> > > > > > >> >   >
>> > > > > > >> >   > > Hi Thomas,
>> > > > > > >> >   > >
>> > > > > > >> >   > > Thanks for your reply with rich information!
>> > > > > > >> >   > >
>> > > > > > >> >   > > We are trying to reproduce your case in our cluster
>> to
>> > > > further
>> > > > > > >> verify it,
>> > > > > > >> >   > > and  @Yingjie Cao is working on it now.
>> > > > > > >> >   > >  As we have not kinesis consumer and producer
>> > internally,
>> > > so
>> > > > > we
>> > > > > > >> will
>> > > > > > >> >   > > construct the common source and sink instead in the
>> case
>> > > of
>> > > > > > >> backpressure.
>> > > > > > >> >   > >
>> > > > > > >> >   > > Firstly, we can dismiss the rockdb factor in this
>> > release,
>> > > > > since
>> > > > > > >> you also
>> > > > > > >> >   > > mentioned that "filesystem leads to same symptoms".
>> > > > > > >> >   > >
>> > > > > > >> >   > > Secondly, if my understanding is right, you emphasis
>> > that
>> > > > the
>> > > > > > >> regression
>> > > > > > >> >   > > only exists for the jobs with low checkpoint interval
>> > > (10s).
>> > > > > > >> >   > > Based on that, I have two suspicions with the network
>> > > > related
>> > > > > > >> changes in
>> > > > > > >> >   > > this release:
>> > > > > > >> >   > >     - [1]: Limited the maximum backlog value (default
>> > 10)
>> > > in
>> > > > > > >> subpartition
>> > > > > > >> >   > > queue.
>> > > > > > >> >   > >     - [2]: Delay send the following buffers after
>> > > checkpoint
>> > > > > > >> barrier on
>> > > > > > >> >   > > upstream side until barrier alignment on downstream
>> > side.
>> > > > > > >> >   > >
>> > > > > > >> >   > > These changes are motivated for reducing the
>> in-flight
>> > > > buffers
>> > > > > > to
>> > > > > > >> speedup
>> > > > > > >> >   > > checkpoint especially in the case of backpressure.
>> > > > > > >> >   > > In theory they should have very minor performance
>> effect
>> > > and
>> > > > > > >> actually we
>> > > > > > >> >   > > also tested in cluster to verify within expectation
>> > before
>> > > > > > >> merging them,
>> > > > > > >> >   > >  but maybe there are other corner cases we have not
>> > > thought
>> > > > of
>> > > > > > >> before.
>> > > > > > >> >   > >
>> > > > > > >> >   > > Before the testing result on our side comes out for
>> your
>> > > > > > >> respective job
>> > > > > > >> >   > > case, I have some other questions to confirm for
>> further
>> > > > > > analysis:
>> > > > > > >> >   > >     -  How much percentage regression you found after
>> > > > > switching
>> > > > > > >> to 1.11?
>> > > > > > >> >   > >     -  Are there any network bottleneck in your
>> cluster?
>> > > > E.g.
>> > > > > > the
>> > > > > > >> network
>> > > > > > >> >   > > bandwidth is full caused by other jobs? If so, it
>> might
>> > > have
>> > > > > > more
>> > > > > > >> effects
>> > > > > > >> >   > > by above [2]
>> > > > > > >> >   > >     -  Did you adjust the default network buffer
>> > setting?
>> > > > E.g.
>> > > > > > >> >   > >
>> "taskmanager.network.memory.floating-buffers-per-gate"
>> > or
>> > > > > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
>> > > > > > >> >   > >     -  I guess the topology has three vertexes
>> > > > > "KinesisConsumer
>> > > > > > ->
>> > > > > > >> >   > Chained
>> > > > > > >> >   > > FlatMap -> KinesisProducer", and the partition mode
>> for
>> > > > > > >> "KinesisConsumer
>> > > > > > >> >   > ->
>> > > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both
>> > > "forward"?
>> > > > If
>> > > > > > >> so, the
>> > > > > > >> >   > edge
>> > > > > > >> >   > > connection is one-to-one, not all-to-all, then the
>> above
>> > > > > [1][2]
>> > > > > > >> should no
>> > > > > > >> >   > > effects in theory with default network buffer
>> setting.
>> > > > > > >> >   > >     - By slot sharing, I guess these three vertex
>> > > > parallelism
>> > > > > > >> task would
>> > > > > > >> >   > > probably be deployed into the same slot, then the
>> data
>> > > > shuffle
>> > > > > > is
>> > > > > > >> by
>> > > > > > >> >   > memory
>> > > > > > >> >   > > queue, not network stack. If so, the above [2]
>> should no
>> > > > > effect.
>> > > > > > >> >   > >     - I also saw some Jira changes for kinesis in
>> this
>> > > > > release,
>> > > > > > >> could you
>> > > > > > >> >   > > confirm that these changes would not effect the
>> > > performance?
>> > > > > > >> >   > >
>> > > > > > >> >   > > Best,
>> > > > > > >> >   > > Zhijiang
>> > > > > > >> >   > >
>> > > > > > >> >   > >
>> > > > > > >> >   > >
>> > > > > >
>> ------------------------------------------------------------------
>> > > > > > >> >   > > From:Thomas Weise <[hidden email]>
>> > > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
>> > > > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
>> > > > > > >> [hidden email]>
>> > > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate
>> #4
>> > > > > > >> >   > >
>> > > > > > >> >   > > Hi Zhijiang,
>> > > > > > >> >   > >
>> > > > > > >> >   > > The performance degradation manifests in backpressure
>> > > which
>> > > > > > leads
>> > > > > > >> to
>> > > > > > >> >   > > growing backlog in the source. I switched a few times
>> > > > between
>> > > > > > >> 1.10 and
>> > > > > > >> >   > 1.11
>> > > > > > >> >   > > and the behavior is consistent.
>> > > > > > >> >   > >
>> > > > > > >> >   > > The DAG is:
>> > > > > > >> >   > >
>> > > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)
>> > >  --------
>> > > > > > >> forward
>> > > > > > >> >   > > ---------> KinesisProducer
>> > > > > > >> >   > >
>> > > > > > >> >   > > Parallelism: 160
>> > > > > > >> >   > > No shuffle/rebalance.
>> > > > > > >> >   > >
>> > > > > > >> >   > > Checkpointing config:
>> > > > > > >> >   > >
>> > > > > > >> >   > > Checkpointing Mode Exactly Once
>> > > > > > >> >   > > Interval 10s
>> > > > > > >> >   > > Timeout 10m 0s
>> > > > > > >> >   > > Minimum Pause Between Checkpoints 10s
>> > > > > > >> >   > > Maximum Concurrent Checkpoints 1
>> > > > > > >> >   > > Persist Checkpoints Externally Enabled (delete on
>> > > > > cancellation)
>> > > > > > >> >   > >
>> > > > > > >> >   > > State backend: rocksdb  (filesystem leads to same
>> > > symptoms)
>> > > > > > >> >   > > Checkpoint size is tiny (500KB)
>> > > > > > >> >   > >
>> > > > > > >> >   > > An interesting difference to another job that I had
>> > > upgraded
>> > > > > > >> successfully
>> > > > > > >> >   > > is the low checkpointing interval.
>> > > > > > >> >   > >
>> > > > > > >> >   > > Thanks,
>> > > > > > >> >   > > Thomas
>> > > > > > >> >   > >
>> > > > > > >> >   > >
>> > > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
>> > > > > > >> [hidden email]
>> > > > > > >> >   > > .invalid>
>> > > > > > >> >   > > wrote:
>> > > > > > >> >   > >
>> > > > > > >> >   > > > Hi Thomas,
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Thanks for the efficient feedback.
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Regarding the suggestion of adding the release
>> notes
>> > > > > document,
>> > > > > > >> I agree
>> > > > > > >> >   > > > with your point. Maybe we should adjust the vote
>> > > template
>> > > > > > >> accordingly
>> > > > > > >> >   > in
>> > > > > > >> >   > > > the respective wiki to guide the following release
>> > > > > processes.
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Regarding the performance regression, could you
>> > provide
>> > > > some
>> > > > > > >> more
>> > > > > > >> >   > details
>> > > > > > >> >   > > > for our better measurement or reproducing on our
>> > sides?
>> > > > > > >> >   > > > E.g. I guess the topology only includes two
>> vertexes
>> > > > source
>> > > > > > and
>> > > > > > >> sink?
>> > > > > > >> >   > > > What is the parallelism for every vertex?
>> > > > > > >> >   > > > The upstream shuffles data to the downstream via
>> > > rebalance
>> > > > > > >> partitioner
>> > > > > > >> >   > or
>> > > > > > >> >   > > > other?
>> > > > > > >> >   > > > The checkpoint mode is exactly-once with rocksDB
>> state
>> > > > > > backend?
>> > > > > > >> >   > > > The backpressure happened in this case?
>> > > > > > >> >   > > > How much percentage regression in this case?
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Best,
>> > > > > > >> >   > > > Zhijiang
>> > > > > > >> >   > > >
>> > > > > > >> >   > > >
>> > > > > > >> >   > > >
>> > > > > > >> >   > > >
>> > > > > > >>
>> > ------------------------------------------------------------------
>> > > > > > >> >   > > > From:Thomas Weise <[hidden email]>
>> > > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
>> > > > > > >> >   > > > To:dev <[hidden email]>
>> > > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release
>> candidate
>> > #4
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Hi Till,
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Generally, we carry forward the existing
>> configuration
>> > > and
>> > > > > any
>> > > > > > >> change
>> > > > > > >> >   > to
>> > > > > > >> >   > > > default configuration values would impact the
>> upgrade.
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Yes, since it is an incompatible change I would
>> state
>> > it
>> > > > in
>> > > > > > the
>> > > > > > >> release
>> > > > > > >> >   > > > notes.
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > Thanks,
>> > > > > > >> >   > > > Thomas
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > BTW I found a performance regression while trying
>> to
>> > > > upgrade
>> > > > > > >> another
>> > > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis to
>> > Kinesis
>> > > > > job.
>> > > > > > >> Wasn't
>> > > > > > >> >   > able
>> > > > > > >> >   > > > to pin it down yet, symptoms include increased
>> > > checkpoint
>> > > > > > >> alignment
>> > > > > > >> >   > time.
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
>> > > > > > >> [hidden email]>
>> > > > > > >> >   > > > wrote:
>> > > > > > >> >   > > >
>> > > > > > >> >   > > > > Hi Thomas,
>> > > > > > >> >   > > > >
>> > > > > > >> >   > > > > just to confirm: When starting the image in local
>> > > mode,
>> > > > > then
>> > > > > > >> you
>> > > > > > >> >   > don't
>> > > > > > >> >   > > > have
>> > > > > > >> >   > > > > any of the JobManager memory configuration
>> settings
>> > > > > > >> configured in the
>> > > > > > >> >   > > > > effective flink-conf.yaml, right? Does this mean
>> > that
>> > > > you
>> > > > > > have
>> > > > > > >> >   > > explicitly
>> > > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the
>> > default
>> > > > > > >> configuration?
>> > > > > > >> >   > > If
>> > > > > > >> >   > > > > this is the case, then I believe it was more of
>> an
>> > > > > > >> unintentional
>> > > > > > >> >   > > artifact
>> > > > > > >> >   > > > > that it worked before and it has been corrected
>> now
>> > so
>> > > > > that
>> > > > > > >> one needs
>> > > > > > >> >   > > to
>> > > > > > >> >   > > > > specify the memory of the JM process explicitly.
>> Do
>> > > you
>> > > > > > think
>> > > > > > >> it
>> > > > > > >> >   > would
>> > > > > > >> >   > > > help
>> > > > > > >> >   > > > > to explicitly state this in the release notes?
>> > > > > > >> >   > > > >
>> > > > > > >> >   > > > > Cheers,
>> > > > > > >> >   > > > > Till
>> > > > > > >> >   > > > >
>> > > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
>> > > > > [hidden email]
>> > > > > > >
>> > > > > > >> wrote:
>> > > > > > >> >   > > > >
>> > > > > > >> >   > > > > > Thanks for preparing another RC!
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > > > As mentioned in the previous RC thread, it
>> would
>> > be
>> > > > > super
>> > > > > > >> helpful
>> > > > > > >> >   > if
>> > > > > > >> >   > > > the
>> > > > > > >> >   > > > > > release notes that are part of the
>> documentation
>> > can
>> > > > be
>> > > > > > >> included
>> > > > > > >> >   > [1].
>> > > > > > >> >   > > > > It's
>> > > > > > >> >   > > > > > a significant time-saver to have read those
>> first.
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > > > I found one more non-backward compatible change
>> > that
>> > > > > would
>> > > > > > >> be worth
>> > > > > > >> >   > > > > > addressing/mentioning:
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > > > It is now necessary to configure the jobmanager
>> > heap
>> > > > > size
>> > > > > > in
>> > > > > > >> >   > > > > > flink-conf.yaml (with either
>> jobmanager.heap.size
>> > > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why would I
>> not
>> > > want
>> > > > to
>> > > > > > do
>> > > > > > >> that
>> > > > > > >> >   > > > anyways?
>> > > > > > >> >   > > > > > Well, we set it dynamically for a cluster
>> > deployment
>> > > > via
>> > > > > > the
>> > > > > > >> >   > > > > > flinkk8soperator, but the container image can
>> also
>> > > be
>> > > > > used
>> > > > > > >> for
>> > > > > > >> >   > > testing
>> > > > > > >> >   > > > > with
>> > > > > > >> >   > > > > > local mode (./bin/jobmanager.sh
>> start-foreground
>> > > > local).
>> > > > > > >> That will
>> > > > > > >> >   > > fail
>> > > > > > >> >   > > > > if
>> > > > > > >> >   > > > > > the heap wasn't configured and that's how I
>> > noticed
>> > > > it.
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > > > Thanks,
>> > > > > > >> >   > > > > > Thomas
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > > > [1]
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > >
>> > > > > > >> >   > > >
>> > > > > > >> >   > >
>> > > > > > >> >   >
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
>> > > > > > >> >   > [hidden email]
>> > > > > > >> >   > > > > > .invalid>
>> > > > > > >> >   > > > > > wrote:
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > > > > Hi everyone,
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > > > > Please review and vote on the release
>> candidate
>> > #4
>> > > > for
>> > > > > > the
>> > > > > > >> >   > version
>> > > > > > >> >   > > > > > 1.11.0,
>> > > > > > >> >   > > > > > > as follows:
>> > > > > > >> >   > > > > > > [ ] +1, Approve the release
>> > > > > > >> >   > > > > > > [ ] -1, Do not approve the release (please
>> > provide
>> > > > > > >> specific
>> > > > > > >> >   > > comments)
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > > > > The complete staging area is available for
>> your
>> > > > > review,
>> > > > > > >> which
>> > > > > > >> >   > > > includes:
>> > > > > > >> >   > > > > > > * JIRA release notes [1],
>> > > > > > >> >   > > > > > > * the official Apache source release and
>> binary
>> > > > > > >> convenience
>> > > > > > >> >   > > releases
>> > > > > > >> >   > > > to
>> > > > > > >> >   > > > > > be
>> > > > > > >> >   > > > > > > deployed to dist.apache.org [2], which are
>> > signed
>> > > > > with
>> > > > > > >> the key
>> > > > > > >> >   > > with
>> > > > > > >> >   > > > > > > fingerprint
>> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
>> > > > > > [3],
>> > > > > > >> >   > > > > > > * all artifacts to be deployed to the Maven
>> > > Central
>> > > > > > >> Repository
>> > > > > > >> >   > [4],
>> > > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
>> > > > > > >> >   > > > > > > * website pull request listing the new
>> release
>> > and
>> > > > > > adding
>> > > > > > >> >   > > > announcement
>> > > > > > >> >   > > > > > > blog post [6].
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > > > > The vote will be open for at least 72 hours.
>> It
>> > is
>> > > > > > >> adopted by
>> > > > > > >> >   > > > majority
>> > > > > > >> >   > > > > > > approval, with at least 3 PMC affirmative
>> votes.
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > > > > Thanks,
>> > > > > > >> >   > > > > > > Release Manager
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > > > > [1]
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > >
>> > > > > > >> >   > > >
>> > > > > > >> >   > >
>> > > > > > >> >   >
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
>> > > > > > >> >   > > > > > > [2]
>> > > > > > >> >   >
>> > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
>> > > > > > >> >   > > > > > > [3]
>> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
>> > > > > > >> >   > > > > > > [4]
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > >
>> > > > > > >> >   > >
>> > > > > > >>
>> > > > >
>> > >
>> https://repository.apache.org/content/repositories/orgapacheflink-1377/
>> > > > > > >> >   > > > > > > [5]
>> > > > > > >> >   > >
>> > > > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
>> > > > > > >> >   > > > > > > [6]
>> > https://github.com/apache/flink-web/pull/352
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > > > >
>> > > > > > >> >   > > > > >
>> > > > > > >> >   > > > >
>> > > > > > >> >   > > >
>> > > > > > >> >   > > >
>> > > > > > >> >   > >
>> > > > > > >> >   > >
>> > > > > > >> >   >
>> > > > > > >> >   >
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >>
>> > > > > > >>
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> > --
>> > Regards,
>> > Roman
>> >
>>
>
>
> --
> Regards,
> Roman
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise
Just another update:

The duration of snapshotState is capped by the Kinesis
producer's "RecordTtl" setting (default 30s). The sleep time in flushSync
does not contribute to the observed behavior.

I guess the open question is why, with the same settings, is 1.11 since
commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more checkpoints?


On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise <[hidden email]> wrote:

> Hi Roman,
>
> Here are the checkpoint summaries for both commits:
>
>
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0
>
> The config:
>
>     CheckpointConfig checkpointConfig = env.getCheckpointConfig();
>     checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>     checkpointConfig.setCheckpointInterval(*10_000*);
>     checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
>     checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
>     checkpointConfig.setCheckpointTimeout(600_000);
>     checkpointConfig.setMaxConcurrentCheckpoints(1);
>     checkpointConfig.setFailOnCheckpointingErrors(true);
>
> The values marked bold when changed to *60_000* make the symptom
> disappear. I meanwhile also verified that with the 1.11.0 release commit.
>
> I will take a look at the sleep time issue.
>
> Thanks,
> Thomas
>
>
> On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <[hidden email]>
> wrote:
>
>> Hi Thomas,
>>
>> Thanks for your reply!
>>
>> I think you are right, we can remove this sleep and improve
>> KinesisProducer.
>> Probably, it's snapshotState can also be sped up by forcing records flush
>> more often.
>> Do you see that 30s checkpointing duration is caused by KinesisProducer
>> (or maybe other operators)?
>>
>> I'd also like to understand the reason behind this increase in checkpoint
>> frequency.
>> Can you please share these values:
>>  - execution.checkpointing.min-pause
>>  - execution.checkpointing.max-concurrent-checkpoints
>>  - execution.checkpointing.timeout
>>
>> And what is the "new" observed checkpoint frequency (or how many
>> checkpoints are created) compared to older versions?
>>
>>
>> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <[hidden email]> wrote:
>>
>>> Hi Roman,
>>>
>>> Indeed there are more frequent checkpoints with this change! The
>>> application was configured to checkpoint every 10s. With 1.10 ("good
>>> commit"), that leads to fewer completed checkpoints compared to 1.11
>>> ("bad
>>> commit"). Just to be clear, the only difference between the two runs was
>>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>>>
>>> Since the sync part of checkpoints with the Kinesis producer always takes
>>> ~30 seconds, the 10s configured checkpoint frequency really had no effect
>>> before 1.11. I confirmed that both commits perform comparably by setting
>>> the checkpoint frequency and min pause to 60s.
>>>
>>> I still have to verify with the final 1.11.0 release commit.
>>>
>>> It's probably good to take a look at the Kinesis producer. Is it really
>>> necessary to have 500ms sleep time? What's responsible for the ~30s
>>> duration in snapshotState?
>>>
>>> As things stand it doesn't make sense to use checkpoint intervals < 30s
>>> when using the Kinesis producer.
>>>
>>> Thanks,
>>> Thomas
>>>
>>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <
>>> [hidden email]>
>>> wrote:
>>>
>>> > Hi Thomas,
>>> >
>>> > Thanks a lot for the analysis.
>>> >
>>> > The first thing that I'd check is whether checkpoints became more
>>> frequent
>>> > with this commit (as each of them adds at least 500ms if there is at
>>> least
>>> > one not sent record, according to FlinkKinesisProducer.snapshotState).
>>> >
>>> > Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs
>>> > first "bad" commits)?
>>> >
>>> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <[hidden email]>
>>> > wrote:
>>> >
>>> > > I run git bisect and the first commit that shows the regression is:
>>> > >
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
>>> > >
>>> > >
>>> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]> wrote:
>>> > >
>>> > > > From my experience, java profilers are sometimes not accurate
>>> enough to
>>> > > > find out the performance regression
>>> > > > root cause. In this case, I would suggest you try out intel vtune
>>> > > amplifier
>>> > > > to watch more detailed metrics.
>>> > > >
>>> > > > Best,
>>> > > > Kurt
>>> > > >
>>> > > >
>>> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]>
>>> wrote:
>>> > > >
>>> > > > > The cause of the issue is all but clear.
>>> > > > >
>>> > > > > Previously I had mentioned that there is no suspect change to the
>>> > > Kinesis
>>> > > > > connector and that I had reverted the AWS SDK change to no
>>> effect.
>>> > > > >
>>> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed
>>> > > another
>>> > > > > regression in the previous release and is present before and
>>> after.
>>> > > > >
>>> > > > > I repeated the run with 1.11.0 core and downgraded the entire
>>> Kinesis
>>> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is
>>> still
>>> > > > present.
>>> > > > > Therefore we will need to look elsewhere for the root cause.
>>> > > > >
>>> > > > > Regarding the time spent in snapshotState, repeat runs reveal a
>>> wide
>>> > > > range
>>> > > > > for both versions, 1.10 and 1.11. So again this is nothing
>>> pointing
>>> > to
>>> > > a
>>> > > > > root cause.
>>> > > > >
>>> > > > > At this point, I have no ideas remaining other than doing a
>>> bisect to
>>> > > > find
>>> > > > > the culprit. Any other suggestions?
>>> > > > >
>>> > > > > Thomas
>>> > > > >
>>> > > > >
>>> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
>>> [hidden email]
>>> > > > > .invalid>
>>> > > > > wrote:
>>> > > > >
>>> > > > > > Hi Thomas,
>>> > > > > >
>>> > > > > > Thanks for your further profiling information and glad to see
>>> we
>>> > > > already
>>> > > > > > finalized the location to cause the regression.
>>> > > > > > Actually I was also suspicious of the point of #snapshotState
>>> in
>>> > > > previous
>>> > > > > > discussions since it indeed cost much time to block normal
>>> operator
>>> > > > > > processing.
>>> > > > > >
>>> > > > > > Based on your below feedback, the sleep time during
>>> #snapshotState
>>> > > > might
>>> > > > > > be the main concern, and I also digged into the implementation
>>> of
>>> > > > > > FlinkKinesisProducer#snapshotState.
>>> > > > > > while (producer.getOutstandingRecordsCount() > 0) {
>>> > > > > >    producer.flush();
>>> > > > > >    try {
>>> > > > > >       Thread.sleep(500);
>>> > > > > >    } catch (InterruptedException e) {
>>> > > > > >       LOG.warn("Flushing was interrupted.");
>>> > > > > >       break;
>>> > > > > >    }
>>> > > > > > }
>>> > > > > > It seems that the sleep time is mainly affected by the internal
>>> > > > > operations
>>> > > > > > inside KinesisProducer implementation provided by amazonaws,
>>> which
>>> > I
>>> > > am
>>> > > > > not
>>> > > > > > quite familiar with.
>>> > > > > > But I noticed there were two upgrades related to it in
>>> > > release-1.11.0.
>>> > > > > One
>>> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and
>>> another
>>> > is
>>> > > > for
>>> > > > > > upgrading aws-sdk-version to 1.11.754 [2].
>>> > > > > > You mentioned that you already reverted the SDK upgrade to
>>> verify
>>> > no
>>> > > > > > changes. Did you also revert the [1] to verify?
>>> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
>>> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
>>> > > > > >
>>> > > > > > Best,
>>> > > > > > Zhijiang
>>> > > > > >
>>> ------------------------------------------------------------------
>>> > > > > > From:Thomas Weise <[hidden email]>
>>> > > > > > Send Time:2020年7月17日(星期五) 05:29
>>> > > > > > To:dev <[hidden email]>
>>> > > > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
>>> > > > [hidden email]
>>> > > > > >;
>>> > > > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
>>> > > > [hidden email]
>>> > > > > >
>>> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>>> 1.11.0,
>>> > > > release
>>> > > > > > candidate #4)
>>> > > > > >
>>> > > > > > Sorry for the delay.
>>> > > > > >
>>> > > > > > I confirmed that the regression is due to the sink
>>> (unsurprising,
>>> > > since
>>> > > > > > another job with the same consumer, but not the producer, runs
>>> as
>>> > > > > > expected).
>>> > > > > >
>>> > > > > > As promised I did CPU profiling on the problematic application,
>>> > which
>>> > > > > gives
>>> > > > > > more insight into the regression [1]
>>> > > > > >
>>> > > > > > The screenshots show that the average time for snapshotState
>>> > > increases
>>> > > > > from
>>> > > > > > ~9s to ~28s. The data also shows the increase in sleep time
>>> during
>>> > > > > > snapshotState.
>>> > > > > >
>>> > > > > > Does anyone, based on changes made in 1.11, have a theory why?
>>> > > > > >
>>> > > > > > I had previously looked at the changes to the Kinesis
>>> connector and
>>> > > > also
>>> > > > > > reverted the SDK upgrade, which did not change the situation.
>>> > > > > >
>>> > > > > > It will likely be necessary to drill into the sink /
>>> checkpointing
>>> > > > > details
>>> > > > > > to understand the cause of the problem.
>>> > > > > >
>>> > > > > > Let me know if anyone has specific questions that I can answer
>>> from
>>> > > the
>>> > > > > > profiling results.
>>> > > > > >
>>> > > > > > Thomas
>>> > > > > >
>>> > > > > > [1]
>>> > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
>>> > > > > >
>>> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]>
>>> > > wrote:
>>> > > > > >
>>> > > > > > > + dev@ for visibility
>>> > > > > > >
>>> > > > > > > I will investigate further today.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
>>> > > [hidden email]
>>> > > > >
>>> > > > > > > wrote:
>>> > > > > > >
>>> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
>>> > > > > > >> >    - Did sink checkpoint notifications change in a
>>> relevant
>>> > way,
>>> > > > for
>>> > > > > > >> example
>>> > > > > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha
>>> > maybe?)
>>> > > > > > >>
>>> > > > > > >> I think that's unrelated: the Kafka fixes were isolated in
>>> Kafka
>>> > > and
>>> > > > > the
>>> > > > > > >> one bug I discovered on the way was about the Task reaper.
>>> > > > > > >>
>>> > > > > > >>
>>> > > > > > >> On 07.07.20 17:51, Zhijiang wrote:
>>> > > > > > >> > Sorry for my misunderstood of the previous information,
>>> > Thomas.
>>> > > I
>>> > > > > was
>>> > > > > > >> assuming that the sync checkpoint duration increased after
>>> > upgrade
>>> > > > as
>>> > > > > it
>>> > > > > > >> was mentioned before.
>>> > > > > > >> >
>>> > > > > > >> > If I remembered correctly, the memory state backend also
>>> has
>>> > the
>>> > > > > same
>>> > > > > > >> issue? If so, we can dismiss the rocksDB state changes. As
>>> the
>>> > > slot
>>> > > > > > sharing
>>> > > > > > >> enabled, the downstream and upstream should
>>> > > > > > >> > probably deployed into the same slot, then no network
>>> shuffle
>>> > > > > effect.
>>> > > > > > >> >
>>> > > > > > >> > I think we need to find out whether it has other symptoms
>>> > > changed
>>> > > > > > >> besides the performance regression to further figure out the
>>> > > scope.
>>> > > > > > >> > E.g. any metrics changes, the number of TaskManager and
>>> the
>>> > > number
>>> > > > > of
>>> > > > > > >> slots per TaskManager from deployment changes.
>>> > > > > > >> > 40% regression is really big, I guess the changes should
>>> also
>>> > be
>>> > > > > > >> reflected in other places.
>>> > > > > > >> >
>>> > > > > > >> > I am not sure whether we can reproduce the regression in
>>> our
>>> > AWS
>>> > > > > > >> environment by writing any Kinesis jobs, since there are
>>> also
>>> > > normal
>>> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
>>> > > > > > >> > So it probably looks like to touch some corner case. I am
>>> very
>>> > > > > willing
>>> > > > > > >> to provide any help for debugging if possible.
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> > Best,
>>> > > > > > >> > Zhijiang
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > ------------------------------------------------------------------
>>> > > > > > >> > From:Thomas Weise <[hidden email]>
>>> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01
>>> > > > > > >> > To:Stephan Ewen <[hidden email]>
>>> > > > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
>>> > > > > > >> [hidden email]>; Zhijiang <[hidden email]>
>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>>> > > 1.11.0,
>>> > > > > > >> release candidate #4)
>>> > > > > > >> >
>>> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We have
>>> one
>>> > job
>>> > > > > that
>>> > > > > > >> works as expected after the upgrade and the one discussed
>>> here
>>> > > that
>>> > > > > has
>>> > > > > > the
>>> > > > > > >> performance regression.
>>> > > > > > >> >
>>> > > > > > >> > "The performance regression is obvious caused by long
>>> duration
>>> > > of
>>> > > > > sync
>>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>>> block
>>> > the
>>> > > > > > normal
>>> > > > > > >> data processing until back pressure the source."
>>> > > > > > >> >
>>> > > > > > >> > That's a constant. Before (1.10) and upgrade have the same
>>> > sync
>>> > > > > > >> checkpointing time. The question is what change came in
>>> with the
>>> > > > > > upgrade.
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <
>>> [hidden email]
>>> > >
>>> > > > > wrote:
>>> > > > > > >> >
>>> > > > > > >> > @Thomas Just one thing real quick: Are you using the
>>> > standalone
>>> > > > > setup
>>> > > > > > >> scripts (like start-cluster.sh, and the former "slaves"
>>> file) ?
>>> > > > > > >> > Be aware that this is now called "workers" because of
>>> avoiding
>>> > > > > > >> sensitive names.
>>> > > > > > >> > In one internal benchmark we saw quite a lot of slowdown
>>> > > > initially,
>>> > > > > > >> before seeing that the cluster was not a distributed
>>> cluster any
>>> > > > more
>>> > > > > > ;-)
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
>>> > > > [hidden email]
>>> > > > > >
>>> > > > > > >> wrote:
>>> > > > > > >> > Thanks for this kickoff and help analysis, Stephan!
>>> > > > > > >> > Thanks for the further feedback and investigation, Thomas!
>>> > > > > > >> >
>>> > > > > > >> > The performance regression is obvious caused by long
>>> duration
>>> > of
>>> > > > > sync
>>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>>> block
>>> > the
>>> > > > > > normal
>>> > > > > > >> data processing until back pressure the source.
>>> > > > > > >> > Maybe we could dig into the process of sync execution in
>>> > > > checkpoint.
>>> > > > > > >> E.g. break down the steps inside respective
>>> > operator#snapshotState
>>> > > > to
>>> > > > > > >> statistic which operation cost most of the time, then
>>> > > > > > >> > we might probably find the root cause to bring such cost.
>>> > > > > > >> >
>>> > > > > > >> > Look forward to the further progress. :)
>>> > > > > > >> >
>>> > > > > > >> > Best,
>>> > > > > > >> > Zhijiang
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > ------------------------------------------------------------------
>>> > > > > > >> > From:Stephan Ewen <[hidden email]>
>>> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52
>>> > > > > > >> > To:Thomas Weise <[hidden email]>
>>> > > > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
>>> > > > > > >> [hidden email]>; Aljoscha Krettek <
>>> > > [hidden email]
>>> > > > >;
>>> > > > > > >> Arvid Heise <[hidden email]>
>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>>> > > 1.11.0,
>>> > > > > > >> release candidate #4)
>>> > > > > > >> >
>>> > > > > > >> > Thank you for the digging so deeply.
>>> > > > > > >> > Mysterious think this regression.
>>> > > > > > >> >
>>> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]>
>>> > wrote:
>>> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it is
>>> > > unchanged
>>> > > > > > >> between 1.10 and 1.11 for the specific pipeline).
>>> > > > > > >> >
>>> > > > > > >> > I verified that increasing the checkpointing interval
>>> does not
>>> > > > make
>>> > > > > a
>>> > > > > > >> difference.
>>> > > > > > >> >
>>> > > > > > >> > I looked at the Kinesis connector changes since 1.10.1 and
>>> > don't
>>> > > > see
>>> > > > > > >> anything that could cause this.
>>> > > > > > >> >
>>> > > > > > >> > Another pipeline that is using the Kinesis consumer (but
>>> not
>>> > the
>>> > > > > > >> producer) performs as expected.
>>> > > > > > >> >
>>> > > > > > >> > I tried reverting the AWS SDK version change, symptoms
>>> remain
>>> > > > > > unchanged:
>>> > > > > > >> >
>>> > > > > > >> > diff --git
>>> a/flink-connectors/flink-connector-kinesis/pom.xml
>>> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
>>> > > > > > >> > index a6abce23ba..741743a05e 100644
>>> > > > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
>>> > > > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
>>> > > > > > >> > @@ -33,7 +33,7 @@ under the License.
>>> > > > > > >> >
>>> > > > > > >>
>>> > > > >
>>> > >
>>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
>>> > > > > > >> >          <name>flink-connector-kinesis</name>
>>> > > > > > >> >          <properties>
>>> > > > > > >> > -
>>>  <aws.sdk.version>1.11.754</aws.sdk.version>
>>> > > > > > >> > +
>>>  <aws.sdk.version>1.11.603</aws.sdk.version>
>>> > > > > > >> >
>>> > > > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
>>> > > > > > >> >
>>> > > > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
>>> > > > > > >> >
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
>>> > > > > > >> >
>>> > > > > > >> > I'm planning to take a look with a profiler next.
>>> > > > > > >> >
>>> > > > > > >> > Thomas
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
>>> > [hidden email]>
>>> > > > > > wrote:
>>> > > > > > >> > Hi all!
>>> > > > > > >> >
>>> > > > > > >> > Forking this thread out of the release vote thread.
>>> > > > > > >> >  From what Thomas describes, it really sounds like a
>>> > > sink-specific
>>> > > > > > >> issue.
>>> > > > > > >> >
>>> > > > > > >> > @Thomas: When you say sink has a long synchronous
>>> checkpoint
>>> > > time,
>>> > > > > you
>>> > > > > > >> mean the time that is shown as "sync time" on the metrics
>>> and
>>> > web
>>> > > > UI?
>>> > > > > > That
>>> > > > > > >> is not including any network buffer related operations. It
>>> is
>>> > > purely
>>> > > > > the
>>> > > > > > >> operator's time.
>>> > > > > > >> >
>>> > > > > > >> > Can we dig into the changes we did in sinks:
>>> > > > > > >> >    - Kinesis version upgrade, AWS library updates
>>> > > > > > >> >
>>> > > > > > >> >    - Could it be that some call (checkpoint complete)
>>> that was
>>> > > > > > >> previously (1.10) in a separate thread is not in the
>>> mailbox and
>>> > > > this
>>> > > > > > >> simply reduces the number of threads that do the work?
>>> > > > > > >> >
>>> > > > > > >> >    - Did sink checkpoint notifications change in a
>>> relevant
>>> > way,
>>> > > > for
>>> > > > > > >> example due to some Kafka issues we addressed in 1.11
>>> (@Aljoscha
>>> > > > > maybe?)
>>> > > > > > >> >
>>> > > > > > >> > Best,
>>> > > > > > >> > Stephan
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
>>> > > > [hidden email]
>>> > > > > > .invalid>
>>> > > > > > >> wrote:
>>> > > > > > >> > Hi Thomas,
>>> > > > > > >> >
>>> > > > > > >> >   Regarding [2], it has more detail infos in the Jira
>>> > > description
>>> > > > (
>>> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
>>> > > > > > >> >
>>> > > > > > >> >   I can also give some basic explanations here to dismiss
>>> the
>>> > > > > concern.
>>> > > > > > >> >   1. In the past, the following buffers after the barrier
>>> will
>>> > > be
>>> > > > > > >> cached on downstream side before alignment.
>>> > > > > > >> >   2. In 1.11, the upstream would not send the buffers
>>> after
>>> > the
>>> > > > > > >> barrier. When the downstream finishes the alignment, it will
>>> > > notify
>>> > > > > the
>>> > > > > > >> downstream of continuing sending following buffers, since
>>> it can
>>> > > > > process
>>> > > > > > >> them after alignment.
>>> > > > > > >> >   3. The only difference is that the temporary blocked
>>> buffers
>>> > > are
>>> > > > > > >> cached either on downstream side or on upstream side before
>>> > > > alignment.
>>> > > > > > >> >   4. The side effect would be the additional notification
>>> cost
>>> > > for
>>> > > > > > >> every barrier alignment. If the downstream and upstream are
>>> > > deployed
>>> > > > > in
>>> > > > > > >> separate TaskManager, the cost is network transport delay
>>> (the
>>> > > > effect
>>> > > > > > can
>>> > > > > > >> be ignored based on our testing with 1s checkpoint
>>> interval).
>>> > For
>>> > > > > > sharing
>>> > > > > > >> slot in your case, the cost is only one method call in
>>> > processor,
>>> > > > can
>>> > > > > be
>>> > > > > > >> ignored also.
>>> > > > > > >> >
>>> > > > > > >> >   You mentioned "In this case, the downstream task has a
>>> high
>>> > > > > average
>>> > > > > > >> checkpoint duration(~30s, sync part)." This duration is not
>>> > > > reflecting
>>> > > > > > the
>>> > > > > > >> changes above, and it is only indicating the duration for
>>> > calling
>>> > > > > > >> `Operation.snapshotState`.
>>> > > > > > >> >   If this duration is beyond your expectation, you can
>>> check
>>> > or
>>> > > > > debug
>>> > > > > > >> whether the source/sink operations might take more time to
>>> > finish
>>> > > > > > >> `snapshotState` in practice. E.g. you can
>>> > > > > > >> >   make the implementation of this method as empty to
>>> further
>>> > > > verify
>>> > > > > > the
>>> > > > > > >> effect.
>>> > > > > > >> >
>>> > > > > > >> >   Best,
>>> > > > > > >> >   Zhijiang
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > >  ------------------------------------------------------------------
>>> > > > > > >> >   From:Thomas Weise <[hidden email]>
>>> > > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
>>> > > > > > >> >   To:dev <[hidden email]>; Zhijiang <
>>> > > > > [hidden email]
>>> > > > > > >
>>> > > > > > >> >   Cc:Yingjie Cao <[hidden email]>
>>> > > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>>> > > > > > >> >
>>> > > > > > >> >   Hi Zhijiang,
>>> > > > > > >> >
>>> > > > > > >> >   Could you please point me to more details regarding:
>>> "[2]:
>>> > > Delay
>>> > > > > > send
>>> > > > > > >> the
>>> > > > > > >> >   following buffers after checkpoint barrier on upstream
>>> side
>>> > > > until
>>> > > > > > >> barrier
>>> > > > > > >> >   alignment on downstream side."
>>> > > > > > >> >
>>> > > > > > >> >   In this case, the downstream task has a high average
>>> > > checkpoint
>>> > > > > > >> duration
>>> > > > > > >> >   (~30s, sync part). If there was a change to hold buffers
>>> > > > depending
>>> > > > > > on
>>> > > > > > >> >   downstream performance, could this possibly apply to
>>> this
>>> > case
>>> > > > > (even
>>> > > > > > >> when
>>> > > > > > >> >   there is no shuffle that would require alignment)?
>>> > > > > > >> >
>>> > > > > > >> >   Thanks,
>>> > > > > > >> >   Thomas
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
>>> > > > > [hidden email]
>>> > > > > > >> .invalid>
>>> > > > > > >> >   wrote:
>>> > > > > > >> >
>>> > > > > > >> >   > Hi Thomas,
>>> > > > > > >> >   >
>>> > > > > > >> >   > Thanks for the further update information.
>>> > > > > > >> >   >
>>> > > > > > >> >   > I guess we can dismiss the network stack changes,
>>> since in
>>> > > > your
>>> > > > > > >> case the
>>> > > > > > >> >   > downstream and upstream would probably be deployed in
>>> the
>>> > > same
>>> > > > > > slot
>>> > > > > > >> >   > bypassing the network data shuffle.
>>> > > > > > >> >   > Also I guess release-1.11 will not bring general
>>> > performance
>>> > > > > > >> regression in
>>> > > > > > >> >   > runtime engine, as we also did the performance
>>> testing for
>>> > > all
>>> > > > > > >> general
>>> > > > > > >> >   > cases by [1] in real cluster before and the testing
>>> > results
>>> > > > > should
>>> > > > > > >> fit the
>>> > > > > > >> >   > expectation. But we indeed did not test the specific
>>> > source
>>> > > > and
>>> > > > > > sink
>>> > > > > > >> >   > connectors yet as I known.
>>> > > > > > >> >   >
>>> > > > > > >> >   > Regarding your performance regression with 40%, I
>>> wonder
>>> > it
>>> > > is
>>> > > > > > >> probably
>>> > > > > > >> >   > related to specific source/sink changes (e.g.
>>> kinesis) or
>>> > > > > > >> environment
>>> > > > > > >> >   > issues with corner case.
>>> > > > > > >> >   > If possible, it would be helpful to further locate
>>> whether
>>> > > the
>>> > > > > > >> regression
>>> > > > > > >> >   > is caused by kinesis, by replacing the kinesis source
>>> &
>>> > sink
>>> > > > and
>>> > > > > > >> keeping
>>> > > > > > >> >   > the others same.
>>> > > > > > >> >   >
>>> > > > > > >> >   > As you said, it would be efficient to contact with you
>>> > > > directly
>>> > > > > > >> next week
>>> > > > > > >> >   > to further discuss this issue. And we are
>>> willing/eager to
>>> > > > > provide
>>> > > > > > >> any help
>>> > > > > > >> >   > to resolve this issue soon.
>>> > > > > > >> >   >
>>> > > > > > >> >   > Besides that, I guess this issue should not be the
>>> blocker
>>> > > for
>>> > > > > the
>>> > > > > > >> >   > release, since it is probably a corner case based on
>>> the
>>> > > > current
>>> > > > > > >> analysis.
>>> > > > > > >> >   > If we really conclude anything need to be resolved
>>> after
>>> > the
>>> > > > > final
>>> > > > > > >> >   > release, then we can also make the next minor
>>> > release-1.11.1
>>> > > > > come
>>> > > > > > >> soon.
>>> > > > > > >> >   >
>>> > > > > > >> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
>>> > > > > > >> >   >
>>> > > > > > >> >   > Best,
>>> > > > > > >> >   > Zhijiang
>>> > > > > > >> >   >
>>> > > > > > >> >   >
>>> > > > > > >> >   >
>>> > > > >
>>> ------------------------------------------------------------------
>>> > > > > > >> >   > From:Thomas Weise <[hidden email]>
>>> > > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
>>> > > > > > >> >   > To:dev <[hidden email]>; Zhijiang <
>>> > > > > > [hidden email]
>>> > > > > > >> >
>>> > > > > > >> >   > Cc:Yingjie Cao <[hidden email]>
>>> > > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate
>>> #4
>>> > > > > > >> >   >
>>> > > > > > >> >   > Hi Zhijiang,
>>> > > > > > >> >   >
>>> > > > > > >> >   > It will probably be best if we connect next week and
>>> > discuss
>>> > > > the
>>> > > > > > >> issue
>>> > > > > > >> >   > directly since this could be quite difficult to
>>> reproduce.
>>> > > > > > >> >   >
>>> > > > > > >> >   > Before the testing result on our side comes out for
>>> your
>>> > > > > > respective
>>> > > > > > >> job
>>> > > > > > >> >   > case, I have some other questions to confirm for
>>> further
>>> > > > > analysis:
>>> > > > > > >> >   >     -  How much percentage regression you found after
>>> > > > switching
>>> > > > > to
>>> > > > > > >> 1.11?
>>> > > > > > >> >   >
>>> > > > > > >> >   > ~40% throughput decline
>>> > > > > > >> >   >
>>> > > > > > >> >   >     -  Are there any network bottleneck in your
>>> cluster?
>>> > > E.g.
>>> > > > > the
>>> > > > > > >> network
>>> > > > > > >> >   > bandwidth is full caused by other jobs? If so, it
>>> might
>>> > have
>>> > > > > more
>>> > > > > > >> effects
>>> > > > > > >> >   > by above [2]
>>> > > > > > >> >   >
>>> > > > > > >> >   > The test runs on a k8s cluster that is also used for
>>> other
>>> > > > > > >> production jobs.
>>> > > > > > >> >   > There is no reason be believe network is the
>>> bottleneck.
>>> > > > > > >> >   >
>>> > > > > > >> >   >     -  Did you adjust the default network buffer
>>> setting?
>>> > > E.g.
>>> > > > > > >> >   >
>>> "taskmanager.network.memory.floating-buffers-per-gate" or
>>> > > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
>>> > > > > > >> >   >
>>> > > > > > >> >   > The job is using the defaults, i.e we don't configure
>>> the
>>> > > > > > settings.
>>> > > > > > >> If you
>>> > > > > > >> >   > want me to try specific settings in the hope that it
>>> will
>>> > > help
>>> > > > > to
>>> > > > > > >> isolate
>>> > > > > > >> >   > the issue please let me know.
>>> > > > > > >> >   >
>>> > > > > > >> >   >     -  I guess the topology has three vertexes
>>> > > > "KinesisConsumer
>>> > > > > ->
>>> > > > > > >> Chained
>>> > > > > > >> >   > FlatMap -> KinesisProducer", and the partition mode
>>> for
>>> > > > > > >> "KinesisConsumer ->
>>> > > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
>>> > "forward"?
>>> > > If
>>> > > > > so,
>>> > > > > > >> the edge
>>> > > > > > >> >   > connection is one-to-one, not all-to-all, then the
>>> above
>>> > > > [1][2]
>>> > > > > > >> should no
>>> > > > > > >> >   > effects in theory with default network buffer setting.
>>> > > > > > >> >   >
>>> > > > > > >> >   > There are only 2 vertices and the edge is "forward".
>>> > > > > > >> >   >
>>> > > > > > >> >   >     - By slot sharing, I guess these three vertex
>>> > > parallelism
>>> > > > > task
>>> > > > > > >> would
>>> > > > > > >> >   > probably be deployed into the same slot, then the data
>>> > > shuffle
>>> > > > > is
>>> > > > > > >> by memory
>>> > > > > > >> >   > queue, not network stack. If so, the above [2] should
>>> no
>>> > > > effect.
>>> > > > > > >> >   >
>>> > > > > > >> >   > Yes, vertices share slots.
>>> > > > > > >> >   >
>>> > > > > > >> >   >     - I also saw some Jira changes for kinesis in this
>>> > > > release,
>>> > > > > > >> could you
>>> > > > > > >> >   > confirm that these changes would not effect the
>>> > performance?
>>> > > > > > >> >   >
>>> > > > > > >> >   > I will need to take a look. 1.10 already had a
>>> regression
>>> > > > > > >> introduced by the
>>> > > > > > >> >   > Kinesis producer update.
>>> > > > > > >> >   >
>>> > > > > > >> >   >
>>> > > > > > >> >   > Thanks,
>>> > > > > > >> >   > Thomas
>>> > > > > > >> >   >
>>> > > > > > >> >   >
>>> > > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
>>> > > > > > >> [hidden email]
>>> > > > > > >> >   > .invalid>
>>> > > > > > >> >   > wrote:
>>> > > > > > >> >   >
>>> > > > > > >> >   > > Hi Thomas,
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Thanks for your reply with rich information!
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > We are trying to reproduce your case in our cluster
>>> to
>>> > > > further
>>> > > > > > >> verify it,
>>> > > > > > >> >   > > and  @Yingjie Cao is working on it now.
>>> > > > > > >> >   > >  As we have not kinesis consumer and producer
>>> > internally,
>>> > > so
>>> > > > > we
>>> > > > > > >> will
>>> > > > > > >> >   > > construct the common source and sink instead in the
>>> case
>>> > > of
>>> > > > > > >> backpressure.
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Firstly, we can dismiss the rockdb factor in this
>>> > release,
>>> > > > > since
>>> > > > > > >> you also
>>> > > > > > >> >   > > mentioned that "filesystem leads to same symptoms".
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Secondly, if my understanding is right, you emphasis
>>> > that
>>> > > > the
>>> > > > > > >> regression
>>> > > > > > >> >   > > only exists for the jobs with low checkpoint
>>> interval
>>> > > (10s).
>>> > > > > > >> >   > > Based on that, I have two suspicions with the
>>> network
>>> > > > related
>>> > > > > > >> changes in
>>> > > > > > >> >   > > this release:
>>> > > > > > >> >   > >     - [1]: Limited the maximum backlog value
>>> (default
>>> > 10)
>>> > > in
>>> > > > > > >> subpartition
>>> > > > > > >> >   > > queue.
>>> > > > > > >> >   > >     - [2]: Delay send the following buffers after
>>> > > checkpoint
>>> > > > > > >> barrier on
>>> > > > > > >> >   > > upstream side until barrier alignment on downstream
>>> > side.
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > These changes are motivated for reducing the
>>> in-flight
>>> > > > buffers
>>> > > > > > to
>>> > > > > > >> speedup
>>> > > > > > >> >   > > checkpoint especially in the case of backpressure.
>>> > > > > > >> >   > > In theory they should have very minor performance
>>> effect
>>> > > and
>>> > > > > > >> actually we
>>> > > > > > >> >   > > also tested in cluster to verify within expectation
>>> > before
>>> > > > > > >> merging them,
>>> > > > > > >> >   > >  but maybe there are other corner cases we have not
>>> > > thought
>>> > > > of
>>> > > > > > >> before.
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Before the testing result on our side comes out for
>>> your
>>> > > > > > >> respective job
>>> > > > > > >> >   > > case, I have some other questions to confirm for
>>> further
>>> > > > > > analysis:
>>> > > > > > >> >   > >     -  How much percentage regression you found
>>> after
>>> > > > > switching
>>> > > > > > >> to 1.11?
>>> > > > > > >> >   > >     -  Are there any network bottleneck in your
>>> cluster?
>>> > > > E.g.
>>> > > > > > the
>>> > > > > > >> network
>>> > > > > > >> >   > > bandwidth is full caused by other jobs? If so, it
>>> might
>>> > > have
>>> > > > > > more
>>> > > > > > >> effects
>>> > > > > > >> >   > > by above [2]
>>> > > > > > >> >   > >     -  Did you adjust the default network buffer
>>> > setting?
>>> > > > E.g.
>>> > > > > > >> >   > >
>>> "taskmanager.network.memory.floating-buffers-per-gate"
>>> > or
>>> > > > > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
>>> > > > > > >> >   > >     -  I guess the topology has three vertexes
>>> > > > > "KinesisConsumer
>>> > > > > > ->
>>> > > > > > >> >   > Chained
>>> > > > > > >> >   > > FlatMap -> KinesisProducer", and the partition mode
>>> for
>>> > > > > > >> "KinesisConsumer
>>> > > > > > >> >   > ->
>>> > > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both
>>> > > "forward"?
>>> > > > If
>>> > > > > > >> so, the
>>> > > > > > >> >   > edge
>>> > > > > > >> >   > > connection is one-to-one, not all-to-all, then the
>>> above
>>> > > > > [1][2]
>>> > > > > > >> should no
>>> > > > > > >> >   > > effects in theory with default network buffer
>>> setting.
>>> > > > > > >> >   > >     - By slot sharing, I guess these three vertex
>>> > > > parallelism
>>> > > > > > >> task would
>>> > > > > > >> >   > > probably be deployed into the same slot, then the
>>> data
>>> > > > shuffle
>>> > > > > > is
>>> > > > > > >> by
>>> > > > > > >> >   > memory
>>> > > > > > >> >   > > queue, not network stack. If so, the above [2]
>>> should no
>>> > > > > effect.
>>> > > > > > >> >   > >     - I also saw some Jira changes for kinesis in
>>> this
>>> > > > > release,
>>> > > > > > >> could you
>>> > > > > > >> >   > > confirm that these changes would not effect the
>>> > > performance?
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Best,
>>> > > > > > >> >   > > Zhijiang
>>> > > > > > >> >   > >
>>> > > > > > >> >   > >
>>> > > > > > >> >   > >
>>> > > > > >
>>> ------------------------------------------------------------------
>>> > > > > > >> >   > > From:Thomas Weise <[hidden email]>
>>> > > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
>>> > > > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
>>> > > > > > >> [hidden email]>
>>> > > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release
>>> candidate #4
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Hi Zhijiang,
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > The performance degradation manifests in
>>> backpressure
>>> > > which
>>> > > > > > leads
>>> > > > > > >> to
>>> > > > > > >> >   > > growing backlog in the source. I switched a few
>>> times
>>> > > > between
>>> > > > > > >> 1.10 and
>>> > > > > > >> >   > 1.11
>>> > > > > > >> >   > > and the behavior is consistent.
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > The DAG is:
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)
>>> > >  --------
>>> > > > > > >> forward
>>> > > > > > >> >   > > ---------> KinesisProducer
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Parallelism: 160
>>> > > > > > >> >   > > No shuffle/rebalance.
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Checkpointing config:
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Checkpointing Mode Exactly Once
>>> > > > > > >> >   > > Interval 10s
>>> > > > > > >> >   > > Timeout 10m 0s
>>> > > > > > >> >   > > Minimum Pause Between Checkpoints 10s
>>> > > > > > >> >   > > Maximum Concurrent Checkpoints 1
>>> > > > > > >> >   > > Persist Checkpoints Externally Enabled (delete on
>>> > > > > cancellation)
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > State backend: rocksdb  (filesystem leads to same
>>> > > symptoms)
>>> > > > > > >> >   > > Checkpoint size is tiny (500KB)
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > An interesting difference to another job that I had
>>> > > upgraded
>>> > > > > > >> successfully
>>> > > > > > >> >   > > is the low checkpointing interval.
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > Thanks,
>>> > > > > > >> >   > > Thomas
>>> > > > > > >> >   > >
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
>>> > > > > > >> [hidden email]
>>> > > > > > >> >   > > .invalid>
>>> > > > > > >> >   > > wrote:
>>> > > > > > >> >   > >
>>> > > > > > >> >   > > > Hi Thomas,
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Thanks for the efficient feedback.
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Regarding the suggestion of adding the release
>>> notes
>>> > > > > document,
>>> > > > > > >> I agree
>>> > > > > > >> >   > > > with your point. Maybe we should adjust the vote
>>> > > template
>>> > > > > > >> accordingly
>>> > > > > > >> >   > in
>>> > > > > > >> >   > > > the respective wiki to guide the following release
>>> > > > > processes.
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Regarding the performance regression, could you
>>> > provide
>>> > > > some
>>> > > > > > >> more
>>> > > > > > >> >   > details
>>> > > > > > >> >   > > > for our better measurement or reproducing on our
>>> > sides?
>>> > > > > > >> >   > > > E.g. I guess the topology only includes two
>>> vertexes
>>> > > > source
>>> > > > > > and
>>> > > > > > >> sink?
>>> > > > > > >> >   > > > What is the parallelism for every vertex?
>>> > > > > > >> >   > > > The upstream shuffles data to the downstream via
>>> > > rebalance
>>> > > > > > >> partitioner
>>> > > > > > >> >   > or
>>> > > > > > >> >   > > > other?
>>> > > > > > >> >   > > > The checkpoint mode is exactly-once with rocksDB
>>> state
>>> > > > > > backend?
>>> > > > > > >> >   > > > The backpressure happened in this case?
>>> > > > > > >> >   > > > How much percentage regression in this case?
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Best,
>>> > > > > > >> >   > > > Zhijiang
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > >
>>> > > > > > >>
>>> > ------------------------------------------------------------------
>>> > > > > > >> >   > > > From:Thomas Weise <[hidden email]>
>>> > > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
>>> > > > > > >> >   > > > To:dev <[hidden email]>
>>> > > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release
>>> candidate
>>> > #4
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Hi Till,
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Yes, we don't have the setting in flink-conf.yaml.
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Generally, we carry forward the existing
>>> configuration
>>> > > and
>>> > > > > any
>>> > > > > > >> change
>>> > > > > > >> >   > to
>>> > > > > > >> >   > > > default configuration values would impact the
>>> upgrade.
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Yes, since it is an incompatible change I would
>>> state
>>> > it
>>> > > > in
>>> > > > > > the
>>> > > > > > >> release
>>> > > > > > >> >   > > > notes.
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > Thanks,
>>> > > > > > >> >   > > > Thomas
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > BTW I found a performance regression while trying
>>> to
>>> > > > upgrade
>>> > > > > > >> another
>>> > > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis to
>>> > Kinesis
>>> > > > > job.
>>> > > > > > >> Wasn't
>>> > > > > > >> >   > able
>>> > > > > > >> >   > > > to pin it down yet, symptoms include increased
>>> > > checkpoint
>>> > > > > > >> alignment
>>> > > > > > >> >   > time.
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
>>> > > > > > >> [hidden email]>
>>> > > > > > >> >   > > > wrote:
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > > > Hi Thomas,
>>> > > > > > >> >   > > > >
>>> > > > > > >> >   > > > > just to confirm: When starting the image in
>>> local
>>> > > mode,
>>> > > > > then
>>> > > > > > >> you
>>> > > > > > >> >   > don't
>>> > > > > > >> >   > > > have
>>> > > > > > >> >   > > > > any of the JobManager memory configuration
>>> settings
>>> > > > > > >> configured in the
>>> > > > > > >> >   > > > > effective flink-conf.yaml, right? Does this mean
>>> > that
>>> > > > you
>>> > > > > > have
>>> > > > > > >> >   > > explicitly
>>> > > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the
>>> > default
>>> > > > > > >> configuration?
>>> > > > > > >> >   > > If
>>> > > > > > >> >   > > > > this is the case, then I believe it was more of
>>> an
>>> > > > > > >> unintentional
>>> > > > > > >> >   > > artifact
>>> > > > > > >> >   > > > > that it worked before and it has been corrected
>>> now
>>> > so
>>> > > > > that
>>> > > > > > >> one needs
>>> > > > > > >> >   > > to
>>> > > > > > >> >   > > > > specify the memory of the JM process
>>> explicitly. Do
>>> > > you
>>> > > > > > think
>>> > > > > > >> it
>>> > > > > > >> >   > would
>>> > > > > > >> >   > > > help
>>> > > > > > >> >   > > > > to explicitly state this in the release notes?
>>> > > > > > >> >   > > > >
>>> > > > > > >> >   > > > > Cheers,
>>> > > > > > >> >   > > > > Till
>>> > > > > > >> >   > > > >
>>> > > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
>>> > > > > [hidden email]
>>> > > > > > >
>>> > > > > > >> wrote:
>>> > > > > > >> >   > > > >
>>> > > > > > >> >   > > > > > Thanks for preparing another RC!
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > > > As mentioned in the previous RC thread, it
>>> would
>>> > be
>>> > > > > super
>>> > > > > > >> helpful
>>> > > > > > >> >   > if
>>> > > > > > >> >   > > > the
>>> > > > > > >> >   > > > > > release notes that are part of the
>>> documentation
>>> > can
>>> > > > be
>>> > > > > > >> included
>>> > > > > > >> >   > [1].
>>> > > > > > >> >   > > > > It's
>>> > > > > > >> >   > > > > > a significant time-saver to have read those
>>> first.
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > > > I found one more non-backward compatible
>>> change
>>> > that
>>> > > > > would
>>> > > > > > >> be worth
>>> > > > > > >> >   > > > > > addressing/mentioning:
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > > > It is now necessary to configure the
>>> jobmanager
>>> > heap
>>> > > > > size
>>> > > > > > in
>>> > > > > > >> >   > > > > > flink-conf.yaml (with either
>>> jobmanager.heap.size
>>> > > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why would I
>>> not
>>> > > want
>>> > > > to
>>> > > > > > do
>>> > > > > > >> that
>>> > > > > > >> >   > > > anyways?
>>> > > > > > >> >   > > > > > Well, we set it dynamically for a cluster
>>> > deployment
>>> > > > via
>>> > > > > > the
>>> > > > > > >> >   > > > > > flinkk8soperator, but the container image can
>>> also
>>> > > be
>>> > > > > used
>>> > > > > > >> for
>>> > > > > > >> >   > > testing
>>> > > > > > >> >   > > > > with
>>> > > > > > >> >   > > > > > local mode (./bin/jobmanager.sh
>>> start-foreground
>>> > > > local).
>>> > > > > > >> That will
>>> > > > > > >> >   > > fail
>>> > > > > > >> >   > > > > if
>>> > > > > > >> >   > > > > > the heap wasn't configured and that's how I
>>> > noticed
>>> > > > it.
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > > > Thanks,
>>> > > > > > >> >   > > > > > Thomas
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > > > [1]
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > >
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > >
>>> > > > > > >> >   >
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
>>> > > > > > >> >   > [hidden email]
>>> > > > > > >> >   > > > > > .invalid>
>>> > > > > > >> >   > > > > > wrote:
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > > > > Hi everyone,
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > > > > Please review and vote on the release
>>> candidate
>>> > #4
>>> > > > for
>>> > > > > > the
>>> > > > > > >> >   > version
>>> > > > > > >> >   > > > > > 1.11.0,
>>> > > > > > >> >   > > > > > > as follows:
>>> > > > > > >> >   > > > > > > [ ] +1, Approve the release
>>> > > > > > >> >   > > > > > > [ ] -1, Do not approve the release (please
>>> > provide
>>> > > > > > >> specific
>>> > > > > > >> >   > > comments)
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > > > > The complete staging area is available for
>>> your
>>> > > > > review,
>>> > > > > > >> which
>>> > > > > > >> >   > > > includes:
>>> > > > > > >> >   > > > > > > * JIRA release notes [1],
>>> > > > > > >> >   > > > > > > * the official Apache source release and
>>> binary
>>> > > > > > >> convenience
>>> > > > > > >> >   > > releases
>>> > > > > > >> >   > > > to
>>> > > > > > >> >   > > > > > be
>>> > > > > > >> >   > > > > > > deployed to dist.apache.org [2], which are
>>> > signed
>>> > > > > with
>>> > > > > > >> the key
>>> > > > > > >> >   > > with
>>> > > > > > >> >   > > > > > > fingerprint
>>> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
>>> > > > > > [3],
>>> > > > > > >> >   > > > > > > * all artifacts to be deployed to the Maven
>>> > > Central
>>> > > > > > >> Repository
>>> > > > > > >> >   > [4],
>>> > > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
>>> > > > > > >> >   > > > > > > * website pull request listing the new
>>> release
>>> > and
>>> > > > > > adding
>>> > > > > > >> >   > > > announcement
>>> > > > > > >> >   > > > > > > blog post [6].
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > > > > The vote will be open for at least 72
>>> hours. It
>>> > is
>>> > > > > > >> adopted by
>>> > > > > > >> >   > > > majority
>>> > > > > > >> >   > > > > > > approval, with at least 3 PMC affirmative
>>> votes.
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > > > > Thanks,
>>> > > > > > >> >   > > > > > > Release Manager
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > > > > [1]
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > >
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > >
>>> > > > > > >> >   >
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
>>> > > > > > >> >   > > > > > > [2]
>>> > > > > > >> >   >
>>> > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
>>> > > > > > >> >   > > > > > > [3]
>>> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
>>> > > > > > >> >   > > > > > > [4]
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > >
>>> > > > > > >> >   > >
>>> > > > > > >>
>>> > > > >
>>> > >
>>> https://repository.apache.org/content/repositories/orgapacheflink-1377/
>>> > > > > > >> >   > > > > > > [5]
>>> > > > > > >> >   > >
>>> > > > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
>>> > > > > > >> >   > > > > > > [6]
>>> > https://github.com/apache/flink-web/pull/352
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > > > >
>>> > > > > > >> >   > > > > >
>>> > > > > > >> >   > > > >
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > > >
>>> > > > > > >> >   > >
>>> > > > > > >> >   > >
>>> > > > > > >> >   >
>>> > > > > > >> >   >
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >> >
>>> > > > > > >>
>>> > > > > > >>
>>> > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Roman
>>> >
>>>
>>
>>
>> --
>> Regards,
>> Roman
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Roman Khachatryan
Hi Thomas,

Thanks a lot for the detailed information.

I think the problem is in CheckpointCoordinator. It stores the last
checkpoint completion time after checking queued requests.
I've created a ticket to fix this:
https://issues.apache.org/jira/browse/FLINK-18856


On Sat, Aug 8, 2020 at 5:25 AM Thomas Weise <[hidden email]> wrote:

> Just another update:
>
> The duration of snapshotState is capped by the Kinesis
> producer's "RecordTtl" setting (default 30s). The sleep time in flushSync
> does not contribute to the observed behavior.
>
> I guess the open question is why, with the same settings, is 1.11 since
> commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more checkpoints?
>
>
> On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise <[hidden email]> wrote:
>
>> Hi Roman,
>>
>> Here are the checkpoint summaries for both commits:
>>
>>
>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0
>>
>> The config:
>>
>>     CheckpointConfig checkpointConfig = env.getCheckpointConfig();
>>     checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>>     checkpointConfig.setCheckpointInterval(*10_000*);
>>     checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
>>
>> checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
>>     checkpointConfig.setCheckpointTimeout(600_000);
>>     checkpointConfig.setMaxConcurrentCheckpoints(1);
>>     checkpointConfig.setFailOnCheckpointingErrors(true);
>>
>> The values marked bold when changed to *60_000* make the symptom
>> disappear. I meanwhile also verified that with the 1.11.0 release commit.
>>
>> I will take a look at the sleep time issue.
>>
>> Thanks,
>> Thomas
>>
>>
>> On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <[hidden email]>
>> wrote:
>>
>>> Hi Thomas,
>>>
>>> Thanks for your reply!
>>>
>>> I think you are right, we can remove this sleep and improve
>>> KinesisProducer.
>>> Probably, it's snapshotState can also be sped up by forcing records
>>> flush more often.
>>> Do you see that 30s checkpointing duration is caused by KinesisProducer
>>> (or maybe other operators)?
>>>
>>> I'd also like to understand the reason behind this increase in
>>> checkpoint frequency.
>>> Can you please share these values:
>>>  - execution.checkpointing.min-pause
>>>  - execution.checkpointing.max-concurrent-checkpoints
>>>  - execution.checkpointing.timeout
>>>
>>> And what is the "new" observed checkpoint frequency (or how many
>>> checkpoints are created) compared to older versions?
>>>
>>>
>>> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <[hidden email]> wrote:
>>>
>>>> Hi Roman,
>>>>
>>>> Indeed there are more frequent checkpoints with this change! The
>>>> application was configured to checkpoint every 10s. With 1.10 ("good
>>>> commit"), that leads to fewer completed checkpoints compared to 1.11
>>>> ("bad
>>>> commit"). Just to be clear, the only difference between the two runs was
>>>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>>>>
>>>> Since the sync part of checkpoints with the Kinesis producer always
>>>> takes
>>>> ~30 seconds, the 10s configured checkpoint frequency really had no
>>>> effect
>>>> before 1.11. I confirmed that both commits perform comparably by setting
>>>> the checkpoint frequency and min pause to 60s.
>>>>
>>>> I still have to verify with the final 1.11.0 release commit.
>>>>
>>>> It's probably good to take a look at the Kinesis producer. Is it really
>>>> necessary to have 500ms sleep time? What's responsible for the ~30s
>>>> duration in snapshotState?
>>>>
>>>> As things stand it doesn't make sense to use checkpoint intervals < 30s
>>>> when using the Kinesis producer.
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <
>>>> [hidden email]>
>>>> wrote:
>>>>
>>>> > Hi Thomas,
>>>> >
>>>> > Thanks a lot for the analysis.
>>>> >
>>>> > The first thing that I'd check is whether checkpoints became more
>>>> frequent
>>>> > with this commit (as each of them adds at least 500ms if there is at
>>>> least
>>>> > one not sent record, according to FlinkKinesisProducer.snapshotState).
>>>> >
>>>> > Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs
>>>> > first "bad" commits)?
>>>> >
>>>> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <[hidden email]>
>>>> > wrote:
>>>> >
>>>> > > I run git bisect and the first commit that shows the regression is:
>>>> > >
>>>> > >
>>>> > >
>>>> >
>>>> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
>>>> > >
>>>> > >
>>>> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]>
>>>> wrote:
>>>> > >
>>>> > > > From my experience, java profilers are sometimes not accurate
>>>> enough to
>>>> > > > find out the performance regression
>>>> > > > root cause. In this case, I would suggest you try out intel vtune
>>>> > > amplifier
>>>> > > > to watch more detailed metrics.
>>>> > > >
>>>> > > > Best,
>>>> > > > Kurt
>>>> > > >
>>>> > > >
>>>> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]>
>>>> wrote:
>>>> > > >
>>>> > > > > The cause of the issue is all but clear.
>>>> > > > >
>>>> > > > > Previously I had mentioned that there is no suspect change to
>>>> the
>>>> > > Kinesis
>>>> > > > > connector and that I had reverted the AWS SDK change to no
>>>> effect.
>>>> > > > >
>>>> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually
>>>> fixed
>>>> > > another
>>>> > > > > regression in the previous release and is present before and
>>>> after.
>>>> > > > >
>>>> > > > > I repeated the run with 1.11.0 core and downgraded the entire
>>>> Kinesis
>>>> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is
>>>> still
>>>> > > > present.
>>>> > > > > Therefore we will need to look elsewhere for the root cause.
>>>> > > > >
>>>> > > > > Regarding the time spent in snapshotState, repeat runs reveal a
>>>> wide
>>>> > > > range
>>>> > > > > for both versions, 1.10 and 1.11. So again this is nothing
>>>> pointing
>>>> > to
>>>> > > a
>>>> > > > > root cause.
>>>> > > > >
>>>> > > > > At this point, I have no ideas remaining other than doing a
>>>> bisect to
>>>> > > > find
>>>> > > > > the culprit. Any other suggestions?
>>>> > > > >
>>>> > > > > Thomas
>>>> > > > >
>>>> > > > >
>>>> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
>>>> [hidden email]
>>>> > > > > .invalid>
>>>> > > > > wrote:
>>>> > > > >
>>>> > > > > > Hi Thomas,
>>>> > > > > >
>>>> > > > > > Thanks for your further profiling information and glad to see
>>>> we
>>>> > > > already
>>>> > > > > > finalized the location to cause the regression.
>>>> > > > > > Actually I was also suspicious of the point of #snapshotState
>>>> in
>>>> > > > previous
>>>> > > > > > discussions since it indeed cost much time to block normal
>>>> operator
>>>> > > > > > processing.
>>>> > > > > >
>>>> > > > > > Based on your below feedback, the sleep time during
>>>> #snapshotState
>>>> > > > might
>>>> > > > > > be the main concern, and I also digged into the
>>>> implementation of
>>>> > > > > > FlinkKinesisProducer#snapshotState.
>>>> > > > > > while (producer.getOutstandingRecordsCount() > 0) {
>>>> > > > > >    producer.flush();
>>>> > > > > >    try {
>>>> > > > > >       Thread.sleep(500);
>>>> > > > > >    } catch (InterruptedException e) {
>>>> > > > > >       LOG.warn("Flushing was interrupted.");
>>>> > > > > >       break;
>>>> > > > > >    }
>>>> > > > > > }
>>>> > > > > > It seems that the sleep time is mainly affected by the
>>>> internal
>>>> > > > > operations
>>>> > > > > > inside KinesisProducer implementation provided by amazonaws,
>>>> which
>>>> > I
>>>> > > am
>>>> > > > > not
>>>> > > > > > quite familiar with.
>>>> > > > > > But I noticed there were two upgrades related to it in
>>>> > > release-1.11.0.
>>>> > > > > One
>>>> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and
>>>> another
>>>> > is
>>>> > > > for
>>>> > > > > > upgrading aws-sdk-version to 1.11.754 [2].
>>>> > > > > > You mentioned that you already reverted the SDK upgrade to
>>>> verify
>>>> > no
>>>> > > > > > changes. Did you also revert the [1] to verify?
>>>> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
>>>> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
>>>> > > > > >
>>>> > > > > > Best,
>>>> > > > > > Zhijiang
>>>> > > > > >
>>>> ------------------------------------------------------------------
>>>> > > > > > From:Thomas Weise <[hidden email]>
>>>> > > > > > Send Time:2020年7月17日(星期五) 05:29
>>>> > > > > > To:dev <[hidden email]>
>>>> > > > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
>>>> > > > [hidden email]
>>>> > > > > >;
>>>> > > > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
>>>> > > > [hidden email]
>>>> > > > > >
>>>> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>>>> 1.11.0,
>>>> > > > release
>>>> > > > > > candidate #4)
>>>> > > > > >
>>>> > > > > > Sorry for the delay.
>>>> > > > > >
>>>> > > > > > I confirmed that the regression is due to the sink
>>>> (unsurprising,
>>>> > > since
>>>> > > > > > another job with the same consumer, but not the producer,
>>>> runs as
>>>> > > > > > expected).
>>>> > > > > >
>>>> > > > > > As promised I did CPU profiling on the problematic
>>>> application,
>>>> > which
>>>> > > > > gives
>>>> > > > > > more insight into the regression [1]
>>>> > > > > >
>>>> > > > > > The screenshots show that the average time for snapshotState
>>>> > > increases
>>>> > > > > from
>>>> > > > > > ~9s to ~28s. The data also shows the increase in sleep time
>>>> during
>>>> > > > > > snapshotState.
>>>> > > > > >
>>>> > > > > > Does anyone, based on changes made in 1.11, have a theory why?
>>>> > > > > >
>>>> > > > > > I had previously looked at the changes to the Kinesis
>>>> connector and
>>>> > > > also
>>>> > > > > > reverted the SDK upgrade, which did not change the situation.
>>>> > > > > >
>>>> > > > > > It will likely be necessary to drill into the sink /
>>>> checkpointing
>>>> > > > > details
>>>> > > > > > to understand the cause of the problem.
>>>> > > > > >
>>>> > > > > > Let me know if anyone has specific questions that I can
>>>> answer from
>>>> > > the
>>>> > > > > > profiling results.
>>>> > > > > >
>>>> > > > > > Thomas
>>>> > > > > >
>>>> > > > > > [1]
>>>> > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
>>>> > > > > >
>>>> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[hidden email]
>>>> >
>>>> > > wrote:
>>>> > > > > >
>>>> > > > > > > + dev@ for visibility
>>>> > > > > > >
>>>> > > > > > > I will investigate further today.
>>>> > > > > > >
>>>> > > > > > >
>>>> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
>>>> > > [hidden email]
>>>> > > > >
>>>> > > > > > > wrote:
>>>> > > > > > >
>>>> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
>>>> relevant
>>>> > way,
>>>> > > > for
>>>> > > > > > >> example
>>>> > > > > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha
>>>> > maybe?)
>>>> > > > > > >>
>>>> > > > > > >> I think that's unrelated: the Kafka fixes were isolated in
>>>> Kafka
>>>> > > and
>>>> > > > > the
>>>> > > > > > >> one bug I discovered on the way was about the Task reaper.
>>>> > > > > > >>
>>>> > > > > > >>
>>>> > > > > > >> On 07.07.20 17:51, Zhijiang wrote:
>>>> > > > > > >> > Sorry for my misunderstood of the previous information,
>>>> > Thomas.
>>>> > > I
>>>> > > > > was
>>>> > > > > > >> assuming that the sync checkpoint duration increased after
>>>> > upgrade
>>>> > > > as
>>>> > > > > it
>>>> > > > > > >> was mentioned before.
>>>> > > > > > >> >
>>>> > > > > > >> > If I remembered correctly, the memory state backend also
>>>> has
>>>> > the
>>>> > > > > same
>>>> > > > > > >> issue? If so, we can dismiss the rocksDB state changes. As
>>>> the
>>>> > > slot
>>>> > > > > > sharing
>>>> > > > > > >> enabled, the downstream and upstream should
>>>> > > > > > >> > probably deployed into the same slot, then no network
>>>> shuffle
>>>> > > > > effect.
>>>> > > > > > >> >
>>>> > > > > > >> > I think we need to find out whether it has other symptoms
>>>> > > changed
>>>> > > > > > >> besides the performance regression to further figure out
>>>> the
>>>> > > scope.
>>>> > > > > > >> > E.g. any metrics changes, the number of TaskManager and
>>>> the
>>>> > > number
>>>> > > > > of
>>>> > > > > > >> slots per TaskManager from deployment changes.
>>>> > > > > > >> > 40% regression is really big, I guess the changes should
>>>> also
>>>> > be
>>>> > > > > > >> reflected in other places.
>>>> > > > > > >> >
>>>> > > > > > >> > I am not sure whether we can reproduce the regression in
>>>> our
>>>> > AWS
>>>> > > > > > >> environment by writing any Kinesis jobs, since there are
>>>> also
>>>> > > normal
>>>> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
>>>> > > > > > >> > So it probably looks like to touch some corner case. I
>>>> am very
>>>> > > > > willing
>>>> > > > > > >> to provide any help for debugging if possible.
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> > Best,
>>>> > > > > > >> > Zhijiang
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > ------------------------------------------------------------------
>>>> > > > > > >> > From:Thomas Weise <[hidden email]>
>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01
>>>> > > > > > >> > To:Stephan Ewen <[hidden email]>
>>>> > > > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise <
>>>> > > > > > >> [hidden email]>; Zhijiang <[hidden email]
>>>> >
>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>>>> > > 1.11.0,
>>>> > > > > > >> release candidate #4)
>>>> > > > > > >> >
>>>> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We have
>>>> one
>>>> > job
>>>> > > > > that
>>>> > > > > > >> works as expected after the upgrade and the one discussed
>>>> here
>>>> > > that
>>>> > > > > has
>>>> > > > > > the
>>>> > > > > > >> performance regression.
>>>> > > > > > >> >
>>>> > > > > > >> > "The performance regression is obvious caused by long
>>>> duration
>>>> > > of
>>>> > > > > sync
>>>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>>>> block
>>>> > the
>>>> > > > > > normal
>>>> > > > > > >> data processing until back pressure the source."
>>>> > > > > > >> >
>>>> > > > > > >> > That's a constant. Before (1.10) and upgrade have the
>>>> same
>>>> > sync
>>>> > > > > > >> checkpointing time. The question is what change came in
>>>> with the
>>>> > > > > > upgrade.
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <
>>>> [hidden email]
>>>> > >
>>>> > > > > wrote:
>>>> > > > > > >> >
>>>> > > > > > >> > @Thomas Just one thing real quick: Are you using the
>>>> > standalone
>>>> > > > > setup
>>>> > > > > > >> scripts (like start-cluster.sh, and the former "slaves"
>>>> file) ?
>>>> > > > > > >> > Be aware that this is now called "workers" because of
>>>> avoiding
>>>> > > > > > >> sensitive names.
>>>> > > > > > >> > In one internal benchmark we saw quite a lot of slowdown
>>>> > > > initially,
>>>> > > > > > >> before seeing that the cluster was not a distributed
>>>> cluster any
>>>> > > > more
>>>> > > > > > ;-)
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
>>>> > > > [hidden email]
>>>> > > > > >
>>>> > > > > > >> wrote:
>>>> > > > > > >> > Thanks for this kickoff and help analysis, Stephan!
>>>> > > > > > >> > Thanks for the further feedback and investigation,
>>>> Thomas!
>>>> > > > > > >> >
>>>> > > > > > >> > The performance regression is obvious caused by long
>>>> duration
>>>> > of
>>>> > > > > sync
>>>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>>>> block
>>>> > the
>>>> > > > > > normal
>>>> > > > > > >> data processing until back pressure the source.
>>>> > > > > > >> > Maybe we could dig into the process of sync execution in
>>>> > > > checkpoint.
>>>> > > > > > >> E.g. break down the steps inside respective
>>>> > operator#snapshotState
>>>> > > > to
>>>> > > > > > >> statistic which operation cost most of the time, then
>>>> > > > > > >> > we might probably find the root cause to bring such cost.
>>>> > > > > > >> >
>>>> > > > > > >> > Look forward to the further progress. :)
>>>> > > > > > >> >
>>>> > > > > > >> > Best,
>>>> > > > > > >> > Zhijiang
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > ------------------------------------------------------------------
>>>> > > > > > >> > From:Stephan Ewen <[hidden email]>
>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52
>>>> > > > > > >> > To:Thomas Weise <[hidden email]>
>>>> > > > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
>>>> > > > > > >> [hidden email]>; Aljoscha Krettek <
>>>> > > [hidden email]
>>>> > > > >;
>>>> > > > > > >> Arvid Heise <[hidden email]>
>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>>>> > > 1.11.0,
>>>> > > > > > >> release candidate #4)
>>>> > > > > > >> >
>>>> > > > > > >> > Thank you for the digging so deeply.
>>>> > > > > > >> > Mysterious think this regression.
>>>> > > > > > >> >
>>>> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]>
>>>> > wrote:
>>>> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it is
>>>> > > unchanged
>>>> > > > > > >> between 1.10 and 1.11 for the specific pipeline).
>>>> > > > > > >> >
>>>> > > > > > >> > I verified that increasing the checkpointing interval
>>>> does not
>>>> > > > make
>>>> > > > > a
>>>> > > > > > >> difference.
>>>> > > > > > >> >
>>>> > > > > > >> > I looked at the Kinesis connector changes since 1.10.1
>>>> and
>>>> > don't
>>>> > > > see
>>>> > > > > > >> anything that could cause this.
>>>> > > > > > >> >
>>>> > > > > > >> > Another pipeline that is using the Kinesis consumer (but
>>>> not
>>>> > the
>>>> > > > > > >> producer) performs as expected.
>>>> > > > > > >> >
>>>> > > > > > >> > I tried reverting the AWS SDK version change, symptoms
>>>> remain
>>>> > > > > > unchanged:
>>>> > > > > > >> >
>>>> > > > > > >> > diff --git
>>>> a/flink-connectors/flink-connector-kinesis/pom.xml
>>>> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
>>>> > > > > > >> > index a6abce23ba..741743a05e 100644
>>>> > > > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
>>>> > > > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
>>>> > > > > > >> > @@ -33,7 +33,7 @@ under the License.
>>>> > > > > > >> >
>>>> > > > > > >>
>>>> > > > >
>>>> > >
>>>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
>>>> > > > > > >> >          <name>flink-connector-kinesis</name>
>>>> > > > > > >> >          <properties>
>>>> > > > > > >> > -
>>>>  <aws.sdk.version>1.11.754</aws.sdk.version>
>>>> > > > > > >> > +
>>>>  <aws.sdk.version>1.11.603</aws.sdk.version>
>>>> > > > > > >> >
>>>> > > > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
>>>> > > > > > >> >
>>>> > > > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
>>>> > > > > > >> >
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
>>>> > > > > > >> >
>>>> > > > > > >> > I'm planning to take a look with a profiler next.
>>>> > > > > > >> >
>>>> > > > > > >> > Thomas
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
>>>> > [hidden email]>
>>>> > > > > > wrote:
>>>> > > > > > >> > Hi all!
>>>> > > > > > >> >
>>>> > > > > > >> > Forking this thread out of the release vote thread.
>>>> > > > > > >> >  From what Thomas describes, it really sounds like a
>>>> > > sink-specific
>>>> > > > > > >> issue.
>>>> > > > > > >> >
>>>> > > > > > >> > @Thomas: When you say sink has a long synchronous
>>>> checkpoint
>>>> > > time,
>>>> > > > > you
>>>> > > > > > >> mean the time that is shown as "sync time" on the metrics
>>>> and
>>>> > web
>>>> > > > UI?
>>>> > > > > > That
>>>> > > > > > >> is not including any network buffer related operations. It
>>>> is
>>>> > > purely
>>>> > > > > the
>>>> > > > > > >> operator's time.
>>>> > > > > > >> >
>>>> > > > > > >> > Can we dig into the changes we did in sinks:
>>>> > > > > > >> >    - Kinesis version upgrade, AWS library updates
>>>> > > > > > >> >
>>>> > > > > > >> >    - Could it be that some call (checkpoint complete)
>>>> that was
>>>> > > > > > >> previously (1.10) in a separate thread is not in the
>>>> mailbox and
>>>> > > > this
>>>> > > > > > >> simply reduces the number of threads that do the work?
>>>> > > > > > >> >
>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
>>>> relevant
>>>> > way,
>>>> > > > for
>>>> > > > > > >> example due to some Kafka issues we addressed in 1.11
>>>> (@Aljoscha
>>>> > > > > maybe?)
>>>> > > > > > >> >
>>>> > > > > > >> > Best,
>>>> > > > > > >> > Stephan
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
>>>> > > > [hidden email]
>>>> > > > > > .invalid>
>>>> > > > > > >> wrote:
>>>> > > > > > >> > Hi Thomas,
>>>> > > > > > >> >
>>>> > > > > > >> >   Regarding [2], it has more detail infos in the Jira
>>>> > > description
>>>> > > > (
>>>> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
>>>> > > > > > >> >
>>>> > > > > > >> >   I can also give some basic explanations here to
>>>> dismiss the
>>>> > > > > concern.
>>>> > > > > > >> >   1. In the past, the following buffers after the
>>>> barrier will
>>>> > > be
>>>> > > > > > >> cached on downstream side before alignment.
>>>> > > > > > >> >   2. In 1.11, the upstream would not send the buffers
>>>> after
>>>> > the
>>>> > > > > > >> barrier. When the downstream finishes the alignment, it
>>>> will
>>>> > > notify
>>>> > > > > the
>>>> > > > > > >> downstream of continuing sending following buffers, since
>>>> it can
>>>> > > > > process
>>>> > > > > > >> them after alignment.
>>>> > > > > > >> >   3. The only difference is that the temporary blocked
>>>> buffers
>>>> > > are
>>>> > > > > > >> cached either on downstream side or on upstream side before
>>>> > > > alignment.
>>>> > > > > > >> >   4. The side effect would be the additional
>>>> notification cost
>>>> > > for
>>>> > > > > > >> every barrier alignment. If the downstream and upstream are
>>>> > > deployed
>>>> > > > > in
>>>> > > > > > >> separate TaskManager, the cost is network transport delay
>>>> (the
>>>> > > > effect
>>>> > > > > > can
>>>> > > > > > >> be ignored based on our testing with 1s checkpoint
>>>> interval).
>>>> > For
>>>> > > > > > sharing
>>>> > > > > > >> slot in your case, the cost is only one method call in
>>>> > processor,
>>>> > > > can
>>>> > > > > be
>>>> > > > > > >> ignored also.
>>>> > > > > > >> >
>>>> > > > > > >> >   You mentioned "In this case, the downstream task has a
>>>> high
>>>> > > > > average
>>>> > > > > > >> checkpoint duration(~30s, sync part)." This duration is not
>>>> > > > reflecting
>>>> > > > > > the
>>>> > > > > > >> changes above, and it is only indicating the duration for
>>>> > calling
>>>> > > > > > >> `Operation.snapshotState`.
>>>> > > > > > >> >   If this duration is beyond your expectation, you can
>>>> check
>>>> > or
>>>> > > > > debug
>>>> > > > > > >> whether the source/sink operations might take more time to
>>>> > finish
>>>> > > > > > >> `snapshotState` in practice. E.g. you can
>>>> > > > > > >> >   make the implementation of this method as empty to
>>>> further
>>>> > > > verify
>>>> > > > > > the
>>>> > > > > > >> effect.
>>>> > > > > > >> >
>>>> > > > > > >> >   Best,
>>>> > > > > > >> >   Zhijiang
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > >
>>>> ------------------------------------------------------------------
>>>> > > > > > >> >   From:Thomas Weise <[hidden email]>
>>>> > > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
>>>> > > > > > >> >   To:dev <[hidden email]>; Zhijiang <
>>>> > > > > [hidden email]
>>>> > > > > > >
>>>> > > > > > >> >   Cc:Yingjie Cao <[hidden email]>
>>>> > > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>>>> > > > > > >> >
>>>> > > > > > >> >   Hi Zhijiang,
>>>> > > > > > >> >
>>>> > > > > > >> >   Could you please point me to more details regarding:
>>>> "[2]:
>>>> > > Delay
>>>> > > > > > send
>>>> > > > > > >> the
>>>> > > > > > >> >   following buffers after checkpoint barrier on upstream
>>>> side
>>>> > > > until
>>>> > > > > > >> barrier
>>>> > > > > > >> >   alignment on downstream side."
>>>> > > > > > >> >
>>>> > > > > > >> >   In this case, the downstream task has a high average
>>>> > > checkpoint
>>>> > > > > > >> duration
>>>> > > > > > >> >   (~30s, sync part). If there was a change to hold
>>>> buffers
>>>> > > > depending
>>>> > > > > > on
>>>> > > > > > >> >   downstream performance, could this possibly apply to
>>>> this
>>>> > case
>>>> > > > > (even
>>>> > > > > > >> when
>>>> > > > > > >> >   there is no shuffle that would require alignment)?
>>>> > > > > > >> >
>>>> > > > > > >> >   Thanks,
>>>> > > > > > >> >   Thomas
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
>>>> > > > > [hidden email]
>>>> > > > > > >> .invalid>
>>>> > > > > > >> >   wrote:
>>>> > > > > > >> >
>>>> > > > > > >> >   > Hi Thomas,
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > Thanks for the further update information.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > I guess we can dismiss the network stack changes,
>>>> since in
>>>> > > > your
>>>> > > > > > >> case the
>>>> > > > > > >> >   > downstream and upstream would probably be deployed
>>>> in the
>>>> > > same
>>>> > > > > > slot
>>>> > > > > > >> >   > bypassing the network data shuffle.
>>>> > > > > > >> >   > Also I guess release-1.11 will not bring general
>>>> > performance
>>>> > > > > > >> regression in
>>>> > > > > > >> >   > runtime engine, as we also did the performance
>>>> testing for
>>>> > > all
>>>> > > > > > >> general
>>>> > > > > > >> >   > cases by [1] in real cluster before and the testing
>>>> > results
>>>> > > > > should
>>>> > > > > > >> fit the
>>>> > > > > > >> >   > expectation. But we indeed did not test the specific
>>>> > source
>>>> > > > and
>>>> > > > > > sink
>>>> > > > > > >> >   > connectors yet as I known.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > Regarding your performance regression with 40%, I
>>>> wonder
>>>> > it
>>>> > > is
>>>> > > > > > >> probably
>>>> > > > > > >> >   > related to specific source/sink changes (e.g.
>>>> kinesis) or
>>>> > > > > > >> environment
>>>> > > > > > >> >   > issues with corner case.
>>>> > > > > > >> >   > If possible, it would be helpful to further locate
>>>> whether
>>>> > > the
>>>> > > > > > >> regression
>>>> > > > > > >> >   > is caused by kinesis, by replacing the kinesis
>>>> source &
>>>> > sink
>>>> > > > and
>>>> > > > > > >> keeping
>>>> > > > > > >> >   > the others same.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > As you said, it would be efficient to contact with
>>>> you
>>>> > > > directly
>>>> > > > > > >> next week
>>>> > > > > > >> >   > to further discuss this issue. And we are
>>>> willing/eager to
>>>> > > > > provide
>>>> > > > > > >> any help
>>>> > > > > > >> >   > to resolve this issue soon.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > Besides that, I guess this issue should not be the
>>>> blocker
>>>> > > for
>>>> > > > > the
>>>> > > > > > >> >   > release, since it is probably a corner case based on
>>>> the
>>>> > > > current
>>>> > > > > > >> analysis.
>>>> > > > > > >> >   > If we really conclude anything need to be resolved
>>>> after
>>>> > the
>>>> > > > > final
>>>> > > > > > >> >   > release, then we can also make the next minor
>>>> > release-1.11.1
>>>> > > > > come
>>>> > > > > > >> soon.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > [1]
>>>> https://issues.apache.org/jira/browse/FLINK-18433
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > Best,
>>>> > > > > > >> >   > Zhijiang
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >
>>>> > > > >
>>>> ------------------------------------------------------------------
>>>> > > > > > >> >   > From:Thomas Weise <[hidden email]>
>>>> > > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
>>>> > > > > > >> >   > To:dev <[hidden email]>; Zhijiang <
>>>> > > > > > [hidden email]
>>>> > > > > > >> >
>>>> > > > > > >> >   > Cc:Yingjie Cao <[hidden email]>
>>>> > > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate
>>>> #4
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > Hi Zhijiang,
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > It will probably be best if we connect next week and
>>>> > discuss
>>>> > > > the
>>>> > > > > > >> issue
>>>> > > > > > >> >   > directly since this could be quite difficult to
>>>> reproduce.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > Before the testing result on our side comes out for
>>>> your
>>>> > > > > > respective
>>>> > > > > > >> job
>>>> > > > > > >> >   > case, I have some other questions to confirm for
>>>> further
>>>> > > > > analysis:
>>>> > > > > > >> >   >     -  How much percentage regression you found after
>>>> > > > switching
>>>> > > > > to
>>>> > > > > > >> 1.11?
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > ~40% throughput decline
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >     -  Are there any network bottleneck in your
>>>> cluster?
>>>> > > E.g.
>>>> > > > > the
>>>> > > > > > >> network
>>>> > > > > > >> >   > bandwidth is full caused by other jobs? If so, it
>>>> might
>>>> > have
>>>> > > > > more
>>>> > > > > > >> effects
>>>> > > > > > >> >   > by above [2]
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > The test runs on a k8s cluster that is also used for
>>>> other
>>>> > > > > > >> production jobs.
>>>> > > > > > >> >   > There is no reason be believe network is the
>>>> bottleneck.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >     -  Did you adjust the default network buffer
>>>> setting?
>>>> > > E.g.
>>>> > > > > > >> >   >
>>>> "taskmanager.network.memory.floating-buffers-per-gate" or
>>>> > > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > The job is using the defaults, i.e we don't
>>>> configure the
>>>> > > > > > settings.
>>>> > > > > > >> If you
>>>> > > > > > >> >   > want me to try specific settings in the hope that it
>>>> will
>>>> > > help
>>>> > > > > to
>>>> > > > > > >> isolate
>>>> > > > > > >> >   > the issue please let me know.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >     -  I guess the topology has three vertexes
>>>> > > > "KinesisConsumer
>>>> > > > > ->
>>>> > > > > > >> Chained
>>>> > > > > > >> >   > FlatMap -> KinesisProducer", and the partition mode
>>>> for
>>>> > > > > > >> "KinesisConsumer ->
>>>> > > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
>>>> > "forward"?
>>>> > > If
>>>> > > > > so,
>>>> > > > > > >> the edge
>>>> > > > > > >> >   > connection is one-to-one, not all-to-all, then the
>>>> above
>>>> > > > [1][2]
>>>> > > > > > >> should no
>>>> > > > > > >> >   > effects in theory with default network buffer
>>>> setting.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > There are only 2 vertices and the edge is "forward".
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >     - By slot sharing, I guess these three vertex
>>>> > > parallelism
>>>> > > > > task
>>>> > > > > > >> would
>>>> > > > > > >> >   > probably be deployed into the same slot, then the
>>>> data
>>>> > > shuffle
>>>> > > > > is
>>>> > > > > > >> by memory
>>>> > > > > > >> >   > queue, not network stack. If so, the above [2]
>>>> should no
>>>> > > > effect.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > Yes, vertices share slots.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >     - I also saw some Jira changes for kinesis in
>>>> this
>>>> > > > release,
>>>> > > > > > >> could you
>>>> > > > > > >> >   > confirm that these changes would not effect the
>>>> > performance?
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > I will need to take a look. 1.10 already had a
>>>> regression
>>>> > > > > > >> introduced by the
>>>> > > > > > >> >   > Kinesis producer update.
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > Thanks,
>>>> > > > > > >> >   > Thomas
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
>>>> > > > > > >> [hidden email]
>>>> > > > > > >> >   > .invalid>
>>>> > > > > > >> >   > wrote:
>>>> > > > > > >> >   >
>>>> > > > > > >> >   > > Hi Thomas,
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Thanks for your reply with rich information!
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > We are trying to reproduce your case in our
>>>> cluster to
>>>> > > > further
>>>> > > > > > >> verify it,
>>>> > > > > > >> >   > > and  @Yingjie Cao is working on it now.
>>>> > > > > > >> >   > >  As we have not kinesis consumer and producer
>>>> > internally,
>>>> > > so
>>>> > > > > we
>>>> > > > > > >> will
>>>> > > > > > >> >   > > construct the common source and sink instead in
>>>> the case
>>>> > > of
>>>> > > > > > >> backpressure.
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Firstly, we can dismiss the rockdb factor in this
>>>> > release,
>>>> > > > > since
>>>> > > > > > >> you also
>>>> > > > > > >> >   > > mentioned that "filesystem leads to same symptoms".
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Secondly, if my understanding is right, you
>>>> emphasis
>>>> > that
>>>> > > > the
>>>> > > > > > >> regression
>>>> > > > > > >> >   > > only exists for the jobs with low checkpoint
>>>> interval
>>>> > > (10s).
>>>> > > > > > >> >   > > Based on that, I have two suspicions with the
>>>> network
>>>> > > > related
>>>> > > > > > >> changes in
>>>> > > > > > >> >   > > this release:
>>>> > > > > > >> >   > >     - [1]: Limited the maximum backlog value
>>>> (default
>>>> > 10)
>>>> > > in
>>>> > > > > > >> subpartition
>>>> > > > > > >> >   > > queue.
>>>> > > > > > >> >   > >     - [2]: Delay send the following buffers after
>>>> > > checkpoint
>>>> > > > > > >> barrier on
>>>> > > > > > >> >   > > upstream side until barrier alignment on downstream
>>>> > side.
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > These changes are motivated for reducing the
>>>> in-flight
>>>> > > > buffers
>>>> > > > > > to
>>>> > > > > > >> speedup
>>>> > > > > > >> >   > > checkpoint especially in the case of backpressure.
>>>> > > > > > >> >   > > In theory they should have very minor performance
>>>> effect
>>>> > > and
>>>> > > > > > >> actually we
>>>> > > > > > >> >   > > also tested in cluster to verify within expectation
>>>> > before
>>>> > > > > > >> merging them,
>>>> > > > > > >> >   > >  but maybe there are other corner cases we have not
>>>> > > thought
>>>> > > > of
>>>> > > > > > >> before.
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Before the testing result on our side comes out
>>>> for your
>>>> > > > > > >> respective job
>>>> > > > > > >> >   > > case, I have some other questions to confirm for
>>>> further
>>>> > > > > > analysis:
>>>> > > > > > >> >   > >     -  How much percentage regression you found
>>>> after
>>>> > > > > switching
>>>> > > > > > >> to 1.11?
>>>> > > > > > >> >   > >     -  Are there any network bottleneck in your
>>>> cluster?
>>>> > > > E.g.
>>>> > > > > > the
>>>> > > > > > >> network
>>>> > > > > > >> >   > > bandwidth is full caused by other jobs? If so, it
>>>> might
>>>> > > have
>>>> > > > > > more
>>>> > > > > > >> effects
>>>> > > > > > >> >   > > by above [2]
>>>> > > > > > >> >   > >     -  Did you adjust the default network buffer
>>>> > setting?
>>>> > > > E.g.
>>>> > > > > > >> >   > >
>>>> "taskmanager.network.memory.floating-buffers-per-gate"
>>>> > or
>>>> > > > > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
>>>> > > > > > >> >   > >     -  I guess the topology has three vertexes
>>>> > > > > "KinesisConsumer
>>>> > > > > > ->
>>>> > > > > > >> >   > Chained
>>>> > > > > > >> >   > > FlatMap -> KinesisProducer", and the partition
>>>> mode for
>>>> > > > > > >> "KinesisConsumer
>>>> > > > > > >> >   > ->
>>>> > > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both
>>>> > > "forward"?
>>>> > > > If
>>>> > > > > > >> so, the
>>>> > > > > > >> >   > edge
>>>> > > > > > >> >   > > connection is one-to-one, not all-to-all, then the
>>>> above
>>>> > > > > [1][2]
>>>> > > > > > >> should no
>>>> > > > > > >> >   > > effects in theory with default network buffer
>>>> setting.
>>>> > > > > > >> >   > >     - By slot sharing, I guess these three vertex
>>>> > > > parallelism
>>>> > > > > > >> task would
>>>> > > > > > >> >   > > probably be deployed into the same slot, then the
>>>> data
>>>> > > > shuffle
>>>> > > > > > is
>>>> > > > > > >> by
>>>> > > > > > >> >   > memory
>>>> > > > > > >> >   > > queue, not network stack. If so, the above [2]
>>>> should no
>>>> > > > > effect.
>>>> > > > > > >> >   > >     - I also saw some Jira changes for kinesis in
>>>> this
>>>> > > > > release,
>>>> > > > > > >> could you
>>>> > > > > > >> >   > > confirm that these changes would not effect the
>>>> > > performance?
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Best,
>>>> > > > > > >> >   > > Zhijiang
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > >
>>>> > > > > >
>>>> ------------------------------------------------------------------
>>>> > > > > > >> >   > > From:Thomas Weise <[hidden email]>
>>>> > > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
>>>> > > > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
>>>> > > > > > >> [hidden email]>
>>>> > > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release
>>>> candidate #4
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Hi Zhijiang,
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > The performance degradation manifests in
>>>> backpressure
>>>> > > which
>>>> > > > > > leads
>>>> > > > > > >> to
>>>> > > > > > >> >   > > growing backlog in the source. I switched a few
>>>> times
>>>> > > > between
>>>> > > > > > >> 1.10 and
>>>> > > > > > >> >   > 1.11
>>>> > > > > > >> >   > > and the behavior is consistent.
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > The DAG is:
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)
>>>> > >  --------
>>>> > > > > > >> forward
>>>> > > > > > >> >   > > ---------> KinesisProducer
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Parallelism: 160
>>>> > > > > > >> >   > > No shuffle/rebalance.
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Checkpointing config:
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Checkpointing Mode Exactly Once
>>>> > > > > > >> >   > > Interval 10s
>>>> > > > > > >> >   > > Timeout 10m 0s
>>>> > > > > > >> >   > > Minimum Pause Between Checkpoints 10s
>>>> > > > > > >> >   > > Maximum Concurrent Checkpoints 1
>>>> > > > > > >> >   > > Persist Checkpoints Externally Enabled (delete on
>>>> > > > > cancellation)
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > State backend: rocksdb  (filesystem leads to same
>>>> > > symptoms)
>>>> > > > > > >> >   > > Checkpoint size is tiny (500KB)
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > An interesting difference to another job that I had
>>>> > > upgraded
>>>> > > > > > >> successfully
>>>> > > > > > >> >   > > is the low checkpointing interval.
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > Thanks,
>>>> > > > > > >> >   > > Thomas
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
>>>> > > > > > >> [hidden email]
>>>> > > > > > >> >   > > .invalid>
>>>> > > > > > >> >   > > wrote:
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > > > Hi Thomas,
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Thanks for the efficient feedback.
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Regarding the suggestion of adding the release
>>>> notes
>>>> > > > > document,
>>>> > > > > > >> I agree
>>>> > > > > > >> >   > > > with your point. Maybe we should adjust the vote
>>>> > > template
>>>> > > > > > >> accordingly
>>>> > > > > > >> >   > in
>>>> > > > > > >> >   > > > the respective wiki to guide the following
>>>> release
>>>> > > > > processes.
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Regarding the performance regression, could you
>>>> > provide
>>>> > > > some
>>>> > > > > > >> more
>>>> > > > > > >> >   > details
>>>> > > > > > >> >   > > > for our better measurement or reproducing on our
>>>> > sides?
>>>> > > > > > >> >   > > > E.g. I guess the topology only includes two
>>>> vertexes
>>>> > > > source
>>>> > > > > > and
>>>> > > > > > >> sink?
>>>> > > > > > >> >   > > > What is the parallelism for every vertex?
>>>> > > > > > >> >   > > > The upstream shuffles data to the downstream via
>>>> > > rebalance
>>>> > > > > > >> partitioner
>>>> > > > > > >> >   > or
>>>> > > > > > >> >   > > > other?
>>>> > > > > > >> >   > > > The checkpoint mode is exactly-once with rocksDB
>>>> state
>>>> > > > > > backend?
>>>> > > > > > >> >   > > > The backpressure happened in this case?
>>>> > > > > > >> >   > > > How much percentage regression in this case?
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Best,
>>>> > > > > > >> >   > > > Zhijiang
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > >
>>>> > > > > > >>
>>>> > ------------------------------------------------------------------
>>>> > > > > > >> >   > > > From:Thomas Weise <[hidden email]>
>>>> > > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
>>>> > > > > > >> >   > > > To:dev <[hidden email]>
>>>> > > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release
>>>> candidate
>>>> > #4
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Hi Till,
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Yes, we don't have the setting in
>>>> flink-conf.yaml.
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Generally, we carry forward the existing
>>>> configuration
>>>> > > and
>>>> > > > > any
>>>> > > > > > >> change
>>>> > > > > > >> >   > to
>>>> > > > > > >> >   > > > default configuration values would impact the
>>>> upgrade.
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Yes, since it is an incompatible change I would
>>>> state
>>>> > it
>>>> > > > in
>>>> > > > > > the
>>>> > > > > > >> release
>>>> > > > > > >> >   > > > notes.
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > Thanks,
>>>> > > > > > >> >   > > > Thomas
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > BTW I found a performance regression while
>>>> trying to
>>>> > > > upgrade
>>>> > > > > > >> another
>>>> > > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis to
>>>> > Kinesis
>>>> > > > > job.
>>>> > > > > > >> Wasn't
>>>> > > > > > >> >   > able
>>>> > > > > > >> >   > > > to pin it down yet, symptoms include increased
>>>> > > checkpoint
>>>> > > > > > >> alignment
>>>> > > > > > >> >   > time.
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
>>>> > > > > > >> [hidden email]>
>>>> > > > > > >> >   > > > wrote:
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > > > Hi Thomas,
>>>> > > > > > >> >   > > > >
>>>> > > > > > >> >   > > > > just to confirm: When starting the image in
>>>> local
>>>> > > mode,
>>>> > > > > then
>>>> > > > > > >> you
>>>> > > > > > >> >   > don't
>>>> > > > > > >> >   > > > have
>>>> > > > > > >> >   > > > > any of the JobManager memory configuration
>>>> settings
>>>> > > > > > >> configured in the
>>>> > > > > > >> >   > > > > effective flink-conf.yaml, right? Does this
>>>> mean
>>>> > that
>>>> > > > you
>>>> > > > > > have
>>>> > > > > > >> >   > > explicitly
>>>> > > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the
>>>> > default
>>>> > > > > > >> configuration?
>>>> > > > > > >> >   > > If
>>>> > > > > > >> >   > > > > this is the case, then I believe it was more
>>>> of an
>>>> > > > > > >> unintentional
>>>> > > > > > >> >   > > artifact
>>>> > > > > > >> >   > > > > that it worked before and it has been
>>>> corrected now
>>>> > so
>>>> > > > > that
>>>> > > > > > >> one needs
>>>> > > > > > >> >   > > to
>>>> > > > > > >> >   > > > > specify the memory of the JM process
>>>> explicitly. Do
>>>> > > you
>>>> > > > > > think
>>>> > > > > > >> it
>>>> > > > > > >> >   > would
>>>> > > > > > >> >   > > > help
>>>> > > > > > >> >   > > > > to explicitly state this in the release notes?
>>>> > > > > > >> >   > > > >
>>>> > > > > > >> >   > > > > Cheers,
>>>> > > > > > >> >   > > > > Till
>>>> > > > > > >> >   > > > >
>>>> > > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
>>>> > > > > [hidden email]
>>>> > > > > > >
>>>> > > > > > >> wrote:
>>>> > > > > > >> >   > > > >
>>>> > > > > > >> >   > > > > > Thanks for preparing another RC!
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > > > As mentioned in the previous RC thread, it
>>>> would
>>>> > be
>>>> > > > > super
>>>> > > > > > >> helpful
>>>> > > > > > >> >   > if
>>>> > > > > > >> >   > > > the
>>>> > > > > > >> >   > > > > > release notes that are part of the
>>>> documentation
>>>> > can
>>>> > > > be
>>>> > > > > > >> included
>>>> > > > > > >> >   > [1].
>>>> > > > > > >> >   > > > > It's
>>>> > > > > > >> >   > > > > > a significant time-saver to have read those
>>>> first.
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > > > I found one more non-backward compatible
>>>> change
>>>> > that
>>>> > > > > would
>>>> > > > > > >> be worth
>>>> > > > > > >> >   > > > > > addressing/mentioning:
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > > > It is now necessary to configure the
>>>> jobmanager
>>>> > heap
>>>> > > > > size
>>>> > > > > > in
>>>> > > > > > >> >   > > > > > flink-conf.yaml (with either
>>>> jobmanager.heap.size
>>>> > > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why would I
>>>> not
>>>> > > want
>>>> > > > to
>>>> > > > > > do
>>>> > > > > > >> that
>>>> > > > > > >> >   > > > anyways?
>>>> > > > > > >> >   > > > > > Well, we set it dynamically for a cluster
>>>> > deployment
>>>> > > > via
>>>> > > > > > the
>>>> > > > > > >> >   > > > > > flinkk8soperator, but the container image
>>>> can also
>>>> > > be
>>>> > > > > used
>>>> > > > > > >> for
>>>> > > > > > >> >   > > testing
>>>> > > > > > >> >   > > > > with
>>>> > > > > > >> >   > > > > > local mode (./bin/jobmanager.sh
>>>> start-foreground
>>>> > > > local).
>>>> > > > > > >> That will
>>>> > > > > > >> >   > > fail
>>>> > > > > > >> >   > > > > if
>>>> > > > > > >> >   > > > > > the heap wasn't configured and that's how I
>>>> > noticed
>>>> > > > it.
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > > > Thanks,
>>>> > > > > > >> >   > > > > > Thomas
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > > > [1]
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > >
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   >
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
>>>> > > > > > >> >   > [hidden email]
>>>> > > > > > >> >   > > > > > .invalid>
>>>> > > > > > >> >   > > > > > wrote:
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > > > > Hi everyone,
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > > > > Please review and vote on the release
>>>> candidate
>>>> > #4
>>>> > > > for
>>>> > > > > > the
>>>> > > > > > >> >   > version
>>>> > > > > > >> >   > > > > > 1.11.0,
>>>> > > > > > >> >   > > > > > > as follows:
>>>> > > > > > >> >   > > > > > > [ ] +1, Approve the release
>>>> > > > > > >> >   > > > > > > [ ] -1, Do not approve the release (please
>>>> > provide
>>>> > > > > > >> specific
>>>> > > > > > >> >   > > comments)
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > > > > The complete staging area is available for
>>>> your
>>>> > > > > review,
>>>> > > > > > >> which
>>>> > > > > > >> >   > > > includes:
>>>> > > > > > >> >   > > > > > > * JIRA release notes [1],
>>>> > > > > > >> >   > > > > > > * the official Apache source release and
>>>> binary
>>>> > > > > > >> convenience
>>>> > > > > > >> >   > > releases
>>>> > > > > > >> >   > > > to
>>>> > > > > > >> >   > > > > > be
>>>> > > > > > >> >   > > > > > > deployed to dist.apache.org [2], which are
>>>> > signed
>>>> > > > > with
>>>> > > > > > >> the key
>>>> > > > > > >> >   > > with
>>>> > > > > > >> >   > > > > > > fingerprint
>>>> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
>>>> > > > > > [3],
>>>> > > > > > >> >   > > > > > > * all artifacts to be deployed to the Maven
>>>> > > Central
>>>> > > > > > >> Repository
>>>> > > > > > >> >   > [4],
>>>> > > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
>>>> > > > > > >> >   > > > > > > * website pull request listing the new
>>>> release
>>>> > and
>>>> > > > > > adding
>>>> > > > > > >> >   > > > announcement
>>>> > > > > > >> >   > > > > > > blog post [6].
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > > > > The vote will be open for at least 72
>>>> hours. It
>>>> > is
>>>> > > > > > >> adopted by
>>>> > > > > > >> >   > > > majority
>>>> > > > > > >> >   > > > > > > approval, with at least 3 PMC affirmative
>>>> votes.
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > > > > Thanks,
>>>> > > > > > >> >   > > > > > > Release Manager
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > > > > [1]
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > >
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   >
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
>>>> > > > > > >> >   > > > > > > [2]
>>>> > > > > > >> >   >
>>>> > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
>>>> > > > > > >> >   > > > > > > [3]
>>>> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
>>>> > > > > > >> >   > > > > > > [4]
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > >
>>>> > > > > > >> >   > >
>>>> > > > > > >>
>>>> > > > >
>>>> > >
>>>> https://repository.apache.org/content/repositories/orgapacheflink-1377/
>>>> > > > > > >> >   > > > > > > [5]
>>>> > > > > > >> >   > >
>>>> > > > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
>>>> > > > > > >> >   > > > > > > [6]
>>>> > https://github.com/apache/flink-web/pull/352
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > > > >
>>>> > > > > > >> >   > > > > >
>>>> > > > > > >> >   > > > >
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > > >
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   > >
>>>> > > > > > >> >   >
>>>> > > > > > >> >   >
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >> >
>>>> > > > > > >>
>>>> > > > > > >>
>>>> > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Roman
>>>> >
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Roman
>>>
>>

--
Regards,
Roman
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Roman Khachatryan
Hi Thomas,

The fix is now merged to master and to release-1.11.
So if you'd like you can check if it solves your problem (it would be
helpful for us too).

On Sat, Aug 8, 2020 at 9:26 AM Roman Khachatryan <[hidden email]>
wrote:

> Hi Thomas,
>
> Thanks a lot for the detailed information.
>
> I think the problem is in CheckpointCoordinator. It stores the last
> checkpoint completion time after checking queued requests.
> I've created a ticket to fix this:
> https://issues.apache.org/jira/browse/FLINK-18856
>
>
> On Sat, Aug 8, 2020 at 5:25 AM Thomas Weise <[hidden email]> wrote:
>
>> Just another update:
>>
>> The duration of snapshotState is capped by the Kinesis
>> producer's "RecordTtl" setting (default 30s). The sleep time in flushSync
>> does not contribute to the observed behavior.
>>
>> I guess the open question is why, with the same settings, is 1.11 since
>> commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more checkpoints?
>>
>>
>> On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise <[hidden email]> wrote:
>>
>>> Hi Roman,
>>>
>>> Here are the checkpoint summaries for both commits:
>>>
>>>
>>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0
>>>
>>> The config:
>>>
>>>     CheckpointConfig checkpointConfig = env.getCheckpointConfig();
>>>
>>> checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>>>     checkpointConfig.setCheckpointInterval(*10_000*);
>>>     checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
>>>
>>> checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
>>>     checkpointConfig.setCheckpointTimeout(600_000);
>>>     checkpointConfig.setMaxConcurrentCheckpoints(1);
>>>     checkpointConfig.setFailOnCheckpointingErrors(true);
>>>
>>> The values marked bold when changed to *60_000* make the symptom
>>> disappear. I meanwhile also verified that with the 1.11.0 release commit.
>>>
>>> I will take a look at the sleep time issue.
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <
>>> [hidden email]> wrote:
>>>
>>>> Hi Thomas,
>>>>
>>>> Thanks for your reply!
>>>>
>>>> I think you are right, we can remove this sleep and improve
>>>> KinesisProducer.
>>>> Probably, it's snapshotState can also be sped up by forcing records
>>>> flush more often.
>>>> Do you see that 30s checkpointing duration is caused by KinesisProducer
>>>> (or maybe other operators)?
>>>>
>>>> I'd also like to understand the reason behind this increase in
>>>> checkpoint frequency.
>>>> Can you please share these values:
>>>>  - execution.checkpointing.min-pause
>>>>  - execution.checkpointing.max-concurrent-checkpoints
>>>>  - execution.checkpointing.timeout
>>>>
>>>> And what is the "new" observed checkpoint frequency (or how many
>>>> checkpoints are created) compared to older versions?
>>>>
>>>>
>>>> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <[hidden email]> wrote:
>>>>
>>>>> Hi Roman,
>>>>>
>>>>> Indeed there are more frequent checkpoints with this change! The
>>>>> application was configured to checkpoint every 10s. With 1.10 ("good
>>>>> commit"), that leads to fewer completed checkpoints compared to 1.11
>>>>> ("bad
>>>>> commit"). Just to be clear, the only difference between the two runs
>>>>> was
>>>>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>>>>>
>>>>> Since the sync part of checkpoints with the Kinesis producer always
>>>>> takes
>>>>> ~30 seconds, the 10s configured checkpoint frequency really had no
>>>>> effect
>>>>> before 1.11. I confirmed that both commits perform comparably by
>>>>> setting
>>>>> the checkpoint frequency and min pause to 60s.
>>>>>
>>>>> I still have to verify with the final 1.11.0 release commit.
>>>>>
>>>>> It's probably good to take a look at the Kinesis producer. Is it really
>>>>> necessary to have 500ms sleep time? What's responsible for the ~30s
>>>>> duration in snapshotState?
>>>>>
>>>>> As things stand it doesn't make sense to use checkpoint intervals < 30s
>>>>> when using the Kinesis producer.
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <
>>>>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>> > Hi Thomas,
>>>>> >
>>>>> > Thanks a lot for the analysis.
>>>>> >
>>>>> > The first thing that I'd check is whether checkpoints became more
>>>>> frequent
>>>>> > with this commit (as each of them adds at least 500ms if there is at
>>>>> least
>>>>> > one not sent record, according to
>>>>> FlinkKinesisProducer.snapshotState).
>>>>> >
>>>>> > Can you share checkpointing statistics (1.10 vs 1.11 or last "good"
>>>>> vs
>>>>> > first "bad" commits)?
>>>>> >
>>>>> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <[hidden email]
>>>>> >
>>>>> > wrote:
>>>>> >
>>>>> > > I run git bisect and the first commit that shows the regression is:
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> >
>>>>> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
>>>>> > >
>>>>> > >
>>>>> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]>
>>>>> wrote:
>>>>> > >
>>>>> > > > From my experience, java profilers are sometimes not accurate
>>>>> enough to
>>>>> > > > find out the performance regression
>>>>> > > > root cause. In this case, I would suggest you try out intel vtune
>>>>> > > amplifier
>>>>> > > > to watch more detailed metrics.
>>>>> > > >
>>>>> > > > Best,
>>>>> > > > Kurt
>>>>> > > >
>>>>> > > >
>>>>> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]>
>>>>> wrote:
>>>>> > > >
>>>>> > > > > The cause of the issue is all but clear.
>>>>> > > > >
>>>>> > > > > Previously I had mentioned that there is no suspect change to
>>>>> the
>>>>> > > Kinesis
>>>>> > > > > connector and that I had reverted the AWS SDK change to no
>>>>> effect.
>>>>> > > > >
>>>>> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually
>>>>> fixed
>>>>> > > another
>>>>> > > > > regression in the previous release and is present before and
>>>>> after.
>>>>> > > > >
>>>>> > > > > I repeated the run with 1.11.0 core and downgraded the entire
>>>>> Kinesis
>>>>> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is
>>>>> still
>>>>> > > > present.
>>>>> > > > > Therefore we will need to look elsewhere for the root cause.
>>>>> > > > >
>>>>> > > > > Regarding the time spent in snapshotState, repeat runs reveal
>>>>> a wide
>>>>> > > > range
>>>>> > > > > for both versions, 1.10 and 1.11. So again this is nothing
>>>>> pointing
>>>>> > to
>>>>> > > a
>>>>> > > > > root cause.
>>>>> > > > >
>>>>> > > > > At this point, I have no ideas remaining other than doing a
>>>>> bisect to
>>>>> > > > find
>>>>> > > > > the culprit. Any other suggestions?
>>>>> > > > >
>>>>> > > > > Thomas
>>>>> > > > >
>>>>> > > > >
>>>>> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
>>>>> [hidden email]
>>>>> > > > > .invalid>
>>>>> > > > > wrote:
>>>>> > > > >
>>>>> > > > > > Hi Thomas,
>>>>> > > > > >
>>>>> > > > > > Thanks for your further profiling information and glad to
>>>>> see we
>>>>> > > > already
>>>>> > > > > > finalized the location to cause the regression.
>>>>> > > > > > Actually I was also suspicious of the point of
>>>>> #snapshotState in
>>>>> > > > previous
>>>>> > > > > > discussions since it indeed cost much time to block normal
>>>>> operator
>>>>> > > > > > processing.
>>>>> > > > > >
>>>>> > > > > > Based on your below feedback, the sleep time during
>>>>> #snapshotState
>>>>> > > > might
>>>>> > > > > > be the main concern, and I also digged into the
>>>>> implementation of
>>>>> > > > > > FlinkKinesisProducer#snapshotState.
>>>>> > > > > > while (producer.getOutstandingRecordsCount() > 0) {
>>>>> > > > > >    producer.flush();
>>>>> > > > > >    try {
>>>>> > > > > >       Thread.sleep(500);
>>>>> > > > > >    } catch (InterruptedException e) {
>>>>> > > > > >       LOG.warn("Flushing was interrupted.");
>>>>> > > > > >       break;
>>>>> > > > > >    }
>>>>> > > > > > }
>>>>> > > > > > It seems that the sleep time is mainly affected by the
>>>>> internal
>>>>> > > > > operations
>>>>> > > > > > inside KinesisProducer implementation provided by amazonaws,
>>>>> which
>>>>> > I
>>>>> > > am
>>>>> > > > > not
>>>>> > > > > > quite familiar with.
>>>>> > > > > > But I noticed there were two upgrades related to it in
>>>>> > > release-1.11.0.
>>>>> > > > > One
>>>>> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and
>>>>> another
>>>>> > is
>>>>> > > > for
>>>>> > > > > > upgrading aws-sdk-version to 1.11.754 [2].
>>>>> > > > > > You mentioned that you already reverted the SDK upgrade to
>>>>> verify
>>>>> > no
>>>>> > > > > > changes. Did you also revert the [1] to verify?
>>>>> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
>>>>> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
>>>>> > > > > >
>>>>> > > > > > Best,
>>>>> > > > > > Zhijiang
>>>>> > > > > >
>>>>> ------------------------------------------------------------------
>>>>> > > > > > From:Thomas Weise <[hidden email]>
>>>>> > > > > > Send Time:2020年7月17日(星期五) 05:29
>>>>> > > > > > To:dev <[hidden email]>
>>>>> > > > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
>>>>> > > > [hidden email]
>>>>> > > > > >;
>>>>> > > > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
>>>>> > > > [hidden email]
>>>>> > > > > >
>>>>> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>>>>> 1.11.0,
>>>>> > > > release
>>>>> > > > > > candidate #4)
>>>>> > > > > >
>>>>> > > > > > Sorry for the delay.
>>>>> > > > > >
>>>>> > > > > > I confirmed that the regression is due to the sink
>>>>> (unsurprising,
>>>>> > > since
>>>>> > > > > > another job with the same consumer, but not the producer,
>>>>> runs as
>>>>> > > > > > expected).
>>>>> > > > > >
>>>>> > > > > > As promised I did CPU profiling on the problematic
>>>>> application,
>>>>> > which
>>>>> > > > > gives
>>>>> > > > > > more insight into the regression [1]
>>>>> > > > > >
>>>>> > > > > > The screenshots show that the average time for snapshotState
>>>>> > > increases
>>>>> > > > > from
>>>>> > > > > > ~9s to ~28s. The data also shows the increase in sleep time
>>>>> during
>>>>> > > > > > snapshotState.
>>>>> > > > > >
>>>>> > > > > > Does anyone, based on changes made in 1.11, have a theory
>>>>> why?
>>>>> > > > > >
>>>>> > > > > > I had previously looked at the changes to the Kinesis
>>>>> connector and
>>>>> > > > also
>>>>> > > > > > reverted the SDK upgrade, which did not change the situation.
>>>>> > > > > >
>>>>> > > > > > It will likely be necessary to drill into the sink /
>>>>> checkpointing
>>>>> > > > > details
>>>>> > > > > > to understand the cause of the problem.
>>>>> > > > > >
>>>>> > > > > > Let me know if anyone has specific questions that I can
>>>>> answer from
>>>>> > > the
>>>>> > > > > > profiling results.
>>>>> > > > > >
>>>>> > > > > > Thomas
>>>>> > > > > >
>>>>> > > > > > [1]
>>>>> > > > > >
>>>>> > > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
>>>>> > > > > >
>>>>> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <
>>>>> [hidden email]>
>>>>> > > wrote:
>>>>> > > > > >
>>>>> > > > > > > + dev@ for visibility
>>>>> > > > > > >
>>>>> > > > > > > I will investigate further today.
>>>>> > > > > > >
>>>>> > > > > > >
>>>>> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
>>>>> > > [hidden email]
>>>>> > > > >
>>>>> > > > > > > wrote:
>>>>> > > > > > >
>>>>> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
>>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
>>>>> relevant
>>>>> > way,
>>>>> > > > for
>>>>> > > > > > >> example
>>>>> > > > > > >> > due to some Kafka issues we addressed in 1.11 (@Aljoscha
>>>>> > maybe?)
>>>>> > > > > > >>
>>>>> > > > > > >> I think that's unrelated: the Kafka fixes were isolated
>>>>> in Kafka
>>>>> > > and
>>>>> > > > > the
>>>>> > > > > > >> one bug I discovered on the way was about the Task reaper.
>>>>> > > > > > >>
>>>>> > > > > > >>
>>>>> > > > > > >> On 07.07.20 17:51, Zhijiang wrote:
>>>>> > > > > > >> > Sorry for my misunderstood of the previous information,
>>>>> > Thomas.
>>>>> > > I
>>>>> > > > > was
>>>>> > > > > > >> assuming that the sync checkpoint duration increased after
>>>>> > upgrade
>>>>> > > > as
>>>>> > > > > it
>>>>> > > > > > >> was mentioned before.
>>>>> > > > > > >> >
>>>>> > > > > > >> > If I remembered correctly, the memory state backend
>>>>> also has
>>>>> > the
>>>>> > > > > same
>>>>> > > > > > >> issue? If so, we can dismiss the rocksDB state changes.
>>>>> As the
>>>>> > > slot
>>>>> > > > > > sharing
>>>>> > > > > > >> enabled, the downstream and upstream should
>>>>> > > > > > >> > probably deployed into the same slot, then no network
>>>>> shuffle
>>>>> > > > > effect.
>>>>> > > > > > >> >
>>>>> > > > > > >> > I think we need to find out whether it has other
>>>>> symptoms
>>>>> > > changed
>>>>> > > > > > >> besides the performance regression to further figure out
>>>>> the
>>>>> > > scope.
>>>>> > > > > > >> > E.g. any metrics changes, the number of TaskManager and
>>>>> the
>>>>> > > number
>>>>> > > > > of
>>>>> > > > > > >> slots per TaskManager from deployment changes.
>>>>> > > > > > >> > 40% regression is really big, I guess the changes
>>>>> should also
>>>>> > be
>>>>> > > > > > >> reflected in other places.
>>>>> > > > > > >> >
>>>>> > > > > > >> > I am not sure whether we can reproduce the regression
>>>>> in our
>>>>> > AWS
>>>>> > > > > > >> environment by writing any Kinesis jobs, since there are
>>>>> also
>>>>> > > normal
>>>>> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
>>>>> > > > > > >> > So it probably looks like to touch some corner case. I
>>>>> am very
>>>>> > > > > willing
>>>>> > > > > > >> to provide any help for debugging if possible.
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> > Best,
>>>>> > > > > > >> > Zhijiang
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > ------------------------------------------------------------------
>>>>> > > > > > >> > From:Thomas Weise <[hidden email]>
>>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01
>>>>> > > > > > >> > To:Stephan Ewen <[hidden email]>
>>>>> > > > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid Heise
>>>>> <
>>>>> > > > > > >> [hidden email]>; Zhijiang <
>>>>> [hidden email]>
>>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE]
>>>>> Release
>>>>> > > 1.11.0,
>>>>> > > > > > >> release candidate #4)
>>>>> > > > > > >> >
>>>>> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We
>>>>> have one
>>>>> > job
>>>>> > > > > that
>>>>> > > > > > >> works as expected after the upgrade and the one discussed
>>>>> here
>>>>> > > that
>>>>> > > > > has
>>>>> > > > > > the
>>>>> > > > > > >> performance regression.
>>>>> > > > > > >> >
>>>>> > > > > > >> > "The performance regression is obvious caused by long
>>>>> duration
>>>>> > > of
>>>>> > > > > sync
>>>>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>>>>> block
>>>>> > the
>>>>> > > > > > normal
>>>>> > > > > > >> data processing until back pressure the source."
>>>>> > > > > > >> >
>>>>> > > > > > >> > That's a constant. Before (1.10) and upgrade have the
>>>>> same
>>>>> > sync
>>>>> > > > > > >> checkpointing time. The question is what change came in
>>>>> with the
>>>>> > > > > > upgrade.
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <
>>>>> [hidden email]
>>>>> > >
>>>>> > > > > wrote:
>>>>> > > > > > >> >
>>>>> > > > > > >> > @Thomas Just one thing real quick: Are you using the
>>>>> > standalone
>>>>> > > > > setup
>>>>> > > > > > >> scripts (like start-cluster.sh, and the former "slaves"
>>>>> file) ?
>>>>> > > > > > >> > Be aware that this is now called "workers" because of
>>>>> avoiding
>>>>> > > > > > >> sensitive names.
>>>>> > > > > > >> > In one internal benchmark we saw quite a lot of slowdown
>>>>> > > > initially,
>>>>> > > > > > >> before seeing that the cluster was not a distributed
>>>>> cluster any
>>>>> > > > more
>>>>> > > > > > ;-)
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
>>>>> > > > [hidden email]
>>>>> > > > > >
>>>>> > > > > > >> wrote:
>>>>> > > > > > >> > Thanks for this kickoff and help analysis, Stephan!
>>>>> > > > > > >> > Thanks for the further feedback and investigation,
>>>>> Thomas!
>>>>> > > > > > >> >
>>>>> > > > > > >> > The performance regression is obvious caused by long
>>>>> duration
>>>>> > of
>>>>> > > > > sync
>>>>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>>>>> block
>>>>> > the
>>>>> > > > > > normal
>>>>> > > > > > >> data processing until back pressure the source.
>>>>> > > > > > >> > Maybe we could dig into the process of sync execution in
>>>>> > > > checkpoint.
>>>>> > > > > > >> E.g. break down the steps inside respective
>>>>> > operator#snapshotState
>>>>> > > > to
>>>>> > > > > > >> statistic which operation cost most of the time, then
>>>>> > > > > > >> > we might probably find the root cause to bring such
>>>>> cost.
>>>>> > > > > > >> >
>>>>> > > > > > >> > Look forward to the further progress. :)
>>>>> > > > > > >> >
>>>>> > > > > > >> > Best,
>>>>> > > > > > >> > Zhijiang
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > ------------------------------------------------------------------
>>>>> > > > > > >> > From:Stephan Ewen <[hidden email]>
>>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52
>>>>> > > > > > >> > To:Thomas Weise <[hidden email]>
>>>>> > > > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
>>>>> > > > > > >> [hidden email]>; Aljoscha Krettek <
>>>>> > > [hidden email]
>>>>> > > > >;
>>>>> > > > > > >> Arvid Heise <[hidden email]>
>>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE]
>>>>> Release
>>>>> > > 1.11.0,
>>>>> > > > > > >> release candidate #4)
>>>>> > > > > > >> >
>>>>> > > > > > >> > Thank you for the digging so deeply.
>>>>> > > > > > >> > Mysterious think this regression.
>>>>> > > > > > >> >
>>>>> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[hidden email]
>>>>> >
>>>>> > wrote:
>>>>> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it is
>>>>> > > unchanged
>>>>> > > > > > >> between 1.10 and 1.11 for the specific pipeline).
>>>>> > > > > > >> >
>>>>> > > > > > >> > I verified that increasing the checkpointing interval
>>>>> does not
>>>>> > > > make
>>>>> > > > > a
>>>>> > > > > > >> difference.
>>>>> > > > > > >> >
>>>>> > > > > > >> > I looked at the Kinesis connector changes since 1.10.1
>>>>> and
>>>>> > don't
>>>>> > > > see
>>>>> > > > > > >> anything that could cause this.
>>>>> > > > > > >> >
>>>>> > > > > > >> > Another pipeline that is using the Kinesis consumer
>>>>> (but not
>>>>> > the
>>>>> > > > > > >> producer) performs as expected.
>>>>> > > > > > >> >
>>>>> > > > > > >> > I tried reverting the AWS SDK version change, symptoms
>>>>> remain
>>>>> > > > > > unchanged:
>>>>> > > > > > >> >
>>>>> > > > > > >> > diff --git
>>>>> a/flink-connectors/flink-connector-kinesis/pom.xml
>>>>> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
>>>>> > > > > > >> > index a6abce23ba..741743a05e 100644
>>>>> > > > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
>>>>> > > > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
>>>>> > > > > > >> > @@ -33,7 +33,7 @@ under the License.
>>>>> > > > > > >> >
>>>>> > > > > > >>
>>>>> > > > >
>>>>> > >
>>>>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
>>>>> > > > > > >> >          <name>flink-connector-kinesis</name>
>>>>> > > > > > >> >          <properties>
>>>>> > > > > > >> > -
>>>>>  <aws.sdk.version>1.11.754</aws.sdk.version>
>>>>> > > > > > >> > +
>>>>>  <aws.sdk.version>1.11.603</aws.sdk.version>
>>>>> > > > > > >> >
>>>>> > > > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
>>>>> > > > > > >> >
>>>>> > > > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
>>>>> > > > > > >> >
>>>>> > > > > > >>
>>>>> > > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
>>>>> > > > > > >> >
>>>>> > > > > > >> > I'm planning to take a look with a profiler next.
>>>>> > > > > > >> >
>>>>> > > > > > >> > Thomas
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
>>>>> > [hidden email]>
>>>>> > > > > > wrote:
>>>>> > > > > > >> > Hi all!
>>>>> > > > > > >> >
>>>>> > > > > > >> > Forking this thread out of the release vote thread.
>>>>> > > > > > >> >  From what Thomas describes, it really sounds like a
>>>>> > > sink-specific
>>>>> > > > > > >> issue.
>>>>> > > > > > >> >
>>>>> > > > > > >> > @Thomas: When you say sink has a long synchronous
>>>>> checkpoint
>>>>> > > time,
>>>>> > > > > you
>>>>> > > > > > >> mean the time that is shown as "sync time" on the metrics
>>>>> and
>>>>> > web
>>>>> > > > UI?
>>>>> > > > > > That
>>>>> > > > > > >> is not including any network buffer related operations.
>>>>> It is
>>>>> > > purely
>>>>> > > > > the
>>>>> > > > > > >> operator's time.
>>>>> > > > > > >> >
>>>>> > > > > > >> > Can we dig into the changes we did in sinks:
>>>>> > > > > > >> >    - Kinesis version upgrade, AWS library updates
>>>>> > > > > > >> >
>>>>> > > > > > >> >    - Could it be that some call (checkpoint complete)
>>>>> that was
>>>>> > > > > > >> previously (1.10) in a separate thread is not in the
>>>>> mailbox and
>>>>> > > > this
>>>>> > > > > > >> simply reduces the number of threads that do the work?
>>>>> > > > > > >> >
>>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
>>>>> relevant
>>>>> > way,
>>>>> > > > for
>>>>> > > > > > >> example due to some Kafka issues we addressed in 1.11
>>>>> (@Aljoscha
>>>>> > > > > maybe?)
>>>>> > > > > > >> >
>>>>> > > > > > >> > Best,
>>>>> > > > > > >> > Stephan
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
>>>>> > > > [hidden email]
>>>>> > > > > > .invalid>
>>>>> > > > > > >> wrote:
>>>>> > > > > > >> > Hi Thomas,
>>>>> > > > > > >> >
>>>>> > > > > > >> >   Regarding [2], it has more detail infos in the Jira
>>>>> > > description
>>>>> > > > (
>>>>> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
>>>>> > > > > > >> >
>>>>> > > > > > >> >   I can also give some basic explanations here to
>>>>> dismiss the
>>>>> > > > > concern.
>>>>> > > > > > >> >   1. In the past, the following buffers after the
>>>>> barrier will
>>>>> > > be
>>>>> > > > > > >> cached on downstream side before alignment.
>>>>> > > > > > >> >   2. In 1.11, the upstream would not send the buffers
>>>>> after
>>>>> > the
>>>>> > > > > > >> barrier. When the downstream finishes the alignment, it
>>>>> will
>>>>> > > notify
>>>>> > > > > the
>>>>> > > > > > >> downstream of continuing sending following buffers, since
>>>>> it can
>>>>> > > > > process
>>>>> > > > > > >> them after alignment.
>>>>> > > > > > >> >   3. The only difference is that the temporary blocked
>>>>> buffers
>>>>> > > are
>>>>> > > > > > >> cached either on downstream side or on upstream side
>>>>> before
>>>>> > > > alignment.
>>>>> > > > > > >> >   4. The side effect would be the additional
>>>>> notification cost
>>>>> > > for
>>>>> > > > > > >> every barrier alignment. If the downstream and upstream
>>>>> are
>>>>> > > deployed
>>>>> > > > > in
>>>>> > > > > > >> separate TaskManager, the cost is network transport delay
>>>>> (the
>>>>> > > > effect
>>>>> > > > > > can
>>>>> > > > > > >> be ignored based on our testing with 1s checkpoint
>>>>> interval).
>>>>> > For
>>>>> > > > > > sharing
>>>>> > > > > > >> slot in your case, the cost is only one method call in
>>>>> > processor,
>>>>> > > > can
>>>>> > > > > be
>>>>> > > > > > >> ignored also.
>>>>> > > > > > >> >
>>>>> > > > > > >> >   You mentioned "In this case, the downstream task has
>>>>> a high
>>>>> > > > > average
>>>>> > > > > > >> checkpoint duration(~30s, sync part)." This duration is
>>>>> not
>>>>> > > > reflecting
>>>>> > > > > > the
>>>>> > > > > > >> changes above, and it is only indicating the duration for
>>>>> > calling
>>>>> > > > > > >> `Operation.snapshotState`.
>>>>> > > > > > >> >   If this duration is beyond your expectation, you can
>>>>> check
>>>>> > or
>>>>> > > > > debug
>>>>> > > > > > >> whether the source/sink operations might take more time to
>>>>> > finish
>>>>> > > > > > >> `snapshotState` in practice. E.g. you can
>>>>> > > > > > >> >   make the implementation of this method as empty to
>>>>> further
>>>>> > > > verify
>>>>> > > > > > the
>>>>> > > > > > >> effect.
>>>>> > > > > > >> >
>>>>> > > > > > >> >   Best,
>>>>> > > > > > >> >   Zhijiang
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > >
>>>>> ------------------------------------------------------------------
>>>>> > > > > > >> >   From:Thomas Weise <[hidden email]>
>>>>> > > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
>>>>> > > > > > >> >   To:dev <[hidden email]>; Zhijiang <
>>>>> > > > > [hidden email]
>>>>> > > > > > >
>>>>> > > > > > >> >   Cc:Yingjie Cao <[hidden email]>
>>>>> > > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate
>>>>> #4
>>>>> > > > > > >> >
>>>>> > > > > > >> >   Hi Zhijiang,
>>>>> > > > > > >> >
>>>>> > > > > > >> >   Could you please point me to more details regarding:
>>>>> "[2]:
>>>>> > > Delay
>>>>> > > > > > send
>>>>> > > > > > >> the
>>>>> > > > > > >> >   following buffers after checkpoint barrier on
>>>>> upstream side
>>>>> > > > until
>>>>> > > > > > >> barrier
>>>>> > > > > > >> >   alignment on downstream side."
>>>>> > > > > > >> >
>>>>> > > > > > >> >   In this case, the downstream task has a high average
>>>>> > > checkpoint
>>>>> > > > > > >> duration
>>>>> > > > > > >> >   (~30s, sync part). If there was a change to hold
>>>>> buffers
>>>>> > > > depending
>>>>> > > > > > on
>>>>> > > > > > >> >   downstream performance, could this possibly apply to
>>>>> this
>>>>> > case
>>>>> > > > > (even
>>>>> > > > > > >> when
>>>>> > > > > > >> >   there is no shuffle that would require alignment)?
>>>>> > > > > > >> >
>>>>> > > > > > >> >   Thanks,
>>>>> > > > > > >> >   Thomas
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
>>>>> > > > > [hidden email]
>>>>> > > > > > >> .invalid>
>>>>> > > > > > >> >   wrote:
>>>>> > > > > > >> >
>>>>> > > > > > >> >   > Hi Thomas,
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > Thanks for the further update information.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > I guess we can dismiss the network stack changes,
>>>>> since in
>>>>> > > > your
>>>>> > > > > > >> case the
>>>>> > > > > > >> >   > downstream and upstream would probably be deployed
>>>>> in the
>>>>> > > same
>>>>> > > > > > slot
>>>>> > > > > > >> >   > bypassing the network data shuffle.
>>>>> > > > > > >> >   > Also I guess release-1.11 will not bring general
>>>>> > performance
>>>>> > > > > > >> regression in
>>>>> > > > > > >> >   > runtime engine, as we also did the performance
>>>>> testing for
>>>>> > > all
>>>>> > > > > > >> general
>>>>> > > > > > >> >   > cases by [1] in real cluster before and the testing
>>>>> > results
>>>>> > > > > should
>>>>> > > > > > >> fit the
>>>>> > > > > > >> >   > expectation. But we indeed did not test the specific
>>>>> > source
>>>>> > > > and
>>>>> > > > > > sink
>>>>> > > > > > >> >   > connectors yet as I known.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > Regarding your performance regression with 40%, I
>>>>> wonder
>>>>> > it
>>>>> > > is
>>>>> > > > > > >> probably
>>>>> > > > > > >> >   > related to specific source/sink changes (e.g.
>>>>> kinesis) or
>>>>> > > > > > >> environment
>>>>> > > > > > >> >   > issues with corner case.
>>>>> > > > > > >> >   > If possible, it would be helpful to further locate
>>>>> whether
>>>>> > > the
>>>>> > > > > > >> regression
>>>>> > > > > > >> >   > is caused by kinesis, by replacing the kinesis
>>>>> source &
>>>>> > sink
>>>>> > > > and
>>>>> > > > > > >> keeping
>>>>> > > > > > >> >   > the others same.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > As you said, it would be efficient to contact with
>>>>> you
>>>>> > > > directly
>>>>> > > > > > >> next week
>>>>> > > > > > >> >   > to further discuss this issue. And we are
>>>>> willing/eager to
>>>>> > > > > provide
>>>>> > > > > > >> any help
>>>>> > > > > > >> >   > to resolve this issue soon.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > Besides that, I guess this issue should not be the
>>>>> blocker
>>>>> > > for
>>>>> > > > > the
>>>>> > > > > > >> >   > release, since it is probably a corner case based
>>>>> on the
>>>>> > > > current
>>>>> > > > > > >> analysis.
>>>>> > > > > > >> >   > If we really conclude anything need to be resolved
>>>>> after
>>>>> > the
>>>>> > > > > final
>>>>> > > > > > >> >   > release, then we can also make the next minor
>>>>> > release-1.11.1
>>>>> > > > > come
>>>>> > > > > > >> soon.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > [1]
>>>>> https://issues.apache.org/jira/browse/FLINK-18433
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > Best,
>>>>> > > > > > >> >   > Zhijiang
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >
>>>>> > > > >
>>>>> ------------------------------------------------------------------
>>>>> > > > > > >> >   > From:Thomas Weise <[hidden email]>
>>>>> > > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
>>>>> > > > > > >> >   > To:dev <[hidden email]>; Zhijiang <
>>>>> > > > > > [hidden email]
>>>>> > > > > > >> >
>>>>> > > > > > >> >   > Cc:Yingjie Cao <[hidden email]>
>>>>> > > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release
>>>>> candidate #4
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > Hi Zhijiang,
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > It will probably be best if we connect next week and
>>>>> > discuss
>>>>> > > > the
>>>>> > > > > > >> issue
>>>>> > > > > > >> >   > directly since this could be quite difficult to
>>>>> reproduce.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > Before the testing result on our side comes out for
>>>>> your
>>>>> > > > > > respective
>>>>> > > > > > >> job
>>>>> > > > > > >> >   > case, I have some other questions to confirm for
>>>>> further
>>>>> > > > > analysis:
>>>>> > > > > > >> >   >     -  How much percentage regression you found
>>>>> after
>>>>> > > > switching
>>>>> > > > > to
>>>>> > > > > > >> 1.11?
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > ~40% throughput decline
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >     -  Are there any network bottleneck in your
>>>>> cluster?
>>>>> > > E.g.
>>>>> > > > > the
>>>>> > > > > > >> network
>>>>> > > > > > >> >   > bandwidth is full caused by other jobs? If so, it
>>>>> might
>>>>> > have
>>>>> > > > > more
>>>>> > > > > > >> effects
>>>>> > > > > > >> >   > by above [2]
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > The test runs on a k8s cluster that is also used
>>>>> for other
>>>>> > > > > > >> production jobs.
>>>>> > > > > > >> >   > There is no reason be believe network is the
>>>>> bottleneck.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >     -  Did you adjust the default network buffer
>>>>> setting?
>>>>> > > E.g.
>>>>> > > > > > >> >   >
>>>>> "taskmanager.network.memory.floating-buffers-per-gate" or
>>>>> > > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > The job is using the defaults, i.e we don't
>>>>> configure the
>>>>> > > > > > settings.
>>>>> > > > > > >> If you
>>>>> > > > > > >> >   > want me to try specific settings in the hope that
>>>>> it will
>>>>> > > help
>>>>> > > > > to
>>>>> > > > > > >> isolate
>>>>> > > > > > >> >   > the issue please let me know.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >     -  I guess the topology has three vertexes
>>>>> > > > "KinesisConsumer
>>>>> > > > > ->
>>>>> > > > > > >> Chained
>>>>> > > > > > >> >   > FlatMap -> KinesisProducer", and the partition mode
>>>>> for
>>>>> > > > > > >> "KinesisConsumer ->
>>>>> > > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
>>>>> > "forward"?
>>>>> > > If
>>>>> > > > > so,
>>>>> > > > > > >> the edge
>>>>> > > > > > >> >   > connection is one-to-one, not all-to-all, then the
>>>>> above
>>>>> > > > [1][2]
>>>>> > > > > > >> should no
>>>>> > > > > > >> >   > effects in theory with default network buffer
>>>>> setting.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > There are only 2 vertices and the edge is "forward".
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >     - By slot sharing, I guess these three vertex
>>>>> > > parallelism
>>>>> > > > > task
>>>>> > > > > > >> would
>>>>> > > > > > >> >   > probably be deployed into the same slot, then the
>>>>> data
>>>>> > > shuffle
>>>>> > > > > is
>>>>> > > > > > >> by memory
>>>>> > > > > > >> >   > queue, not network stack. If so, the above [2]
>>>>> should no
>>>>> > > > effect.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > Yes, vertices share slots.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >     - I also saw some Jira changes for kinesis in
>>>>> this
>>>>> > > > release,
>>>>> > > > > > >> could you
>>>>> > > > > > >> >   > confirm that these changes would not effect the
>>>>> > performance?
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > I will need to take a look. 1.10 already had a
>>>>> regression
>>>>> > > > > > >> introduced by the
>>>>> > > > > > >> >   > Kinesis producer update.
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > Thanks,
>>>>> > > > > > >> >   > Thomas
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
>>>>> > > > > > >> [hidden email]
>>>>> > > > > > >> >   > .invalid>
>>>>> > > > > > >> >   > wrote:
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   > > Hi Thomas,
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Thanks for your reply with rich information!
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > We are trying to reproduce your case in our
>>>>> cluster to
>>>>> > > > further
>>>>> > > > > > >> verify it,
>>>>> > > > > > >> >   > > and  @Yingjie Cao is working on it now.
>>>>> > > > > > >> >   > >  As we have not kinesis consumer and producer
>>>>> > internally,
>>>>> > > so
>>>>> > > > > we
>>>>> > > > > > >> will
>>>>> > > > > > >> >   > > construct the common source and sink instead in
>>>>> the case
>>>>> > > of
>>>>> > > > > > >> backpressure.
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Firstly, we can dismiss the rockdb factor in this
>>>>> > release,
>>>>> > > > > since
>>>>> > > > > > >> you also
>>>>> > > > > > >> >   > > mentioned that "filesystem leads to same
>>>>> symptoms".
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Secondly, if my understanding is right, you
>>>>> emphasis
>>>>> > that
>>>>> > > > the
>>>>> > > > > > >> regression
>>>>> > > > > > >> >   > > only exists for the jobs with low checkpoint
>>>>> interval
>>>>> > > (10s).
>>>>> > > > > > >> >   > > Based on that, I have two suspicions with the
>>>>> network
>>>>> > > > related
>>>>> > > > > > >> changes in
>>>>> > > > > > >> >   > > this release:
>>>>> > > > > > >> >   > >     - [1]: Limited the maximum backlog value
>>>>> (default
>>>>> > 10)
>>>>> > > in
>>>>> > > > > > >> subpartition
>>>>> > > > > > >> >   > > queue.
>>>>> > > > > > >> >   > >     - [2]: Delay send the following buffers after
>>>>> > > checkpoint
>>>>> > > > > > >> barrier on
>>>>> > > > > > >> >   > > upstream side until barrier alignment on
>>>>> downstream
>>>>> > side.
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > These changes are motivated for reducing the
>>>>> in-flight
>>>>> > > > buffers
>>>>> > > > > > to
>>>>> > > > > > >> speedup
>>>>> > > > > > >> >   > > checkpoint especially in the case of backpressure.
>>>>> > > > > > >> >   > > In theory they should have very minor performance
>>>>> effect
>>>>> > > and
>>>>> > > > > > >> actually we
>>>>> > > > > > >> >   > > also tested in cluster to verify within
>>>>> expectation
>>>>> > before
>>>>> > > > > > >> merging them,
>>>>> > > > > > >> >   > >  but maybe there are other corner cases we have
>>>>> not
>>>>> > > thought
>>>>> > > > of
>>>>> > > > > > >> before.
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Before the testing result on our side comes out
>>>>> for your
>>>>> > > > > > >> respective job
>>>>> > > > > > >> >   > > case, I have some other questions to confirm for
>>>>> further
>>>>> > > > > > analysis:
>>>>> > > > > > >> >   > >     -  How much percentage regression you found
>>>>> after
>>>>> > > > > switching
>>>>> > > > > > >> to 1.11?
>>>>> > > > > > >> >   > >     -  Are there any network bottleneck in your
>>>>> cluster?
>>>>> > > > E.g.
>>>>> > > > > > the
>>>>> > > > > > >> network
>>>>> > > > > > >> >   > > bandwidth is full caused by other jobs? If so, it
>>>>> might
>>>>> > > have
>>>>> > > > > > more
>>>>> > > > > > >> effects
>>>>> > > > > > >> >   > > by above [2]
>>>>> > > > > > >> >   > >     -  Did you adjust the default network buffer
>>>>> > setting?
>>>>> > > > E.g.
>>>>> > > > > > >> >   > >
>>>>> "taskmanager.network.memory.floating-buffers-per-gate"
>>>>> > or
>>>>> > > > > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
>>>>> > > > > > >> >   > >     -  I guess the topology has three vertexes
>>>>> > > > > "KinesisConsumer
>>>>> > > > > > ->
>>>>> > > > > > >> >   > Chained
>>>>> > > > > > >> >   > > FlatMap -> KinesisProducer", and the partition
>>>>> mode for
>>>>> > > > > > >> "KinesisConsumer
>>>>> > > > > > >> >   > ->
>>>>> > > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both
>>>>> > > "forward"?
>>>>> > > > If
>>>>> > > > > > >> so, the
>>>>> > > > > > >> >   > edge
>>>>> > > > > > >> >   > > connection is one-to-one, not all-to-all, then
>>>>> the above
>>>>> > > > > [1][2]
>>>>> > > > > > >> should no
>>>>> > > > > > >> >   > > effects in theory with default network buffer
>>>>> setting.
>>>>> > > > > > >> >   > >     - By slot sharing, I guess these three vertex
>>>>> > > > parallelism
>>>>> > > > > > >> task would
>>>>> > > > > > >> >   > > probably be deployed into the same slot, then the
>>>>> data
>>>>> > > > shuffle
>>>>> > > > > > is
>>>>> > > > > > >> by
>>>>> > > > > > >> >   > memory
>>>>> > > > > > >> >   > > queue, not network stack. If so, the above [2]
>>>>> should no
>>>>> > > > > effect.
>>>>> > > > > > >> >   > >     - I also saw some Jira changes for kinesis in
>>>>> this
>>>>> > > > > release,
>>>>> > > > > > >> could you
>>>>> > > > > > >> >   > > confirm that these changes would not effect the
>>>>> > > performance?
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Best,
>>>>> > > > > > >> >   > > Zhijiang
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > >
>>>>> > > > > >
>>>>> ------------------------------------------------------------------
>>>>> > > > > > >> >   > > From:Thomas Weise <[hidden email]>
>>>>> > > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
>>>>> > > > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
>>>>> > > > > > >> [hidden email]>
>>>>> > > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release
>>>>> candidate #4
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Hi Zhijiang,
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > The performance degradation manifests in
>>>>> backpressure
>>>>> > > which
>>>>> > > > > > leads
>>>>> > > > > > >> to
>>>>> > > > > > >> >   > > growing backlog in the source. I switched a few
>>>>> times
>>>>> > > > between
>>>>> > > > > > >> 1.10 and
>>>>> > > > > > >> >   > 1.11
>>>>> > > > > > >> >   > > and the behavior is consistent.
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > The DAG is:
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)
>>>>> > >  --------
>>>>> > > > > > >> forward
>>>>> > > > > > >> >   > > ---------> KinesisProducer
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Parallelism: 160
>>>>> > > > > > >> >   > > No shuffle/rebalance.
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Checkpointing config:
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Checkpointing Mode Exactly Once
>>>>> > > > > > >> >   > > Interval 10s
>>>>> > > > > > >> >   > > Timeout 10m 0s
>>>>> > > > > > >> >   > > Minimum Pause Between Checkpoints 10s
>>>>> > > > > > >> >   > > Maximum Concurrent Checkpoints 1
>>>>> > > > > > >> >   > > Persist Checkpoints Externally Enabled (delete on
>>>>> > > > > cancellation)
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > State backend: rocksdb  (filesystem leads to same
>>>>> > > symptoms)
>>>>> > > > > > >> >   > > Checkpoint size is tiny (500KB)
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > An interesting difference to another job that I
>>>>> had
>>>>> > > upgraded
>>>>> > > > > > >> successfully
>>>>> > > > > > >> >   > > is the low checkpointing interval.
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > Thanks,
>>>>> > > > > > >> >   > > Thomas
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
>>>>> > > > > > >> [hidden email]
>>>>> > > > > > >> >   > > .invalid>
>>>>> > > > > > >> >   > > wrote:
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > > > Hi Thomas,
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Thanks for the efficient feedback.
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Regarding the suggestion of adding the release
>>>>> notes
>>>>> > > > > document,
>>>>> > > > > > >> I agree
>>>>> > > > > > >> >   > > > with your point. Maybe we should adjust the vote
>>>>> > > template
>>>>> > > > > > >> accordingly
>>>>> > > > > > >> >   > in
>>>>> > > > > > >> >   > > > the respective wiki to guide the following
>>>>> release
>>>>> > > > > processes.
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Regarding the performance regression, could you
>>>>> > provide
>>>>> > > > some
>>>>> > > > > > >> more
>>>>> > > > > > >> >   > details
>>>>> > > > > > >> >   > > > for our better measurement or reproducing on our
>>>>> > sides?
>>>>> > > > > > >> >   > > > E.g. I guess the topology only includes two
>>>>> vertexes
>>>>> > > > source
>>>>> > > > > > and
>>>>> > > > > > >> sink?
>>>>> > > > > > >> >   > > > What is the parallelism for every vertex?
>>>>> > > > > > >> >   > > > The upstream shuffles data to the downstream via
>>>>> > > rebalance
>>>>> > > > > > >> partitioner
>>>>> > > > > > >> >   > or
>>>>> > > > > > >> >   > > > other?
>>>>> > > > > > >> >   > > > The checkpoint mode is exactly-once with
>>>>> rocksDB state
>>>>> > > > > > backend?
>>>>> > > > > > >> >   > > > The backpressure happened in this case?
>>>>> > > > > > >> >   > > > How much percentage regression in this case?
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Best,
>>>>> > > > > > >> >   > > > Zhijiang
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >>
>>>>> > ------------------------------------------------------------------
>>>>> > > > > > >> >   > > > From:Thomas Weise <[hidden email]>
>>>>> > > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
>>>>> > > > > > >> >   > > > To:dev <[hidden email]>
>>>>> > > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release
>>>>> candidate
>>>>> > #4
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Hi Till,
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Yes, we don't have the setting in
>>>>> flink-conf.yaml.
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Generally, we carry forward the existing
>>>>> configuration
>>>>> > > and
>>>>> > > > > any
>>>>> > > > > > >> change
>>>>> > > > > > >> >   > to
>>>>> > > > > > >> >   > > > default configuration values would impact the
>>>>> upgrade.
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Yes, since it is an incompatible change I would
>>>>> state
>>>>> > it
>>>>> > > > in
>>>>> > > > > > the
>>>>> > > > > > >> release
>>>>> > > > > > >> >   > > > notes.
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > Thanks,
>>>>> > > > > > >> >   > > > Thomas
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > BTW I found a performance regression while
>>>>> trying to
>>>>> > > > upgrade
>>>>> > > > > > >> another
>>>>> > > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis to
>>>>> > Kinesis
>>>>> > > > > job.
>>>>> > > > > > >> Wasn't
>>>>> > > > > > >> >   > able
>>>>> > > > > > >> >   > > > to pin it down yet, symptoms include increased
>>>>> > > checkpoint
>>>>> > > > > > >> alignment
>>>>> > > > > > >> >   > time.
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
>>>>> > > > > > >> [hidden email]>
>>>>> > > > > > >> >   > > > wrote:
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > > > Hi Thomas,
>>>>> > > > > > >> >   > > > >
>>>>> > > > > > >> >   > > > > just to confirm: When starting the image in
>>>>> local
>>>>> > > mode,
>>>>> > > > > then
>>>>> > > > > > >> you
>>>>> > > > > > >> >   > don't
>>>>> > > > > > >> >   > > > have
>>>>> > > > > > >> >   > > > > any of the JobManager memory configuration
>>>>> settings
>>>>> > > > > > >> configured in the
>>>>> > > > > > >> >   > > > > effective flink-conf.yaml, right? Does this
>>>>> mean
>>>>> > that
>>>>> > > > you
>>>>> > > > > > have
>>>>> > > > > > >> >   > > explicitly
>>>>> > > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from the
>>>>> > default
>>>>> > > > > > >> configuration?
>>>>> > > > > > >> >   > > If
>>>>> > > > > > >> >   > > > > this is the case, then I believe it was more
>>>>> of an
>>>>> > > > > > >> unintentional
>>>>> > > > > > >> >   > > artifact
>>>>> > > > > > >> >   > > > > that it worked before and it has been
>>>>> corrected now
>>>>> > so
>>>>> > > > > that
>>>>> > > > > > >> one needs
>>>>> > > > > > >> >   > > to
>>>>> > > > > > >> >   > > > > specify the memory of the JM process
>>>>> explicitly. Do
>>>>> > > you
>>>>> > > > > > think
>>>>> > > > > > >> it
>>>>> > > > > > >> >   > would
>>>>> > > > > > >> >   > > > help
>>>>> > > > > > >> >   > > > > to explicitly state this in the release notes?
>>>>> > > > > > >> >   > > > >
>>>>> > > > > > >> >   > > > > Cheers,
>>>>> > > > > > >> >   > > > > Till
>>>>> > > > > > >> >   > > > >
>>>>> > > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
>>>>> > > > > [hidden email]
>>>>> > > > > > >
>>>>> > > > > > >> wrote:
>>>>> > > > > > >> >   > > > >
>>>>> > > > > > >> >   > > > > > Thanks for preparing another RC!
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > > > As mentioned in the previous RC thread, it
>>>>> would
>>>>> > be
>>>>> > > > > super
>>>>> > > > > > >> helpful
>>>>> > > > > > >> >   > if
>>>>> > > > > > >> >   > > > the
>>>>> > > > > > >> >   > > > > > release notes that are part of the
>>>>> documentation
>>>>> > can
>>>>> > > > be
>>>>> > > > > > >> included
>>>>> > > > > > >> >   > [1].
>>>>> > > > > > >> >   > > > > It's
>>>>> > > > > > >> >   > > > > > a significant time-saver to have read those
>>>>> first.
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > > > I found one more non-backward compatible
>>>>> change
>>>>> > that
>>>>> > > > > would
>>>>> > > > > > >> be worth
>>>>> > > > > > >> >   > > > > > addressing/mentioning:
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > > > It is now necessary to configure the
>>>>> jobmanager
>>>>> > heap
>>>>> > > > > size
>>>>> > > > > > in
>>>>> > > > > > >> >   > > > > > flink-conf.yaml (with either
>>>>> jobmanager.heap.size
>>>>> > > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why would
>>>>> I not
>>>>> > > want
>>>>> > > > to
>>>>> > > > > > do
>>>>> > > > > > >> that
>>>>> > > > > > >> >   > > > anyways?
>>>>> > > > > > >> >   > > > > > Well, we set it dynamically for a cluster
>>>>> > deployment
>>>>> > > > via
>>>>> > > > > > the
>>>>> > > > > > >> >   > > > > > flinkk8soperator, but the container image
>>>>> can also
>>>>> > > be
>>>>> > > > > used
>>>>> > > > > > >> for
>>>>> > > > > > >> >   > > testing
>>>>> > > > > > >> >   > > > > with
>>>>> > > > > > >> >   > > > > > local mode (./bin/jobmanager.sh
>>>>> start-foreground
>>>>> > > > local).
>>>>> > > > > > >> That will
>>>>> > > > > > >> >   > > fail
>>>>> > > > > > >> >   > > > > if
>>>>> > > > > > >> >   > > > > > the heap wasn't configured and that's how I
>>>>> > noticed
>>>>> > > > it.
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > > > Thanks,
>>>>> > > > > > >> >   > > > > > Thomas
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > > > [1]
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > >
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   >
>>>>> > > > > > >>
>>>>> > > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
>>>>> > > > > > >> >   > [hidden email]
>>>>> > > > > > >> >   > > > > > .invalid>
>>>>> > > > > > >> >   > > > > > wrote:
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > > > > Hi everyone,
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > > > > Please review and vote on the release
>>>>> candidate
>>>>> > #4
>>>>> > > > for
>>>>> > > > > > the
>>>>> > > > > > >> >   > version
>>>>> > > > > > >> >   > > > > > 1.11.0,
>>>>> > > > > > >> >   > > > > > > as follows:
>>>>> > > > > > >> >   > > > > > > [ ] +1, Approve the release
>>>>> > > > > > >> >   > > > > > > [ ] -1, Do not approve the release (please
>>>>> > provide
>>>>> > > > > > >> specific
>>>>> > > > > > >> >   > > comments)
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > > > > The complete staging area is available
>>>>> for your
>>>>> > > > > review,
>>>>> > > > > > >> which
>>>>> > > > > > >> >   > > > includes:
>>>>> > > > > > >> >   > > > > > > * JIRA release notes [1],
>>>>> > > > > > >> >   > > > > > > * the official Apache source release and
>>>>> binary
>>>>> > > > > > >> convenience
>>>>> > > > > > >> >   > > releases
>>>>> > > > > > >> >   > > > to
>>>>> > > > > > >> >   > > > > > be
>>>>> > > > > > >> >   > > > > > > deployed to dist.apache.org [2], which
>>>>> are
>>>>> > signed
>>>>> > > > > with
>>>>> > > > > > >> the key
>>>>> > > > > > >> >   > > with
>>>>> > > > > > >> >   > > > > > > fingerprint
>>>>> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
>>>>> > > > > > [3],
>>>>> > > > > > >> >   > > > > > > * all artifacts to be deployed to the
>>>>> Maven
>>>>> > > Central
>>>>> > > > > > >> Repository
>>>>> > > > > > >> >   > [4],
>>>>> > > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4"
>>>>> [5],
>>>>> > > > > > >> >   > > > > > > * website pull request listing the new
>>>>> release
>>>>> > and
>>>>> > > > > > adding
>>>>> > > > > > >> >   > > > announcement
>>>>> > > > > > >> >   > > > > > > blog post [6].
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > > > > The vote will be open for at least 72
>>>>> hours. It
>>>>> > is
>>>>> > > > > > >> adopted by
>>>>> > > > > > >> >   > > > majority
>>>>> > > > > > >> >   > > > > > > approval, with at least 3 PMC affirmative
>>>>> votes.
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > > > > Thanks,
>>>>> > > > > > >> >   > > > > > > Release Manager
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > > > > [1]
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > >
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   >
>>>>> > > > > > >>
>>>>> > > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
>>>>> > > > > > >> >   > > > > > > [2]
>>>>> > > > > > >> >   >
>>>>> > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
>>>>> > > > > > >> >   > > > > > > [3]
>>>>> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
>>>>> > > > > > >> >   > > > > > > [4]
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > >
>>>>> > > > > > >> >   > >
>>>>> > > > > > >>
>>>>> > > > >
>>>>> > >
>>>>> https://repository.apache.org/content/repositories/orgapacheflink-1377/
>>>>> > > > > > >> >   > > > > > > [5]
>>>>> > > > > > >> >   > >
>>>>> > > > >
>>>>> https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
>>>>> > > > > > >> >   > > > > > > [6]
>>>>> > https://github.com/apache/flink-web/pull/352
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > > > >
>>>>> > > > > > >> >   > > > > >
>>>>> > > > > > >> >   > > > >
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > > >
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   > >
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >   >
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >> >
>>>>> > > > > > >>
>>>>> > > > > > >>
>>>>> > > > > >
>>>>> > > > > >
>>>>> > > > >
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Roman
>>>>> >
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Roman
>>>>
>>>
>
> --
> Regards,
> Roman
>


--
Regards,
Roman
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Thomas Weise
Hi Roman,

Thanks for working on this! I deployed the change and it appears to be
working as expected.

Will monitor over a period of time to compare the checkpoint counts and get
back to you if there are still issues.

Thomas


On Thu, Aug 13, 2020 at 3:41 AM Roman Khachatryan <[hidden email]>
wrote:

> Hi Thomas,
>
> The fix is now merged to master and to release-1.11.
> So if you'd like you can check if it solves your problem (it would be
> helpful for us too).
>
> On Sat, Aug 8, 2020 at 9:26 AM Roman Khachatryan <[hidden email]>
> wrote:
>
>> Hi Thomas,
>>
>> Thanks a lot for the detailed information.
>>
>> I think the problem is in CheckpointCoordinator. It stores the last
>> checkpoint completion time after checking queued requests.
>> I've created a ticket to fix this:
>> https://issues.apache.org/jira/browse/FLINK-18856
>>
>>
>> On Sat, Aug 8, 2020 at 5:25 AM Thomas Weise <[hidden email]> wrote:
>>
>>> Just another update:
>>>
>>> The duration of snapshotState is capped by the Kinesis
>>> producer's "RecordTtl" setting (default 30s). The sleep time in flushSync
>>> does not contribute to the observed behavior.
>>>
>>> I guess the open question is why, with the same settings, is 1.11 since
>>> commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more checkpoints?
>>>
>>>
>>> On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise <[hidden email]> wrote:
>>>
>>>> Hi Roman,
>>>>
>>>> Here are the checkpoint summaries for both commits:
>>>>
>>>>
>>>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0
>>>>
>>>> The config:
>>>>
>>>>     CheckpointConfig checkpointConfig = env.getCheckpointConfig();
>>>>
>>>> checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>>>>     checkpointConfig.setCheckpointInterval(*10_000*);
>>>>     checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
>>>>
>>>> checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
>>>>     checkpointConfig.setCheckpointTimeout(600_000);
>>>>     checkpointConfig.setMaxConcurrentCheckpoints(1);
>>>>     checkpointConfig.setFailOnCheckpointingErrors(true);
>>>>
>>>> The values marked bold when changed to *60_000* make the symptom
>>>> disappear. I meanwhile also verified that with the 1.11.0 release commit.
>>>>
>>>> I will take a look at the sleep time issue.
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <
>>>> [hidden email]> wrote:
>>>>
>>>>> Hi Thomas,
>>>>>
>>>>> Thanks for your reply!
>>>>>
>>>>> I think you are right, we can remove this sleep and improve
>>>>> KinesisProducer.
>>>>> Probably, it's snapshotState can also be sped up by forcing records
>>>>> flush more often.
>>>>> Do you see that 30s checkpointing duration is caused
>>>>> by KinesisProducer (or maybe other operators)?
>>>>>
>>>>> I'd also like to understand the reason behind this increase in
>>>>> checkpoint frequency.
>>>>> Can you please share these values:
>>>>>  - execution.checkpointing.min-pause
>>>>>  - execution.checkpointing.max-concurrent-checkpoints
>>>>>  - execution.checkpointing.timeout
>>>>>
>>>>> And what is the "new" observed checkpoint frequency (or how many
>>>>> checkpoints are created) compared to older versions?
>>>>>
>>>>>
>>>>> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <[hidden email]> wrote:
>>>>>
>>>>>> Hi Roman,
>>>>>>
>>>>>> Indeed there are more frequent checkpoints with this change! The
>>>>>> application was configured to checkpoint every 10s. With 1.10 ("good
>>>>>> commit"), that leads to fewer completed checkpoints compared to 1.11
>>>>>> ("bad
>>>>>> commit"). Just to be clear, the only difference between the two runs
>>>>>> was
>>>>>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>>>>>>
>>>>>> Since the sync part of checkpoints with the Kinesis producer always
>>>>>> takes
>>>>>> ~30 seconds, the 10s configured checkpoint frequency really had no
>>>>>> effect
>>>>>> before 1.11. I confirmed that both commits perform comparably by
>>>>>> setting
>>>>>> the checkpoint frequency and min pause to 60s.
>>>>>>
>>>>>> I still have to verify with the final 1.11.0 release commit.
>>>>>>
>>>>>> It's probably good to take a look at the Kinesis producer. Is it
>>>>>> really
>>>>>> necessary to have 500ms sleep time? What's responsible for the ~30s
>>>>>> duration in snapshotState?
>>>>>>
>>>>>> As things stand it doesn't make sense to use checkpoint intervals <
>>>>>> 30s
>>>>>> when using the Kinesis producer.
>>>>>>
>>>>>> Thanks,
>>>>>> Thomas
>>>>>>
>>>>>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <
>>>>>> [hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>> > Hi Thomas,
>>>>>> >
>>>>>> > Thanks a lot for the analysis.
>>>>>> >
>>>>>> > The first thing that I'd check is whether checkpoints became more
>>>>>> frequent
>>>>>> > with this commit (as each of them adds at least 500ms if there is
>>>>>> at least
>>>>>> > one not sent record, according to
>>>>>> FlinkKinesisProducer.snapshotState).
>>>>>> >
>>>>>> > Can you share checkpointing statistics (1.10 vs 1.11 or last "good"
>>>>>> vs
>>>>>> > first "bad" commits)?
>>>>>> >
>>>>>> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <
>>>>>> [hidden email]>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > I run git bisect and the first commit that shows the regression
>>>>>> is:
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> >
>>>>>> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
>>>>>> > >
>>>>>> > >
>>>>>> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]>
>>>>>> wrote:
>>>>>> > >
>>>>>> > > > From my experience, java profilers are sometimes not accurate
>>>>>> enough to
>>>>>> > > > find out the performance regression
>>>>>> > > > root cause. In this case, I would suggest you try out intel
>>>>>> vtune
>>>>>> > > amplifier
>>>>>> > > > to watch more detailed metrics.
>>>>>> > > >
>>>>>> > > > Best,
>>>>>> > > > Kurt
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]>
>>>>>> wrote:
>>>>>> > > >
>>>>>> > > > > The cause of the issue is all but clear.
>>>>>> > > > >
>>>>>> > > > > Previously I had mentioned that there is no suspect change to
>>>>>> the
>>>>>> > > Kinesis
>>>>>> > > > > connector and that I had reverted the AWS SDK change to no
>>>>>> effect.
>>>>>> > > > >
>>>>>> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually
>>>>>> fixed
>>>>>> > > another
>>>>>> > > > > regression in the previous release and is present before and
>>>>>> after.
>>>>>> > > > >
>>>>>> > > > > I repeated the run with 1.11.0 core and downgraded the entire
>>>>>> Kinesis
>>>>>> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is
>>>>>> still
>>>>>> > > > present.
>>>>>> > > > > Therefore we will need to look elsewhere for the root cause.
>>>>>> > > > >
>>>>>> > > > > Regarding the time spent in snapshotState, repeat runs reveal
>>>>>> a wide
>>>>>> > > > range
>>>>>> > > > > for both versions, 1.10 and 1.11. So again this is nothing
>>>>>> pointing
>>>>>> > to
>>>>>> > > a
>>>>>> > > > > root cause.
>>>>>> > > > >
>>>>>> > > > > At this point, I have no ideas remaining other than doing a
>>>>>> bisect to
>>>>>> > > > find
>>>>>> > > > > the culprit. Any other suggestions?
>>>>>> > > > >
>>>>>> > > > > Thomas
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
>>>>>> [hidden email]
>>>>>> > > > > .invalid>
>>>>>> > > > > wrote:
>>>>>> > > > >
>>>>>> > > > > > Hi Thomas,
>>>>>> > > > > >
>>>>>> > > > > > Thanks for your further profiling information and glad to
>>>>>> see we
>>>>>> > > > already
>>>>>> > > > > > finalized the location to cause the regression.
>>>>>> > > > > > Actually I was also suspicious of the point of
>>>>>> #snapshotState in
>>>>>> > > > previous
>>>>>> > > > > > discussions since it indeed cost much time to block normal
>>>>>> operator
>>>>>> > > > > > processing.
>>>>>> > > > > >
>>>>>> > > > > > Based on your below feedback, the sleep time during
>>>>>> #snapshotState
>>>>>> > > > might
>>>>>> > > > > > be the main concern, and I also digged into the
>>>>>> implementation of
>>>>>> > > > > > FlinkKinesisProducer#snapshotState.
>>>>>> > > > > > while (producer.getOutstandingRecordsCount() > 0) {
>>>>>> > > > > >    producer.flush();
>>>>>> > > > > >    try {
>>>>>> > > > > >       Thread.sleep(500);
>>>>>> > > > > >    } catch (InterruptedException e) {
>>>>>> > > > > >       LOG.warn("Flushing was interrupted.");
>>>>>> > > > > >       break;
>>>>>> > > > > >    }
>>>>>> > > > > > }
>>>>>> > > > > > It seems that the sleep time is mainly affected by the
>>>>>> internal
>>>>>> > > > > operations
>>>>>> > > > > > inside KinesisProducer implementation provided by
>>>>>> amazonaws, which
>>>>>> > I
>>>>>> > > am
>>>>>> > > > > not
>>>>>> > > > > > quite familiar with.
>>>>>> > > > > > But I noticed there were two upgrades related to it in
>>>>>> > > release-1.11.0.
>>>>>> > > > > One
>>>>>> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and
>>>>>> another
>>>>>> > is
>>>>>> > > > for
>>>>>> > > > > > upgrading aws-sdk-version to 1.11.754 [2].
>>>>>> > > > > > You mentioned that you already reverted the SDK upgrade to
>>>>>> verify
>>>>>> > no
>>>>>> > > > > > changes. Did you also revert the [1] to verify?
>>>>>> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
>>>>>> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
>>>>>> > > > > >
>>>>>> > > > > > Best,
>>>>>> > > > > > Zhijiang
>>>>>> > > > > >
>>>>>> ------------------------------------------------------------------
>>>>>> > > > > > From:Thomas Weise <[hidden email]>
>>>>>> > > > > > Send Time:2020年7月17日(星期五) 05:29
>>>>>> > > > > > To:dev <[hidden email]>
>>>>>> > > > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
>>>>>> > > > [hidden email]
>>>>>> > > > > >;
>>>>>> > > > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
>>>>>> > > > [hidden email]
>>>>>> > > > > >
>>>>>> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
>>>>>> 1.11.0,
>>>>>> > > > release
>>>>>> > > > > > candidate #4)
>>>>>> > > > > >
>>>>>> > > > > > Sorry for the delay.
>>>>>> > > > > >
>>>>>> > > > > > I confirmed that the regression is due to the sink
>>>>>> (unsurprising,
>>>>>> > > since
>>>>>> > > > > > another job with the same consumer, but not the producer,
>>>>>> runs as
>>>>>> > > > > > expected).
>>>>>> > > > > >
>>>>>> > > > > > As promised I did CPU profiling on the problematic
>>>>>> application,
>>>>>> > which
>>>>>> > > > > gives
>>>>>> > > > > > more insight into the regression [1]
>>>>>> > > > > >
>>>>>> > > > > > The screenshots show that the average time for snapshotState
>>>>>> > > increases
>>>>>> > > > > from
>>>>>> > > > > > ~9s to ~28s. The data also shows the increase in sleep time
>>>>>> during
>>>>>> > > > > > snapshotState.
>>>>>> > > > > >
>>>>>> > > > > > Does anyone, based on changes made in 1.11, have a theory
>>>>>> why?
>>>>>> > > > > >
>>>>>> > > > > > I had previously looked at the changes to the Kinesis
>>>>>> connector and
>>>>>> > > > also
>>>>>> > > > > > reverted the SDK upgrade, which did not change the
>>>>>> situation.
>>>>>> > > > > >
>>>>>> > > > > > It will likely be necessary to drill into the sink /
>>>>>> checkpointing
>>>>>> > > > > details
>>>>>> > > > > > to understand the cause of the problem.
>>>>>> > > > > >
>>>>>> > > > > > Let me know if anyone has specific questions that I can
>>>>>> answer from
>>>>>> > > the
>>>>>> > > > > > profiling results.
>>>>>> > > > > >
>>>>>> > > > > > Thomas
>>>>>> > > > > >
>>>>>> > > > > > [1]
>>>>>> > > > > >
>>>>>> > > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
>>>>>> > > > > >
>>>>>> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <
>>>>>> [hidden email]>
>>>>>> > > wrote:
>>>>>> > > > > >
>>>>>> > > > > > > + dev@ for visibility
>>>>>> > > > > > >
>>>>>> > > > > > > I will investigate further today.
>>>>>> > > > > > >
>>>>>> > > > > > >
>>>>>> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
>>>>>> > > [hidden email]
>>>>>> > > > >
>>>>>> > > > > > > wrote:
>>>>>> > > > > > >
>>>>>> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
>>>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
>>>>>> relevant
>>>>>> > way,
>>>>>> > > > for
>>>>>> > > > > > >> example
>>>>>> > > > > > >> > due to some Kafka issues we addressed in 1.11
>>>>>> (@Aljoscha
>>>>>> > maybe?)
>>>>>> > > > > > >>
>>>>>> > > > > > >> I think that's unrelated: the Kafka fixes were isolated
>>>>>> in Kafka
>>>>>> > > and
>>>>>> > > > > the
>>>>>> > > > > > >> one bug I discovered on the way was about the Task
>>>>>> reaper.
>>>>>> > > > > > >>
>>>>>> > > > > > >>
>>>>>> > > > > > >> On 07.07.20 17:51, Zhijiang wrote:
>>>>>> > > > > > >> > Sorry for my misunderstood of the previous information,
>>>>>> > Thomas.
>>>>>> > > I
>>>>>> > > > > was
>>>>>> > > > > > >> assuming that the sync checkpoint duration increased
>>>>>> after
>>>>>> > upgrade
>>>>>> > > > as
>>>>>> > > > > it
>>>>>> > > > > > >> was mentioned before.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > If I remembered correctly, the memory state backend
>>>>>> also has
>>>>>> > the
>>>>>> > > > > same
>>>>>> > > > > > >> issue? If so, we can dismiss the rocksDB state changes.
>>>>>> As the
>>>>>> > > slot
>>>>>> > > > > > sharing
>>>>>> > > > > > >> enabled, the downstream and upstream should
>>>>>> > > > > > >> > probably deployed into the same slot, then no network
>>>>>> shuffle
>>>>>> > > > > effect.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > I think we need to find out whether it has other
>>>>>> symptoms
>>>>>> > > changed
>>>>>> > > > > > >> besides the performance regression to further figure out
>>>>>> the
>>>>>> > > scope.
>>>>>> > > > > > >> > E.g. any metrics changes, the number of TaskManager
>>>>>> and the
>>>>>> > > number
>>>>>> > > > > of
>>>>>> > > > > > >> slots per TaskManager from deployment changes.
>>>>>> > > > > > >> > 40% regression is really big, I guess the changes
>>>>>> should also
>>>>>> > be
>>>>>> > > > > > >> reflected in other places.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > I am not sure whether we can reproduce the regression
>>>>>> in our
>>>>>> > AWS
>>>>>> > > > > > >> environment by writing any Kinesis jobs, since there are
>>>>>> also
>>>>>> > > normal
>>>>>> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
>>>>>> > > > > > >> > So it probably looks like to touch some corner case. I
>>>>>> am very
>>>>>> > > > > willing
>>>>>> > > > > > >> to provide any help for debugging if possible.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Best,
>>>>>> > > > > > >> > Zhijiang
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > ------------------------------------------------------------------
>>>>>> > > > > > >> > From:Thomas Weise <[hidden email]>
>>>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01
>>>>>> > > > > > >> > To:Stephan Ewen <[hidden email]>
>>>>>> > > > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid
>>>>>> Heise <
>>>>>> > > > > > >> [hidden email]>; Zhijiang <
>>>>>> [hidden email]>
>>>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE]
>>>>>> Release
>>>>>> > > 1.11.0,
>>>>>> > > > > > >> release candidate #4)
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We
>>>>>> have one
>>>>>> > job
>>>>>> > > > > that
>>>>>> > > > > > >> works as expected after the upgrade and the one
>>>>>> discussed here
>>>>>> > > that
>>>>>> > > > > has
>>>>>> > > > > > the
>>>>>> > > > > > >> performance regression.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > "The performance regression is obvious caused by long
>>>>>> duration
>>>>>> > > of
>>>>>> > > > > sync
>>>>>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>>>>>> block
>>>>>> > the
>>>>>> > > > > > normal
>>>>>> > > > > > >> data processing until back pressure the source."
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > That's a constant. Before (1.10) and upgrade have the
>>>>>> same
>>>>>> > sync
>>>>>> > > > > > >> checkpointing time. The question is what change came in
>>>>>> with the
>>>>>> > > > > > upgrade.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <
>>>>>> [hidden email]
>>>>>> > >
>>>>>> > > > > wrote:
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > @Thomas Just one thing real quick: Are you using the
>>>>>> > standalone
>>>>>> > > > > setup
>>>>>> > > > > > >> scripts (like start-cluster.sh, and the former "slaves"
>>>>>> file) ?
>>>>>> > > > > > >> > Be aware that this is now called "workers" because of
>>>>>> avoiding
>>>>>> > > > > > >> sensitive names.
>>>>>> > > > > > >> > In one internal benchmark we saw quite a lot of
>>>>>> slowdown
>>>>>> > > > initially,
>>>>>> > > > > > >> before seeing that the cluster was not a distributed
>>>>>> cluster any
>>>>>> > > > more
>>>>>> > > > > > ;-)
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
>>>>>> > > > [hidden email]
>>>>>> > > > > >
>>>>>> > > > > > >> wrote:
>>>>>> > > > > > >> > Thanks for this kickoff and help analysis, Stephan!
>>>>>> > > > > > >> > Thanks for the further feedback and investigation,
>>>>>> Thomas!
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > The performance regression is obvious caused by long
>>>>>> duration
>>>>>> > of
>>>>>> > > > > sync
>>>>>> > > > > > >> checkpoint process in Kinesis sink operator, which would
>>>>>> block
>>>>>> > the
>>>>>> > > > > > normal
>>>>>> > > > > > >> data processing until back pressure the source.
>>>>>> > > > > > >> > Maybe we could dig into the process of sync execution
>>>>>> in
>>>>>> > > > checkpoint.
>>>>>> > > > > > >> E.g. break down the steps inside respective
>>>>>> > operator#snapshotState
>>>>>> > > > to
>>>>>> > > > > > >> statistic which operation cost most of the time, then
>>>>>> > > > > > >> > we might probably find the root cause to bring such
>>>>>> cost.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Look forward to the further progress. :)
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Best,
>>>>>> > > > > > >> > Zhijiang
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > ------------------------------------------------------------------
>>>>>> > > > > > >> > From:Stephan Ewen <[hidden email]>
>>>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52
>>>>>> > > > > > >> > To:Thomas Weise <[hidden email]>
>>>>>> > > > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
>>>>>> > > > > > >> [hidden email]>; Aljoscha Krettek <
>>>>>> > > [hidden email]
>>>>>> > > > >;
>>>>>> > > > > > >> Arvid Heise <[hidden email]>
>>>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE]
>>>>>> Release
>>>>>> > > 1.11.0,
>>>>>> > > > > > >> release candidate #4)
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Thank you for the digging so deeply.
>>>>>> > > > > > >> > Mysterious think this regression.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <
>>>>>> [hidden email]>
>>>>>> > wrote:
>>>>>> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it
>>>>>> is
>>>>>> > > unchanged
>>>>>> > > > > > >> between 1.10 and 1.11 for the specific pipeline).
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > I verified that increasing the checkpointing interval
>>>>>> does not
>>>>>> > > > make
>>>>>> > > > > a
>>>>>> > > > > > >> difference.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > I looked at the Kinesis connector changes since 1.10.1
>>>>>> and
>>>>>> > don't
>>>>>> > > > see
>>>>>> > > > > > >> anything that could cause this.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Another pipeline that is using the Kinesis consumer
>>>>>> (but not
>>>>>> > the
>>>>>> > > > > > >> producer) performs as expected.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > I tried reverting the AWS SDK version change, symptoms
>>>>>> remain
>>>>>> > > > > > unchanged:
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > diff --git
>>>>>> a/flink-connectors/flink-connector-kinesis/pom.xml
>>>>>> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
>>>>>> > > > > > >> > index a6abce23ba..741743a05e 100644
>>>>>> > > > > > >> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
>>>>>> > > > > > >> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
>>>>>> > > > > > >> > @@ -33,7 +33,7 @@ under the License.
>>>>>> > > > > > >> >
>>>>>> > > > > > >>
>>>>>> > > > >
>>>>>> > >
>>>>>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
>>>>>> > > > > > >> >          <name>flink-connector-kinesis</name>
>>>>>> > > > > > >> >          <properties>
>>>>>> > > > > > >> > -
>>>>>>  <aws.sdk.version>1.11.754</aws.sdk.version>
>>>>>> > > > > > >> > +
>>>>>>  <aws.sdk.version>1.11.603</aws.sdk.version>
>>>>>> > > > > > >> >
>>>>>> > > > > > >> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
>>>>>> > > > > > >> >
>>>>>> > > > > > >> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
>>>>>> > > > > > >> >
>>>>>> > > > > > >>
>>>>>> > > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > I'm planning to take a look with a profiler next.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Thomas
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
>>>>>> > [hidden email]>
>>>>>> > > > > > wrote:
>>>>>> > > > > > >> > Hi all!
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Forking this thread out of the release vote thread.
>>>>>> > > > > > >> >  From what Thomas describes, it really sounds like a
>>>>>> > > sink-specific
>>>>>> > > > > > >> issue.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > @Thomas: When you say sink has a long synchronous
>>>>>> checkpoint
>>>>>> > > time,
>>>>>> > > > > you
>>>>>> > > > > > >> mean the time that is shown as "sync time" on the
>>>>>> metrics and
>>>>>> > web
>>>>>> > > > UI?
>>>>>> > > > > > That
>>>>>> > > > > > >> is not including any network buffer related operations.
>>>>>> It is
>>>>>> > > purely
>>>>>> > > > > the
>>>>>> > > > > > >> operator's time.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Can we dig into the changes we did in sinks:
>>>>>> > > > > > >> >    - Kinesis version upgrade, AWS library updates
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >    - Could it be that some call (checkpoint complete)
>>>>>> that was
>>>>>> > > > > > >> previously (1.10) in a separate thread is not in the
>>>>>> mailbox and
>>>>>> > > > this
>>>>>> > > > > > >> simply reduces the number of threads that do the work?
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
>>>>>> relevant
>>>>>> > way,
>>>>>> > > > for
>>>>>> > > > > > >> example due to some Kafka issues we addressed in 1.11
>>>>>> (@Aljoscha
>>>>>> > > > > maybe?)
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > Best,
>>>>>> > > > > > >> > Stephan
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
>>>>>> > > > [hidden email]
>>>>>> > > > > > .invalid>
>>>>>> > > > > > >> wrote:
>>>>>> > > > > > >> > Hi Thomas,
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   Regarding [2], it has more detail infos in the Jira
>>>>>> > > description
>>>>>> > > > (
>>>>>> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   I can also give some basic explanations here to
>>>>>> dismiss the
>>>>>> > > > > concern.
>>>>>> > > > > > >> >   1. In the past, the following buffers after the
>>>>>> barrier will
>>>>>> > > be
>>>>>> > > > > > >> cached on downstream side before alignment.
>>>>>> > > > > > >> >   2. In 1.11, the upstream would not send the buffers
>>>>>> after
>>>>>> > the
>>>>>> > > > > > >> barrier. When the downstream finishes the alignment, it
>>>>>> will
>>>>>> > > notify
>>>>>> > > > > the
>>>>>> > > > > > >> downstream of continuing sending following buffers,
>>>>>> since it can
>>>>>> > > > > process
>>>>>> > > > > > >> them after alignment.
>>>>>> > > > > > >> >   3. The only difference is that the temporary blocked
>>>>>> buffers
>>>>>> > > are
>>>>>> > > > > > >> cached either on downstream side or on upstream side
>>>>>> before
>>>>>> > > > alignment.
>>>>>> > > > > > >> >   4. The side effect would be the additional
>>>>>> notification cost
>>>>>> > > for
>>>>>> > > > > > >> every barrier alignment. If the downstream and upstream
>>>>>> are
>>>>>> > > deployed
>>>>>> > > > > in
>>>>>> > > > > > >> separate TaskManager, the cost is network transport
>>>>>> delay (the
>>>>>> > > > effect
>>>>>> > > > > > can
>>>>>> > > > > > >> be ignored based on our testing with 1s checkpoint
>>>>>> interval).
>>>>>> > For
>>>>>> > > > > > sharing
>>>>>> > > > > > >> slot in your case, the cost is only one method call in
>>>>>> > processor,
>>>>>> > > > can
>>>>>> > > > > be
>>>>>> > > > > > >> ignored also.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   You mentioned "In this case, the downstream task has
>>>>>> a high
>>>>>> > > > > average
>>>>>> > > > > > >> checkpoint duration(~30s, sync part)." This duration is
>>>>>> not
>>>>>> > > > reflecting
>>>>>> > > > > > the
>>>>>> > > > > > >> changes above, and it is only indicating the duration for
>>>>>> > calling
>>>>>> > > > > > >> `Operation.snapshotState`.
>>>>>> > > > > > >> >   If this duration is beyond your expectation, you can
>>>>>> check
>>>>>> > or
>>>>>> > > > > debug
>>>>>> > > > > > >> whether the source/sink operations might take more time
>>>>>> to
>>>>>> > finish
>>>>>> > > > > > >> `snapshotState` in practice. E.g. you can
>>>>>> > > > > > >> >   make the implementation of this method as empty to
>>>>>> further
>>>>>> > > > verify
>>>>>> > > > > > the
>>>>>> > > > > > >> effect.
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   Best,
>>>>>> > > > > > >> >   Zhijiang
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > >
>>>>>> ------------------------------------------------------------------
>>>>>> > > > > > >> >   From:Thomas Weise <[hidden email]>
>>>>>> > > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
>>>>>> > > > > > >> >   To:dev <[hidden email]>; Zhijiang <
>>>>>> > > > > [hidden email]
>>>>>> > > > > > >
>>>>>> > > > > > >> >   Cc:Yingjie Cao <[hidden email]>
>>>>>> > > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release candidate
>>>>>> #4
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   Hi Zhijiang,
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   Could you please point me to more details regarding:
>>>>>> "[2]:
>>>>>> > > Delay
>>>>>> > > > > > send
>>>>>> > > > > > >> the
>>>>>> > > > > > >> >   following buffers after checkpoint barrier on
>>>>>> upstream side
>>>>>> > > > until
>>>>>> > > > > > >> barrier
>>>>>> > > > > > >> >   alignment on downstream side."
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   In this case, the downstream task has a high average
>>>>>> > > checkpoint
>>>>>> > > > > > >> duration
>>>>>> > > > > > >> >   (~30s, sync part). If there was a change to hold
>>>>>> buffers
>>>>>> > > > depending
>>>>>> > > > > > on
>>>>>> > > > > > >> >   downstream performance, could this possibly apply to
>>>>>> this
>>>>>> > case
>>>>>> > > > > (even
>>>>>> > > > > > >> when
>>>>>> > > > > > >> >   there is no shuffle that would require alignment)?
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   Thanks,
>>>>>> > > > > > >> >   Thomas
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
>>>>>> > > > > [hidden email]
>>>>>> > > > > > >> .invalid>
>>>>>> > > > > > >> >   wrote:
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   > Hi Thomas,
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > Thanks for the further update information.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > I guess we can dismiss the network stack changes,
>>>>>> since in
>>>>>> > > > your
>>>>>> > > > > > >> case the
>>>>>> > > > > > >> >   > downstream and upstream would probably be deployed
>>>>>> in the
>>>>>> > > same
>>>>>> > > > > > slot
>>>>>> > > > > > >> >   > bypassing the network data shuffle.
>>>>>> > > > > > >> >   > Also I guess release-1.11 will not bring general
>>>>>> > performance
>>>>>> > > > > > >> regression in
>>>>>> > > > > > >> >   > runtime engine, as we also did the performance
>>>>>> testing for
>>>>>> > > all
>>>>>> > > > > > >> general
>>>>>> > > > > > >> >   > cases by [1] in real cluster before and the testing
>>>>>> > results
>>>>>> > > > > should
>>>>>> > > > > > >> fit the
>>>>>> > > > > > >> >   > expectation. But we indeed did not test the
>>>>>> specific
>>>>>> > source
>>>>>> > > > and
>>>>>> > > > > > sink
>>>>>> > > > > > >> >   > connectors yet as I known.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > Regarding your performance regression with 40%, I
>>>>>> wonder
>>>>>> > it
>>>>>> > > is
>>>>>> > > > > > >> probably
>>>>>> > > > > > >> >   > related to specific source/sink changes (e.g.
>>>>>> kinesis) or
>>>>>> > > > > > >> environment
>>>>>> > > > > > >> >   > issues with corner case.
>>>>>> > > > > > >> >   > If possible, it would be helpful to further locate
>>>>>> whether
>>>>>> > > the
>>>>>> > > > > > >> regression
>>>>>> > > > > > >> >   > is caused by kinesis, by replacing the kinesis
>>>>>> source &
>>>>>> > sink
>>>>>> > > > and
>>>>>> > > > > > >> keeping
>>>>>> > > > > > >> >   > the others same.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > As you said, it would be efficient to contact with
>>>>>> you
>>>>>> > > > directly
>>>>>> > > > > > >> next week
>>>>>> > > > > > >> >   > to further discuss this issue. And we are
>>>>>> willing/eager to
>>>>>> > > > > provide
>>>>>> > > > > > >> any help
>>>>>> > > > > > >> >   > to resolve this issue soon.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > Besides that, I guess this issue should not be the
>>>>>> blocker
>>>>>> > > for
>>>>>> > > > > the
>>>>>> > > > > > >> >   > release, since it is probably a corner case based
>>>>>> on the
>>>>>> > > > current
>>>>>> > > > > > >> analysis.
>>>>>> > > > > > >> >   > If we really conclude anything need to be resolved
>>>>>> after
>>>>>> > the
>>>>>> > > > > final
>>>>>> > > > > > >> >   > release, then we can also make the next minor
>>>>>> > release-1.11.1
>>>>>> > > > > come
>>>>>> > > > > > >> soon.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > [1]
>>>>>> https://issues.apache.org/jira/browse/FLINK-18433
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > Best,
>>>>>> > > > > > >> >   > Zhijiang
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >
>>>>>> > > > >
>>>>>> ------------------------------------------------------------------
>>>>>> > > > > > >> >   > From:Thomas Weise <[hidden email]>
>>>>>> > > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
>>>>>> > > > > > >> >   > To:dev <[hidden email]>; Zhijiang <
>>>>>> > > > > > [hidden email]
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >   > Cc:Yingjie Cao <[hidden email]>
>>>>>> > > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release
>>>>>> candidate #4
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > Hi Zhijiang,
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > It will probably be best if we connect next week
>>>>>> and
>>>>>> > discuss
>>>>>> > > > the
>>>>>> > > > > > >> issue
>>>>>> > > > > > >> >   > directly since this could be quite difficult to
>>>>>> reproduce.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > Before the testing result on our side comes out
>>>>>> for your
>>>>>> > > > > > respective
>>>>>> > > > > > >> job
>>>>>> > > > > > >> >   > case, I have some other questions to confirm for
>>>>>> further
>>>>>> > > > > analysis:
>>>>>> > > > > > >> >   >     -  How much percentage regression you found
>>>>>> after
>>>>>> > > > switching
>>>>>> > > > > to
>>>>>> > > > > > >> 1.11?
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > ~40% throughput decline
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >     -  Are there any network bottleneck in your
>>>>>> cluster?
>>>>>> > > E.g.
>>>>>> > > > > the
>>>>>> > > > > > >> network
>>>>>> > > > > > >> >   > bandwidth is full caused by other jobs? If so, it
>>>>>> might
>>>>>> > have
>>>>>> > > > > more
>>>>>> > > > > > >> effects
>>>>>> > > > > > >> >   > by above [2]
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > The test runs on a k8s cluster that is also used
>>>>>> for other
>>>>>> > > > > > >> production jobs.
>>>>>> > > > > > >> >   > There is no reason be believe network is the
>>>>>> bottleneck.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >     -  Did you adjust the default network buffer
>>>>>> setting?
>>>>>> > > E.g.
>>>>>> > > > > > >> >   >
>>>>>> "taskmanager.network.memory.floating-buffers-per-gate" or
>>>>>> > > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > The job is using the defaults, i.e we don't
>>>>>> configure the
>>>>>> > > > > > settings.
>>>>>> > > > > > >> If you
>>>>>> > > > > > >> >   > want me to try specific settings in the hope that
>>>>>> it will
>>>>>> > > help
>>>>>> > > > > to
>>>>>> > > > > > >> isolate
>>>>>> > > > > > >> >   > the issue please let me know.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >     -  I guess the topology has three vertexes
>>>>>> > > > "KinesisConsumer
>>>>>> > > > > ->
>>>>>> > > > > > >> Chained
>>>>>> > > > > > >> >   > FlatMap -> KinesisProducer", and the partition
>>>>>> mode for
>>>>>> > > > > > >> "KinesisConsumer ->
>>>>>> > > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
>>>>>> > "forward"?
>>>>>> > > If
>>>>>> > > > > so,
>>>>>> > > > > > >> the edge
>>>>>> > > > > > >> >   > connection is one-to-one, not all-to-all, then the
>>>>>> above
>>>>>> > > > [1][2]
>>>>>> > > > > > >> should no
>>>>>> > > > > > >> >   > effects in theory with default network buffer
>>>>>> setting.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > There are only 2 vertices and the edge is
>>>>>> "forward".
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >     - By slot sharing, I guess these three vertex
>>>>>> > > parallelism
>>>>>> > > > > task
>>>>>> > > > > > >> would
>>>>>> > > > > > >> >   > probably be deployed into the same slot, then the
>>>>>> data
>>>>>> > > shuffle
>>>>>> > > > > is
>>>>>> > > > > > >> by memory
>>>>>> > > > > > >> >   > queue, not network stack. If so, the above [2]
>>>>>> should no
>>>>>> > > > effect.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > Yes, vertices share slots.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >     - I also saw some Jira changes for kinesis in
>>>>>> this
>>>>>> > > > release,
>>>>>> > > > > > >> could you
>>>>>> > > > > > >> >   > confirm that these changes would not effect the
>>>>>> > performance?
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > I will need to take a look. 1.10 already had a
>>>>>> regression
>>>>>> > > > > > >> introduced by the
>>>>>> > > > > > >> >   > Kinesis producer update.
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > Thanks,
>>>>>> > > > > > >> >   > Thomas
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
>>>>>> > > > > > >> [hidden email]
>>>>>> > > > > > >> >   > .invalid>
>>>>>> > > > > > >> >   > wrote:
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   > > Hi Thomas,
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Thanks for your reply with rich information!
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > We are trying to reproduce your case in our
>>>>>> cluster to
>>>>>> > > > further
>>>>>> > > > > > >> verify it,
>>>>>> > > > > > >> >   > > and  @Yingjie Cao is working on it now.
>>>>>> > > > > > >> >   > >  As we have not kinesis consumer and producer
>>>>>> > internally,
>>>>>> > > so
>>>>>> > > > > we
>>>>>> > > > > > >> will
>>>>>> > > > > > >> >   > > construct the common source and sink instead in
>>>>>> the case
>>>>>> > > of
>>>>>> > > > > > >> backpressure.
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Firstly, we can dismiss the rockdb factor in this
>>>>>> > release,
>>>>>> > > > > since
>>>>>> > > > > > >> you also
>>>>>> > > > > > >> >   > > mentioned that "filesystem leads to same
>>>>>> symptoms".
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Secondly, if my understanding is right, you
>>>>>> emphasis
>>>>>> > that
>>>>>> > > > the
>>>>>> > > > > > >> regression
>>>>>> > > > > > >> >   > > only exists for the jobs with low checkpoint
>>>>>> interval
>>>>>> > > (10s).
>>>>>> > > > > > >> >   > > Based on that, I have two suspicions with the
>>>>>> network
>>>>>> > > > related
>>>>>> > > > > > >> changes in
>>>>>> > > > > > >> >   > > this release:
>>>>>> > > > > > >> >   > >     - [1]: Limited the maximum backlog value
>>>>>> (default
>>>>>> > 10)
>>>>>> > > in
>>>>>> > > > > > >> subpartition
>>>>>> > > > > > >> >   > > queue.
>>>>>> > > > > > >> >   > >     - [2]: Delay send the following buffers after
>>>>>> > > checkpoint
>>>>>> > > > > > >> barrier on
>>>>>> > > > > > >> >   > > upstream side until barrier alignment on
>>>>>> downstream
>>>>>> > side.
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > These changes are motivated for reducing the
>>>>>> in-flight
>>>>>> > > > buffers
>>>>>> > > > > > to
>>>>>> > > > > > >> speedup
>>>>>> > > > > > >> >   > > checkpoint especially in the case of
>>>>>> backpressure.
>>>>>> > > > > > >> >   > > In theory they should have very minor
>>>>>> performance effect
>>>>>> > > and
>>>>>> > > > > > >> actually we
>>>>>> > > > > > >> >   > > also tested in cluster to verify within
>>>>>> expectation
>>>>>> > before
>>>>>> > > > > > >> merging them,
>>>>>> > > > > > >> >   > >  but maybe there are other corner cases we have
>>>>>> not
>>>>>> > > thought
>>>>>> > > > of
>>>>>> > > > > > >> before.
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Before the testing result on our side comes out
>>>>>> for your
>>>>>> > > > > > >> respective job
>>>>>> > > > > > >> >   > > case, I have some other questions to confirm for
>>>>>> further
>>>>>> > > > > > analysis:
>>>>>> > > > > > >> >   > >     -  How much percentage regression you found
>>>>>> after
>>>>>> > > > > switching
>>>>>> > > > > > >> to 1.11?
>>>>>> > > > > > >> >   > >     -  Are there any network bottleneck in your
>>>>>> cluster?
>>>>>> > > > E.g.
>>>>>> > > > > > the
>>>>>> > > > > > >> network
>>>>>> > > > > > >> >   > > bandwidth is full caused by other jobs? If so,
>>>>>> it might
>>>>>> > > have
>>>>>> > > > > > more
>>>>>> > > > > > >> effects
>>>>>> > > > > > >> >   > > by above [2]
>>>>>> > > > > > >> >   > >     -  Did you adjust the default network buffer
>>>>>> > setting?
>>>>>> > > > E.g.
>>>>>> > > > > > >> >   > >
>>>>>> "taskmanager.network.memory.floating-buffers-per-gate"
>>>>>> > or
>>>>>> > > > > > >> >   > > "taskmanager.network.memory.buffers-per-channel"
>>>>>> > > > > > >> >   > >     -  I guess the topology has three vertexes
>>>>>> > > > > "KinesisConsumer
>>>>>> > > > > > ->
>>>>>> > > > > > >> >   > Chained
>>>>>> > > > > > >> >   > > FlatMap -> KinesisProducer", and the partition
>>>>>> mode for
>>>>>> > > > > > >> "KinesisConsumer
>>>>>> > > > > > >> >   > ->
>>>>>> > > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are both
>>>>>> > > "forward"?
>>>>>> > > > If
>>>>>> > > > > > >> so, the
>>>>>> > > > > > >> >   > edge
>>>>>> > > > > > >> >   > > connection is one-to-one, not all-to-all, then
>>>>>> the above
>>>>>> > > > > [1][2]
>>>>>> > > > > > >> should no
>>>>>> > > > > > >> >   > > effects in theory with default network buffer
>>>>>> setting.
>>>>>> > > > > > >> >   > >     - By slot sharing, I guess these three vertex
>>>>>> > > > parallelism
>>>>>> > > > > > >> task would
>>>>>> > > > > > >> >   > > probably be deployed into the same slot, then
>>>>>> the data
>>>>>> > > > shuffle
>>>>>> > > > > > is
>>>>>> > > > > > >> by
>>>>>> > > > > > >> >   > memory
>>>>>> > > > > > >> >   > > queue, not network stack. If so, the above [2]
>>>>>> should no
>>>>>> > > > > effect.
>>>>>> > > > > > >> >   > >     - I also saw some Jira changes for kinesis
>>>>>> in this
>>>>>> > > > > release,
>>>>>> > > > > > >> could you
>>>>>> > > > > > >> >   > > confirm that these changes would not effect the
>>>>>> > > performance?
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Best,
>>>>>> > > > > > >> >   > > Zhijiang
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > >
>>>>>> > > > > >
>>>>>> ------------------------------------------------------------------
>>>>>> > > > > > >> >   > > From:Thomas Weise <[hidden email]>
>>>>>> > > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
>>>>>> > > > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
>>>>>> > > > > > >> [hidden email]>
>>>>>> > > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release
>>>>>> candidate #4
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Hi Zhijiang,
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > The performance degradation manifests in
>>>>>> backpressure
>>>>>> > > which
>>>>>> > > > > > leads
>>>>>> > > > > > >> to
>>>>>> > > > > > >> >   > > growing backlog in the source. I switched a few
>>>>>> times
>>>>>> > > > between
>>>>>> > > > > > >> 1.10 and
>>>>>> > > > > > >> >   > 1.11
>>>>>> > > > > > >> >   > > and the behavior is consistent.
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > The DAG is:
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)
>>>>>> > >  --------
>>>>>> > > > > > >> forward
>>>>>> > > > > > >> >   > > ---------> KinesisProducer
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Parallelism: 160
>>>>>> > > > > > >> >   > > No shuffle/rebalance.
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Checkpointing config:
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Checkpointing Mode Exactly Once
>>>>>> > > > > > >> >   > > Interval 10s
>>>>>> > > > > > >> >   > > Timeout 10m 0s
>>>>>> > > > > > >> >   > > Minimum Pause Between Checkpoints 10s
>>>>>> > > > > > >> >   > > Maximum Concurrent Checkpoints 1
>>>>>> > > > > > >> >   > > Persist Checkpoints Externally Enabled (delete on
>>>>>> > > > > cancellation)
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > State backend: rocksdb  (filesystem leads to same
>>>>>> > > symptoms)
>>>>>> > > > > > >> >   > > Checkpoint size is tiny (500KB)
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > An interesting difference to another job that I
>>>>>> had
>>>>>> > > upgraded
>>>>>> > > > > > >> successfully
>>>>>> > > > > > >> >   > > is the low checkpointing interval.
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > Thanks,
>>>>>> > > > > > >> >   > > Thomas
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
>>>>>> > > > > > >> [hidden email]
>>>>>> > > > > > >> >   > > .invalid>
>>>>>> > > > > > >> >   > > wrote:
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > > > Hi Thomas,
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Thanks for the efficient feedback.
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Regarding the suggestion of adding the release
>>>>>> notes
>>>>>> > > > > document,
>>>>>> > > > > > >> I agree
>>>>>> > > > > > >> >   > > > with your point. Maybe we should adjust the
>>>>>> vote
>>>>>> > > template
>>>>>> > > > > > >> accordingly
>>>>>> > > > > > >> >   > in
>>>>>> > > > > > >> >   > > > the respective wiki to guide the following
>>>>>> release
>>>>>> > > > > processes.
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Regarding the performance regression, could you
>>>>>> > provide
>>>>>> > > > some
>>>>>> > > > > > >> more
>>>>>> > > > > > >> >   > details
>>>>>> > > > > > >> >   > > > for our better measurement or reproducing on
>>>>>> our
>>>>>> > sides?
>>>>>> > > > > > >> >   > > > E.g. I guess the topology only includes two
>>>>>> vertexes
>>>>>> > > > source
>>>>>> > > > > > and
>>>>>> > > > > > >> sink?
>>>>>> > > > > > >> >   > > > What is the parallelism for every vertex?
>>>>>> > > > > > >> >   > > > The upstream shuffles data to the downstream
>>>>>> via
>>>>>> > > rebalance
>>>>>> > > > > > >> partitioner
>>>>>> > > > > > >> >   > or
>>>>>> > > > > > >> >   > > > other?
>>>>>> > > > > > >> >   > > > The checkpoint mode is exactly-once with
>>>>>> rocksDB state
>>>>>> > > > > > backend?
>>>>>> > > > > > >> >   > > > The backpressure happened in this case?
>>>>>> > > > > > >> >   > > > How much percentage regression in this case?
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Best,
>>>>>> > > > > > >> >   > > > Zhijiang
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >>
>>>>>> > ------------------------------------------------------------------
>>>>>> > > > > > >> >   > > > From:Thomas Weise <[hidden email]>
>>>>>> > > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
>>>>>> > > > > > >> >   > > > To:dev <[hidden email]>
>>>>>> > > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release
>>>>>> candidate
>>>>>> > #4
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Hi Till,
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Yes, we don't have the setting in
>>>>>> flink-conf.yaml.
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Generally, we carry forward the existing
>>>>>> configuration
>>>>>> > > and
>>>>>> > > > > any
>>>>>> > > > > > >> change
>>>>>> > > > > > >> >   > to
>>>>>> > > > > > >> >   > > > default configuration values would impact the
>>>>>> upgrade.
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Yes, since it is an incompatible change I
>>>>>> would state
>>>>>> > it
>>>>>> > > > in
>>>>>> > > > > > the
>>>>>> > > > > > >> release
>>>>>> > > > > > >> >   > > > notes.
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > Thanks,
>>>>>> > > > > > >> >   > > > Thomas
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > BTW I found a performance regression while
>>>>>> trying to
>>>>>> > > > upgrade
>>>>>> > > > > > >> another
>>>>>> > > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis
>>>>>> to
>>>>>> > Kinesis
>>>>>> > > > > job.
>>>>>> > > > > > >> Wasn't
>>>>>> > > > > > >> >   > able
>>>>>> > > > > > >> >   > > > to pin it down yet, symptoms include increased
>>>>>> > > checkpoint
>>>>>> > > > > > >> alignment
>>>>>> > > > > > >> >   > time.
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
>>>>>> > > > > > >> [hidden email]>
>>>>>> > > > > > >> >   > > > wrote:
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > > > Hi Thomas,
>>>>>> > > > > > >> >   > > > >
>>>>>> > > > > > >> >   > > > > just to confirm: When starting the image in
>>>>>> local
>>>>>> > > mode,
>>>>>> > > > > then
>>>>>> > > > > > >> you
>>>>>> > > > > > >> >   > don't
>>>>>> > > > > > >> >   > > > have
>>>>>> > > > > > >> >   > > > > any of the JobManager memory configuration
>>>>>> settings
>>>>>> > > > > > >> configured in the
>>>>>> > > > > > >> >   > > > > effective flink-conf.yaml, right? Does this
>>>>>> mean
>>>>>> > that
>>>>>> > > > you
>>>>>> > > > > > have
>>>>>> > > > > > >> >   > > explicitly
>>>>>> > > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from
>>>>>> the
>>>>>> > default
>>>>>> > > > > > >> configuration?
>>>>>> > > > > > >> >   > > If
>>>>>> > > > > > >> >   > > > > this is the case, then I believe it was more
>>>>>> of an
>>>>>> > > > > > >> unintentional
>>>>>> > > > > > >> >   > > artifact
>>>>>> > > > > > >> >   > > > > that it worked before and it has been
>>>>>> corrected now
>>>>>> > so
>>>>>> > > > > that
>>>>>> > > > > > >> one needs
>>>>>> > > > > > >> >   > > to
>>>>>> > > > > > >> >   > > > > specify the memory of the JM process
>>>>>> explicitly. Do
>>>>>> > > you
>>>>>> > > > > > think
>>>>>> > > > > > >> it
>>>>>> > > > > > >> >   > would
>>>>>> > > > > > >> >   > > > help
>>>>>> > > > > > >> >   > > > > to explicitly state this in the release
>>>>>> notes?
>>>>>> > > > > > >> >   > > > >
>>>>>> > > > > > >> >   > > > > Cheers,
>>>>>> > > > > > >> >   > > > > Till
>>>>>> > > > > > >> >   > > > >
>>>>>> > > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <
>>>>>> > > > > [hidden email]
>>>>>> > > > > > >
>>>>>> > > > > > >> wrote:
>>>>>> > > > > > >> >   > > > >
>>>>>> > > > > > >> >   > > > > > Thanks for preparing another RC!
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > > > As mentioned in the previous RC thread, it
>>>>>> would
>>>>>> > be
>>>>>> > > > > super
>>>>>> > > > > > >> helpful
>>>>>> > > > > > >> >   > if
>>>>>> > > > > > >> >   > > > the
>>>>>> > > > > > >> >   > > > > > release notes that are part of the
>>>>>> documentation
>>>>>> > can
>>>>>> > > > be
>>>>>> > > > > > >> included
>>>>>> > > > > > >> >   > [1].
>>>>>> > > > > > >> >   > > > > It's
>>>>>> > > > > > >> >   > > > > > a significant time-saver to have read
>>>>>> those first.
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > > > I found one more non-backward compatible
>>>>>> change
>>>>>> > that
>>>>>> > > > > would
>>>>>> > > > > > >> be worth
>>>>>> > > > > > >> >   > > > > > addressing/mentioning:
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > > > It is now necessary to configure the
>>>>>> jobmanager
>>>>>> > heap
>>>>>> > > > > size
>>>>>> > > > > > in
>>>>>> > > > > > >> >   > > > > > flink-conf.yaml (with either
>>>>>> jobmanager.heap.size
>>>>>> > > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why would
>>>>>> I not
>>>>>> > > want
>>>>>> > > > to
>>>>>> > > > > > do
>>>>>> > > > > > >> that
>>>>>> > > > > > >> >   > > > anyways?
>>>>>> > > > > > >> >   > > > > > Well, we set it dynamically for a cluster
>>>>>> > deployment
>>>>>> > > > via
>>>>>> > > > > > the
>>>>>> > > > > > >> >   > > > > > flinkk8soperator, but the container image
>>>>>> can also
>>>>>> > > be
>>>>>> > > > > used
>>>>>> > > > > > >> for
>>>>>> > > > > > >> >   > > testing
>>>>>> > > > > > >> >   > > > > with
>>>>>> > > > > > >> >   > > > > > local mode (./bin/jobmanager.sh
>>>>>> start-foreground
>>>>>> > > > local).
>>>>>> > > > > > >> That will
>>>>>> > > > > > >> >   > > fail
>>>>>> > > > > > >> >   > > > > if
>>>>>> > > > > > >> >   > > > > > the heap wasn't configured and that's how I
>>>>>> > noticed
>>>>>> > > > it.
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > > > Thanks,
>>>>>> > > > > > >> >   > > > > > Thomas
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > > > [1]
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > >
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >>
>>>>>> > > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
>>>>>> > > > > > >> >   > [hidden email]
>>>>>> > > > > > >> >   > > > > > .invalid>
>>>>>> > > > > > >> >   > > > > > wrote:
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > > > > Hi everyone,
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > > > > Please review and vote on the release
>>>>>> candidate
>>>>>> > #4
>>>>>> > > > for
>>>>>> > > > > > the
>>>>>> > > > > > >> >   > version
>>>>>> > > > > > >> >   > > > > > 1.11.0,
>>>>>> > > > > > >> >   > > > > > > as follows:
>>>>>> > > > > > >> >   > > > > > > [ ] +1, Approve the release
>>>>>> > > > > > >> >   > > > > > > [ ] -1, Do not approve the release
>>>>>> (please
>>>>>> > provide
>>>>>> > > > > > >> specific
>>>>>> > > > > > >> >   > > comments)
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > > > > The complete staging area is available
>>>>>> for your
>>>>>> > > > > review,
>>>>>> > > > > > >> which
>>>>>> > > > > > >> >   > > > includes:
>>>>>> > > > > > >> >   > > > > > > * JIRA release notes [1],
>>>>>> > > > > > >> >   > > > > > > * the official Apache source release and
>>>>>> binary
>>>>>> > > > > > >> convenience
>>>>>> > > > > > >> >   > > releases
>>>>>> > > > > > >> >   > > > to
>>>>>> > > > > > >> >   > > > > > be
>>>>>> > > > > > >> >   > > > > > > deployed to dist.apache.org [2], which
>>>>>> are
>>>>>> > signed
>>>>>> > > > > with
>>>>>> > > > > > >> the key
>>>>>> > > > > > >> >   > > with
>>>>>> > > > > > >> >   > > > > > > fingerprint
>>>>>> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
>>>>>> > > > > > [3],
>>>>>> > > > > > >> >   > > > > > > * all artifacts to be deployed to the
>>>>>> Maven
>>>>>> > > Central
>>>>>> > > > > > >> Repository
>>>>>> > > > > > >> >   > [4],
>>>>>> > > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4"
>>>>>> [5],
>>>>>> > > > > > >> >   > > > > > > * website pull request listing the new
>>>>>> release
>>>>>> > and
>>>>>> > > > > > adding
>>>>>> > > > > > >> >   > > > announcement
>>>>>> > > > > > >> >   > > > > > > blog post [6].
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > > > > The vote will be open for at least 72
>>>>>> hours. It
>>>>>> > is
>>>>>> > > > > > >> adopted by
>>>>>> > > > > > >> >   > > > majority
>>>>>> > > > > > >> >   > > > > > > approval, with at least 3 PMC
>>>>>> affirmative votes.
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > > > > Thanks,
>>>>>> > > > > > >> >   > > > > > > Release Manager
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > > > > [1]
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > >
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >>
>>>>>> > > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
>>>>>> > > > > > >> >   > > > > > > [2]
>>>>>> > > > > > >> >   >
>>>>>> > > > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
>>>>>> > > > > > >> >   > > > > > > [3]
>>>>>> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
>>>>>> > > > > > >> >   > > > > > > [4]
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > >
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >>
>>>>>> > > > >
>>>>>> > >
>>>>>> https://repository.apache.org/content/repositories/orgapacheflink-1377/
>>>>>> > > > > > >> >   > > > > > > [5]
>>>>>> > > > > > >> >   > >
>>>>>> > > > >
>>>>>> https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
>>>>>> > > > > > >> >   > > > > > > [6]
>>>>>> > https://github.com/apache/flink-web/pull/352
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > > > >
>>>>>> > > > > > >> >   > > > > >
>>>>>> > > > > > >> >   > > > >
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > > >
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   > >
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >   >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >> >
>>>>>> > > > > > >>
>>>>>> > > > > > >>
>>>>>> > > > > >
>>>>>> > > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Regards,
>>>>>> > Roman
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Roman
>>>>>
>>>>
>>
>> --
>> Regards,
>> Roman
>>
>
>
> --
> Regards,
> Roman
>
Reply | Threaded
Open this post in threaded view
|

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Piotr Nowojski-5
Thanks Thomas for reporting the problem, analysing which commit has caused
and now for the verification that it was fixed :) Much appreciated.

Piotrek

czw., 13 sie 2020 o 18:18 Thomas Weise <[hidden email]> napisał(a):

> Hi Roman,
>
> Thanks for working on this! I deployed the change and it appears to be
> working as expected.
>
> Will monitor over a period of time to compare the checkpoint counts and get
> back to you if there are still issues.
>
> Thomas
>
>
> On Thu, Aug 13, 2020 at 3:41 AM Roman Khachatryan <[hidden email]
> >
> wrote:
>
> > Hi Thomas,
> >
> > The fix is now merged to master and to release-1.11.
> > So if you'd like you can check if it solves your problem (it would be
> > helpful for us too).
> >
> > On Sat, Aug 8, 2020 at 9:26 AM Roman Khachatryan <
> [hidden email]>
> > wrote:
> >
> >> Hi Thomas,
> >>
> >> Thanks a lot for the detailed information.
> >>
> >> I think the problem is in CheckpointCoordinator. It stores the last
> >> checkpoint completion time after checking queued requests.
> >> I've created a ticket to fix this:
> >> https://issues.apache.org/jira/browse/FLINK-18856
> >>
> >>
> >> On Sat, Aug 8, 2020 at 5:25 AM Thomas Weise <[hidden email]> wrote:
> >>
> >>> Just another update:
> >>>
> >>> The duration of snapshotState is capped by the Kinesis
> >>> producer's "RecordTtl" setting (default 30s). The sleep time in
> flushSync
> >>> does not contribute to the observed behavior.
> >>>
> >>> I guess the open question is why, with the same settings, is 1.11 since
> >>> commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more
> checkpoints?
> >>>
> >>>
> >>> On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise <[hidden email]> wrote:
> >>>
> >>>> Hi Roman,
> >>>>
> >>>> Here are the checkpoint summaries for both commits:
> >>>>
> >>>>
> >>>>
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0
> >>>>
> >>>> The config:
> >>>>
> >>>>     CheckpointConfig checkpointConfig = env.getCheckpointConfig();
> >>>>
> >>>> checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
> >>>>     checkpointConfig.setCheckpointInterval(*10_000*);
> >>>>     checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
> >>>>
> >>>>
> checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
> >>>>     checkpointConfig.setCheckpointTimeout(600_000);
> >>>>     checkpointConfig.setMaxConcurrentCheckpoints(1);
> >>>>     checkpointConfig.setFailOnCheckpointingErrors(true);
> >>>>
> >>>> The values marked bold when changed to *60_000* make the symptom
> >>>> disappear. I meanwhile also verified that with the 1.11.0 release
> commit.
> >>>>
> >>>> I will take a look at the sleep time issue.
> >>>>
> >>>> Thanks,
> >>>> Thomas
> >>>>
> >>>>
> >>>> On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <
> >>>> [hidden email]> wrote:
> >>>>
> >>>>> Hi Thomas,
> >>>>>
> >>>>> Thanks for your reply!
> >>>>>
> >>>>> I think you are right, we can remove this sleep and improve
> >>>>> KinesisProducer.
> >>>>> Probably, it's snapshotState can also be sped up by forcing records
> >>>>> flush more often.
> >>>>> Do you see that 30s checkpointing duration is caused
> >>>>> by KinesisProducer (or maybe other operators)?
> >>>>>
> >>>>> I'd also like to understand the reason behind this increase in
> >>>>> checkpoint frequency.
> >>>>> Can you please share these values:
> >>>>>  - execution.checkpointing.min-pause
> >>>>>  - execution.checkpointing.max-concurrent-checkpoints
> >>>>>  - execution.checkpointing.timeout
> >>>>>
> >>>>> And what is the "new" observed checkpoint frequency (or how many
> >>>>> checkpoints are created) compared to older versions?
> >>>>>
> >>>>>
> >>>>> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise <[hidden email]> wrote:
> >>>>>
> >>>>>> Hi Roman,
> >>>>>>
> >>>>>> Indeed there are more frequent checkpoints with this change! The
> >>>>>> application was configured to checkpoint every 10s. With 1.10 ("good
> >>>>>> commit"), that leads to fewer completed checkpoints compared to 1.11
> >>>>>> ("bad
> >>>>>> commit"). Just to be clear, the only difference between the two runs
> >>>>>> was
> >>>>>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
> >>>>>>
> >>>>>> Since the sync part of checkpoints with the Kinesis producer always
> >>>>>> takes
> >>>>>> ~30 seconds, the 10s configured checkpoint frequency really had no
> >>>>>> effect
> >>>>>> before 1.11. I confirmed that both commits perform comparably by
> >>>>>> setting
> >>>>>> the checkpoint frequency and min pause to 60s.
> >>>>>>
> >>>>>> I still have to verify with the final 1.11.0 release commit.
> >>>>>>
> >>>>>> It's probably good to take a look at the Kinesis producer. Is it
> >>>>>> really
> >>>>>> necessary to have 500ms sleep time? What's responsible for the ~30s
> >>>>>> duration in snapshotState?
> >>>>>>
> >>>>>> As things stand it doesn't make sense to use checkpoint intervals <
> >>>>>> 30s
> >>>>>> when using the Kinesis producer.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Thomas
> >>>>>>
> >>>>>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <
> >>>>>> [hidden email]>
> >>>>>> wrote:
> >>>>>>
> >>>>>> > Hi Thomas,
> >>>>>> >
> >>>>>> > Thanks a lot for the analysis.
> >>>>>> >
> >>>>>> > The first thing that I'd check is whether checkpoints became more
> >>>>>> frequent
> >>>>>> > with this commit (as each of them adds at least 500ms if there is
> >>>>>> at least
> >>>>>> > one not sent record, according to
> >>>>>> FlinkKinesisProducer.snapshotState).
> >>>>>> >
> >>>>>> > Can you share checkpointing statistics (1.10 vs 1.11 or last
> "good"
> >>>>>> vs
> >>>>>> > first "bad" commits)?
> >>>>>> >
> >>>>>> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise <
> >>>>>> [hidden email]>
> >>>>>> > wrote:
> >>>>>> >
> >>>>>> > > I run git bisect and the first commit that shows the regression
> >>>>>> is:
> >>>>>> > >
> >>>>>> > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
> >>>>>> > >
> >>>>>> > >
> >>>>>> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young <[hidden email]>
> >>>>>> wrote:
> >>>>>> > >
> >>>>>> > > > From my experience, java profilers are sometimes not accurate
> >>>>>> enough to
> >>>>>> > > > find out the performance regression
> >>>>>> > > > root cause. In this case, I would suggest you try out intel
> >>>>>> vtune
> >>>>>> > > amplifier
> >>>>>> > > > to watch more detailed metrics.
> >>>>>> > > >
> >>>>>> > > > Best,
> >>>>>> > > > Kurt
> >>>>>> > > >
> >>>>>> > > >
> >>>>>> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise <[hidden email]>
> >>>>>> wrote:
> >>>>>> > > >
> >>>>>> > > > > The cause of the issue is all but clear.
> >>>>>> > > > >
> >>>>>> > > > > Previously I had mentioned that there is no suspect change
> to
> >>>>>> the
> >>>>>> > > Kinesis
> >>>>>> > > > > connector and that I had reverted the AWS SDK change to no
> >>>>>> effect.
> >>>>>> > > > >
> >>>>>> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually
> >>>>>> fixed
> >>>>>> > > another
> >>>>>> > > > > regression in the previous release and is present before and
> >>>>>> after.
> >>>>>> > > > >
> >>>>>> > > > > I repeated the run with 1.11.0 core and downgraded the
> entire
> >>>>>> Kinesis
> >>>>>> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is
> >>>>>> still
> >>>>>> > > > present.
> >>>>>> > > > > Therefore we will need to look elsewhere for the root cause.
> >>>>>> > > > >
> >>>>>> > > > > Regarding the time spent in snapshotState, repeat runs
> reveal
> >>>>>> a wide
> >>>>>> > > > range
> >>>>>> > > > > for both versions, 1.10 and 1.11. So again this is nothing
> >>>>>> pointing
> >>>>>> > to
> >>>>>> > > a
> >>>>>> > > > > root cause.
> >>>>>> > > > >
> >>>>>> > > > > At this point, I have no ideas remaining other than doing a
> >>>>>> bisect to
> >>>>>> > > > find
> >>>>>> > > > > the culprit. Any other suggestions?
> >>>>>> > > > >
> >>>>>> > > > > Thomas
> >>>>>> > > > >
> >>>>>> > > > >
> >>>>>> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
> >>>>>> [hidden email]
> >>>>>> > > > > .invalid>
> >>>>>> > > > > wrote:
> >>>>>> > > > >
> >>>>>> > > > > > Hi Thomas,
> >>>>>> > > > > >
> >>>>>> > > > > > Thanks for your further profiling information and glad to
> >>>>>> see we
> >>>>>> > > > already
> >>>>>> > > > > > finalized the location to cause the regression.
> >>>>>> > > > > > Actually I was also suspicious of the point of
> >>>>>> #snapshotState in
> >>>>>> > > > previous
> >>>>>> > > > > > discussions since it indeed cost much time to block normal
> >>>>>> operator
> >>>>>> > > > > > processing.
> >>>>>> > > > > >
> >>>>>> > > > > > Based on your below feedback, the sleep time during
> >>>>>> #snapshotState
> >>>>>> > > > might
> >>>>>> > > > > > be the main concern, and I also digged into the
> >>>>>> implementation of
> >>>>>> > > > > > FlinkKinesisProducer#snapshotState.
> >>>>>> > > > > > while (producer.getOutstandingRecordsCount() > 0) {
> >>>>>> > > > > >    producer.flush();
> >>>>>> > > > > >    try {
> >>>>>> > > > > >       Thread.sleep(500);
> >>>>>> > > > > >    } catch (InterruptedException e) {
> >>>>>> > > > > >       LOG.warn("Flushing was interrupted.");
> >>>>>> > > > > >       break;
> >>>>>> > > > > >    }
> >>>>>> > > > > > }
> >>>>>> > > > > > It seems that the sleep time is mainly affected by the
> >>>>>> internal
> >>>>>> > > > > operations
> >>>>>> > > > > > inside KinesisProducer implementation provided by
> >>>>>> amazonaws, which
> >>>>>> > I
> >>>>>> > > am
> >>>>>> > > > > not
> >>>>>> > > > > > quite familiar with.
> >>>>>> > > > > > But I noticed there were two upgrades related to it in
> >>>>>> > > release-1.11.0.
> >>>>>> > > > > One
> >>>>>> > > > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and
> >>>>>> another
> >>>>>> > is
> >>>>>> > > > for
> >>>>>> > > > > > upgrading aws-sdk-version to 1.11.754 [2].
> >>>>>> > > > > > You mentioned that you already reverted the SDK upgrade to
> >>>>>> verify
> >>>>>> > no
> >>>>>> > > > > > changes. Did you also revert the [1] to verify?
> >>>>>> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> >>>>>> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> >>>>>> > > > > >
> >>>>>> > > > > > Best,
> >>>>>> > > > > > Zhijiang
> >>>>>> > > > > >
> >>>>>> ------------------------------------------------------------------
> >>>>>> > > > > > From:Thomas Weise <[hidden email]>
> >>>>>> > > > > > Send Time:2020年7月17日(星期五) 05:29
> >>>>>> > > > > > To:dev <[hidden email]>
> >>>>>> > > > > > Cc:Zhijiang <[hidden email]>; Stephan Ewen <
> >>>>>> > > > [hidden email]
> >>>>>> > > > > >;
> >>>>>> > > > > > Arvid Heise <[hidden email]>; Aljoscha Krettek <
> >>>>>> > > > [hidden email]
> >>>>>> > > > > >
> >>>>>> > > > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release
> >>>>>> 1.11.0,
> >>>>>> > > > release
> >>>>>> > > > > > candidate #4)
> >>>>>> > > > > >
> >>>>>> > > > > > Sorry for the delay.
> >>>>>> > > > > >
> >>>>>> > > > > > I confirmed that the regression is due to the sink
> >>>>>> (unsurprising,
> >>>>>> > > since
> >>>>>> > > > > > another job with the same consumer, but not the producer,
> >>>>>> runs as
> >>>>>> > > > > > expected).
> >>>>>> > > > > >
> >>>>>> > > > > > As promised I did CPU profiling on the problematic
> >>>>>> application,
> >>>>>> > which
> >>>>>> > > > > gives
> >>>>>> > > > > > more insight into the regression [1]
> >>>>>> > > > > >
> >>>>>> > > > > > The screenshots show that the average time for
> snapshotState
> >>>>>> > > increases
> >>>>>> > > > > from
> >>>>>> > > > > > ~9s to ~28s. The data also shows the increase in sleep
> time
> >>>>>> during
> >>>>>> > > > > > snapshotState.
> >>>>>> > > > > >
> >>>>>> > > > > > Does anyone, based on changes made in 1.11, have a theory
> >>>>>> why?
> >>>>>> > > > > >
> >>>>>> > > > > > I had previously looked at the changes to the Kinesis
> >>>>>> connector and
> >>>>>> > > > also
> >>>>>> > > > > > reverted the SDK upgrade, which did not change the
> >>>>>> situation.
> >>>>>> > > > > >
> >>>>>> > > > > > It will likely be necessary to drill into the sink /
> >>>>>> checkpointing
> >>>>>> > > > > details
> >>>>>> > > > > > to understand the cause of the problem.
> >>>>>> > > > > >
> >>>>>> > > > > > Let me know if anyone has specific questions that I can
> >>>>>> answer from
> >>>>>> > > the
> >>>>>> > > > > > profiling results.
> >>>>>> > > > > >
> >>>>>> > > > > > Thomas
> >>>>>> > > > > >
> >>>>>> > > > > > [1]
> >>>>>> > > > > >
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing
> >>>>>> > > > > >
> >>>>>> > > > > > On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <
> >>>>>> [hidden email]>
> >>>>>> > > wrote:
> >>>>>> > > > > >
> >>>>>> > > > > > > + dev@ for visibility
> >>>>>> > > > > > >
> >>>>>> > > > > > > I will investigate further today.
> >>>>>> > > > > > >
> >>>>>> > > > > > >
> >>>>>> > > > > > > On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <
> >>>>>> > > [hidden email]
> >>>>>> > > > >
> >>>>>> > > > > > > wrote:
> >>>>>> > > > > > >
> >>>>>> > > > > > >> On 06.07.20 20:39, Stephan Ewen wrote:
> >>>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
> >>>>>> relevant
> >>>>>> > way,
> >>>>>> > > > for
> >>>>>> > > > > > >> example
> >>>>>> > > > > > >> > due to some Kafka issues we addressed in 1.11
> >>>>>> (@Aljoscha
> >>>>>> > maybe?)
> >>>>>> > > > > > >>
> >>>>>> > > > > > >> I think that's unrelated: the Kafka fixes were isolated
> >>>>>> in Kafka
> >>>>>> > > and
> >>>>>> > > > > the
> >>>>>> > > > > > >> one bug I discovered on the way was about the Task
> >>>>>> reaper.
> >>>>>> > > > > > >>
> >>>>>> > > > > > >>
> >>>>>> > > > > > >> On 07.07.20 17:51, Zhijiang wrote:
> >>>>>> > > > > > >> > Sorry for my misunderstood of the previous
> information,
> >>>>>> > Thomas.
> >>>>>> > > I
> >>>>>> > > > > was
> >>>>>> > > > > > >> assuming that the sync checkpoint duration increased
> >>>>>> after
> >>>>>> > upgrade
> >>>>>> > > > as
> >>>>>> > > > > it
> >>>>>> > > > > > >> was mentioned before.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > If I remembered correctly, the memory state backend
> >>>>>> also has
> >>>>>> > the
> >>>>>> > > > > same
> >>>>>> > > > > > >> issue? If so, we can dismiss the rocksDB state changes.
> >>>>>> As the
> >>>>>> > > slot
> >>>>>> > > > > > sharing
> >>>>>> > > > > > >> enabled, the downstream and upstream should
> >>>>>> > > > > > >> > probably deployed into the same slot, then no network
> >>>>>> shuffle
> >>>>>> > > > > effect.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I think we need to find out whether it has other
> >>>>>> symptoms
> >>>>>> > > changed
> >>>>>> > > > > > >> besides the performance regression to further figure
> out
> >>>>>> the
> >>>>>> > > scope.
> >>>>>> > > > > > >> > E.g. any metrics changes, the number of TaskManager
> >>>>>> and the
> >>>>>> > > number
> >>>>>> > > > > of
> >>>>>> > > > > > >> slots per TaskManager from deployment changes.
> >>>>>> > > > > > >> > 40% regression is really big, I guess the changes
> >>>>>> should also
> >>>>>> > be
> >>>>>> > > > > > >> reflected in other places.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I am not sure whether we can reproduce the regression
> >>>>>> in our
> >>>>>> > AWS
> >>>>>> > > > > > >> environment by writing any Kinesis jobs, since there
> are
> >>>>>> also
> >>>>>> > > normal
> >>>>>> > > > > > >> Kinesis jobs as Thomas mentioned after upgrade.
> >>>>>> > > > > > >> > So it probably looks like to touch some corner case.
> I
> >>>>>> am very
> >>>>>> > > > > willing
> >>>>>> > > > > > >> to provide any help for debugging if possible.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Best,
> >>>>>> > > > > > >> > Zhijiang
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > >
> ------------------------------------------------------------------
> >>>>>> > > > > > >> > From:Thomas Weise <[hidden email]>
> >>>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 23:01
> >>>>>> > > > > > >> > To:Stephan Ewen <[hidden email]>
> >>>>>> > > > > > >> > Cc:Aljoscha Krettek <[hidden email]>; Arvid
> >>>>>> Heise <
> >>>>>> > > > > > >> [hidden email]>; Zhijiang <
> >>>>>> [hidden email]>
> >>>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE]
> >>>>>> Release
> >>>>>> > > 1.11.0,
> >>>>>> > > > > > >> release candidate #4)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > We are deploying our apps with FlinkK8sOperator. We
> >>>>>> have one
> >>>>>> > job
> >>>>>> > > > > that
> >>>>>> > > > > > >> works as expected after the upgrade and the one
> >>>>>> discussed here
> >>>>>> > > that
> >>>>>> > > > > has
> >>>>>> > > > > > the
> >>>>>> > > > > > >> performance regression.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > "The performance regression is obvious caused by long
> >>>>>> duration
> >>>>>> > > of
> >>>>>> > > > > sync
> >>>>>> > > > > > >> checkpoint process in Kinesis sink operator, which
> would
> >>>>>> block
> >>>>>> > the
> >>>>>> > > > > > normal
> >>>>>> > > > > > >> data processing until back pressure the source."
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > That's a constant. Before (1.10) and upgrade have the
> >>>>>> same
> >>>>>> > sync
> >>>>>> > > > > > >> checkpointing time. The question is what change came in
> >>>>>> with the
> >>>>>> > > > > > upgrade.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <
> >>>>>> [hidden email]
> >>>>>> > >
> >>>>>> > > > > wrote:
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > @Thomas Just one thing real quick: Are you using the
> >>>>>> > standalone
> >>>>>> > > > > setup
> >>>>>> > > > > > >> scripts (like start-cluster.sh, and the former "slaves"
> >>>>>> file) ?
> >>>>>> > > > > > >> > Be aware that this is now called "workers" because of
> >>>>>> avoiding
> >>>>>> > > > > > >> sensitive names.
> >>>>>> > > > > > >> > In one internal benchmark we saw quite a lot of
> >>>>>> slowdown
> >>>>>> > > > initially,
> >>>>>> > > > > > >> before seeing that the cluster was not a distributed
> >>>>>> cluster any
> >>>>>> > > > more
> >>>>>> > > > > > ;-)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <
> >>>>>> > > > [hidden email]
> >>>>>> > > > > >
> >>>>>> > > > > > >> wrote:
> >>>>>> > > > > > >> > Thanks for this kickoff and help analysis, Stephan!
> >>>>>> > > > > > >> > Thanks for the further feedback and investigation,
> >>>>>> Thomas!
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > The performance regression is obvious caused by long
> >>>>>> duration
> >>>>>> > of
> >>>>>> > > > > sync
> >>>>>> > > > > > >> checkpoint process in Kinesis sink operator, which
> would
> >>>>>> block
> >>>>>> > the
> >>>>>> > > > > > normal
> >>>>>> > > > > > >> data processing until back pressure the source.
> >>>>>> > > > > > >> > Maybe we could dig into the process of sync execution
> >>>>>> in
> >>>>>> > > > checkpoint.
> >>>>>> > > > > > >> E.g. break down the steps inside respective
> >>>>>> > operator#snapshotState
> >>>>>> > > > to
> >>>>>> > > > > > >> statistic which operation cost most of the time, then
> >>>>>> > > > > > >> > we might probably find the root cause to bring such
> >>>>>> cost.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Look forward to the further progress. :)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Best,
> >>>>>> > > > > > >> > Zhijiang
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > >
> ------------------------------------------------------------------
> >>>>>> > > > > > >> > From:Stephan Ewen <[hidden email]>
> >>>>>> > > > > > >> > Send Time:2020年7月7日(星期二) 14:52
> >>>>>> > > > > > >> > To:Thomas Weise <[hidden email]>
> >>>>>> > > > > > >> > Cc:Stephan Ewen <[hidden email]>; Zhijiang <
> >>>>>> > > > > > >> [hidden email]>; Aljoscha Krettek <
> >>>>>> > > [hidden email]
> >>>>>> > > > >;
> >>>>>> > > > > > >> Arvid Heise <[hidden email]>
> >>>>>> > > > > > >> > Subject:Re: Kinesis Performance Issue (was [VOTE]
> >>>>>> Release
> >>>>>> > > 1.11.0,
> >>>>>> > > > > > >> release candidate #4)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Thank you for the digging so deeply.
> >>>>>> > > > > > >> > Mysterious think this regression.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <
> >>>>>> [hidden email]>
> >>>>>> > wrote:
> >>>>>> > > > > > >> > @Stephan: yes, I refer to sync time in the web UI (it
> >>>>>> is
> >>>>>> > > unchanged
> >>>>>> > > > > > >> between 1.10 and 1.11 for the specific pipeline).
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I verified that increasing the checkpointing interval
> >>>>>> does not
> >>>>>> > > > make
> >>>>>> > > > > a
> >>>>>> > > > > > >> difference.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I looked at the Kinesis connector changes since
> 1.10.1
> >>>>>> and
> >>>>>> > don't
> >>>>>> > > > see
> >>>>>> > > > > > >> anything that could cause this.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Another pipeline that is using the Kinesis consumer
> >>>>>> (but not
> >>>>>> > the
> >>>>>> > > > > > >> producer) performs as expected.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I tried reverting the AWS SDK version change,
> symptoms
> >>>>>> remain
> >>>>>> > > > > > unchanged:
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > diff --git
> >>>>>> a/flink-connectors/flink-connector-kinesis/pom.xml
> >>>>>> > > > > > >> b/flink-connectors/flink-connector-kinesis/pom.xml
> >>>>>> > > > > > >> > index a6abce23ba..741743a05e 100644
> >>>>>> > > > > > >> > ---
> a/flink-connectors/flink-connector-kinesis/pom.xml
> >>>>>> > > > > > >> > +++
> b/flink-connectors/flink-connector-kinesis/pom.xml
> >>>>>> > > > > > >> > @@ -33,7 +33,7 @@ under the License.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> >>>>>> > > > >
> >>>>>> > >
> >>>>>>
> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
> >>>>>> > > > > > >> >          <name>flink-connector-kinesis</name>
> >>>>>> > > > > > >> >          <properties>
> >>>>>> > > > > > >> > -
> >>>>>>  <aws.sdk.version>1.11.754</aws.sdk.version>
> >>>>>> > > > > > >> > +
> >>>>>>  <aws.sdk.version>1.11.603</aws.sdk.version>
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > I'm planning to take a look with a profiler next.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Thomas
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <
> >>>>>> > [hidden email]>
> >>>>>> > > > > > wrote:
> >>>>>> > > > > > >> > Hi all!
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Forking this thread out of the release vote thread.
> >>>>>> > > > > > >> >  From what Thomas describes, it really sounds like a
> >>>>>> > > sink-specific
> >>>>>> > > > > > >> issue.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > @Thomas: When you say sink has a long synchronous
> >>>>>> checkpoint
> >>>>>> > > time,
> >>>>>> > > > > you
> >>>>>> > > > > > >> mean the time that is shown as "sync time" on the
> >>>>>> metrics and
> >>>>>> > web
> >>>>>> > > > UI?
> >>>>>> > > > > > That
> >>>>>> > > > > > >> is not including any network buffer related operations.
> >>>>>> It is
> >>>>>> > > purely
> >>>>>> > > > > the
> >>>>>> > > > > > >> operator's time.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Can we dig into the changes we did in sinks:
> >>>>>> > > > > > >> >    - Kinesis version upgrade, AWS library updates
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >    - Could it be that some call (checkpoint complete)
> >>>>>> that was
> >>>>>> > > > > > >> previously (1.10) in a separate thread is not in the
> >>>>>> mailbox and
> >>>>>> > > > this
> >>>>>> > > > > > >> simply reduces the number of threads that do the work?
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >    - Did sink checkpoint notifications change in a
> >>>>>> relevant
> >>>>>> > way,
> >>>>>> > > > for
> >>>>>> > > > > > >> example due to some Kafka issues we addressed in 1.11
> >>>>>> (@Aljoscha
> >>>>>> > > > > maybe?)
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > Best,
> >>>>>> > > > > > >> > Stephan
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang <
> >>>>>> > > > [hidden email]
> >>>>>> > > > > > .invalid>
> >>>>>> > > > > > >> wrote:
> >>>>>> > > > > > >> > Hi Thomas,
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Regarding [2], it has more detail infos in the Jira
> >>>>>> > > description
> >>>>>> > > > (
> >>>>>> > > > > > >> https://issues.apache.org/jira/browse/FLINK-16404).
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   I can also give some basic explanations here to
> >>>>>> dismiss the
> >>>>>> > > > > concern.
> >>>>>> > > > > > >> >   1. In the past, the following buffers after the
> >>>>>> barrier will
> >>>>>> > > be
> >>>>>> > > > > > >> cached on downstream side before alignment.
> >>>>>> > > > > > >> >   2. In 1.11, the upstream would not send the buffers
> >>>>>> after
> >>>>>> > the
> >>>>>> > > > > > >> barrier. When the downstream finishes the alignment, it
> >>>>>> will
> >>>>>> > > notify
> >>>>>> > > > > the
> >>>>>> > > > > > >> downstream of continuing sending following buffers,
> >>>>>> since it can
> >>>>>> > > > > process
> >>>>>> > > > > > >> them after alignment.
> >>>>>> > > > > > >> >   3. The only difference is that the temporary
> blocked
> >>>>>> buffers
> >>>>>> > > are
> >>>>>> > > > > > >> cached either on downstream side or on upstream side
> >>>>>> before
> >>>>>> > > > alignment.
> >>>>>> > > > > > >> >   4. The side effect would be the additional
> >>>>>> notification cost
> >>>>>> > > for
> >>>>>> > > > > > >> every barrier alignment. If the downstream and upstream
> >>>>>> are
> >>>>>> > > deployed
> >>>>>> > > > > in
> >>>>>> > > > > > >> separate TaskManager, the cost is network transport
> >>>>>> delay (the
> >>>>>> > > > effect
> >>>>>> > > > > > can
> >>>>>> > > > > > >> be ignored based on our testing with 1s checkpoint
> >>>>>> interval).
> >>>>>> > For
> >>>>>> > > > > > sharing
> >>>>>> > > > > > >> slot in your case, the cost is only one method call in
> >>>>>> > processor,
> >>>>>> > > > can
> >>>>>> > > > > be
> >>>>>> > > > > > >> ignored also.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   You mentioned "In this case, the downstream task
> has
> >>>>>> a high
> >>>>>> > > > > average
> >>>>>> > > > > > >> checkpoint duration(~30s, sync part)." This duration is
> >>>>>> not
> >>>>>> > > > reflecting
> >>>>>> > > > > > the
> >>>>>> > > > > > >> changes above, and it is only indicating the duration
> for
> >>>>>> > calling
> >>>>>> > > > > > >> `Operation.snapshotState`.
> >>>>>> > > > > > >> >   If this duration is beyond your expectation, you
> can
> >>>>>> check
> >>>>>> > or
> >>>>>> > > > > debug
> >>>>>> > > > > > >> whether the source/sink operations might take more time
> >>>>>> to
> >>>>>> > finish
> >>>>>> > > > > > >> `snapshotState` in practice. E.g. you can
> >>>>>> > > > > > >> >   make the implementation of this method as empty to
> >>>>>> further
> >>>>>> > > > verify
> >>>>>> > > > > > the
> >>>>>> > > > > > >> effect.
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Best,
> >>>>>> > > > > > >> >   Zhijiang
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > >
> >>>>>> ------------------------------------------------------------------
> >>>>>> > > > > > >> >   From:Thomas Weise <[hidden email]>
> >>>>>> > > > > > >> >   Send Time:2020年7月5日(星期日) 12:22
> >>>>>> > > > > > >> >   To:dev <[hidden email]>; Zhijiang <
> >>>>>> > > > > [hidden email]
> >>>>>> > > > > > >
> >>>>>> > > > > > >> >   Cc:Yingjie Cao <[hidden email]>
> >>>>>> > > > > > >> >   Subject:Re: [VOTE] Release 1.11.0, release
> candidate
> >>>>>> #4
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Hi Zhijiang,
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Could you please point me to more details
> regarding:
> >>>>>> "[2]:
> >>>>>> > > Delay
> >>>>>> > > > > > send
> >>>>>> > > > > > >> the
> >>>>>> > > > > > >> >   following buffers after checkpoint barrier on
> >>>>>> upstream side
> >>>>>> > > > until
> >>>>>> > > > > > >> barrier
> >>>>>> > > > > > >> >   alignment on downstream side."
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   In this case, the downstream task has a high
> average
> >>>>>> > > checkpoint
> >>>>>> > > > > > >> duration
> >>>>>> > > > > > >> >   (~30s, sync part). If there was a change to hold
> >>>>>> buffers
> >>>>>> > > > depending
> >>>>>> > > > > > on
> >>>>>> > > > > > >> >   downstream performance, could this possibly apply
> to
> >>>>>> this
> >>>>>> > case
> >>>>>> > > > > (even
> >>>>>> > > > > > >> when
> >>>>>> > > > > > >> >   there is no shuffle that would require alignment)?
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   Thanks,
> >>>>>> > > > > > >> >   Thomas
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <
> >>>>>> > > > > [hidden email]
> >>>>>> > > > > > >> .invalid>
> >>>>>> > > > > > >> >   wrote:
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   > Hi Thomas,
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Thanks for the further update information.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > I guess we can dismiss the network stack changes,
> >>>>>> since in
> >>>>>> > > > your
> >>>>>> > > > > > >> case the
> >>>>>> > > > > > >> >   > downstream and upstream would probably be
> deployed
> >>>>>> in the
> >>>>>> > > same
> >>>>>> > > > > > slot
> >>>>>> > > > > > >> >   > bypassing the network data shuffle.
> >>>>>> > > > > > >> >   > Also I guess release-1.11 will not bring general
> >>>>>> > performance
> >>>>>> > > > > > >> regression in
> >>>>>> > > > > > >> >   > runtime engine, as we also did the performance
> >>>>>> testing for
> >>>>>> > > all
> >>>>>> > > > > > >> general
> >>>>>> > > > > > >> >   > cases by [1] in real cluster before and the
> testing
> >>>>>> > results
> >>>>>> > > > > should
> >>>>>> > > > > > >> fit the
> >>>>>> > > > > > >> >   > expectation. But we indeed did not test the
> >>>>>> specific
> >>>>>> > source
> >>>>>> > > > and
> >>>>>> > > > > > sink
> >>>>>> > > > > > >> >   > connectors yet as I known.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Regarding your performance regression with 40%, I
> >>>>>> wonder
> >>>>>> > it
> >>>>>> > > is
> >>>>>> > > > > > >> probably
> >>>>>> > > > > > >> >   > related to specific source/sink changes (e.g.
> >>>>>> kinesis) or
> >>>>>> > > > > > >> environment
> >>>>>> > > > > > >> >   > issues with corner case.
> >>>>>> > > > > > >> >   > If possible, it would be helpful to further
> locate
> >>>>>> whether
> >>>>>> > > the
> >>>>>> > > > > > >> regression
> >>>>>> > > > > > >> >   > is caused by kinesis, by replacing the kinesis
> >>>>>> source &
> >>>>>> > sink
> >>>>>> > > > and
> >>>>>> > > > > > >> keeping
> >>>>>> > > > > > >> >   > the others same.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > As you said, it would be efficient to contact
> with
> >>>>>> you
> >>>>>> > > > directly
> >>>>>> > > > > > >> next week
> >>>>>> > > > > > >> >   > to further discuss this issue. And we are
> >>>>>> willing/eager to
> >>>>>> > > > > provide
> >>>>>> > > > > > >> any help
> >>>>>> > > > > > >> >   > to resolve this issue soon.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Besides that, I guess this issue should not be
> the
> >>>>>> blocker
> >>>>>> > > for
> >>>>>> > > > > the
> >>>>>> > > > > > >> >   > release, since it is probably a corner case based
> >>>>>> on the
> >>>>>> > > > current
> >>>>>> > > > > > >> analysis.
> >>>>>> > > > > > >> >   > If we really conclude anything need to be
> resolved
> >>>>>> after
> >>>>>> > the
> >>>>>> > > > > final
> >>>>>> > > > > > >> >   > release, then we can also make the next minor
> >>>>>> > release-1.11.1
> >>>>>> > > > > come
> >>>>>> > > > > > >> soon.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > [1]
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18433
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Best,
> >>>>>> > > > > > >> >   > Zhijiang
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > >
> >>>>>> ------------------------------------------------------------------
> >>>>>> > > > > > >> >   > From:Thomas Weise <[hidden email]>
> >>>>>> > > > > > >> >   > Send Time:2020年7月4日(星期六) 12:26
> >>>>>> > > > > > >> >   > To:dev <[hidden email]>; Zhijiang <
> >>>>>> > > > > > [hidden email]
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >   > Cc:Yingjie Cao <[hidden email]>
> >>>>>> > > > > > >> >   > Subject:Re: [VOTE] Release 1.11.0, release
> >>>>>> candidate #4
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Hi Zhijiang,
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > It will probably be best if we connect next week
> >>>>>> and
> >>>>>> > discuss
> >>>>>> > > > the
> >>>>>> > > > > > >> issue
> >>>>>> > > > > > >> >   > directly since this could be quite difficult to
> >>>>>> reproduce.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Before the testing result on our side comes out
> >>>>>> for your
> >>>>>> > > > > > respective
> >>>>>> > > > > > >> job
> >>>>>> > > > > > >> >   > case, I have some other questions to confirm for
> >>>>>> further
> >>>>>> > > > > analysis:
> >>>>>> > > > > > >> >   >     -  How much percentage regression you found
> >>>>>> after
> >>>>>> > > > switching
> >>>>>> > > > > to
> >>>>>> > > > > > >> 1.11?
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > ~40% throughput decline
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     -  Are there any network bottleneck in your
> >>>>>> cluster?
> >>>>>> > > E.g.
> >>>>>> > > > > the
> >>>>>> > > > > > >> network
> >>>>>> > > > > > >> >   > bandwidth is full caused by other jobs? If so, it
> >>>>>> might
> >>>>>> > have
> >>>>>> > > > > more
> >>>>>> > > > > > >> effects
> >>>>>> > > > > > >> >   > by above [2]
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > The test runs on a k8s cluster that is also used
> >>>>>> for other
> >>>>>> > > > > > >> production jobs.
> >>>>>> > > > > > >> >   > There is no reason be believe network is the
> >>>>>> bottleneck.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     -  Did you adjust the default network buffer
> >>>>>> setting?
> >>>>>> > > E.g.
> >>>>>> > > > > > >> >   >
> >>>>>> "taskmanager.network.memory.floating-buffers-per-gate" or
> >>>>>> > > > > > >> >   > "taskmanager.network.memory.buffers-per-channel"
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > The job is using the defaults, i.e we don't
> >>>>>> configure the
> >>>>>> > > > > > settings.
> >>>>>> > > > > > >> If you
> >>>>>> > > > > > >> >   > want me to try specific settings in the hope that
> >>>>>> it will
> >>>>>> > > help
> >>>>>> > > > > to
> >>>>>> > > > > > >> isolate
> >>>>>> > > > > > >> >   > the issue please let me know.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     -  I guess the topology has three vertexes
> >>>>>> > > > "KinesisConsumer
> >>>>>> > > > > ->
> >>>>>> > > > > > >> Chained
> >>>>>> > > > > > >> >   > FlatMap -> KinesisProducer", and the partition
> >>>>>> mode for
> >>>>>> > > > > > >> "KinesisConsumer ->
> >>>>>> > > > > > >> >   > FlatMap" and "FlatMap->KinesisProducer" are both
> >>>>>> > "forward"?
> >>>>>> > > If
> >>>>>> > > > > so,
> >>>>>> > > > > > >> the edge
> >>>>>> > > > > > >> >   > connection is one-to-one, not all-to-all, then
> the
> >>>>>> above
> >>>>>> > > > [1][2]
> >>>>>> > > > > > >> should no
> >>>>>> > > > > > >> >   > effects in theory with default network buffer
> >>>>>> setting.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > There are only 2 vertices and the edge is
> >>>>>> "forward".
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     - By slot sharing, I guess these three vertex
> >>>>>> > > parallelism
> >>>>>> > > > > task
> >>>>>> > > > > > >> would
> >>>>>> > > > > > >> >   > probably be deployed into the same slot, then the
> >>>>>> data
> >>>>>> > > shuffle
> >>>>>> > > > > is
> >>>>>> > > > > > >> by memory
> >>>>>> > > > > > >> >   > queue, not network stack. If so, the above [2]
> >>>>>> should no
> >>>>>> > > > effect.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Yes, vertices share slots.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >     - I also saw some Jira changes for kinesis in
> >>>>>> this
> >>>>>> > > > release,
> >>>>>> > > > > > >> could you
> >>>>>> > > > > > >> >   > confirm that these changes would not effect the
> >>>>>> > performance?
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > I will need to take a look. 1.10 already had a
> >>>>>> regression
> >>>>>> > > > > > >> introduced by the
> >>>>>> > > > > > >> >   > Kinesis producer update.
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > Thanks,
> >>>>>> > > > > > >> >   > Thomas
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
> >>>>>> > > > > > >> [hidden email]
> >>>>>> > > > > > >> >   > .invalid>
> >>>>>> > > > > > >> >   > wrote:
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   > > Hi Thomas,
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Thanks for your reply with rich information!
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > We are trying to reproduce your case in our
> >>>>>> cluster to
> >>>>>> > > > further
> >>>>>> > > > > > >> verify it,
> >>>>>> > > > > > >> >   > > and  @Yingjie Cao is working on it now.
> >>>>>> > > > > > >> >   > >  As we have not kinesis consumer and producer
> >>>>>> > internally,
> >>>>>> > > so
> >>>>>> > > > > we
> >>>>>> > > > > > >> will
> >>>>>> > > > > > >> >   > > construct the common source and sink instead in
> >>>>>> the case
> >>>>>> > > of
> >>>>>> > > > > > >> backpressure.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Firstly, we can dismiss the rockdb factor in
> this
> >>>>>> > release,
> >>>>>> > > > > since
> >>>>>> > > > > > >> you also
> >>>>>> > > > > > >> >   > > mentioned that "filesystem leads to same
> >>>>>> symptoms".
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Secondly, if my understanding is right, you
> >>>>>> emphasis
> >>>>>> > that
> >>>>>> > > > the
> >>>>>> > > > > > >> regression
> >>>>>> > > > > > >> >   > > only exists for the jobs with low checkpoint
> >>>>>> interval
> >>>>>> > > (10s).
> >>>>>> > > > > > >> >   > > Based on that, I have two suspicions with the
> >>>>>> network
> >>>>>> > > > related
> >>>>>> > > > > > >> changes in
> >>>>>> > > > > > >> >   > > this release:
> >>>>>> > > > > > >> >   > >     - [1]: Limited the maximum backlog value
> >>>>>> (default
> >>>>>> > 10)
> >>>>>> > > in
> >>>>>> > > > > > >> subpartition
> >>>>>> > > > > > >> >   > > queue.
> >>>>>> > > > > > >> >   > >     - [2]: Delay send the following buffers
> after
> >>>>>> > > checkpoint
> >>>>>> > > > > > >> barrier on
> >>>>>> > > > > > >> >   > > upstream side until barrier alignment on
> >>>>>> downstream
> >>>>>> > side.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > These changes are motivated for reducing the
> >>>>>> in-flight
> >>>>>> > > > buffers
> >>>>>> > > > > > to
> >>>>>> > > > > > >> speedup
> >>>>>> > > > > > >> >   > > checkpoint especially in the case of
> >>>>>> backpressure.
> >>>>>> > > > > > >> >   > > In theory they should have very minor
> >>>>>> performance effect
> >>>>>> > > and
> >>>>>> > > > > > >> actually we
> >>>>>> > > > > > >> >   > > also tested in cluster to verify within
> >>>>>> expectation
> >>>>>> > before
> >>>>>> > > > > > >> merging them,
> >>>>>> > > > > > >> >   > >  but maybe there are other corner cases we have
> >>>>>> not
> >>>>>> > > thought
> >>>>>> > > > of
> >>>>>> > > > > > >> before.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Before the testing result on our side comes out
> >>>>>> for your
> >>>>>> > > > > > >> respective job
> >>>>>> > > > > > >> >   > > case, I have some other questions to confirm
> for
> >>>>>> further
> >>>>>> > > > > > analysis:
> >>>>>> > > > > > >> >   > >     -  How much percentage regression you found
> >>>>>> after
> >>>>>> > > > > switching
> >>>>>> > > > > > >> to 1.11?
> >>>>>> > > > > > >> >   > >     -  Are there any network bottleneck in your
> >>>>>> cluster?
> >>>>>> > > > E.g.
> >>>>>> > > > > > the
> >>>>>> > > > > > >> network
> >>>>>> > > > > > >> >   > > bandwidth is full caused by other jobs? If so,
> >>>>>> it might
> >>>>>> > > have
> >>>>>> > > > > > more
> >>>>>> > > > > > >> effects
> >>>>>> > > > > > >> >   > > by above [2]
> >>>>>> > > > > > >> >   > >     -  Did you adjust the default network
> buffer
> >>>>>> > setting?
> >>>>>> > > > E.g.
> >>>>>> > > > > > >> >   > >
> >>>>>> "taskmanager.network.memory.floating-buffers-per-gate"
> >>>>>> > or
> >>>>>> > > > > > >> >   > >
> "taskmanager.network.memory.buffers-per-channel"
> >>>>>> > > > > > >> >   > >     -  I guess the topology has three vertexes
> >>>>>> > > > > "KinesisConsumer
> >>>>>> > > > > > ->
> >>>>>> > > > > > >> >   > Chained
> >>>>>> > > > > > >> >   > > FlatMap -> KinesisProducer", and the partition
> >>>>>> mode for
> >>>>>> > > > > > >> "KinesisConsumer
> >>>>>> > > > > > >> >   > ->
> >>>>>> > > > > > >> >   > > FlatMap" and "FlatMap->KinesisProducer" are
> both
> >>>>>> > > "forward"?
> >>>>>> > > > If
> >>>>>> > > > > > >> so, the
> >>>>>> > > > > > >> >   > edge
> >>>>>> > > > > > >> >   > > connection is one-to-one, not all-to-all, then
> >>>>>> the above
> >>>>>> > > > > [1][2]
> >>>>>> > > > > > >> should no
> >>>>>> > > > > > >> >   > > effects in theory with default network buffer
> >>>>>> setting.
> >>>>>> > > > > > >> >   > >     - By slot sharing, I guess these three
> vertex
> >>>>>> > > > parallelism
> >>>>>> > > > > > >> task would
> >>>>>> > > > > > >> >   > > probably be deployed into the same slot, then
> >>>>>> the data
> >>>>>> > > > shuffle
> >>>>>> > > > > > is
> >>>>>> > > > > > >> by
> >>>>>> > > > > > >> >   > memory
> >>>>>> > > > > > >> >   > > queue, not network stack. If so, the above [2]
> >>>>>> should no
> >>>>>> > > > > effect.
> >>>>>> > > > > > >> >   > >     - I also saw some Jira changes for kinesis
> >>>>>> in this
> >>>>>> > > > > release,
> >>>>>> > > > > > >> could you
> >>>>>> > > > > > >> >   > > confirm that these changes would not effect the
> >>>>>> > > performance?
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Best,
> >>>>>> > > > > > >> >   > > Zhijiang
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > >
> >>>>>> ------------------------------------------------------------------
> >>>>>> > > > > > >> >   > > From:Thomas Weise <[hidden email]>
> >>>>>> > > > > > >> >   > > Send Time:2020年7月3日(星期五) 01:07
> >>>>>> > > > > > >> >   > > To:dev <[hidden email]>; Zhijiang <
> >>>>>> > > > > > >> [hidden email]>
> >>>>>> > > > > > >> >   > > Subject:Re: [VOTE] Release 1.11.0, release
> >>>>>> candidate #4
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Hi Zhijiang,
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > The performance degradation manifests in
> >>>>>> backpressure
> >>>>>> > > which
> >>>>>> > > > > > leads
> >>>>>> > > > > > >> to
> >>>>>> > > > > > >> >   > > growing backlog in the source. I switched a few
> >>>>>> times
> >>>>>> > > > between
> >>>>>> > > > > > >> 1.10 and
> >>>>>> > > > > > >> >   > 1.11
> >>>>>> > > > > > >> >   > > and the behavior is consistent.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > The DAG is:
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat
> Map)
> >>>>>> > >  --------
> >>>>>> > > > > > >> forward
> >>>>>> > > > > > >> >   > > ---------> KinesisProducer
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Parallelism: 160
> >>>>>> > > > > > >> >   > > No shuffle/rebalance.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Checkpointing config:
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Checkpointing Mode Exactly Once
> >>>>>> > > > > > >> >   > > Interval 10s
> >>>>>> > > > > > >> >   > > Timeout 10m 0s
> >>>>>> > > > > > >> >   > > Minimum Pause Between Checkpoints 10s
> >>>>>> > > > > > >> >   > > Maximum Concurrent Checkpoints 1
> >>>>>> > > > > > >> >   > > Persist Checkpoints Externally Enabled (delete
> on
> >>>>>> > > > > cancellation)
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > State backend: rocksdb  (filesystem leads to
> same
> >>>>>> > > symptoms)
> >>>>>> > > > > > >> >   > > Checkpoint size is tiny (500KB)
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > An interesting difference to another job that I
> >>>>>> had
> >>>>>> > > upgraded
> >>>>>> > > > > > >> successfully
> >>>>>> > > > > > >> >   > > is the low checkpointing interval.
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > Thanks,
> >>>>>> > > > > > >> >   > > Thomas
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
> >>>>>> > > > > > >> [hidden email]
> >>>>>> > > > > > >> >   > > .invalid>
> >>>>>> > > > > > >> >   > > wrote:
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > > > Hi Thomas,
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Thanks for the efficient feedback.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Regarding the suggestion of adding the
> release
> >>>>>> notes
> >>>>>> > > > > document,
> >>>>>> > > > > > >> I agree
> >>>>>> > > > > > >> >   > > > with your point. Maybe we should adjust the
> >>>>>> vote
> >>>>>> > > template
> >>>>>> > > > > > >> accordingly
> >>>>>> > > > > > >> >   > in
> >>>>>> > > > > > >> >   > > > the respective wiki to guide the following
> >>>>>> release
> >>>>>> > > > > processes.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Regarding the performance regression, could
> you
> >>>>>> > provide
> >>>>>> > > > some
> >>>>>> > > > > > >> more
> >>>>>> > > > > > >> >   > details
> >>>>>> > > > > > >> >   > > > for our better measurement or reproducing on
> >>>>>> our
> >>>>>> > sides?
> >>>>>> > > > > > >> >   > > > E.g. I guess the topology only includes two
> >>>>>> vertexes
> >>>>>> > > > source
> >>>>>> > > > > > and
> >>>>>> > > > > > >> sink?
> >>>>>> > > > > > >> >   > > > What is the parallelism for every vertex?
> >>>>>> > > > > > >> >   > > > The upstream shuffles data to the downstream
> >>>>>> via
> >>>>>> > > rebalance
> >>>>>> > > > > > >> partitioner
> >>>>>> > > > > > >> >   > or
> >>>>>> > > > > > >> >   > > > other?
> >>>>>> > > > > > >> >   > > > The checkpoint mode is exactly-once with
> >>>>>> rocksDB state
> >>>>>> > > > > > backend?
> >>>>>> > > > > > >> >   > > > The backpressure happened in this case?
> >>>>>> > > > > > >> >   > > > How much percentage regression in this case?
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Best,
> >>>>>> > > > > > >> >   > > > Zhijiang
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >>
> >>>>>> > ------------------------------------------------------------------
> >>>>>> > > > > > >> >   > > > From:Thomas Weise <[hidden email]>
> >>>>>> > > > > > >> >   > > > Send Time:2020年7月2日(星期四) 09:54
> >>>>>> > > > > > >> >   > > > To:dev <[hidden email]>
> >>>>>> > > > > > >> >   > > > Subject:Re: [VOTE] Release 1.11.0, release
> >>>>>> candidate
> >>>>>> > #4
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Hi Till,
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Yes, we don't have the setting in
> >>>>>> flink-conf.yaml.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Generally, we carry forward the existing
> >>>>>> configuration
> >>>>>> > > and
> >>>>>> > > > > any
> >>>>>> > > > > > >> change
> >>>>>> > > > > > >> >   > to
> >>>>>> > > > > > >> >   > > > default configuration values would impact the
> >>>>>> upgrade.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Yes, since it is an incompatible change I
> >>>>>> would state
> >>>>>> > it
> >>>>>> > > > in
> >>>>>> > > > > > the
> >>>>>> > > > > > >> release
> >>>>>> > > > > > >> >   > > > notes.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > Thanks,
> >>>>>> > > > > > >> >   > > > Thomas
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > BTW I found a performance regression while
> >>>>>> trying to
> >>>>>> > > > upgrade
> >>>>>> > > > > > >> another
> >>>>>> > > > > > >> >   > > > pipeline with this RC. It is a simple Kinesis
> >>>>>> to
> >>>>>> > Kinesis
> >>>>>> > > > > job.
> >>>>>> > > > > > >> Wasn't
> >>>>>> > > > > > >> >   > able
> >>>>>> > > > > > >> >   > > > to pin it down yet, symptoms include
> increased
> >>>>>> > > checkpoint
> >>>>>> > > > > > >> alignment
> >>>>>> > > > > > >> >   > time.
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till
> Rohrmann <
> >>>>>> > > > > > >> [hidden email]>
> >>>>>> > > > > > >> >   > > > wrote:
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > > > Hi Thomas,
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > > > just to confirm: When starting the image in
> >>>>>> local
> >>>>>> > > mode,
> >>>>>> > > > > then
> >>>>>> > > > > > >> you
> >>>>>> > > > > > >> >   > don't
> >>>>>> > > > > > >> >   > > > have
> >>>>>> > > > > > >> >   > > > > any of the JobManager memory configuration
> >>>>>> settings
> >>>>>> > > > > > >> configured in the
> >>>>>> > > > > > >> >   > > > > effective flink-conf.yaml, right? Does this
> >>>>>> mean
> >>>>>> > that
> >>>>>> > > > you
> >>>>>> > > > > > have
> >>>>>> > > > > > >> >   > > explicitly
> >>>>>> > > > > > >> >   > > > > removed `jobmanager.heap.size: 1024m` from
> >>>>>> the
> >>>>>> > default
> >>>>>> > > > > > >> configuration?
> >>>>>> > > > > > >> >   > > If
> >>>>>> > > > > > >> >   > > > > this is the case, then I believe it was
> more
> >>>>>> of an
> >>>>>> > > > > > >> unintentional
> >>>>>> > > > > > >> >   > > artifact
> >>>>>> > > > > > >> >   > > > > that it worked before and it has been
> >>>>>> corrected now
> >>>>>> > so
> >>>>>> > > > > that
> >>>>>> > > > > > >> one needs
> >>>>>> > > > > > >> >   > > to
> >>>>>> > > > > > >> >   > > > > specify the memory of the JM process
> >>>>>> explicitly. Do
> >>>>>> > > you
> >>>>>> > > > > > think
> >>>>>> > > > > > >> it
> >>>>>> > > > > > >> >   > would
> >>>>>> > > > > > >> >   > > > help
> >>>>>> > > > > > >> >   > > > > to explicitly state this in the release
> >>>>>> notes?
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > > > Cheers,
> >>>>>> > > > > > >> >   > > > > Till
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas
> Weise <
> >>>>>> > > > > [hidden email]
> >>>>>> > > > > > >
> >>>>>> > > > > > >> wrote:
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > > > > Thanks for preparing another RC!
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > As mentioned in the previous RC thread,
> it
> >>>>>> would
> >>>>>> > be
> >>>>>> > > > > super
> >>>>>> > > > > > >> helpful
> >>>>>> > > > > > >> >   > if
> >>>>>> > > > > > >> >   > > > the
> >>>>>> > > > > > >> >   > > > > > release notes that are part of the
> >>>>>> documentation
> >>>>>> > can
> >>>>>> > > > be
> >>>>>> > > > > > >> included
> >>>>>> > > > > > >> >   > [1].
> >>>>>> > > > > > >> >   > > > > It's
> >>>>>> > > > > > >> >   > > > > > a significant time-saver to have read
> >>>>>> those first.
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > I found one more non-backward compatible
> >>>>>> change
> >>>>>> > that
> >>>>>> > > > > would
> >>>>>> > > > > > >> be worth
> >>>>>> > > > > > >> >   > > > > > addressing/mentioning:
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > It is now necessary to configure the
> >>>>>> jobmanager
> >>>>>> > heap
> >>>>>> > > > > size
> >>>>>> > > > > > in
> >>>>>> > > > > > >> >   > > > > > flink-conf.yaml (with either
> >>>>>> jobmanager.heap.size
> >>>>>> > > > > > >> >   > > > > > or jobmanager.memory.heap.size). Why
> would
> >>>>>> I not
> >>>>>> > > want
> >>>>>> > > > to
> >>>>>> > > > > > do
> >>>>>> > > > > > >> that
> >>>>>> > > > > > >> >   > > > anyways?
> >>>>>> > > > > > >> >   > > > > > Well, we set it dynamically for a cluster
> >>>>>> > deployment
> >>>>>> > > > via
> >>>>>> > > > > > the
> >>>>>> > > > > > >> >   > > > > > flinkk8soperator, but the container image
> >>>>>> can also
> >>>>>> > > be
> >>>>>> > > > > used
> >>>>>> > > > > > >> for
> >>>>>> > > > > > >> >   > > testing
> >>>>>> > > > > > >> >   > > > > with
> >>>>>> > > > > > >> >   > > > > > local mode (./bin/jobmanager.sh
> >>>>>> start-foreground
> >>>>>> > > > local).
> >>>>>> > > > > > >> That will
> >>>>>> > > > > > >> >   > > fail
> >>>>>> > > > > > >> >   > > > > if
> >>>>>> > > > > > >> >   > > > > > the heap wasn't configured and that's
> how I
> >>>>>> > noticed
> >>>>>> > > > it.
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > Thanks,
> >>>>>> > > > > > >> >   > > > > > Thomas
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > [1]
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >>
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang
> <
> >>>>>> > > > > > >> >   > [hidden email]
> >>>>>> > > > > > >> >   > > > > > .invalid>
> >>>>>> > > > > > >> >   > > > > > wrote:
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > > > > Hi everyone,
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > Please review and vote on the release
> >>>>>> candidate
> >>>>>> > #4
> >>>>>> > > > for
> >>>>>> > > > > > the
> >>>>>> > > > > > >> >   > version
> >>>>>> > > > > > >> >   > > > > > 1.11.0,
> >>>>>> > > > > > >> >   > > > > > > as follows:
> >>>>>> > > > > > >> >   > > > > > > [ ] +1, Approve the release
> >>>>>> > > > > > >> >   > > > > > > [ ] -1, Do not approve the release
> >>>>>> (please
> >>>>>> > provide
> >>>>>> > > > > > >> specific
> >>>>>> > > > > > >> >   > > comments)
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > The complete staging area is available
> >>>>>> for your
> >>>>>> > > > > review,
> >>>>>> > > > > > >> which
> >>>>>> > > > > > >> >   > > > includes:
> >>>>>> > > > > > >> >   > > > > > > * JIRA release notes [1],
> >>>>>> > > > > > >> >   > > > > > > * the official Apache source release
> and
> >>>>>> binary
> >>>>>> > > > > > >> convenience
> >>>>>> > > > > > >> >   > > releases
> >>>>>> > > > > > >> >   > > > to
> >>>>>> > > > > > >> >   > > > > > be
> >>>>>> > > > > > >> >   > > > > > > deployed to dist.apache.org [2], which
> >>>>>> are
> >>>>>> > signed
> >>>>>> > > > > with
> >>>>>> > > > > > >> the key
> >>>>>> > > > > > >> >   > > with
> >>>>>> > > > > > >> >   > > > > > > fingerprint
> >>>>>> > > 2DA85B93244FDFA19A6244500653C0A2CEA00D0E
> >>>>>> > > > > > [3],
> >>>>>> > > > > > >> >   > > > > > > * all artifacts to be deployed to the
> >>>>>> Maven
> >>>>>> > > Central
> >>>>>> > > > > > >> Repository
> >>>>>> > > > > > >> >   > [4],
> >>>>>> > > > > > >> >   > > > > > > * source code tag "release-1.11.0-rc4"
> >>>>>> [5],
> >>>>>> > > > > > >> >   > > > > > > * website pull request listing the new
> >>>>>> release
> >>>>>> > and
> >>>>>> > > > > > adding
> >>>>>> > > > > > >> >   > > > announcement
> >>>>>> > > > > > >> >   > > > > > > blog post [6].
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > The vote will be open for at least 72
> >>>>>> hours. It
> >>>>>> > is
> >>>>>> > > > > > >> adopted by
> >>>>>> > > > > > >> >   > > > majority
> >>>>>> > > > > > >> >   > > > > > > approval, with at least 3 PMC
> >>>>>> affirmative votes.
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > Thanks,
> >>>>>> > > > > > >> >   > > > > > > Release Manager
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > > [1]
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >>
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> >>>>>> > > > > > >> >   > > > > > > [2]
> >>>>>> > > > > > >> >   >
> >>>>>> > > >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
> >>>>>> > > > > > >> >   > > > > > > [3]
> >>>>>> > > > > > https://dist.apache.org/repos/dist/release/flink/KEYS
> >>>>>> > > > > > >> >   > > > > > > [4]
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >>
> >>>>>> > > > >
> >>>>>> > >
> >>>>>>
> https://repository.apache.org/content/repositories/orgapacheflink-1377/
> >>>>>> > > > > > >> >   > > > > > > [5]
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > >
> >>>>>> https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
> >>>>>> > > > > > >> >   > > > > > > [6]
> >>>>>> > https://github.com/apache/flink-web/pull/352
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > > >
> >>>>>> > > > > > >> >   > > > > >
> >>>>>> > > > > > >> >   > > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   > >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >   >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >> >
> >>>>>> > > > > > >>
> >>>>>> > > > > > >>
> >>>>>> > > > > >
> >>>>>> > > > > >
> >>>>>> > > > >
> >>>>>> > > >
> >>>>>> > >
> >>>>>> >
> >>>>>> >
> >>>>>> > --
> >>>>>> > Regards,
> >>>>>> > Roman
> >>>>>> >
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Regards,
> >>>>> Roman
> >>>>>
> >>>>
> >>
> >> --
> >> Regards,
> >> Roman
> >>
> >
> >
> > --
> > Regards,
> > Roman
> >
>
123