[DISCUSS] Macro-benchmarking for performance tuning and regression detection

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Macro-benchmarking for performance tuning and regression detection

Greg Hogan
I'd like to discuss the creation of a macro-benchmarking module for Flink.
This could be run during pre-release testing to detect performance
regressions and during development when refactoring or performance tuning
code on the hot path.

Many users have published benchmarks and the Flink libraries already
contain a modest selection of algorithms. Some benefits of creating a
consolidated collection of macro-benchmarks include:

- comprehensive code coverage: a diverse set of algorithms can stress every
aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...)

- codify best practices: benchmarks should be relatively stable and
repeatable

- efficient: an automated system can run many more tests and generate more
accurate results

Macro-benchmarks would be useful in analyzing improved performance with the
proposed specialized serializes and comparators [FLINK-3599] or making
Flink NUMA-aware [FLINK-3163].

I've also been looking recently at some of the hot code and see about a
~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
to bitshift and bitmask rather than divide and modulo. The trade-off is
that to align on a power-of-2 we have holes in and require additional
MemoryBuffers. And I'm testing on a single data type, IntValue, and there
may be different results for LongValue or StringValue or custom types or
with different algorithms. And replacing multiply with a left shift reduces
performance, demonstrating the need to test changes in isolation.

There are many more ideas, i.e. NormalizedKeySorter writing keys before the
pointer so that the offset computation is performed outside of the compare
and sort methods. Also, SpanningRecordSerializer could skip to the next
buffer rather than writing length across buffers. These changes might each
be worth a few percent. Other changes might be less than a 1% speedup, but
taken in aggregate will yield a noticeable performance increase.

I like the idea of profile first, measure second, then create and discuss
the pull request.

As for the actual macro-benchmarking framework, it would be nice if the
algorithms would also verify correctness alongside performance. The
algorithm interface would be warmup (run only once) and execute, which
would be run multiple times in an interleaved manner. There benchmarking
duration should be tunable.

The framework would be responsible for configuration of as well as starting
and stopping the cluster, executing algorithms and recording performance,
and comparing and analyzing results.

Greg
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Macro-benchmarking for performance tuning and regression detection

Till Rohrmann
Hi Greg,

I like the idea to have a macro-benchmarking suite to exactly test the
points you've mentioned. If we don't have reliable performance numbers,
then it will always be hard to tell whether an improvement makes sense or
not (performance-wise).

I think we already undertook a first attempt to do solve the problem with
Yoka [1]. The idea was to run a set of algorithms continuously on a machine
in the cloud. Yoka was running for some time, but I'm not sure whether this
is still the case.

Another tool I know of and which people use to run benchmark suites with
Flink is Peel [2]. Researcher of Dima are using it to benchmark different
distributed engines against each other. But I have never really worked with
it.

[1] https://github.com/mxm/yoka
[2] https://github.com/stratosphere/peel

Cheers,
Till

On Wed, Apr 6, 2016 at 6:56 PM, Greg Hogan <[hidden email]> wrote:

> I'd like to discuss the creation of a macro-benchmarking module for Flink.
> This could be run during pre-release testing to detect performance
> regressions and during development when refactoring or performance tuning
> code on the hot path.
>
> Many users have published benchmarks and the Flink libraries already
> contain a modest selection of algorithms. Some benefits of creating a
> consolidated collection of macro-benchmarks include:
>
> - comprehensive code coverage: a diverse set of algorithms can stress every
> aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...)
>
> - codify best practices: benchmarks should be relatively stable and
> repeatable
>
> - efficient: an automated system can run many more tests and generate more
> accurate results
>
> Macro-benchmarks would be useful in analyzing improved performance with the
> proposed specialized serializes and comparators [FLINK-3599] or making
> Flink NUMA-aware [FLINK-3163].
>
> I've also been looking recently at some of the hot code and see about a
> ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> to bitshift and bitmask rather than divide and modulo. The trade-off is
> that to align on a power-of-2 we have holes in and require additional
> MemoryBuffers. And I'm testing on a single data type, IntValue, and there
> may be different results for LongValue or StringValue or custom types or
> with different algorithms. And replacing multiply with a left shift reduces
> performance, demonstrating the need to test changes in isolation.
>
> There are many more ideas, i.e. NormalizedKeySorter writing keys before the
> pointer so that the offset computation is performed outside of the compare
> and sort methods. Also, SpanningRecordSerializer could skip to the next
> buffer rather than writing length across buffers. These changes might each
> be worth a few percent. Other changes might be less than a 1% speedup, but
> taken in aggregate will yield a noticeable performance increase.
>
> I like the idea of profile first, measure second, then create and discuss
> the pull request.
>
> As for the actual macro-benchmarking framework, it would be nice if the
> algorithms would also verify correctness alongside performance. The
> algorithm interface would be warmup (run only once) and execute, which
> would be run multiple times in an interleaved manner. There benchmarking
> duration should be tunable.
>
> The framework would be responsible for configuration of as well as starting
> and stopping the cluster, executing algorithms and recording performance,
> and comparing and analyzing results.
>
> Greg
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Macro-benchmarking for performance tuning and regression detection

aalexandrov
Hi Greg,

I just pushed v1.0.0-rc2 for Peel to Sonatype.

As Till said, we are using the framework extensively at the TU for
benchmarking and comparing different systems (mostly Flink and Spark).

We recently used Peel to conduct some experiments for FLINK-2237. If you
want to learn more about the framework, I suggest to read the repeatability
section of our blog post draft [2] on the subject, as well as the Peel
manual [3]. We also have a Google Group [4] and an Issue tracker [5] in
case you want to use or contribute to the project.

[1] https://issues.apache.org/jira/browse/FLINK-2237
[2]
https://docs.google.com/document/d/12yx7olVrkooceaQPoR1nkk468lIq0xOObY5ukWuNEcM/edit#heading=h.w1uw5kmqciq7
[3] http://peel-framework.org
[4] https://groups.google.com/forum/#!forum/peel-framework
[5] https://github.com/stratosphere/peel/issues

Regards,
A.

2016-04-07 10:48 GMT+02:00 Till Rohrmann <[hidden email]>:

> Hi Greg,
>
> I like the idea to have a macro-benchmarking suite to exactly test the
> points you've mentioned. If we don't have reliable performance numbers,
> then it will always be hard to tell whether an improvement makes sense or
> not (performance-wise).
>
> I think we already undertook a first attempt to do solve the problem with
> Yoka [1]. The idea was to run a set of algorithms continuously on a machine
> in the cloud. Yoka was running for some time, but I'm not sure whether this
> is still the case.
>
> Another tool I know of and which people use to run benchmark suites with
> Flink is Peel [2]. Researcher of Dima are using it to benchmark different
> distributed engines against each other. But I have never really worked with
> it.
>
> [1] https://github.com/mxm/yoka
> [2] https://github.com/stratosphere/peel
>
> Cheers,
> Till
>
> On Wed, Apr 6, 2016 at 6:56 PM, Greg Hogan <[hidden email]> wrote:
>
> > I'd like to discuss the creation of a macro-benchmarking module for
> Flink.
> > This could be run during pre-release testing to detect performance
> > regressions and during development when refactoring or performance tuning
> > code on the hot path.
> >
> > Many users have published benchmarks and the Flink libraries already
> > contain a modest selection of algorithms. Some benefits of creating a
> > consolidated collection of macro-benchmarks include:
> >
> > - comprehensive code coverage: a diverse set of algorithms can stress
> every
> > aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...)
> >
> > - codify best practices: benchmarks should be relatively stable and
> > repeatable
> >
> > - efficient: an automated system can run many more tests and generate
> more
> > accurate results
> >
> > Macro-benchmarks would be useful in analyzing improved performance with
> the
> > proposed specialized serializes and comparators [FLINK-3599] or making
> > Flink NUMA-aware [FLINK-3163].
> >
> > I've also been looking recently at some of the hot code and see about a
> > ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> > to bitshift and bitmask rather than divide and modulo. The trade-off is
> > that to align on a power-of-2 we have holes in and require additional
> > MemoryBuffers. And I'm testing on a single data type, IntValue, and there
> > may be different results for LongValue or StringValue or custom types or
> > with different algorithms. And replacing multiply with a left shift
> reduces
> > performance, demonstrating the need to test changes in isolation.
> >
> > There are many more ideas, i.e. NormalizedKeySorter writing keys before
> the
> > pointer so that the offset computation is performed outside of the
> compare
> > and sort methods. Also, SpanningRecordSerializer could skip to the next
> > buffer rather than writing length across buffers. These changes might
> each
> > be worth a few percent. Other changes might be less than a 1% speedup,
> but
> > taken in aggregate will yield a noticeable performance increase.
> >
> > I like the idea of profile first, measure second, then create and discuss
> > the pull request.
> >
> > As for the actual macro-benchmarking framework, it would be nice if the
> > algorithms would also verify correctness alongside performance. The
> > algorithm interface would be warmup (run only once) and execute, which
> > would be run multiple times in an interleaved manner. There benchmarking
> > duration should be tunable.
> >
> > The framework would be responsible for configuration of as well as
> starting
> > and stopping the cluster, executing algorithms and recording performance,
> > and comparing and analyzing results.
> >
> > Greg
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Macro-benchmarking for performance tuning and regression detection

Gábor Gévay
In reply to this post by Greg Hogan
Hello,

I think that creating a macro-benchmarking module would be a very good
idea. It would make doing performance-related changes much easier and
safer.

I have also used Peel, and can confirm that it would be a good fit for
this task.

> I've also been looking recently at some of the hot code and see about a
> ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> to bitshift and bitmask rather than divide and modulo. The trade-off is
> that to align on a power-of-2 we have holes in and require additional
> MemoryBuffers.

I've also noticed the performance problem that those divisons in
NormalizedKeySorter.compare/swap cause, and have an idea about
eliminating them without the aligning to power-of-2 trade-off. I've
opened a Jira [1], where I explain it.

Best,
Gábor

[1] https://issues.apache.org/jira/browse/FLINK-3722




2016-04-06 18:56 GMT+02:00 Greg Hogan <[hidden email]>:

> I'd like to discuss the creation of a macro-benchmarking module for Flink.
> This could be run during pre-release testing to detect performance
> regressions and during development when refactoring or performance tuning
> code on the hot path.
>
> Many users have published benchmarks and the Flink libraries already
> contain a modest selection of algorithms. Some benefits of creating a
> consolidated collection of macro-benchmarks include:
>
> - comprehensive code coverage: a diverse set of algorithms can stress every
> aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...)
>
> - codify best practices: benchmarks should be relatively stable and
> repeatable
>
> - efficient: an automated system can run many more tests and generate more
> accurate results
>
> Macro-benchmarks would be useful in analyzing improved performance with the
> proposed specialized serializes and comparators [FLINK-3599] or making
> Flink NUMA-aware [FLINK-3163].
>
> I've also been looking recently at some of the hot code and see about a
> ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> to bitshift and bitmask rather than divide and modulo. The trade-off is
> that to align on a power-of-2 we have holes in and require additional
> MemoryBuffers. And I'm testing on a single data type, IntValue, and there
> may be different results for LongValue or StringValue or custom types or
> with different algorithms. And replacing multiply with a left shift reduces
> performance, demonstrating the need to test changes in isolation.
>
> There are many more ideas, i.e. NormalizedKeySorter writing keys before the
> pointer so that the offset computation is performed outside of the compare
> and sort methods. Also, SpanningRecordSerializer could skip to the next
> buffer rather than writing length across buffers. These changes might each
> be worth a few percent. Other changes might be less than a 1% speedup, but
> taken in aggregate will yield a noticeable performance increase.
>
> I like the idea of profile first, measure second, then create and discuss
> the pull request.
>
> As for the actual macro-benchmarking framework, it would be nice if the
> algorithms would also verify correctness alongside performance. The
> algorithm interface would be warmup (run only once) and execute, which
> would be run multiple times in an interleaved manner. There benchmarking
> duration should be tunable.
>
> The framework would be responsible for configuration of as well as starting
> and stopping the cluster, executing algorithms and recording performance,
> and comparing and analyzing results.
>
> Greg
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Macro-benchmarking for performance tuning and regression detection

Stephan Ewen
Hi Greg!

The idea is very good, especially having these pre-built performance tests
for release testing.

In your opinion, are the tests going to be self-contained, or will they
need a cluster (YARN, Mesos, Docker, etc.)
to bring up a Flink cluster and  run things?

Greetings,
Stephan




On Sat, Apr 9, 2016 at 12:41 PM, Gábor Gévay <[hidden email]> wrote:

> Hello,
>
> I think that creating a macro-benchmarking module would be a very good
> idea. It would make doing performance-related changes much easier and
> safer.
>
> I have also used Peel, and can confirm that it would be a good fit for
> this task.
>
> > I've also been looking recently at some of the hot code and see about a
> > ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> > to bitshift and bitmask rather than divide and modulo. The trade-off is
> > that to align on a power-of-2 we have holes in and require additional
> > MemoryBuffers.
>
> I've also noticed the performance problem that those divisons in
> NormalizedKeySorter.compare/swap cause, and have an idea about
> eliminating them without the aligning to power-of-2 trade-off. I've
> opened a Jira [1], where I explain it.
>
> Best,
> Gábor
>
> [1] https://issues.apache.org/jira/browse/FLINK-3722
>
>
>
>
> 2016-04-06 18:56 GMT+02:00 Greg Hogan <[hidden email]>:
> > I'd like to discuss the creation of a macro-benchmarking module for
> Flink.
> > This could be run during pre-release testing to detect performance
> > regressions and during development when refactoring or performance tuning
> > code on the hot path.
> >
> > Many users have published benchmarks and the Flink libraries already
> > contain a modest selection of algorithms. Some benefits of creating a
> > consolidated collection of macro-benchmarks include:
> >
> > - comprehensive code coverage: a diverse set of algorithms can stress
> every
> > aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...)
> >
> > - codify best practices: benchmarks should be relatively stable and
> > repeatable
> >
> > - efficient: an automated system can run many more tests and generate
> more
> > accurate results
> >
> > Macro-benchmarks would be useful in analyzing improved performance with
> the
> > proposed specialized serializes and comparators [FLINK-3599] or making
> > Flink NUMA-aware [FLINK-3163].
> >
> > I've also been looking recently at some of the hot code and see about a
> > ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> > to bitshift and bitmask rather than divide and modulo. The trade-off is
> > that to align on a power-of-2 we have holes in and require additional
> > MemoryBuffers. And I'm testing on a single data type, IntValue, and there
> > may be different results for LongValue or StringValue or custom types or
> > with different algorithms. And replacing multiply with a left shift
> reduces
> > performance, demonstrating the need to test changes in isolation.
> >
> > There are many more ideas, i.e. NormalizedKeySorter writing keys before
> the
> > pointer so that the offset computation is performed outside of the
> compare
> > and sort methods. Also, SpanningRecordSerializer could skip to the next
> > buffer rather than writing length across buffers. These changes might
> each
> > be worth a few percent. Other changes might be less than a 1% speedup,
> but
> > taken in aggregate will yield a noticeable performance increase.
> >
> > I like the idea of profile first, measure second, then create and discuss
> > the pull request.
> >
> > As for the actual macro-benchmarking framework, it would be nice if the
> > algorithms would also verify correctness alongside performance. The
> > algorithm interface would be warmup (run only once) and execute, which
> > would be run multiple times in an interleaved manner. There benchmarking
> > duration should be tunable.
> >
> > The framework would be responsible for configuration of as well as
> starting
> > and stopping the cluster, executing algorithms and recording performance,
> > and comparing and analyzing results.
> >
> > Greg
>