(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS][CODE STYLE] Create collections always with initial capacity

Classic

List

Threaded

7 messages Options

Andrey Zagrebin-3

[DISCUSS][CODE STYLE] Create collections always with initial capacity

Hi all,

As you probably already noticed, Stephan has triggered a discussion thread
about code style guide for Flink [1]. Recently we were discussing
internally some smaller concerns and I would like start separate threads
for them.

This thread is about creating collections always with initial capacity. As
you might have seen, some parts of our code base always initialise
collections with some non-default capacity. You can even activate a check
in IntelliJ Idea that can monitor and highlight creation of collection
without initial capacity.

Pros:
- performance gain if there is a good reasoning about initial capacity
- the capacity is always deterministic and does not depend on any changes
of its default value in Java
- easy to follow: always initialise, has IDE support for detection

Cons (for initialising w/o good reasoning):
- We are trying to outsmart JVM. When there is no good reasoning about
initial capacity, we can rely on JVM default value.
- It is even confusing e.g. for hash maps as the real size depends on the
load factor.
- It would only add minor performance gain.
- a bit more code, increases maintenance burden.

The conclusion is the following at the moment:
Only set the initial capacity if you have a good idea about the expected
size.

Please, feel free to share you thoughts.

Best,
Andrey

[1]
http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CED91DF4B-7CAB-4547-A430-85BC710FD157@...%3E

Xintong Song

Re: [DISCUSS][CODE STYLE] Create collections always with initial capacity

+1 on setting initial capacity only when have good expectation on the
collection size.

Thank you~

Xintong Song

On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin <[hidden email]> wrote:

> Hi all,
>
> As you probably already noticed, Stephan has triggered a discussion thread
> about code style guide for Flink [1]. Recently we were discussing
> internally some smaller concerns and I would like start separate threads
> for them.
>
> This thread is about creating collections always with initial capacity. As
> you might have seen, some parts of our code base always initialise
> collections with some non-default capacity. You can even activate a check
> in IntelliJ Idea that can monitor and highlight creation of collection
> without initial capacity.
>
> Pros:
> - performance gain if there is a good reasoning about initial capacity
> - the capacity is always deterministic and does not depend on any changes
> of its default value in Java
> - easy to follow: always initialise, has IDE support for detection
>
> Cons (for initialising w/o good reasoning):
> - We are trying to outsmart JVM. When there is no good reasoning about
> initial capacity, we can rely on JVM default value.
> - It is even confusing e.g. for hash maps as the real size depends on the
> load factor.
> - It would only add minor performance gain.
> - a bit more code, increases maintenance burden.
>
> The conclusion is the following at the moment:
> Only set the initial capacity if you have a good idea about the expected
> size.
>
> Please, feel free to share you thoughts.
>
> Best,
> Andrey
>
> [1]
>
> http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CED91DF4B-7CAB-4547-A430-85BC710FD157@...%3E
>

Piotr Nowojski-3

Re: [DISCUSS][CODE STYLE] Create collections always with initial capacity

Hi,

> - a bit more code, increases maintenance burden.

I think there is even more to that. It’s almost like a code duplication, albeit expressed in very different way, with all of the drawbacks of duplicated code: initial capacity can drift out of sync, causing confusion. Also it’s not “a bit more code”, it might be non trivial reasoning/calculation how to set the initial value. Whenever we change something/refactor the code, "maintenance burden” will mostly come from that.

Also I think this just usually falls under a premature optimisation rule.

Besides:

> The conclusion is the following at the moment:
> Only set the initial capacity if you have a good idea about the expected size.

I would add a clause to set the initial capacity “only for good proven reasons”. It’s not about whether we can set it, but whether it makes sense to do so (to avoid the before mentioned "maintenance burden”).

Piotrek

> On 1 Aug 2019, at 14:41, Xintong Song <[hidden email]> wrote:
>
> +1 on setting initial capacity only when have good expectation on the
> collection size.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin <[hidden email]> wrote:
>
>> Hi all,
>>
>> As you probably already noticed, Stephan has triggered a discussion thread
>> about code style guide for Flink [1]. Recently we were discussing
>> internally some smaller concerns and I would like start separate threads
>> for them.
>>
>> This thread is about creating collections always with initial capacity. As
>> you might have seen, some parts of our code base always initialise
>> collections with some non-default capacity. You can even activate a check
>> in IntelliJ Idea that can monitor and highlight creation of collection
>> without initial capacity.
>>
>> Pros:
>> - performance gain if there is a good reasoning about initial capacity
>> - the capacity is always deterministic and does not depend on any changes
>> of its default value in Java
>> - easy to follow: always initialise, has IDE support for detection
>>
>> Cons (for initialising w/o good reasoning):
>> - We are trying to outsmart JVM. When there is no good reasoning about
>> initial capacity, we can rely on JVM default value.
>> - It is even confusing e.g. for hash maps as the real size depends on the
>> load factor.
>> - It would only add minor performance gain.
>> - a bit more code, increases maintenance burden.
>>
>> The conclusion is the following at the moment:
>> Only set the initial capacity if you have a good idea about the expected
>> size.
>>
>> Please, feel free to share you thoughts.
>>
>> Best,
>> Andrey
>>
>> [1]
>>
>> http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CED91DF4B-7CAB-4547-A430-85BC710FD157@...%3E
>>

Andrey Zagrebin-3

Re: [DISCUSS][CODE STYLE] Create collections always with initial capacity

Hi All,

It looks like this proposal has an approval and we can conclude this
discussion.
Additionally, I agree with Piotr we should really force the proven good
reasoning for setting the capacity to avoid confusion, redundancy and other
already mentioned things while reading and maintaining the code.
Ideally the need of setting the capacity should be either immediately clear
(e.g. perf etc) or explained in comments if it is non-trivial.
Although, it can easily enter a grey zone, so I would not demand strictly
performance measurement proof e.g. if the size is known and it is "per
record" code.
At the end of the day it is a decision of the code developer and reviewer.

The conclusion is then:
Set the initial capacity only if there is a good proven reason to do it.
Otherwise do not clutter the code with it.

Best,
Andrey

On Thu, Aug 1, 2019 at 5:10 PM Piotr Nowojski <[hidden email]> wrote:

> Hi,
>
> > - a bit more code, increases maintenance burden.
>
> I think there is even more to that. It’s almost like a code duplication,
> albeit expressed in very different way, with all of the drawbacks of
> duplicated code: initial capacity can drift out of sync, causing confusion.
> Also it’s not “a bit more code”, it might be non trivial
> reasoning/calculation how to set the initial value. Whenever we change
> something/refactor the code, "maintenance burden” will mostly come from
> that.
>
> Also I think this just usually falls under a premature optimisation rule.
>
> Besides:
>
> > The conclusion is the following at the moment:
> > Only set the initial capacity if you have a good idea about the expected
> size.
>
> I would add a clause to set the initial capacity “only for good proven
> reasons”. It’s not about whether we can set it, but whether it makes sense
> to do so (to avoid the before mentioned "maintenance burden”).
>
> Piotrek
>
> > On 1 Aug 2019, at 14:41, Xintong Song <[hidden email]> wrote:
> >
> > +1 on setting initial capacity only when have good expectation on the
> > collection size.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin <[hidden email]>
> wrote:
> >
> >> Hi all,
> >>
> >> As you probably already noticed, Stephan has triggered a discussion
> thread
> >> about code style guide for Flink [1]. Recently we were discussing
> >> internally some smaller concerns and I would like start separate threads
> >> for them.
> >>
> >> This thread is about creating collections always with initial capacity.
> As
> >> you might have seen, some parts of our code base always initialise
> >> collections with some non-default capacity. You can even activate a
> check
> >> in IntelliJ Idea that can monitor and highlight creation of collection
> >> without initial capacity.
> >>
> >> Pros:
> >> - performance gain if there is a good reasoning about initial capacity
> >> - the capacity is always deterministic and does not depend on any
> changes
> >> of its default value in Java
> >> - easy to follow: always initialise, has IDE support for detection
> >>
> >> Cons (for initialising w/o good reasoning):
> >> - We are trying to outsmart JVM. When there is no good reasoning about
> >> initial capacity, we can rely on JVM default value.
> >> - It is even confusing e.g. for hash maps as the real size depends on
> the
> >> load factor.
> >> - It would only add minor performance gain.
> >> - a bit more code, increases maintenance burden.
> >>
> >> The conclusion is the following at the moment:
> >> Only set the initial capacity if you have a good idea about the expected
> >> size.
> >>
> >> Please, feel free to share you thoughts.
> >>
> >> Best,
> >> Andrey
> >>
> >> [1]
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CED91DF4B-7CAB-4547-A430-85BC710FD157@...%3E
> >>
>
>

Stephan Ewen

Re: [DISCUSS][CODE STYLE] Create collections always with initial capacity

@Andrey Will you open a PR to add this to the code style?

On Mon, Aug 19, 2019 at 11:51 AM Andrey Zagrebin <[hidden email]>
wrote:

> Hi All,
>
> It looks like this proposal has an approval and we can conclude this
> discussion.
> Additionally, I agree with Piotr we should really force the proven good
> reasoning for setting the capacity to avoid confusion, redundancy and other
> already mentioned things while reading and maintaining the code.
> Ideally the need of setting the capacity should be either immediately clear
> (e.g. perf etc) or explained in comments if it is non-trivial.
> Although, it can easily enter a grey zone, so I would not demand strictly
> performance measurement proof e.g. if the size is known and it is "per
> record" code.
> At the end of the day it is a decision of the code developer and reviewer.
>
> The conclusion is then:
> Set the initial capacity only if there is a good proven reason to do it.
> Otherwise do not clutter the code with it.
>
> Best,
> Andrey
>
> On Thu, Aug 1, 2019 at 5:10 PM Piotr Nowojski <[hidden email]> wrote:
>
> > Hi,
> >
> > > - a bit more code, increases maintenance burden.
> >
> > I think there is even more to that. It’s almost like a code duplication,
> > albeit expressed in very different way, with all of the drawbacks of
> > duplicated code: initial capacity can drift out of sync, causing
> confusion.
> > Also it’s not “a bit more code”, it might be non trivial
> > reasoning/calculation how to set the initial value. Whenever we change
> > something/refactor the code, "maintenance burden” will mostly come from
> > that.
> >
> > Also I think this just usually falls under a premature optimisation rule.
> >
> > Besides:
> >
> > > The conclusion is the following at the moment:
> > > Only set the initial capacity if you have a good idea about the
> expected
> > size.
> >
> > I would add a clause to set the initial capacity “only for good proven
> > reasons”. It’s not about whether we can set it, but whether it makes
> sense
> > to do so (to avoid the before mentioned "maintenance burden”).
> >
> > Piotrek
> >
> > > On 1 Aug 2019, at 14:41, Xintong Song <[hidden email]> wrote:
> > >
> > > +1 on setting initial capacity only when have good expectation on the
> > > collection size.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin <[hidden email]>
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> As you probably already noticed, Stephan has triggered a discussion
> > thread
> > >> about code style guide for Flink [1]. Recently we were discussing
> > >> internally some smaller concerns and I would like start separate
> threads
> > >> for them.
> > >>
> > >> This thread is about creating collections always with initial
> capacity.
> > As
> > >> you might have seen, some parts of our code base always initialise
> > >> collections with some non-default capacity. You can even activate a
> > check
> > >> in IntelliJ Idea that can monitor and highlight creation of collection
> > >> without initial capacity.
> > >>
> > >> Pros:
> > >> - performance gain if there is a good reasoning about initial capacity
> > >> - the capacity is always deterministic and does not depend on any
> > changes
> > >> of its default value in Java
> > >> - easy to follow: always initialise, has IDE support for detection
> > >>
> > >> Cons (for initialising w/o good reasoning):
> > >> - We are trying to outsmart JVM. When there is no good reasoning about
> > >> initial capacity, we can rely on JVM default value.
> > >> - It is even confusing e.g. for hash maps as the real size depends on
> > the
> > >> load factor.
> > >> - It would only add minor performance gain.
> > >> - a bit more code, increases maintenance burden.
> > >>
> > >> The conclusion is the following at the moment:
> > >> Only set the initial capacity if you have a good idea about the
> expected
> > >> size.
> > >>
> > >> Please, feel free to share you thoughts.
> > >>
> > >> Best,
> > >> Andrey
> > >>
> > >> [1]
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CED91DF4B-7CAB-4547-A430-85BC710FD157@...%3E
> > >>
> >
> >
>

Andrey Zagrebin-3

Re: [DISCUSS][CODE STYLE] Create collections always with initial capacity

I created an umbrella issue for the code style guide effort and a subtask
for this discussion:
https://issues.apache.org/jira/browse/FLINK-13804
I will also submit a PR to flink-web based on the conclusion.

On Mon, Aug 19, 2019 at 6:15 PM Stephan Ewen <[hidden email]> wrote:

> @Andrey Will you open a PR to add this to the code style?
>
> On Mon, Aug 19, 2019 at 11:51 AM Andrey Zagrebin <[hidden email]>
> wrote:
>
> > Hi All,
> >
> > It looks like this proposal has an approval and we can conclude this
> > discussion.
> > Additionally, I agree with Piotr we should really force the proven good
> > reasoning for setting the capacity to avoid confusion, redundancy and
> other
> > already mentioned things while reading and maintaining the code.
> > Ideally the need of setting the capacity should be either immediately
> clear
> > (e.g. perf etc) or explained in comments if it is non-trivial.
> > Although, it can easily enter a grey zone, so I would not demand strictly
> > performance measurement proof e.g. if the size is known and it is "per
> > record" code.
> > At the end of the day it is a decision of the code developer and
> reviewer.
> >
> > The conclusion is then:
> > Set the initial capacity only if there is a good proven reason to do it.
> > Otherwise do not clutter the code with it.
> >
> > Best,
> > Andrey
> >
> > On Thu, Aug 1, 2019 at 5:10 PM Piotr Nowojski <[hidden email]>
> wrote:
> >
> > > Hi,
> > >
> > > > - a bit more code, increases maintenance burden.
> > >
> > > I think there is even more to that. It’s almost like a code
> duplication,
> > > albeit expressed in very different way, with all of the drawbacks of
> > > duplicated code: initial capacity can drift out of sync, causing
> > confusion.
> > > Also it’s not “a bit more code”, it might be non trivial
> > > reasoning/calculation how to set the initial value. Whenever we change
> > > something/refactor the code, "maintenance burden” will mostly come from
> > > that.
> > >
> > > Also I think this just usually falls under a premature optimisation
> rule.
> > >
> > > Besides:
> > >
> > > > The conclusion is the following at the moment:
> > > > Only set the initial capacity if you have a good idea about the
> > expected
> > > size.
> > >
> > > I would add a clause to set the initial capacity “only for good proven
> > > reasons”. It’s not about whether we can set it, but whether it makes
> > sense
> > > to do so (to avoid the before mentioned "maintenance burden”).
> > >
> > > Piotrek
> > >
> > > > On 1 Aug 2019, at 14:41, Xintong Song <[hidden email]> wrote:
> > > >
> > > > +1 on setting initial capacity only when have good expectation on the
> > > > collection size.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin <[hidden email]
> >
> > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> As you probably already noticed, Stephan has triggered a discussion
> > > thread
> > > >> about code style guide for Flink [1]. Recently we were discussing
> > > >> internally some smaller concerns and I would like start separate
> > threads
> > > >> for them.
> > > >>
> > > >> This thread is about creating collections always with initial
> > capacity.
> > > As
> > > >> you might have seen, some parts of our code base always initialise
> > > >> collections with some non-default capacity. You can even activate a
> > > check
> > > >> in IntelliJ Idea that can monitor and highlight creation of
> collection
> > > >> without initial capacity.
> > > >>
> > > >> Pros:
> > > >> - performance gain if there is a good reasoning about initial
> capacity
> > > >> - the capacity is always deterministic and does not depend on any
> > > changes
> > > >> of its default value in Java
> > > >> - easy to follow: always initialise, has IDE support for detection
> > > >>
> > > >> Cons (for initialising w/o good reasoning):
> > > >> - We are trying to outsmart JVM. When there is no good reasoning
> about
> > > >> initial capacity, we can rely on JVM default value.
> > > >> - It is even confusing e.g. for hash maps as the real size depends
> on
> > > the
> > > >> load factor.
> > > >> - It would only add minor performance gain.
> > > >> - a bit more code, increases maintenance burden.
> > > >>
> > > >> The conclusion is the following at the moment:
> > > >> Only set the initial capacity if you have a good idea about the
> > expected
> > > >> size.
> > > >>
> > > >> Please, feel free to share you thoughts.
> > > >>
> > > >> Best,
> > > >> Andrey
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CED91DF4B-7CAB-4547-A430-85BC710FD157@...%3E
> > > >>
> > >
> > >
> >
>

Andrey Zagrebin-3

Re: [DISCUSS][CODE STYLE] Create collections always with initial capacity

FYI the PR: https://github.com/apache/flink-web/pull/249
A review from the discussion participants would be appreciated.

On Tue, Aug 20, 2019 at 5:29 PM Andrey Zagrebin <[hidden email]>
wrote:

> I created an umbrella issue for the code style guide effort and a subtask
> for this discussion:
> https://issues.apache.org/jira/browse/FLINK-13804
> I will also submit a PR to flink-web based on the conclusion.
>
> On Mon, Aug 19, 2019 at 6:15 PM Stephan Ewen <[hidden email]> wrote:
>
>> @Andrey Will you open a PR to add this to the code style?
>>
>> On Mon, Aug 19, 2019 at 11:51 AM Andrey Zagrebin <[hidden email]>
>> wrote:
>>
>> > Hi All,
>> >
>> > It looks like this proposal has an approval and we can conclude this
>> > discussion.
>> > Additionally, I agree with Piotr we should really force the proven good
>> > reasoning for setting the capacity to avoid confusion, redundancy and
>> other
>> > already mentioned things while reading and maintaining the code.
>> > Ideally the need of setting the capacity should be either immediately
>> clear
>> > (e.g. perf etc) or explained in comments if it is non-trivial.
>> > Although, it can easily enter a grey zone, so I would not demand
>> strictly
>> > performance measurement proof e.g. if the size is known and it is "per
>> > record" code.
>> > At the end of the day it is a decision of the code developer and
>> reviewer.
>> >
>> > The conclusion is then:
>> > Set the initial capacity only if there is a good proven reason to do it.
>> > Otherwise do not clutter the code with it.
>> >
>> > Best,
>> > Andrey
>> >
>> > On Thu, Aug 1, 2019 at 5:10 PM Piotr Nowojski <[hidden email]>
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > > - a bit more code, increases maintenance burden.
>> > >
>> > > I think there is even more to that. It’s almost like a code
>> duplication,
>> > > albeit expressed in very different way, with all of the drawbacks of
>> > > duplicated code: initial capacity can drift out of sync, causing
>> > confusion.
>> > > Also it’s not “a bit more code”, it might be non trivial
>> > > reasoning/calculation how to set the initial value. Whenever we change
>> > > something/refactor the code, "maintenance burden” will mostly come
>> from
>> > > that.
>> > >
>> > > Also I think this just usually falls under a premature optimisation
>> rule.
>> > >
>> > > Besides:
>> > >
>> > > > The conclusion is the following at the moment:
>> > > > Only set the initial capacity if you have a good idea about the
>> > expected
>> > > size.
>> > >
>> > > I would add a clause to set the initial capacity “only for good proven
>> > > reasons”. It’s not about whether we can set it, but whether it makes
>> > sense
>> > > to do so (to avoid the before mentioned "maintenance burden”).
>> > >
>> > > Piotrek
>> > >
>> > > > On 1 Aug 2019, at 14:41, Xintong Song <[hidden email]>
>> wrote:
>> > > >
>> > > > +1 on setting initial capacity only when have good expectation on
>> the
>> > > > collection size.
>> > > >
>> > > > Thank you~
>> > > >
>> > > > Xintong Song
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin <
>> [hidden email]>
>> > > wrote:
>> > > >
>> > > >> Hi all,
>> > > >>
>> > > >> As you probably already noticed, Stephan has triggered a discussion
>> > > thread
>> > > >> about code style guide for Flink [1]. Recently we were discussing
>> > > >> internally some smaller concerns and I would like start separate
>> > threads
>> > > >> for them.
>> > > >>
>> > > >> This thread is about creating collections always with initial
>> > capacity.
>> > > As
>> > > >> you might have seen, some parts of our code base always initialise
>> > > >> collections with some non-default capacity. You can even activate a
>> > > check
>> > > >> in IntelliJ Idea that can monitor and highlight creation of
>> collection
>> > > >> without initial capacity.
>> > > >>
>> > > >> Pros:
>> > > >> - performance gain if there is a good reasoning about initial
>> capacity
>> > > >> - the capacity is always deterministic and does not depend on any
>> > > changes
>> > > >> of its default value in Java
>> > > >> - easy to follow: always initialise, has IDE support for detection
>> > > >>
>> > > >> Cons (for initialising w/o good reasoning):
>> > > >> - We are trying to outsmart JVM. When there is no good reasoning
>> about
>> > > >> initial capacity, we can rely on JVM default value.
>> > > >> - It is even confusing e.g. for hash maps as the real size depends
>> on
>> > > the
>> > > >> load factor.
>> > > >> - It would only add minor performance gain.
>> > > >> - a bit more code, increases maintenance burden.
>> > > >>
>> > > >> The conclusion is the following at the moment:
>> > > >> Only set the initial capacity if you have a good idea about the
>> > expected
>> > > >> size.
>> > > >>
>> > > >> Please, feel free to share you thoughts.
>> > > >>
>> > > >> Best,
>> > > >> Andrey
>> > > >>
>> > > >> [1]
>> > > >>
>> > > >>
>> > >
>> >
>> http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CED91DF4B-7CAB-4547-A430-85BC710FD157@...%3E
>> > > >>
>> > >
>> > >
>> >
>>
>