(DEPRECATED) Apache Flink Mailing List archive.

Dataset rowCount accumulator

Classic

List

Threaded

2 messages Options

Flavio Pompermaier

Dataset rowCount accumulator

Hi to all,
we often need to track the number of rows of a dataset.
In order to burden on the job complexitye we use accumulators to track this
information.
The problem is that we have to extends all InputFormats that we use in
order to properly handle such row-count accumulator...my question is: what
about introducing it as a first class citizen (forcing all input format to
handle a rowCount accumulator when required)?

What do you think? Will it be useful in general?

Best,
Flavio

Flavio Pompermaier

Re: Dataset rowCount accumulator

Thinking about it I came up that adding a map function after the read is
probably more general.
Is there any "significant" difference in terms of performance in using such
dedicated map function (that just reads a row, increment an accumulator and
returns immediately) vs adding this accumulator directly in the input
formats?

On Mon, Feb 4, 2019 at 10:18 AM Flavio Pompermaier <[hidden email]>
wrote:

> Hi to all,
> we often need to track the number of rows of a dataset.
> In order to burden on the job complexitye we use accumulators to track
> this information.
> The problem is that we have to extends all InputFormats that we use in
> order to properly handle such row-count accumulator...my question is: what
> about introducing it as a first class citizen (forcing all input format to
> handle a rowCount accumulator when required)?
>
> What do you think? Will it be useful in general?
>
> Best,
> Flavio
>