Dataset rowCount accumulator

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Dataset rowCount accumulator

Flavio Pompermaier
Hi to all,
we often need to track the number of rows of a dataset.
In order to burden on the job complexitye we use accumulators to track this
information.
The problem is that we have to extends all InputFormats that we use in
order to properly handle such row-count accumulator...my question is: what
about introducing it as a first class citizen (forcing all input format to
handle a rowCount accumulator when required)?

What do you think? Will it be useful in general?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Dataset rowCount accumulator

Flavio Pompermaier
Thinking about it I came up that adding a map function after the read is
probably more general.
Is there any "significant" difference in terms of performance in using such
dedicated map function (that just reads a row, increment an accumulator and
returns immediately) vs adding this accumulator directly in the input
formats?

On Mon, Feb 4, 2019 at 10:18 AM Flavio Pompermaier <[hidden email]>
wrote:

> Hi to all,
> we often need to track the number of rows of a dataset.
> In order to burden on the job complexitye we use accumulators to track
> this information.
> The problem is that we have to extends all InputFormats that we use in
> order to properly handle such row-count accumulator...my question is: what
> about introducing it as a first class citizen (forcing all input format to
> handle a rowCount accumulator when required)?
>
> What do you think? Will it be useful in general?
>
> Best,
> Flavio
>