Thinking about it I came up that adding a map function after the read is
probably more general.
Is there any "significant" difference in terms of performance in using such
dedicated map function (that just reads a row, increment an accumulator and
returns immediately) vs adding this accumulator directly in the input
formats?
On Mon, Feb 4, 2019 at 10:18 AM Flavio Pompermaier <
[hidden email]>
wrote:
> Hi to all,
> we often need to track the number of rows of a dataset.
> In order to burden on the job complexitye we use accumulators to track
> this information.
> The problem is that we have to extends all InputFormats that we use in
> order to properly handle such row-count accumulator...my question is: what
> about introducing it as a first class citizen (forcing all input format to
> handle a rowCount accumulator when required)?
>
> What do you think? Will it be useful in general?
>
> Best,
> Flavio
>