InputFormat API and current scanned row count

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

InputFormat API and current scanned row count

Flavio Pompermaier
Hi guys,

I was debugging an inputFormat and I discovered that there's no way to
understand how many records have been processed in a split.
So I added a counter in my input format incremented every nextRecord..do
you think adding something to similar like "public int
getProcessedRecordsCount()" to InputFormat interface could be useful?
Or are you going to manage this count stat from the caller of nextRecord?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: InputFormat API and current scanned row count

Fabian Hueske
Hi Flavio,

we have a few recently started efforts to implement the collection of
monitoring and runtime/data statistics.
Counting the number of elements emitted by an operator (or data source)
will be included.

Do you want to count the number of produced tuples for monitoring the
progress or do you see a different use case?

2014-11-28 9:37 GMT+01:00 Flavio Pompermaier <[hidden email]>:

> Hi guys,
>
> I was debugging an inputFormat and I discovered that there's no way to
> understand how many records have been processed in a split.
> So I added a counter in my input format incremented every nextRecord..do
> you think adding something to similar like "public int
> getProcessedRecordsCount()" to InputFormat interface could be useful?
> Or are you going to manage this count stat from the caller of nextRecord?
>
> Best,
> Flavio
>
Reply | Threaded
Open this post in threaded view
|

Re: InputFormat API and current scanned row count

Flavio Pompermaier
In my specific use case I was intererested in understanding why the scans
of the splits were taking a long time, so I was intrested in getting
statistics about the number of records contained in each split and the
rate/speed of its reading..do you think it could be something useful in
general?
On Dec 2, 2014 9:56 PM, "Fabian Hueske" <[hidden email]> wrote:

> Hi Flavio,
>
> we have a few recently started efforts to implement the collection of
> monitoring and runtime/data statistics.
> Counting the number of elements emitted by an operator (or data source)
> will be included.
>
> Do you want to count the number of produced tuples for monitoring the
> progress or do you see a different use case?
>
> 2014-11-28 9:37 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>
> > Hi guys,
> >
> > I was debugging an inputFormat and I discovered that there's no way to
> > understand how many records have been processed in a split.
> > So I added a counter in my input format incremented every nextRecord..do
> > you think adding something to similar like "public int
> > getProcessedRecordsCount()" to InputFormat interface could be useful?
> > Or are you going to manage this count stat from the caller of nextRecord?
> >
> > Best,
> > Flavio
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: InputFormat API and current scanned row count

Fabian Hueske
Yes, sure.
Tracking records per split and UDF exec time per call (min, max, avg, or
histogram) would be valuable information when debugging the performance of
a program.

2014-12-02 22:08 GMT+01:00 Flavio Pompermaier <[hidden email]>:

> In my specific use case I was intererested in understanding why the scans
> of the splits were taking a long time, so I was intrested in getting
> statistics about the number of records contained in each split and the
> rate/speed of its reading..do you think it could be something useful in
> general?
> On Dec 2, 2014 9:56 PM, "Fabian Hueske" <[hidden email]> wrote:
>
> > Hi Flavio,
> >
> > we have a few recently started efforts to implement the collection of
> > monitoring and runtime/data statistics.
> > Counting the number of elements emitted by an operator (or data source)
> > will be included.
> >
> > Do you want to count the number of produced tuples for monitoring the
> > progress or do you see a different use case?
> >
> > 2014-11-28 9:37 GMT+01:00 Flavio Pompermaier <[hidden email]>:
> >
> > > Hi guys,
> > >
> > > I was debugging an inputFormat and I discovered that there's no way to
> > > understand how many records have been processed in a split.
> > > So I added a counter in my input format incremented every
> nextRecord..do
> > > you think adding something to similar like "public int
> > > getProcessedRecordsCount()" to InputFormat interface could be useful?
> > > Or are you going to manage this count stat from the caller of
> nextRecord?
> > >
> > > Best,
> > > Flavio
> > >
> >
>