(DEPRECATED) Apache Flink Mailing List archive.

HBase 0.98 addon for Flink 0.8

Classic

List

Threaded

38 messages Options

Fabian Hueske

Re: HBase 0.98 addon for Flink 0.8

I don't think we need to bundle the HBase input and output format in a
single PR.
So, I think we can proceed with the IF only and target the OF later.
However, the fix for Kryo should be in the master before merging the PR.
Till is currently working on that and said he expects this to be done by
end of the week.

Cheers, Fabian

2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <[hidden email]>:

> I fixed also the profile for Cloudera CDH5.1.3. You can build it with the
> command:
> mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
> -Pvendor-repos,cdh5.1.3
>
> However, it would be good to generate the specific jar when
> releasing..(e.g.
> flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
>
> Best,
> Flavio
>
> On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <[hidden email]>
> wrote:
>
> > I've just updated the code on my fork (synch with current master and
> > applied improvements coming from comments on related PR).
> > I still have to understand how to write results back to an HBase
> > Sink/OutputFormat...
> >
> >
> > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> [hidden email]>
> > wrote:
> >
> >> Thanks for the detailed answer. So if I run a job from my machine I'll
> >> have to download all the scanned data in a table..right?
> >>
> >> Always regarding the GenericTableOutputFormat it is not clear to me how
> >> to proceed..
> >> I saw in the hadoop compatibility addon that it is possible to have such
> >> compatibility using HBaseUtils class so the open method should become
> >> something like:
> >>
> >> @Override
> >> public void open(int taskNumber, int numTasks) throws IOException {
> >> if (Integer.toString(taskNumber + 1).length() > 6) {
> >> throw new IOException("Task id too large.");
> >> }
> >> TaskAttemptID taskAttemptID = TaskAttemptID.forName("attempt__0000_r_"
> >> + String.format("%" + (6 - Integer.toString(taskNumber + 1).length()) +
> >> "s"," ").replace(" ", "0")
> >> + Integer.toString(taskNumber + 1)
> >> + "_0");
> >> this.configuration.set("mapred.task.id", taskAttemptID.toString());
> >> this.configuration.setInt("mapred.task.partition", taskNumber + 1);
> >> // for hadoop 2.2
> >> this.configuration.set("mapreduce.task.attempt.id",
> >> taskAttemptID.toString());
> >> this.configuration.setInt("mapreduce.task.partition", taskNumber + 1);
> >> try {
> >> this.context =
> >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> >> taskAttemptID);
> >> } catch (Exception e) {
> >> throw new RuntimeException(e);
> >> }
> >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
> >> try {
> >> this.writer = outFormat.getRecordWriter(this.context);
> >> } catch (InterruptedException iex) {
> >> throw new IOException("Opening the writer was interrupted.", iex);
> >> }
> >> }
> >>
> >> But I'm not sure about how to pass the JobConf to the class, if to merge
> >> config fileas, where HFileOutputFormat2 writes the data and how to
> >> implement the public void writeRecord(Record record) API.
> >> Could I do a little chat off the mailing list with the implementor of
> >> this extension?
> >>
> >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <[hidden email]>
> >> wrote:
> >>
> >>> Hi Flavio
> >>>
> >>> let me try to answer your last question on the user's list (to the best
> >>> of
> >>> my HBase knowledge).
> >>> "I just wanted to known if and how regiom splitting is handled. Can you
> >>> explain me in detail how Flink and HBase works?what is not fully clear
> to
> >>> me is when computation is done by region servers and when data start
> flow
> >>> to a Flink worker (that in ky test job is only my pc) and how ro
> >>> undertsand
> >>> better the important logged info to understand if my job is performing
> >>> well"
> >>>
> >>> HBase partitions its tables into so called "regions" of keys and stores
> >>> the
> >>> regions distributed in the cluster using HDFS. I think an HBase region
> >>> can
> >>> be thought of as a HDFS block. To make reading an HBase table
> efficient,
> >>> region reads should be locally done, i.e., an InputFormat should
> >>> primarily
> >>> read region that are stored on the same machine as the IF is running
> on.
> >>> Flink's InputSplits partition the HBase input by regions and add
> >>> information about the storage location of the region. During execution,
> >>> input splits are assigned to InputFormats that can do local reads.
> >>>
> >>> Best, Fabian
> >>>
> >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[hidden email]>:
> >>>
> >>> > Hi!
> >>> >
> >>> > The way of passing parameters through the configuration is very old
> >>> (the
> >>> > original HBase format dated back to that time). I would simply make
> the
> >>> > HBase format take those parameters through the constructor.
> >>> >
> >>> > Greetings,
> >>> > Stephan
> >>> >
> >>> >
> >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> >>> [hidden email]>
> >>> > wrote:
> >>> >
> >>> > > The problem is that I also removed the GenericTableOutputFormat
> >>> because
> >>> > > there is an incompatibility between hadoop1 and hadoop2 for class
> >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> >>> > > then it would be nice if the user doesn't have to worry about
> passing
> >>> > > pact.hbase.jtkey and pact.job.id parameters..
> >>> > > I think it is probably a good idea to remove hadoop1 compatibility
> >>> and
> >>> > keep
> >>> > > enable HBase addon only for hadoop2 (as before) and decide how to
> >>> mange
> >>> > > those 2 parameters..
> >>> > >
> >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <[hidden email]>
> >>> wrote:
> >>> > >
> >>> > > > It is fine to remove it, in my opinion.
> >>> > > >
> >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
> >>> > > [hidden email]>
> >>> > > > wrote:
> >>> > > >
> >>> > > > > That is one class I removed because it was using the deprecated
> >>> API
> >>> > > > > GenericDataSink..I can restore them but the it will be a good
> >>> idea to
> >>> > > > > remove those warning (also because from what I understood the
> >>> Record
> >>> > > APIs
> >>> > > > > are going to be removed).
> >>> > > > >
> >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> >>> [hidden email]>
> >>> > > > wrote:
> >>> > > > >
> >>> > > > > > I'm not familiar with the HBase connector code, but are you
> >>> maybe
> >>> > > > looking
> >>> > > > > > for the GenericTableOutputFormat?
> >>> > > > > >
> >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> >>> [hidden email]
> >>> > >:
> >>> > > > > >
> >>> > > > > > > | was trying to modify the example setting
> hbaseDs.output(new
> >>> > > > > > > HBaseOutputFormat()); but I can't see any HBaseOutputFormat
> >>> > > > > class..maybe
> >>> > > > > > we
> >>> > > > > > > shall use another class?
> >>> > > > > > >
> >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
> >>> > > > > [hidden email]
> >>> > > > > > >
> >>> > > > > > > wrote:
> >>> > > > > > >
> >>> > > > > > > > Maybe that's something I could add to the HBase example
> and
> >>> > that
> >>> > > > > could
> >>> > > > > > be
> >>> > > > > > > > better documented in the Wiki.
> >>> > > > > > > >
> >>> > > > > > > > Since we're talking about the wiki..I was looking at the
> >>> Java
> >>> > > API (
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> >>> )
> >>> > > > > > > > and the link to the KMeans example is not working (where
> it
> >>> > says
> >>> > > > For
> >>> > > > > a
> >>> > > > > > > > complete example program, have a look at KMeans
> Algorithm).
> >>> > > > > > > >
> >>> > > > > > > > Best,
> >>> > > > > > > > Flavio
> >>> > > > > > > >
> >>> > > > > > > >
> >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio Pompermaier <
> >>> > > > > > [hidden email]
> >>> > > > > > > >
> >>> > > > > > > > wrote:
> >>> > > > > > > >
> >>> > > > > > > >> Ah ok, perfect! That was the reason why I removed it :)
> >>> > > > > > > >>
> >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
> >>> > [hidden email]>
> >>> > > > > > wrote:
> >>> > > > > > > >>
> >>> > > > > > > >>> You do not really need a HBase data sink. You can call
> >>> > > > > > > >>> "DataSet.output(new
> >>> > > > > > > >>> HBaseOutputFormat())"
> >>> > > > > > > >>>
> >>> > > > > > > >>> Stephan
> >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio Pompermaier" <
> >>> > > > > > [hidden email]
> >>> > > > > > > >:
> >>> > > > > > > >>>
> >>> > > > > > > >>> > Just one last thing..I removed the HbaseDataSink
> >>> because I
> >>> > > > think
> >>> > > > > it
> >>> > > > > > > was
> >>> > > > > > > >>> > using the old APIs..can someone help me in updating
> >>> that
> >>> > > class?
> >>> > > > > > > >>> >
> >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio Pompermaier <
> >>> > > > > > > >>> [hidden email]>
> >>> > > > > > > >>> > wrote:
> >>> > > > > > > >>> >
> >>> > > > > > > >>> > > Indeed this time the build has been successful :)
> >>> > > > > > > >>> > >
> >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian Hueske <
> >>> > > > > > [hidden email]
> >>> > > > > > > >
> >>> > > > > > > >>> > wrote:
> >>> > > > > > > >>> > >
> >>> > > > > > > >>> > >> You can also setup Travis to build your own Github
> >>> > > > > repositories
> >>> > > > > > by
> >>> > > > > > > >>> > linking
> >>> > > > > > > >>> > >> it to your Github account. That way Travis can
> >>> build all
> >>> > > > your
> >>> > > > > > > >>> branches
> >>> > > > > > > >>> > >> (and
> >>> > > > > > > >>> > >> you can also trigger rebuilds if something fails).
> >>> > > > > > > >>> > >> Not sure if we can manually trigger retrigger
> >>> builds on
> >>> > > the
> >>> > > > > > Apache
> >>> > > > > > > >>> > >> repository.
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a very good
> >>> > addition
> >>> > > > :-)
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> For the discusion about the PR itself, I would
> need
> >>> a
> >>> > bit
> >>> > > > more
> >>> > > > > > > time
> >>> > > > > > > >>> to
> >>> > > > > > > >>> > >> become more familiar with HBase. I do also not
> have
> >>> a
> >>> > > HBase
> >>> > > > > > setup
> >>> > > > > > > >>> > >> available
> >>> > > > > > > >>> > >> here.
> >>> > > > > > > >>> > >> Maybe somebody else of the community who was
> >>> involved
> >>> > > with a
> >>> > > > > > > >>> previous
> >>> > > > > > > >>> > >> version of the HBase connector could comment on
> your
> >>> > > > question.
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> Best, Fabian
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio Pompermaier <
> >>> > > > > > > [hidden email]
> >>> > > > > > > >>> >:
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the discussion on
> >>> this
> >>> > > > > mailing
> >>> > > > > > > >>> list.
> >>> > > > > > > >>> > >> >
> >>> > > > > > > >>> > >> > I think that what is still to be discussed is
> >>> how to
> >>> > > > > > retrigger
> >>> > > > > > > >>> the
> >>> > > > > > > >>> > >> build
> >>> > > > > > > >>> > >> > on Travis (I don't have an account) and if the
> PR
> >>> can
> >>> > be
> >>> > > > > > > >>> integrated.
> >>> > > > > > > >>> > >> >
> >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase example
> >>> in
> >>> > the
> >>> > > > test
> >>> > > > > > > >>> package
> >>> > > > > > > >>> > >> (right
> >>> > > > > > > >>> > >> > now I left it in the main folder) so it will
> force
> >>> > > Travis
> >>> > > > to
> >>> > > > > > > >>> rebuild.
> >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
> >>> > > > > > > >>> > >> >
> >>> > > > > > > >>> > >> > Another thing I forgot to say is that the hbase
> >>> > > extension
> >>> > > > is
> >>> > > > > > now
> >>> > > > > > > >>> > >> compatible
> >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> >>> > > > > > > >>> > >> >
> >>> > > > > > > >>> > >> > Best,
> >>> > > > > > > >>> > >> > Flavio
> >>> > > > > > > >>> > >>
> >>> > > > > > > >>> > >
> >>> > > > > > > >>> >
> >>> > > > > > > >>>
> >>> > > > > > > >>
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >>
> >
>

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

Today we tried tp execute a job on the cluster instead of on local executor
and we faced the problem that the hbase-site.xml was basically ignored. Is
there a reason why the TableInputFormat is working correctly on local
environment while it doesn't on a cluster?
On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]> wrote:

> I don't think we need to bundle the HBase input and output format in a
> single PR.
> So, I think we can proceed with the IF only and target the OF later.
> However, the fix for Kryo should be in the master before merging the PR.
> Till is currently working on that and said he expects this to be done by
> end of the week.
>
> Cheers, Fabian
>
>
> 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>
> > I fixed also the profile for Cloudera CDH5.1.3. You can build it with the
> > command:
> > mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
> > -Pvendor-repos,cdh5.1.3
> >
> > However, it would be good to generate the specific jar when
> > releasing..(e.g.
> > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> >
> > Best,
> > Flavio
> >
> > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> [hidden email]>
> > wrote:
> >
> > > I've just updated the code on my fork (synch with current master and
> > > applied improvements coming from comments on related PR).
> > > I still have to understand how to write results back to an HBase
> > > Sink/OutputFormat...
> > >
> > >
> > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> > [hidden email]>
> > > wrote:
> > >
> > >> Thanks for the detailed answer. So if I run a job from my machine I'll
> > >> have to download all the scanned data in a table..right?
> > >>
> > >> Always regarding the GenericTableOutputFormat it is not clear to me
> how
> > >> to proceed..
> > >> I saw in the hadoop compatibility addon that it is possible to have
> such
> > >> compatibility using HBaseUtils class so the open method should become
> > >> something like:
> > >>
> > >> @Override
> > >> public void open(int taskNumber, int numTasks) throws IOException {
> > >> if (Integer.toString(taskNumber + 1).length() > 6) {
> > >> throw new IOException("Task id too large.");
> > >> }
> > >> TaskAttemptID taskAttemptID = TaskAttemptID.forName("attempt__0000_r_"
> > >> + String.format("%" + (6 - Integer.toString(taskNumber + 1).length())
> +
> > >> "s"," ").replace(" ", "0")
> > >> + Integer.toString(taskNumber + 1)
> > >> + "_0");
> > >> this.configuration.set("mapred.task.id", taskAttemptID.toString());
> > >> this.configuration.setInt("mapred.task.partition", taskNumber + 1);
> > >> // for hadoop 2.2
> > >> this.configuration.set("mapreduce.task.attempt.id",
> > >> taskAttemptID.toString());
> > >> this.configuration.setInt("mapreduce.task.partition", taskNumber + 1);
> > >> try {
> > >> this.context =
> > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > >> taskAttemptID);
> > >> } catch (Exception e) {
> > >> throw new RuntimeException(e);
> > >> }
> > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
> > >> try {
> > >> this.writer = outFormat.getRecordWriter(this.context);
> > >> } catch (InterruptedException iex) {
> > >> throw new IOException("Opening the writer was interrupted.", iex);
> > >> }
> > >> }
> > >>
> > >> But I'm not sure about how to pass the JobConf to the class, if to
> merge
> > >> config fileas, where HFileOutputFormat2 writes the data and how to
> > >> implement the public void writeRecord(Record record) API.
> > >> Could I do a little chat off the mailing list with the implementor of
> > >> this extension?
> > >>
> > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <[hidden email]>
> > >> wrote:
> > >>
> > >>> Hi Flavio
> > >>>
> > >>> let me try to answer your last question on the user's list (to the
> best
> > >>> of
> > >>> my HBase knowledge).
> > >>> "I just wanted to known if and how regiom splitting is handled. Can
> you
> > >>> explain me in detail how Flink and HBase works?what is not fully
> clear
> > to
> > >>> me is when computation is done by region servers and when data start
> > flow
> > >>> to a Flink worker (that in ky test job is only my pc) and how ro
> > >>> undertsand
> > >>> better the important logged info to understand if my job is
> performing
> > >>> well"
> > >>>
> > >>> HBase partitions its tables into so called "regions" of keys and
> stores
> > >>> the
> > >>> regions distributed in the cluster using HDFS. I think an HBase
> region
> > >>> can
> > >>> be thought of as a HDFS block. To make reading an HBase table
> > efficient,
> > >>> region reads should be locally done, i.e., an InputFormat should
> > >>> primarily
> > >>> read region that are stored on the same machine as the IF is running
> > on.
> > >>> Flink's InputSplits partition the HBase input by regions and add
> > >>> information about the storage location of the region. During
> execution,
> > >>> input splits are assigned to InputFormats that can do local reads.
> > >>>
> > >>> Best, Fabian
> > >>>
> > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[hidden email]>:
> > >>>
> > >>> > Hi!
> > >>> >
> > >>> > The way of passing parameters through the configuration is very old
> > >>> (the
> > >>> > original HBase format dated back to that time). I would simply make
> > the
> > >>> > HBase format take those parameters through the constructor.
> > >>> >
> > >>> > Greetings,
> > >>> > Stephan
> > >>> >
> > >>> >
> > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> > >>> [hidden email]>
> > >>> > wrote:
> > >>> >
> > >>> > > The problem is that I also removed the GenericTableOutputFormat
> > >>> because
> > >>> > > there is an incompatibility between hadoop1 and hadoop2 for class
> > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> > >>> > > then it would be nice if the user doesn't have to worry about
> > passing
> > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> > >>> > > I think it is probably a good idea to remove hadoop1
> compatibility
> > >>> and
> > >>> > keep
> > >>> > > enable HBase addon only for hadoop2 (as before) and decide how to
> > >>> mange
> > >>> > > those 2 parameters..
> > >>> > >
> > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <[hidden email]>
> > >>> wrote:
> > >>> > >
> > >>> > > > It is fine to remove it, in my opinion.
> > >>> > > >
> > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
> > >>> > > [hidden email]>
> > >>> > > > wrote:
> > >>> > > >
> > >>> > > > > That is one class I removed because it was using the
> deprecated
> > >>> API
> > >>> > > > > GenericDataSink..I can restore them but the it will be a good
> > >>> idea to
> > >>> > > > > remove those warning (also because from what I understood the
> > >>> Record
> > >>> > > APIs
> > >>> > > > > are going to be removed).
> > >>> > > > >
> > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> > >>> [hidden email]>
> > >>> > > > wrote:
> > >>> > > > >
> > >>> > > > > > I'm not familiar with the HBase connector code, but are you
> > >>> maybe
> > >>> > > > looking
> > >>> > > > > > for the GenericTableOutputFormat?
> > >>> > > > > >
> > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> > >>> [hidden email]
> > >>> > >:
> > >>> > > > > >
> > >>> > > > > > > | was trying to modify the example setting
> > hbaseDs.output(new
> > >>> > > > > > > HBaseOutputFormat()); but I can't see any
> HBaseOutputFormat
> > >>> > > > > class..maybe
> > >>> > > > > > we
> > >>> > > > > > > shall use another class?
> > >>> > > > > > >
> > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
> > >>> > > > > [hidden email]
> > >>> > > > > > >
> > >>> > > > > > > wrote:
> > >>> > > > > > >
> > >>> > > > > > > > Maybe that's something I could add to the HBase example
> > and
> > >>> > that
> > >>> > > > > could
> > >>> > > > > > be
> > >>> > > > > > > > better documented in the Wiki.
> > >>> > > > > > > >
> > >>> > > > > > > > Since we're talking about the wiki..I was looking at
> the
> > >>> Java
> > >>> > > API (
> > >>> > > > > > > >
> > >>> > > > > > >
> > >>> > > > > >
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > >>> )
> > >>> > > > > > > > and the link to the KMeans example is not working
> (where
> > it
> > >>> > says
> > >>> > > > For
> > >>> > > > > a
> > >>> > > > > > > > complete example program, have a look at KMeans
> > Algorithm).
> > >>> > > > > > > >
> > >>> > > > > > > > Best,
> > >>> > > > > > > > Flavio
> > >>> > > > > > > >
> > >>> > > > > > > >
> > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio Pompermaier <
> > >>> > > > > > [hidden email]
> > >>> > > > > > > >
> > >>> > > > > > > > wrote:
> > >>> > > > > > > >
> > >>> > > > > > > >> Ah ok, perfect! That was the reason why I removed it
> :)
> > >>> > > > > > > >>
> > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
> > >>> > [hidden email]>
> > >>> > > > > > wrote:
> > >>> > > > > > > >>
> > >>> > > > > > > >>> You do not really need a HBase data sink. You can
> call
> > >>> > > > > > > >>> "DataSet.output(new
> > >>> > > > > > > >>> HBaseOutputFormat())"
> > >>> > > > > > > >>>
> > >>> > > > > > > >>> Stephan
> > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio Pompermaier" <
> > >>> > > > > > [hidden email]
> > >>> > > > > > > >:
> > >>> > > > > > > >>>
> > >>> > > > > > > >>> > Just one last thing..I removed the HbaseDataSink
> > >>> because I
> > >>> > > > think
> > >>> > > > > it
> > >>> > > > > > > was
> > >>> > > > > > > >>> > using the old APIs..can someone help me in updating
> > >>> that
> > >>> > > class?
> > >>> > > > > > > >>> >
> > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
> Pompermaier <
> > >>> > > > > > > >>> [hidden email]>
> > >>> > > > > > > >>> > wrote:
> > >>> > > > > > > >>> >
> > >>> > > > > > > >>> > > Indeed this time the build has been successful :)
> > >>> > > > > > > >>> > >
> > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian Hueske <
> > >>> > > > > > [hidden email]
> > >>> > > > > > > >
> > >>> > > > > > > >>> > wrote:
> > >>> > > > > > > >>> > >
> > >>> > > > > > > >>> > >> You can also setup Travis to build your own
> Github
> > >>> > > > > repositories
> > >>> > > > > > by
> > >>> > > > > > > >>> > linking
> > >>> > > > > > > >>> > >> it to your Github account. That way Travis can
> > >>> build all
> > >>> > > > your
> > >>> > > > > > > >>> branches
> > >>> > > > > > > >>> > >> (and
> > >>> > > > > > > >>> > >> you can also trigger rebuilds if something
> fails).
> > >>> > > > > > > >>> > >> Not sure if we can manually trigger retrigger
> > >>> builds on
> > >>> > > the
> > >>> > > > > > Apache
> > >>> > > > > > > >>> > >> repository.
> > >>> > > > > > > >>> > >>
> > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a very good
> > >>> > addition
> > >>> > > > :-)
> > >>> > > > > > > >>> > >>
> > >>> > > > > > > >>> > >> For the discusion about the PR itself, I would
> > need
> > >>> a
> > >>> > bit
> > >>> > > > more
> > >>> > > > > > > time
> > >>> > > > > > > >>> to
> > >>> > > > > > > >>> > >> become more familiar with HBase. I do also not
> > have
> > >>> a
> > >>> > > HBase
> > >>> > > > > > setup
> > >>> > > > > > > >>> > >> available
> > >>> > > > > > > >>> > >> here.
> > >>> > > > > > > >>> > >> Maybe somebody else of the community who was
> > >>> involved
> > >>> > > with a
> > >>> > > > > > > >>> previous
> > >>> > > > > > > >>> > >> version of the HBase connector could comment on
> > your
> > >>> > > > question.
> > >>> > > > > > > >>> > >>
> > >>> > > > > > > >>> > >> Best, Fabian
> > >>> > > > > > > >>> > >>
> > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio Pompermaier <
> > >>> > > > > > > [hidden email]
> > >>> > > > > > > >>> >:
> > >>> > > > > > > >>> > >>
> > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the discussion
> on
> > >>> this
> > >>> > > > > mailing
> > >>> > > > > > > >>> list.
> > >>> > > > > > > >>> > >> >
> > >>> > > > > > > >>> > >> > I think that what is still to be discussed is
> > >>> how to
> > >>> > > > > > retrigger
> > >>> > > > > > > >>> the
> > >>> > > > > > > >>> > >> build
> > >>> > > > > > > >>> > >> > on Travis (I don't have an account) and if the
> > PR
> > >>> can
> > >>> > be
> > >>> > > > > > > >>> integrated.
> > >>> > > > > > > >>> > >> >
> > >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase
> example
> > >>> in
> > >>> > the
> > >>> > > > test
> > >>> > > > > > > >>> package
> > >>> > > > > > > >>> > >> (right
> > >>> > > > > > > >>> > >> > now I left it in the main folder) so it will
> > force
> > >>> > > Travis
> > >>> > > > to
> > >>> > > > > > > >>> rebuild.
> > >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
> > >>> > > > > > > >>> > >> >
> > >>> > > > > > > >>> > >> > Another thing I forgot to say is that the
> hbase
> > >>> > > extension
> > >>> > > > is
> > >>> > > > > > now
> > >>> > > > > > > >>> > >> compatible
> > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > >>> > > > > > > >>> > >> >
> > >>> > > > > > > >>> > >> > Best,
> > >>> > > > > > > >>> > >> > Flavio
> > >>> > > > > > > >>> > >>
> > >>> > > > > > > >>> > >
> > >>> > > > > > > >>> >
> > >>> > > > > > > >>>
> > >>> > > > > > > >>
> > >>> > > > > > > >
> > >>> > > > > > >
> > >>> > > > > >
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >>
> > >
> >
>

Robert Metzger

Re: HBase 0.98 addon for Flink 0.8

Hi,
Maybe its an issue with the classpath? As far as I know is Hadoop reading
the configuration files from the classpath. Maybe is the hbase-site.xml
file not accessible through the classpath when running on the cluster?

On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <[hidden email]>
wrote:

> Today we tried tp execute a job on the cluster instead of on local executor
> and we faced the problem that the hbase-site.xml was basically ignored. Is
> there a reason why the TableInputFormat is working correctly on local
> environment while it doesn't on a cluster?
> On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]> wrote:
>
> > I don't think we need to bundle the HBase input and output format in a
> > single PR.
> > So, I think we can proceed with the IF only and target the OF later.
> > However, the fix for Kryo should be in the master before merging the PR.
> > Till is currently working on that and said he expects this to be done by
> > end of the week.
> >
> > Cheers, Fabian
> >
> >
> > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <[hidden email]>:
> >
> > > I fixed also the profile for Cloudera CDH5.1.3. You can build it with
> the
> > > command:
> > > mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
> > > -Pvendor-repos,cdh5.1.3
> > >
> > > However, it would be good to generate the specific jar when
> > > releasing..(e.g.
> > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > >
> > > Best,
> > > Flavio
> > >
> > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> > [hidden email]>
> > > wrote:
> > >
> > > > I've just updated the code on my fork (synch with current master and
> > > > applied improvements coming from comments on related PR).
> > > > I still have to understand how to write results back to an HBase
> > > > Sink/OutputFormat...
> > > >
> > > >
> > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > >> Thanks for the detailed answer. So if I run a job from my machine
> I'll
> > > >> have to download all the scanned data in a table..right?
> > > >>
> > > >> Always regarding the GenericTableOutputFormat it is not clear to me
> > how
> > > >> to proceed..
> > > >> I saw in the hadoop compatibility addon that it is possible to have
> > such
> > > >> compatibility using HBaseUtils class so the open method should
> become
> > > >> something like:
> > > >>
> > > >> @Override
> > > >> public void open(int taskNumber, int numTasks) throws IOException {
> > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
> > > >> throw new IOException("Task id too large.");
> > > >> }
> > > >> TaskAttemptID taskAttemptID =
> TaskAttemptID.forName("attempt__0000_r_"
> > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
> 1).length())
> > +
> > > >> "s"," ").replace(" ", "0")
> > > >> + Integer.toString(taskNumber + 1)
> > > >> + "_0");
> > > >> this.configuration.set("mapred.task.id",
> taskAttemptID.toString());
> > > >> this.configuration.setInt("mapred.task.partition", taskNumber + 1);
> > > >> // for hadoop 2.2
> > > >> this.configuration.set("mapreduce.task.attempt.id",
> > > >> taskAttemptID.toString());
> > > >> this.configuration.setInt("mapreduce.task.partition", taskNumber +
> 1);
> > > >> try {
> > > >> this.context =
> > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > > >> taskAttemptID);
> > > >> } catch (Exception e) {
> > > >> throw new RuntimeException(e);
> > > >> }
> > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
> > > >> try {
> > > >> this.writer = outFormat.getRecordWriter(this.context);
> > > >> } catch (InterruptedException iex) {
> > > >> throw new IOException("Opening the writer was interrupted.", iex);
> > > >> }
> > > >> }
> > > >>
> > > >> But I'm not sure about how to pass the JobConf to the class, if to
> > merge
> > > >> config fileas, where HFileOutputFormat2 writes the data and how to
> > > >> implement the public void writeRecord(Record record) API.
> > > >> Could I do a little chat off the mailing list with the implementor
> of
> > > >> this extension?
> > > >>
> > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <[hidden email]>
> > > >> wrote:
> > > >>
> > > >>> Hi Flavio
> > > >>>
> > > >>> let me try to answer your last question on the user's list (to the
> > best
> > > >>> of
> > > >>> my HBase knowledge).
> > > >>> "I just wanted to known if and how regiom splitting is handled. Can
> > you
> > > >>> explain me in detail how Flink and HBase works?what is not fully
> > clear
> > > to
> > > >>> me is when computation is done by region servers and when data
> start
> > > flow
> > > >>> to a Flink worker (that in ky test job is only my pc) and how ro
> > > >>> undertsand
> > > >>> better the important logged info to understand if my job is
> > performing
> > > >>> well"
> > > >>>
> > > >>> HBase partitions its tables into so called "regions" of keys and
> > stores
> > > >>> the
> > > >>> regions distributed in the cluster using HDFS. I think an HBase
> > region
> > > >>> can
> > > >>> be thought of as a HDFS block. To make reading an HBase table
> > > efficient,
> > > >>> region reads should be locally done, i.e., an InputFormat should
> > > >>> primarily
> > > >>> read region that are stored on the same machine as the IF is
> running
> > > on.
> > > >>> Flink's InputSplits partition the HBase input by regions and add
> > > >>> information about the storage location of the region. During
> > execution,
> > > >>> input splits are assigned to InputFormats that can do local reads.
> > > >>>
> > > >>> Best, Fabian
> > > >>>
> > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[hidden email]>:
> > > >>>
> > > >>> > Hi!
> > > >>> >
> > > >>> > The way of passing parameters through the configuration is very
> old
> > > >>> (the
> > > >>> > original HBase format dated back to that time). I would simply
> make
> > > the
> > > >>> > HBase format take those parameters through the constructor.
> > > >>> >
> > > >>> > Greetings,
> > > >>> > Stephan
> > > >>> >
> > > >>> >
> > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> > > >>> [hidden email]>
> > > >>> > wrote:
> > > >>> >
> > > >>> > > The problem is that I also removed the GenericTableOutputFormat
> > > >>> because
> > > >>> > > there is an incompatibility between hadoop1 and hadoop2 for
> class
> > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> > > >>> > > then it would be nice if the user doesn't have to worry about
> > > passing
> > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> > > >>> > > I think it is probably a good idea to remove hadoop1
> > compatibility
> > > >>> and
> > > >>> > keep
> > > >>> > > enable HBase addon only for hadoop2 (as before) and decide how
> to
> > > >>> mange
> > > >>> > > those 2 parameters..
> > > >>> > >
> > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
> [hidden email]>
> > > >>> wrote:
> > > >>> > >
> > > >>> > > > It is fine to remove it, in my opinion.
> > > >>> > > >
> > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
> > > >>> > > [hidden email]>
> > > >>> > > > wrote:
> > > >>> > > >
> > > >>> > > > > That is one class I removed because it was using the
> > deprecated
> > > >>> API
> > > >>> > > > > GenericDataSink..I can restore them but the it will be a
> good
> > > >>> idea to
> > > >>> > > > > remove those warning (also because from what I understood
> the
> > > >>> Record
> > > >>> > > APIs
> > > >>> > > > > are going to be removed).
> > > >>> > > > >
> > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> > > >>> [hidden email]>
> > > >>> > > > wrote:
> > > >>> > > > >
> > > >>> > > > > > I'm not familiar with the HBase connector code, but are
> you
> > > >>> maybe
> > > >>> > > > looking
> > > >>> > > > > > for the GenericTableOutputFormat?
> > > >>> > > > > >
> > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> > > >>> [hidden email]
> > > >>> > >:
> > > >>> > > > > >
> > > >>> > > > > > > | was trying to modify the example setting
> > > hbaseDs.output(new
> > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
> > HBaseOutputFormat
> > > >>> > > > > class..maybe
> > > >>> > > > > > we
> > > >>> > > > > > > shall use another class?
> > > >>> > > > > > >
> > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
> > > >>> > > > > [hidden email]
> > > >>> > > > > > >
> > > >>> > > > > > > wrote:
> > > >>> > > > > > >
> > > >>> > > > > > > > Maybe that's something I could add to the HBase
> example
> > > and
> > > >>> > that
> > > >>> > > > > could
> > > >>> > > > > > be
> > > >>> > > > > > > > better documented in the Wiki.
> > > >>> > > > > > > >
> > > >>> > > > > > > > Since we're talking about the wiki..I was looking at
> > the
> > > >>> Java
> > > >>> > > API (
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > > >>> )
> > > >>> > > > > > > > and the link to the KMeans example is not working
> > (where
> > > it
> > > >>> > says
> > > >>> > > > For
> > > >>> > > > > a
> > > >>> > > > > > > > complete example program, have a look at KMeans
> > > Algorithm).
> > > >>> > > > > > > >
> > > >>> > > > > > > > Best,
> > > >>> > > > > > > > Flavio
> > > >>> > > > > > > >
> > > >>> > > > > > > >
> > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio Pompermaier <
> > > >>> > > > > > [hidden email]
> > > >>> > > > > > > >
> > > >>> > > > > > > > wrote:
> > > >>> > > > > > > >
> > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I removed it
> > :)
> > > >>> > > > > > > >>
> > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
> > > >>> > [hidden email]>
> > > >>> > > > > > wrote:
> > > >>> > > > > > > >>
> > > >>> > > > > > > >>> You do not really need a HBase data sink. You can
> > call
> > > >>> > > > > > > >>> "DataSet.output(new
> > > >>> > > > > > > >>> HBaseOutputFormat())"
> > > >>> > > > > > > >>>
> > > >>> > > > > > > >>> Stephan
> > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio Pompermaier" <
> > > >>> > > > > > [hidden email]
> > > >>> > > > > > > >:
> > > >>> > > > > > > >>>
> > > >>> > > > > > > >>> > Just one last thing..I removed the HbaseDataSink
> > > >>> because I
> > > >>> > > > think
> > > >>> > > > > it
> > > >>> > > > > > > was
> > > >>> > > > > > > >>> > using the old APIs..can someone help me in
> updating
> > > >>> that
> > > >>> > > class?
> > > >>> > > > > > > >>> >
> > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
> > Pompermaier <
> > > >>> > > > > > > >>> [hidden email]>
> > > >>> > > > > > > >>> > wrote:
> > > >>> > > > > > > >>> >
> > > >>> > > > > > > >>> > > Indeed this time the build has been successful
> :)
> > > >>> > > > > > > >>> > >
> > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian Hueske
> <
> > > >>> > > > > > [hidden email]
> > > >>> > > > > > > >
> > > >>> > > > > > > >>> > wrote:
> > > >>> > > > > > > >>> > >
> > > >>> > > > > > > >>> > >> You can also setup Travis to build your own
> > Github
> > > >>> > > > > repositories
> > > >>> > > > > > by
> > > >>> > > > > > > >>> > linking
> > > >>> > > > > > > >>> > >> it to your Github account. That way Travis can
> > > >>> build all
> > > >>> > > > your
> > > >>> > > > > > > >>> branches
> > > >>> > > > > > > >>> > >> (and
> > > >>> > > > > > > >>> > >> you can also trigger rebuilds if something
> > fails).
> > > >>> > > > > > > >>> > >> Not sure if we can manually trigger retrigger
> > > >>> builds on
> > > >>> > > the
> > > >>> > > > > > Apache
> > > >>> > > > > > > >>> > >> repository.
> > > >>> > > > > > > >>> > >>
> > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a very
> good
> > > >>> > addition
> > > >>> > > > :-)
> > > >>> > > > > > > >>> > >>
> > > >>> > > > > > > >>> > >> For the discusion about the PR itself, I would
> > > need
> > > >>> a
> > > >>> > bit
> > > >>> > > > more
> > > >>> > > > > > > time
> > > >>> > > > > > > >>> to
> > > >>> > > > > > > >>> > >> become more familiar with HBase. I do also not
> > > have
> > > >>> a
> > > >>> > > HBase
> > > >>> > > > > > setup
> > > >>> > > > > > > >>> > >> available
> > > >>> > > > > > > >>> > >> here.
> > > >>> > > > > > > >>> > >> Maybe somebody else of the community who was
> > > >>> involved
> > > >>> > > with a
> > > >>> > > > > > > >>> previous
> > > >>> > > > > > > >>> > >> version of the HBase connector could comment
> on
> > > your
> > > >>> > > > question.
> > > >>> > > > > > > >>> > >>
> > > >>> > > > > > > >>> > >> Best, Fabian
> > > >>> > > > > > > >>> > >>
> > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio Pompermaier <
> > > >>> > > > > > > [hidden email]
> > > >>> > > > > > > >>> >:
> > > >>> > > > > > > >>> > >>
> > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the
> discussion
> > on
> > > >>> this
> > > >>> > > > > mailing
> > > >>> > > > > > > >>> list.
> > > >>> > > > > > > >>> > >> >
> > > >>> > > > > > > >>> > >> > I think that what is still to be discussed
> is
> > > >>> how to
> > > >>> > > > > > retrigger
> > > >>> > > > > > > >>> the
> > > >>> > > > > > > >>> > >> build
> > > >>> > > > > > > >>> > >> > on Travis (I don't have an account) and if
> the
> > > PR
> > > >>> can
> > > >>> > be
> > > >>> > > > > > > >>> integrated.
> > > >>> > > > > > > >>> > >> >
> > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase
> > example
> > > >>> in
> > > >>> > the
> > > >>> > > > test
> > > >>> > > > > > > >>> package
> > > >>> > > > > > > >>> > >> (right
> > > >>> > > > > > > >>> > >> > now I left it in the main folder) so it will
> > > force
> > > >>> > > Travis
> > > >>> > > > to
> > > >>> > > > > > > >>> rebuild.
> > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
> > > >>> > > > > > > >>> > >> >
> > > >>> > > > > > > >>> > >> > Another thing I forgot to say is that the
> > hbase
> > > >>> > > extension
> > > >>> > > > is
> > > >>> > > > > > now
> > > >>> > > > > > > >>> > >> compatible
> > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > > >>> > > > > > > >>> > >> >
> > > >>> > > > > > > >>> > >> > Best,
> > > >>> > > > > > > >>> > >> > Flavio
> > > >>> > > > > > > >>> > >>
> > > >>> > > > > > > >>> > >
> > > >>> > > > > > > >>> >
> > > >>> > > > > > > >>>
> > > >>> > > > > > > >>
> > > >>> > > > > > > >
> > > >>> > > > > > >
> > > >>> > > > > >
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >
> > >
> >
>

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

Usually, when I run a mapreduce job both on Spark and Hadoop I just put
*-site.xml files into the war I submit to the cluster and that's it. I
think the problem appeared when I made the HTable a private transient field
and the table istantiation was moved in the configure method.
Could it be a valid reason? we still have to make a deeper debug but I'm
trying ro figure out where to investigate..
On Nov 12, 2014 8:03 PM, "Robert Metzger" <[hidden email]> wrote:

> Hi,
> Maybe its an issue with the classpath? As far as I know is Hadoop reading
> the configuration files from the classpath. Maybe is the hbase-site.xml
> file not accessible through the classpath when running on the cluster?
>
> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <[hidden email]>
> wrote:
>
> > Today we tried tp execute a job on the cluster instead of on local
> executor
> > and we faced the problem that the hbase-site.xml was basically ignored.
> Is
> > there a reason why the TableInputFormat is working correctly on local
> > environment while it doesn't on a cluster?
> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]> wrote:
> >
> > > I don't think we need to bundle the HBase input and output format in a
> > > single PR.
> > > So, I think we can proceed with the IF only and target the OF later.
> > > However, the fix for Kryo should be in the master before merging the
> PR.
> > > Till is currently working on that and said he expects this to be done
> by
> > > end of the week.
> > >
> > > Cheers, Fabian
> > >
> > >
> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <[hidden email]>:
> > >
> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build it with
> > the
> > > > command:
> > > > mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
> > > > -Pvendor-repos,cdh5.1.3
> > > >
> > > > However, it would be good to generate the specific jar when
> > > > releasing..(e.g.
> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > > >
> > > > Best,
> > > > Flavio
> > > >
> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > > > I've just updated the code on my fork (synch with current master
> and
> > > > > applied improvements coming from comments on related PR).
> > > > > I still have to understand how to write results back to an HBase
> > > > > Sink/OutputFormat...
> > > > >
> > > > >
> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > >> Thanks for the detailed answer. So if I run a job from my machine
> > I'll
> > > > >> have to download all the scanned data in a table..right?
> > > > >>
> > > > >> Always regarding the GenericTableOutputFormat it is not clear to
> me
> > > how
> > > > >> to proceed..
> > > > >> I saw in the hadoop compatibility addon that it is possible to
> have
> > > such
> > > > >> compatibility using HBaseUtils class so the open method should
> > become
> > > > >> something like:
> > > > >>
> > > > >> @Override
> > > > >> public void open(int taskNumber, int numTasks) throws IOException
> {
> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
> > > > >> throw new IOException("Task id too large.");
> > > > >> }
> > > > >> TaskAttemptID taskAttemptID =
> > TaskAttemptID.forName("attempt__0000_r_"
> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
> > 1).length())
> > > +
> > > > >> "s"," ").replace(" ", "0")
> > > > >> + Integer.toString(taskNumber + 1)
> > > > >> + "_0");
> > > > >> this.configuration.set("mapred.task.id",
> > taskAttemptID.toString());
> > > > >> this.configuration.setInt("mapred.task.partition", taskNumber +
> 1);
> > > > >> // for hadoop 2.2
> > > > >> this.configuration.set("mapreduce.task.attempt.id",
> > > > >> taskAttemptID.toString());
> > > > >> this.configuration.setInt("mapreduce.task.partition", taskNumber +
> > 1);
> > > > >> try {
> > > > >> this.context =
> > > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > > > >> taskAttemptID);
> > > > >> } catch (Exception e) {
> > > > >> throw new RuntimeException(e);
> > > > >> }
> > > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
> > > > >> try {
> > > > >> this.writer = outFormat.getRecordWriter(this.context);
> > > > >> } catch (InterruptedException iex) {
> > > > >> throw new IOException("Opening the writer was interrupted.", iex);
> > > > >> }
> > > > >> }
> > > > >>
> > > > >> But I'm not sure about how to pass the JobConf to the class, if to
> > > merge
> > > > >> config fileas, where HFileOutputFormat2 writes the data and how to
> > > > >> implement the public void writeRecord(Record record) API.
> > > > >> Could I do a little chat off the mailing list with the implementor
> > of
> > > > >> this extension?
> > > > >>
> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> [hidden email]>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Flavio
> > > > >>>
> > > > >>> let me try to answer your last question on the user's list (to
> the
> > > best
> > > > >>> of
> > > > >>> my HBase knowledge).
> > > > >>> "I just wanted to known if and how regiom splitting is handled.
> Can
> > > you
> > > > >>> explain me in detail how Flink and HBase works?what is not fully
> > > clear
> > > > to
> > > > >>> me is when computation is done by region servers and when data
> > start
> > > > flow
> > > > >>> to a Flink worker (that in ky test job is only my pc) and how ro
> > > > >>> undertsand
> > > > >>> better the important logged info to understand if my job is
> > > performing
> > > > >>> well"
> > > > >>>
> > > > >>> HBase partitions its tables into so called "regions" of keys and
> > > stores
> > > > >>> the
> > > > >>> regions distributed in the cluster using HDFS. I think an HBase
> > > region
> > > > >>> can
> > > > >>> be thought of as a HDFS block. To make reading an HBase table
> > > > efficient,
> > > > >>> region reads should be locally done, i.e., an InputFormat should
> > > > >>> primarily
> > > > >>> read region that are stored on the same machine as the IF is
> > running
> > > > on.
> > > > >>> Flink's InputSplits partition the HBase input by regions and add
> > > > >>> information about the storage location of the region. During
> > > execution,
> > > > >>> input splits are assigned to InputFormats that can do local
> reads.
> > > > >>>
> > > > >>> Best, Fabian
> > > > >>>
> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[hidden email]>:
> > > > >>>
> > > > >>> > Hi!
> > > > >>> >
> > > > >>> > The way of passing parameters through the configuration is very
> > old
> > > > >>> (the
> > > > >>> > original HBase format dated back to that time). I would simply
> > make
> > > > the
> > > > >>> > HBase format take those parameters through the constructor.
> > > > >>> >
> > > > >>> > Greetings,
> > > > >>> > Stephan
> > > > >>> >
> > > > >>> >
> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> > > > >>> [hidden email]>
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> > > The problem is that I also removed the
> GenericTableOutputFormat
> > > > >>> because
> > > > >>> > > there is an incompatibility between hadoop1 and hadoop2 for
> > class
> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> > > > >>> > > then it would be nice if the user doesn't have to worry about
> > > > passing
> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> > > > >>> > > I think it is probably a good idea to remove hadoop1
> > > compatibility
> > > > >>> and
> > > > >>> > keep
> > > > >>> > > enable HBase addon only for hadoop2 (as before) and decide
> how
> > to
> > > > >>> mange
> > > > >>> > > those 2 parameters..
> > > > >>> > >
> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
> > [hidden email]>
> > > > >>> wrote:
> > > > >>> > >
> > > > >>> > > > It is fine to remove it, in my opinion.
> > > > >>> > > >
> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
> > > > >>> > > [hidden email]>
> > > > >>> > > > wrote:
> > > > >>> > > >
> > > > >>> > > > > That is one class I removed because it was using the
> > > deprecated
> > > > >>> API
> > > > >>> > > > > GenericDataSink..I can restore them but the it will be a
> > good
> > > > >>> idea to
> > > > >>> > > > > remove those warning (also because from what I understood
> > the
> > > > >>> Record
> > > > >>> > > APIs
> > > > >>> > > > > are going to be removed).
> > > > >>> > > > >
> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> > > > >>> [hidden email]>
> > > > >>> > > > wrote:
> > > > >>> > > > >
> > > > >>> > > > > > I'm not familiar with the HBase connector code, but are
> > you
> > > > >>> maybe
> > > > >>> > > > looking
> > > > >>> > > > > > for the GenericTableOutputFormat?
> > > > >>> > > > > >
> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> > > > >>> [hidden email]
> > > > >>> > >:
> > > > >>> > > > > >
> > > > >>> > > > > > > | was trying to modify the example setting
> > > > hbaseDs.output(new
> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
> > > HBaseOutputFormat
> > > > >>> > > > > class..maybe
> > > > >>> > > > > > we
> > > > >>> > > > > > > shall use another class?
> > > > >>> > > > > > >
> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
> > > > >>> > > > > [hidden email]
> > > > >>> > > > > > >
> > > > >>> > > > > > > wrote:
> > > > >>> > > > > > >
> > > > >>> > > > > > > > Maybe that's something I could add to the HBase
> > example
> > > > and
> > > > >>> > that
> > > > >>> > > > > could
> > > > >>> > > > > > be
> > > > >>> > > > > > > > better documented in the Wiki.
> > > > >>> > > > > > > >
> > > > >>> > > > > > > > Since we're talking about the wiki..I was looking
> at
> > > the
> > > > >>> Java
> > > > >>> > > API (
> > > > >>> > > > > > > >
> > > > >>> > > > > > >
> > > > >>> > > > > >
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > > > >>> )
> > > > >>> > > > > > > > and the link to the KMeans example is not working
> > > (where
> > > > it
> > > > >>> > says
> > > > >>> > > > For
> > > > >>> > > > > a
> > > > >>> > > > > > > > complete example program, have a look at KMeans
> > > > Algorithm).
> > > > >>> > > > > > > >
> > > > >>> > > > > > > > Best,
> > > > >>> > > > > > > > Flavio
> > > > >>> > > > > > > >
> > > > >>> > > > > > > >
> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio Pompermaier
> <
> > > > >>> > > > > > [hidden email]
> > > > >>> > > > > > > >
> > > > >>> > > > > > > > wrote:
> > > > >>> > > > > > > >
> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I removed
> it
> > > :)
> > > > >>> > > > > > > >>
> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
> > > > >>> > [hidden email]>
> > > > >>> > > > > > wrote:
> > > > >>> > > > > > > >>
> > > > >>> > > > > > > >>> You do not really need a HBase data sink. You can
> > > call
> > > > >>> > > > > > > >>> "DataSet.output(new
> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> > > > >>> > > > > > > >>>
> > > > >>> > > > > > > >>> Stephan
> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio Pompermaier"
> <
> > > > >>> > > > > > [hidden email]
> > > > >>> > > > > > > >:
> > > > >>> > > > > > > >>>
> > > > >>> > > > > > > >>> > Just one last thing..I removed the
> HbaseDataSink
> > > > >>> because I
> > > > >>> > > > think
> > > > >>> > > > > it
> > > > >>> > > > > > > was
> > > > >>> > > > > > > >>> > using the old APIs..can someone help me in
> > updating
> > > > >>> that
> > > > >>> > > class?
> > > > >>> > > > > > > >>> >
> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
> > > Pompermaier <
> > > > >>> > > > > > > >>> [hidden email]>
> > > > >>> > > > > > > >>> > wrote:
> > > > >>> > > > > > > >>> >
> > > > >>> > > > > > > >>> > > Indeed this time the build has been
> successful
> > :)
> > > > >>> > > > > > > >>> > >
> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian
> Hueske
> > <
> > > > >>> > > > > > [hidden email]
> > > > >>> > > > > > > >
> > > > >>> > > > > > > >>> > wrote:
> > > > >>> > > > > > > >>> > >
> > > > >>> > > > > > > >>> > >> You can also setup Travis to build your own
> > > Github
> > > > >>> > > > > repositories
> > > > >>> > > > > > by
> > > > >>> > > > > > > >>> > linking
> > > > >>> > > > > > > >>> > >> it to your Github account. That way Travis
> can
> > > > >>> build all
> > > > >>> > > > your
> > > > >>> > > > > > > >>> branches
> > > > >>> > > > > > > >>> > >> (and
> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if something
> > > fails).
> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger
> retrigger
> > > > >>> builds on
> > > > >>> > > the
> > > > >>> > > > > > Apache
> > > > >>> > > > > > > >>> > >> repository.
> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a very
> > good
> > > > >>> > addition
> > > > >>> > > > :-)
> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > > > > >>> > >> For the discusion about the PR itself, I
> would
> > > > need
> > > > >>> a
> > > > >>> > bit
> > > > >>> > > > more
> > > > >>> > > > > > > time
> > > > >>> > > > > > > >>> to
> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I do also
> not
> > > > have
> > > > >>> a
> > > > >>> > > HBase
> > > > >>> > > > > > setup
> > > > >>> > > > > > > >>> > >> available
> > > > >>> > > > > > > >>> > >> here.
> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community who was
> > > > >>> involved
> > > > >>> > > with a
> > > > >>> > > > > > > >>> previous
> > > > >>> > > > > > > >>> > >> version of the HBase connector could comment
> > on
> > > > your
> > > > >>> > > > question.
> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > > > > >>> > >> Best, Fabian
> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
> Pompermaier <
> > > > >>> > > > > > > [hidden email]
> > > > >>> > > > > > > >>> >:
> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the
> > discussion
> > > on
> > > > >>> this
> > > > >>> > > > > mailing
> > > > >>> > > > > > > >>> list.
> > > > >>> > > > > > > >>> > >> >
> > > > >>> > > > > > > >>> > >> > I think that what is still to be discussed
> > is
> > > > >>> how to
> > > > >>> > > > > > retrigger
> > > > >>> > > > > > > >>> the
> > > > >>> > > > > > > >>> > >> build
> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an account) and if
> > the
> > > > PR
> > > > >>> can
> > > > >>> > be
> > > > >>> > > > > > > >>> integrated.
> > > > >>> > > > > > > >>> > >> >
> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase
> > > example
> > > > >>> in
> > > > >>> > the
> > > > >>> > > > test
> > > > >>> > > > > > > >>> package
> > > > >>> > > > > > > >>> > >> (right
> > > > >>> > > > > > > >>> > >> > now I left it in the main folder) so it
> will
> > > > force
> > > > >>> > > Travis
> > > > >>> > > > to
> > > > >>> > > > > > > >>> rebuild.
> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
> > > > >>> > > > > > > >>> > >> >
> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is that the
> > > hbase
> > > > >>> > > extension
> > > > >>> > > > is
> > > > >>> > > > > > now
> > > > >>> > > > > > > >>> > >> compatible
> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > > > >>> > > > > > > >>> > >> >
> > > > >>> > > > > > > >>> > >> > Best,
> > > > >>> > > > > > > >>> > >> > Flavio
> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > > > > >>> > >
> > > > >>> > > > > > > >>> >
> > > > >>> > > > > > > >>>
> > > > >>> > > > > > > >>
> > > > >>> > > > > > > >
> > > > >>> > > > > > >
> > > > >>> > > > > >
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

We definitely discovered that instantiating HTable and Scan in configure()
method of TableInputFormat causes problem in distributed environment!
If you look at my implementation at
https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
you can see that Scan and HTable were made transient and recreated within
configure but this causes HBaseConfiguration.create() to fail searching for
classpath files...could you help us understanding why?

On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <[hidden email]>
wrote:

> Usually, when I run a mapreduce job both on Spark and Hadoop I just put
> *-site.xml files into the war I submit to the cluster and that's it. I
> think the problem appeared when I made the HTable a private transient field
> and the table istantiation was moved in the configure method.
> Could it be a valid reason? we still have to make a deeper debug but I'm
> trying ro figure out where to investigate..
> On Nov 12, 2014 8:03 PM, "Robert Metzger" <[hidden email]> wrote:
>
>> Hi,
>> Maybe its an issue with the classpath? As far as I know is Hadoop reading
>> the configuration files from the classpath. Maybe is the hbase-site.xml
>> file not accessible through the classpath when running on the cluster?
>>
>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <[hidden email]
>> >
>> wrote:
>>
>> > Today we tried tp execute a job on the cluster instead of on local
>> executor
>> > and we faced the problem that the hbase-site.xml was basically ignored.
>> Is
>> > there a reason why the TableInputFormat is working correctly on local
>> > environment while it doesn't on a cluster?
>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]> wrote:
>> >
>> > > I don't think we need to bundle the HBase input and output format in a
>> > > single PR.
>> > > So, I think we can proceed with the IF only and target the OF later.
>> > > However, the fix for Kryo should be in the master before merging the
>> PR.
>> > > Till is currently working on that and said he expects this to be done
>> by
>> > > end of the week.
>> > >
>> > > Cheers, Fabian
>> > >
>> > >
>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>> > >
>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build it
>> with
>> > the
>> > > > command:
>> > > > mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
>> > > > -Pvendor-repos,cdh5.1.3
>> > > >
>> > > > However, it would be good to generate the specific jar when
>> > > > releasing..(e.g.
>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
>> > > >
>> > > > Best,
>> > > > Flavio
>> > > >
>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
>> > > [hidden email]>
>> > > > wrote:
>> > > >
>> > > > > I've just updated the code on my fork (synch with current master
>> and
>> > > > > applied improvements coming from comments on related PR).
>> > > > > I still have to understand how to write results back to an HBase
>> > > > > Sink/OutputFormat...
>> > > > >
>> > > > >
>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
>> > > > [hidden email]>
>> > > > > wrote:
>> > > > >
>> > > > >> Thanks for the detailed answer. So if I run a job from my machine
>> > I'll
>> > > > >> have to download all the scanned data in a table..right?
>> > > > >>
>> > > > >> Always regarding the GenericTableOutputFormat it is not clear to
>> me
>> > > how
>> > > > >> to proceed..
>> > > > >> I saw in the hadoop compatibility addon that it is possible to
>> have
>> > > such
>> > > > >> compatibility using HBaseUtils class so the open method should
>> > become
>> > > > >> something like:
>> > > > >>
>> > > > >> @Override
>> > > > >> public void open(int taskNumber, int numTasks) throws
>> IOException {
>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
>> > > > >> throw new IOException("Task id too large.");
>> > > > >> }
>> > > > >> TaskAttemptID taskAttemptID =
>> > TaskAttemptID.forName("attempt__0000_r_"
>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
>> > 1).length())
>> > > +
>> > > > >> "s"," ").replace(" ", "0")
>> > > > >> + Integer.toString(taskNumber + 1)
>> > > > >> + "_0");
>> > > > >> this.configuration.set("mapred.task.id",
>> > taskAttemptID.toString());
>> > > > >> this.configuration.setInt("mapred.task.partition", taskNumber +
>> 1);
>> > > > >> // for hadoop 2.2
>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
>> > > > >> taskAttemptID.toString());
>> > > > >> this.configuration.setInt("mapreduce.task.partition", taskNumber
>> +
>> > 1);
>> > > > >> try {
>> > > > >> this.context =
>> > > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
>> > > > >> taskAttemptID);
>> > > > >> } catch (Exception e) {
>> > > > >> throw new RuntimeException(e);
>> > > > >> }
>> > > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
>> > > > >> try {
>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
>> > > > >> } catch (InterruptedException iex) {
>> > > > >> throw new IOException("Opening the writer was interrupted.",
>> iex);
>> > > > >> }
>> > > > >> }
>> > > > >>
>> > > > >> But I'm not sure about how to pass the JobConf to the class, if
>> to
>> > > merge
>> > > > >> config fileas, where HFileOutputFormat2 writes the data and how
>> to
>> > > > >> implement the public void writeRecord(Record record) API.
>> > > > >> Could I do a little chat off the mailing list with the
>> implementor
>> > of
>> > > > >> this extension?
>> > > > >>
>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
>> [hidden email]>
>> > > > >> wrote:
>> > > > >>
>> > > > >>> Hi Flavio
>> > > > >>>
>> > > > >>> let me try to answer your last question on the user's list (to
>> the
>> > > best
>> > > > >>> of
>> > > > >>> my HBase knowledge).
>> > > > >>> "I just wanted to known if and how regiom splitting is handled.
>> Can
>> > > you
>> > > > >>> explain me in detail how Flink and HBase works?what is not fully
>> > > clear
>> > > > to
>> > > > >>> me is when computation is done by region servers and when data
>> > start
>> > > > flow
>> > > > >>> to a Flink worker (that in ky test job is only my pc) and how ro
>> > > > >>> undertsand
>> > > > >>> better the important logged info to understand if my job is
>> > > performing
>> > > > >>> well"
>> > > > >>>
>> > > > >>> HBase partitions its tables into so called "regions" of keys and
>> > > stores
>> > > > >>> the
>> > > > >>> regions distributed in the cluster using HDFS. I think an HBase
>> > > region
>> > > > >>> can
>> > > > >>> be thought of as a HDFS block. To make reading an HBase table
>> > > > efficient,
>> > > > >>> region reads should be locally done, i.e., an InputFormat should
>> > > > >>> primarily
>> > > > >>> read region that are stored on the same machine as the IF is
>> > running
>> > > > on.
>> > > > >>> Flink's InputSplits partition the HBase input by regions and add
>> > > > >>> information about the storage location of the region. During
>> > > execution,
>> > > > >>> input splits are assigned to InputFormats that can do local
>> reads.
>> > > > >>>
>> > > > >>> Best, Fabian
>> > > > >>>
>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[hidden email]>:
>> > > > >>>
>> > > > >>> > Hi!
>> > > > >>> >
>> > > > >>> > The way of passing parameters through the configuration is
>> very
>> > old
>> > > > >>> (the
>> > > > >>> > original HBase format dated back to that time). I would simply
>> > make
>> > > > the
>> > > > >>> > HBase format take those parameters through the constructor.
>> > > > >>> >
>> > > > >>> > Greetings,
>> > > > >>> > Stephan
>> > > > >>> >
>> > > > >>> >
>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
>> > > > >>> [hidden email]>
>> > > > >>> > wrote:
>> > > > >>> >
>> > > > >>> > > The problem is that I also removed the
>> GenericTableOutputFormat
>> > > > >>> because
>> > > > >>> > > there is an incompatibility between hadoop1 and hadoop2 for
>> > class
>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
>> > > > >>> > > then it would be nice if the user doesn't have to worry
>> about
>> > > > passing
>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
>> > > > >>> > > I think it is probably a good idea to remove hadoop1
>> > > compatibility
>> > > > >>> and
>> > > > >>> > keep
>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and decide
>> how
>> > to
>> > > > >>> mange
>> > > > >>> > > those 2 parameters..
>> > > > >>> > >
>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
>> > [hidden email]>
>> > > > >>> wrote:
>> > > > >>> > >
>> > > > >>> > > > It is fine to remove it, in my opinion.
>> > > > >>> > > >
>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
>> > > > >>> > > [hidden email]>
>> > > > >>> > > > wrote:
>> > > > >>> > > >
>> > > > >>> > > > > That is one class I removed because it was using the
>> > > deprecated
>> > > > >>> API
>> > > > >>> > > > > GenericDataSink..I can restore them but the it will be a
>> > good
>> > > > >>> idea to
>> > > > >>> > > > > remove those warning (also because from what I
>> understood
>> > the
>> > > > >>> Record
>> > > > >>> > > APIs
>> > > > >>> > > > > are going to be removed).
>> > > > >>> > > > >
>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
>> > > > >>> [hidden email]>
>> > > > >>> > > > wrote:
>> > > > >>> > > > >
>> > > > >>> > > > > > I'm not familiar with the HBase connector code, but
>> are
>> > you
>> > > > >>> maybe
>> > > > >>> > > > looking
>> > > > >>> > > > > > for the GenericTableOutputFormat?
>> > > > >>> > > > > >
>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
>> > > > >>> [hidden email]
>> > > > >>> > >:
>> > > > >>> > > > > >
>> > > > >>> > > > > > > | was trying to modify the example setting
>> > > > hbaseDs.output(new
>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
>> > > HBaseOutputFormat
>> > > > >>> > > > > class..maybe
>> > > > >>> > > > > > we
>> > > > >>> > > > > > > shall use another class?
>> > > > >>> > > > > > >
>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier <
>> > > > >>> > > > > [hidden email]
>> > > > >>> > > > > > >
>> > > > >>> > > > > > > wrote:
>> > > > >>> > > > > > >
>> > > > >>> > > > > > > > Maybe that's something I could add to the HBase
>> > example
>> > > > and
>> > > > >>> > that
>> > > > >>> > > > > could
>> > > > >>> > > > > > be
>> > > > >>> > > > > > > > better documented in the Wiki.
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > > Since we're talking about the wiki..I was looking
>> at
>> > > the
>> > > > >>> Java
>> > > > >>> > > API (
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > >
>> > > > >>> > > > > >
>> > > > >>> > > > >
>> > > > >>> > > >
>> > > > >>> > >
>> > > > >>> >
>> > > > >>>
>> > > >
>> > >
>> >
>> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
>> > > > >>> )
>> > > > >>> > > > > > > > and the link to the KMeans example is not working
>> > > (where
>> > > > it
>> > > > >>> > says
>> > > > >>> > > > For
>> > > > >>> > > > > a
>> > > > >>> > > > > > > > complete example program, have a look at KMeans
>> > > > Algorithm).
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > > Best,
>> > > > >>> > > > > > > > Flavio
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio
>> Pompermaier <
>> > > > >>> > > > > > [hidden email]
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > > wrote:
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I
>> removed it
>> > > :)
>> > > > >>> > > > > > > >>
>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
>> > > > >>> > [hidden email]>
>> > > > >>> > > > > > wrote:
>> > > > >>> > > > > > > >>
>> > > > >>> > > > > > > >>> You do not really need a HBase data sink. You
>> can
>> > > call
>> > > > >>> > > > > > > >>> "DataSet.output(new
>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
>> > > > >>> > > > > > > >>>
>> > > > >>> > > > > > > >>> Stephan
>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio
>> Pompermaier" <
>> > > > >>> > > > > > [hidden email]
>> > > > >>> > > > > > > >:
>> > > > >>> > > > > > > >>>
>> > > > >>> > > > > > > >>> > Just one last thing..I removed the
>> HbaseDataSink
>> > > > >>> because I
>> > > > >>> > > > think
>> > > > >>> > > > > it
>> > > > >>> > > > > > > was
>> > > > >>> > > > > > > >>> > using the old APIs..can someone help me in
>> > updating
>> > > > >>> that
>> > > > >>> > > class?
>> > > > >>> > > > > > > >>> >
>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
>> > > Pompermaier <
>> > > > >>> > > > > > > >>> [hidden email]>
>> > > > >>> > > > > > > >>> > wrote:
>> > > > >>> > > > > > > >>> >
>> > > > >>> > > > > > > >>> > > Indeed this time the build has been
>> successful
>> > :)
>> > > > >>> > > > > > > >>> > >
>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian
>> Hueske
>> > <
>> > > > >>> > > > > > [hidden email]
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > > >>> > wrote:
>> > > > >>> > > > > > > >>> > >
>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build your own
>> > > Github
>> > > > >>> > > > > repositories
>> > > > >>> > > > > > by
>> > > > >>> > > > > > > >>> > linking
>> > > > >>> > > > > > > >>> > >> it to your Github account. That way Travis
>> can
>> > > > >>> build all
>> > > > >>> > > > your
>> > > > >>> > > > > > > >>> branches
>> > > > >>> > > > > > > >>> > >> (and
>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if something
>> > > fails).
>> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger
>> retrigger
>> > > > >>> builds on
>> > > > >>> > > the
>> > > > >>> > > > > > Apache
>> > > > >>> > > > > > > >>> > >> repository.
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a very
>> > good
>> > > > >>> > addition
>> > > > >>> > > > :-)
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> For the discusion about the PR itself, I
>> would
>> > > > need
>> > > > >>> a
>> > > > >>> > bit
>> > > > >>> > > > more
>> > > > >>> > > > > > > time
>> > > > >>> > > > > > > >>> to
>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I do also
>> not
>> > > > have
>> > > > >>> a
>> > > > >>> > > HBase
>> > > > >>> > > > > > setup
>> > > > >>> > > > > > > >>> > >> available
>> > > > >>> > > > > > > >>> > >> here.
>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community who
>> was
>> > > > >>> involved
>> > > > >>> > > with a
>> > > > >>> > > > > > > >>> previous
>> > > > >>> > > > > > > >>> > >> version of the HBase connector could
>> comment
>> > on
>> > > > your
>> > > > >>> > > > question.
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> Best, Fabian
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
>> Pompermaier <
>> > > > >>> > > > > > > [hidden email]
>> > > > >>> > > > > > > >>> >:
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the
>> > discussion
>> > > on
>> > > > >>> this
>> > > > >>> > > > > mailing
>> > > > >>> > > > > > > >>> list.
>> > > > >>> > > > > > > >>> > >> >
>> > > > >>> > > > > > > >>> > >> > I think that what is still to be
>> discussed
>> > is
>> > > > >>> how to
>> > > > >>> > > > > > retrigger
>> > > > >>> > > > > > > >>> the
>> > > > >>> > > > > > > >>> > >> build
>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an account) and
>> if
>> > the
>> > > > PR
>> > > > >>> can
>> > > > >>> > be
>> > > > >>> > > > > > > >>> integrated.
>> > > > >>> > > > > > > >>> > >> >
>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase
>> > > example
>> > > > >>> in
>> > > > >>> > the
>> > > > >>> > > > test
>> > > > >>> > > > > > > >>> package
>> > > > >>> > > > > > > >>> > >> (right
>> > > > >>> > > > > > > >>> > >> > now I left it in the main folder) so it
>> will
>> > > > force
>> > > > >>> > > Travis
>> > > > >>> > > > to
>> > > > >>> > > > > > > >>> rebuild.
>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
>> > > > >>> > > > > > > >>> > >> >
>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is that the
>> > > hbase
>> > > > >>> > > extension
>> > > > >>> > > > is
>> > > > >>> > > > > > now
>> > > > >>> > > > > > > >>> > >> compatible
>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
>> > > > >>> > > > > > > >>> > >> >
>> > > > >>> > > > > > > >>> > >> > Best,
>> > > > >>> > > > > > > >>> > >> > Flavio
>> > > > >>> > > > > > > >>> > >>
>> > > > >>> > > > > > > >>> > >
>> > > > >>> > > > > > > >>> >
>> > > > >>> > > > > > > >>>
>> > > > >>> > > > > > > >>
>> > > > >>> > > > > > > >
>> > > > >>> > > > > > >
>> > > > >>> > > > > >
>> > > > >>> > > > >
>> > > > >>> > > >
>> > > > >>> > >
>> > > > >>> >
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

Any help with this? :(

On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <[hidden email]>
wrote:

> We definitely discovered that instantiating HTable and Scan in configure()
> method of TableInputFormat causes problem in distributed environment!
> If you look at my implementation at
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> you can see that Scan and HTable were made transient and recreated within
> configure but this causes HBaseConfiguration.create() to fail searching for
> classpath files...could you help us understanding why?
>
> On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <[hidden email]>
> wrote:
>
>> Usually, when I run a mapreduce job both on Spark and Hadoop I just put
>> *-site.xml files into the war I submit to the cluster and that's it. I
>> think the problem appeared when I made the HTable a private transient field
>> and the table istantiation was moved in the configure method.
>> Could it be a valid reason? we still have to make a deeper debug but I'm
>> trying ro figure out where to investigate..
>> On Nov 12, 2014 8:03 PM, "Robert Metzger" <[hidden email]> wrote:
>>
>>> Hi,
>>> Maybe its an issue with the classpath? As far as I know is Hadoop reading
>>> the configuration files from the classpath. Maybe is the hbase-site.xml
>>> file not accessible through the classpath when running on the cluster?
>>>
>>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
>>> [hidden email]>
>>> wrote:
>>>
>>> > Today we tried tp execute a job on the cluster instead of on local
>>> executor
>>> > and we faced the problem that the hbase-site.xml was basically
>>> ignored. Is
>>> > there a reason why the TableInputFormat is working correctly on local
>>> > environment while it doesn't on a cluster?
>>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]> wrote:
>>> >
>>> > > I don't think we need to bundle the HBase input and output format in
>>> a
>>> > > single PR.
>>> > > So, I think we can proceed with the IF only and target the OF later.
>>> > > However, the fix for Kryo should be in the master before merging the
>>> PR.
>>> > > Till is currently working on that and said he expects this to be
>>> done by
>>> > > end of the week.
>>> > >
>>> > > Cheers, Fabian
>>> > >
>>> > >
>>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <[hidden email]
>>> >:
>>> > >
>>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build it
>>> with
>>> > the
>>> > > > command:
>>> > > > mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
>>> > > > -Pvendor-repos,cdh5.1.3
>>> > > >
>>> > > > However, it would be good to generate the specific jar when
>>> > > > releasing..(e.g.
>>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
>>> > > >
>>> > > > Best,
>>> > > > Flavio
>>> > > >
>>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
>>> > > [hidden email]>
>>> > > > wrote:
>>> > > >
>>> > > > > I've just updated the code on my fork (synch with current master
>>> and
>>> > > > > applied improvements coming from comments on related PR).
>>> > > > > I still have to understand how to write results back to an HBase
>>> > > > > Sink/OutputFormat...
>>> > > > >
>>> > > > >
>>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
>>> > > > [hidden email]>
>>> > > > > wrote:
>>> > > > >
>>> > > > >> Thanks for the detailed answer. So if I run a job from my
>>> machine
>>> > I'll
>>> > > > >> have to download all the scanned data in a table..right?
>>> > > > >>
>>> > > > >> Always regarding the GenericTableOutputFormat it is not clear
>>> to me
>>> > > how
>>> > > > >> to proceed..
>>> > > > >> I saw in the hadoop compatibility addon that it is possible to
>>> have
>>> > > such
>>> > > > >> compatibility using HBaseUtils class so the open method should
>>> > become
>>> > > > >> something like:
>>> > > > >>
>>> > > > >> @Override
>>> > > > >> public void open(int taskNumber, int numTasks) throws
>>> IOException {
>>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
>>> > > > >> throw new IOException("Task id too large.");
>>> > > > >> }
>>> > > > >> TaskAttemptID taskAttemptID =
>>> > TaskAttemptID.forName("attempt__0000_r_"
>>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
>>> > 1).length())
>>> > > +
>>> > > > >> "s"," ").replace(" ", "0")
>>> > > > >> + Integer.toString(taskNumber + 1)
>>> > > > >> + "_0");
>>> > > > >> this.configuration.set("mapred.task.id",
>>> > taskAttemptID.toString());
>>> > > > >> this.configuration.setInt("mapred.task.partition", taskNumber +
>>> 1);
>>> > > > >> // for hadoop 2.2
>>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
>>> > > > >> taskAttemptID.toString());
>>> > > > >> this.configuration.setInt("mapreduce.task.partition",
>>> taskNumber +
>>> > 1);
>>> > > > >> try {
>>> > > > >> this.context =
>>> > > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
>>> > > > >> taskAttemptID);
>>> > > > >> } catch (Exception e) {
>>> > > > >> throw new RuntimeException(e);
>>> > > > >> }
>>> > > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
>>> > > > >> try {
>>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
>>> > > > >> } catch (InterruptedException iex) {
>>> > > > >> throw new IOException("Opening the writer was interrupted.",
>>> iex);
>>> > > > >> }
>>> > > > >> }
>>> > > > >>
>>> > > > >> But I'm not sure about how to pass the JobConf to the class, if
>>> to
>>> > > merge
>>> > > > >> config fileas, where HFileOutputFormat2 writes the data and how
>>> to
>>> > > > >> implement the public void writeRecord(Record record) API.
>>> > > > >> Could I do a little chat off the mailing list with the
>>> implementor
>>> > of
>>> > > > >> this extension?
>>> > > > >>
>>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
>>> [hidden email]>
>>> > > > >> wrote:
>>> > > > >>
>>> > > > >>> Hi Flavio
>>> > > > >>>
>>> > > > >>> let me try to answer your last question on the user's list (to
>>> the
>>> > > best
>>> > > > >>> of
>>> > > > >>> my HBase knowledge).
>>> > > > >>> "I just wanted to known if and how regiom splitting is
>>> handled. Can
>>> > > you
>>> > > > >>> explain me in detail how Flink and HBase works?what is not
>>> fully
>>> > > clear
>>> > > > to
>>> > > > >>> me is when computation is done by region servers and when data
>>> > start
>>> > > > flow
>>> > > > >>> to a Flink worker (that in ky test job is only my pc) and how
>>> ro
>>> > > > >>> undertsand
>>> > > > >>> better the important logged info to understand if my job is
>>> > > performing
>>> > > > >>> well"
>>> > > > >>>
>>> > > > >>> HBase partitions its tables into so called "regions" of keys
>>> and
>>> > > stores
>>> > > > >>> the
>>> > > > >>> regions distributed in the cluster using HDFS. I think an HBase
>>> > > region
>>> > > > >>> can
>>> > > > >>> be thought of as a HDFS block. To make reading an HBase table
>>> > > > efficient,
>>> > > > >>> region reads should be locally done, i.e., an InputFormat
>>> should
>>> > > > >>> primarily
>>> > > > >>> read region that are stored on the same machine as the IF is
>>> > running
>>> > > > on.
>>> > > > >>> Flink's InputSplits partition the HBase input by regions and
>>> add
>>> > > > >>> information about the storage location of the region. During
>>> > > execution,
>>> > > > >>> input splits are assigned to InputFormats that can do local
>>> reads.
>>> > > > >>>
>>> > > > >>> Best, Fabian
>>> > > > >>>
>>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[hidden email]>:
>>> > > > >>>
>>> > > > >>> > Hi!
>>> > > > >>> >
>>> > > > >>> > The way of passing parameters through the configuration is
>>> very
>>> > old
>>> > > > >>> (the
>>> > > > >>> > original HBase format dated back to that time). I would
>>> simply
>>> > make
>>> > > > the
>>> > > > >>> > HBase format take those parameters through the constructor.
>>> > > > >>> >
>>> > > > >>> > Greetings,
>>> > > > >>> > Stephan
>>> > > > >>> >
>>> > > > >>> >
>>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
>>> > > > >>> [hidden email]>
>>> > > > >>> > wrote:
>>> > > > >>> >
>>> > > > >>> > > The problem is that I also removed the
>>> GenericTableOutputFormat
>>> > > > >>> because
>>> > > > >>> > > there is an incompatibility between hadoop1 and hadoop2 for
>>> > class
>>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
>>> > > > >>> > > then it would be nice if the user doesn't have to worry
>>> about
>>> > > > passing
>>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
>>> > > > >>> > > I think it is probably a good idea to remove hadoop1
>>> > > compatibility
>>> > > > >>> and
>>> > > > >>> > keep
>>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and decide
>>> how
>>> > to
>>> > > > >>> mange
>>> > > > >>> > > those 2 parameters..
>>> > > > >>> > >
>>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
>>> > [hidden email]>
>>> > > > >>> wrote:
>>> > > > >>> > >
>>> > > > >>> > > > It is fine to remove it, in my opinion.
>>> > > > >>> > > >
>>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
>>> > > > >>> > > [hidden email]>
>>> > > > >>> > > > wrote:
>>> > > > >>> > > >
>>> > > > >>> > > > > That is one class I removed because it was using the
>>> > > deprecated
>>> > > > >>> API
>>> > > > >>> > > > > GenericDataSink..I can restore them but the it will be
>>> a
>>> > good
>>> > > > >>> idea to
>>> > > > >>> > > > > remove those warning (also because from what I
>>> understood
>>> > the
>>> > > > >>> Record
>>> > > > >>> > > APIs
>>> > > > >>> > > > > are going to be removed).
>>> > > > >>> > > > >
>>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
>>> > > > >>> [hidden email]>
>>> > > > >>> > > > wrote:
>>> > > > >>> > > > >
>>> > > > >>> > > > > > I'm not familiar with the HBase connector code, but
>>> are
>>> > you
>>> > > > >>> maybe
>>> > > > >>> > > > looking
>>> > > > >>> > > > > > for the GenericTableOutputFormat?
>>> > > > >>> > > > > >
>>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
>>> > > > >>> [hidden email]
>>> > > > >>> > >:
>>> > > > >>> > > > > >
>>> > > > >>> > > > > > > | was trying to modify the example setting
>>> > > > hbaseDs.output(new
>>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
>>> > > HBaseOutputFormat
>>> > > > >>> > > > > class..maybe
>>> > > > >>> > > > > > we
>>> > > > >>> > > > > > > shall use another class?
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio Pompermaier
>>> <
>>> > > > >>> > > > > [hidden email]
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > > > wrote:
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > > > > Maybe that's something I could add to the HBase
>>> > example
>>> > > > and
>>> > > > >>> > that
>>> > > > >>> > > > > could
>>> > > > >>> > > > > > be
>>> > > > >>> > > > > > > > better documented in the Wiki.
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > > Since we're talking about the wiki..I was
>>> looking at
>>> > > the
>>> > > > >>> Java
>>> > > > >>> > > API (
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > >
>>> > > > >>> > > > >
>>> > > > >>> > > >
>>> > > > >>> > >
>>> > > > >>> >
>>> > > > >>>
>>> > > >
>>> > >
>>> >
>>> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
>>> > > > >>> )
>>> > > > >>> > > > > > > > and the link to the KMeans example is not working
>>> > > (where
>>> > > > it
>>> > > > >>> > says
>>> > > > >>> > > > For
>>> > > > >>> > > > > a
>>> > > > >>> > > > > > > > complete example program, have a look at KMeans
>>> > > > Algorithm).
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > > Best,
>>> > > > >>> > > > > > > > Flavio
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio
>>> Pompermaier <
>>> > > > >>> > > > > > [hidden email]
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > > wrote:
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I
>>> removed it
>>> > > :)
>>> > > > >>> > > > > > > >>
>>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
>>> > > > >>> > [hidden email]>
>>> > > > >>> > > > > > wrote:
>>> > > > >>> > > > > > > >>
>>> > > > >>> > > > > > > >>> You do not really need a HBase data sink. You
>>> can
>>> > > call
>>> > > > >>> > > > > > > >>> "DataSet.output(new
>>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
>>> > > > >>> > > > > > > >>>
>>> > > > >>> > > > > > > >>> Stephan
>>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio
>>> Pompermaier" <
>>> > > > >>> > > > > > [hidden email]
>>> > > > >>> > > > > > > >:
>>> > > > >>> > > > > > > >>>
>>> > > > >>> > > > > > > >>> > Just one last thing..I removed the
>>> HbaseDataSink
>>> > > > >>> because I
>>> > > > >>> > > > think
>>> > > > >>> > > > > it
>>> > > > >>> > > > > > > was
>>> > > > >>> > > > > > > >>> > using the old APIs..can someone help me in
>>> > updating
>>> > > > >>> that
>>> > > > >>> > > class?
>>> > > > >>> > > > > > > >>> >
>>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
>>> > > Pompermaier <
>>> > > > >>> > > > > > > >>> [hidden email]>
>>> > > > >>> > > > > > > >>> > wrote:
>>> > > > >>> > > > > > > >>> >
>>> > > > >>> > > > > > > >>> > > Indeed this time the build has been
>>> successful
>>> > :)
>>> > > > >>> > > > > > > >>> > >
>>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian
>>> Hueske
>>> > <
>>> > > > >>> > > > > > [hidden email]
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > > >>> > wrote:
>>> > > > >>> > > > > > > >>> > >
>>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build your
>>> own
>>> > > Github
>>> > > > >>> > > > > repositories
>>> > > > >>> > > > > > by
>>> > > > >>> > > > > > > >>> > linking
>>> > > > >>> > > > > > > >>> > >> it to your Github account. That way
>>> Travis can
>>> > > > >>> build all
>>> > > > >>> > > > your
>>> > > > >>> > > > > > > >>> branches
>>> > > > >>> > > > > > > >>> > >> (and
>>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if something
>>> > > fails).
>>> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger
>>> retrigger
>>> > > > >>> builds on
>>> > > > >>> > > the
>>> > > > >>> > > > > > Apache
>>> > > > >>> > > > > > > >>> > >> repository.
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a
>>> very
>>> > good
>>> > > > >>> > addition
>>> > > > >>> > > > :-)
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >> For the discusion about the PR itself, I
>>> would
>>> > > > need
>>> > > > >>> a
>>> > > > >>> > bit
>>> > > > >>> > > > more
>>> > > > >>> > > > > > > time
>>> > > > >>> > > > > > > >>> to
>>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I do
>>> also not
>>> > > > have
>>> > > > >>> a
>>> > > > >>> > > HBase
>>> > > > >>> > > > > > setup
>>> > > > >>> > > > > > > >>> > >> available
>>> > > > >>> > > > > > > >>> > >> here.
>>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community who
>>> was
>>> > > > >>> involved
>>> > > > >>> > > with a
>>> > > > >>> > > > > > > >>> previous
>>> > > > >>> > > > > > > >>> > >> version of the HBase connector could
>>> comment
>>> > on
>>> > > > your
>>> > > > >>> > > > question.
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >> Best, Fabian
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
>>> Pompermaier <
>>> > > > >>> > > > > > > [hidden email]
>>> > > > >>> > > > > > > >>> >:
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the
>>> > discussion
>>> > > on
>>> > > > >>> this
>>> > > > >>> > > > > mailing
>>> > > > >>> > > > > > > >>> list.
>>> > > > >>> > > > > > > >>> > >> >
>>> > > > >>> > > > > > > >>> > >> > I think that what is still to be
>>> discussed
>>> > is
>>> > > > >>> how to
>>> > > > >>> > > > > > retrigger
>>> > > > >>> > > > > > > >>> the
>>> > > > >>> > > > > > > >>> > >> build
>>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an account) and
>>> if
>>> > the
>>> > > > PR
>>> > > > >>> can
>>> > > > >>> > be
>>> > > > >>> > > > > > > >>> integrated.
>>> > > > >>> > > > > > > >>> > >> >
>>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the HBase
>>> > > example
>>> > > > >>> in
>>> > > > >>> > the
>>> > > > >>> > > > test
>>> > > > >>> > > > > > > >>> package
>>> > > > >>> > > > > > > >>> > >> (right
>>> > > > >>> > > > > > > >>> > >> > now I left it in the main folder) so it
>>> will
>>> > > > force
>>> > > > >>> > > Travis
>>> > > > >>> > > > to
>>> > > > >>> > > > > > > >>> rebuild.
>>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
>>> > > > >>> > > > > > > >>> > >> >
>>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is that
>>> the
>>> > > hbase
>>> > > > >>> > > extension
>>> > > > >>> > > > is
>>> > > > >>> > > > > > now
>>> > > > >>> > > > > > > >>> > >> compatible
>>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
>>> > > > >>> > > > > > > >>> > >> >
>>> > > > >>> > > > > > > >>> > >> > Best,
>>> > > > >>> > > > > > > >>> > >> > Flavio
>>> > > > >>> > > > > > > >>> > >>
>>> > > > >>> > > > > > > >>> > >
>>> > > > >>> > > > > > > >>> >
>>> > > > >>> > > > > > > >>>
>>> > > > >>> > > > > > > >>
>>> > > > >>> > > > > > > >
>>> > > > >>> > > > > > >
>>> > > > >>> > > > > >
>>> > > > >>> > > > >
>>> > > > >>> > > >
>>> > > > >>> > >
>>> > > > >>> >
>>> > > > >>>
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>
>

Fabian Hueske-2

Re: HBase 0.98 addon for Flink 0.8

Have you added the hbase.jar file with your HBase config to the ./lib folders of your Flink setup (JobManager, TaskManager) or is it bundled with your job.jar file?

--
Fabian Hueske
Phone: +49 170 5549438
Email: [hidden email]
Web: http://www.user.tu-berlin.de/fabian.hueske

From: Flavio Pompermaier
Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
To: [hidden email]

Any help with this? :(

On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <[hidden email]>
wrote:

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

The hbase jar is in the lib directory on each node while the config files
are within the jar file I submit from the web client.
On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:

> Have you added the hbase.jar file with your HBase config to the ./lib
> folders of your Flink setup (JobManager, TaskManager) or is it bundled with
> your job.jar file?
>
>
>
>
>
> --
> Fabian Hueske
> Phone: +49 170 5549438
> Email: [hidden email]
> Web: http://www.user.tu-berlin.de/fabian.hueske
>
>
>
>
>
> From: Flavio Pompermaier
> Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
> To: [hidden email]
>
>
>
>
>
> Any help with this? :(
>
> On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <[hidden email]>
> wrote:
>
> > We definitely discovered that instantiating HTable and Scan in
> configure()
> > method of TableInputFormat causes problem in distributed environment!
> > If you look at my implementation at
> >
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> > you can see that Scan and HTable were made transient and recreated within
> > configure but this causes HBaseConfiguration.create() to fail searching
> for
> > classpath files...could you help us understanding why?
> >
> > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <
> [hidden email]>
> > wrote:
> >
> >> Usually, when I run a mapreduce job both on Spark and Hadoop I just put
> >> *-site.xml files into the war I submit to the cluster and that's it. I
> >> think the problem appeared when I made the HTable a private transient
> field
> >> and the table istantiation was moved in the configure method.
> >> Could it be a valid reason? we still have to make a deeper debug but I'm
> >> trying ro figure out where to investigate..
> >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <[hidden email]> wrote:
> >>
> >>> Hi,
> >>> Maybe its an issue with the classpath? As far as I know is Hadoop
> reading
> >>> the configuration files from the classpath. Maybe is the hbase-site.xml
> >>> file not accessible through the classpath when running on the cluster?
> >>>
> >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
> >>> [hidden email]>
> >>> wrote:
> >>>
> >>> > Today we tried tp execute a job on the cluster instead of on local
> >>> executor
> >>> > and we faced the problem that the hbase-site.xml was basically
> >>> ignored. Is
> >>> > there a reason why the TableInputFormat is working correctly on local
> >>> > environment while it doesn't on a cluster?
> >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]>
> wrote:
> >>> >
> >>> > > I don't think we need to bundle the HBase input and output format
> in
> >>> a
> >>> > > single PR.
> >>> > > So, I think we can proceed with the IF only and target the OF
> later.
> >>> > > However, the fix for Kryo should be in the master before merging
> the
> >>> PR.
> >>> > > Till is currently working on that and said he expects this to be
> >>> done by
> >>> > > end of the week.
> >>> > >
> >>> > > Cheers, Fabian
> >>> > >
> >>> > >
> >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <
> [hidden email]
> >>> >:
> >>> > >
> >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build it
> >>> with
> >>> > the
> >>> > > > command:
> >>> > > > mvn clean install -Dmaven.test.skip=true -Dhadoop.profile=2
> >>> > > > -Pvendor-repos,cdh5.1.3
> >>> > > >
> >>> > > > However, it would be good to generate the specific jar when
> >>> > > > releasing..(e.g.
> >>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> >>> > > >
> >>> > > > Best,
> >>> > > > Flavio
> >>> > > >
> >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> >>> > > [hidden email]>
> >>> > > > wrote:
> >>> > > >
> >>> > > > > I've just updated the code on my fork (synch with current
> master
> >>> and
> >>> > > > > applied improvements coming from comments on related PR).
> >>> > > > > I still have to understand how to write results back to an
> HBase
> >>> > > > > Sink/OutputFormat...
> >>> > > > >
> >>> > > > >
> >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> >>> > > > [hidden email]>
> >>> > > > > wrote:
> >>> > > > >
> >>> > > > >> Thanks for the detailed answer. So if I run a job from my
> >>> machine
> >>> > I'll
> >>> > > > >> have to download all the scanned data in a table..right?
> >>> > > > >>
> >>> > > > >> Always regarding the GenericTableOutputFormat it is not clear
> >>> to me
> >>> > > how
> >>> > > > >> to proceed..
> >>> > > > >> I saw in the hadoop compatibility addon that it is possible to
> >>> have
> >>> > > such
> >>> > > > >> compatibility using HBaseUtils class so the open method should
> >>> > become
> >>> > > > >> something like:
> >>> > > > >>
> >>> > > > >> @Override
> >>> > > > >> public void open(int taskNumber, int numTasks) throws
> >>> IOException {
> >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
> >>> > > > >> throw new IOException("Task id too large.");
> >>> > > > >> }
> >>> > > > >> TaskAttemptID taskAttemptID =
> >>> > TaskAttemptID.forName("attempt__0000_r_"
> >>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
> >>> > 1).length())
> >>> > > +
> >>> > > > >> "s"," ").replace(" ", "0")
> >>> > > > >> + Integer.toString(taskNumber + 1)
> >>> > > > >> + "_0");
> >>> > > > >> this.configuration.set("mapred.task.id",
> >>> > taskAttemptID.toString());
> >>> > > > >> this.configuration.setInt("mapred.task.partition", taskNumber
> +
> >>> 1);
> >>> > > > >> // for hadoop 2.2
> >>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
> >>> > > > >> taskAttemptID.toString());
> >>> > > > >> this.configuration.setInt("mapreduce.task.partition",
> >>> taskNumber +
> >>> > 1);
> >>> > > > >> try {
> >>> > > > >> this.context =
> >>> > > > >> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> >>> > > > >> taskAttemptID);
> >>> > > > >> } catch (Exception e) {
> >>> > > > >> throw new RuntimeException(e);
> >>> > > > >> }
> >>> > > > >> final HFileOutputFormat2 outFormat = new HFileOutputFormat2();
> >>> > > > >> try {
> >>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
> >>> > > > >> } catch (InterruptedException iex) {
> >>> > > > >> throw new IOException("Opening the writer was interrupted.",
> >>> iex);
> >>> > > > >> }
> >>> > > > >> }
> >>> > > > >>
> >>> > > > >> But I'm not sure about how to pass the JobConf to the class,
> if
> >>> to
> >>> > > merge
> >>> > > > >> config fileas, where HFileOutputFormat2 writes the data and
> how
> >>> to
> >>> > > > >> implement the public void writeRecord(Record record) API.
> >>> > > > >> Could I do a little chat off the mailing list with the
> >>> implementor
> >>> > of
> >>> > > > >> this extension?
> >>> > > > >>
> >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> >>> [hidden email]>
> >>> > > > >> wrote:
> >>> > > > >>
> >>> > > > >>> Hi Flavio
> >>> > > > >>>
> >>> > > > >>> let me try to answer your last question on the user's list
> (to
> >>> the
> >>> > > best
> >>> > > > >>> of
> >>> > > > >>> my HBase knowledge).
> >>> > > > >>> "I just wanted to known if and how regiom splitting is
> >>> handled. Can
> >>> > > you
> >>> > > > >>> explain me in detail how Flink and HBase works?what is not
> >>> fully
> >>> > > clear
> >>> > > > to
> >>> > > > >>> me is when computation is done by region servers and when
> data
> >>> > start
> >>> > > > flow
> >>> > > > >>> to a Flink worker (that in ky test job is only my pc) and how
> >>> ro
> >>> > > > >>> undertsand
> >>> > > > >>> better the important logged info to understand if my job is
> >>> > > performing
> >>> > > > >>> well"
> >>> > > > >>>
> >>> > > > >>> HBase partitions its tables into so called "regions" of keys
> >>> and
> >>> > > stores
> >>> > > > >>> the
> >>> > > > >>> regions distributed in the cluster using HDFS. I think an
> HBase
> >>> > > region
> >>> > > > >>> can
> >>> > > > >>> be thought of as a HDFS block. To make reading an HBase table
> >>> > > > efficient,
> >>> > > > >>> region reads should be locally done, i.e., an InputFormat
> >>> should
> >>> > > > >>> primarily
> >>> > > > >>> read region that are stored on the same machine as the IF is
> >>> > running
> >>> > > > on.
> >>> > > > >>> Flink's InputSplits partition the HBase input by regions and
> >>> add
> >>> > > > >>> information about the storage location of the region. During
> >>> > > execution,
> >>> > > > >>> input splits are assigned to InputFormats that can do local
> >>> reads.
> >>> > > > >>>
> >>> > > > >>> Best, Fabian
> >>> > > > >>>
> >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[hidden email]>:
> >>> > > > >>>
> >>> > > > >>> > Hi!
> >>> > > > >>> >
> >>> > > > >>> > The way of passing parameters through the configuration is
> >>> very
> >>> > old
> >>> > > > >>> (the
> >>> > > > >>> > original HBase format dated back to that time). I would
> >>> simply
> >>> > make
> >>> > > > the
> >>> > > > >>> > HBase format take those parameters through the constructor.
> >>> > > > >>> >
> >>> > > > >>> > Greetings,
> >>> > > > >>> > Stephan
> >>> > > > >>> >
> >>> > > > >>> >
> >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> >>> > > > >>> [hidden email]>
> >>> > > > >>> > wrote:
> >>> > > > >>> >
> >>> > > > >>> > > The problem is that I also removed the
> >>> GenericTableOutputFormat
> >>> > > > >>> because
> >>> > > > >>> > > there is an incompatibility between hadoop1 and hadoop2
> for
> >>> > class
> >>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> >>> > > > >>> > > then it would be nice if the user doesn't have to worry
> >>> about
> >>> > > > passing
> >>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> >>> > > > >>> > > I think it is probably a good idea to remove hadoop1
> >>> > > compatibility
> >>> > > > >>> and
> >>> > > > >>> > keep
> >>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and
> decide
> >>> how
> >>> > to
> >>> > > > >>> mange
> >>> > > > >>> > > those 2 parameters..
> >>> > > > >>> > >
> >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
> >>> > [hidden email]>
> >>> > > > >>> wrote:
> >>> > > > >>> > >
> >>> > > > >>> > > > It is fine to remove it, in my opinion.
> >>> > > > >>> > > >
> >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
> >>> > > > >>> > > [hidden email]>
> >>> > > > >>> > > > wrote:
> >>> > > > >>> > > >
> >>> > > > >>> > > > > That is one class I removed because it was using the
> >>> > > deprecated
> >>> > > > >>> API
> >>> > > > >>> > > > > GenericDataSink..I can restore them but the it will
> be
> >>> a
> >>> > good
> >>> > > > >>> idea to
> >>> > > > >>> > > > > remove those warning (also because from what I
> >>> understood
> >>> > the
> >>> > > > >>> Record
> >>> > > > >>> > > APIs
> >>> > > > >>> > > > > are going to be removed).
> >>> > > > >>> > > > >
> >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> >>> > > > >>> [hidden email]>
> >>> > > > >>> > > > wrote:
> >>> > > > >>> > > > >
> >>> > > > >>> > > > > > I'm not familiar with the HBase connector code, but
> >>> are
> >>> > you
> >>> > > > >>> maybe
> >>> > > > >>> > > > looking
> >>> > > > >>> > > > > > for the GenericTableOutputFormat?
> >>> > > > >>> > > > > >
> >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> >>> > > > >>> [hidden email]
> >>> > > > >>> > >:
> >>> > > > >>> > > > > >
> >>> > > > >>> > > > > > > | was trying to modify the example setting
> >>> > > > hbaseDs.output(new
> >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
> >>> > > HBaseOutputFormat
> >>> > > > >>> > > > > class..maybe
> >>> > > > >>> > > > > > we
> >>> > > > >>> > > > > > > shall use another class?
> >>> > > > >>> > > > > > >
> >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio
> Pompermaier
> >>> <
> >>> > > > >>> > > > > [hidden email]
> >>> > > > >>> > > > > > >
> >>> > > > >>> > > > > > > wrote:
> >>> > > > >>> > > > > > >
> >>> > > > >>> > > > > > > > Maybe that's something I could add to the HBase
> >>> > example
> >>> > > > and
> >>> > > > >>> > that
> >>> > > > >>> > > > > could
> >>> > > > >>> > > > > > be
> >>> > > > >>> > > > > > > > better documented in the Wiki.
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > > > Since we're talking about the wiki..I was
> >>> looking at
> >>> > > the
> >>> > > > >>> Java
> >>> > > > >>> > > API (
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > >
> >>> > > > >>> > > > > >
> >>> > > > >>> > > > >
> >>> > > > >>> > > >
> >>> > > > >>> > >
> >>> > > > >>> >
> >>> > > > >>>
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> >>> > > > >>> )
> >>> > > > >>> > > > > > > > and the link to the KMeans example is not
> working
> >>> > > (where
> >>> > > > it
> >>> > > > >>> > says
> >>> > > > >>> > > > For
> >>> > > > >>> > > > > a
> >>> > > > >>> > > > > > > > complete example program, have a look at KMeans
> >>> > > > Algorithm).
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > > > Best,
> >>> > > > >>> > > > > > > > Flavio
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio
> >>> Pompermaier <
> >>> > > > >>> > > > > > [hidden email]
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > > > wrote:
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I
> >>> removed it
> >>> > > :)
> >>> > > > >>> > > > > > > >>
> >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan Ewen <
> >>> > > > >>> > [hidden email]>
> >>> > > > >>> > > > > > wrote:
> >>> > > > >>> > > > > > > >>
> >>> > > > >>> > > > > > > >>> You do not really need a HBase data sink. You
> >>> can
> >>> > > call
> >>> > > > >>> > > > > > > >>> "DataSet.output(new
> >>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> >>> > > > >>> > > > > > > >>>
> >>> > > > >>> > > > > > > >>> Stephan
> >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio
> >>> Pompermaier" <
> >>> > > > >>> > > > > > [hidden email]
> >>> > > > >>> > > > > > > >:
> >>> > > > >>> > > > > > > >>>
> >>> > > > >>> > > > > > > >>> > Just one last thing..I removed the
> >>> HbaseDataSink
> >>> > > > >>> because I
> >>> > > > >>> > > > think
> >>> > > > >>> > > > > it
> >>> > > > >>> > > > > > > was
> >>> > > > >>> > > > > > > >>> > using the old APIs..can someone help me in
> >>> > updating
> >>> > > > >>> that
> >>> > > > >>> > > class?
> >>> > > > >>> > > > > > > >>> >
> >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
> >>> > > Pompermaier <
> >>> > > > >>> > > > > > > >>> [hidden email]>
> >>> > > > >>> > > > > > > >>> > wrote:
> >>> > > > >>> > > > > > > >>> >
> >>> > > > >>> > > > > > > >>> > > Indeed this time the build has been
> >>> successful
> >>> > :)
> >>> > > > >>> > > > > > > >>> > >
> >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian
> >>> Hueske
> >>> > <
> >>> > > > >>> > > > > > [hidden email]
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > > >>> > wrote:
> >>> > > > >>> > > > > > > >>> > >
> >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build your
> >>> own
> >>> > > Github
> >>> > > > >>> > > > > repositories
> >>> > > > >>> > > > > > by
> >>> > > > >>> > > > > > > >>> > linking
> >>> > > > >>> > > > > > > >>> > >> it to your Github account. That way
> >>> Travis can
> >>> > > > >>> build all
> >>> > > > >>> > > > your
> >>> > > > >>> > > > > > > >>> branches
> >>> > > > >>> > > > > > > >>> > >> (and
> >>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if
> something
> >>> > > fails).
> >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger
> >>> retrigger
> >>> > > > >>> builds on
> >>> > > > >>> > > the
> >>> > > > >>> > > > > > Apache
> >>> > > > >>> > > > > > > >>> > >> repository.
> >>> > > > >>> > > > > > > >>> > >>
> >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a
> >>> very
> >>> > good
> >>> > > > >>> > addition
> >>> > > > >>> > > > :-)
> >>> > > > >>> > > > > > > >>> > >>
> >>> > > > >>> > > > > > > >>> > >> For the discusion about the PR itself, I
> >>> would
> >>> > > > need
> >>> > > > >>> a
> >>> > > > >>> > bit
> >>> > > > >>> > > > more
> >>> > > > >>> > > > > > > time
> >>> > > > >>> > > > > > > >>> to
> >>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I do
> >>> also not
> >>> > > > have
> >>> > > > >>> a
> >>> > > > >>> > > HBase
> >>> > > > >>> > > > > > setup
> >>> > > > >>> > > > > > > >>> > >> available
> >>> > > > >>> > > > > > > >>> > >> here.
> >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community who
> >>> was
> >>> > > > >>> involved
> >>> > > > >>> > > with a
> >>> > > > >>> > > > > > > >>> previous
> >>> > > > >>> > > > > > > >>> > >> version of the HBase connector could
> >>> comment
> >>> > on
> >>> > > > your
> >>> > > > >>> > > > question.
> >>> > > > >>> > > > > > > >>> > >>
> >>> > > > >>> > > > > > > >>> > >> Best, Fabian
> >>> > > > >>> > > > > > > >>> > >>
> >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
> >>> Pompermaier <
> >>> > > > >>> > > > > > > [hidden email]
> >>> > > > >>> > > > > > > >>> >:
> >>> > > > >>> > > > > > > >>> > >>
> >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the
> >>> > discussion
> >>> > > on
> >>> > > > >>> this
> >>> > > > >>> > > > > mailing
> >>> > > > >>> > > > > > > >>> list.
> >>> > > > >>> > > > > > > >>> > >> >
> >>> > > > >>> > > > > > > >>> > >> > I think that what is still to be
> >>> discussed
> >>> > is
> >>> > > > >>> how to
> >>> > > > >>> > > > > > retrigger
> >>> > > > >>> > > > > > > >>> the
> >>> > > > >>> > > > > > > >>> > >> build
> >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an account)
> and
> >>> if
> >>> > the
> >>> > > > PR
> >>> > > > >>> can
> >>> > > > >>> > be
> >>> > > > >>> > > > > > > >>> integrated.
> >>> > > > >>> > > > > > > >>> > >> >
> >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the
> HBase
> >>> > > example
> >>> > > > >>> in
> >>> > > > >>> > the
> >>> > > > >>> > > > test
> >>> > > > >>> > > > > > > >>> package
> >>> > > > >>> > > > > > > >>> > >> (right
> >>> > > > >>> > > > > > > >>> > >> > now I left it in the main folder) so
> it
> >>> will
> >>> > > > force
> >>> > > > >>> > > Travis
> >>> > > > >>> > > > to
> >>> > > > >>> > > > > > > >>> rebuild.
> >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
> >>> > > > >>> > > > > > > >>> > >> >
> >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is that
> >>> the
> >>> > > hbase
> >>> > > > >>> > > extension
> >>> > > > >>> > > > is
> >>> > > > >>> > > > > > now
> >>> > > > >>> > > > > > > >>> > >> compatible
> >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> >>> > > > >>> > > > > > > >>> > >> >
> >>> > > > >>> > > > > > > >>> > >> > Best,
> >>> > > > >>> > > > > > > >>> > >> > Flavio
> >>> > > > >>> > > > > > > >>> > >>
> >>> > > > >>> > > > > > > >>> > >
> >>> > > > >>> > > > > > > >>> >
> >>> > > > >>> > > > > > > >>>
> >>> > > > >>> > > > > > > >>
> >>> > > > >>> > > > > > > >
> >>> > > > >>> > > > > > >
> >>> > > > >>> > > > > >
> >>> > > > >>> > > > >
> >>> > > > >>> > > >
> >>> > > > >>> > >
> >>> > > > >>> >
> >>> > > > >>>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >
> >

Fabian Hueske-2

Re: HBase 0.98 addon for Flink 0.8

Does the HBase jar in the lib folder contain a config that could be used instead of the config in the job jar file? Or is simply no config at all available when the configure method is called?

--
Fabian Hueske
Phone: +49 170 5549438
Email: [hidden email]
Web: http://www.user.tu-berlin.de/fabian.hueske

From: Flavio Pompermaier
Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎21‎:‎43
To: [hidden email]

The hbase jar is in the lib directory on each node while the config files
are within the jar file I submit from the web client.
On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

The only config files available are within the submitted jar. Things works
in eclipse using local environment while fails deploying to the cluster
On Nov 13, 2014 10:01 PM, <[hidden email]> wrote:

> Does the HBase jar in the lib folder contain a config that could be used
> instead of the config in the job jar file? Or is simply no config at all
> available when the configure method is called?
>
>
>
>
>
>
> --
> Fabian Hueske
> Phone: +49 170 5549438
> Email: [hidden email]
> Web: http://www.user.tu-berlin.de/fabian.hueske
>
>
>
>
>
> From: Flavio Pompermaier
> Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎21‎:‎43
> To: [hidden email]
>
>
>
>
>
> The hbase jar is in the lib directory on each node while the config files
> are within the jar file I submit from the web client.
> On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:
>
> > Have you added the hbase.jar file with your HBase config to the ./lib
> > folders of your Flink setup (JobManager, TaskManager) or is it bundled
> with
> > your job.jar file?
> >
> >
> >
> >
> >
> > --
> > Fabian Hueske
> > Phone: +49 170 5549438
> > Email: [hidden email]
> > Web: http://www.user.tu-berlin.de/fabian.hueske
> >
> >
> >
> >
> >
> > From: Flavio Pompermaier
> > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
> > To: [hidden email]
> >
> >
> >
> >
> >
> > Any help with this? :(
> >
> > On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <
> [hidden email]>
> > wrote:
> >
> > > We definitely discovered that instantiating HTable and Scan in
> > configure()
> > > method of TableInputFormat causes problem in distributed environment!
> > > If you look at my implementation at
> > >
> >
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> > > you can see that Scan and HTable were made transient and recreated
> within
> > > configure but this causes HBaseConfiguration.create() to fail searching
> > for
> > > classpath files...could you help us understanding why?
> > >
> > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <
> > [hidden email]>
> > > wrote:
> > >
> > >> Usually, when I run a mapreduce job both on Spark and Hadoop I just
> put
> > >> *-site.xml files into the war I submit to the cluster and that's it. I
> > >> think the problem appeared when I made the HTable a private transient
> > field
> > >> and the table istantiation was moved in the configure method.
> > >> Could it be a valid reason? we still have to make a deeper debug but
> I'm
> > >> trying ro figure out where to investigate..
> > >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <[hidden email]>
> wrote:
> > >>
> > >>> Hi,
> > >>> Maybe its an issue with the classpath? As far as I know is Hadoop
> > reading
> > >>> the configuration files from the classpath. Maybe is the
> hbase-site.xml
> > >>> file not accessible through the classpath when running on the
> cluster?
> > >>>
> > >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
> > >>> [hidden email]>
> > >>> wrote:
> > >>>
> > >>> > Today we tried tp execute a job on the cluster instead of on local
> > >>> executor
> > >>> > and we faced the problem that the hbase-site.xml was basically
> > >>> ignored. Is
> > >>> > there a reason why the TableInputFormat is working correctly on
> local
> > >>> > environment while it doesn't on a cluster?
> > >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]>
> > wrote:
> > >>> >
> > >>> > > I don't think we need to bundle the HBase input and output format
> > in
> > >>> a
> > >>> > > single PR.
> > >>> > > So, I think we can proceed with the IF only and target the OF
> > later.
> > >>> > > However, the fix for Kryo should be in the master before merging
> > the
> > >>> PR.
> > >>> > > Till is currently working on that and said he expects this to be
> > >>> done by
> > >>> > > end of the week.
> > >>> > >
> > >>> > > Cheers, Fabian
> > >>> > >
> > >>> > >
> > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <
> > [hidden email]
> > >>> >:
> > >>> > >
> > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build
> it
> > >>> with
> > >>> > the
> > >>> > > > command:
> > >>> > > > mvn clean install -Dmaven.test.skip=true
> -Dhadoop.profile=2
> > >>> > > > -Pvendor-repos,cdh5.1.3
> > >>> > > >
> > >>> > > > However, it would be good to generate the specific jar when
> > >>> > > > releasing..(e.g.
> > >>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > >>> > > >
> > >>> > > > Best,
> > >>> > > > Flavio
> > >>> > > >
> > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> > >>> > > [hidden email]>
> > >>> > > > wrote:
> > >>> > > >
> > >>> > > > > I've just updated the code on my fork (synch with current
> > master
> > >>> and
> > >>> > > > > applied improvements coming from comments on related PR).
> > >>> > > > > I still have to understand how to write results back to an
> > HBase
> > >>> > > > > Sink/OutputFormat...
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> > >>> > > > [hidden email]>
> > >>> > > > > wrote:
> > >>> > > > >
> > >>> > > > >> Thanks for the detailed answer. So if I run a job from my
> > >>> machine
> > >>> > I'll
> > >>> > > > >> have to download all the scanned data in a table..right?
> > >>> > > > >>
> > >>> > > > >> Always regarding the GenericTableOutputFormat it is not
> clear
> > >>> to me
> > >>> > > how
> > >>> > > > >> to proceed..
> > >>> > > > >> I saw in the hadoop compatibility addon that it is possible
> to
> > >>> have
> > >>> > > such
> > >>> > > > >> compatibility using HBaseUtils class so the open method
> should
> > >>> > become
> > >>> > > > >> something like:
> > >>> > > > >>
> > >>> > > > >> @Override
> > >>> > > > >> public void open(int taskNumber, int numTasks) throws
> > >>> IOException {
> > >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
> > >>> > > > >> throw new IOException("Task id too large.");
> > >>> > > > >> }
> > >>> > > > >> TaskAttemptID taskAttemptID =
> > >>> > TaskAttemptID.forName("attempt__0000_r_"
> > >>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
> > >>> > 1).length())
> > >>> > > +
> > >>> > > > >> "s"," ").replace(" ", "0")
> > >>> > > > >> + Integer.toString(taskNumber + 1)
> > >>> > > > >> + "_0");
> > >>> > > > >> this.configuration.set("mapred.task.id",
> > >>> > taskAttemptID.toString());
> > >>> > > > >> this.configuration.setInt("mapred.task.partition",
> taskNumber
> > +
> > >>> 1);
> > >>> > > > >> // for hadoop 2.2
> > >>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
> > >>> > > > >> taskAttemptID.toString());
> > >>> > > > >> this.configuration.setInt("mapreduce.task.partition",
> > >>> taskNumber +
> > >>> > 1);
> > >>> > > > >> try {
> > >>> > > > >> this.context =
> > >>> > > > >>
> HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > >>> > > > >> taskAttemptID);
> > >>> > > > >> } catch (Exception e) {
> > >>> > > > >> throw new RuntimeException(e);
> > >>> > > > >> }
> > >>> > > > >> final HFileOutputFormat2 outFormat = new
> HFileOutputFormat2();
> > >>> > > > >> try {
> > >>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
> > >>> > > > >> } catch (InterruptedException iex) {
> > >>> > > > >> throw new IOException("Opening the writer was interrupted.",
> > >>> iex);
> > >>> > > > >> }
> > >>> > > > >> }
> > >>> > > > >>
> > >>> > > > >> But I'm not sure about how to pass the JobConf to the class,
> > if
> > >>> to
> > >>> > > merge
> > >>> > > > >> config fileas, where HFileOutputFormat2 writes the data and
> > how
> > >>> to
> > >>> > > > >> implement the public void writeRecord(Record record) API.
> > >>> > > > >> Could I do a little chat off the mailing list with the
> > >>> implementor
> > >>> > of
> > >>> > > > >> this extension?
> > >>> > > > >>
> > >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> > >>> [hidden email]>
> > >>> > > > >> wrote:
> > >>> > > > >>
> > >>> > > > >>> Hi Flavio
> > >>> > > > >>>
> > >>> > > > >>> let me try to answer your last question on the user's list
> > (to
> > >>> the
> > >>> > > best
> > >>> > > > >>> of
> > >>> > > > >>> my HBase knowledge).
> > >>> > > > >>> "I just wanted to known if and how regiom splitting is
> > >>> handled. Can
> > >>> > > you
> > >>> > > > >>> explain me in detail how Flink and HBase works?what is not
> > >>> fully
> > >>> > > clear
> > >>> > > > to
> > >>> > > > >>> me is when computation is done by region servers and when
> > data
> > >>> > start
> > >>> > > > flow
> > >>> > > > >>> to a Flink worker (that in ky test job is only my pc) and
> how
> > >>> ro
> > >>> > > > >>> undertsand
> > >>> > > > >>> better the important logged info to understand if my job is
> > >>> > > performing
> > >>> > > > >>> well"
> > >>> > > > >>>
> > >>> > > > >>> HBase partitions its tables into so called "regions" of
> keys
> > >>> and
> > >>> > > stores
> > >>> > > > >>> the
> > >>> > > > >>> regions distributed in the cluster using HDFS. I think an
> > HBase
> > >>> > > region
> > >>> > > > >>> can
> > >>> > > > >>> be thought of as a HDFS block. To make reading an HBase
> table
> > >>> > > > efficient,
> > >>> > > > >>> region reads should be locally done, i.e., an InputFormat
> > >>> should
> > >>> > > > >>> primarily
> > >>> > > > >>> read region that are stored on the same machine as the IF
> is
> > >>> > running
> > >>> > > > on.
> > >>> > > > >>> Flink's InputSplits partition the HBase input by regions
> and
> > >>> add
> > >>> > > > >>> information about the storage location of the region.
> During
> > >>> > > execution,
> > >>> > > > >>> input splits are assigned to InputFormats that can do local
> > >>> reads.
> > >>> > > > >>>
> > >>> > > > >>> Best, Fabian
> > >>> > > > >>>
> > >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <[hidden email]
> >:
> > >>> > > > >>>
> > >>> > > > >>> > Hi!
> > >>> > > > >>> >
> > >>> > > > >>> > The way of passing parameters through the configuration
> is
> > >>> very
> > >>> > old
> > >>> > > > >>> (the
> > >>> > > > >>> > original HBase format dated back to that time). I would
> > >>> simply
> > >>> > make
> > >>> > > > the
> > >>> > > > >>> > HBase format take those parameters through the
> constructor.
> > >>> > > > >>> >
> > >>> > > > >>> > Greetings,
> > >>> > > > >>> > Stephan
> > >>> > > > >>> >
> > >>> > > > >>> >
> > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> > >>> > > > >>> [hidden email]>
> > >>> > > > >>> > wrote:
> > >>> > > > >>> >
> > >>> > > > >>> > > The problem is that I also removed the
> > >>> GenericTableOutputFormat
> > >>> > > > >>> because
> > >>> > > > >>> > > there is an incompatibility between hadoop1 and hadoop2
> > for
> > >>> > class
> > >>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> > >>> > > > >>> > > then it would be nice if the user doesn't have to worry
> > >>> about
> > >>> > > > passing
> > >>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> > >>> > > > >>> > > I think it is probably a good idea to remove hadoop1
> > >>> > > compatibility
> > >>> > > > >>> and
> > >>> > > > >>> > keep
> > >>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and
> > decide
> > >>> how
> > >>> > to
> > >>> > > > >>> mange
> > >>> > > > >>> > > those 2 parameters..
> > >>> > > > >>> > >
> > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
> > >>> > [hidden email]>
> > >>> > > > >>> wrote:
> > >>> > > > >>> > >
> > >>> > > > >>> > > > It is fine to remove it, in my opinion.
> > >>> > > > >>> > > >
> > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio Pompermaier <
> > >>> > > > >>> > > [hidden email]>
> > >>> > > > >>> > > > wrote:
> > >>> > > > >>> > > >
> > >>> > > > >>> > > > > That is one class I removed because it was using
> the
> > >>> > > deprecated
> > >>> > > > >>> API
> > >>> > > > >>> > > > > GenericDataSink..I can restore them but the it will
> > be
> > >>> a
> > >>> > good
> > >>> > > > >>> idea to
> > >>> > > > >>> > > > > remove those warning (also because from what I
> > >>> understood
> > >>> > the
> > >>> > > > >>> Record
> > >>> > > > >>> > > APIs
> > >>> > > > >>> > > > > are going to be removed).
> > >>> > > > >>> > > > >
> > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> > >>> > > > >>> [hidden email]>
> > >>> > > > >>> > > > wrote:
> > >>> > > > >>> > > > >
> > >>> > > > >>> > > > > > I'm not familiar with the HBase connector code,
> but
> > >>> are
> > >>> > you
> > >>> > > > >>> maybe
> > >>> > > > >>> > > > looking
> > >>> > > > >>> > > > > > for the GenericTableOutputFormat?
> > >>> > > > >>> > > > > >
> > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> > >>> > > > >>> [hidden email]
> > >>> > > > >>> > >:
> > >>> > > > >>> > > > > >
> > >>> > > > >>> > > > > > > | was trying to modify the example setting
> > >>> > > > hbaseDs.output(new
> > >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
> > >>> > > HBaseOutputFormat
> > >>> > > > >>> > > > > class..maybe
> > >>> > > > >>> > > > > > we
> > >>> > > > >>> > > > > > > shall use another class?
> > >>> > > > >>> > > > > > >
> > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio
> > Pompermaier
> > >>> <
> > >>> > > > >>> > > > > [hidden email]
> > >>> > > > >>> > > > > > >
> > >>> > > > >>> > > > > > > wrote:
> > >>> > > > >>> > > > > > >
> > >>> > > > >>> > > > > > > > Maybe that's something I could add to the
> HBase
> > >>> > example
> > >>> > > > and
> > >>> > > > >>> > that
> > >>> > > > >>> > > > > could
> > >>> > > > >>> > > > > > be
> > >>> > > > >>> > > > > > > > better documented in the Wiki.
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > > > Since we're talking about the wiki..I was
> > >>> looking at
> > >>> > > the
> > >>> > > > >>> Java
> > >>> > > > >>> > > API (
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > >
> > >>> > > > >>> > > > > >
> > >>> > > > >>> > > > >
> > >>> > > > >>> > > >
> > >>> > > > >>> > >
> > >>> > > > >>> >
> > >>> > > > >>>
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > >>> > > > >>> )
> > >>> > > > >>> > > > > > > > and the link to the KMeans example is not
> > working
> > >>> > > (where
> > >>> > > > it
> > >>> > > > >>> > says
> > >>> > > > >>> > > > For
> > >>> > > > >>> > > > > a
> > >>> > > > >>> > > > > > > > complete example program, have a look at
> KMeans
> > >>> > > > Algorithm).
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > > > Best,
> > >>> > > > >>> > > > > > > > Flavio
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio
> > >>> Pompermaier <
> > >>> > > > >>> > > > > > [hidden email]
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > > > wrote:
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I
> > >>> removed it
> > >>> > > :)
> > >>> > > > >>> > > > > > > >>
> > >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan
> Ewen <
> > >>> > > > >>> > [hidden email]>
> > >>> > > > >>> > > > > > wrote:
> > >>> > > > >>> > > > > > > >>
> > >>> > > > >>> > > > > > > >>> You do not really need a HBase data sink.
> You
> > >>> can
> > >>> > > call
> > >>> > > > >>> > > > > > > >>> "DataSet.output(new
> > >>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> > >>> > > > >>> > > > > > > >>>
> > >>> > > > >>> > > > > > > >>> Stephan
> > >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio
> > >>> Pompermaier" <
> > >>> > > > >>> > > > > > [hidden email]
> > >>> > > > >>> > > > > > > >:
> > >>> > > > >>> > > > > > > >>>
> > >>> > > > >>> > > > > > > >>> > Just one last thing..I removed the
> > >>> HbaseDataSink
> > >>> > > > >>> because I
> > >>> > > > >>> > > > think
> > >>> > > > >>> > > > > it
> > >>> > > > >>> > > > > > > was
> > >>> > > > >>> > > > > > > >>> > using the old APIs..can someone help me
> in
> > >>> > updating
> > >>> > > > >>> that
> > >>> > > > >>> > > class?
> > >>> > > > >>> > > > > > > >>> >
> > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
> > >>> > > Pompermaier <
> > >>> > > > >>> > > > > > > >>> [hidden email]>
> > >>> > > > >>> > > > > > > >>> > wrote:
> > >>> > > > >>> > > > > > > >>> >
> > >>> > > > >>> > > > > > > >>> > > Indeed this time the build has been
> > >>> successful
> > >>> > :)
> > >>> > > > >>> > > > > > > >>> > >
> > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM, Fabian
> > >>> Hueske
> > >>> > <
> > >>> > > > >>> > > > > > [hidden email]
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > > >>> > wrote:
> > >>> > > > >>> > > > > > > >>> > >
> > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build
> your
> > >>> own
> > >>> > > Github
> > >>> > > > >>> > > > > repositories
> > >>> > > > >>> > > > > > by
> > >>> > > > >>> > > > > > > >>> > linking
> > >>> > > > >>> > > > > > > >>> > >> it to your Github account. That way
> > >>> Travis can
> > >>> > > > >>> build all
> > >>> > > > >>> > > > your
> > >>> > > > >>> > > > > > > >>> branches
> > >>> > > > >>> > > > > > > >>> > >> (and
> > >>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if
> > something
> > >>> > > fails).
> > >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger
> > >>> retrigger
> > >>> > > > >>> builds on
> > >>> > > > >>> > > the
> > >>> > > > >>> > > > > > Apache
> > >>> > > > >>> > > > > > > >>> > >> repository.
> > >>> > > > >>> > > > > > > >>> > >>
> > >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is indeed a
> > >>> very
> > >>> > good
> > >>> > > > >>> > addition
> > >>> > > > >>> > > > :-)
> > >>> > > > >>> > > > > > > >>> > >>
> > >>> > > > >>> > > > > > > >>> > >> For the discusion about the PR
> itself, I
> > >>> would
> > >>> > > > need
> > >>> > > > >>> a
> > >>> > > > >>> > bit
> > >>> > > > >>> > > > more
> > >>> > > > >>> > > > > > > time
> > >>> > > > >>> > > > > > > >>> to
> > >>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I do
> > >>> also not
> > >>> > > > have
> > >>> > > > >>> a
> > >>> > > > >>> > > HBase
> > >>> > > > >>> > > > > > setup
> > >>> > > > >>> > > > > > > >>> > >> available
> > >>> > > > >>> > > > > > > >>> > >> here.
> > >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community
> who
> > >>> was
> > >>> > > > >>> involved
> > >>> > > > >>> > > with a
> > >>> > > > >>> > > > > > > >>> previous
> > >>> > > > >>> > > > > > > >>> > >> version of the HBase connector could
> > >>> comment
> > >>> > on
> > >>> > > > your
> > >>> > > > >>> > > > question.
> > >>> > > > >>> > > > > > > >>> > >>
> > >>> > > > >>> > > > > > > >>> > >> Best, Fabian
> > >>> > > > >>> > > > > > > >>> > >>
> > >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
> > >>> Pompermaier <
> > >>> > > > >>> > > > > > > [hidden email]
> > >>> > > > >>> > > > > > > >>> >:
> > >>> > > > >>> > > > > > > >>> > >>
> > >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the
> > >>> > discussion
> > >>> > > on
> > >>> > > > >>> this
> > >>> > > > >>> > > > > mailing
> > >>> > > > >>> > > > > > > >>> list.
> > >>> > > > >>> > > > > > > >>> > >> >
> > >>> > > > >>> > > > > > > >>> > >> > I think that what is still to be
> > >>> discussed
> > >>> > is
> > >>> > > > >>> how to
> > >>> > > > >>> > > > > > retrigger
> > >>> > > > >>> > > > > > > >>> the
> > >>> > > > >>> > > > > > > >>> > >> build
> > >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an account)
> > and
> > >>> if
> > >>> > the
> > >>> > > > PR
> > >>> > > > >>> can
> > >>> > > > >>> > be
> > >>> > > > >>> > > > > > > >>> integrated.
> > >>> > > > >>> > > > > > > >>> > >> >
> > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the
> > HBase
> > >>> > > example
> > >>> > > > >>> in
> > >>> > > > >>> > the
> > >>> > > > >>> > > > test
> > >>> > > > >>> > > > > > > >>> package
> > >>> > > > >>> > > > > > > >>> > >> (right
> > >>> > > > >>> > > > > > > >>> > >> > now I left it in the main folder) so
> > it
> > >>> will
> > >>> > > > force
> > >>> > > > >>> > > Travis
> > >>> > > > >>> > > > to
> > >>> > > > >>> > > > > > > >>> rebuild.
> > >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of hours.
> > >>> > > > >>> > > > > > > >>> > >> >
> > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is
> that
> > >>> the
> > >>> > > hbase
> > >>> > > > >>> > > extension
> > >>> > > > >>> > > > is
> > >>> > > > >>> > > > > > now
> > >>> > > > >>> > > > > > > >>> > >> compatible
> > >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > >>> > > > >>> > > > > > > >>> > >> >
> > >>> > > > >>> > > > > > > >>> > >> > Best,
> > >>> > > > >>> > > > > > > >>> > >> > Flavio
> > >>> > > > >>> > > > > > > >>> > >>
> > >>> > > > >>> > > > > > > >>> > >
> > >>> > > > >>> > > > > > > >>> >
> > >>> > > > >>> > > > > > > >>>
> > >>> > > > >>> > > > > > > >>
> > >>> > > > >>> > > > > > > >
> > >>> > > > >>> > > > > > >
> > >>> > > > >>> > > > > >
> > >>> > > > >>> > > > >
> > >>> > > > >>> > > >
> > >>> > > > >>> > >
> > >>> > > > >>> >
> > >>> > > > >>>
> > >>> > > > >>
> > >>> > > > >>
> > >>> > > > >>
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >
> > >

Stephan Ewen

Re: HBase 0.98 addon for Flink 0.8

I think that this is a case where the wrong classloader is used:

If the HBase classes are part of the flink lib directory, they are loaded
with the system class loader. When they look for anything in the classpath,
they will do so with the system classloader.

You configuration is in the user code jar that you submit, so it is only
available through the user-code classloader.

Any way you can load the configuration yourself and give that configuration
to HBase?

Stephan
Am 13.11.2014 22:06 schrieb "Flavio Pompermaier" <[hidden email]>:

> The only config files available are within the submitted jar. Things works
> in eclipse using local environment while fails deploying to the cluster
> On Nov 13, 2014 10:01 PM, <[hidden email]> wrote:
>
> > Does the HBase jar in the lib folder contain a config that could be used
> > instead of the config in the job jar file? Or is simply no config at all
> > available when the configure method is called?
> >
> >
> >
> >
> >
> >
> > --
> > Fabian Hueske
> > Phone: +49 170 5549438
> > Email: [hidden email]
> > Web: http://www.user.tu-berlin.de/fabian.hueske
> >
> >
> >
> >
> >
> > From: Flavio Pompermaier
> > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎21‎:‎43
> > To: [hidden email]
> >
> >
> >
> >
> >
> > The hbase jar is in the lib directory on each node while the config files
> > are within the jar file I submit from the web client.
> > On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:
> >
> > > Have you added the hbase.jar file with your HBase config to the ./lib
> > > folders of your Flink setup (JobManager, TaskManager) or is it bundled
> > with
> > > your job.jar file?
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Fabian Hueske
> > > Phone: +49 170 5549438
> > > Email: [hidden email]
> > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > >
> > >
> > >
> > >
> > >
> > > From: Flavio Pompermaier
> > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
> > > To: [hidden email]
> > >
> > >
> > >
> > >
> > >
> > > Any help with this? :(
> > >
> > > On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <
> > [hidden email]>
> > > wrote:
> > >
> > > > We definitely discovered that instantiating HTable and Scan in
> > > configure()
> > > > method of TableInputFormat causes problem in distributed environment!
> > > > If you look at my implementation at
> > > >
> > >
> >
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> > > > you can see that Scan and HTable were made transient and recreated
> > within
> > > > configure but this causes HBaseConfiguration.create() to fail
> searching
> > > for
> > > > classpath files...could you help us understanding why?
> > > >
> > > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > >> Usually, when I run a mapreduce job both on Spark and Hadoop I just
> > put
> > > >> *-site.xml files into the war I submit to the cluster and that's
> it. I
> > > >> think the problem appeared when I made the HTable a private
> transient
> > > field
> > > >> and the table istantiation was moved in the configure method.
> > > >> Could it be a valid reason? we still have to make a deeper debug but
> > I'm
> > > >> trying ro figure out where to investigate..
> > > >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <[hidden email]>
> > wrote:
> > > >>
> > > >>> Hi,
> > > >>> Maybe its an issue with the classpath? As far as I know is Hadoop
> > > reading
> > > >>> the configuration files from the classpath. Maybe is the
> > hbase-site.xml
> > > >>> file not accessible through the classpath when running on the
> > cluster?
> > > >>>
> > > >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
> > > >>> [hidden email]>
> > > >>> wrote:
> > > >>>
> > > >>> > Today we tried tp execute a job on the cluster instead of on
> local
> > > >>> executor
> > > >>> > and we faced the problem that the hbase-site.xml was basically
> > > >>> ignored. Is
> > > >>> > there a reason why the TableInputFormat is working correctly on
> > local
> > > >>> > environment while it doesn't on a cluster?
> > > >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]>
> > > wrote:
> > > >>> >
> > > >>> > > I don't think we need to bundle the HBase input and output
> format
> > > in
> > > >>> a
> > > >>> > > single PR.
> > > >>> > > So, I think we can proceed with the IF only and target the OF
> > > later.
> > > >>> > > However, the fix for Kryo should be in the master before
> merging
> > > the
> > > >>> PR.
> > > >>> > > Till is currently working on that and said he expects this to
> be
> > > >>> done by
> > > >>> > > end of the week.
> > > >>> > >
> > > >>> > > Cheers, Fabian
> > > >>> > >
> > > >>> > >
> > > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <
> > > [hidden email]
> > > >>> >:
> > > >>> > >
> > > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can build
> > it
> > > >>> with
> > > >>> > the
> > > >>> > > > command:
> > > >>> > > > mvn clean install -Dmaven.test.skip=true
> > -Dhadoop.profile=2
> > > >>> > > > -Pvendor-repos,cdh5.1.3
> > > >>> > > >
> > > >>> > > > However, it would be good to generate the specific jar when
> > > >>> > > > releasing..(e.g.
> > > >>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > > >>> > > >
> > > >>> > > > Best,
> > > >>> > > > Flavio
> > > >>> > > >
> > > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> > > >>> > > [hidden email]>
> > > >>> > > > wrote:
> > > >>> > > >
> > > >>> > > > > I've just updated the code on my fork (synch with current
> > > master
> > > >>> and
> > > >>> > > > > applied improvements coming from comments on related PR).
> > > >>> > > > > I still have to understand how to write results back to an
> > > HBase
> > > >>> > > > > Sink/OutputFormat...
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> > > >>> > > > [hidden email]>
> > > >>> > > > > wrote:
> > > >>> > > > >
> > > >>> > > > >> Thanks for the detailed answer. So if I run a job from my
> > > >>> machine
> > > >>> > I'll
> > > >>> > > > >> have to download all the scanned data in a table..right?
> > > >>> > > > >>
> > > >>> > > > >> Always regarding the GenericTableOutputFormat it is not
> > clear
> > > >>> to me
> > > >>> > > how
> > > >>> > > > >> to proceed..
> > > >>> > > > >> I saw in the hadoop compatibility addon that it is
> possible
> > to
> > > >>> have
> > > >>> > > such
> > > >>> > > > >> compatibility using HBaseUtils class so the open method
> > should
> > > >>> > become
> > > >>> > > > >> something like:
> > > >>> > > > >>
> > > >>> > > > >> @Override
> > > >>> > > > >> public void open(int taskNumber, int numTasks) throws
> > > >>> IOException {
> > > >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
> > > >>> > > > >> throw new IOException("Task id too large.");
> > > >>> > > > >> }
> > > >>> > > > >> TaskAttemptID taskAttemptID =
> > > >>> > TaskAttemptID.forName("attempt__0000_r_"
> > > >>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
> > > >>> > 1).length())
> > > >>> > > +
> > > >>> > > > >> "s"," ").replace(" ", "0")
> > > >>> > > > >> + Integer.toString(taskNumber + 1)
> > > >>> > > > >> + "_0");
> > > >>> > > > >> this.configuration.set("mapred.task.id",
> > > >>> > taskAttemptID.toString());
> > > >>> > > > >> this.configuration.setInt("mapred.task.partition",
> > taskNumber
> > > +
> > > >>> 1);
> > > >>> > > > >> // for hadoop 2.2
> > > >>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
> > > >>> > > > >> taskAttemptID.toString());
> > > >>> > > > >> this.configuration.setInt("mapreduce.task.partition",
> > > >>> taskNumber +
> > > >>> > 1);
> > > >>> > > > >> try {
> > > >>> > > > >> this.context =
> > > >>> > > > >>
> > HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > > >>> > > > >> taskAttemptID);
> > > >>> > > > >> } catch (Exception e) {
> > > >>> > > > >> throw new RuntimeException(e);
> > > >>> > > > >> }
> > > >>> > > > >> final HFileOutputFormat2 outFormat = new
> > HFileOutputFormat2();
> > > >>> > > > >> try {
> > > >>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
> > > >>> > > > >> } catch (InterruptedException iex) {
> > > >>> > > > >> throw new IOException("Opening the writer was
> interrupted.",
> > > >>> iex);
> > > >>> > > > >> }
> > > >>> > > > >> }
> > > >>> > > > >>
> > > >>> > > > >> But I'm not sure about how to pass the JobConf to the
> class,
> > > if
> > > >>> to
> > > >>> > > merge
> > > >>> > > > >> config fileas, where HFileOutputFormat2 writes the data
> and
> > > how
> > > >>> to
> > > >>> > > > >> implement the public void writeRecord(Record record) API.
> > > >>> > > > >> Could I do a little chat off the mailing list with the
> > > >>> implementor
> > > >>> > of
> > > >>> > > > >> this extension?
> > > >>> > > > >>
> > > >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> > > >>> [hidden email]>
> > > >>> > > > >> wrote:
> > > >>> > > > >>
> > > >>> > > > >>> Hi Flavio
> > > >>> > > > >>>
> > > >>> > > > >>> let me try to answer your last question on the user's
> list
> > > (to
> > > >>> the
> > > >>> > > best
> > > >>> > > > >>> of
> > > >>> > > > >>> my HBase knowledge).
> > > >>> > > > >>> "I just wanted to known if and how regiom splitting is
> > > >>> handled. Can
> > > >>> > > you
> > > >>> > > > >>> explain me in detail how Flink and HBase works?what is
> not
> > > >>> fully
> > > >>> > > clear
> > > >>> > > > to
> > > >>> > > > >>> me is when computation is done by region servers and when
> > > data
> > > >>> > start
> > > >>> > > > flow
> > > >>> > > > >>> to a Flink worker (that in ky test job is only my pc) and
> > how
> > > >>> ro
> > > >>> > > > >>> undertsand
> > > >>> > > > >>> better the important logged info to understand if my job
> is
> > > >>> > > performing
> > > >>> > > > >>> well"
> > > >>> > > > >>>
> > > >>> > > > >>> HBase partitions its tables into so called "regions" of
> > keys
> > > >>> and
> > > >>> > > stores
> > > >>> > > > >>> the
> > > >>> > > > >>> regions distributed in the cluster using HDFS. I think an
> > > HBase
> > > >>> > > region
> > > >>> > > > >>> can
> > > >>> > > > >>> be thought of as a HDFS block. To make reading an HBase
> > table
> > > >>> > > > efficient,
> > > >>> > > > >>> region reads should be locally done, i.e., an InputFormat
> > > >>> should
> > > >>> > > > >>> primarily
> > > >>> > > > >>> read region that are stored on the same machine as the IF
> > is
> > > >>> > running
> > > >>> > > > on.
> > > >>> > > > >>> Flink's InputSplits partition the HBase input by regions
> > and
> > > >>> add
> > > >>> > > > >>> information about the storage location of the region.
> > During
> > > >>> > > execution,
> > > >>> > > > >>> input splits are assigned to InputFormats that can do
> local
> > > >>> reads.
> > > >>> > > > >>>
> > > >>> > > > >>> Best, Fabian
> > > >>> > > > >>>
> > > >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <
> [hidden email]
> > >:
> > > >>> > > > >>>
> > > >>> > > > >>> > Hi!
> > > >>> > > > >>> >
> > > >>> > > > >>> > The way of passing parameters through the configuration
> > is
> > > >>> very
> > > >>> > old
> > > >>> > > > >>> (the
> > > >>> > > > >>> > original HBase format dated back to that time). I would
> > > >>> simply
> > > >>> > make
> > > >>> > > > the
> > > >>> > > > >>> > HBase format take those parameters through the
> > constructor.
> > > >>> > > > >>> >
> > > >>> > > > >>> > Greetings,
> > > >>> > > > >>> > Stephan
> > > >>> > > > >>> >
> > > >>> > > > >>> >
> > > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> > > >>> > > > >>> [hidden email]>
> > > >>> > > > >>> > wrote:
> > > >>> > > > >>> >
> > > >>> > > > >>> > > The problem is that I also removed the
> > > >>> GenericTableOutputFormat
> > > >>> > > > >>> because
> > > >>> > > > >>> > > there is an incompatibility between hadoop1 and
> hadoop2
> > > for
> > > >>> > class
> > > >>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> > > >>> > > > >>> > > then it would be nice if the user doesn't have to
> worry
> > > >>> about
> > > >>> > > > passing
> > > >>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> > > >>> > > > >>> > > I think it is probably a good idea to remove hadoop1
> > > >>> > > compatibility
> > > >>> > > > >>> and
> > > >>> > > > >>> > keep
> > > >>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and
> > > decide
> > > >>> how
> > > >>> > to
> > > >>> > > > >>> mange
> > > >>> > > > >>> > > those 2 parameters..
> > > >>> > > > >>> > >
> > > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
> > > >>> > [hidden email]>
> > > >>> > > > >>> wrote:
> > > >>> > > > >>> > >
> > > >>> > > > >>> > > > It is fine to remove it, in my opinion.
> > > >>> > > > >>> > > >
> > > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio
> Pompermaier <
> > > >>> > > > >>> > > [hidden email]>
> > > >>> > > > >>> > > > wrote:
> > > >>> > > > >>> > > >
> > > >>> > > > >>> > > > > That is one class I removed because it was using
> > the
> > > >>> > > deprecated
> > > >>> > > > >>> API
> > > >>> > > > >>> > > > > GenericDataSink..I can restore them but the it
> will
> > > be
> > > >>> a
> > > >>> > good
> > > >>> > > > >>> idea to
> > > >>> > > > >>> > > > > remove those warning (also because from what I
> > > >>> understood
> > > >>> > the
> > > >>> > > > >>> Record
> > > >>> > > > >>> > > APIs
> > > >>> > > > >>> > > > > are going to be removed).
> > > >>> > > > >>> > > > >
> > > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> > > >>> > > > >>> [hidden email]>
> > > >>> > > > >>> > > > wrote:
> > > >>> > > > >>> > > > >
> > > >>> > > > >>> > > > > > I'm not familiar with the HBase connector code,
> > but
> > > >>> are
> > > >>> > you
> > > >>> > > > >>> maybe
> > > >>> > > > >>> > > > looking
> > > >>> > > > >>> > > > > > for the GenericTableOutputFormat?
> > > >>> > > > >>> > > > > >
> > > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier <
> > > >>> > > > >>> [hidden email]
> > > >>> > > > >>> > >:
> > > >>> > > > >>> > > > > >
> > > >>> > > > >>> > > > > > > | was trying to modify the example setting
> > > >>> > > > hbaseDs.output(new
> > > >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
> > > >>> > > HBaseOutputFormat
> > > >>> > > > >>> > > > > class..maybe
> > > >>> > > > >>> > > > > > we
> > > >>> > > > >>> > > > > > > shall use another class?
> > > >>> > > > >>> > > > > > >
> > > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio
> > > Pompermaier
> > > >>> <
> > > >>> > > > >>> > > > > [hidden email]
> > > >>> > > > >>> > > > > > >
> > > >>> > > > >>> > > > > > > wrote:
> > > >>> > > > >>> > > > > > >
> > > >>> > > > >>> > > > > > > > Maybe that's something I could add to the
> > HBase
> > > >>> > example
> > > >>> > > > and
> > > >>> > > > >>> > that
> > > >>> > > > >>> > > > > could
> > > >>> > > > >>> > > > > > be
> > > >>> > > > >>> > > > > > > > better documented in the Wiki.
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > > > Since we're talking about the wiki..I was
> > > >>> looking at
> > > >>> > > the
> > > >>> > > > >>> Java
> > > >>> > > > >>> > > API (
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > >
> > > >>> > > > >>> > > > > >
> > > >>> > > > >>> > > > >
> > > >>> > > > >>> > > >
> > > >>> > > > >>> > >
> > > >>> > > > >>> >
> > > >>> > > > >>>
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > > >>> > > > >>> )
> > > >>> > > > >>> > > > > > > > and the link to the KMeans example is not
> > > working
> > > >>> > > (where
> > > >>> > > > it
> > > >>> > > > >>> > says
> > > >>> > > > >>> > > > For
> > > >>> > > > >>> > > > > a
> > > >>> > > > >>> > > > > > > > complete example program, have a look at
> > KMeans
> > > >>> > > > Algorithm).
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > > > Best,
> > > >>> > > > >>> > > > > > > > Flavio
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio
> > > >>> Pompermaier <
> > > >>> > > > >>> > > > > > [hidden email]
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > > > wrote:
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why I
> > > >>> removed it
> > > >>> > > :)
> > > >>> > > > >>> > > > > > > >>
> > > >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan
> > Ewen <
> > > >>> > > > >>> > [hidden email]>
> > > >>> > > > >>> > > > > > wrote:
> > > >>> > > > >>> > > > > > > >>
> > > >>> > > > >>> > > > > > > >>> You do not really need a HBase data sink.
> > You
> > > >>> can
> > > >>> > > call
> > > >>> > > > >>> > > > > > > >>> "DataSet.output(new
> > > >>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> > > >>> > > > >>> > > > > > > >>>
> > > >>> > > > >>> > > > > > > >>> Stephan
> > > >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio
> > > >>> Pompermaier" <
> > > >>> > > > >>> > > > > > [hidden email]
> > > >>> > > > >>> > > > > > > >:
> > > >>> > > > >>> > > > > > > >>>
> > > >>> > > > >>> > > > > > > >>> > Just one last thing..I removed the
> > > >>> HbaseDataSink
> > > >>> > > > >>> because I
> > > >>> > > > >>> > > > think
> > > >>> > > > >>> > > > > it
> > > >>> > > > >>> > > > > > > was
> > > >>> > > > >>> > > > > > > >>> > using the old APIs..can someone help me
> > in
> > > >>> > updating
> > > >>> > > > >>> that
> > > >>> > > > >>> > > class?
> > > >>> > > > >>> > > > > > > >>> >
> > > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM, Flavio
> > > >>> > > Pompermaier <
> > > >>> > > > >>> > > > > > > >>> [hidden email]>
> > > >>> > > > >>> > > > > > > >>> > wrote:
> > > >>> > > > >>> > > > > > > >>> >
> > > >>> > > > >>> > > > > > > >>> > > Indeed this time the build has been
> > > >>> successful
> > > >>> > :)
> > > >>> > > > >>> > > > > > > >>> > >
> > > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM,
> Fabian
> > > >>> Hueske
> > > >>> > <
> > > >>> > > > >>> > > > > > [hidden email]
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > > >>> > wrote:
> > > >>> > > > >>> > > > > > > >>> > >
> > > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build
> > your
> > > >>> own
> > > >>> > > Github
> > > >>> > > > >>> > > > > repositories
> > > >>> > > > >>> > > > > > by
> > > >>> > > > >>> > > > > > > >>> > linking
> > > >>> > > > >>> > > > > > > >>> > >> it to your Github account. That way
> > > >>> Travis can
> > > >>> > > > >>> build all
> > > >>> > > > >>> > > > your
> > > >>> > > > >>> > > > > > > >>> branches
> > > >>> > > > >>> > > > > > > >>> > >> (and
> > > >>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if
> > > something
> > > >>> > > fails).
> > > >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually trigger
> > > >>> retrigger
> > > >>> > > > >>> builds on
> > > >>> > > > >>> > > the
> > > >>> > > > >>> > > > > > Apache
> > > >>> > > > >>> > > > > > > >>> > >> repository.
> > > >>> > > > >>> > > > > > > >>> > >>
> > > >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is
> indeed a
> > > >>> very
> > > >>> > good
> > > >>> > > > >>> > addition
> > > >>> > > > >>> > > > :-)
> > > >>> > > > >>> > > > > > > >>> > >>
> > > >>> > > > >>> > > > > > > >>> > >> For the discusion about the PR
> > itself, I
> > > >>> would
> > > >>> > > > need
> > > >>> > > > >>> a
> > > >>> > > > >>> > bit
> > > >>> > > > >>> > > > more
> > > >>> > > > >>> > > > > > > time
> > > >>> > > > >>> > > > > > > >>> to
> > > >>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I
> do
> > > >>> also not
> > > >>> > > > have
> > > >>> > > > >>> a
> > > >>> > > > >>> > > HBase
> > > >>> > > > >>> > > > > > setup
> > > >>> > > > >>> > > > > > > >>> > >> available
> > > >>> > > > >>> > > > > > > >>> > >> here.
> > > >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the community
> > who
> > > >>> was
> > > >>> > > > >>> involved
> > > >>> > > > >>> > > with a
> > > >>> > > > >>> > > > > > > >>> previous
> > > >>> > > > >>> > > > > > > >>> > >> version of the HBase connector could
> > > >>> comment
> > > >>> > on
> > > >>> > > > your
> > > >>> > > > >>> > > > question.
> > > >>> > > > >>> > > > > > > >>> > >>
> > > >>> > > > >>> > > > > > > >>> > >> Best, Fabian
> > > >>> > > > >>> > > > > > > >>> > >>
> > > >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
> > > >>> Pompermaier <
> > > >>> > > > >>> > > > > > > [hidden email]
> > > >>> > > > >>> > > > > > > >>> >:
> > > >>> > > > >>> > > > > > > >>> > >>
> > > >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved the
> > > >>> > discussion
> > > >>> > > on
> > > >>> > > > >>> this
> > > >>> > > > >>> > > > > mailing
> > > >>> > > > >>> > > > > > > >>> list.
> > > >>> > > > >>> > > > > > > >>> > >> >
> > > >>> > > > >>> > > > > > > >>> > >> > I think that what is still to be
> > > >>> discussed
> > > >>> > is
> > > >>> > > > >>> how to
> > > >>> > > > >>> > > > > > retrigger
> > > >>> > > > >>> > > > > > > >>> the
> > > >>> > > > >>> > > > > > > >>> > >> build
> > > >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an
> account)
> > > and
> > > >>> if
> > > >>> > the
> > > >>> > > > PR
> > > >>> > > > >>> can
> > > >>> > > > >>> > be
> > > >>> > > > >>> > > > > > > >>> integrated.
> > > >>> > > > >>> > > > > > > >>> > >> >
> > > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move the
> > > HBase
> > > >>> > > example
> > > >>> > > > >>> in
> > > >>> > > > >>> > the
> > > >>> > > > >>> > > > test
> > > >>> > > > >>> > > > > > > >>> package
> > > >>> > > > >>> > > > > > > >>> > >> (right
> > > >>> > > > >>> > > > > > > >>> > >> > now I left it in the main folder)
> so
> > > it
> > > >>> will
> > > >>> > > > force
> > > >>> > > > >>> > > Travis
> > > >>> > > > >>> > > > to
> > > >>> > > > >>> > > > > > > >>> rebuild.
> > > >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of
> hours.
> > > >>> > > > >>> > > > > > > >>> > >> >
> > > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is
> > that
> > > >>> the
> > > >>> > > hbase
> > > >>> > > > >>> > > extension
> > > >>> > > > >>> > > > is
> > > >>> > > > >>> > > > > > now
> > > >>> > > > >>> > > > > > > >>> > >> compatible
> > > >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > > >>> > > > >>> > > > > > > >>> > >> >
> > > >>> > > > >>> > > > > > > >>> > >> > Best,
> > > >>> > > > >>> > > > > > > >>> > >> > Flavio
> > > >>> > > > >>> > > > > > > >>> > >>
> > > >>> > > > >>> > > > > > > >>> > >
> > > >>> > > > >>> > > > > > > >>> >
> > > >>> > > > >>> > > > > > > >>>
> > > >>> > > > >>> > > > > > > >>
> > > >>> > > > >>> > > > > > > >
> > > >>> > > > >>> > > > > > >
> > > >>> > > > >>> > > > > >
> > > >>> > > > >>> > > > >
> > > >>> > > > >>> > > >
> > > >>> > > > >>> > >
> > > >>> > > > >>> >
> > > >>> > > > >>>
> > > >>> > > > >>
> > > >>> > > > >>
> > > >>> > > > >>
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >
> > > >
>

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

The strange thing us that everything works if I create HTable outside
configure()..
On Nov 14, 2014 10:32 AM, "Stephan Ewen" <[hidden email]> wrote:

> I think that this is a case where the wrong classloader is used:
>
> If the HBase classes are part of the flink lib directory, they are loaded
> with the system class loader. When they look for anything in the classpath,
> they will do so with the system classloader.
>
> You configuration is in the user code jar that you submit, so it is only
> available through the user-code classloader.
>
> Any way you can load the configuration yourself and give that configuration
> to HBase?
>
> Stephan
> Am 13.11.2014 22:06 schrieb "Flavio Pompermaier" <[hidden email]>:
>
> > The only config files available are within the submitted jar. Things
> works
> > in eclipse using local environment while fails deploying to the cluster
> > On Nov 13, 2014 10:01 PM, <[hidden email]> wrote:
> >
> > > Does the HBase jar in the lib folder contain a config that could be
> used
> > > instead of the config in the job jar file? Or is simply no config at
> all
> > > available when the configure method is called?
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Fabian Hueske
> > > Phone: +49 170 5549438
> > > Email: [hidden email]
> > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > >
> > >
> > >
> > >
> > >
> > > From: Flavio Pompermaier
> > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎21‎:‎43
> > > To: [hidden email]
> > >
> > >
> > >
> > >
> > >
> > > The hbase jar is in the lib directory on each node while the config
> files
> > > are within the jar file I submit from the web client.
> > > On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:
> > >
> > > > Have you added the hbase.jar file with your HBase config to the ./lib
> > > > folders of your Flink setup (JobManager, TaskManager) or is it
> bundled
> > > with
> > > > your job.jar file?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Fabian Hueske
> > > > Phone: +49 170 5549438
> > > > Email: [hidden email]
> > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > From: Flavio Pompermaier
> > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
> > > > To: [hidden email]
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Any help with this? :(
> > > >
> > > > On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > > > We definitely discovered that instantiating HTable and Scan in
> > > > configure()
> > > > > method of TableInputFormat causes problem in distributed
> environment!
> > > > > If you look at my implementation at
> > > > >
> > > >
> > >
> >
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> > > > > you can see that Scan and HTable were made transient and recreated
> > > within
> > > > > configure but this causes HBaseConfiguration.create() to fail
> > searching
> > > > for
> > > > > classpath files...could you help us understanding why?
> > > > >
> > > > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > >> Usually, when I run a mapreduce job both on Spark and Hadoop I
> just
> > > put
> > > > >> *-site.xml files into the war I submit to the cluster and that's
> > it. I
> > > > >> think the problem appeared when I made the HTable a private
> > transient
> > > > field
> > > > >> and the table istantiation was moved in the configure method.
> > > > >> Could it be a valid reason? we still have to make a deeper debug
> but
> > > I'm
> > > > >> trying ro figure out where to investigate..
> > > > >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <[hidden email]>
> > > wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>> Maybe its an issue with the classpath? As far as I know is Hadoop
> > > > reading
> > > > >>> the configuration files from the classpath. Maybe is the
> > > hbase-site.xml
> > > > >>> file not accessible through the classpath when running on the
> > > cluster?
> > > > >>>
> > > > >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
> > > > >>> [hidden email]>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Today we tried tp execute a job on the cluster instead of on
> > local
> > > > >>> executor
> > > > >>> > and we faced the problem that the hbase-site.xml was basically
> > > > >>> ignored. Is
> > > > >>> > there a reason why the TableInputFormat is working correctly on
> > > local
> > > > >>> > environment while it doesn't on a cluster?
> > > > >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <[hidden email]>
> > > > wrote:
> > > > >>> >
> > > > >>> > > I don't think we need to bundle the HBase input and output
> > format
> > > > in
> > > > >>> a
> > > > >>> > > single PR.
> > > > >>> > > So, I think we can proceed with the IF only and target the OF
> > > > later.
> > > > >>> > > However, the fix for Kryo should be in the master before
> > merging
> > > > the
> > > > >>> PR.
> > > > >>> > > Till is currently working on that and said he expects this to
> > be
> > > > >>> done by
> > > > >>> > > end of the week.
> > > > >>> > >
> > > > >>> > > Cheers, Fabian
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <
> > > > [hidden email]
> > > > >>> >:
> > > > >>> > >
> > > > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can
> build
> > > it
> > > > >>> with
> > > > >>> > the
> > > > >>> > > > command:
> > > > >>> > > > mvn clean install -Dmaven.test.skip=true
> > > -Dhadoop.profile=2
> > > > >>> > > > -Pvendor-repos,cdh5.1.3
> > > > >>> > > >
> > > > >>> > > > However, it would be good to generate the specific jar when
> > > > >>> > > > releasing..(e.g.
> > > > >>> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > > > >>> > > >
> > > > >>> > > > Best,
> > > > >>> > > > Flavio
> > > > >>> > > >
> > > > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> > > > >>> > > [hidden email]>
> > > > >>> > > > wrote:
> > > > >>> > > >
> > > > >>> > > > > I've just updated the code on my fork (synch with current
> > > > master
> > > > >>> and
> > > > >>> > > > > applied improvements coming from comments on related PR).
> > > > >>> > > > > I still have to understand how to write results back to
> an
> > > > HBase
> > > > >>> > > > > Sink/OutputFormat...
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> > > > >>> > > > [hidden email]>
> > > > >>> > > > > wrote:
> > > > >>> > > > >
> > > > >>> > > > >> Thanks for the detailed answer. So if I run a job from
> my
> > > > >>> machine
> > > > >>> > I'll
> > > > >>> > > > >> have to download all the scanned data in a table..right?
> > > > >>> > > > >>
> > > > >>> > > > >> Always regarding the GenericTableOutputFormat it is not
> > > clear
> > > > >>> to me
> > > > >>> > > how
> > > > >>> > > > >> to proceed..
> > > > >>> > > > >> I saw in the hadoop compatibility addon that it is
> > possible
> > > to
> > > > >>> have
> > > > >>> > > such
> > > > >>> > > > >> compatibility using HBaseUtils class so the open method
> > > should
> > > > >>> > become
> > > > >>> > > > >> something like:
> > > > >>> > > > >>
> > > > >>> > > > >> @Override
> > > > >>> > > > >> public void open(int taskNumber, int numTasks) throws
> > > > >>> IOException {
> > > > >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
> > > > >>> > > > >> throw new IOException("Task id too large.");
> > > > >>> > > > >> }
> > > > >>> > > > >> TaskAttemptID taskAttemptID =
> > > > >>> > TaskAttemptID.forName("attempt__0000_r_"
> > > > >>> > > > >> + String.format("%" + (6 - Integer.toString(taskNumber +
> > > > >>> > 1).length())
> > > > >>> > > +
> > > > >>> > > > >> "s"," ").replace(" ", "0")
> > > > >>> > > > >> + Integer.toString(taskNumber + 1)
> > > > >>> > > > >> + "_0");
> > > > >>> > > > >> this.configuration.set("mapred.task.id",
> > > > >>> > taskAttemptID.toString());
> > > > >>> > > > >> this.configuration.setInt("mapred.task.partition",
> > > taskNumber
> > > > +
> > > > >>> 1);
> > > > >>> > > > >> // for hadoop 2.2
> > > > >>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
> > > > >>> > > > >> taskAttemptID.toString());
> > > > >>> > > > >> this.configuration.setInt("mapreduce.task.partition",
> > > > >>> taskNumber +
> > > > >>> > 1);
> > > > >>> > > > >> try {
> > > > >>> > > > >> this.context =
> > > > >>> > > > >>
> > > HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > > > >>> > > > >> taskAttemptID);
> > > > >>> > > > >> } catch (Exception e) {
> > > > >>> > > > >> throw new RuntimeException(e);
> > > > >>> > > > >> }
> > > > >>> > > > >> final HFileOutputFormat2 outFormat = new
> > > HFileOutputFormat2();
> > > > >>> > > > >> try {
> > > > >>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
> > > > >>> > > > >> } catch (InterruptedException iex) {
> > > > >>> > > > >> throw new IOException("Opening the writer was
> > interrupted.",
> > > > >>> iex);
> > > > >>> > > > >> }
> > > > >>> > > > >> }
> > > > >>> > > > >>
> > > > >>> > > > >> But I'm not sure about how to pass the JobConf to the
> > class,
> > > > if
> > > > >>> to
> > > > >>> > > merge
> > > > >>> > > > >> config fileas, where HFileOutputFormat2 writes the data
> > and
> > > > how
> > > > >>> to
> > > > >>> > > > >> implement the public void writeRecord(Record record)
> API.
> > > > >>> > > > >> Could I do a little chat off the mailing list with the
> > > > >>> implementor
> > > > >>> > of
> > > > >>> > > > >> this extension?
> > > > >>> > > > >>
> > > > >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> > > > >>> [hidden email]>
> > > > >>> > > > >> wrote:
> > > > >>> > > > >>
> > > > >>> > > > >>> Hi Flavio
> > > > >>> > > > >>>
> > > > >>> > > > >>> let me try to answer your last question on the user's
> > list
> > > > (to
> > > > >>> the
> > > > >>> > > best
> > > > >>> > > > >>> of
> > > > >>> > > > >>> my HBase knowledge).
> > > > >>> > > > >>> "I just wanted to known if and how regiom splitting is
> > > > >>> handled. Can
> > > > >>> > > you
> > > > >>> > > > >>> explain me in detail how Flink and HBase works?what is
> > not
> > > > >>> fully
> > > > >>> > > clear
> > > > >>> > > > to
> > > > >>> > > > >>> me is when computation is done by region servers and
> when
> > > > data
> > > > >>> > start
> > > > >>> > > > flow
> > > > >>> > > > >>> to a Flink worker (that in ky test job is only my pc)
> and
> > > how
> > > > >>> ro
> > > > >>> > > > >>> undertsand
> > > > >>> > > > >>> better the important logged info to understand if my
> job
> > is
> > > > >>> > > performing
> > > > >>> > > > >>> well"
> > > > >>> > > > >>>
> > > > >>> > > > >>> HBase partitions its tables into so called "regions" of
> > > keys
> > > > >>> and
> > > > >>> > > stores
> > > > >>> > > > >>> the
> > > > >>> > > > >>> regions distributed in the cluster using HDFS. I think
> an
> > > > HBase
> > > > >>> > > region
> > > > >>> > > > >>> can
> > > > >>> > > > >>> be thought of as a HDFS block. To make reading an HBase
> > > table
> > > > >>> > > > efficient,
> > > > >>> > > > >>> region reads should be locally done, i.e., an
> InputFormat
> > > > >>> should
> > > > >>> > > > >>> primarily
> > > > >>> > > > >>> read region that are stored on the same machine as the
> IF
> > > is
> > > > >>> > running
> > > > >>> > > > on.
> > > > >>> > > > >>> Flink's InputSplits partition the HBase input by
> regions
> > > and
> > > > >>> add
> > > > >>> > > > >>> information about the storage location of the region.
> > > During
> > > > >>> > > execution,
> > > > >>> > > > >>> input splits are assigned to InputFormats that can do
> > local
> > > > >>> reads.
> > > > >>> > > > >>>
> > > > >>> > > > >>> Best, Fabian
> > > > >>> > > > >>>
> > > > >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <
> > [hidden email]
> > > >:
> > > > >>> > > > >>>
> > > > >>> > > > >>> > Hi!
> > > > >>> > > > >>> >
> > > > >>> > > > >>> > The way of passing parameters through the
> configuration
> > > is
> > > > >>> very
> > > > >>> > old
> > > > >>> > > > >>> (the
> > > > >>> > > > >>> > original HBase format dated back to that time). I
> would
> > > > >>> simply
> > > > >>> > make
> > > > >>> > > > the
> > > > >>> > > > >>> > HBase format take those parameters through the
> > > constructor.
> > > > >>> > > > >>> >
> > > > >>> > > > >>> > Greetings,
> > > > >>> > > > >>> > Stephan
> > > > >>> > > > >>> >
> > > > >>> > > > >>> >
> > > > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio Pompermaier <
> > > > >>> > > > >>> [hidden email]>
> > > > >>> > > > >>> > wrote:
> > > > >>> > > > >>> >
> > > > >>> > > > >>> > > The problem is that I also removed the
> > > > >>> GenericTableOutputFormat
> > > > >>> > > > >>> because
> > > > >>> > > > >>> > > there is an incompatibility between hadoop1 and
> > hadoop2
> > > > for
> > > > >>> > class
> > > > >>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> > > > >>> > > > >>> > > then it would be nice if the user doesn't have to
> > worry
> > > > >>> about
> > > > >>> > > > passing
> > > > >>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> > > > >>> > > > >>> > > I think it is probably a good idea to remove
> hadoop1
> > > > >>> > > compatibility
> > > > >>> > > > >>> and
> > > > >>> > > > >>> > keep
> > > > >>> > > > >>> > > enable HBase addon only for hadoop2 (as before) and
> > > > decide
> > > > >>> how
> > > > >>> > to
> > > > >>> > > > >>> mange
> > > > >>> > > > >>> > > those 2 parameters..
> > > > >>> > > > >>> > >
> > > > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
> > > > >>> > [hidden email]>
> > > > >>> > > > >>> wrote:
> > > > >>> > > > >>> > >
> > > > >>> > > > >>> > > > It is fine to remove it, in my opinion.
> > > > >>> > > > >>> > > >
> > > > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio
> > Pompermaier <
> > > > >>> > > > >>> > > [hidden email]>
> > > > >>> > > > >>> > > > wrote:
> > > > >>> > > > >>> > > >
> > > > >>> > > > >>> > > > > That is one class I removed because it was
> using
> > > the
> > > > >>> > > deprecated
> > > > >>> > > > >>> API
> > > > >>> > > > >>> > > > > GenericDataSink..I can restore them but the it
> > will
> > > > be
> > > > >>> a
> > > > >>> > good
> > > > >>> > > > >>> idea to
> > > > >>> > > > >>> > > > > remove those warning (also because from what I
> > > > >>> understood
> > > > >>> > the
> > > > >>> > > > >>> Record
> > > > >>> > > > >>> > > APIs
> > > > >>> > > > >>> > > > > are going to be removed).
> > > > >>> > > > >>> > > > >
> > > > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian Hueske <
> > > > >>> > > > >>> [hidden email]>
> > > > >>> > > > >>> > > > wrote:
> > > > >>> > > > >>> > > > >
> > > > >>> > > > >>> > > > > > I'm not familiar with the HBase connector
> code,
> > > but
> > > > >>> are
> > > > >>> > you
> > > > >>> > > > >>> maybe
> > > > >>> > > > >>> > > > looking
> > > > >>> > > > >>> > > > > > for the GenericTableOutputFormat?
> > > > >>> > > > >>> > > > > >
> > > > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio Pompermaier
> <
> > > > >>> > > > >>> [hidden email]
> > > > >>> > > > >>> > >:
> > > > >>> > > > >>> > > > > >
> > > > >>> > > > >>> > > > > > > | was trying to modify the example setting
> > > > >>> > > > hbaseDs.output(new
> > > > >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
> > > > >>> > > HBaseOutputFormat
> > > > >>> > > > >>> > > > > class..maybe
> > > > >>> > > > >>> > > > > > we
> > > > >>> > > > >>> > > > > > > shall use another class?
> > > > >>> > > > >>> > > > > > >
> > > > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio
> > > > Pompermaier
> > > > >>> <
> > > > >>> > > > >>> > > > > [hidden email]
> > > > >>> > > > >>> > > > > > >
> > > > >>> > > > >>> > > > > > > wrote:
> > > > >>> > > > >>> > > > > > >
> > > > >>> > > > >>> > > > > > > > Maybe that's something I could add to the
> > > HBase
> > > > >>> > example
> > > > >>> > > > and
> > > > >>> > > > >>> > that
> > > > >>> > > > >>> > > > > could
> > > > >>> > > > >>> > > > > > be
> > > > >>> > > > >>> > > > > > > > better documented in the Wiki.
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > > > Since we're talking about the wiki..I was
> > > > >>> looking at
> > > > >>> > > the
> > > > >>> > > > >>> Java
> > > > >>> > > > >>> > > API (
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > >
> > > > >>> > > > >>> > > > > >
> > > > >>> > > > >>> > > > >
> > > > >>> > > > >>> > > >
> > > > >>> > > > >>> > >
> > > > >>> > > > >>> >
> > > > >>> > > > >>>
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > > > >>> > > > >>> )
> > > > >>> > > > >>> > > > > > > > and the link to the KMeans example is not
> > > > working
> > > > >>> > > (where
> > > > >>> > > > it
> > > > >>> > > > >>> > says
> > > > >>> > > > >>> > > > For
> > > > >>> > > > >>> > > > > a
> > > > >>> > > > >>> > > > > > > > complete example program, have a look at
> > > KMeans
> > > > >>> > > > Algorithm).
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > > > Best,
> > > > >>> > > > >>> > > > > > > > Flavio
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio
> > > > >>> Pompermaier <
> > > > >>> > > > >>> > > > > > [hidden email]
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > > > wrote:
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason why
> I
> > > > >>> removed it
> > > > >>> > > :)
> > > > >>> > > > >>> > > > > > > >>
> > > > >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM, Stephan
> > > Ewen <
> > > > >>> > > > >>> > [hidden email]>
> > > > >>> > > > >>> > > > > > wrote:
> > > > >>> > > > >>> > > > > > > >>
> > > > >>> > > > >>> > > > > > > >>> You do not really need a HBase data
> sink.
> > > You
> > > > >>> can
> > > > >>> > > call
> > > > >>> > > > >>> > > > > > > >>> "DataSet.output(new
> > > > >>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> > > > >>> > > > >>> > > > > > > >>>
> > > > >>> > > > >>> > > > > > > >>> Stephan
> > > > >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio
> > > > >>> Pompermaier" <
> > > > >>> > > > >>> > > > > > [hidden email]
> > > > >>> > > > >>> > > > > > > >:
> > > > >>> > > > >>> > > > > > > >>>
> > > > >>> > > > >>> > > > > > > >>> > Just one last thing..I removed the
> > > > >>> HbaseDataSink
> > > > >>> > > > >>> because I
> > > > >>> > > > >>> > > > think
> > > > >>> > > > >>> > > > > it
> > > > >>> > > > >>> > > > > > > was
> > > > >>> > > > >>> > > > > > > >>> > using the old APIs..can someone help
> me
> > > in
> > > > >>> > updating
> > > > >>> > > > >>> that
> > > > >>> > > > >>> > > class?
> > > > >>> > > > >>> > > > > > > >>> >
> > > > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM,
> Flavio
> > > > >>> > > Pompermaier <
> > > > >>> > > > >>> > > > > > > >>> [hidden email]>
> > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > >>> > > > >>> > > > > > > >>> >
> > > > >>> > > > >>> > > > > > > >>> > > Indeed this time the build has been
> > > > >>> successful
> > > > >>> > :)
> > > > >>> > > > >>> > > > > > > >>> > >
> > > > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM,
> > Fabian
> > > > >>> Hueske
> > > > >>> > <
> > > > >>> > > > >>> > > > > > [hidden email]
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > >>> > > > >>> > > > > > > >>> > >
> > > > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to build
> > > your
> > > > >>> own
> > > > >>> > > Github
> > > > >>> > > > >>> > > > > repositories
> > > > >>> > > > >>> > > > > > by
> > > > >>> > > > >>> > > > > > > >>> > linking
> > > > >>> > > > >>> > > > > > > >>> > >> it to your Github account. That
> way
> > > > >>> Travis can
> > > > >>> > > > >>> build all
> > > > >>> > > > >>> > > > your
> > > > >>> > > > >>> > > > > > > >>> branches
> > > > >>> > > > >>> > > > > > > >>> > >> (and
> > > > >>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if
> > > > something
> > > > >>> > > fails).
> > > > >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually
> trigger
> > > > >>> retrigger
> > > > >>> > > > >>> builds on
> > > > >>> > > > >>> > > the
> > > > >>> > > > >>> > > > > > Apache
> > > > >>> > > > >>> > > > > > > >>> > >> repository.
> > > > >>> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is
> > indeed a
> > > > >>> very
> > > > >>> > good
> > > > >>> > > > >>> > addition
> > > > >>> > > > >>> > > > :-)
> > > > >>> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > >>> > > > > > > >>> > >> For the discusion about the PR
> > > itself, I
> > > > >>> would
> > > > >>> > > > need
> > > > >>> > > > >>> a
> > > > >>> > > > >>> > bit
> > > > >>> > > > >>> > > > more
> > > > >>> > > > >>> > > > > > > time
> > > > >>> > > > >>> > > > > > > >>> to
> > > > >>> > > > >>> > > > > > > >>> > >> become more familiar with HBase. I
> > do
> > > > >>> also not
> > > > >>> > > > have
> > > > >>> > > > >>> a
> > > > >>> > > > >>> > > HBase
> > > > >>> > > > >>> > > > > > setup
> > > > >>> > > > >>> > > > > > > >>> > >> available
> > > > >>> > > > >>> > > > > > > >>> > >> here.
> > > > >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the
> community
> > > who
> > > > >>> was
> > > > >>> > > > >>> involved
> > > > >>> > > > >>> > > with a
> > > > >>> > > > >>> > > > > > > >>> previous
> > > > >>> > > > >>> > > > > > > >>> > >> version of the HBase connector
> could
> > > > >>> comment
> > > > >>> > on
> > > > >>> > > > your
> > > > >>> > > > >>> > > > question.
> > > > >>> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > >>> > > > > > > >>> > >> Best, Fabian
> > > > >>> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
> > > > >>> Pompermaier <
> > > > >>> > > > >>> > > > > > > [hidden email]
> > > > >>> > > > >>> > > > > > > >>> >:
> > > > >>> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved
> the
> > > > >>> > discussion
> > > > >>> > > on
> > > > >>> > > > >>> this
> > > > >>> > > > >>> > > > > mailing
> > > > >>> > > > >>> > > > > > > >>> list.
> > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > >>> > > > >>> > > > > > > >>> > >> > I think that what is still to be
> > > > >>> discussed
> > > > >>> > is
> > > > >>> > > > >>> how to
> > > > >>> > > > >>> > > > > > retrigger
> > > > >>> > > > >>> > > > > > > >>> the
> > > > >>> > > > >>> > > > > > > >>> > >> build
> > > > >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an
> > account)
> > > > and
> > > > >>> if
> > > > >>> > the
> > > > >>> > > > PR
> > > > >>> > > > >>> can
> > > > >>> > > > >>> > be
> > > > >>> > > > >>> > > > > > > >>> integrated.
> > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move
> the
> > > > HBase
> > > > >>> > > example
> > > > >>> > > > >>> in
> > > > >>> > > > >>> > the
> > > > >>> > > > >>> > > > test
> > > > >>> > > > >>> > > > > > > >>> package
> > > > >>> > > > >>> > > > > > > >>> > >> (right
> > > > >>> > > > >>> > > > > > > >>> > >> > now I left it in the main
> folder)
> > so
> > > > it
> > > > >>> will
> > > > >>> > > > force
> > > > >>> > > > >>> > > Travis
> > > > >>> > > > >>> > > > to
> > > > >>> > > > >>> > > > > > > >>> rebuild.
> > > > >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of
> > hours.
> > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say is
> > > that
> > > > >>> the
> > > > >>> > > hbase
> > > > >>> > > > >>> > > extension
> > > > >>> > > > >>> > > > is
> > > > >>> > > > >>> > > > > > now
> > > > >>> > > > >>> > > > > > > >>> > >> compatible
> > > > >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > >>> > > > >>> > > > > > > >>> > >> > Best,
> > > > >>> > > > >>> > > > > > > >>> > >> > Flavio
> > > > >>> > > > >>> > > > > > > >>> > >>
> > > > >>> > > > >>> > > > > > > >>> > >
> > > > >>> > > > >>> > > > > > > >>> >
> > > > >>> > > > >>> > > > > > > >>>
> > > > >>> > > > >>> > > > > > > >>
> > > > >>> > > > >>> > > > > > > >
> > > > >>> > > > >>> > > > > > >
> > > > >>> > > > >>> > > > > >
> > > > >>> > > > >>> > > > >
> > > > >>> > > > >>> > > >
> > > > >>> > > > >>> > >
> > > > >>> > > > >>> >
> > > > >>> > > > >>>
> > > > >>> > > > >>
> > > > >>> > > > >>
> > > > >>> > > > >>
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >
> > > > >
> >
>

Fabian Hueske

Re: HBase 0.98 addon for Flink 0.8

In this case, the initialization happens when the InputFormat is
instantiated at the submission client and the Table info is serialized as
part of the InputFormat and shipped out to all TaskManagers for execution.
However, if the initialization is done within configure it happens on each
TaskManager when initializing the InputFormat.
These are two separate JVMs in a distributed setting with different
classpaths.

How do you submit your job for execution?

2014-11-14 13:58 GMT+01:00 Flavio Pompermaier <[hidden email]>:

> The strange thing us that everything works if I create HTable outside
> configure()..
> On Nov 14, 2014 10:32 AM, "Stephan Ewen" <[hidden email]> wrote:
>
> > I think that this is a case where the wrong classloader is used:
> >
> > If the HBase classes are part of the flink lib directory, they are loaded
> > with the system class loader. When they look for anything in the
> classpath,
> > they will do so with the system classloader.
> >
> > You configuration is in the user code jar that you submit, so it is only
> > available through the user-code classloader.
> >
> > Any way you can load the configuration yourself and give that
> configuration
> > to HBase?
> >
> > Stephan
> > Am 13.11.2014 22:06 schrieb "Flavio Pompermaier" <[hidden email]>:
> >
> > > The only config files available are within the submitted jar. Things
> > works
> > > in eclipse using local environment while fails deploying to the cluster
> > > On Nov 13, 2014 10:01 PM, <[hidden email]> wrote:
> > >
> > > > Does the HBase jar in the lib folder contain a config that could be
> > used
> > > > instead of the config in the job jar file? Or is simply no config at
> > all
> > > > available when the configure method is called?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Fabian Hueske
> > > > Phone: +49 170 5549438
> > > > Email: [hidden email]
> > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > From: Flavio Pompermaier
> > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎21‎:‎43
> > > > To: [hidden email]
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > The hbase jar is in the lib directory on each node while the config
> > files
> > > > are within the jar file I submit from the web client.
> > > > On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:
> > > >
> > > > > Have you added the hbase.jar file with your HBase config to the
> ./lib
> > > > > folders of your Flink setup (JobManager, TaskManager) or is it
> > bundled
> > > > with
> > > > > your job.jar file?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Fabian Hueske
> > > > > Phone: +49 170 5549438
> > > > > Email: [hidden email]
> > > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > From: Flavio Pompermaier
> > > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
> > > > > To: [hidden email]
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Any help with this? :(
> > > > >
> > > > > On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > We definitely discovered that instantiating HTable and Scan in
> > > > > configure()
> > > > > > method of TableInputFormat causes problem in distributed
> > environment!
> > > > > > If you look at my implementation at
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> > > > > > you can see that Scan and HTable were made transient and
> recreated
> > > > within
> > > > > > configure but this causes HBaseConfiguration.create() to fail
> > > searching
> > > > > for
> > > > > > classpath files...could you help us understanding why?
> > > > > >
> > > > > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <
> > > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > >> Usually, when I run a mapreduce job both on Spark and Hadoop I
> > just
> > > > put
> > > > > >> *-site.xml files into the war I submit to the cluster and that's
> > > it. I
> > > > > >> think the problem appeared when I made the HTable a private
> > > transient
> > > > > field
> > > > > >> and the table istantiation was moved in the configure method.
> > > > > >> Could it be a valid reason? we still have to make a deeper debug
> > but
> > > > I'm
> > > > > >> trying ro figure out where to investigate..
> > > > > >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <[hidden email]>
> > > > wrote:
> > > > > >>
> > > > > >>> Hi,
> > > > > >>> Maybe its an issue with the classpath? As far as I know is
> Hadoop
> > > > > reading
> > > > > >>> the configuration files from the classpath. Maybe is the
> > > > hbase-site.xml
> > > > > >>> file not accessible through the classpath when running on the
> > > > cluster?
> > > > > >>>
> > > > > >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
> > > > > >>> [hidden email]>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > Today we tried tp execute a job on the cluster instead of on
> > > local
> > > > > >>> executor
> > > > > >>> > and we faced the problem that the hbase-site.xml was
> basically
> > > > > >>> ignored. Is
> > > > > >>> > there a reason why the TableInputFormat is working correctly
> on
> > > > local
> > > > > >>> > environment while it doesn't on a cluster?
> > > > > >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <
> [hidden email]>
> > > > > wrote:
> > > > > >>> >
> > > > > >>> > > I don't think we need to bundle the HBase input and output
> > > format
> > > > > in
> > > > > >>> a
> > > > > >>> > > single PR.
> > > > > >>> > > So, I think we can proceed with the IF only and target the
> OF
> > > > > later.
> > > > > >>> > > However, the fix for Kryo should be in the master before
> > > merging
> > > > > the
> > > > > >>> PR.
> > > > > >>> > > Till is currently working on that and said he expects this
> to
> > > be
> > > > > >>> done by
> > > > > >>> > > end of the week.
> > > > > >>> > >
> > > > > >>> > > Cheers, Fabian
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <
> > > > > [hidden email]
> > > > > >>> >:
> > > > > >>> > >
> > > > > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You can
> > build
> > > > it
> > > > > >>> with
> > > > > >>> > the
> > > > > >>> > > > command:
> > > > > >>> > > > mvn clean install -Dmaven.test.skip=true
> > > > -Dhadoop.profile=2
> > > > > >>> > > > -Pvendor-repos,cdh5.1.3
> > > > > >>> > > >
> > > > > >>> > > > However, it would be good to generate the specific jar
> when
> > > > > >>> > > > releasing..(e.g.
> > > > > >>> > > >
> flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > > > > >>> > > >
> > > > > >>> > > > Best,
> > > > > >>> > > > Flavio
> > > > > >>> > > >
> > > > > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> > > > > >>> > > [hidden email]>
> > > > > >>> > > > wrote:
> > > > > >>> > > >
> > > > > >>> > > > > I've just updated the code on my fork (synch with
> current
> > > > > master
> > > > > >>> and
> > > > > >>> > > > > applied improvements coming from comments on related
> PR).
> > > > > >>> > > > > I still have to understand how to write results back to
> > an
> > > > > HBase
> > > > > >>> > > > > Sink/OutputFormat...
> > > > > >>> > > > >
> > > > > >>> > > > >
> > > > > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier <
> > > > > >>> > > > [hidden email]>
> > > > > >>> > > > > wrote:
> > > > > >>> > > > >
> > > > > >>> > > > >> Thanks for the detailed answer. So if I run a job from
> > my
> > > > > >>> machine
> > > > > >>> > I'll
> > > > > >>> > > > >> have to download all the scanned data in a
> table..right?
> > > > > >>> > > > >>
> > > > > >>> > > > >> Always regarding the GenericTableOutputFormat it is
> not
> > > > clear
> > > > > >>> to me
> > > > > >>> > > how
> > > > > >>> > > > >> to proceed..
> > > > > >>> > > > >> I saw in the hadoop compatibility addon that it is
> > > possible
> > > > to
> > > > > >>> have
> > > > > >>> > > such
> > > > > >>> > > > >> compatibility using HBaseUtils class so the open
> method
> > > > should
> > > > > >>> > become
> > > > > >>> > > > >> something like:
> > > > > >>> > > > >>
> > > > > >>> > > > >> @Override
> > > > > >>> > > > >> public void open(int taskNumber, int numTasks) throws
> > > > > >>> IOException {
> > > > > >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6) {
> > > > > >>> > > > >> throw new IOException("Task id too large.");
> > > > > >>> > > > >> }
> > > > > >>> > > > >> TaskAttemptID taskAttemptID =
> > > > > >>> > TaskAttemptID.forName("attempt__0000_r_"
> > > > > >>> > > > >> + String.format("%" + (6 -
> Integer.toString(taskNumber +
> > > > > >>> > 1).length())
> > > > > >>> > > +
> > > > > >>> > > > >> "s"," ").replace(" ", "0")
> > > > > >>> > > > >> + Integer.toString(taskNumber + 1)
> > > > > >>> > > > >> + "_0");
> > > > > >>> > > > >> this.configuration.set("mapred.task.id",
> > > > > >>> > taskAttemptID.toString());
> > > > > >>> > > > >> this.configuration.setInt("mapred.task.partition",
> > > > taskNumber
> > > > > +
> > > > > >>> 1);
> > > > > >>> > > > >> // for hadoop 2.2
> > > > > >>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
> > > > > >>> > > > >> taskAttemptID.toString());
> > > > > >>> > > > >> this.configuration.setInt("mapreduce.task.partition",
> > > > > >>> taskNumber +
> > > > > >>> > 1);
> > > > > >>> > > > >> try {
> > > > > >>> > > > >> this.context =
> > > > > >>> > > > >>
> > > > HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > > > > >>> > > > >> taskAttemptID);
> > > > > >>> > > > >> } catch (Exception e) {
> > > > > >>> > > > >> throw new RuntimeException(e);
> > > > > >>> > > > >> }
> > > > > >>> > > > >> final HFileOutputFormat2 outFormat = new
> > > > HFileOutputFormat2();
> > > > > >>> > > > >> try {
> > > > > >>> > > > >> this.writer = outFormat.getRecordWriter(this.context);
> > > > > >>> > > > >> } catch (InterruptedException iex) {
> > > > > >>> > > > >> throw new IOException("Opening the writer was
> > > interrupted.",
> > > > > >>> iex);
> > > > > >>> > > > >> }
> > > > > >>> > > > >> }
> > > > > >>> > > > >>
> > > > > >>> > > > >> But I'm not sure about how to pass the JobConf to the
> > > class,
> > > > > if
> > > > > >>> to
> > > > > >>> > > merge
> > > > > >>> > > > >> config fileas, where HFileOutputFormat2 writes the
> data
> > > and
> > > > > how
> > > > > >>> to
> > > > > >>> > > > >> implement the public void writeRecord(Record record)
> > API.
> > > > > >>> > > > >> Could I do a little chat off the mailing list with the
> > > > > >>> implementor
> > > > > >>> > of
> > > > > >>> > > > >> this extension?
> > > > > >>> > > > >>
> > > > > >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> > > > > >>> [hidden email]>
> > > > > >>> > > > >> wrote:
> > > > > >>> > > > >>
> > > > > >>> > > > >>> Hi Flavio
> > > > > >>> > > > >>>
> > > > > >>> > > > >>> let me try to answer your last question on the user's
> > > list
> > > > > (to
> > > > > >>> the
> > > > > >>> > > best
> > > > > >>> > > > >>> of
> > > > > >>> > > > >>> my HBase knowledge).
> > > > > >>> > > > >>> "I just wanted to known if and how regiom splitting
> is
> > > > > >>> handled. Can
> > > > > >>> > > you
> > > > > >>> > > > >>> explain me in detail how Flink and HBase works?what
> is
> > > not
> > > > > >>> fully
> > > > > >>> > > clear
> > > > > >>> > > > to
> > > > > >>> > > > >>> me is when computation is done by region servers and
> > when
> > > > > data
> > > > > >>> > start
> > > > > >>> > > > flow
> > > > > >>> > > > >>> to a Flink worker (that in ky test job is only my pc)
> > and
> > > > how
> > > > > >>> ro
> > > > > >>> > > > >>> undertsand
> > > > > >>> > > > >>> better the important logged info to understand if my
> > job
> > > is
> > > > > >>> > > performing
> > > > > >>> > > > >>> well"
> > > > > >>> > > > >>>
> > > > > >>> > > > >>> HBase partitions its tables into so called "regions"
> of
> > > > keys
> > > > > >>> and
> > > > > >>> > > stores
> > > > > >>> > > > >>> the
> > > > > >>> > > > >>> regions distributed in the cluster using HDFS. I
> think
> > an
> > > > > HBase
> > > > > >>> > > region
> > > > > >>> > > > >>> can
> > > > > >>> > > > >>> be thought of as a HDFS block. To make reading an
> HBase
> > > > table
> > > > > >>> > > > efficient,
> > > > > >>> > > > >>> region reads should be locally done, i.e., an
> > InputFormat
> > > > > >>> should
> > > > > >>> > > > >>> primarily
> > > > > >>> > > > >>> read region that are stored on the same machine as
> the
> > IF
> > > > is
> > > > > >>> > running
> > > > > >>> > > > on.
> > > > > >>> > > > >>> Flink's InputSplits partition the HBase input by
> > regions
> > > > and
> > > > > >>> add
> > > > > >>> > > > >>> information about the storage location of the region.
> > > > During
> > > > > >>> > > execution,
> > > > > >>> > > > >>> input splits are assigned to InputFormats that can do
> > > local
> > > > > >>> reads.
> > > > > >>> > > > >>>
> > > > > >>> > > > >>> Best, Fabian
> > > > > >>> > > > >>>
> > > > > >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <
> > > [hidden email]
> > > > >:
> > > > > >>> > > > >>>
> > > > > >>> > > > >>> > Hi!
> > > > > >>> > > > >>> >
> > > > > >>> > > > >>> > The way of passing parameters through the
> > configuration
> > > > is
> > > > > >>> very
> > > > > >>> > old
> > > > > >>> > > > >>> (the
> > > > > >>> > > > >>> > original HBase format dated back to that time). I
> > would
> > > > > >>> simply
> > > > > >>> > make
> > > > > >>> > > > the
> > > > > >>> > > > >>> > HBase format take those parameters through the
> > > > constructor.
> > > > > >>> > > > >>> >
> > > > > >>> > > > >>> > Greetings,
> > > > > >>> > > > >>> > Stephan
> > > > > >>> > > > >>> >
> > > > > >>> > > > >>> >
> > > > > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio
> Pompermaier <
> > > > > >>> > > > >>> [hidden email]>
> > > > > >>> > > > >>> > wrote:
> > > > > >>> > > > >>> >
> > > > > >>> > > > >>> > > The problem is that I also removed the
> > > > > >>> GenericTableOutputFormat
> > > > > >>> > > > >>> because
> > > > > >>> > > > >>> > > there is an incompatibility between hadoop1 and
> > > hadoop2
> > > > > for
> > > > > >>> > class
> > > > > >>> > > > >>> > > TaskAttemptContext and TaskAttemptContextImpl..
> > > > > >>> > > > >>> > > then it would be nice if the user doesn't have to
> > > worry
> > > > > >>> about
> > > > > >>> > > > passing
> > > > > >>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> > > > > >>> > > > >>> > > I think it is probably a good idea to remove
> > hadoop1
> > > > > >>> > > compatibility
> > > > > >>> > > > >>> and
> > > > > >>> > > > >>> > keep
> > > > > >>> > > > >>> > > enable HBase addon only for hadoop2 (as before)
> and
> > > > > decide
> > > > > >>> how
> > > > > >>> > to
> > > > > >>> > > > >>> mange
> > > > > >>> > > > >>> > > those 2 parameters..
> > > > > >>> > > > >>> > >
> > > > > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen <
> > > > > >>> > [hidden email]>
> > > > > >>> > > > >>> wrote:
> > > > > >>> > > > >>> > >
> > > > > >>> > > > >>> > > > It is fine to remove it, in my opinion.
> > > > > >>> > > > >>> > > >
> > > > > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio
> > > Pompermaier <
> > > > > >>> > > > >>> > > [hidden email]>
> > > > > >>> > > > >>> > > > wrote:
> > > > > >>> > > > >>> > > >
> > > > > >>> > > > >>> > > > > That is one class I removed because it was
> > using
> > > > the
> > > > > >>> > > deprecated
> > > > > >>> > > > >>> API
> > > > > >>> > > > >>> > > > > GenericDataSink..I can restore them but the
> it
> > > will
> > > > > be
> > > > > >>> a
> > > > > >>> > good
> > > > > >>> > > > >>> idea to
> > > > > >>> > > > >>> > > > > remove those warning (also because from what
> I
> > > > > >>> understood
> > > > > >>> > the
> > > > > >>> > > > >>> Record
> > > > > >>> > > > >>> > > APIs
> > > > > >>> > > > >>> > > > > are going to be removed).
> > > > > >>> > > > >>> > > > >
> > > > > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian
> Hueske <
> > > > > >>> > > > >>> [hidden email]>
> > > > > >>> > > > >>> > > > wrote:
> > > > > >>> > > > >>> > > > >
> > > > > >>> > > > >>> > > > > > I'm not familiar with the HBase connector
> > code,
> > > > but
> > > > > >>> are
> > > > > >>> > you
> > > > > >>> > > > >>> maybe
> > > > > >>> > > > >>> > > > looking
> > > > > >>> > > > >>> > > > > > for the GenericTableOutputFormat?
> > > > > >>> > > > >>> > > > > >
> > > > > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio
> Pompermaier
> > <
> > > > > >>> > > > >>> [hidden email]
> > > > > >>> > > > >>> > >:
> > > > > >>> > > > >>> > > > > >
> > > > > >>> > > > >>> > > > > > > | was trying to modify the example
> setting
> > > > > >>> > > > hbaseDs.output(new
> > > > > >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see any
> > > > > >>> > > HBaseOutputFormat
> > > > > >>> > > > >>> > > > > class..maybe
> > > > > >>> > > > >>> > > > > > we
> > > > > >>> > > > >>> > > > > > > shall use another class?
> > > > > >>> > > > >>> > > > > > >
> > > > > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio
> > > > > Pompermaier
> > > > > >>> <
> > > > > >>> > > > >>> > > > > [hidden email]
> > > > > >>> > > > >>> > > > > > >
> > > > > >>> > > > >>> > > > > > > wrote:
> > > > > >>> > > > >>> > > > > > >
> > > > > >>> > > > >>> > > > > > > > Maybe that's something I could add to
> the
> > > > HBase
> > > > > >>> > example
> > > > > >>> > > > and
> > > > > >>> > > > >>> > that
> > > > > >>> > > > >>> > > > > could
> > > > > >>> > > > >>> > > > > > be
> > > > > >>> > > > >>> > > > > > > > better documented in the Wiki.
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > > > Since we're talking about the wiki..I
> was
> > > > > >>> looking at
> > > > > >>> > > the
> > > > > >>> > > > >>> Java
> > > > > >>> > > > >>> > > API (
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > >
> > > > > >>> > > > >>> > > > > >
> > > > > >>> > > > >>> > > > >
> > > > > >>> > > > >>> > > >
> > > > > >>> > > > >>> > >
> > > > > >>> > > > >>> >
> > > > > >>> > > > >>>
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > > > > >>> > > > >>> )
> > > > > >>> > > > >>> > > > > > > > and the link to the KMeans example is
> not
> > > > > working
> > > > > >>> > > (where
> > > > > >>> > > > it
> > > > > >>> > > > >>> > says
> > > > > >>> > > > >>> > > > For
> > > > > >>> > > > >>> > > > > a
> > > > > >>> > > > >>> > > > > > > > complete example program, have a look
> at
> > > > KMeans
> > > > > >>> > > > Algorithm).
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > > > Best,
> > > > > >>> > > > >>> > > > > > > > Flavio
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM, Flavio
> > > > > >>> Pompermaier <
> > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > > > wrote:
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason
> why
> > I
> > > > > >>> removed it
> > > > > >>> > > :)
> > > > > >>> > > > >>> > > > > > > >>
> > > > > >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM,
> Stephan
> > > > Ewen <
> > > > > >>> > > > >>> > [hidden email]>
> > > > > >>> > > > >>> > > > > > wrote:
> > > > > >>> > > > >>> > > > > > > >>
> > > > > >>> > > > >>> > > > > > > >>> You do not really need a HBase data
> > sink.
> > > > You
> > > > > >>> can
> > > > > >>> > > call
> > > > > >>> > > > >>> > > > > > > >>> "DataSet.output(new
> > > > > >>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> > > > > >>> > > > >>> > > > > > > >>>
> > > > > >>> > > > >>> > > > > > > >>> Stephan
> > > > > >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb "Flavio
> > > > > >>> Pompermaier" <
> > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > >>> > > > >>> > > > > > > >:
> > > > > >>> > > > >>> > > > > > > >>>
> > > > > >>> > > > >>> > > > > > > >>> > Just one last thing..I removed the
> > > > > >>> HbaseDataSink
> > > > > >>> > > > >>> because I
> > > > > >>> > > > >>> > > > think
> > > > > >>> > > > >>> > > > > it
> > > > > >>> > > > >>> > > > > > > was
> > > > > >>> > > > >>> > > > > > > >>> > using the old APIs..can someone
> help
> > me
> > > > in
> > > > > >>> > updating
> > > > > >>> > > > >>> that
> > > > > >>> > > > >>> > > class?
> > > > > >>> > > > >>> > > > > > > >>> >
> > > > > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM,
> > Flavio
> > > > > >>> > > Pompermaier <
> > > > > >>> > > > >>> > > > > > > >>> [hidden email]>
> > > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > > >>> > > > >>> > > > > > > >>> >
> > > > > >>> > > > >>> > > > > > > >>> > > Indeed this time the build has
> been
> > > > > >>> successful
> > > > > >>> > :)
> > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29 AM,
> > > Fabian
> > > > > >>> Hueske
> > > > > >>> > <
> > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to
> build
> > > > your
> > > > > >>> own
> > > > > >>> > > Github
> > > > > >>> > > > >>> > > > > repositories
> > > > > >>> > > > >>> > > > > > by
> > > > > >>> > > > >>> > > > > > > >>> > linking
> > > > > >>> > > > >>> > > > > > > >>> > >> it to your Github account. That
> > way
> > > > > >>> Travis can
> > > > > >>> > > > >>> build all
> > > > > >>> > > > >>> > > > your
> > > > > >>> > > > >>> > > > > > > >>> branches
> > > > > >>> > > > >>> > > > > > > >>> > >> (and
> > > > > >>> > > > >>> > > > > > > >>> > >> you can also trigger rebuilds if
> > > > > something
> > > > > >>> > > fails).
> > > > > >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually
> > trigger
> > > > > >>> retrigger
> > > > > >>> > > > >>> builds on
> > > > > >>> > > > >>> > > the
> > > > > >>> > > > >>> > > > > > Apache
> > > > > >>> > > > >>> > > > > > > >>> > >> repository.
> > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is
> > > indeed a
> > > > > >>> very
> > > > > >>> > good
> > > > > >>> > > > >>> > addition
> > > > > >>> > > > >>> > > > :-)
> > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > >>> > > > >>> > > > > > > >>> > >> For the discusion about the PR
> > > > itself, I
> > > > > >>> would
> > > > > >>> > > > need
> > > > > >>> > > > >>> a
> > > > > >>> > > > >>> > bit
> > > > > >>> > > > >>> > > > more
> > > > > >>> > > > >>> > > > > > > time
> > > > > >>> > > > >>> > > > > > > >>> to
> > > > > >>> > > > >>> > > > > > > >>> > >> become more familiar with
> HBase. I
> > > do
> > > > > >>> also not
> > > > > >>> > > > have
> > > > > >>> > > > >>> a
> > > > > >>> > > > >>> > > HBase
> > > > > >>> > > > >>> > > > > > setup
> > > > > >>> > > > >>> > > > > > > >>> > >> available
> > > > > >>> > > > >>> > > > > > > >>> > >> here.
> > > > > >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the
> > community
> > > > who
> > > > > >>> was
> > > > > >>> > > > >>> involved
> > > > > >>> > > > >>> > > with a
> > > > > >>> > > > >>> > > > > > > >>> previous
> > > > > >>> > > > >>> > > > > > > >>> > >> version of the HBase connector
> > could
> > > > > >>> comment
> > > > > >>> > on
> > > > > >>> > > > your
> > > > > >>> > > > >>> > > > question.
> > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > >>> > > > >>> > > > > > > >>> > >> Best, Fabian
> > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00 Flavio
> > > > > >>> Pompermaier <
> > > > > >>> > > > >>> > > > > > > [hidden email]
> > > > > >>> > > > >>> > > > > > > >>> >:
> > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I moved
> > the
> > > > > >>> > discussion
> > > > > >>> > > on
> > > > > >>> > > > >>> this
> > > > > >>> > > > >>> > > > > mailing
> > > > > >>> > > > >>> > > > > > > >>> list.
> > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > >>> > > > >>> > > > > > > >>> > >> > I think that what is still to
> be
> > > > > >>> discussed
> > > > > >>> > is
> > > > > >>> > > > >>> how to
> > > > > >>> > > > >>> > > > > > retrigger
> > > > > >>> > > > >>> > > > > > > >>> the
> > > > > >>> > > > >>> > > > > > > >>> > >> build
> > > > > >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an
> > > account)
> > > > > and
> > > > > >>> if
> > > > > >>> > the
> > > > > >>> > > > PR
> > > > > >>> > > > >>> can
> > > > > >>> > > > >>> > be
> > > > > >>> > > > >>> > > > > > > >>> integrated.
> > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to move
> > the
> > > > > HBase
> > > > > >>> > > example
> > > > > >>> > > > >>> in
> > > > > >>> > > > >>> > the
> > > > > >>> > > > >>> > > > test
> > > > > >>> > > > >>> > > > > > > >>> package
> > > > > >>> > > > >>> > > > > > > >>> > >> (right
> > > > > >>> > > > >>> > > > > > > >>> > >> > now I left it in the main
> > folder)
> > > so
> > > > > it
> > > > > >>> will
> > > > > >>> > > > force
> > > > > >>> > > > >>> > > Travis
> > > > > >>> > > > >>> > > > to
> > > > > >>> > > > >>> > > > > > > >>> rebuild.
> > > > > >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple of
> > > hours.
> > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to say
> is
> > > > that
> > > > > >>> the
> > > > > >>> > > hbase
> > > > > >>> > > > >>> > > extension
> > > > > >>> > > > >>> > > > is
> > > > > >>> > > > >>> > > > > > now
> > > > > >>> > > > >>> > > > > > > >>> > >> compatible
> > > > > >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > >>> > > > >>> > > > > > > >>> > >> > Best,
> > > > > >>> > > > >>> > > > > > > >>> > >> > Flavio
> > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > >>> > > > >>> > > > > > > >>> >
> > > > > >>> > > > >>> > > > > > > >>>
> > > > > >>> > > > >>> > > > > > > >>
> > > > > >>> > > > >>> > > > > > > >
> > > > > >>> > > > >>> > > > > > >
> > > > > >>> > > > >>> > > > > >
> > > > > >>> > > > >>> > > > >
> > > > > >>> > > > >>> > > >
> > > > > >>> > > > >>> > >
> > > > > >>> > > > >>> >
> > > > > >>> > > > >>>
> > > > > >>> > > > >>
> > > > > >>> > > > >>
> > > > > >>> > > > >>
> > > > > >>> > > > >
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > > >
> > > > > >
> > >
> >
>

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

Both from shell with run command and from web client
On Nov 14, 2014 2:32 PM, "Fabian Hueske" <[hidden email]> wrote:

>
> In this case, the initialization happens when the InputFormat is
> instantiated at the submission client and the Table info is serialized as
> part of the InputFormat and shipped out to all TaskManagers for execution.
> However, if the initialization is done within configure it happens on each
> TaskManager when initializing the InputFormat.
> These are two separate JVMs in a distributed setting with different
> classpaths.
>
> How do you submit your job for execution?
>
> 2014-11-14 13:58 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>
> > The strange thing us that everything works if I create HTable outside
> > configure()..
> > On Nov 14, 2014 10:32 AM, "Stephan Ewen" <[hidden email]> wrote:
> >
> > > I think that this is a case where the wrong classloader is used:
> > >
> > > If the HBase classes are part of the flink lib directory, they are

loaded
> > > with the system class loader. When they look for anything in the
> > classpath,
> > > they will do so with the system classloader.
> > >
> > > You configuration is in the user code jar that you submit, so it is
only

> > > available through the user-code classloader.
> > >
> > > Any way you can load the configuration yourself and give that
> > configuration
> > > to HBase?
> > >
> > > Stephan
> > > Am 13.11.2014 22:06 schrieb "Flavio Pompermaier" <[hidden email]
>:
> > >
> > > > The only config files available are within the submitted jar. Things
> > > works
> > > > in eclipse using local environment while fails deploying to the

cluster
> > > > On Nov 13, 2014 10:01 PM, <[hidden email]> wrote:
> > > >
> > > > > Does the HBase jar in the lib folder contain a config that could
be
> > > used
> > > > > instead of the config in the job jar file? Or is simply no config
at

> > > all
> > > > > available when the configure method is called?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Fabian Hueske
> > > > > Phone: +49 170 5549438
> > > > > Email: [hidden email]
> > > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > From: Flavio Pompermaier
> > > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎21‎:‎43
> > > > > To: [hidden email]
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > The hbase jar is in the lib directory on each node while the

config

> > > files
> > > > > are within the jar file I submit from the web client.
> > > > > On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:
> > > > >
> > > > > > Have you added the hbase.jar file with your HBase config to the
> > ./lib
> > > > > > folders of your Flink setup (JobManager, TaskManager) or is it
> > > bundled
> > > > > with
> > > > > > your job.jar file?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Fabian Hueske
> > > > > > Phone: +49 170 5549438
> > > > > > Email: [hidden email]
> > > > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > From: Flavio Pompermaier
> > > > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
> > > > > > To: [hidden email]
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Any help with this? :(
> > > > > >
> > > > > > On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <
> > > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > We definitely discovered that instantiating HTable and Scan in
> > > > > > configure()
> > > > > > > method of TableInputFormat causes problem in distributed
> > > environment!
> > > > > > > If you look at my implementation at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java

> > > > > > > you can see that Scan and HTable were made transient and
> > recreated
> > > > > within
> > > > > > > configure but this causes HBaseConfiguration.create() to fail
> > > > searching
> > > > > > for
> > > > > > > classpath files...could you help us understanding why?
> > > > > > >
> > > > > > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <
> > > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Usually, when I run a mapreduce job both on Spark and Hadoop

I
> > > just
> > > > > put
> > > > > > >> *-site.xml files into the war I submit to the cluster and
that's
> > > > it. I
> > > > > > >> think the problem appeared when I made the HTable a private
> > > > transient
> > > > > > field
> > > > > > >> and the table istantiation was moved in the configure method.
> > > > > > >> Could it be a valid reason? we still have to make a deeper
debug
> > > but
> > > > > I'm
> > > > > > >> trying ro figure out where to investigate..
> > > > > > >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <
[hidden email]>
> > > > > wrote:
> > > > > > >>
> > > > > > >>> Hi,
> > > > > > >>> Maybe its an issue with the classpath? As far as I know is
> > Hadoop
> > > > > > reading
> > > > > > >>> the configuration files from the classpath. Maybe is the
> > > > > hbase-site.xml
> > > > > > >>> file not accessible through the classpath when running on
the
> > > > > cluster?
> > > > > > >>>
> > > > > > >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
> > > > > > >>> [hidden email]>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>> > Today we tried tp execute a job on the cluster instead of
on
> > > > local
> > > > > > >>> executor
> > > > > > >>> > and we faced the problem that the hbase-site.xml was
> > basically
> > > > > > >>> ignored. Is
> > > > > > >>> > there a reason why the TableInputFormat is working
correctly
> > on
> > > > > local
> > > > > > >>> > environment while it doesn't on a cluster?
> > > > > > >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <
> > [hidden email]>
> > > > > > wrote:
> > > > > > >>> >
> > > > > > >>> > > I don't think we need to bundle the HBase input and
output
> > > > format
> > > > > > in
> > > > > > >>> a
> > > > > > >>> > > single PR.
> > > > > > >>> > > So, I think we can proceed with the IF only and target
the
> > OF
> > > > > > later.
> > > > > > >>> > > However, the fix for Kryo should be in the master before
> > > > merging
> > > > > > the
> > > > > > >>> PR.
> > > > > > >>> > > Till is currently working on that and said he expects
this

> > to
> > > > be
> > > > > > >>> done by
> > > > > > >>> > > end of the week.
> > > > > > >>> > >
> > > > > > >>> > > Cheers, Fabian
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <
> > > > > > [hidden email]
> > > > > > >>> >:
> > > > > > >>> > >
> > > > > > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You

can

> > > build
> > > > > it
> > > > > > >>> with
> > > > > > >>> > the
> > > > > > >>> > > > command:
> > > > > > >>> > > > mvn clean install -Dmaven.test.skip=true
> > > > > -Dhadoop.profile=2
> > > > > > >>> > > > -Pvendor-repos,cdh5.1.3
> > > > > > >>> > > >
> > > > > > >>> > > > However, it would be good to generate the specific jar
> > when
> > > > > > >>> > > > releasing..(e.g.
> > > > > > >>> > > >
> > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > > > > > >>> > > >
> > > > > > >>> > > > Best,
> > > > > > >>> > > > Flavio
> > > > > > >>> > > >
> > > > > > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> > > > > > >>> > > [hidden email]>
> > > > > > >>> > > > wrote:
> > > > > > >>> > > >
> > > > > > >>> > > > > I've just updated the code on my fork (synch with
> > current
> > > > > > master
> > > > > > >>> and
> > > > > > >>> > > > > applied improvements coming from comments on related
> > PR).
> > > > > > >>> > > > > I still have to understand how to write results

back to
> > > an
> > > > > > HBase
> > > > > > >>> > > > > Sink/OutputFormat...
> > > > > > >>> > > > >
> > > > > > >>> > > > >
> > > > > > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier
<
> > > > > > >>> > > > [hidden email]>
> > > > > > >>> > > > > wrote:
> > > > > > >>> > > > >
> > > > > > >>> > > > >> Thanks for the detailed answer. So if I run a job
from

> > > my
> > > > > > >>> machine
> > > > > > >>> > I'll
> > > > > > >>> > > > >> have to download all the scanned data in a
> > table..right?
> > > > > > >>> > > > >>
> > > > > > >>> > > > >> Always regarding the GenericTableOutputFormat it is
> > not
> > > > > clear
> > > > > > >>> to me
> > > > > > >>> > > how
> > > > > > >>> > > > >> to proceed..
> > > > > > >>> > > > >> I saw in the hadoop compatibility addon that it is
> > > > possible
> > > > > to
> > > > > > >>> have
> > > > > > >>> > > such
> > > > > > >>> > > > >> compatibility using HBaseUtils class so the open
> > method
> > > > > should
> > > > > > >>> > become
> > > > > > >>> > > > >> something like:
> > > > > > >>> > > > >>
> > > > > > >>> > > > >> @Override
> > > > > > >>> > > > >> public void open(int taskNumber, int numTasks)

throws
> > > > > > >>> IOException {
> > > > > > >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6)
{

> > > > > > >>> > > > >> throw new IOException("Task id too large.");
> > > > > > >>> > > > >> }
> > > > > > >>> > > > >> TaskAttemptID taskAttemptID =
> > > > > > >>> > TaskAttemptID.forName("attempt__0000_r_"
> > > > > > >>> > > > >> + String.format("%" + (6 -
> > Integer.toString(taskNumber +
> > > > > > >>> > 1).length())
> > > > > > >>> > > +
> > > > > > >>> > > > >> "s"," ").replace(" ", "0")
> > > > > > >>> > > > >> + Integer.toString(taskNumber + 1)
> > > > > > >>> > > > >> + "_0");
> > > > > > >>> > > > >> this.configuration.set("mapred.task.id",
> > > > > > >>> > taskAttemptID.toString());
> > > > > > >>> > > > >> this.configuration.setInt("mapred.task.partition",
> > > > > taskNumber
> > > > > > +
> > > > > > >>> 1);
> > > > > > >>> > > > >> // for hadoop 2.2
> > > > > > >>> > > > >> this.configuration.set("mapreduce.task.attempt.id",
> > > > > > >>> > > > >> taskAttemptID.toString());
> > > > > > >>> > > > >>

this.configuration.setInt("mapreduce.task.partition",

> > > > > > >>> taskNumber +
> > > > > > >>> > 1);
> > > > > > >>> > > > >> try {
> > > > > > >>> > > > >> this.context =
> > > > > > >>> > > > >>
> > > > > HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > > > > > >>> > > > >> taskAttemptID);
> > > > > > >>> > > > >> } catch (Exception e) {
> > > > > > >>> > > > >> throw new RuntimeException(e);
> > > > > > >>> > > > >> }
> > > > > > >>> > > > >> final HFileOutputFormat2 outFormat = new
> > > > > HFileOutputFormat2();
> > > > > > >>> > > > >> try {
> > > > > > >>> > > > >> this.writer =

outFormat.getRecordWriter(this.context);
> > > > > > >>> > > > >> } catch (InterruptedException iex) {
> > > > > > >>> > > > >> throw new IOException("Opening the writer was
> > > > interrupted.",
> > > > > > >>> iex);
> > > > > > >>> > > > >> }
> > > > > > >>> > > > >> }
> > > > > > >>> > > > >>
> > > > > > >>> > > > >> But I'm not sure about how to pass the JobConf to
the

> > > > class,
> > > > > > if
> > > > > > >>> to
> > > > > > >>> > > merge
> > > > > > >>> > > > >> config fileas, where HFileOutputFormat2 writes the
> > data
> > > > and
> > > > > > how
> > > > > > >>> to
> > > > > > >>> > > > >> implement the public void writeRecord(Record

record)
> > > API.
> > > > > > >>> > > > >> Could I do a little chat off the mailing list with
the

> > > > > > >>> implementor
> > > > > > >>> > of
> > > > > > >>> > > > >> this extension?
> > > > > > >>> > > > >>
> > > > > > >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> > > > > > >>> [hidden email]>
> > > > > > >>> > > > >> wrote:
> > > > > > >>> > > > >>
> > > > > > >>> > > > >>> Hi Flavio
> > > > > > >>> > > > >>>
> > > > > > >>> > > > >>> let me try to answer your last question on the

user's
> > > > list
> > > > > > (to
> > > > > > >>> the
> > > > > > >>> > > best
> > > > > > >>> > > > >>> of
> > > > > > >>> > > > >>> my HBase knowledge).
> > > > > > >>> > > > >>> "I just wanted to known if and how regiom
splitting
> > is
> > > > > > >>> handled. Can
> > > > > > >>> > > you
> > > > > > >>> > > > >>> explain me in detail how Flink and HBase
works?what
> > is
> > > > not
> > > > > > >>> fully
> > > > > > >>> > > clear
> > > > > > >>> > > > to
> > > > > > >>> > > > >>> me is when computation is done by region servers
and
> > > when
> > > > > > data
> > > > > > >>> > start
> > > > > > >>> > > > flow
> > > > > > >>> > > > >>> to a Flink worker (that in ky test job is only my
pc)
> > > and
> > > > > how
> > > > > > >>> ro
> > > > > > >>> > > > >>> undertsand
> > > > > > >>> > > > >>> better the important logged info to understand if
my
> > > job
> > > > is
> > > > > > >>> > > performing
> > > > > > >>> > > > >>> well"
> > > > > > >>> > > > >>>
> > > > > > >>> > > > >>> HBase partitions its tables into so called
"regions"

> > of
> > > > > keys
> > > > > > >>> and
> > > > > > >>> > > stores
> > > > > > >>> > > > >>> the
> > > > > > >>> > > > >>> regions distributed in the cluster using HDFS. I
> > think
> > > an
> > > > > > HBase
> > > > > > >>> > > region
> > > > > > >>> > > > >>> can
> > > > > > >>> > > > >>> be thought of as a HDFS block. To make reading an
> > HBase
> > > > > table
> > > > > > >>> > > > efficient,
> > > > > > >>> > > > >>> region reads should be locally done, i.e., an
> > > InputFormat
> > > > > > >>> should
> > > > > > >>> > > > >>> primarily
> > > > > > >>> > > > >>> read region that are stored on the same machine as
> > the
> > > IF
> > > > > is
> > > > > > >>> > running
> > > > > > >>> > > > on.
> > > > > > >>> > > > >>> Flink's InputSplits partition the HBase input by
> > > regions
> > > > > and
> > > > > > >>> add
> > > > > > >>> > > > >>> information about the storage location of the

region.
> > > > > During
> > > > > > >>> > > execution,
> > > > > > >>> > > > >>> input splits are assigned to InputFormats that
can do

> > > > local
> > > > > > >>> reads.
> > > > > > >>> > > > >>>
> > > > > > >>> > > > >>> Best, Fabian
> > > > > > >>> > > > >>>
> > > > > > >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <
> > > > [hidden email]
> > > > > >:
> > > > > > >>> > > > >>>
> > > > > > >>> > > > >>> > Hi!
> > > > > > >>> > > > >>> >
> > > > > > >>> > > > >>> > The way of passing parameters through the
> > > configuration
> > > > > is
> > > > > > >>> very
> > > > > > >>> > old
> > > > > > >>> > > > >>> (the
> > > > > > >>> > > > >>> > original HBase format dated back to that time).

> > > would
> > > > > > >>> simply
> > > > > > >>> > make
> > > > > > >>> > > > the
> > > > > > >>> > > > >>> > HBase format take those parameters through the
> > > > > constructor.
> > > > > > >>> > > > >>> >
> > > > > > >>> > > > >>> > Greetings,
> > > > > > >>> > > > >>> > Stephan
> > > > > > >>> > > > >>> >
> > > > > > >>> > > > >>> >
> > > > > > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio
> > Pompermaier <
> > > > > > >>> > > > >>> [hidden email]>
> > > > > > >>> > > > >>> > wrote:
> > > > > > >>> > > > >>> >
> > > > > > >>> > > > >>> > > The problem is that I also removed the
> > > > > > >>> GenericTableOutputFormat
> > > > > > >>> > > > >>> because
> > > > > > >>> > > > >>> > > there is an incompatibility between hadoop1

and
> > > > hadoop2
> > > > > > for
> > > > > > >>> > class
> > > > > > >>> > > > >>> > > TaskAttemptContext and
TaskAttemptContextImpl..
> > > > > > >>> > > > >>> > > then it would be nice if the user doesn't
have to

> > > > worry
> > > > > > >>> about
> > > > > > >>> > > > passing
> > > > > > >>> > > > >>> > > pact.hbase.jtkey and pact.job.id parameters..
> > > > > > >>> > > > >>> > > I think it is probably a good idea to remove
> > > hadoop1
> > > > > > >>> > > compatibility
> > > > > > >>> > > > >>> and
> > > > > > >>> > > > >>> > keep
> > > > > > >>> > > > >>> > > enable HBase addon only for hadoop2 (as

before)
> > and
> > > > > > decide
> > > > > > >>> how
> > > > > > >>> > to
> > > > > > >>> > > > >>> mange
> > > > > > >>> > > > >>> > > those 2 parameters..
> > > > > > >>> > > > >>> > >
> > > > > > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen
<

> > > > > > >>> > [hidden email]>
> > > > > > >>> > > > >>> wrote:
> > > > > > >>> > > > >>> > >
> > > > > > >>> > > > >>> > > > It is fine to remove it, in my opinion.
> > > > > > >>> > > > >>> > > >
> > > > > > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio
> > > > Pompermaier <
> > > > > > >>> > > > >>> > > [hidden email]>
> > > > > > >>> > > > >>> > > > wrote:
> > > > > > >>> > > > >>> > > >
> > > > > > >>> > > > >>> > > > > That is one class I removed because it was
> > > using
> > > > > the
> > > > > > >>> > > deprecated
> > > > > > >>> > > > >>> API
> > > > > > >>> > > > >>> > > > > GenericDataSink..I can restore them but

the
> > it
> > > > will
> > > > > > be
> > > > > > >>> a
> > > > > > >>> > good
> > > > > > >>> > > > >>> idea to
> > > > > > >>> > > > >>> > > > > remove those warning (also because from
what

> > I
> > > > > > >>> understood
> > > > > > >>> > the
> > > > > > >>> > > > >>> Record
> > > > > > >>> > > > >>> > > APIs
> > > > > > >>> > > > >>> > > > > are going to be removed).
> > > > > > >>> > > > >>> > > > >
> > > > > > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian
> > Hueske <
> > > > > > >>> > > > >>> [hidden email]>
> > > > > > >>> > > > >>> > > > wrote:
> > > > > > >>> > > > >>> > > > >
> > > > > > >>> > > > >>> > > > > > I'm not familiar with the HBase

connector

> > > code,
> > > > > but
> > > > > > >>> are
> > > > > > >>> > you
> > > > > > >>> > > > >>> maybe
> > > > > > >>> > > > >>> > > > looking
> > > > > > >>> > > > >>> > > > > > for the GenericTableOutputFormat?
> > > > > > >>> > > > >>> > > > > >
> > > > > > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio
> > Pompermaier
> > > <
> > > > > > >>> > > > >>> [hidden email]
> > > > > > >>> > > > >>> > >:
> > > > > > >>> > > > >>> > > > > >
> > > > > > >>> > > > >>> > > > > > > | was trying to modify the example
> > setting
> > > > > > >>> > > > hbaseDs.output(new
> > > > > > >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see

any

> > > > > > >>> > > HBaseOutputFormat
> > > > > > >>> > > > >>> > > > > class..maybe
> > > > > > >>> > > > >>> > > > > > we
> > > > > > >>> > > > >>> > > > > > > shall use another class?
> > > > > > >>> > > > >>> > > > > > >
> > > > > > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM, Flavio
> > > > > > Pompermaier
> > > > > > >>> <
> > > > > > >>> > > > >>> > > > > [hidden email]
> > > > > > >>> > > > >>> > > > > > >
> > > > > > >>> > > > >>> > > > > > > wrote:
> > > > > > >>> > > > >>> > > > > > >
> > > > > > >>> > > > >>> > > > > > > > Maybe that's something I could add

> > the
> > > > > HBase
> > > > > > >>> > example
> > > > > > >>> > > > and
> > > > > > >>> > > > >>> > that
> > > > > > >>> > > > >>> > > > > could
> > > > > > >>> > > > >>> > > > > > be
> > > > > > >>> > > > >>> > > > > > > > better documented in the Wiki.
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > > > Since we're talking about the

wiki..I

> > was
> > > > > > >>> looking at
> > > > > > >>> > > the
> > > > > > >>> > > > >>> Java
> > > > > > >>> > > > >>> > > API (
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > >
> > > > > > >>> > > > >>> > > > > >
> > > > > > >>> > > > >>> > > > >
> > > > > > >>> > > > >>> > > >
> > > > > > >>> > > > >>> > >
> > > > > > >>> > > > >>> >
> > > > > > >>> > > > >>>
> > > > > > >>> > > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >

http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > > > > > >>> > > > >>> )
> > > > > > >>> > > > >>> > > > > > > > and the link to the KMeans example
is
> > not
> > > > > > working
> > > > > > >>> > > (where
> > > > > > >>> > > > it
> > > > > > >>> > > > >>> > says
> > > > > > >>> > > > >>> > > > For
> > > > > > >>> > > > >>> > > > > a
> > > > > > >>> > > > >>> > > > > > > > complete example program, have a
look
> > at
> > > > > KMeans
> > > > > > >>> > > > Algorithm).
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > > > Best,
> > > > > > >>> > > > >>> > > > > > > > Flavio
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM,
Flavio

> > > > > > >>> Pompermaier <
> > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > > > wrote:
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the reason
> > why
> > > I
> > > > > > >>> removed it
> > > > > > >>> > > :)
> > > > > > >>> > > > >>> > > > > > > >>
> > > > > > >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM,
> > Stephan
> > > > > Ewen <
> > > > > > >>> > > > >>> > [hidden email]>
> > > > > > >>> > > > >>> > > > > > wrote:
> > > > > > >>> > > > >>> > > > > > > >>
> > > > > > >>> > > > >>> > > > > > > >>> You do not really need a HBase

data
> > > sink.
> > > > > You
> > > > > > >>> can
> > > > > > >>> > > call
> > > > > > >>> > > > >>> > > > > > > >>> "DataSet.output(new
> > > > > > >>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > >>> > > > >>> > > > > > > >>> Stephan
> > > > > > >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb
"Flavio
> > > > > > >>> Pompermaier" <
> > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > >>> > > > >>> > > > > > > >:
> > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > >>> > > > >>> > > > > > > >>> > Just one last thing..I removed
the

> > > > > > >>> HbaseDataSink
> > > > > > >>> > > > >>> because I
> > > > > > >>> > > > >>> > > > think
> > > > > > >>> > > > >>> > > > > it
> > > > > > >>> > > > >>> > > > > > > was
> > > > > > >>> > > > >>> > > > > > > >>> > using the old APIs..can someone
> > help
> > > me
> > > > > in
> > > > > > >>> > updating
> > > > > > >>> > > > >>> that
> > > > > > >>> > > > >>> > > class?
> > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55 AM,
> > > Flavio
> > > > > > >>> > > Pompermaier <
> > > > > > >>> > > > >>> > > > > > > >>> [hidden email]>
> > > > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > >>> > > > >>> > > > > > > >>> > > Indeed this time the build has
> > been
> > > > > > >>> successful
> > > > > > >>> > :)
> > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29

AM,

> > > > Fabian
> > > > > > >>> Hueske
> > > > > > >>> > <
> > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to
> > build
> > > > > your
> > > > > > >>> own
> > > > > > >>> > > Github
> > > > > > >>> > > > >>> > > > > repositories
> > > > > > >>> > > > >>> > > > > > by
> > > > > > >>> > > > >>> > > > > > > >>> > linking
> > > > > > >>> > > > >>> > > > > > > >>> > >> it to your Github account.

That
> > > way
> > > > > > >>> Travis can
> > > > > > >>> > > > >>> build all
> > > > > > >>> > > > >>> > > > your
> > > > > > >>> > > > >>> > > > > > > >>> branches
> > > > > > >>> > > > >>> > > > > > > >>> > >> (and
> > > > > > >>> > > > >>> > > > > > > >>> > >> you can also trigger
rebuilds if

> > > > > > something
> > > > > > >>> > > fails).
> > > > > > >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually
> > > trigger
> > > > > > >>> retrigger
> > > > > > >>> > > > >>> builds on
> > > > > > >>> > > > >>> > > the
> > > > > > >>> > > > >>> > > > > > Apache
> > > > > > >>> > > > >>> > > > > > > >>> > >> repository.
> > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2 is
> > > > indeed a
> > > > > > >>> very
> > > > > > >>> > good
> > > > > > >>> > > > >>> > addition
> > > > > > >>> > > > >>> > > > :-)
> > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > >>> > > > >>> > > > > > > >>> > >> For the discusion about the

> > > > > itself, I
> > > > > > >>> would
> > > > > > >>> > > > need
> > > > > > >>> > > > >>> a
> > > > > > >>> > > > >>> > bit
> > > > > > >>> > > > >>> > > > more
> > > > > > >>> > > > >>> > > > > > > time
> > > > > > >>> > > > >>> > > > > > > >>> to
> > > > > > >>> > > > >>> > > > > > > >>> > >> become more familiar with
> > HBase. I
> > > > do
> > > > > > >>> also not
> > > > > > >>> > > > have
> > > > > > >>> > > > >>> a
> > > > > > >>> > > > >>> > > HBase
> > > > > > >>> > > > >>> > > > > > setup
> > > > > > >>> > > > >>> > > > > > > >>> > >> available
> > > > > > >>> > > > >>> > > > > > > >>> > >> here.
> > > > > > >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the
> > > community
> > > > > who
> > > > > > >>> was
> > > > > > >>> > > > >>> involved
> > > > > > >>> > > > >>> > > with a
> > > > > > >>> > > > >>> > > > > > > >>> previous
> > > > > > >>> > > > >>> > > > > > > >>> > >> version of the HBase

connector
> > > could
> > > > > > >>> comment
> > > > > > >>> > on
> > > > > > >>> > > > your
> > > > > > >>> > > > >>> > > > question.
> > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > >>> > > > >>> > > > > > > >>> > >> Best, Fabian
> > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00
Flavio
> > > > > > >>> Pompermaier <
> > > > > > >>> > > > >>> > > > > > > [hidden email]
> > > > > > >>> > > > >>> > > > > > > >>> >:
> > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I
moved
> > > the
> > > > > > >>> > discussion
> > > > > > >>> > > on
> > > > > > >>> > > > >>> this
> > > > > > >>> > > > >>> > > > > mailing
> > > > > > >>> > > > >>> > > > > > > >>> list.
> > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > >>> > > > >>> > > > > > > >>> > >> > I think that what is still
to

> > be
> > > > > > >>> discussed
> > > > > > >>> > is
> > > > > > >>> > > > >>> how to
> > > > > > >>> > > > >>> > > > > > retrigger
> > > > > > >>> > > > >>> > > > > > > >>> the
> > > > > > >>> > > > >>> > > > > > > >>> > >> build
> > > > > > >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an
> > > > account)
> > > > > > and
> > > > > > >>> if
> > > > > > >>> > the
> > > > > > >>> > > > PR
> > > > > > >>> > > > >>> can
> > > > > > >>> > > > >>> > be
> > > > > > >>> > > > >>> > > > > > > >>> integrated.
> > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to

move

> > > the
> > > > > > HBase
> > > > > > >>> > > example
> > > > > > >>> > > > >>> in
> > > > > > >>> > > > >>> > the
> > > > > > >>> > > > >>> > > > test
> > > > > > >>> > > > >>> > > > > > > >>> package
> > > > > > >>> > > > >>> > > > > > > >>> > >> (right
> > > > > > >>> > > > >>> > > > > > > >>> > >> > now I left it in the main
> > > folder)
> > > > so
> > > > > > it
> > > > > > >>> will
> > > > > > >>> > > > force
> > > > > > >>> > > > >>> > > Travis
> > > > > > >>> > > > >>> > > > to
> > > > > > >>> > > > >>> > > > > > > >>> rebuild.
> > > > > > >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple

of
> > > > hours.
> > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to
say

> > is
> > > > > that
> > > > > > >>> the
> > > > > > >>> > > hbase
> > > > > > >>> > > > >>> > > extension
> > > > > > >>> > > > >>> > > > is
> > > > > > >>> > > > >>> > > > > > now
> > > > > > >>> > > > >>> > > > > > > >>> > >> compatible
> > > > > > >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > >>> > > > >>> > > > > > > >>> > >> > Best,
> > > > > > >>> > > > >>> > > > > > > >>> > >> > Flavio
> > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > >>> > > > >>> > > > > > > >>
> > > > > > >>> > > > >>> > > > > > > >
> > > > > > >>> > > > >>> > > > > > >
> > > > > > >>> > > > >>> > > > > >
> > > > > > >>> > > > >>> > > > >
> > > > > > >>> > > > >>> > > >
> > > > > > >>> > > > >>> > >
> > > > > > >>> > > > >>> >
> > > > > > >>> > > > >>>
> > > > > > >>> > > > >>
> > > > > > >>> > > > >>
> > > > > > >>> > > > >>
> > > > > > >>> > > > >
> > > > > > >>> > > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > > >
> > > >
> > >
> >

Fabian Hueske

Re: HBase 0.98 addon for Flink 0.8

What exactly is required to configure the TableInputFormat?
Would it be easier and more flexible to just set the hostname of the HBase
master, the table name, etc, directly as strings in the InputFormat?

2014-11-14 15:34 GMT+01:00 Flavio Pompermaier <[hidden email]>:

> Both from shell with run command and from web client
> On Nov 14, 2014 2:32 PM, "Fabian Hueske" <[hidden email]> wrote:
> >
> > In this case, the initialization happens when the InputFormat is
> > instantiated at the submission client and the Table info is serialized as
> > part of the InputFormat and shipped out to all TaskManagers for
> execution.
> > However, if the initialization is done within configure it happens on
> each
> > TaskManager when initializing the InputFormat.
> > These are two separate JVMs in a distributed setting with different
> > classpaths.
> >
> > How do you submit your job for execution?
> >
> > 2014-11-14 13:58 GMT+01:00 Flavio Pompermaier <[hidden email]>:
> >
> > > The strange thing us that everything works if I create HTable outside
> > > configure()..
> > > On Nov 14, 2014 10:32 AM, "Stephan Ewen" <[hidden email]> wrote:
> > >
> > > > I think that this is a case where the wrong classloader is used:
> > > >
> > > > If the HBase classes are part of the flink lib directory, they are
> loaded
> > > > with the system class loader. When they look for anything in the
> > > classpath,
> > > > they will do so with the system classloader.
> > > >
> > > > You configuration is in the user code jar that you submit, so it is
> only
> > > > available through the user-code classloader.
> > > >
> > > > Any way you can load the configuration yourself and give that
> > > configuration
> > > > to HBase?
> > > >
> > > > Stephan
> > > > Am 13.11.2014 22:06 schrieb "Flavio Pompermaier" <
> [hidden email]
> >:
> > > >
> > > > > The only config files available are within the submitted jar.
> Things
> > > > works
> > > > > in eclipse using local environment while fails deploying to the
> cluster
> > > > > On Nov 13, 2014 10:01 PM, <[hidden email]> wrote:
> > > > >
> > > > > > Does the HBase jar in the lib folder contain a config that could
> be
> > > > used
> > > > > > instead of the config in the job jar file? Or is simply no config
> at
> > > > all
> > > > > > available when the configure method is called?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Fabian Hueske
> > > > > > Phone: +49 170 5549438
> > > > > > Email: [hidden email]
> > > > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > From: Flavio Pompermaier
> > > > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎21‎:‎43
> > > > > > To: [hidden email]
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The hbase jar is in the lib directory on each node while the
> config
> > > > files
> > > > > > are within the jar file I submit from the web client.
> > > > > > On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:
> > > > > >
> > > > > > > Have you added the hbase.jar file with your HBase config to the
> > > ./lib
> > > > > > > folders of your Flink setup (JobManager, TaskManager) or is it
> > > > bundled
> > > > > > with
> > > > > > > your job.jar file?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Fabian Hueske
> > > > > > > Phone: +49 170 5549438
> > > > > > > Email: [hidden email]
> > > > > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > From: Flavio Pompermaier
> > > > > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
> > > > > > > To: [hidden email]
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Any help with this? :(
> > > > > > >
> > > > > > > On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <
> > > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > We definitely discovered that instantiating HTable and Scan
> in
> > > > > > > configure()
> > > > > > > > method of TableInputFormat causes problem in distributed
> > > > environment!
> > > > > > > > If you look at my implementation at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> > > > > > > > you can see that Scan and HTable were made transient and
> > > recreated
> > > > > > within
> > > > > > > > configure but this causes HBaseConfiguration.create() to fail
> > > > > searching
> > > > > > > for
> > > > > > > > classpath files...could you help us understanding why?
> > > > > > > >
> > > > > > > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <
> > > > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Usually, when I run a mapreduce job both on Spark and Hadoop
> I
> > > > just
> > > > > > put
> > > > > > > >> *-site.xml files into the war I submit to the cluster and
> that's
> > > > > it. I
> > > > > > > >> think the problem appeared when I made the HTable a private
> > > > > transient
> > > > > > > field
> > > > > > > >> and the table istantiation was moved in the configure
> method.
> > > > > > > >> Could it be a valid reason? we still have to make a deeper
> debug
> > > > but
> > > > > > I'm
> > > > > > > >> trying ro figure out where to investigate..
> > > > > > > >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <
> [hidden email]>
> > > > > > wrote:
> > > > > > > >>
> > > > > > > >>> Hi,
> > > > > > > >>> Maybe its an issue with the classpath? As far as I know is
> > > Hadoop
> > > > > > > reading
> > > > > > > >>> the configuration files from the classpath. Maybe is the
> > > > > > hbase-site.xml
> > > > > > > >>> file not accessible through the classpath when running on
> the
> > > > > > cluster?
> > > > > > > >>>
> > > > > > > >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
> > > > > > > >>> [hidden email]>
> > > > > > > >>> wrote:
> > > > > > > >>>
> > > > > > > >>> > Today we tried tp execute a job on the cluster instead of
> on
> > > > > local
> > > > > > > >>> executor
> > > > > > > >>> > and we faced the problem that the hbase-site.xml was
> > > basically
> > > > > > > >>> ignored. Is
> > > > > > > >>> > there a reason why the TableInputFormat is working
> correctly
> > > on
> > > > > > local
> > > > > > > >>> > environment while it doesn't on a cluster?
> > > > > > > >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > > >>> >
> > > > > > > >>> > > I don't think we need to bundle the HBase input and
> output
> > > > > format
> > > > > > > in
> > > > > > > >>> a
> > > > > > > >>> > > single PR.
> > > > > > > >>> > > So, I think we can proceed with the IF only and target
> the
> > > OF
> > > > > > > later.
> > > > > > > >>> > > However, the fix for Kryo should be in the master
> before
> > > > > merging
> > > > > > > the
> > > > > > > >>> PR.
> > > > > > > >>> > > Till is currently working on that and said he expects
> this
> > > to
> > > > > be
> > > > > > > >>> done by
> > > > > > > >>> > > end of the week.
> > > > > > > >>> > >
> > > > > > > >>> > > Cheers, Fabian
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <
> > > > > > > [hidden email]
> > > > > > > >>> >:
> > > > > > > >>> > >
> > > > > > > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You
> can
> > > > build
> > > > > > it
> > > > > > > >>> with
> > > > > > > >>> > the
> > > > > > > >>> > > > command:
> > > > > > > >>> > > > mvn clean install -Dmaven.test.skip=true
> > > > > > -Dhadoop.profile=2
> > > > > > > >>> > > > -Pvendor-repos,cdh5.1.3
> > > > > > > >>> > > >
> > > > > > > >>> > > > However, it would be good to generate the specific
> jar
> > > when
> > > > > > > >>> > > > releasing..(e.g.
> > > > > > > >>> > > >
> > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > > > > > > >>> > > >
> > > > > > > >>> > > > Best,
> > > > > > > >>> > > > Flavio
> > > > > > > >>> > > >
> > > > > > > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio Pompermaier <
> > > > > > > >>> > > [hidden email]>
> > > > > > > >>> > > > wrote:
> > > > > > > >>> > > >
> > > > > > > >>> > > > > I've just updated the code on my fork (synch with
> > > current
> > > > > > > master
> > > > > > > >>> and
> > > > > > > >>> > > > > applied improvements coming from comments on
> related
> > > PR).
> > > > > > > >>> > > > > I still have to understand how to write results
> back to
> > > > an
> > > > > > > HBase
> > > > > > > >>> > > > > Sink/OutputFormat...
> > > > > > > >>> > > > >
> > > > > > > >>> > > > >
> > > > > > > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio Pompermaier
> <
> > > > > > > >>> > > > [hidden email]>
> > > > > > > >>> > > > > wrote:
> > > > > > > >>> > > > >
> > > > > > > >>> > > > >> Thanks for the detailed answer. So if I run a job
> from
> > > > my
> > > > > > > >>> machine
> > > > > > > >>> > I'll
> > > > > > > >>> > > > >> have to download all the scanned data in a
> > > table..right?
> > > > > > > >>> > > > >>
> > > > > > > >>> > > > >> Always regarding the GenericTableOutputFormat it
> is
> > > not
> > > > > > clear
> > > > > > > >>> to me
> > > > > > > >>> > > how
> > > > > > > >>> > > > >> to proceed..
> > > > > > > >>> > > > >> I saw in the hadoop compatibility addon that it is
> > > > > possible
> > > > > > to
> > > > > > > >>> have
> > > > > > > >>> > > such
> > > > > > > >>> > > > >> compatibility using HBaseUtils class so the open
> > > method
> > > > > > should
> > > > > > > >>> > become
> > > > > > > >>> > > > >> something like:
> > > > > > > >>> > > > >>
> > > > > > > >>> > > > >> @Override
> > > > > > > >>> > > > >> public void open(int taskNumber, int numTasks)
> throws
> > > > > > > >>> IOException {
> > > > > > > >>> > > > >> if (Integer.toString(taskNumber + 1).length() > 6)
> {
> > > > > > > >>> > > > >> throw new IOException("Task id too large.");
> > > > > > > >>> > > > >> }
> > > > > > > >>> > > > >> TaskAttemptID taskAttemptID =
> > > > > > > >>> > TaskAttemptID.forName("attempt__0000_r_"
> > > > > > > >>> > > > >> + String.format("%" + (6 -
> > > Integer.toString(taskNumber +
> > > > > > > >>> > 1).length())
> > > > > > > >>> > > +
> > > > > > > >>> > > > >> "s"," ").replace(" ", "0")
> > > > > > > >>> > > > >> + Integer.toString(taskNumber + 1)
> > > > > > > >>> > > > >> + "_0");
> > > > > > > >>> > > > >> this.configuration.set("mapred.task.id",
> > > > > > > >>> > taskAttemptID.toString());
> > > > > > > >>> > > > >> this.configuration.setInt("mapred.task.partition",
> > > > > > taskNumber
> > > > > > > +
> > > > > > > >>> 1);
> > > > > > > >>> > > > >> // for hadoop 2.2
> > > > > > > >>> > > > >> this.configuration.set("mapreduce.task.attempt.id
> ",
> > > > > > > >>> > > > >> taskAttemptID.toString());
> > > > > > > >>> > > > >>
> this.configuration.setInt("mapreduce.task.partition",
> > > > > > > >>> taskNumber +
> > > > > > > >>> > 1);
> > > > > > > >>> > > > >> try {
> > > > > > > >>> > > > >> this.context =
> > > > > > > >>> > > > >>
> > > > > > HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > > > > > > >>> > > > >> taskAttemptID);
> > > > > > > >>> > > > >> } catch (Exception e) {
> > > > > > > >>> > > > >> throw new RuntimeException(e);
> > > > > > > >>> > > > >> }
> > > > > > > >>> > > > >> final HFileOutputFormat2 outFormat = new
> > > > > > HFileOutputFormat2();
> > > > > > > >>> > > > >> try {
> > > > > > > >>> > > > >> this.writer =
> outFormat.getRecordWriter(this.context);
> > > > > > > >>> > > > >> } catch (InterruptedException iex) {
> > > > > > > >>> > > > >> throw new IOException("Opening the writer was
> > > > > interrupted.",
> > > > > > > >>> iex);
> > > > > > > >>> > > > >> }
> > > > > > > >>> > > > >> }
> > > > > > > >>> > > > >>
> > > > > > > >>> > > > >> But I'm not sure about how to pass the JobConf to
> the
> > > > > class,
> > > > > > > if
> > > > > > > >>> to
> > > > > > > >>> > > merge
> > > > > > > >>> > > > >> config fileas, where HFileOutputFormat2 writes the
> > > data
> > > > > and
> > > > > > > how
> > > > > > > >>> to
> > > > > > > >>> > > > >> implement the public void writeRecord(Record
> record)
> > > > API.
> > > > > > > >>> > > > >> Could I do a little chat off the mailing list with
> the
> > > > > > > >>> implementor
> > > > > > > >>> > of
> > > > > > > >>> > > > >> this extension?
> > > > > > > >>> > > > >>
> > > > > > > >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> > > > > > > >>> [hidden email]>
> > > > > > > >>> > > > >> wrote:
> > > > > > > >>> > > > >>
> > > > > > > >>> > > > >>> Hi Flavio
> > > > > > > >>> > > > >>>
> > > > > > > >>> > > > >>> let me try to answer your last question on the
> user's
> > > > > list
> > > > > > > (to
> > > > > > > >>> the
> > > > > > > >>> > > best
> > > > > > > >>> > > > >>> of
> > > > > > > >>> > > > >>> my HBase knowledge).
> > > > > > > >>> > > > >>> "I just wanted to known if and how regiom
> splitting
> > > is
> > > > > > > >>> handled. Can
> > > > > > > >>> > > you
> > > > > > > >>> > > > >>> explain me in detail how Flink and HBase
> works?what
> > > is
> > > > > not
> > > > > > > >>> fully
> > > > > > > >>> > > clear
> > > > > > > >>> > > > to
> > > > > > > >>> > > > >>> me is when computation is done by region servers
> and
> > > > when
> > > > > > > data
> > > > > > > >>> > start
> > > > > > > >>> > > > flow
> > > > > > > >>> > > > >>> to a Flink worker (that in ky test job is only my
> pc)
> > > > and
> > > > > > how
> > > > > > > >>> ro
> > > > > > > >>> > > > >>> undertsand
> > > > > > > >>> > > > >>> better the important logged info to understand if
> my
> > > > job
> > > > > is
> > > > > > > >>> > > performing
> > > > > > > >>> > > > >>> well"
> > > > > > > >>> > > > >>>
> > > > > > > >>> > > > >>> HBase partitions its tables into so called
> "regions"
> > > of
> > > > > > keys
> > > > > > > >>> and
> > > > > > > >>> > > stores
> > > > > > > >>> > > > >>> the
> > > > > > > >>> > > > >>> regions distributed in the cluster using HDFS. I
> > > think
> > > > an
> > > > > > > HBase
> > > > > > > >>> > > region
> > > > > > > >>> > > > >>> can
> > > > > > > >>> > > > >>> be thought of as a HDFS block. To make reading an
> > > HBase
> > > > > > table
> > > > > > > >>> > > > efficient,
> > > > > > > >>> > > > >>> region reads should be locally done, i.e., an
> > > > InputFormat
> > > > > > > >>> should
> > > > > > > >>> > > > >>> primarily
> > > > > > > >>> > > > >>> read region that are stored on the same machine
> as
> > > the
> > > > IF
> > > > > > is
> > > > > > > >>> > running
> > > > > > > >>> > > > on.
> > > > > > > >>> > > > >>> Flink's InputSplits partition the HBase input by
> > > > regions
> > > > > > and
> > > > > > > >>> add
> > > > > > > >>> > > > >>> information about the storage location of the
> region.
> > > > > > During
> > > > > > > >>> > > execution,
> > > > > > > >>> > > > >>> input splits are assigned to InputFormats that
> can do
> > > > > local
> > > > > > > >>> reads.
> > > > > > > >>> > > > >>>
> > > > > > > >>> > > > >>> Best, Fabian
> > > > > > > >>> > > > >>>
> > > > > > > >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <
> > > > > [hidden email]
> > > > > > >:
> > > > > > > >>> > > > >>>
> > > > > > > >>> > > > >>> > Hi!
> > > > > > > >>> > > > >>> >
> > > > > > > >>> > > > >>> > The way of passing parameters through the
> > > > configuration
> > > > > > is
> > > > > > > >>> very
> > > > > > > >>> > old
> > > > > > > >>> > > > >>> (the
> > > > > > > >>> > > > >>> > original HBase format dated back to that time).
> I
> > > > would
> > > > > > > >>> simply
> > > > > > > >>> > make
> > > > > > > >>> > > > the
> > > > > > > >>> > > > >>> > HBase format take those parameters through the
> > > > > > constructor.
> > > > > > > >>> > > > >>> >
> > > > > > > >>> > > > >>> > Greetings,
> > > > > > > >>> > > > >>> > Stephan
> > > > > > > >>> > > > >>> >
> > > > > > > >>> > > > >>> >
> > > > > > > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio
> > > Pompermaier <
> > > > > > > >>> > > > >>> [hidden email]>
> > > > > > > >>> > > > >>> > wrote:
> > > > > > > >>> > > > >>> >
> > > > > > > >>> > > > >>> > > The problem is that I also removed the
> > > > > > > >>> GenericTableOutputFormat
> > > > > > > >>> > > > >>> because
> > > > > > > >>> > > > >>> > > there is an incompatibility between hadoop1
> and
> > > > > hadoop2
> > > > > > > for
> > > > > > > >>> > class
> > > > > > > >>> > > > >>> > > TaskAttemptContext and
> TaskAttemptContextImpl..
> > > > > > > >>> > > > >>> > > then it would be nice if the user doesn't
> have to
> > > > > worry
> > > > > > > >>> about
> > > > > > > >>> > > > passing
> > > > > > > >>> > > > >>> > > pact.hbase.jtkey and pact.job.id
> parameters..
> > > > > > > >>> > > > >>> > > I think it is probably a good idea to remove
> > > > hadoop1
> > > > > > > >>> > > compatibility
> > > > > > > >>> > > > >>> and
> > > > > > > >>> > > > >>> > keep
> > > > > > > >>> > > > >>> > > enable HBase addon only for hadoop2 (as
> before)
> > > and
> > > > > > > decide
> > > > > > > >>> how
> > > > > > > >>> > to
> > > > > > > >>> > > > >>> mange
> > > > > > > >>> > > > >>> > > those 2 parameters..
> > > > > > > >>> > > > >>> > >
> > > > > > > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan Ewen
> <
> > > > > > > >>> > [hidden email]>
> > > > > > > >>> > > > >>> wrote:
> > > > > > > >>> > > > >>> > >
> > > > > > > >>> > > > >>> > > > It is fine to remove it, in my opinion.
> > > > > > > >>> > > > >>> > > >
> > > > > > > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio
> > > > > Pompermaier <
> > > > > > > >>> > > > >>> > > [hidden email]>
> > > > > > > >>> > > > >>> > > > wrote:
> > > > > > > >>> > > > >>> > > >
> > > > > > > >>> > > > >>> > > > > That is one class I removed because it
> was
> > > > using
> > > > > > the
> > > > > > > >>> > > deprecated
> > > > > > > >>> > > > >>> API
> > > > > > > >>> > > > >>> > > > > GenericDataSink..I can restore them but
> the
> > > it
> > > > > will
> > > > > > > be
> > > > > > > >>> a
> > > > > > > >>> > good
> > > > > > > >>> > > > >>> idea to
> > > > > > > >>> > > > >>> > > > > remove those warning (also because from
> what
> > > I
> > > > > > > >>> understood
> > > > > > > >>> > the
> > > > > > > >>> > > > >>> Record
> > > > > > > >>> > > > >>> > > APIs
> > > > > > > >>> > > > >>> > > > > are going to be removed).
> > > > > > > >>> > > > >>> > > > >
> > > > > > > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian
> > > Hueske <
> > > > > > > >>> > > > >>> [hidden email]>
> > > > > > > >>> > > > >>> > > > wrote:
> > > > > > > >>> > > > >>> > > > >
> > > > > > > >>> > > > >>> > > > > > I'm not familiar with the HBase
> connector
> > > > code,
> > > > > > but
> > > > > > > >>> are
> > > > > > > >>> > you
> > > > > > > >>> > > > >>> maybe
> > > > > > > >>> > > > >>> > > > looking
> > > > > > > >>> > > > >>> > > > > > for the GenericTableOutputFormat?
> > > > > > > >>> > > > >>> > > > > >
> > > > > > > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio
> > > Pompermaier
> > > > <
> > > > > > > >>> > > > >>> [hidden email]
> > > > > > > >>> > > > >>> > >:
> > > > > > > >>> > > > >>> > > > > >
> > > > > > > >>> > > > >>> > > > > > > | was trying to modify the example
> > > setting
> > > > > > > >>> > > > hbaseDs.output(new
> > > > > > > >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't see
> any
> > > > > > > >>> > > HBaseOutputFormat
> > > > > > > >>> > > > >>> > > > > class..maybe
> > > > > > > >>> > > > >>> > > > > > we
> > > > > > > >>> > > > >>> > > > > > > shall use another class?
> > > > > > > >>> > > > >>> > > > > > >
> > > > > > > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM,
> Flavio
> > > > > > > Pompermaier
> > > > > > > >>> <
> > > > > > > >>> > > > >>> > > > > [hidden email]
> > > > > > > >>> > > > >>> > > > > > >
> > > > > > > >>> > > > >>> > > > > > > wrote:
> > > > > > > >>> > > > >>> > > > > > >
> > > > > > > >>> > > > >>> > > > > > > > Maybe that's something I could add
> to
> > > the
> > > > > > HBase
> > > > > > > >>> > example
> > > > > > > >>> > > > and
> > > > > > > >>> > > > >>> > that
> > > > > > > >>> > > > >>> > > > > could
> > > > > > > >>> > > > >>> > > > > > be
> > > > > > > >>> > > > >>> > > > > > > > better documented in the Wiki.
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > > > Since we're talking about the
> wiki..I
> > > was
> > > > > > > >>> looking at
> > > > > > > >>> > > the
> > > > > > > >>> > > > >>> Java
> > > > > > > >>> > > > >>> > > API (
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > >
> > > > > > > >>> > > > >>> > > > > >
> > > > > > > >>> > > > >>> > > > >
> > > > > > > >>> > > > >>> > > >
> > > > > > > >>> > > > >>> > >
> > > > > > > >>> > > > >>> >
> > > > > > > >>> > > > >>>
> > > > > > > >>> > > >
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > > > > > > >>> > > > >>> )
> > > > > > > >>> > > > >>> > > > > > > > and the link to the KMeans example
> is
> > > not
> > > > > > > working
> > > > > > > >>> > > (where
> > > > > > > >>> > > > it
> > > > > > > >>> > > > >>> > says
> > > > > > > >>> > > > >>> > > > For
> > > > > > > >>> > > > >>> > > > > a
> > > > > > > >>> > > > >>> > > > > > > > complete example program, have a
> look
> > > at
> > > > > > KMeans
> > > > > > > >>> > > > Algorithm).
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > > > Best,
> > > > > > > >>> > > > >>> > > > > > > > Flavio
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM,
> Flavio
> > > > > > > >>> Pompermaier <
> > > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > > > wrote:
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the
> reason
> > > why
> > > > I
> > > > > > > >>> removed it
> > > > > > > >>> > > :)
> > > > > > > >>> > > > >>> > > > > > > >>
> > > > > > > >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM,
> > > Stephan
> > > > > > Ewen <
> > > > > > > >>> > > > >>> > [hidden email]>
> > > > > > > >>> > > > >>> > > > > > wrote:
> > > > > > > >>> > > > >>> > > > > > > >>
> > > > > > > >>> > > > >>> > > > > > > >>> You do not really need a HBase
> data
> > > > sink.
> > > > > > You
> > > > > > > >>> can
> > > > > > > >>> > > call
> > > > > > > >>> > > > >>> > > > > > > >>> "DataSet.output(new
> > > > > > > >>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> > > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > > >>> > > > >>> > > > > > > >>> Stephan
> > > > > > > >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb
> "Flavio
> > > > > > > >>> Pompermaier" <
> > > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > > >>> > > > >>> > > > > > > >:
> > > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > > >>> > > > >>> > > > > > > >>> > Just one last thing..I removed
> the
> > > > > > > >>> HbaseDataSink
> > > > > > > >>> > > > >>> because I
> > > > > > > >>> > > > >>> > > > think
> > > > > > > >>> > > > >>> > > > > it
> > > > > > > >>> > > > >>> > > > > > > was
> > > > > > > >>> > > > >>> > > > > > > >>> > using the old APIs..can someone
> > > help
> > > > me
> > > > > > in
> > > > > > > >>> > updating
> > > > > > > >>> > > > >>> that
> > > > > > > >>> > > > >>> > > class?
> > > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55
> AM,
> > > > Flavio
> > > > > > > >>> > > Pompermaier <
> > > > > > > >>> > > > >>> > > > > > > >>> [hidden email]>
> > > > > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > > >>> > > > >>> > > > > > > >>> > > Indeed this time the build
> has
> > > been
> > > > > > > >>> successful
> > > > > > > >>> > :)
> > > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at 10:29
> AM,
> > > > > Fabian
> > > > > > > >>> Hueske
> > > > > > > >>> > <
> > > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis to
> > > build
> > > > > > your
> > > > > > > >>> own
> > > > > > > >>> > > Github
> > > > > > > >>> > > > >>> > > > > repositories
> > > > > > > >>> > > > >>> > > > > > by
> > > > > > > >>> > > > >>> > > > > > > >>> > linking
> > > > > > > >>> > > > >>> > > > > > > >>> > >> it to your Github account.
> That
> > > > way
> > > > > > > >>> Travis can
> > > > > > > >>> > > > >>> build all
> > > > > > > >>> > > > >>> > > > your
> > > > > > > >>> > > > >>> > > > > > > >>> branches
> > > > > > > >>> > > > >>> > > > > > > >>> > >> (and
> > > > > > > >>> > > > >>> > > > > > > >>> > >> you can also trigger
> rebuilds if
> > > > > > > something
> > > > > > > >>> > > fails).
> > > > > > > >>> > > > >>> > > > > > > >>> > >> Not sure if we can manually
> > > > trigger
> > > > > > > >>> retrigger
> > > > > > > >>> > > > >>> builds on
> > > > > > > >>> > > > >>> > > the
> > > > > > > >>> > > > >>> > > > > > Apache
> > > > > > > >>> > > > >>> > > > > > > >>> > >> repository.
> > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2
> is
> > > > > indeed a
> > > > > > > >>> very
> > > > > > > >>> > good
> > > > > > > >>> > > > >>> > addition
> > > > > > > >>> > > > >>> > > > :-)
> > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > >>> > > > >>> > > > > > > >>> > >> For the discusion about the
> PR
> > > > > > itself, I
> > > > > > > >>> would
> > > > > > > >>> > > > need
> > > > > > > >>> > > > >>> a
> > > > > > > >>> > > > >>> > bit
> > > > > > > >>> > > > >>> > > > more
> > > > > > > >>> > > > >>> > > > > > > time
> > > > > > > >>> > > > >>> > > > > > > >>> to
> > > > > > > >>> > > > >>> > > > > > > >>> > >> become more familiar with
> > > HBase. I
> > > > > do
> > > > > > > >>> also not
> > > > > > > >>> > > > have
> > > > > > > >>> > > > >>> a
> > > > > > > >>> > > > >>> > > HBase
> > > > > > > >>> > > > >>> > > > > > setup
> > > > > > > >>> > > > >>> > > > > > > >>> > >> available
> > > > > > > >>> > > > >>> > > > > > > >>> > >> here.
> > > > > > > >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the
> > > > community
> > > > > > who
> > > > > > > >>> was
> > > > > > > >>> > > > >>> involved
> > > > > > > >>> > > > >>> > > with a
> > > > > > > >>> > > > >>> > > > > > > >>> previous
> > > > > > > >>> > > > >>> > > > > > > >>> > >> version of the HBase
> connector
> > > > could
> > > > > > > >>> comment
> > > > > > > >>> > on
> > > > > > > >>> > > > your
> > > > > > > >>> > > > >>> > > > question.
> > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > >>> > > > >>> > > > > > > >>> > >> Best, Fabian
> > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00
> Flavio
> > > > > > > >>> Pompermaier <
> > > > > > > >>> > > > >>> > > > > > > [hidden email]
> > > > > > > >>> > > > >>> > > > > > > >>> >:
> > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I
> moved
> > > > the
> > > > > > > >>> > discussion
> > > > > > > >>> > > on
> > > > > > > >>> > > > >>> this
> > > > > > > >>> > > > >>> > > > > mailing
> > > > > > > >>> > > > >>> > > > > > > >>> list.
> > > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > I think that what is still
> to
> > > be
> > > > > > > >>> discussed
> > > > > > > >>> > is
> > > > > > > >>> > > > >>> how to
> > > > > > > >>> > > > >>> > > > > > retrigger
> > > > > > > >>> > > > >>> > > > > > > >>> the
> > > > > > > >>> > > > >>> > > > > > > >>> > >> build
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have an
> > > > > account)
> > > > > > > and
> > > > > > > >>> if
> > > > > > > >>> > the
> > > > > > > >>> > > > PR
> > > > > > > >>> > > > >>> can
> > > > > > > >>> > > > >>> > be
> > > > > > > >>> > > > >>> > > > > > > >>> integrated.
> > > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is to
> move
> > > > the
> > > > > > > HBase
> > > > > > > >>> > > example
> > > > > > > >>> > > > >>> in
> > > > > > > >>> > > > >>> > the
> > > > > > > >>> > > > >>> > > > test
> > > > > > > >>> > > > >>> > > > > > > >>> package
> > > > > > > >>> > > > >>> > > > > > > >>> > >> (right
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > now I left it in the main
> > > > folder)
> > > > > so
> > > > > > > it
> > > > > > > >>> will
> > > > > > > >>> > > > force
> > > > > > > >>> > > > >>> > > Travis
> > > > > > > >>> > > > >>> > > > to
> > > > > > > >>> > > > >>> > > > > > > >>> rebuild.
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > I'll do it within a couple
> of
> > > > > hours.
> > > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot to
> say
> > > is
> > > > > > that
> > > > > > > >>> the
> > > > > > > >>> > > hbase
> > > > > > > >>> > > > >>> > > extension
> > > > > > > >>> > > > >>> > > > is
> > > > > > > >>> > > > >>> > > > > > now
> > > > > > > >>> > > > >>> > > > > > > >>> > >> compatible
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and 2.
> > > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > Best,
> > > > > > > >>> > > > >>> > > > > > > >>> > >> > Flavio
> > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > > >>> > > > >>> > > > > > > >>
> > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > >>> > > > >>> > > > > > >
> > > > > > > >>> > > > >>> > > > > >
> > > > > > > >>> > > > >>> > > > >
> > > > > > > >>> > > > >>> > > >
> > > > > > > >>> > > > >>> > >
> > > > > > > >>> > > > >>> >
> > > > > > > >>> > > > >>>
> > > > > > > >>> > > > >>
> > > > > > > >>> > > > >>
> > > > > > > >>> > > > >>
> > > > > > > >>> > > > >
> > > > > > > >>> > > >
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > >
>

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

I think that it's not standard..usually you need to specify the table name
but everything in *-site.xml files should be automatically load at runtime
as for all other framework like mapreduce or spark. Don't you think?why
Flink behaves differently within the configure method?
On Nov 14, 2014 9:26 PM, "Fabian Hueske" <[hidden email]> wrote:

> What exactly is required to configure the TableInputFormat?
> Would it be easier and more flexible to just set the hostname of the HBase
> master, the table name, etc, directly as strings in the InputFormat?
>
> 2014-11-14 15:34 GMT+01:00 Flavio Pompermaier <[hidden email]>:
>
> > Both from shell with run command and from web client
> > On Nov 14, 2014 2:32 PM, "Fabian Hueske" <[hidden email]> wrote:
> > >
> > > In this case, the initialization happens when the InputFormat is
> > > instantiated at the submission client and the Table info is serialized
> as
> > > part of the InputFormat and shipped out to all TaskManagers for
> > execution.
> > > However, if the initialization is done within configure it happens on
> > each
> > > TaskManager when initializing the InputFormat.
> > > These are two separate JVMs in a distributed setting with different
> > > classpaths.
> > >
> > > How do you submit your job for execution?
> > >
> > > 2014-11-14 13:58 GMT+01:00 Flavio Pompermaier <[hidden email]>:
> > >
> > > > The strange thing us that everything works if I create HTable outside
> > > > configure()..
> > > > On Nov 14, 2014 10:32 AM, "Stephan Ewen" <[hidden email]> wrote:
> > > >
> > > > > I think that this is a case where the wrong classloader is used:
> > > > >
> > > > > If the HBase classes are part of the flink lib directory, they are
> > loaded
> > > > > with the system class loader. When they look for anything in the
> > > > classpath,
> > > > > they will do so with the system classloader.
> > > > >
> > > > > You configuration is in the user code jar that you submit, so it is
> > only
> > > > > available through the user-code classloader.
> > > > >
> > > > > Any way you can load the configuration yourself and give that
> > > > configuration
> > > > > to HBase?
> > > > >
> > > > > Stephan
> > > > > Am 13.11.2014 22:06 schrieb "Flavio Pompermaier" <
> > [hidden email]
> > >:
> > > > >
> > > > > > The only config files available are within the submitted jar.
> > Things
> > > > > works
> > > > > > in eclipse using local environment while fails deploying to the
> > cluster
> > > > > > On Nov 13, 2014 10:01 PM, <[hidden email]> wrote:
> > > > > >
> > > > > > > Does the HBase jar in the lib folder contain a config that
> could
> > be
> > > > > used
> > > > > > > instead of the config in the job jar file? Or is simply no
> config
> > at
> > > > > all
> > > > > > > available when the configure method is called?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Fabian Hueske
> > > > > > > Phone: +49 170 5549438
> > > > > > > Email: [hidden email]
> > > > > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > From: Flavio Pompermaier
> > > > > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎21‎:‎43
> > > > > > > To: [hidden email]
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The hbase jar is in the lib directory on each node while the
> > config
> > > > > files
> > > > > > > are within the jar file I submit from the web client.
> > > > > > > On Nov 13, 2014 9:37 PM, <[hidden email]> wrote:
> > > > > > >
> > > > > > > > Have you added the hbase.jar file with your HBase config to
> the
> > > > ./lib
> > > > > > > > folders of your Flink setup (JobManager, TaskManager) or is
> it
> > > > > bundled
> > > > > > > with
> > > > > > > > your job.jar file?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Fabian Hueske
> > > > > > > > Phone: +49 170 5549438
> > > > > > > > Email: [hidden email]
> > > > > > > > Web: http://www.user.tu-berlin.de/fabian.hueske
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > From: Flavio Pompermaier
> > > > > > > > Sent: ‎Thursday‎, ‎13‎. ‎November‎, ‎2014 ‎18‎:‎36
> > > > > > > > To: [hidden email]
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Any help with this? :(
> > > > > > > >
> > > > > > > > On Thu, Nov 13, 2014 at 2:06 PM, Flavio Pompermaier <
> > > > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > We definitely discovered that instantiating HTable and Scan
> > in
> > > > > > > > configure()
> > > > > > > > > method of TableInputFormat causes problem in distributed
> > > > > environment!
> > > > > > > > > If you look at my implementation at
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> >
> https://github.com/fpompermaier/incubator-flink/blob/master/flink-addons/flink-hbase/src/main/java/org/apache/flink/addons/hbase/TableInputFormat.java
> > > > > > > > > you can see that Scan and HTable were made transient and
> > > > recreated
> > > > > > > within
> > > > > > > > > configure but this causes HBaseConfiguration.create() to
> fail
> > > > > > searching
> > > > > > > > for
> > > > > > > > > classpath files...could you help us understanding why?
> > > > > > > > >
> > > > > > > > > On Wed, Nov 12, 2014 at 8:10 PM, Flavio Pompermaier <
> > > > > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Usually, when I run a mapreduce job both on Spark and
> Hadoop
> > I
> > > > > just
> > > > > > > put
> > > > > > > > >> *-site.xml files into the war I submit to the cluster and
> > that's
> > > > > > it. I
> > > > > > > > >> think the problem appeared when I made the HTable a
> private
> > > > > > transient
> > > > > > > > field
> > > > > > > > >> and the table istantiation was moved in the configure
> > method.
> > > > > > > > >> Could it be a valid reason? we still have to make a deeper
> > debug
> > > > > but
> > > > > > > I'm
> > > > > > > > >> trying ro figure out where to investigate..
> > > > > > > > >> On Nov 12, 2014 8:03 PM, "Robert Metzger" <
> > [hidden email]>
> > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >>> Hi,
> > > > > > > > >>> Maybe its an issue with the classpath? As far as I know
> is
> > > > Hadoop
> > > > > > > > reading
> > > > > > > > >>> the configuration files from the classpath. Maybe is the
> > > > > > > hbase-site.xml
> > > > > > > > >>> file not accessible through the classpath when running on
> > the
> > > > > > > cluster?
> > > > > > > > >>>
> > > > > > > > >>> On Wed, Nov 12, 2014 at 7:40 PM, Flavio Pompermaier <
> > > > > > > > >>> [hidden email]>
> > > > > > > > >>> wrote:
> > > > > > > > >>>
> > > > > > > > >>> > Today we tried tp execute a job on the cluster instead
> of
> > on
> > > > > > local
> > > > > > > > >>> executor
> > > > > > > > >>> > and we faced the problem that the hbase-site.xml was
> > > > basically
> > > > > > > > >>> ignored. Is
> > > > > > > > >>> > there a reason why the TableInputFormat is working
> > correctly
> > > > on
> > > > > > > local
> > > > > > > > >>> > environment while it doesn't on a cluster?
> > > > > > > > >>> > On Nov 10, 2014 10:56 AM, "Fabian Hueske" <
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > >>> >
> > > > > > > > >>> > > I don't think we need to bundle the HBase input and
> > output
> > > > > > format
> > > > > > > > in
> > > > > > > > >>> a
> > > > > > > > >>> > > single PR.
> > > > > > > > >>> > > So, I think we can proceed with the IF only and
> target
> > the
> > > > OF
> > > > > > > > later.
> > > > > > > > >>> > > However, the fix for Kryo should be in the master
> > before
> > > > > > merging
> > > > > > > > the
> > > > > > > > >>> PR.
> > > > > > > > >>> > > Till is currently working on that and said he expects
> > this
> > > > to
> > > > > > be
> > > > > > > > >>> done by
> > > > > > > > >>> > > end of the week.
> > > > > > > > >>> > >
> > > > > > > > >>> > > Cheers, Fabian
> > > > > > > > >>> > >
> > > > > > > > >>> > >
> > > > > > > > >>> > > 2014-11-07 12:49 GMT+01:00 Flavio Pompermaier <
> > > > > > > > [hidden email]
> > > > > > > > >>> >:
> > > > > > > > >>> > >
> > > > > > > > >>> > > > I fixed also the profile for Cloudera CDH5.1.3. You
> > can
> > > > > build
> > > > > > > it
> > > > > > > > >>> with
> > > > > > > > >>> > the
> > > > > > > > >>> > > > command:
> > > > > > > > >>> > > > mvn clean install -Dmaven.test.skip=true
> > > > > > > -Dhadoop.profile=2
> > > > > > > > >>> > > > -Pvendor-repos,cdh5.1.3
> > > > > > > > >>> > > >
> > > > > > > > >>> > > > However, it would be good to generate the specific
> > jar
> > > > when
> > > > > > > > >>> > > > releasing..(e.g.
> > > > > > > > >>> > > >
> > > > flink-addons:flink-hbase:0.8.0-hadoop2-cdh5.1.3-incubating)
> > > > > > > > >>> > > >
> > > > > > > > >>> > > > Best,
> > > > > > > > >>> > > > Flavio
> > > > > > > > >>> > > >
> > > > > > > > >>> > > > On Fri, Nov 7, 2014 at 12:44 PM, Flavio
> Pompermaier <
> > > > > > > > >>> > > [hidden email]>
> > > > > > > > >>> > > > wrote:
> > > > > > > > >>> > > >
> > > > > > > > >>> > > > > I've just updated the code on my fork (synch with
> > > > current
> > > > > > > > master
> > > > > > > > >>> and
> > > > > > > > >>> > > > > applied improvements coming from comments on
> > related
> > > > PR).
> > > > > > > > >>> > > > > I still have to understand how to write results
> > back to
> > > > > an
> > > > > > > > HBase
> > > > > > > > >>> > > > > Sink/OutputFormat...
> > > > > > > > >>> > > > >
> > > > > > > > >>> > > > >
> > > > > > > > >>> > > > > On Mon, Nov 3, 2014 at 12:05 PM, Flavio
> Pompermaier
> > <
> > > > > > > > >>> > > > [hidden email]>
> > > > > > > > >>> > > > > wrote:
> > > > > > > > >>> > > > >
> > > > > > > > >>> > > > >> Thanks for the detailed answer. So if I run a
> job
> > from
> > > > > my
> > > > > > > > >>> machine
> > > > > > > > >>> > I'll
> > > > > > > > >>> > > > >> have to download all the scanned data in a
> > > > table..right?
> > > > > > > > >>> > > > >>
> > > > > > > > >>> > > > >> Always regarding the GenericTableOutputFormat it
> > is
> > > > not
> > > > > > > clear
> > > > > > > > >>> to me
> > > > > > > > >>> > > how
> > > > > > > > >>> > > > >> to proceed..
> > > > > > > > >>> > > > >> I saw in the hadoop compatibility addon that it
> is
> > > > > > possible
> > > > > > > to
> > > > > > > > >>> have
> > > > > > > > >>> > > such
> > > > > > > > >>> > > > >> compatibility using HBaseUtils class so the open
> > > > method
> > > > > > > should
> > > > > > > > >>> > become
> > > > > > > > >>> > > > >> something like:
> > > > > > > > >>> > > > >>
> > > > > > > > >>> > > > >> @Override
> > > > > > > > >>> > > > >> public void open(int taskNumber, int numTasks)
> > throws
> > > > > > > > >>> IOException {
> > > > > > > > >>> > > > >> if (Integer.toString(taskNumber + 1).length() >
> 6)
> > {
> > > > > > > > >>> > > > >> throw new IOException("Task id too large.");
> > > > > > > > >>> > > > >> }
> > > > > > > > >>> > > > >> TaskAttemptID taskAttemptID =
> > > > > > > > >>> > TaskAttemptID.forName("attempt__0000_r_"
> > > > > > > > >>> > > > >> + String.format("%" + (6 -
> > > > Integer.toString(taskNumber +
> > > > > > > > >>> > 1).length())
> > > > > > > > >>> > > +
> > > > > > > > >>> > > > >> "s"," ").replace(" ", "0")
> > > > > > > > >>> > > > >> + Integer.toString(taskNumber + 1)
> > > > > > > > >>> > > > >> + "_0");
> > > > > > > > >>> > > > >> this.configuration.set("mapred.task.id",
> > > > > > > > >>> > taskAttemptID.toString());
> > > > > > > > >>> > > > >>
> this.configuration.setInt("mapred.task.partition",
> > > > > > > taskNumber
> > > > > > > > +
> > > > > > > > >>> 1);
> > > > > > > > >>> > > > >> // for hadoop 2.2
> > > > > > > > >>> > > > >> this.configuration.set("
> mapreduce.task.attempt.id
> > ",
> > > > > > > > >>> > > > >> taskAttemptID.toString());
> > > > > > > > >>> > > > >>
> > this.configuration.setInt("mapreduce.task.partition",
> > > > > > > > >>> taskNumber +
> > > > > > > > >>> > 1);
> > > > > > > > >>> > > > >> try {
> > > > > > > > >>> > > > >> this.context =
> > > > > > > > >>> > > > >>
> > > > > > > HadoopUtils.instantiateTaskAttemptContext(this.configuration,
> > > > > > > > >>> > > > >> taskAttemptID);
> > > > > > > > >>> > > > >> } catch (Exception e) {
> > > > > > > > >>> > > > >> throw new RuntimeException(e);
> > > > > > > > >>> > > > >> }
> > > > > > > > >>> > > > >> final HFileOutputFormat2 outFormat = new
> > > > > > > HFileOutputFormat2();
> > > > > > > > >>> > > > >> try {
> > > > > > > > >>> > > > >> this.writer =
> > outFormat.getRecordWriter(this.context);
> > > > > > > > >>> > > > >> } catch (InterruptedException iex) {
> > > > > > > > >>> > > > >> throw new IOException("Opening the writer was
> > > > > > interrupted.",
> > > > > > > > >>> iex);
> > > > > > > > >>> > > > >> }
> > > > > > > > >>> > > > >> }
> > > > > > > > >>> > > > >>
> > > > > > > > >>> > > > >> But I'm not sure about how to pass the JobConf
> to
> > the
> > > > > > class,
> > > > > > > > if
> > > > > > > > >>> to
> > > > > > > > >>> > > merge
> > > > > > > > >>> > > > >> config fileas, where HFileOutputFormat2 writes
> the
> > > > data
> > > > > > and
> > > > > > > > how
> > > > > > > > >>> to
> > > > > > > > >>> > > > >> implement the public void writeRecord(Record
> > record)
> > > > > API.
> > > > > > > > >>> > > > >> Could I do a little chat off the mailing list
> with
> > the
> > > > > > > > >>> implementor
> > > > > > > > >>> > of
> > > > > > > > >>> > > > >> this extension?
> > > > > > > > >>> > > > >>
> > > > > > > > >>> > > > >> On Mon, Nov 3, 2014 at 11:51 AM, Fabian Hueske <
> > > > > > > > >>> [hidden email]>
> > > > > > > > >>> > > > >> wrote:
> > > > > > > > >>> > > > >>
> > > > > > > > >>> > > > >>> Hi Flavio
> > > > > > > > >>> > > > >>>
> > > > > > > > >>> > > > >>> let me try to answer your last question on the
> > user's
> > > > > > list
> > > > > > > > (to
> > > > > > > > >>> the
> > > > > > > > >>> > > best
> > > > > > > > >>> > > > >>> of
> > > > > > > > >>> > > > >>> my HBase knowledge).
> > > > > > > > >>> > > > >>> "I just wanted to known if and how regiom
> > splitting
> > > > is
> > > > > > > > >>> handled. Can
> > > > > > > > >>> > > you
> > > > > > > > >>> > > > >>> explain me in detail how Flink and HBase
> > works?what
> > > > is
> > > > > > not
> > > > > > > > >>> fully
> > > > > > > > >>> > > clear
> > > > > > > > >>> > > > to
> > > > > > > > >>> > > > >>> me is when computation is done by region
> servers
> > and
> > > > > when
> > > > > > > > data
> > > > > > > > >>> > start
> > > > > > > > >>> > > > flow
> > > > > > > > >>> > > > >>> to a Flink worker (that in ky test job is only
> my
> > pc)
> > > > > and
> > > > > > > how
> > > > > > > > >>> ro
> > > > > > > > >>> > > > >>> undertsand
> > > > > > > > >>> > > > >>> better the important logged info to understand
> if
> > my
> > > > > job
> > > > > > is
> > > > > > > > >>> > > performing
> > > > > > > > >>> > > > >>> well"
> > > > > > > > >>> > > > >>>
> > > > > > > > >>> > > > >>> HBase partitions its tables into so called
> > "regions"
> > > > of
> > > > > > > keys
> > > > > > > > >>> and
> > > > > > > > >>> > > stores
> > > > > > > > >>> > > > >>> the
> > > > > > > > >>> > > > >>> regions distributed in the cluster using HDFS.
> I
> > > > think
> > > > > an
> > > > > > > > HBase
> > > > > > > > >>> > > region
> > > > > > > > >>> > > > >>> can
> > > > > > > > >>> > > > >>> be thought of as a HDFS block. To make reading
> an
> > > > HBase
> > > > > > > table
> > > > > > > > >>> > > > efficient,
> > > > > > > > >>> > > > >>> region reads should be locally done, i.e., an
> > > > > InputFormat
> > > > > > > > >>> should
> > > > > > > > >>> > > > >>> primarily
> > > > > > > > >>> > > > >>> read region that are stored on the same machine
> > as
> > > > the
> > > > > IF
> > > > > > > is
> > > > > > > > >>> > running
> > > > > > > > >>> > > > on.
> > > > > > > > >>> > > > >>> Flink's InputSplits partition the HBase input
> by
> > > > > regions
> > > > > > > and
> > > > > > > > >>> add
> > > > > > > > >>> > > > >>> information about the storage location of the
> > region.
> > > > > > > During
> > > > > > > > >>> > > execution,
> > > > > > > > >>> > > > >>> input splits are assigned to InputFormats that
> > can do
> > > > > > local
> > > > > > > > >>> reads.
> > > > > > > > >>> > > > >>>
> > > > > > > > >>> > > > >>> Best, Fabian
> > > > > > > > >>> > > > >>>
> > > > > > > > >>> > > > >>> 2014-11-03 11:13 GMT+01:00 Stephan Ewen <
> > > > > > [hidden email]
> > > > > > > >:
> > > > > > > > >>> > > > >>>
> > > > > > > > >>> > > > >>> > Hi!
> > > > > > > > >>> > > > >>> >
> > > > > > > > >>> > > > >>> > The way of passing parameters through the
> > > > > configuration
> > > > > > > is
> > > > > > > > >>> very
> > > > > > > > >>> > old
> > > > > > > > >>> > > > >>> (the
> > > > > > > > >>> > > > >>> > original HBase format dated back to that
> time).
> > I
> > > > > would
> > > > > > > > >>> simply
> > > > > > > > >>> > make
> > > > > > > > >>> > > > the
> > > > > > > > >>> > > > >>> > HBase format take those parameters through
> the
> > > > > > > constructor.
> > > > > > > > >>> > > > >>> >
> > > > > > > > >>> > > > >>> > Greetings,
> > > > > > > > >>> > > > >>> > Stephan
> > > > > > > > >>> > > > >>> >
> > > > > > > > >>> > > > >>> >
> > > > > > > > >>> > > > >>> > On Mon, Nov 3, 2014 at 10:59 AM, Flavio
> > > > Pompermaier <
> > > > > > > > >>> > > > >>> [hidden email]>
> > > > > > > > >>> > > > >>> > wrote:
> > > > > > > > >>> > > > >>> >
> > > > > > > > >>> > > > >>> > > The problem is that I also removed the
> > > > > > > > >>> GenericTableOutputFormat
> > > > > > > > >>> > > > >>> because
> > > > > > > > >>> > > > >>> > > there is an incompatibility between hadoop1
> > and
> > > > > > hadoop2
> > > > > > > > for
> > > > > > > > >>> > class
> > > > > > > > >>> > > > >>> > > TaskAttemptContext and
> > TaskAttemptContextImpl..
> > > > > > > > >>> > > > >>> > > then it would be nice if the user doesn't
> > have to
> > > > > > worry
> > > > > > > > >>> about
> > > > > > > > >>> > > > passing
> > > > > > > > >>> > > > >>> > > pact.hbase.jtkey and pact.job.id
> > parameters..
> > > > > > > > >>> > > > >>> > > I think it is probably a good idea to
> remove
> > > > > hadoop1
> > > > > > > > >>> > > compatibility
> > > > > > > > >>> > > > >>> and
> > > > > > > > >>> > > > >>> > keep
> > > > > > > > >>> > > > >>> > > enable HBase addon only for hadoop2 (as
> > before)
> > > > and
> > > > > > > > decide
> > > > > > > > >>> how
> > > > > > > > >>> > to
> > > > > > > > >>> > > > >>> mange
> > > > > > > > >>> > > > >>> > > those 2 parameters..
> > > > > > > > >>> > > > >>> > >
> > > > > > > > >>> > > > >>> > > On Mon, Nov 3, 2014 at 10:19 AM, Stephan
> Ewen
> > <
> > > > > > > > >>> > [hidden email]>
> > > > > > > > >>> > > > >>> wrote:
> > > > > > > > >>> > > > >>> > >
> > > > > > > > >>> > > > >>> > > > It is fine to remove it, in my opinion.
> > > > > > > > >>> > > > >>> > > >
> > > > > > > > >>> > > > >>> > > > On Mon, Nov 3, 2014 at 10:11 AM, Flavio
> > > > > > Pompermaier <
> > > > > > > > >>> > > > >>> > > [hidden email]>
> > > > > > > > >>> > > > >>> > > > wrote:
> > > > > > > > >>> > > > >>> > > >
> > > > > > > > >>> > > > >>> > > > > That is one class I removed because it
> > was
> > > > > using
> > > > > > > the
> > > > > > > > >>> > > deprecated
> > > > > > > > >>> > > > >>> API
> > > > > > > > >>> > > > >>> > > > > GenericDataSink..I can restore them but
> > the
> > > > it
> > > > > > will
> > > > > > > > be
> > > > > > > > >>> a
> > > > > > > > >>> > good
> > > > > > > > >>> > > > >>> idea to
> > > > > > > > >>> > > > >>> > > > > remove those warning (also because from
> > what
> > > > I
> > > > > > > > >>> understood
> > > > > > > > >>> > the
> > > > > > > > >>> > > > >>> Record
> > > > > > > > >>> > > > >>> > > APIs
> > > > > > > > >>> > > > >>> > > > > are going to be removed).
> > > > > > > > >>> > > > >>> > > > >
> > > > > > > > >>> > > > >>> > > > > On Mon, Nov 3, 2014 at 9:51 AM, Fabian
> > > > Hueske <
> > > > > > > > >>> > > > >>> [hidden email]>
> > > > > > > > >>> > > > >>> > > > wrote:
> > > > > > > > >>> > > > >>> > > > >
> > > > > > > > >>> > > > >>> > > > > > I'm not familiar with the HBase
> > connector
> > > > > code,
> > > > > > > but
> > > > > > > > >>> are
> > > > > > > > >>> > you
> > > > > > > > >>> > > > >>> maybe
> > > > > > > > >>> > > > >>> > > > looking
> > > > > > > > >>> > > > >>> > > > > > for the GenericTableOutputFormat?
> > > > > > > > >>> > > > >>> > > > > >
> > > > > > > > >>> > > > >>> > > > > > 2014-11-03 9:44 GMT+01:00 Flavio
> > > > Pompermaier
> > > > > <
> > > > > > > > >>> > > > >>> [hidden email]
> > > > > > > > >>> > > > >>> > >:
> > > > > > > > >>> > > > >>> > > > > >
> > > > > > > > >>> > > > >>> > > > > > > | was trying to modify the example
> > > > setting
> > > > > > > > >>> > > > hbaseDs.output(new
> > > > > > > > >>> > > > >>> > > > > > > HBaseOutputFormat()); but I can't
> see
> > any
> > > > > > > > >>> > > HBaseOutputFormat
> > > > > > > > >>> > > > >>> > > > > class..maybe
> > > > > > > > >>> > > > >>> > > > > > we
> > > > > > > > >>> > > > >>> > > > > > > shall use another class?
> > > > > > > > >>> > > > >>> > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > On Mon, Nov 3, 2014 at 9:39 AM,
> > Flavio
> > > > > > > > Pompermaier
> > > > > > > > >>> <
> > > > > > > > >>> > > > >>> > > > > [hidden email]
> > > > > > > > >>> > > > >>> > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > wrote:
> > > > > > > > >>> > > > >>> > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > > Maybe that's something I could
> add
> > to
> > > > the
> > > > > > > HBase
> > > > > > > > >>> > example
> > > > > > > > >>> > > > and
> > > > > > > > >>> > > > >>> > that
> > > > > > > > >>> > > > >>> > > > > could
> > > > > > > > >>> > > > >>> > > > > > be
> > > > > > > > >>> > > > >>> > > > > > > > better documented in the Wiki.
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > > Since we're talking about the
> > wiki..I
> > > > was
> > > > > > > > >>> looking at
> > > > > > > > >>> > > the
> > > > > > > > >>> > > > >>> Java
> > > > > > > > >>> > > > >>> > > API (
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > >
> > > > > > > > >>> > > > >>> > > > > >
> > > > > > > > >>> > > > >>> > > > >
> > > > > > > > >>> > > > >>> > > >
> > > > > > > > >>> > > > >>> > >
> > > > > > > > >>> > > > >>> >
> > > > > > > > >>> > > > >>>
> > > > > > > > >>> > > >
> > > > > > > > >>> > >
> > > > > > > > >>> >
> > > > > > > > >>>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_guide.html
> > > > > > > > >>> > > > >>> )
> > > > > > > > >>> > > > >>> > > > > > > > and the link to the KMeans
> example
> > is
> > > > not
> > > > > > > > working
> > > > > > > > >>> > > (where
> > > > > > > > >>> > > > it
> > > > > > > > >>> > > > >>> > says
> > > > > > > > >>> > > > >>> > > > For
> > > > > > > > >>> > > > >>> > > > > a
> > > > > > > > >>> > > > >>> > > > > > > > complete example program, have a
> > look
> > > > at
> > > > > > > KMeans
> > > > > > > > >>> > > > Algorithm).
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > > Best,
> > > > > > > > >>> > > > >>> > > > > > > > Flavio
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > > On Mon, Nov 3, 2014 at 9:12 AM,
> > Flavio
> > > > > > > > >>> Pompermaier <
> > > > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > > wrote:
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > >> Ah ok, perfect! That was the
> > reason
> > > > why
> > > > > I
> > > > > > > > >>> removed it
> > > > > > > > >>> > > :)
> > > > > > > > >>> > > > >>> > > > > > > >>
> > > > > > > > >>> > > > >>> > > > > > > >> On Mon, Nov 3, 2014 at 9:10 AM,
> > > > Stephan
> > > > > > > Ewen <
> > > > > > > > >>> > > > >>> > [hidden email]>
> > > > > > > > >>> > > > >>> > > > > > wrote:
> > > > > > > > >>> > > > >>> > > > > > > >>
> > > > > > > > >>> > > > >>> > > > > > > >>> You do not really need a HBase
> > data
> > > > > sink.
> > > > > > > You
> > > > > > > > >>> can
> > > > > > > > >>> > > call
> > > > > > > > >>> > > > >>> > > > > > > >>> "DataSet.output(new
> > > > > > > > >>> > > > >>> > > > > > > >>> HBaseOutputFormat())"
> > > > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > > > >>> > > > >>> > > > > > > >>> Stephan
> > > > > > > > >>> > > > >>> > > > > > > >>> Am 02.11.2014 23:05 schrieb
> > "Flavio
> > > > > > > > >>> Pompermaier" <
> > > > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > > > >>> > > > >>> > > > > > > >:
> > > > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > > > >>> > > > >>> > > > > > > >>> > Just one last thing..I
> removed
> > the
> > > > > > > > >>> HbaseDataSink
> > > > > > > > >>> > > > >>> because I
> > > > > > > > >>> > > > >>> > > > think
> > > > > > > > >>> > > > >>> > > > > it
> > > > > > > > >>> > > > >>> > > > > > > was
> > > > > > > > >>> > > > >>> > > > > > > >>> > using the old APIs..can
> someone
> > > > help
> > > > > me
> > > > > > > in
> > > > > > > > >>> > updating
> > > > > > > > >>> > > > >>> that
> > > > > > > > >>> > > > >>> > > class?
> > > > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > > > >>> > > > >>> > > > > > > >>> > On Sun, Nov 2, 2014 at 10:55
> > AM,
> > > > > Flavio
> > > > > > > > >>> > > Pompermaier <
> > > > > > > > >>> > > > >>> > > > > > > >>> [hidden email]>
> > > > > > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > > > >>> > > > >>> > > > > > > >>> > > Indeed this time the build
> > has
> > > > been
> > > > > > > > >>> successful
> > > > > > > > >>> > :)
> > > > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > > > >>> > > > >>> > > > > > > >>> > > On Sun, Nov 2, 2014 at
> 10:29
> > AM,
> > > > > > Fabian
> > > > > > > > >>> Hueske
> > > > > > > > >>> > <
> > > > > > > > >>> > > > >>> > > > > > [hidden email]
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > > >>> > wrote:
> > > > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> You can also setup Travis
> to
> > > > build
> > > > > > > your
> > > > > > > > >>> own
> > > > > > > > >>> > > Github
> > > > > > > > >>> > > > >>> > > > > repositories
> > > > > > > > >>> > > > >>> > > > > > by
> > > > > > > > >>> > > > >>> > > > > > > >>> > linking
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> it to your Github account.
> > That
> > > > > way
> > > > > > > > >>> Travis can
> > > > > > > > >>> > > > >>> build all
> > > > > > > > >>> > > > >>> > > > your
> > > > > > > > >>> > > > >>> > > > > > > >>> branches
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> (and
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> you can also trigger
> > rebuilds if
> > > > > > > > something
> > > > > > > > >>> > > fails).
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> Not sure if we can
> manually
> > > > > trigger
> > > > > > > > >>> retrigger
> > > > > > > > >>> > > > >>> builds on
> > > > > > > > >>> > > > >>> > > the
> > > > > > > > >>> > > > >>> > > > > > Apache
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> repository.
> > > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> Support for Hadoop 1 and 2
> > is
> > > > > > indeed a
> > > > > > > > >>> very
> > > > > > > > >>> > good
> > > > > > > > >>> > > > >>> > addition
> > > > > > > > >>> > > > >>> > > > :-)
> > > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> For the discusion about
> the
> > PR
> > > > > > > itself, I
> > > > > > > > >>> would
> > > > > > > > >>> > > > need
> > > > > > > > >>> > > > >>> a
> > > > > > > > >>> > > > >>> > bit
> > > > > > > > >>> > > > >>> > > > more
> > > > > > > > >>> > > > >>> > > > > > > time
> > > > > > > > >>> > > > >>> > > > > > > >>> to
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> become more familiar with
> > > > HBase. I
> > > > > > do
> > > > > > > > >>> also not
> > > > > > > > >>> > > > have
> > > > > > > > >>> > > > >>> a
> > > > > > > > >>> > > > >>> > > HBase
> > > > > > > > >>> > > > >>> > > > > > setup
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> available
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> here.
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> Maybe somebody else of the
> > > > > community
> > > > > > > who
> > > > > > > > >>> was
> > > > > > > > >>> > > > >>> involved
> > > > > > > > >>> > > > >>> > > with a
> > > > > > > > >>> > > > >>> > > > > > > >>> previous
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> version of the HBase
> > connector
> > > > > could
> > > > > > > > >>> comment
> > > > > > > > >>> > on
> > > > > > > > >>> > > > your
> > > > > > > > >>> > > > >>> > > > question.
> > > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> Best, Fabian
> > > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> 2014-11-02 9:57 GMT+01:00
> > Flavio
> > > > > > > > >>> Pompermaier <
> > > > > > > > >>> > > > >>> > > > > > > [hidden email]
> > > > > > > > >>> > > > >>> > > > > > > >>> >:
> > > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > As suggestes by Fabian I
> > moved
> > > > > the
> > > > > > > > >>> > discussion
> > > > > > > > >>> > > on
> > > > > > > > >>> > > > >>> this
> > > > > > > > >>> > > > >>> > > > > mailing
> > > > > > > > >>> > > > >>> > > > > > > >>> list.
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > I think that what is
> still
> > to
> > > > be
> > > > > > > > >>> discussed
> > > > > > > > >>> > is
> > > > > > > > >>> > > > >>> how to
> > > > > > > > >>> > > > >>> > > > > > retrigger
> > > > > > > > >>> > > > >>> > > > > > > >>> the
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> build
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > on Travis (I don't have
> an
> > > > > > account)
> > > > > > > > and
> > > > > > > > >>> if
> > > > > > > > >>> > the
> > > > > > > > >>> > > > PR
> > > > > > > > >>> > > > >>> can
> > > > > > > > >>> > > > >>> > be
> > > > > > > > >>> > > > >>> > > > > > > >>> integrated.
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > Maybe what I can do is
> to
> > move
> > > > > the
> > > > > > > > HBase
> > > > > > > > >>> > > example
> > > > > > > > >>> > > > >>> in
> > > > > > > > >>> > > > >>> > the
> > > > > > > > >>> > > > >>> > > > test
> > > > > > > > >>> > > > >>> > > > > > > >>> package
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> (right
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > now I left it in the
> main
> > > > > folder)
> > > > > > so
> > > > > > > > it
> > > > > > > > >>> will
> > > > > > > > >>> > > > force
> > > > > > > > >>> > > > >>> > > Travis
> > > > > > > > >>> > > > >>> > > > to
> > > > > > > > >>> > > > >>> > > > > > > >>> rebuild.
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > I'll do it within a
> couple
> > of
> > > > > > hours.
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > Another thing I forgot
> to
> > say
> > > > is
> > > > > > > that
> > > > > > > > >>> the
> > > > > > > > >>> > > hbase
> > > > > > > > >>> > > > >>> > > extension
> > > > > > > > >>> > > > >>> > > > is
> > > > > > > > >>> > > > >>> > > > > > now
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> compatible
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > with both hadoop 1 and
> 2.
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> >
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > Best,
> > > > > > > > >>> > > > >>> > > > > > > >>> > >> > Flavio
> > > > > > > > >>> > > > >>> > > > > > > >>> > >>
> > > > > > > > >>> > > > >>> > > > > > > >>> > >
> > > > > > > > >>> > > > >>> > > > > > > >>> >
> > > > > > > > >>> > > > >>> > > > > > > >>>
> > > > > > > > >>> > > > >>> > > > > > > >>
> > > > > > > > >>> > > > >>> > > > > > > >
> > > > > > > > >>> > > > >>> > > > > > >
> > > > > > > > >>> > > > >>> > > > > >
> > > > > > > > >>> > > > >>> > > > >
> > > > > > > > >>> > > > >>> > > >
> > > > > > > > >>> > > > >>> > >
> > > > > > > > >>> > > > >>> >
> > > > > > > > >>> > > > >>>
> > > > > > > > >>> > > > >>
> > > > > > > > >>> > > > >>
> > > > > > > > >>> > > > >>
> > > > > > > > >>> > > > >
> > > > > > > > >>> > > >
> > > > > > > > >>> > >
> > > > > > > > >>> >
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Stephan Ewen

Re: HBase 0.98 addon for Flink 0.8

Hi!

I still believe it is an issue of the class loader. To test that, can you
remove the HBase jar file from the Flink lib directory, and make it part of
the job jar? (Either as a fat jar, or by creating a "lib" folder inside
your job jar file and putting the hbase.jar into that folder)

That way Hbase would be loaded through the user code class loader and
hopefully use that class loader to search for the XML files.

Please let us know if that solves it.

Stephan

Flavio Pompermaier

Re: HBase 0.98 addon for Flink 0.8

probably I can't test that before Tuesday..I think we can talk about that
at the Berlin meeting if you cannot debug it :/
On Nov 14, 2014 11:41 PM, "Stephan Ewen" <[hidden email]> wrote:

> Hi!
>
> I still believe it is an issue of the class loader. To test that, can you
> remove the HBase jar file from the Flink lib directory, and make it part of
> the job jar? (Either as a fat jar, or by creating a "lib" folder inside
> your job jar file and putting the hbase.jar into that folder)
>
> That way Hbase would be loaded through the user code class loader and
> hopefully use that class loader to search for the XML files.
>
> Please let us know if that solves it.
>
> Stephan
>