flink-orc or flink-orc-nohive

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

flink-orc or flink-orc-nohive

Sivaprasanna
Hello,

I'm working on an implementation of ORC BulkWriter[1]. As of now, I have
the entire implementation in a separate module called "flink-orc-compress"
under "flink-formats" since I'm not entirely sure whether it should go into
the existing ORC modules i.e flink-orc & flink-orc-nohive.

So my questions are:
1. What's the difference between these two ORC modules?
2. Should the ORC BulkWriter implementation go into one of these existing
modules? If yes, which one? Or can we keep it in a separate module to avoid
duplicating or causing any conflicts?

Note: My current implementation of ORC BulkWriter uses orc-core with nohive
classifier as the dependency.

[1] https://issues.apache.org/jira/browse/FLINK-10114
Reply | Threaded
Open this post in threaded view
|

Re: flink-orc or flink-orc-nohive

Jingsong Li
Hi,

Maybe you should use flink-orc. And use orc-core instead of orc-core with
nohive classifier. We can provide nohive version in the future.

Because orc and hive are so close, orc still relies on some classes of hive
currently.
Apache orc with nohive classifier is for create a variant of core and
mapreduce jars that don't conflict with hive 1.x [1]

So the orc and orc-nohive have same class name, but orc-nohive
shade/relocation lots of classes, like "ColumnVector" and
"VectorizedRowBatch".
Now the flink-orc-nohive depends on flink-orc, they share lots of codes.
They can not be unified to a separate module, there will be a lot of
conflicts.

[1]https://issues.apache.org/jira/browse/ORC-174

Best,
Jingsong Lee

On Tue, Apr 14, 2020 at 3:36 PM Sivaprasanna <[hidden email]>
wrote:

> Hello,
>
> I'm working on an implementation of ORC BulkWriter[1]. As of now, I have
> the entire implementation in a separate module called "flink-orc-compress"
> under "flink-formats" since I'm not entirely sure whether it should go into
> the existing ORC modules i.e flink-orc & flink-orc-nohive.
>
> So my questions are:
> 1. What's the difference between these two ORC modules?
> 2. Should the ORC BulkWriter implementation go into one of these existing
> modules? If yes, which one? Or can we keep it in a separate module to avoid
> duplicating or causing any conflicts?
>
> Note: My current implementation of ORC BulkWriter uses orc-core with nohive
> classifier as the dependency.
>
> [1] https://issues.apache.org/jira/browse/FLINK-10114
>


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: flink-orc or flink-orc-nohive

Sivaprasanna
On a similar note, I just checked that the Flink currently uses orc 1.4.3
in the dependencies. IMO, it is a little outdated. Can we bump the ORC
version to a slightly newer version - maybe 1.5.x or even 1.6.0?

-
Sivaprasanna

On Tue, Apr 14, 2020 at 1:42 PM Jingsong Li <[hidden email]> wrote:

> Hi,
>
> Maybe you should use flink-orc. And use orc-core instead of orc-core with
> nohive classifier. We can provide nohive version in the future.
>
> Because orc and hive are so close, orc still relies on some classes of hive
> currently.
> Apache orc with nohive classifier is for create a variant of core and
> mapreduce jars that don't conflict with hive 1.x [1]
>
> So the orc and orc-nohive have same class name, but orc-nohive
> shade/relocation lots of classes, like "ColumnVector" and
> "VectorizedRowBatch".
> Now the flink-orc-nohive depends on flink-orc, they share lots of codes.
> They can not be unified to a separate module, there will be a lot of
> conflicts.
>
> [1]https://issues.apache.org/jira/browse/ORC-174
>
> Best,
> Jingsong Lee
>
> On Tue, Apr 14, 2020 at 3:36 PM Sivaprasanna <[hidden email]>
> wrote:
>
> > Hello,
> >
> > I'm working on an implementation of ORC BulkWriter[1]. As of now, I have
> > the entire implementation in a separate module called
> "flink-orc-compress"
> > under "flink-formats" since I'm not entirely sure whether it should go
> into
> > the existing ORC modules i.e flink-orc & flink-orc-nohive.
> >
> > So my questions are:
> > 1. What's the difference between these two ORC modules?
> > 2. Should the ORC BulkWriter implementation go into one of these existing
> > modules? If yes, which one? Or can we keep it in a separate module to
> avoid
> > duplicating or causing any conflicts?
> >
> > Note: My current implementation of ORC BulkWriter uses orc-core with
> nohive
> > classifier as the dependency.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-10114
> >
>
>
> --
> Best, Jingsong Lee
>
Reply | Threaded
Open this post in threaded view
|

Re: flink-orc or flink-orc-nohive

Jingsong Li
Hi, yes, we can bump orc-core version to a newer.

Best,
Jingsong Lee

On Tue, Apr 14, 2020 at 8:16 PM Sivaprasanna <[hidden email]>
wrote:

> On a similar note, I just checked that the Flink currently uses orc 1.4.3
> in the dependencies. IMO, it is a little outdated. Can we bump the ORC
> version to a slightly newer version - maybe 1.5.x or even 1.6.0?
>
> -
> Sivaprasanna
>
> On Tue, Apr 14, 2020 at 1:42 PM Jingsong Li <[hidden email]>
> wrote:
>
> > Hi,
> >
> > Maybe you should use flink-orc. And use orc-core instead of orc-core with
> > nohive classifier. We can provide nohive version in the future.
> >
> > Because orc and hive are so close, orc still relies on some classes of
> hive
> > currently.
> > Apache orc with nohive classifier is for create a variant of core and
> > mapreduce jars that don't conflict with hive 1.x [1]
> >
> > So the orc and orc-nohive have same class name, but orc-nohive
> > shade/relocation lots of classes, like "ColumnVector" and
> > "VectorizedRowBatch".
> > Now the flink-orc-nohive depends on flink-orc, they share lots of codes.
> > They can not be unified to a separate module, there will be a lot of
> > conflicts.
> >
> > [1]https://issues.apache.org/jira/browse/ORC-174
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, Apr 14, 2020 at 3:36 PM Sivaprasanna <[hidden email]>
> > wrote:
> >
> > > Hello,
> > >
> > > I'm working on an implementation of ORC BulkWriter[1]. As of now, I
> have
> > > the entire implementation in a separate module called
> > "flink-orc-compress"
> > > under "flink-formats" since I'm not entirely sure whether it should go
> > into
> > > the existing ORC modules i.e flink-orc & flink-orc-nohive.
> > >
> > > So my questions are:
> > > 1. What's the difference between these two ORC modules?
> > > 2. Should the ORC BulkWriter implementation go into one of these
> existing
> > > modules? If yes, which one? Or can we keep it in a separate module to
> > avoid
> > > duplicating or causing any conflicts?
> > >
> > > Note: My current implementation of ORC BulkWriter uses orc-core with
> > nohive
> > > classifier as the dependency.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-10114
> > >
> >
> >
> > --
> > Best, Jingsong Lee
> >
>


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: flink-orc or flink-orc-nohive

Sivaprasanna
I have created a ticket to update the ORC version.
 https://issues.apache.org/jira/browse/FLINK-17142

On Tue, Apr 14, 2020 at 8:18 PM Jingsong Li <[hidden email]> wrote:

> Hi, yes, we can bump orc-core version to a newer.
>
> Best,
> Jingsong Lee
>
> On Tue, Apr 14, 2020 at 8:16 PM Sivaprasanna <[hidden email]>
> wrote:
>
> > On a similar note, I just checked that the Flink currently uses orc 1.4.3
> > in the dependencies. IMO, it is a little outdated. Can we bump the ORC
> > version to a slightly newer version - maybe 1.5.x or even 1.6.0?
> >
> > -
> > Sivaprasanna
> >
> > On Tue, Apr 14, 2020 at 1:42 PM Jingsong Li <[hidden email]>
> > wrote:
> >
> > > Hi,
> > >
> > > Maybe you should use flink-orc. And use orc-core instead of orc-core
> with
> > > nohive classifier. We can provide nohive version in the future.
> > >
> > > Because orc and hive are so close, orc still relies on some classes of
> > hive
> > > currently.
> > > Apache orc with nohive classifier is for create a variant of core and
> > > mapreduce jars that don't conflict with hive 1.x [1]
> > >
> > > So the orc and orc-nohive have same class name, but orc-nohive
> > > shade/relocation lots of classes, like "ColumnVector" and
> > > "VectorizedRowBatch".
> > > Now the flink-orc-nohive depends on flink-orc, they share lots of
> codes.
> > > They can not be unified to a separate module, there will be a lot of
> > > conflicts.
> > >
> > > [1]https://issues.apache.org/jira/browse/ORC-174
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Tue, Apr 14, 2020 at 3:36 PM Sivaprasanna <
> [hidden email]>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm working on an implementation of ORC BulkWriter[1]. As of now, I
> > have
> > > > the entire implementation in a separate module called
> > > "flink-orc-compress"
> > > > under "flink-formats" since I'm not entirely sure whether it should
> go
> > > into
> > > > the existing ORC modules i.e flink-orc & flink-orc-nohive.
> > > >
> > > > So my questions are:
> > > > 1. What's the difference between these two ORC modules?
> > > > 2. Should the ORC BulkWriter implementation go into one of these
> > existing
> > > > modules? If yes, which one? Or can we keep it in a separate module to
> > > avoid
> > > > duplicating or causing any conflicts?
> > > >
> > > > Note: My current implementation of ORC BulkWriter uses orc-core with
> > > nohive
> > > > classifier as the dependency.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-10114
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>
>
> --
> Best, Jingsong Lee
>