Sqoop-like module in Flink

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Sqoop-like module in Flink

Flavio Pompermaier
Hi to all,
we've recently migrated our sqoop[1] import process to a Flink job, using
an improved version of the Flink JDBC Input Format[2] that is able to
exploit the parallelism of the cluster (the current Flink version
implements NonParallelInput).

Still need to improve the mapping part of sql types to java ones (in the
addValue method IMHO) but this could be the basis for a flink-sqoop module
that will incrementally cover the sqoop functionalities when requested.
Do you think that such a module could be of interest for Flink or not?

[1] https://sqoop.apache.org/
[2] https://gist.github.com/fpompermaier/bcd704abc93b25b6744ac76ac17ed351

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Sqoop-like module in Flink

Stefano Bortoli
Hi Flavio,

I think this can be very handy when you have to run jobs Sqoop-like but you
need to run the process with few resources. As for Cascading, Flink could
do the heavy-lifting and make the scan of large relational databases more
robust. Of course to make it work in real world, the JDBC Input format must
be improved. Besides parallelism, null values, and related inputsplit, we
need to find a way to map properly the Java types towards the database
types. Probably having a wrapper POJO implementing cast/tranformation
policy passed as a parameter of the InputFormat could do. Another thing we
need to take care of is the management of connections, which can be very
costly if the database is particularly large.


saluti,
Stefano



2016-04-13 12:45 GMT+02:00 Flavio Pompermaier <[hidden email]>:

> Hi to all,
> we've recently migrated our sqoop[1] import process to a Flink job, using
> an improved version of the Flink JDBC Input Format[2] that is able to
> exploit the parallelism of the cluster (the current Flink version
> implements NonParallelInput).
>
> Still need to improve the mapping part of sql types to java ones (in the
> addValue method IMHO) but this could be the basis for a flink-sqoop module
> that will incrementally cover the sqoop functionalities when requested.
> Do you think that such a module could be of interest for Flink or not?
>
> [1] https://sqoop.apache.org/
> [2] https://gist.github.com/fpompermaier/bcd704abc93b25b6744ac76ac17ed351
>
> Best,
> Flavio
>
Reply | Threaded
Open this post in threaded view
|

Re: Sqoop-like module in Flink

Fabian Hueske-2
Hi Flavio,

sorry for not replying earlier.
I think there is definitely need to improve the JdbcInputFormat.
All your points wrt to the current JdbcInputFormat are valid and fixing
them would be a big improvement and highly welcome contribution, IMO.

I am not so sure about adding a flink-sqoop module to Flink.
How much better/faster would flink-sqoop be compared to Apache Scoop. With
YARN it is easy to use two frameworks side-by-side.
Maybe you can share a few details about your use case / environment and why
flink-sqoop would be a good addition.

Best, Fabian


2016-04-15 10:03 GMT+02:00 Stefano Bortoli <[hidden email]>:

> Hi Flavio,
>
> I think this can be very handy when you have to run jobs Sqoop-like but you
> need to run the process with few resources. As for Cascading, Flink could
> do the heavy-lifting and make the scan of large relational databases more
> robust. Of course to make it work in real world, the JDBC Input format must
> be improved. Besides parallelism, null values, and related inputsplit, we
> need to find a way to map properly the Java types towards the database
> types. Probably having a wrapper POJO implementing cast/tranformation
> policy passed as a parameter of the InputFormat could do. Another thing we
> need to take care of is the management of connections, which can be very
> costly if the database is particularly large.
>
>
> saluti,
> Stefano
>
>
>
> 2016-04-13 12:45 GMT+02:00 Flavio Pompermaier <[hidden email]>:
>
> > Hi to all,
> > we've recently migrated our sqoop[1] import process to a Flink job, using
> > an improved version of the Flink JDBC Input Format[2] that is able to
> > exploit the parallelism of the cluster (the current Flink version
> > implements NonParallelInput).
> >
> > Still need to improve the mapping part of sql types to java ones (in the
> > addValue method IMHO) but this could be the basis for a flink-sqoop
> module
> > that will incrementally cover the sqoop functionalities when requested.
> > Do you think that such a module could be of interest for Flink or not?
> >
> > [1] https://sqoop.apache.org/
> > [2]
> https://gist.github.com/fpompermaier/bcd704abc93b25b6744ac76ac17ed351
> >
> > Best,
> > Flavio
> >
>