Hi to all,
we've recently migrated our sqoop[1] import process to a Flink job, using an improved version of the Flink JDBC Input Format[2] that is able to exploit the parallelism of the cluster (the current Flink version implements NonParallelInput). Still need to improve the mapping part of sql types to java ones (in the addValue method IMHO) but this could be the basis for a flink-sqoop module that will incrementally cover the sqoop functionalities when requested. Do you think that such a module could be of interest for Flink or not? [1] https://sqoop.apache.org/ [2] https://gist.github.com/fpompermaier/bcd704abc93b25b6744ac76ac17ed351 Best, Flavio |
Hi Flavio,
I think this can be very handy when you have to run jobs Sqoop-like but you need to run the process with few resources. As for Cascading, Flink could do the heavy-lifting and make the scan of large relational databases more robust. Of course to make it work in real world, the JDBC Input format must be improved. Besides parallelism, null values, and related inputsplit, we need to find a way to map properly the Java types towards the database types. Probably having a wrapper POJO implementing cast/tranformation policy passed as a parameter of the InputFormat could do. Another thing we need to take care of is the management of connections, which can be very costly if the database is particularly large. saluti, Stefano 2016-04-13 12:45 GMT+02:00 Flavio Pompermaier <[hidden email]>: > Hi to all, > we've recently migrated our sqoop[1] import process to a Flink job, using > an improved version of the Flink JDBC Input Format[2] that is able to > exploit the parallelism of the cluster (the current Flink version > implements NonParallelInput). > > Still need to improve the mapping part of sql types to java ones (in the > addValue method IMHO) but this could be the basis for a flink-sqoop module > that will incrementally cover the sqoop functionalities when requested. > Do you think that such a module could be of interest for Flink or not? > > [1] https://sqoop.apache.org/ > [2] https://gist.github.com/fpompermaier/bcd704abc93b25b6744ac76ac17ed351 > > Best, > Flavio > |
Hi Flavio,
sorry for not replying earlier. I think there is definitely need to improve the JdbcInputFormat. All your points wrt to the current JdbcInputFormat are valid and fixing them would be a big improvement and highly welcome contribution, IMO. I am not so sure about adding a flink-sqoop module to Flink. How much better/faster would flink-sqoop be compared to Apache Scoop. With YARN it is easy to use two frameworks side-by-side. Maybe you can share a few details about your use case / environment and why flink-sqoop would be a good addition. Best, Fabian 2016-04-15 10:03 GMT+02:00 Stefano Bortoli <[hidden email]>: > Hi Flavio, > > I think this can be very handy when you have to run jobs Sqoop-like but you > need to run the process with few resources. As for Cascading, Flink could > do the heavy-lifting and make the scan of large relational databases more > robust. Of course to make it work in real world, the JDBC Input format must > be improved. Besides parallelism, null values, and related inputsplit, we > need to find a way to map properly the Java types towards the database > types. Probably having a wrapper POJO implementing cast/tranformation > policy passed as a parameter of the InputFormat could do. Another thing we > need to take care of is the management of connections, which can be very > costly if the database is particularly large. > > > saluti, > Stefano > > > > 2016-04-13 12:45 GMT+02:00 Flavio Pompermaier <[hidden email]>: > > > Hi to all, > > we've recently migrated our sqoop[1] import process to a Flink job, using > > an improved version of the Flink JDBC Input Format[2] that is able to > > exploit the parallelism of the cluster (the current Flink version > > implements NonParallelInput). > > > > Still need to improve the mapping part of sql types to java ones (in the > > addValue method IMHO) but this could be the basis for a flink-sqoop > module > > that will incrementally cover the sqoop functionalities when requested. > > Do you think that such a module could be of interest for Flink or not? > > > > [1] https://sqoop.apache.org/ > > [2] > https://gist.github.com/fpompermaier/bcd704abc93b25b6744ac76ac17ed351 > > > > Best, > > Flavio > > > |
Free forum by Nabble | Edit this page |