Source Kafka and Sink Hive managed tables via Flink Job

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Source Kafka and Sink Hive managed tables via Flink Job

Youssef Achbany
Dear all,

I'm working for a big project and one of the challenge is to read Kafka
topics and copy them via Hive command into Hive managed tables in order to
enable ACID HIVE properties.

I try it but I have a issue with back pressure:
- The first window read 20.000 events and wrote them in Hive tables
- The second, third, ... send only 100 events because the write in Hive
take more time than the read of a Kafka topic. But writing 100 events or
50.000 events takes +/- the same time for Hive.

Someone have already do this source and sink? Could you help on this?
Or have you some tips?
It seems that defining a size window on number of event instead time is not
possible. Is it true?

Thank you for your help

Youssef

--
♻ Be green, keep it on the screen
Reply | Threaded
Open this post in threaded view
|

Re: Source Kafka and Sink Hive managed tables via Flink Job

bowen.li
Hi Youssef,

You need to provide more background context:

- Which Hive sink are you using? We are working on the official Hive sink
for community and will be released in 1.9. So did you develop yours in
house?
- What do you mean by 1st, 2nd, 3rd window? You mean the parallel instances
of the same operator, or do you have you have 3 windowing operations
chained?
- What does your Hive table look like? E.g. is it partitioned or
non-partitioned? If partitioned, how many partitions do you have? is it
writing in static partition or dynamic partition mode? what format? how
large?
- What does your sink do - is each parallelism writing to multiple
partitions or a single partition/table? Is it only appending data or
upserting?

On Wed, Jul 3, 2019 at 1:38 AM Youssef Achbany <[hidden email]>
wrote:

> Dear all,
>
> I'm working for a big project and one of the challenge is to read Kafka
> topics and copy them via Hive command into Hive managed tables in order to
> enable ACID HIVE properties.
>
> I try it but I have a issue with back pressure:
> - The first window read 20.000 events and wrote them in Hive tables
> - The second, third, ... send only 100 events because the write in Hive
> take more time than the read of a Kafka topic. But writing 100 events or
> 50.000 events takes +/- the same time for Hive.
>
> Someone have already do this source and sink? Could you help on this?
> Or have you some tips?
> It seems that defining a size window on number of event instead time is not
> possible. Is it true?
>
> Thank you for your help
>
> Youssef
>
> --
> ♻ Be green, keep it on the screen
>
Reply | Threaded
Open this post in threaded view
|

Re: Source Kafka and Sink Hive managed tables via Flink Job

bowen.li
BTW,  I'm adding user@ mailing list since this is a user question and
should be asked there.

dev@ mailing list is only for discussions of Flink development. Please see
https://flink.apache.org/community.html#mailing-lists

On Wed, Jul 3, 2019 at 12:34 PM Bowen Li <[hidden email]> wrote:

> Hi Youssef,
>
> You need to provide more background context:
>
> - Which Hive sink are you using? We are working on the official Hive sink
> for community and will be released in 1.9. So did you develop yours in
> house?
> - What do you mean by 1st, 2nd, 3rd window? You mean the parallel
> instances of the same operator, or do you have you have 3 windowing
> operations chained?
> - What does your Hive table look like? E.g. is it partitioned or
> non-partitioned? If partitioned, how many partitions do you have? is it
> writing in static partition or dynamic partition mode? what format? how
> large?
> - What does your sink do - is each parallelism writing to multiple
> partitions or a single partition/table? Is it only appending data or
> upserting?
>
> On Wed, Jul 3, 2019 at 1:38 AM Youssef Achbany <
> [hidden email]> wrote:
>
>> Dear all,
>>
>> I'm working for a big project and one of the challenge is to read Kafka
>> topics and copy them via Hive command into Hive managed tables in order to
>> enable ACID HIVE properties.
>>
>> I try it but I have a issue with back pressure:
>> - The first window read 20.000 events and wrote them in Hive tables
>> - The second, third, ... send only 100 events because the write in Hive
>> take more time than the read of a Kafka topic. But writing 100 events or
>> 50.000 events takes +/- the same time for Hive.
>>
>> Someone have already do this source and sink? Could you help on this?
>> Or have you some tips?
>> It seems that defining a size window on number of event instead time is
>> not
>> possible. Is it true?
>>
>> Thank you for your help
>>
>> Youssef
>>
>> --
>> ♻ Be green, keep it on the screen
>>
>