Hi,
We are seeing an issue with Flink on our production. The version is 1.7 which we use. We started seeing sudden lag on kafka, and the consumers were no longer working/accepting messages. On trying to enable debug mode, the below errors were seen I am not sure why this occurs everyday and when this happens, I can see the remaining workers arent able to handle the load. Unless i restart my jobs, i am unable to start processing again. This way, there is data loss as well. On the below graph, there is a slight dip in consumption before 5:30. That is when this incident happens and correlated with logs. Any pointers/suggestions would be appreciated. Thanks. |
Hi Ramya,
Unfortunately your images are blocked. Could you upload them somewhere and post the links here? Also I think that the TaskManager logs may be able to help a bit more. Could you please provide them here? Cheers, Kostas On Tue, Sep 22, 2020 at 8:58 AM Ramya Ramamurthy <[hidden email]> wrote: > Hi, > > We are seeing an issue with Flink on our production. The version is 1.7 > which we use. > We started seeing sudden lag on kafka, and the consumers were no longer > working/accepting messages. On trying to enable debug mode, the below > errors were seen > [image: image.jpeg] > > I am not sure why this occurs everyday and when this happens, I can see > the remaining workers arent able to handle the load. Unless i restart my > jobs, i am unable to start processing again. This way, there is data loss > as well. > > On the below graph, there is a slight dip in consumption before 5:30. That > is when this incident happens and correlated with logs. > > [image: image.jpeg] > > Any pointers/suggestions would be appreciated. > > Thanks. > > |
Hi Kostas, Attaching the taskmanager logs regarding this issue. I have attached the Kaka related metrics. I hope you can see it this time. Not sure why we get these many disconnects to Kafka. Maybe because of this interruptions, we seem to slow down on our processing. At some point the memory also increases and the workers almost stagnate not doing any processing. I have 3GB heap committed and allotted 5GB memory to the pods. Thanks for your help. ~Ramya. On Tue, Sep 22, 2020 at 9:18 PM Kostas Kloudas <[hidden email]> wrote: Hi Ramya, |
Hi Ramya,
Unfortunately I cannot see them. Kostas On Wed, Sep 23, 2020 at 10:27 AM Ramya Ramamurthy <[hidden email]> wrote: > > Hi Kostas, > > Attaching the taskmanager logs regarding this issue. > I have attached the Kaka related metrics. I hope you can see it this time. > > Not sure why we get these many disconnects to Kafka. Maybe because of this interruptions, we seem to slow down on our processing. At some point the memory also increases and the workers almost stagnate not doing any processing. I have 3GB heap committed and allotted 5GB memory to the pods. > > Thanks for your help. > > ~Ramya. > > On Tue, Sep 22, 2020 at 9:18 PM Kostas Kloudas <[hidden email]> wrote: >> >> Hi Ramya, >> >> Unfortunately your images are blocked. Could you upload them somewhere and >> post the links here? >> Also I think that the TaskManager logs may be able to help a bit more. >> Could you please provide them here? >> >> Cheers, >> Kostas >> >> On Tue, Sep 22, 2020 at 8:58 AM Ramya Ramamurthy <[hidden email]> wrote: >> >> > Hi, >> > >> > We are seeing an issue with Flink on our production. The version is 1.7 >> > which we use. >> > We started seeing sudden lag on kafka, and the consumers were no longer >> > working/accepting messages. On trying to enable debug mode, the below >> > errors were seen >> > [image: image.jpeg] >> > >> > I am not sure why this occurs everyday and when this happens, I can see >> > the remaining workers arent able to handle the load. Unless i restart my >> > jobs, i am unable to start processing again. This way, there is data loss >> > as well. >> > >> > On the below graph, there is a slight dip in consumption before 5:30. That >> > is when this incident happens and correlated with logs. >> > >> > [image: image.jpeg] >> > >> > Any pointers/suggestions would be appreciated. >> > >> > Thanks. >> > >> > |
hi Kostas,
Copy pasting this snippet where we see the fluctuations. Let me know if this helps. 2020-09-22 23:39:19,646 DEBUG org.apache.kafka.clients.NetworkClient - Node 3 disconnected. 2020-09-22 23:39:19,646 DEBUG org.apache.kafka.clients.NetworkClient - Initialize connection to node be-kafka-dragonpit-broker-4:8017 (id: 4 rack: null) for sending metadata request 2020-09-22 23:39:19,646 DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node be-kafka-dragonpit-broker-4:8017 (id: 4 rack: null) 2020-09-22 23:39:19,664 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834984310 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834984311, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1516) 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834984311 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.NetworkClient - Sending metadata request (type=MetadataRequest, topics=captchastream) to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.common.network.Selector - Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node 4 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.NetworkClient - Completed connection to node 4. Fetching API versions. 2020-09-22 23:39:19,665 DEBUG org.apache.kafka.clients.NetworkClient - Initiating API versions fetch from node 4. 2020-09-22 23:39:19,666 DEBUG org.apache.kafka.clients.Metadata - Updated cluster metadata version 319 to Cluster(id = 4ou4oBz8TU24ipwW8ws1Bw, nodes = [be-kafka-dragonpit-broker-6:8017 (id: 6 rack: null), be-kafka-dragonpit-broker-4:8017 (id: 4 rack: null), be-kafka-dragonpit-broker-8:8017 (id: 8 rack: null), be-kafka-dragonpit-broker-3:8017 (id: 3 rack: null), be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null), be-kafka-dragonpit-broker-7:8017 (id: 7 rack: null)], partitions = [Partition(topic = captchastream, partition = 8, leader = 7, replicas = [7,3], isr = [7,3]), Partition(topic = captchastream, partition = 9, leader = 8, replicas = [8,4], isr = [4,8]), Partition(topic = captchastream, partition = 4, leader = 3, replicas = [3,4], isr = [3,4]), Partition(topic = captchastream, partition = 5, leader = 4, replicas = [4,5], isr = [4,5]), Partition(topic = captchastream, partition = 6, leader = 5, replicas = [5,7], isr = [5,7]), Partition(topic = captchastream, partition = 7, leader = 6, replicas = [6,8], isr = [8,6]), Partition(topic = captchastream, partition = 0, leader = 5, replicas = [5,6], isr = [5,6]), Partition(topic = captchastream, partition = 1, leader = 6, replicas = [6,7], isr = [7,6]), Partition(topic = captchastream, partition = 2, leader = 7, replicas = [7,8], isr = [7,8]), Partition(topic = captchastream, partition = 3, leader = 8, replicas = [8,3], isr = [8,3])]) 2020-09-22 23:39:19,666 DEBUG org.apache.kafka.clients.NetworkClient - Recorded API versions for node 4: (Produce(0): 0 to 5 [usable: 3], Fetch(1): 0 to 7 [usable: 5], Offsets(2): 0 to 2 [usable: 2], Metadata(3): 0 to 5 [usable: 4], LeaderAndIsr(4): 0 to 1 [usable: 0], StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 4 [usable: 3], ControlledShutdown(7): 0 to 1 [usable: 1], OffsetCommit(8): 0 to 3 [usable: 3], OffsetFetch(9): 0 to 3 [usable: 3], FindCoordinator(10): 0 to 1 [usable: 1], JoinGroup(11): 0 to 2 [usable: 2], Heartbeat(12): 0 to 1 [usable: 1], LeaveGroup(13): 0 to 1 [usable: 1], SyncGroup(14): 0 to 1 [usable: 1], DescribeGroups(15): 0 to 1 [usable: 1], ListGroups(16): 0 to 1 [usable: 1], SaslHandshake(17): 0 to 1 [usable: 0], ApiVersions(18): 0 to 1 [usable: 1], CreateTopics(19): 0 to 2 [usable: 2], DeleteTopics(20): 0 to 1 [usable: 1], DeleteRecords(21): 0 [usable: 0], InitProducerId(22): 0 [usable: 0], OffsetForLeaderEpoch(23): 0 [usable: 0], AddPartitionsToTxn(24): 0 [usable: 0], AddOffsetsToTxn(25): 0 [usable: 0], EndTxn(26): 0 [usable: 0], WriteTxnMarkers(27): 0 [usable: 0], TxnOffsetCommit(28): 0 [usable: 0], DescribeAcls(29): 0 [usable: 0], CreateAcls(30): 0 [usable: 0], DeleteAcls(31): 0 [usable: 0], DescribeConfigs(32): 0 to 1 [usable: 0], AlterConfigs(33): 0 [usable: 0], UNKNOWN(34): 0, UNKNOWN(35): 0, UNKNOWN(36): 0, UNKNOWN(37): 0, UNKNOWN(38): 0, UNKNOWN(39): 0, UNKNOWN(40): 0, UNKNOWN(41): 0, UNKNOWN(42): 0) 2020-09-22 23:39:19,716 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834984311 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834984312, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=3479) 2020-09-22 23:39:19,716 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834984312 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,716 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,815 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834984312 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834984313, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1523) 2020-09-22 23:39:19,815 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834984313 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:19,815 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:39:20,239 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834984313 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834984314, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1296) 2020-09-22 23:39:20,239 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834984314 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.NetworkClient - Node 4 disconnected. 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834989827 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834989828, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1019) 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834989828 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.NetworkClient - Initialize connection to node be-kafka-dragonpit-broker-8:8017 (id: 8 rack: null) for sending metadata request 2020-09-22 23:48:19,675 DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node be-kafka-dragonpit-broker-8:8017 (id: 8 rack: null) 2020-09-22 23:48:19,683 DEBUG org.apache.kafka.common.network.Selector - Created socket with SO_RCVBUF = 65536, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node 8 2020-09-22 23:48:19,684 DEBUG org.apache.kafka.clients.NetworkClient - Completed connection to node 8. Fetching API versions. 2020-09-22 23:48:19,684 DEBUG org.apache.kafka.clients.NetworkClient - Initiating API versions fetch from node 8. 2020-09-22 23:48:19,684 DEBUG org.apache.kafka.clients.NetworkClient - Sending metadata request (type=MetadataRequest, topics=captchastream) to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,685 DEBUG org.apache.kafka.clients.NetworkClient - Recorded API versions for node 8: (Produce(0): 0 to 5 [usable: 3], Fetch(1): 0 to 7 [usable: 5], Offsets(2): 0 to 2 [usable: 2], Metadata(3): 0 to 5 [usable: 4], LeaderAndIsr(4): 0 to 1 [usable: 0], StopReplica(5): 0 [usable: 0], UpdateMetadata(6): 0 to 4 [usable: 3], ControlledShutdown(7): 0 to 1 [usable: 1], OffsetCommit(8): 0 to 3 [usable: 3], OffsetFetch(9): 0 to 3 [usable: 3], FindCoordinator(10): 0 to 1 [usable: 1], JoinGroup(11): 0 to 2 [usable: 2], Heartbeat(12): 0 to 1 [usable: 1], LeaveGroup(13): 0 to 1 [usable: 1], SyncGroup(14): 0 to 1 [usable: 1], DescribeGroups(15): 0 to 1 [usable: 1], ListGroups(16): 0 to 1 [usable: 1], SaslHandshake(17): 0 to 1 [usable: 0], ApiVersions(18): 0 to 1 [usable: 1], CreateTopics(19): 0 to 2 [usable: 2], DeleteTopics(20): 0 to 1 [usable: 1], DeleteRecords(21): 0 [usable: 0], InitProducerId(22): 0 [usable: 0], OffsetForLeaderEpoch(23): 0 [usable: 0], AddPartitionsToTxn(24): 0 [usable: 0], AddOffsetsToTxn(25): 0 [usable: 0], EndTxn(26): 0 [usable: 0], WriteTxnMarkers(27): 0 [usable: 0], TxnOffsetCommit(28): 0 [usable: 0], DescribeAcls(29): 0 [usable: 0], CreateAcls(30): 0 [usable: 0], DeleteAcls(31): 0 [usable: 0], DescribeConfigs(32): 0 to 1 [usable: 0], AlterConfigs(33): 0 [usable: 0], UNKNOWN(34): 0, UNKNOWN(35): 0, UNKNOWN(36): 0, UNKNOWN(37): 0, UNKNOWN(38): 0, UNKNOWN(39): 0, UNKNOWN(40): 0, UNKNOWN(41): 0, UNKNOWN(42): 0) 2020-09-22 23:48:19,685 DEBUG org.apache.kafka.clients.Metadata - Updated cluster metadata version 321 to Cluster(id = 4ou4oBz8TU24ipwW8ws1Bw, nodes = [be-kafka-dragonpit-broker-6:8017 (id: 6 rack: null), be-kafka-dragonpit-broker-4:8017 (id: 4 rack: null), be-kafka-dragonpit-broker-3:8017 (id: 3 rack: null), be-kafka-dragonpit-broker-7:8017 (id: 7 rack: null), be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null), be-kafka-dragonpit-broker-8:8017 (id: 8 rack: null)], partitions = [Partition(topic = captchastream, partition = 8, leader = 7, replicas = [7,3], isr = [7,3]), Partition(topic = captchastream, partition = 9, leader = 8, replicas = [8,4], isr = [4,8]), Partition(topic = captchastream, partition = 4, leader = 3, replicas = [3,4], isr = [3,4]), Partition(topic = captchastream, partition = 5, leader = 4, replicas = [4,5], isr = [4,5]), Partition(topic = captchastream, partition = 6, leader = 5, replicas = [5,7], isr = [5,7]), Partition(topic = captchastream, partition = 7, leader = 6, replicas = [6,8], isr = [8,6]), Partition(topic = captchastream, partition = 0, leader = 5, replicas = [5,6], isr = [5,6]), Partition(topic = captchastream, partition = 1, leader = 6, replicas = [6,7], isr = [7,6]), Partition(topic = captchastream, partition = 2, leader = 7, replicas = [7,8], isr = [7,8]), Partition(topic = captchastream, partition = 3, leader = 8, replicas = [8,3], isr = [8,3])]) 2020-09-22 23:48:19,809 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834989828 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834989829, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=3489) 2020-09-22 23:48:19,809 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834989829 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,809 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Sending READ_UNCOMMITTED fetch for partitions [captchastream-6] to broker be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) 2020-09-22 23:48:19,902 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 834989829 for partition captchastream-6 returned fetch data (error=NONE, highWaterMark=834989830, lastStableOffset = -1, logStartOffset = 834470755, abortedTransactions = null, recordsSizeInBytes=1736) 2020-09-22 23:48:19,903 DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - Added READ_UNCOMMITTED fetch request for partition captchastream-6 at offset 834989830 to node be-kafka-dragonpit-broker-5:8017 (id: 5 rack: null) On Wed, Sep 23, 2020 at 3:08 PM Kostas Kloudas <[hidden email]> wrote: > Hi Ramya, > > Unfortunately I cannot see them. > > Kostas > > On Wed, Sep 23, 2020 at 10:27 AM Ramya Ramamurthy <[hidden email]> > wrote: > > > > Hi Kostas, > > > > Attaching the taskmanager logs regarding this issue. > > I have attached the Kaka related metrics. I hope you can see it this > time. > > > > Not sure why we get these many disconnects to Kafka. Maybe because of > this interruptions, we seem to slow down on our processing. At some point > the memory also increases and the workers almost stagnate not doing any > processing. I have 3GB heap committed and allotted 5GB memory to the pods. > > > > Thanks for your help. > > > > ~Ramya. > > > > On Tue, Sep 22, 2020 at 9:18 PM Kostas Kloudas <[hidden email]> > wrote: > >> > >> Hi Ramya, > >> > >> Unfortunately your images are blocked. Could you upload them somewhere > and > >> post the links here? > >> Also I think that the TaskManager logs may be able to help a bit more. > >> Could you please provide them here? > >> > >> Cheers, > >> Kostas > >> > >> On Tue, Sep 22, 2020 at 8:58 AM Ramya Ramamurthy <[hidden email]> > wrote: > >> > >> > Hi, > >> > > >> > We are seeing an issue with Flink on our production. The version is > 1.7 > >> > which we use. > >> > We started seeing sudden lag on kafka, and the consumers were no > longer > >> > working/accepting messages. On trying to enable debug mode, the below > >> > errors were seen > >> > [image: image.jpeg] > >> > > >> > I am not sure why this occurs everyday and when this happens, I can > see > >> > the remaining workers arent able to handle the load. Unless i restart > my > >> > jobs, i am unable to start processing again. This way, there is data > loss > >> > as well. > >> > > >> > On the below graph, there is a slight dip in consumption before 5:30. > That > >> > is when this incident happens and correlated with logs. > >> > > >> > [image: image.jpeg] > >> > > >> > Any pointers/suggestions would be appreciated. > >> > > >> > Thanks. > >> > > >> > > |
Free forum by Nabble | Edit this page |