Fritz Budiyanto created FLINK-17853:
--------------------------------------- Summary: JobGraph is not getting deleted after Job cancelation Key: FLINK-17853 URL: https://issues.apache.org/jira/browse/FLINK-17853 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.9.2 Environment: Flink 1.9.2 Zookeeper from AWS MSK Reporter: Fritz Budiyanto Attachments: flinkissue.txt I have been seeing this issue several time where JobGraph are not cleaned up properly after Job deletion. Job deletion is performed by using "flink stop" command. As a result JobGraph node lingering in ZK, when Flink cluster is restarted, it will attempt to do HA restoration on non existing checkpoint which prevent the Flink cluster to come up. 2020-05-19 19:56:21,471 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task and sending final execution state FINISHED to JobManager for task Source: kafkaConsumer[update_server] -> (DetectedUpdateMessageConverter -> Sink: update_server.detected_updates, DrivenCoordinatesMessageConverter -> Sink: update_server.driven_coordinates) 588902a8096f49845b09fa1f595d6065. 2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId: 29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385). 2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job 86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring. 2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock. 2020-05-19 19:56:21,623 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385. 2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385. 2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered. ... Zookeeper CLI: ls /flink/cluster_update/jobgraphs [86a028b3f7aada8ffe59859ca71d6385] Attached is the Flink logs in reverse order -- This message was sent by Atlassian Jira (v8.3.4#803005) |
Free forum by Nabble | Edit this page |