[jira] [Created] (FLINK-17853) JobGraph is not getting deleted after Job cancelation

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-17853) JobGraph is not getting deleted after Job cancelation

Shang Yuanchun (Jira)
Fritz Budiyanto created FLINK-17853:
---------------------------------------

             Summary: JobGraph is not getting deleted after Job cancelation
                 Key: FLINK-17853
                 URL: https://issues.apache.org/jira/browse/FLINK-17853
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.9.2
         Environment: Flink 1.9.2
Zookeeper from AWS MSK
            Reporter: Fritz Budiyanto
         Attachments: flinkissue.txt

I have been seeing this issue several time where JobGraph are not cleaned up properly after Job deletion. Job deletion is performed by using "flink stop" command. As a result JobGraph node lingering in ZK, when Flink cluster is restarted, it will attempt to do HA restoration on non existing checkpoint which prevent the Flink cluster to come up.




2020-05-19 19:56:21,471 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task and sending final execution state FINISHED to JobManager for task Source: kafkaConsumer[update_server] -> (DetectedUpdateMessageConverter -> Sink: update_server.detected_updates, DrivenCoordinatesMessageConverter -> Sink: update_server.driven_coordinates) 588902a8096f49845b09fa1f595d6065.
2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId: 29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385).
2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job 86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring.
2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock.
2020-05-19 19:56:21,623 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered.


...
Zookeeper CLI:

ls /flink/cluster_update/jobgraphs
[86a028b3f7aada8ffe59859ca71d6385]

 

Attached is the Flink logs in reverse order



--
This message was sent by Atlassian Jira
(v8.3.4#803005)