[jira] [Created] (FLINK-12381) Without failover (aka "HA") configured, full restarts' checkpointing crashes

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-12381) Without failover (aka "HA") configured, full restarts' checkpointing crashes

Shang Yuanchun (Jira)
Henrik created FLINK-12381:
------------------------------

             Summary: Without failover (aka "HA") configured, full restarts' checkpointing crashes
                 Key: FLINK-12381
                 URL: https://issues.apache.org/jira/browse/FLINK-12381
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.8.0
         Environment: Same as FLINK-\{12379, 12377, 12376}
            Reporter: Henrik


{code:java}
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 'gs://example_bucket/flink/checkpoints/00000000000000000000000000000000/chk-16/_metadata' already exists
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:74)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
    at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
    at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
    at org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.<init>(FsCheckpointMetadataOutputStream.java:65)
    at org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
    at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
    ... 8 more
{code}
Instead, it should either just overwrite the checkpoint or fail to start the job completely. Partial and undefined failure is not what should happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)