[jira] [Created] (FLINK-18811) if a disk is damaged, taskmanager should choose another disk for temp dir , rather than throw an IOException, which causes flink job restart over and over again

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-18811) if a disk is damaged, taskmanager should choose another disk for temp dir , rather than throw an IOException, which causes flink job restart over and over again

Shang Yuanchun (Jira)
Kai Chen created FLINK-18811:
--------------------------------

             Summary: if a disk is damaged, taskmanager should choose another disk for temp dir , rather than throw an IOException, which causes flink job restart over and over again
                 Key: FLINK-18811
                 URL: https://issues.apache.org/jira/browse/FLINK-18811
             Project: Flink
          Issue Type: Improvement
          Components: Table SQL / Runtime
         Environment: flink-1.10
            Reporter: Kai Chen
         Attachments: flink_disk_error.png

I met this Exception when a disk was damaged:

!flink_disk_error.png!

I think, if a disk is damaged, taskmanager should choose another disk for temp dir , rather than throw an IOException, which causes flink job restart over and over again.

If we could just change “SpillingAdaptiveSpanningRecordDeserializer" like this:
{code:java}
// SpillingAdaptiveSpanningRecordDeserializer.java
private FileChannel createSpillingChannel() throws IOException {
   if (spillFile != null) {
      throw new IllegalStateException("Spilling file already exists.");
   }

   // try to find a unique file name for the spilling channel
   int maxAttempts = 10;
   String[] tempDirs = this.tempDirs;
   for (int attempt = 0; attempt < maxAttempts; attempt++) {
      int dirIndex = rnd.nextInt(tempDirs.length);
      String directory = tempDirs[dirIndex];
      spillFile = new File(directory, randomString(rnd) + ".inputchannel");
      try {
         if (spillFile.createNewFile()) {
            return new RandomAccessFile(spillFile, "rw").getChannel();
         }
      } catch (IOException e) {
         // if there is no tempDir left to try
         if(tempDirs.length <= 1) {
            throw e;
         }
         LOG.warn("Caught an IOException when creating spill file: " + directory + ". Attempt " + attempt, e);
         tempDirs = (String[])ArrayUtils.remove(tempDirs,dirIndex);
      }
   }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)