Kai Chen created FLINK-18811:
--------------------------------
Summary: if a disk is damaged, taskmanager should choose another disk for temp dir , rather than throw an IOException, which causes flink job restart over and over again
Key: FLINK-18811
URL:
https://issues.apache.org/jira/browse/FLINK-18811 Project: Flink
Issue Type: Improvement
Components: Table SQL / Runtime
Environment: flink-1.10
Reporter: Kai Chen
Attachments: flink_disk_error.png
I met this Exception when a disk was damaged:
!flink_disk_error.png!
I think, if a disk is damaged, taskmanager should choose another disk for temp dir , rather than throw an IOException, which causes flink job restart over and over again.
If we could just change “SpillingAdaptiveSpanningRecordDeserializer" like this:
{code:java}
// SpillingAdaptiveSpanningRecordDeserializer.java
private FileChannel createSpillingChannel() throws IOException {
if (spillFile != null) {
throw new IllegalStateException("Spilling file already exists.");
}
// try to find a unique file name for the spilling channel
int maxAttempts = 10;
String[] tempDirs = this.tempDirs;
for (int attempt = 0; attempt < maxAttempts; attempt++) {
int dirIndex = rnd.nextInt(tempDirs.length);
String directory = tempDirs[dirIndex];
spillFile = new File(directory, randomString(rnd) + ".inputchannel");
try {
if (spillFile.createNewFile()) {
return new RandomAccessFile(spillFile, "rw").getChannel();
}
} catch (IOException e) {
// if there is no tempDir left to try
if(tempDirs.length <= 1) {
throw e;
}
LOG.warn("Caught an IOException when creating spill file: " + directory + ". Attempt " + attempt, e);
tempDirs = (String[])ArrayUtils.remove(tempDirs,dirIndex);
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)