[jira] [Created] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets

Shang Yuanchun (Jira)
Jason Kania created FLINK-16468:
-----------------------------------

             Summary: BlobClient rapid retrieval retries on failure opens too many sockets
                 Key: FLINK-16468
                 URL: https://issues.apache.org/jira/browse/FLINK-16468
             Project: Flink
          Issue Type: Bug
          Components: API / Core
    Affects Versions: 1.9.2
         Environment: Linux ubuntu servers running, patch current latest Ubuntu patch current release java 8 JRE
            Reporter: Jason Kania


In situations where the BlobClient retrieval fails as in the following log, rapid retries will exhaust the open sockets. All the retries happen within a few milliseconds.

{{2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - Failed to fetch BLOB cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7 from aaa-1/10.0.1.1:45145 and store it under /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004 Retrying...}}

The above is output repeatedly until the following error occurs:

{{java.io.IOException: Could not connect to BlobServer at address aaa-1/10.0.1.1:45145}}
{{ at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)}}
{{ at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)}}
{{ at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)}}
{{ at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)}}
{{ at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)}}
{{ at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)}}
{{ at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)}}
{{ at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)}}
{{ at java.lang.Thread.run(Thread.java:748)}}
{{Caused by: java.net.SocketException: Too many open files}}
{{ at java.net.Socket.createImpl(Socket.java:478)}}
{{ at java.net.Socket.connect(Socket.java:605)}}
{{ at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)}}
{{ ... 8 more}}

 The retries should have some form of backoff in this situation to avoid flooding the logs and exhausting other resources on the server.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)