[jira] [Created] (FLINK-22505) Limit the precision of Resource

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-22505) Limit the precision of Resource

Shang Yuanchun (Jira)
Yangze Guo created FLINK-22505:
----------------------------------

             Summary: Limit the precision of Resource
                 Key: FLINK-22505
                 URL: https://issues.apache.org/jira/browse/FLINK-22505
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.13.0
            Reporter: Yangze Guo


In our internal deployment, we found that a high precision {{CPUResource}} may cause the required resource never to be fulfilled. Think about the following scenario:
- The {{SlotManager}} receives a slot request with 1.000000000000001 CPU and decides to allocate a pending task manager with that resource spec.
- The resource manager starts a task manager and sets the CPU by dynamic config. In this step, we cast the {{CPUResource}} to a double value, where the precision loss happens.
The task manager will finally register with 1.0 CPU and thus can not deduct any pending task manager or fulfill the slot request.

To solve that issue, we proposed to limit the precision of Resource to a safe value, e.g. 8, to prevent the precision loss when cast to double.
- For {{CPUResource}}, the supported scale for the CPU is 3 in k8s while in Yarn, the CPU should be an integer.
- For {{ExternalResource}}, the value will always be treated as an integer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)