|
Hi,
I am working with Apache Flink and am interested in knowing how one could estimate the total amount of time an application spends in recovery, including the input stream "catch-up" after checkpoint recovery. What I am specifically interested in is knowing the time needed for the recovery of the state + the catch-up phase (since the application's source tasks are reset to an earlier input position after recovery, this would be the data it processed before the failure and data that accumulated while the application was down).
My question is, what important considerations should I take into account when estimating this time and which portions of the Apache Flink codebase would be most helpful?
Thanks?
|