My flink job runs in kubernetes. This is the setup:
1. One job running as a job cluster with one job manager
2. HA powered by zookeeper (works fine)
3. Job/Deployment manifests stored in Github and deployed to kubernetes by
Argo
4. State persisted to S3
If I were to stop (drain and take a savepoint) and resume, I'll have to
update the job manager manifest with the savepoint location and save it in
Github and redeploy. After deployment, I'll presumably have to modify the
manifest again to remove the savepoint location so as to avoid starting the
application from the same savepoint. This raises some questions:
1. If the job manager were to crash before the manifest is updated again
then won't kubernetes restart the job manager from the savepoint rather than
the latest checkpoint?
2. Is there a way to ensure that restoration from a savepoint doesn't happen
more than once? Or not after first successful checkpoint?
3. If even one checkpoint has been finalized, then the job should prefer the
checkpoint rather than the savepoint. Will that happen automatically given
zookeeper?
4. Is it possible to not have to remove the savepoint path from the
kubernetes manifest and simply rely on newer checkpoints/savepoints? It
feels rather clumsy to have to add and remove back manually. We could use a
cron job to remove it but its still clumsy.
5. Is there a way of asking flink to use the latest savepoint rather than
specifying the location of the savepoint? If I were to manually rename the
s3 savepoint location to something fixed (s3://fixed_savepoint_path_always)
then would there be any problem restoring the job?
--
Sent from:
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/