Stephan Ewen created FLINK-22749:
------------------------------------
Summary: Apply a robust default State Backend Configuration
Key: FLINK-22749
URL:
https://issues.apache.org/jira/browse/FLINK-22749 Project: Flink
Issue Type: Improvement
Components: Stateful Functions
Reporter: Stephan Ewen
We should update the default state backend configuration with default settings that reflect lessons-learned about robust setups.
(1) Always use the RocksDB State Backend. That is already the case.
(2) Active Partitioned Index filters by default. This may cost some overhead in specific cases, but helps with massive performance regressions when we have too many ColumnFamilies (too many states) such that the cache can no longer hold all index files.
We need to add {{state.backend.rocksdb.memory.partitioned-index-filters: true}} to the config.
See FLINK-20496 for details.
There is a chance that Flink makes this the default in the future as well, then we could remove it again from the StateFun setup.
(3) Activate local recovery by default.
That should speed up the recovery of all non-hard-crashed TMs by a lot.
We need to configure
- {{state.backend.local-recovery: true}}
- {{taskmanager.state.local.root-dirs}} to some non-temp directory
For this to work reliably, we need a local directory that is not periodically wiped by the OS, so we should not rely on the default ({{/tmp}} directory, but set up a dedicated non-temp state directory.
Flink will probably make this the default in the future, but having a non-{{/tmp}} directory for the RocksDB and local snapshots makes still a lot of sense.
(4) Increase state transfer threads by default, to speed up state restores.
Add to the config: {{state.backend.rocksdb.checkpoint.transfer.thread.num: 8}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)