[jira] [Created] (FLINK-22749) Apply a robust default State Backend Configuration

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-22749) Apply a robust default State Backend Configuration

Shang Yuanchun (Jira)
Stephan Ewen created FLINK-22749:
------------------------------------

             Summary: Apply a robust default State Backend Configuration
                 Key: FLINK-22749
                 URL: https://issues.apache.org/jira/browse/FLINK-22749
             Project: Flink
          Issue Type: Improvement
          Components: Stateful Functions
            Reporter: Stephan Ewen


We should update the default state backend configuration with default settings that reflect lessons-learned about robust setups.

(1) Always use the RocksDB State Backend. That is already the case.

(2) Active Partitioned Index filters by default. This may cost some overhead in specific cases, but helps with massive performance regressions when we have too many ColumnFamilies (too many states) such that the cache can no longer hold all index files.

We need to add {{state.backend.rocksdb.memory.partitioned-index-filters: true}} to the config.

See FLINK-20496 for details.
There is a chance that Flink makes this the default in the future as well, then we could remove it again from the StateFun setup.

(3) Activate local recovery by default.

That should speed up the recovery of all non-hard-crashed TMs by a lot.
We need to configure
  - {{state.backend.local-recovery: true}}
  - {{taskmanager.state.local.root-dirs}} to some non-temp directory

For this to work reliably, we need a local directory that is not periodically wiped by the OS, so we should not rely on the default ({{/tmp}} directory, but set up a dedicated non-temp state directory.

Flink will probably make this the default in the future, but having a non-{{/tmp}} directory for the RocksDB and local snapshots makes still a lot of sense.

(4) Increase state transfer threads by default, to speed up state restores.

Add to the config: {{state.backend.rocksdb.checkpoint.transfer.thread.num: 8}}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)