Hi Till,
Sorry for late response, I just did some investigations about Spark. Spark adopted the SPI way to obtain delegations for different components. It has a HadoopDelegationTokenManager.scala<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala> to manage all Hadoop delegation tokens including obtaining and renewing the delegation tokens. When the HadoopDelegationTokenManager is initializing, it will use ServiceLoader to load all HadoopDelegationTokenProviders in different connectors. As for Hive, the provider implementation is HadoopDelegationTokenProvider<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala>. Thanks, Jie On 2021/01/13 08:51:29, Till Rohrmann <[hidden email]<mailto:[hidden email]>> wrote: > Hi Jie Wang, > > thanks for starting this discussion. To me the SPI approach sounds better > because it is not as brittle as using reflection. Concerning the > configuration, we could think about introducing some Hive specific > configuration options which allow us to specify these paths. How are other > projects which integrate with Hive are solving this problem? > > Cheers, > Till > > On Tue, Jan 12, 2021 at 4:13 PM 王 杰 <[hidden email]<mailto:[hidden email]>> wrote: > > > Hi everyone, > > > > Currently, Hive delegation token is not obtained when Flink submits the > > application in Yarn mode using kinit way. The ticket is > > https://issues.apache.org/jira/browse/FLINK-20714. I'd like to start a > > discussion about how to support this feature. > > > > Maybe we have two options: > > 1. Using a reflection way to construct a Hive client to obtain the token, > > just same as the org.apache.flink.yarn.Utils.obtainTokenForHBase > > implementation. > > 2. Introduce a pluggable delegation provider via SPI. Delegation provider > > could be placed in connector related code, so reflection is not needed and > > is more extendable. > > > > > > > > Both options have to handle how to specify the HiveConf to use. In Hive > > connector, user could specify both hiveConfDir and hadoopConfDir when > > creating HiveCatalog. The hadoopConfDir may not the same as the Hadoop > > configuration in HadoopModule. > > > > Looking forward to your suggestions. > > > > -- > > Best regards! > > Jie Wang > > > > > |
Hi Jie,
Thanks for the investigation. I think we can first implement pluggable DT providers, and add renewal abilities incrementally. I'm also curious where Spark runs its HadoopDelegationTokenManager when renewal is enabled? Because it seems HadoopDelegationTokenManager needs access to keytab to create new tokens, does that mean it can only run on the client side? On Mon, Jan 25, 2021 at 10:32 AM 王 杰 <[hidden email]> wrote: > Hi Till, > > Sorry for late response, I just did some investigations about Spark. Spark > adopted the SPI way to obtain delegations for different components. It has > a HadoopDelegationTokenManager.scala< > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala> > to manage all Hadoop delegation tokens including obtaining and renewing the > delegation tokens. > > When the HadoopDelegationTokenManager is initializing, it will use > ServiceLoader to load all HadoopDelegationTokenProviders in different > connectors. As for Hive, the provider implementation is > HadoopDelegationTokenProvider< > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala > >. > > Thanks, > Jie > > > On 2021/01/13 08:51:29, Till Rohrmann <[hidden email]<mailto: > [hidden email]>> wrote: > > Hi Jie Wang, > > > > thanks for starting this discussion. To me the SPI approach sounds better > > because it is not as brittle as using reflection. Concerning the > > configuration, we could think about introducing some Hive specific > > configuration options which allow us to specify these paths. How are > other > > projects which integrate with Hive are solving this problem? > > > > Cheers, > > Till > > > > On Tue, Jan 12, 2021 at 4:13 PM 王 杰 <[hidden email]<mailto: > [hidden email]>> wrote: > > > > > Hi everyone, > > > > > > Currently, Hive delegation token is not obtained when Flink submits the > > > application in Yarn mode using kinit way. The ticket is > > > https://issues.apache.org/jira/browse/FLINK-20714. I'd like to start a > > > discussion about how to support this feature. > > > > > > Maybe we have two options: > > > 1. Using a reflection way to construct a Hive client to obtain the > token, > > > just same as the org.apache.flink.yarn.Utils.obtainTokenForHBase > > > implementation. > > > 2. Introduce a pluggable delegation provider via SPI. Delegation > provider > > > could be placed in connector related code, so reflection is not needed > and > > > is more extendable. > > > > > > > > > > > > Both options have to handle how to specify the HiveConf to use. In Hive > > > connector, user could specify both hiveConfDir and hadoopConfDir when > > > creating HiveCatalog. The hadoopConfDir may not the same as the Hadoop > > > configuration in HadoopModule. > > > > > > Looking forward to your suggestions. > > > > > > -- > > > Best regards! > > > Jie Wang > > > > > > > > > -- Best regards! Rui Li |
Hi Rui,
I agree with you that we can implement puggable DT providers firstly, I have created a new ticket to track it: https://issues.apache.org/jira/browse/FLINK-21232. Spark’s HadoopDelegationTokenManager could run on both client and driver(Application master) sides. On the client side, HadoopDelegationTokenManager is used to obtain tokens when users use keytab or `kinit`(credential cache); on the driver side, it is used to obtain and renew DTs. To explain this, there are some backgrounds. Currently, Flink will distribute keytab to JobManager and TaskManagers, the kerberos credentials are renewed by the keytab on JobManager and TaskManagers. However, Spark adopts a different way solution, it only ships the keytab to Driver and Driver will use this keytab to renew all delegation tokens periodically and then distribute the renewed tokens to Executors. In this way, Spark can reduce the load on KDC. You could refer this doc for details: https://docs.google.com/document/d/10V7LiNlUJKeKZ58mkR7oVv1t6BrC6TZi3FGf2Dm6-i8/edit Thanks, Jie On 2021/01/27 03:33:37, Rui Li <[hidden email]> wrote: > Hi Jie, > > Thanks for the investigation. I think we can first implement pluggable DT > providers, and add renewal abilities incrementally. I'm also curious where > Spark runs its HadoopDelegationTokenManager when renewal is enabled? > Because it seems HadoopDelegationTokenManager needs access to keytab to > create new tokens, does that mean it can only run on the client side? > > On Mon, Jan 25, 2021 at 10:32 AM 王 杰 <[hidden email]> wrote: > > > Hi Till, > > > > Sorry for late response, I just did some investigations about Spark. Spark > > adopted the SPI way to obtain delegations for different components. It has > > a HadoopDelegationTokenManager.scala< > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala> > > to manage all Hadoop delegation tokens including obtaining and renewing the > > delegation tokens. > > > > When the HadoopDelegationTokenManager is initializing, it will use > > ServiceLoader to load all HadoopDelegationTokenProviders in different > > connectors. As for Hive, the provider implementation is > > HadoopDelegationTokenProvider< > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala > > >. > > > > Thanks, > > Jie > > > > > > On 2021/01/13 08:51:29, Till Rohrmann <[hidden email]<mailto: > > [hidden email]>> wrote: > > > Hi Jie Wang, > > > > > > thanks for starting this discussion. To me the SPI approach sounds better > > > because it is not as brittle as using reflection. Concerning the > > > configuration, we could think about introducing some Hive specific > > > configuration options which allow us to specify these paths. How are > > other > > > projects which integrate with Hive are solving this problem? > > > > > > Cheers, > > > Till > > > > > > On Tue, Jan 12, 2021 at 4:13 PM 王 杰 <[hidden email]<mailto: > > [hidden email]>> wrote: > > > > > > > Hi everyone, > > > > > > > > Currently, Hive delegation token is not obtained when Flink submits the > > > > application in Yarn mode using kinit way. The ticket is > > > > https://issues.apache.org/jira/browse/FLINK-20714. I'd like to start a > > > > discussion about how to support this feature. > > > > > > > > Maybe we have two options: > > > > 1. Using a reflection way to construct a Hive client to obtain the > > token, > > > > just same as the org.apache.flink.yarn.Utils.obtainTokenForHBase > > > > implementation. > > > > 2. Introduce a pluggable delegation provider via SPI. Delegation > > provider > > > > could be placed in connector related code, so reflection is not needed > > and > > > > is more extendable. > > > > > > > > > > > > > > > > Both options have to handle how to specify the HiveConf to use. In Hive > > > > connector, user could specify both hiveConfDir and hadoopConfDir when > > > > creating HiveCatalog. The hadoopConfDir may not the same as the Hadoop > > > > configuration in HadoopModule. > > > > > > > > Looking forward to your suggestions. > > > > > > > > -- > > > > Best regards! > > > > Jie Wang > > > > > > > > > > > > > > > > -- > Best regards! > Rui Li > |
Free forum by Nabble | Edit this page |