[DISCUSS] Dashboard/HistoryServer authentication

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Dashboard/HistoryServer authentication

Márton Balassi-3
Hi team,

Firstly I would like to introduce Gabor or G [1] for short to the
community, he is a Spark committer who has recently transitioned to the
Flink Engineering team at Cloudera and is looking forward to contributing
to Apache Flink. Previously G primarily focused on Spark Streaming and
security.

Based on requests from our customers G has implemented Kerberos and HTTP
Basic Authentication for the Flink Dashboard and HistoryServer. Previously
lacked an authentication story.

We are looking to contribute this functionality back to the community, we
believe that given Flink's maturity there should be a common code solution
for this general pattern.

We are looking forward to your feedback on G's design. [2]

[1] http://gaborsomogyi.com/
[2]
https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Till Rohrmann
Hi Gabor, welcome to the Flink community!

Thanks for sharing this proposal with the community Márton. In general, I
agree that authentication is missing and that this is required for using
Flink within an enterprise. The thing I am wondering is whether this
feature strictly needs to be implemented inside of Flink or whether a proxy
setup could do the job? Have you considered this option? If yes, then it
would be good to list it under the point of rejected alternatives.

I do see the benefit of implementing this feature inside of Flink if many
users need it. If not, then it might be easier for the project to not
increase the surface area since it makes the overall maintenance harder.

Cheers,
Till

On Mon, May 31, 2021 at 4:57 PM Márton Balassi <[hidden email]> wrote:

> Hi team,
>
> Firstly I would like to introduce Gabor or G [1] for short to the
> community, he is a Spark committer who has recently transitioned to the
> Flink Engineering team at Cloudera and is looking forward to contributing
> to Apache Flink. Previously G primarily focused on Spark Streaming and
> security.
>
> Based on requests from our customers G has implemented Kerberos and HTTP
> Basic Authentication for the Flink Dashboard and HistoryServer. Previously
> lacked an authentication story.
>
> We are looking to contribute this functionality back to the community, we
> believe that given Flink's maturity there should be a common code solution
> for this general pattern.
>
> We are looking forward to your feedback on G's design. [2]
>
> [1] http://gaborsomogyi.com/
> [2]
>
> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Chesnay Schepler-3
There's a related effort: https://issues.apache.org/jira/browse/FLINK-21108

On 6/1/2021 10:14 AM, Till Rohrmann wrote:

> Hi Gabor, welcome to the Flink community!
>
> Thanks for sharing this proposal with the community Márton. In general, I
> agree that authentication is missing and that this is required for using
> Flink within an enterprise. The thing I am wondering is whether this
> feature strictly needs to be implemented inside of Flink or whether a proxy
> setup could do the job? Have you considered this option? If yes, then it
> would be good to list it under the point of rejected alternatives.
>
> I do see the benefit of implementing this feature inside of Flink if many
> users need it. If not, then it might be easier for the project to not
> increase the surface area since it makes the overall maintenance harder.
>
> Cheers,
> Till
>
> On Mon, May 31, 2021 at 4:57 PM Márton Balassi <[hidden email]> wrote:
>
>> Hi team,
>>
>> Firstly I would like to introduce Gabor or G [1] for short to the
>> community, he is a Spark committer who has recently transitioned to the
>> Flink Engineering team at Cloudera and is looking forward to contributing
>> to Apache Flink. Previously G primarily focused on Spark Streaming and
>> security.
>>
>> Based on requests from our customers G has implemented Kerberos and HTTP
>> Basic Authentication for the Flink Dashboard and HistoryServer. Previously
>> lacked an authentication story.
>>
>> We are looking to contribute this functionality back to the community, we
>> believe that given Flink's maturity there should be a common code solution
>> for this general pattern.
>>
>> We are looking forward to your feedback on G's design. [2]
>>
>> [1] http://gaborsomogyi.com/
>> [2]
>>
>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Márton Balassi
Thanks, Chesney - I totally missed that. Answered on the ticket too, let us
continue there then.

Till, I agree that we should keep this codepath as slim as possible. It is
an important design decision that we aim to keep the list of authentication
protocols to a minimum. We believe that this should not be a primary
concern of Flink and a trusted proxy service (for example Apache Knox)
should be used to enable a multitude of enduser authentication mechanisms.
The bare minimum of authentication mechanisms to support consequently
consist of a single strong authentication protocol for which Kerberos is
the enterprise solution and HTTP Basic primary for development and
light-weight scenarios.

Added the above wording to G's doc.
https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit



On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <[hidden email]> wrote:

> There's a related effort:
> https://issues.apache.org/jira/browse/FLINK-21108
>
> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
> > Hi Gabor, welcome to the Flink community!
> >
> > Thanks for sharing this proposal with the community Márton. In general, I
> > agree that authentication is missing and that this is required for using
> > Flink within an enterprise. The thing I am wondering is whether this
> > feature strictly needs to be implemented inside of Flink or whether a
> proxy
> > setup could do the job? Have you considered this option? If yes, then it
> > would be good to list it under the point of rejected alternatives.
> >
> > I do see the benefit of implementing this feature inside of Flink if many
> > users need it. If not, then it might be easier for the project to not
> > increase the surface area since it makes the overall maintenance harder.
> >
> > Cheers,
> > Till
> >
> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <[hidden email]>
> wrote:
> >
> >> Hi team,
> >>
> >> Firstly I would like to introduce Gabor or G [1] for short to the
> >> community, he is a Spark committer who has recently transitioned to the
> >> Flink Engineering team at Cloudera and is looking forward to
> contributing
> >> to Apache Flink. Previously G primarily focused on Spark Streaming and
> >> security.
> >>
> >> Based on requests from our customers G has implemented Kerberos and HTTP
> >> Basic Authentication for the Flink Dashboard and HistoryServer.
> Previously
> >> lacked an authentication story.
> >>
> >> We are looking to contribute this functionality back to the community,
> we
> >> believe that given Flink's maturity there should be a common code
> solution
> >> for this general pattern.
> >>
> >> We are looking forward to your feedback on G's design. [2]
> >>
> >> [1] http://gaborsomogyi.com/
> >> [2]
> >>
> >>
> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Gabor Somogyi
Hi team,

Happy to be here and hope I can provide quality additions in the future.

Thank you all for helpful the suggestions!
Considering them the FLIP has been modified and the work continues on the
already existing Jira.

BR,
G


On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <[hidden email]>
wrote:

> Thanks, Chesney - I totally missed that. Answered on the ticket too, let
> us continue there then.
>
> Till, I agree that we should keep this codepath as slim as possible. It is
> an important design decision that we aim to keep the list of authentication
> protocols to a minimum. We believe that this should not be a primary
> concern of Flink and a trusted proxy service (for example Apache Knox)
> should be used to enable a multitude of enduser authentication mechanisms.
> The bare minimum of authentication mechanisms to support consequently
> consist of a single strong authentication protocol for which Kerberos is
> the enterprise solution and HTTP Basic primary for development and
> light-weight scenarios.
>
> Added the above wording to G's doc.
>
> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>
>
>
> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <[hidden email]>
> wrote:
>
>> There's a related effort:
>> https://issues.apache.org/jira/browse/FLINK-21108
>>
>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>> > Hi Gabor, welcome to the Flink community!
>> >
>> > Thanks for sharing this proposal with the community Márton. In general,
>> I
>> > agree that authentication is missing and that this is required for using
>> > Flink within an enterprise. The thing I am wondering is whether this
>> > feature strictly needs to be implemented inside of Flink or whether a
>> proxy
>> > setup could do the job? Have you considered this option? If yes, then it
>> > would be good to list it under the point of rejected alternatives.
>> >
>> > I do see the benefit of implementing this feature inside of Flink if
>> many
>> > users need it. If not, then it might be easier for the project to not
>> > increase the surface area since it makes the overall maintenance harder.
>> >
>> > Cheers,
>> > Till
>> >
>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <[hidden email]>
>> wrote:
>> >
>> >> Hi team,
>> >>
>> >> Firstly I would like to introduce Gabor or G [1] for short to the
>> >> community, he is a Spark committer who has recently transitioned to the
>> >> Flink Engineering team at Cloudera and is looking forward to
>> contributing
>> >> to Apache Flink. Previously G primarily focused on Spark Streaming and
>> >> security.
>> >>
>> >> Based on requests from our customers G has implemented Kerberos and
>> HTTP
>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
>> Previously
>> >> lacked an authentication story.
>> >>
>> >> We are looking to contribute this functionality back to the community,
>> we
>> >> believe that given Flink's maturity there should be a common code
>> solution
>> >> for this general pattern.
>> >>
>> >> We are looking forward to your feedback on G's design. [2]
>> >>
>> >> [1] http://gaborsomogyi.com/
>> >> [2]
>> >>
>> >>
>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>> >>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Till Rohrmann
Thanks for updating the document Márton. Why is it that banks will consider
it more secure if Flink comes with Kerberos authentication (assuming a
properly secured setup)? I mean if an attacker can get access to one of the
machines, then it should also be possible to obtain the right Kerberos
token.

I am not an authentication expert and that's why I wanted to ask what are
other authentication protocols other than Kerberos? Why did we select
Kerberos and not any other authentication protocol? Maybe you can list the
pros and cons for the different protocols. Is Kerberos also the standard
authentication protocol for Kubernetes deployments? If not, what would be
the answer when deploying on K8s?

Cheers,
Till

On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <[hidden email]>
wrote:

> Hi team,
>
> Happy to be here and hope I can provide quality additions in the future.
>
> Thank you all for helpful the suggestions!
> Considering them the FLIP has been modified and the work continues on the
> already existing Jira.
>
> BR,
> G
>
>
> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <[hidden email]>
> wrote:
>
>> Thanks, Chesney - I totally missed that. Answered on the ticket too, let
>> us continue there then.
>>
>> Till, I agree that we should keep this codepath as slim as possible. It
>> is an important design decision that we aim to keep the list of
>> authentication protocols to a minimum. We believe that this should not be a
>> primary concern of Flink and a trusted proxy service (for example Apache
>> Knox) should be used to enable a multitude of enduser authentication
>> mechanisms. The bare minimum of authentication mechanisms to support
>> consequently consist of a single strong authentication protocol for which
>> Kerberos is the enterprise solution and HTTP Basic primary for development
>> and light-weight scenarios.
>>
>> Added the above wording to G's doc.
>>
>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>
>>
>>
>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <[hidden email]>
>> wrote:
>>
>>> There's a related effort:
>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>
>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>> > Hi Gabor, welcome to the Flink community!
>>> >
>>> > Thanks for sharing this proposal with the community Márton. In
>>> general, I
>>> > agree that authentication is missing and that this is required for
>>> using
>>> > Flink within an enterprise. The thing I am wondering is whether this
>>> > feature strictly needs to be implemented inside of Flink or whether a
>>> proxy
>>> > setup could do the job? Have you considered this option? If yes, then
>>> it
>>> > would be good to list it under the point of rejected alternatives.
>>> >
>>> > I do see the benefit of implementing this feature inside of Flink if
>>> many
>>> > users need it. If not, then it might be easier for the project to not
>>> > increase the surface area since it makes the overall maintenance
>>> harder.
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <[hidden email]>
>>> wrote:
>>> >
>>> >> Hi team,
>>> >>
>>> >> Firstly I would like to introduce Gabor or G [1] for short to the
>>> >> community, he is a Spark committer who has recently transitioned to
>>> the
>>> >> Flink Engineering team at Cloudera and is looking forward to
>>> contributing
>>> >> to Apache Flink. Previously G primarily focused on Spark Streaming and
>>> >> security.
>>> >>
>>> >> Based on requests from our customers G has implemented Kerberos and
>>> HTTP
>>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
>>> Previously
>>> >> lacked an authentication story.
>>> >>
>>> >> We are looking to contribute this functionality back to the
>>> community, we
>>> >> believe that given Flink's maturity there should be a common code
>>> solution
>>> >> for this general pattern.
>>> >>
>>> >> We are looking forward to your feedback on G's design. [2]
>>> >>
>>> >> [1] http://gaborsomogyi.com/
>>> >> [2]
>>> >>
>>> >>
>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>> >>
>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Gabor Somogyi
Hi Till,

Since I'm working in security area 10+ years let me share my thought.
I would like to emphasise there are experts better than me but I have some
basics.
The discussion is open and not trying to tell alone things...

> I mean if an attacker can get access to one of the machines, then it
should also be possible to obtain the right Kerberos token.
Not necessarily. For example if one gets access to a specific user's
credentials then it's not possible to compromise other user's jobs, data,
etc...
Security is like an onion, the more layers has been added the more time an
attacker needs to proceed.
At the end of the day if one is in, then most probably can find the way but
this time is normally enough to sysadmins or security experts to
close down the system and minimize the damage.

The other thing is that all tokens has a timeout and if the token is
invalid then the attacker can't proceed further.

> Is Kerberos also the standard authentication protocol for Kubernetes
deployments?
Kerberos is an industry standard which is cloud/deployment agnostic and it
can be used in any deployments including k8s.
The main intention is to use kerberos in k8s deployments too since we're
going this direction as well.
Please see how Spark does this:
https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes

Last but not least the most important reason to add at least one strong
authentication is that we have users who has
hard requirements on this. They're doing security audits and if they fail
then it's deal breaking.
That is why we have added kerberos at the first place. Unfortunately we
can't name them in this public list, however
the customers who specifically asked for this were mainly in the banking
and telco sector.

BR,
G


On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]> wrote:

> Thanks for updating the document Márton. Why is it that banks will
> consider it more secure if Flink comes with Kerberos authentication
> (assuming a properly secured setup)? I mean if an attacker can get access
> to one of the machines, then it should also be possible to obtain the right
> Kerberos token.
>
> I am not an authentication expert and that's why I wanted to ask what are
> other authentication protocols other than Kerberos? Why did we select
> Kerberos and not any other authentication protocol? Maybe you can list the
> pros and cons for the different protocols. Is Kerberos also the standard
> authentication protocol for Kubernetes deployments? If not, what would be
> the answer when deploying on K8s?
>
> Cheers,
> Till
>
> On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <[hidden email]>
> wrote:
>
>> Hi team,
>>
>> Happy to be here and hope I can provide quality additions in the future.
>>
>> Thank you all for helpful the suggestions!
>> Considering them the FLIP has been modified and the work continues on the
>> already existing Jira.
>>
>> BR,
>> G
>>
>>
>> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <[hidden email]>
>> wrote:
>>
>>> Thanks, Chesney - I totally missed that. Answered on the ticket too, let
>>> us continue there then.
>>>
>>> Till, I agree that we should keep this codepath as slim as possible. It
>>> is an important design decision that we aim to keep the list of
>>> authentication protocols to a minimum. We believe that this should not be a
>>> primary concern of Flink and a trusted proxy service (for example Apache
>>> Knox) should be used to enable a multitude of enduser authentication
>>> mechanisms. The bare minimum of authentication mechanisms to support
>>> consequently consist of a single strong authentication protocol for which
>>> Kerberos is the enterprise solution and HTTP Basic primary for development
>>> and light-weight scenarios.
>>>
>>> Added the above wording to G's doc.
>>>
>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>
>>>
>>>
>>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <[hidden email]>
>>> wrote:
>>>
>>>> There's a related effort:
>>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>
>>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>> > Hi Gabor, welcome to the Flink community!
>>>> >
>>>> > Thanks for sharing this proposal with the community Márton. In
>>>> general, I
>>>> > agree that authentication is missing and that this is required for
>>>> using
>>>> > Flink within an enterprise. The thing I am wondering is whether this
>>>> > feature strictly needs to be implemented inside of Flink or whether a
>>>> proxy
>>>> > setup could do the job? Have you considered this option? If yes, then
>>>> it
>>>> > would be good to list it under the point of rejected alternatives.
>>>> >
>>>> > I do see the benefit of implementing this feature inside of Flink if
>>>> many
>>>> > users need it. If not, then it might be easier for the project to not
>>>> > increase the surface area since it makes the overall maintenance
>>>> harder.
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <[hidden email]>
>>>> wrote:
>>>> >
>>>> >> Hi team,
>>>> >>
>>>> >> Firstly I would like to introduce Gabor or G [1] for short to the
>>>> >> community, he is a Spark committer who has recently transitioned to
>>>> the
>>>> >> Flink Engineering team at Cloudera and is looking forward to
>>>> contributing
>>>> >> to Apache Flink. Previously G primarily focused on Spark Streaming
>>>> and
>>>> >> security.
>>>> >>
>>>> >> Based on requests from our customers G has implemented Kerberos and
>>>> HTTP
>>>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
>>>> Previously
>>>> >> lacked an authentication story.
>>>> >>
>>>> >> We are looking to contribute this functionality back to the
>>>> community, we
>>>> >> believe that given Flink's maturity there should be a common code
>>>> solution
>>>> >> for this general pattern.
>>>> >>
>>>> >> We are looking forward to your feedback on G's design. [2]
>>>> >>
>>>> >> [1] http://gaborsomogyi.com/
>>>> >> [2]
>>>> >>
>>>> >>
>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>> >>
>>>>
>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Till Rohrmann
Thanks for the information Gabor. If it is about securing the
communication between the REST client and the REST server, then Flink
already supports enabling mutual SSL authentication [1]. Would this be
enough to secure the communication and to pass an audit?

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity

Cheers,
Till

On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <[hidden email]>
wrote:

> Hi Till,
>
> Since I'm working in security area 10+ years let me share my thought.
> I would like to emphasise there are experts better than me but I have some
> basics.
> The discussion is open and not trying to tell alone things...
>
> > I mean if an attacker can get access to one of the machines, then it
> should also be possible to obtain the right Kerberos token.
> Not necessarily. For example if one gets access to a specific user's
> credentials then it's not possible to compromise other user's jobs, data,
> etc...
> Security is like an onion, the more layers has been added the more time an
> attacker needs to proceed.
> At the end of the day if one is in, then most probably can find the way but
> this time is normally enough to sysadmins or security experts to
> close down the system and minimize the damage.
>
> The other thing is that all tokens has a timeout and if the token is
> invalid then the attacker can't proceed further.
>
> > Is Kerberos also the standard authentication protocol for Kubernetes
> deployments?
> Kerberos is an industry standard which is cloud/deployment agnostic and it
> can be used in any deployments including k8s.
> The main intention is to use kerberos in k8s deployments too since we're
> going this direction as well.
> Please see how Spark does this:
>
> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>
> Last but not least the most important reason to add at least one strong
> authentication is that we have users who has
> hard requirements on this. They're doing security audits and if they fail
> then it's deal breaking.
> That is why we have added kerberos at the first place. Unfortunately we
> can't name them in this public list, however
> the customers who specifically asked for this were mainly in the banking
> and telco sector.
>
> BR,
> G
>
>
> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]> wrote:
>
> > Thanks for updating the document Márton. Why is it that banks will
> > consider it more secure if Flink comes with Kerberos authentication
> > (assuming a properly secured setup)? I mean if an attacker can get access
> > to one of the machines, then it should also be possible to obtain the
> right
> > Kerberos token.
> >
> > I am not an authentication expert and that's why I wanted to ask what are
> > other authentication protocols other than Kerberos? Why did we select
> > Kerberos and not any other authentication protocol? Maybe you can list
> the
> > pros and cons for the different protocols. Is Kerberos also the standard
> > authentication protocol for Kubernetes deployments? If not, what would be
> > the answer when deploying on K8s?
> >
> > Cheers,
> > Till
> >
> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <[hidden email]
> >
> > wrote:
> >
> >> Hi team,
> >>
> >> Happy to be here and hope I can provide quality additions in the future.
> >>
> >> Thank you all for helpful the suggestions!
> >> Considering them the FLIP has been modified and the work continues on
> the
> >> already existing Jira.
> >>
> >> BR,
> >> G
> >>
> >>
> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
> [hidden email]>
> >> wrote:
> >>
> >>> Thanks, Chesney - I totally missed that. Answered on the ticket too,
> let
> >>> us continue there then.
> >>>
> >>> Till, I agree that we should keep this codepath as slim as possible. It
> >>> is an important design decision that we aim to keep the list of
> >>> authentication protocols to a minimum. We believe that this should not
> be a
> >>> primary concern of Flink and a trusted proxy service (for example
> Apache
> >>> Knox) should be used to enable a multitude of enduser authentication
> >>> mechanisms. The bare minimum of authentication mechanisms to support
> >>> consequently consist of a single strong authentication protocol for
> which
> >>> Kerberos is the enterprise solution and HTTP Basic primary for
> development
> >>> and light-weight scenarios.
> >>>
> >>> Added the above wording to G's doc.
> >>>
> >>>
> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
> >>>
> >>>
> >>>
> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <[hidden email]>
> >>> wrote:
> >>>
> >>>> There's a related effort:
> >>>> https://issues.apache.org/jira/browse/FLINK-21108
> >>>>
> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
> >>>> > Hi Gabor, welcome to the Flink community!
> >>>> >
> >>>> > Thanks for sharing this proposal with the community Márton. In
> >>>> general, I
> >>>> > agree that authentication is missing and that this is required for
> >>>> using
> >>>> > Flink within an enterprise. The thing I am wondering is whether this
> >>>> > feature strictly needs to be implemented inside of Flink or whether
> a
> >>>> proxy
> >>>> > setup could do the job? Have you considered this option? If yes,
> then
> >>>> it
> >>>> > would be good to list it under the point of rejected alternatives.
> >>>> >
> >>>> > I do see the benefit of implementing this feature inside of Flink if
> >>>> many
> >>>> > users need it. If not, then it might be easier for the project to
> not
> >>>> > increase the surface area since it makes the overall maintenance
> >>>> harder.
> >>>> >
> >>>> > Cheers,
> >>>> > Till
> >>>> >
> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <[hidden email]
> >
> >>>> wrote:
> >>>> >
> >>>> >> Hi team,
> >>>> >>
> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to the
> >>>> >> community, he is a Spark committer who has recently transitioned to
> >>>> the
> >>>> >> Flink Engineering team at Cloudera and is looking forward to
> >>>> contributing
> >>>> >> to Apache Flink. Previously G primarily focused on Spark Streaming
> >>>> and
> >>>> >> security.
> >>>> >>
> >>>> >> Based on requests from our customers G has implemented Kerberos and
> >>>> HTTP
> >>>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
> >>>> Previously
> >>>> >> lacked an authentication story.
> >>>> >>
> >>>> >> We are looking to contribute this functionality back to the
> >>>> community, we
> >>>> >> believe that given Flink's maturity there should be a common code
> >>>> solution
> >>>> >> for this general pattern.
> >>>> >>
> >>>> >> We are looking forward to your feedback on G's design. [2]
> >>>> >>
> >>>> >> [1] http://gaborsomogyi.com/
> >>>> >> [2]
> >>>> >>
> >>>> >>
> >>>>
> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
> >>>> >>
> >>>>
> >>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Gabor Somogyi
Thanks for giving options to fulfil the need.

Users are looking for a solution where users can be identified on the whole
cluster and restrict access to resources/actions.
A good example for such an action is cancelling other users running jobs.

* SSL does provide mutual authentication but when authentication passed
there is no user based on restrictions can be made.
* The less problematic part is that generating/maintaining short time valid
certificates would be a hard (that's the reason KDC like servers exist).
Having long time valid certificates would widen the attack surface but
since the first concern is there this is just a cosmetic issue.

All in all using TLS certificates is not sufficient in these environments
unfortunately.

BR,
G


On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for the information Gabor. If it is about securing the
> communication between the REST client and the REST server, then Flink
> already supports enabling mutual SSL authentication [1]. Would this be
> enough to secure the communication and to pass an audit?
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>
> Cheers,
> Till
>
> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <[hidden email]>
> wrote:
>
>> Hi Till,
>>
>> Since I'm working in security area 10+ years let me share my thought.
>> I would like to emphasise there are experts better than me but I have some
>> basics.
>> The discussion is open and not trying to tell alone things...
>>
>> > I mean if an attacker can get access to one of the machines, then it
>> should also be possible to obtain the right Kerberos token.
>> Not necessarily. For example if one gets access to a specific user's
>> credentials then it's not possible to compromise other user's jobs, data,
>> etc...
>> Security is like an onion, the more layers has been added the more time an
>> attacker needs to proceed.
>> At the end of the day if one is in, then most probably can find the way
>> but
>> this time is normally enough to sysadmins or security experts to
>> close down the system and minimize the damage.
>>
>> The other thing is that all tokens has a timeout and if the token is
>> invalid then the attacker can't proceed further.
>>
>> > Is Kerberos also the standard authentication protocol for Kubernetes
>> deployments?
>> Kerberos is an industry standard which is cloud/deployment agnostic and it
>> can be used in any deployments including k8s.
>> The main intention is to use kerberos in k8s deployments too since we're
>> going this direction as well.
>> Please see how Spark does this:
>>
>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>
>> Last but not least the most important reason to add at least one strong
>> authentication is that we have users who has
>> hard requirements on this. They're doing security audits and if they fail
>> then it's deal breaking.
>> That is why we have added kerberos at the first place. Unfortunately we
>> can't name them in this public list, however
>> the customers who specifically asked for this were mainly in the banking
>> and telco sector.
>>
>> BR,
>> G
>>
>>
>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>> wrote:
>>
>> > Thanks for updating the document Márton. Why is it that banks will
>> > consider it more secure if Flink comes with Kerberos authentication
>> > (assuming a properly secured setup)? I mean if an attacker can get
>> access
>> > to one of the machines, then it should also be possible to obtain the
>> right
>> > Kerberos token.
>> >
>> > I am not an authentication expert and that's why I wanted to ask what
>> are
>> > other authentication protocols other than Kerberos? Why did we select
>> > Kerberos and not any other authentication protocol? Maybe you can list
>> the
>> > pros and cons for the different protocols. Is Kerberos also the standard
>> > authentication protocol for Kubernetes deployments? If not, what would
>> be
>> > the answer when deploying on K8s?
>> >
>> > Cheers,
>> > Till
>> >
>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>> [hidden email]>
>> > wrote:
>> >
>> >> Hi team,
>> >>
>> >> Happy to be here and hope I can provide quality additions in the
>> future.
>> >>
>> >> Thank you all for helpful the suggestions!
>> >> Considering them the FLIP has been modified and the work continues on
>> the
>> >> already existing Jira.
>> >>
>> >> BR,
>> >> G
>> >>
>> >>
>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>> [hidden email]>
>> >> wrote:
>> >>
>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket too,
>> let
>> >>> us continue there then.
>> >>>
>> >>> Till, I agree that we should keep this codepath as slim as possible.
>> It
>> >>> is an important design decision that we aim to keep the list of
>> >>> authentication protocols to a minimum. We believe that this should
>> not be a
>> >>> primary concern of Flink and a trusted proxy service (for example
>> Apache
>> >>> Knox) should be used to enable a multitude of enduser authentication
>> >>> mechanisms. The bare minimum of authentication mechanisms to support
>> >>> consequently consist of a single strong authentication protocol for
>> which
>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>> development
>> >>> and light-weight scenarios.
>> >>>
>> >>> Added the above wording to G's doc.
>> >>>
>> >>>
>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <[hidden email]>
>> >>> wrote:
>> >>>
>> >>>> There's a related effort:
>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>> >>>>
>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>> >>>> > Hi Gabor, welcome to the Flink community!
>> >>>> >
>> >>>> > Thanks for sharing this proposal with the community Márton. In
>> >>>> general, I
>> >>>> > agree that authentication is missing and that this is required for
>> >>>> using
>> >>>> > Flink within an enterprise. The thing I am wondering is whether
>> this
>> >>>> > feature strictly needs to be implemented inside of Flink or
>> whether a
>> >>>> proxy
>> >>>> > setup could do the job? Have you considered this option? If yes,
>> then
>> >>>> it
>> >>>> > would be good to list it under the point of rejected alternatives.
>> >>>> >
>> >>>> > I do see the benefit of implementing this feature inside of Flink
>> if
>> >>>> many
>> >>>> > users need it. If not, then it might be easier for the project to
>> not
>> >>>> > increase the surface area since it makes the overall maintenance
>> >>>> harder.
>> >>>> >
>> >>>> > Cheers,
>> >>>> > Till
>> >>>> >
>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>> [hidden email]>
>> >>>> wrote:
>> >>>> >
>> >>>> >> Hi team,
>> >>>> >>
>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to the
>> >>>> >> community, he is a Spark committer who has recently transitioned
>> to
>> >>>> the
>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>> >>>> contributing
>> >>>> >> to Apache Flink. Previously G primarily focused on Spark Streaming
>> >>>> and
>> >>>> >> security.
>> >>>> >>
>> >>>> >> Based on requests from our customers G has implemented Kerberos
>> and
>> >>>> HTTP
>> >>>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
>> >>>> Previously
>> >>>> >> lacked an authentication story.
>> >>>> >>
>> >>>> >> We are looking to contribute this functionality back to the
>> >>>> community, we
>> >>>> >> believe that given Flink's maturity there should be a common code
>> >>>> solution
>> >>>> >> for this general pattern.
>> >>>> >>
>> >>>> >> We are looking forward to your feedback on G's design. [2]
>> >>>> >>
>> >>>> >> [1] http://gaborsomogyi.com/
>> >>>> >> [2]
>> >>>> >>
>> >>>> >>
>> >>>>
>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>> >>>> >>
>> >>>>
>> >>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Till Rohrmann
I guess the idea would then be to let the proxy do the authentication job
and only forward the request via an SSL mutually encrypted connection to
the Flink cluster. Would this be possible? The beauty of this setup is in
my opinion that this setup should work with all kinds of authentication
mechanisms.

Cheers,
Till

On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <[hidden email]>
wrote:

> Thanks for giving options to fulfil the need.
>
> Users are looking for a solution where users can be identified on the
> whole cluster and restrict access to resources/actions.
> A good example for such an action is cancelling other users running jobs.
>
> * SSL does provide mutual authentication but when authentication passed
> there is no user based on restrictions can be made.
> * The less problematic part is that generating/maintaining short time
> valid certificates would be a hard (that's the reason KDC like servers
> exist).
> Having long time valid certificates would widen the attack surface but
> since the first concern is there this is just a cosmetic issue.
>
> All in all using TLS certificates is not sufficient in these environments
> unfortunately.
>
> BR,
> G
>
>
> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
> wrote:
>
>> Thanks for the information Gabor. If it is about securing the
>> communication between the REST client and the REST server, then Flink
>> already supports enabling mutual SSL authentication [1]. Would this be
>> enough to secure the communication and to pass an audit?
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>
>> Cheers,
>> Till
>>
>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <[hidden email]>
>> wrote:
>>
>>> Hi Till,
>>>
>>> Since I'm working in security area 10+ years let me share my thought.
>>> I would like to emphasise there are experts better than me but I have
>>> some
>>> basics.
>>> The discussion is open and not trying to tell alone things...
>>>
>>> > I mean if an attacker can get access to one of the machines, then it
>>> should also be possible to obtain the right Kerberos token.
>>> Not necessarily. For example if one gets access to a specific user's
>>> credentials then it's not possible to compromise other user's jobs, data,
>>> etc...
>>> Security is like an onion, the more layers has been added the more time
>>> an
>>> attacker needs to proceed.
>>> At the end of the day if one is in, then most probably can find the way
>>> but
>>> this time is normally enough to sysadmins or security experts to
>>> close down the system and minimize the damage.
>>>
>>> The other thing is that all tokens has a timeout and if the token is
>>> invalid then the attacker can't proceed further.
>>>
>>> > Is Kerberos also the standard authentication protocol for Kubernetes
>>> deployments?
>>> Kerberos is an industry standard which is cloud/deployment agnostic and
>>> it
>>> can be used in any deployments including k8s.
>>> The main intention is to use kerberos in k8s deployments too since we're
>>> going this direction as well.
>>> Please see how Spark does this:
>>>
>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>
>>> Last but not least the most important reason to add at least one strong
>>> authentication is that we have users who has
>>> hard requirements on this. They're doing security audits and if they fail
>>> then it's deal breaking.
>>> That is why we have added kerberos at the first place. Unfortunately we
>>> can't name them in this public list, however
>>> the customers who specifically asked for this were mainly in the banking
>>> and telco sector.
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>>> wrote:
>>>
>>> > Thanks for updating the document Márton. Why is it that banks will
>>> > consider it more secure if Flink comes with Kerberos authentication
>>> > (assuming a properly secured setup)? I mean if an attacker can get
>>> access
>>> > to one of the machines, then it should also be possible to obtain the
>>> right
>>> > Kerberos token.
>>> >
>>> > I am not an authentication expert and that's why I wanted to ask what
>>> are
>>> > other authentication protocols other than Kerberos? Why did we select
>>> > Kerberos and not any other authentication protocol? Maybe you can list
>>> the
>>> > pros and cons for the different protocols. Is Kerberos also the
>>> standard
>>> > authentication protocol for Kubernetes deployments? If not, what would
>>> be
>>> > the answer when deploying on K8s?
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>> [hidden email]>
>>> > wrote:
>>> >
>>> >> Hi team,
>>> >>
>>> >> Happy to be here and hope I can provide quality additions in the
>>> future.
>>> >>
>>> >> Thank you all for helpful the suggestions!
>>> >> Considering them the FLIP has been modified and the work continues on
>>> the
>>> >> already existing Jira.
>>> >>
>>> >> BR,
>>> >> G
>>> >>
>>> >>
>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>> [hidden email]>
>>> >> wrote:
>>> >>
>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket too,
>>> let
>>> >>> us continue there then.
>>> >>>
>>> >>> Till, I agree that we should keep this codepath as slim as possible.
>>> It
>>> >>> is an important design decision that we aim to keep the list of
>>> >>> authentication protocols to a minimum. We believe that this should
>>> not be a
>>> >>> primary concern of Flink and a trusted proxy service (for example
>>> Apache
>>> >>> Knox) should be used to enable a multitude of enduser authentication
>>> >>> mechanisms. The bare minimum of authentication mechanisms to support
>>> >>> consequently consist of a single strong authentication protocol for
>>> which
>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>> development
>>> >>> and light-weight scenarios.
>>> >>>
>>> >>> Added the above wording to G's doc.
>>> >>>
>>> >>>
>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <[hidden email]
>>> >
>>> >>> wrote:
>>> >>>
>>> >>>> There's a related effort:
>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>> >>>>
>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>> >>>> > Hi Gabor, welcome to the Flink community!
>>> >>>> >
>>> >>>> > Thanks for sharing this proposal with the community Márton. In
>>> >>>> general, I
>>> >>>> > agree that authentication is missing and that this is required for
>>> >>>> using
>>> >>>> > Flink within an enterprise. The thing I am wondering is whether
>>> this
>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>> whether a
>>> >>>> proxy
>>> >>>> > setup could do the job? Have you considered this option? If yes,
>>> then
>>> >>>> it
>>> >>>> > would be good to list it under the point of rejected alternatives.
>>> >>>> >
>>> >>>> > I do see the benefit of implementing this feature inside of Flink
>>> if
>>> >>>> many
>>> >>>> > users need it. If not, then it might be easier for the project to
>>> not
>>> >>>> > increase the surface area since it makes the overall maintenance
>>> >>>> harder.
>>> >>>> >
>>> >>>> > Cheers,
>>> >>>> > Till
>>> >>>> >
>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>> [hidden email]>
>>> >>>> wrote:
>>> >>>> >
>>> >>>> >> Hi team,
>>> >>>> >>
>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to the
>>> >>>> >> community, he is a Spark committer who has recently transitioned
>>> to
>>> >>>> the
>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>>> >>>> contributing
>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>> Streaming
>>> >>>> and
>>> >>>> >> security.
>>> >>>> >>
>>> >>>> >> Based on requests from our customers G has implemented Kerberos
>>> and
>>> >>>> HTTP
>>> >>>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
>>> >>>> Previously
>>> >>>> >> lacked an authentication story.
>>> >>>> >>
>>> >>>> >> We are looking to contribute this functionality back to the
>>> >>>> community, we
>>> >>>> >> believe that given Flink's maturity there should be a common code
>>> >>>> solution
>>> >>>> >> for this general pattern.
>>> >>>> >>
>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>> >>>> >>
>>> >>>> >> [1] http://gaborsomogyi.com/
>>> >>>> >> [2]
>>> >>>> >>
>>> >>>> >>
>>> >>>>
>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>> >>>> >>
>>> >>>>
>>> >>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Márton Balassi
That is an interesting idea, Till.

The main issue with it is that TLS certificates have an expiration time,
usually they get approved for a couple years. Forcing our users to restart
jobs to reprovision TLS certificates would be weird when we could just
implement a single proper strong authentication mechanism instead in a
couple hundred lines of code. :-)

In many cases it is also impractical to go the TLS mutual route, because
the Flink Dashboard can end up on any node in the k8s/Yarn cluster which
means that we need a certificate per node (due to the mutual auth), but if
we also want to protect the private key of these from users accidentally or
intentionally leaking them then we need this per user. As in we end up
managing user*machine number certificates and having to renew them
periodically, which albeit automatable is unfortunately not yet automated
in all large organizations.

I fully agree that TLS certificate mutual authentication has its nice
properties, especially at very large (multiple thousand node) clusters -
but it has its own challenges too. Thanks for bringing it up.

Happy to have this added to the rejected alternative list so that we have
the full picture documented.

On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]> wrote:

> I guess the idea would then be to let the proxy do the authentication job
> and only forward the request via an SSL mutually encrypted connection to
> the Flink cluster. Would this be possible? The beauty of this setup is in
> my opinion that this setup should work with all kinds of authentication
> mechanisms.
>
> Cheers,
> Till
>
> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <[hidden email]>
> wrote:
>
>> Thanks for giving options to fulfil the need.
>>
>> Users are looking for a solution where users can be identified on the
>> whole cluster and restrict access to resources/actions.
>> A good example for such an action is cancelling other users running jobs.
>>
>> * SSL does provide mutual authentication but when authentication passed
>> there is no user based on restrictions can be made.
>> * The less problematic part is that generating/maintaining short time
>> valid certificates would be a hard (that's the reason KDC like servers
>> exist).
>> Having long time valid certificates would widen the attack surface but
>> since the first concern is there this is just a cosmetic issue.
>>
>> All in all using TLS certificates is not sufficient in these environments
>> unfortunately.
>>
>> BR,
>> G
>>
>>
>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
>> wrote:
>>
>>> Thanks for the information Gabor. If it is about securing the
>>> communication between the REST client and the REST server, then Flink
>>> already supports enabling mutual SSL authentication [1]. Would this be
>>> enough to secure the communication and to pass an audit?
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <[hidden email]>
>>> wrote:
>>>
>>>> Hi Till,
>>>>
>>>> Since I'm working in security area 10+ years let me share my thought.
>>>> I would like to emphasise there are experts better than me but I have
>>>> some
>>>> basics.
>>>> The discussion is open and not trying to tell alone things...
>>>>
>>>> > I mean if an attacker can get access to one of the machines, then it
>>>> should also be possible to obtain the right Kerberos token.
>>>> Not necessarily. For example if one gets access to a specific user's
>>>> credentials then it's not possible to compromise other user's jobs,
>>>> data,
>>>> etc...
>>>> Security is like an onion, the more layers has been added the more time
>>>> an
>>>> attacker needs to proceed.
>>>> At the end of the day if one is in, then most probably can find the way
>>>> but
>>>> this time is normally enough to sysadmins or security experts to
>>>> close down the system and minimize the damage.
>>>>
>>>> The other thing is that all tokens has a timeout and if the token is
>>>> invalid then the attacker can't proceed further.
>>>>
>>>> > Is Kerberos also the standard authentication protocol for Kubernetes
>>>> deployments?
>>>> Kerberos is an industry standard which is cloud/deployment agnostic and
>>>> it
>>>> can be used in any deployments including k8s.
>>>> The main intention is to use kerberos in k8s deployments too since we're
>>>> going this direction as well.
>>>> Please see how Spark does this:
>>>>
>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>
>>>> Last but not least the most important reason to add at least one strong
>>>> authentication is that we have users who has
>>>> hard requirements on this. They're doing security audits and if they
>>>> fail
>>>> then it's deal breaking.
>>>> That is why we have added kerberos at the first place. Unfortunately we
>>>> can't name them in this public list, however
>>>> the customers who specifically asked for this were mainly in the banking
>>>> and telco sector.
>>>>
>>>> BR,
>>>> G
>>>>
>>>>
>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>>>> wrote:
>>>>
>>>> > Thanks for updating the document Márton. Why is it that banks will
>>>> > consider it more secure if Flink comes with Kerberos authentication
>>>> > (assuming a properly secured setup)? I mean if an attacker can get
>>>> access
>>>> > to one of the machines, then it should also be possible to obtain the
>>>> right
>>>> > Kerberos token.
>>>> >
>>>> > I am not an authentication expert and that's why I wanted to ask what
>>>> are
>>>> > other authentication protocols other than Kerberos? Why did we select
>>>> > Kerberos and not any other authentication protocol? Maybe you can
>>>> list the
>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>> standard
>>>> > authentication protocol for Kubernetes deployments? If not, what
>>>> would be
>>>> > the answer when deploying on K8s?
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>> [hidden email]>
>>>> > wrote:
>>>> >
>>>> >> Hi team,
>>>> >>
>>>> >> Happy to be here and hope I can provide quality additions in the
>>>> future.
>>>> >>
>>>> >> Thank you all for helpful the suggestions!
>>>> >> Considering them the FLIP has been modified and the work continues
>>>> on the
>>>> >> already existing Jira.
>>>> >>
>>>> >> BR,
>>>> >> G
>>>> >>
>>>> >>
>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>> [hidden email]>
>>>> >> wrote:
>>>> >>
>>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket
>>>> too, let
>>>> >>> us continue there then.
>>>> >>>
>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>> possible. It
>>>> >>> is an important design decision that we aim to keep the list of
>>>> >>> authentication protocols to a minimum. We believe that this should
>>>> not be a
>>>> >>> primary concern of Flink and a trusted proxy service (for example
>>>> Apache
>>>> >>> Knox) should be used to enable a multitude of enduser authentication
>>>> >>> mechanisms. The bare minimum of authentication mechanisms to support
>>>> >>> consequently consist of a single strong authentication protocol for
>>>> which
>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>>> development
>>>> >>> and light-weight scenarios.
>>>> >>>
>>>> >>> Added the above wording to G's doc.
>>>> >>>
>>>> >>>
>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>> [hidden email]>
>>>> >>> wrote:
>>>> >>>
>>>> >>>> There's a related effort:
>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>> >>>>
>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>> >>>> >
>>>> >>>> > Thanks for sharing this proposal with the community Márton. In
>>>> >>>> general, I
>>>> >>>> > agree that authentication is missing and that this is required
>>>> for
>>>> >>>> using
>>>> >>>> > Flink within an enterprise. The thing I am wondering is whether
>>>> this
>>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>>> whether a
>>>> >>>> proxy
>>>> >>>> > setup could do the job? Have you considered this option? If yes,
>>>> then
>>>> >>>> it
>>>> >>>> > would be good to list it under the point of rejected
>>>> alternatives.
>>>> >>>> >
>>>> >>>> > I do see the benefit of implementing this feature inside of
>>>> Flink if
>>>> >>>> many
>>>> >>>> > users need it. If not, then it might be easier for the project
>>>> to not
>>>> >>>> > increase the surface area since it makes the overall maintenance
>>>> >>>> harder.
>>>> >>>> >
>>>> >>>> > Cheers,
>>>> >>>> > Till
>>>> >>>> >
>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>> [hidden email]>
>>>> >>>> wrote:
>>>> >>>> >
>>>> >>>> >> Hi team,
>>>> >>>> >>
>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to
>>>> the
>>>> >>>> >> community, he is a Spark committer who has recently
>>>> transitioned to
>>>> >>>> the
>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>>>> >>>> contributing
>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>> Streaming
>>>> >>>> and
>>>> >>>> >> security.
>>>> >>>> >>
>>>> >>>> >> Based on requests from our customers G has implemented Kerberos
>>>> and
>>>> >>>> HTTP
>>>> >>>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
>>>> >>>> Previously
>>>> >>>> >> lacked an authentication story.
>>>> >>>> >>
>>>> >>>> >> We are looking to contribute this functionality back to the
>>>> >>>> community, we
>>>> >>>> >> believe that given Flink's maturity there should be a common
>>>> code
>>>> >>>> solution
>>>> >>>> >> for this general pattern.
>>>> >>>> >>
>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>> >>>> >>
>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>> >>>> >> [2]
>>>> >>>> >>
>>>> >>>> >>
>>>> >>>>
>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>> >>>> >>
>>>> >>>>
>>>> >>>>
>>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Till Rohrmann
I am not saying that we shouldn't add a strong authentication mechanism if
there are good reasons for it. I primarily would like to understand the
context a bit better in order to give qualified feedback and come to a good
decision. In order to do this, I have the feeling that we haven't fully
considered all available options which are on the table, tbh.

Does the problem of certificate expiry also apply for self-signed
certificates? If yes, then this should then also be a problem for the
internal encryption of Flink's communication. If not, then one could use
self-signed certificates with a longer validity to solve the mentioned
issue.

I think you can set up Flink in such a way that you don't have to handle
all the different certificates. For example, you could deploy Flink with a
"sidecar proxy" which is responsible for the authentication using an
arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
network interface. That way, the REST endpoint would only be available
through the sidecar proxy. Additionally, one could enable SSL for this
communication. Would this be a solution for the problem?

Cheers,
Till

On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <[hidden email]>
wrote:

> That is an interesting idea, Till.
>
> The main issue with it is that TLS certificates have an expiration time,
> usually they get approved for a couple years. Forcing our users to restart
> jobs to reprovision TLS certificates would be weird when we could just
> implement a single proper strong authentication mechanism instead in a
> couple hundred lines of code. :-)
>
> In many cases it is also impractical to go the TLS mutual route, because
> the Flink Dashboard can end up on any node in the k8s/Yarn cluster which
> means that we need a certificate per node (due to the mutual auth), but if
> we also want to protect the private key of these from users accidentally or
> intentionally leaking them then we need this per user. As in we end up
> managing user*machine number certificates and having to renew them
> periodically, which albeit automatable is unfortunately not yet automated
> in all large organizations.
>
> I fully agree that TLS certificate mutual authentication has its nice
> properties, especially at very large (multiple thousand node) clusters -
> but it has its own challenges too. Thanks for bringing it up.
>
> Happy to have this added to the rejected alternative list so that we have
> the full picture documented.
>
> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]> wrote:
>
>> I guess the idea would then be to let the proxy do the authentication job
>> and only forward the request via an SSL mutually encrypted connection to
>> the Flink cluster. Would this be possible? The beauty of this setup is in
>> my opinion that this setup should work with all kinds of authentication
>> mechanisms.
>>
>> Cheers,
>> Till
>>
>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <[hidden email]>
>> wrote:
>>
>>> Thanks for giving options to fulfil the need.
>>>
>>> Users are looking for a solution where users can be identified on the
>>> whole cluster and restrict access to resources/actions.
>>> A good example for such an action is cancelling other users running jobs.
>>>
>>> * SSL does provide mutual authentication but when authentication passed
>>> there is no user based on restrictions can be made.
>>> * The less problematic part is that generating/maintaining short time
>>> valid certificates would be a hard (that's the reason KDC like servers
>>> exist).
>>> Having long time valid certificates would widen the attack surface but
>>> since the first concern is there this is just a cosmetic issue.
>>>
>>> All in all using TLS certificates is not sufficient in these
>>> environments unfortunately.
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
>>> wrote:
>>>
>>>> Thanks for the information Gabor. If it is about securing the
>>>> communication between the REST client and the REST server, then Flink
>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>> enough to secure the communication and to pass an audit?
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>> [hidden email]> wrote:
>>>>
>>>>> Hi Till,
>>>>>
>>>>> Since I'm working in security area 10+ years let me share my thought.
>>>>> I would like to emphasise there are experts better than me but I have
>>>>> some
>>>>> basics.
>>>>> The discussion is open and not trying to tell alone things...
>>>>>
>>>>> > I mean if an attacker can get access to one of the machines, then it
>>>>> should also be possible to obtain the right Kerberos token.
>>>>> Not necessarily. For example if one gets access to a specific user's
>>>>> credentials then it's not possible to compromise other user's jobs,
>>>>> data,
>>>>> etc...
>>>>> Security is like an onion, the more layers has been added the more
>>>>> time an
>>>>> attacker needs to proceed.
>>>>> At the end of the day if one is in, then most probably can find the
>>>>> way but
>>>>> this time is normally enough to sysadmins or security experts to
>>>>> close down the system and minimize the damage.
>>>>>
>>>>> The other thing is that all tokens has a timeout and if the token is
>>>>> invalid then the attacker can't proceed further.
>>>>>
>>>>> > Is Kerberos also the standard authentication protocol for Kubernetes
>>>>> deployments?
>>>>> Kerberos is an industry standard which is cloud/deployment agnostic
>>>>> and it
>>>>> can be used in any deployments including k8s.
>>>>> The main intention is to use kerberos in k8s deployments too since
>>>>> we're
>>>>> going this direction as well.
>>>>> Please see how Spark does this:
>>>>>
>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>
>>>>> Last but not least the most important reason to add at least one strong
>>>>> authentication is that we have users who has
>>>>> hard requirements on this. They're doing security audits and if they
>>>>> fail
>>>>> then it's deal breaking.
>>>>> That is why we have added kerberos at the first place. Unfortunately we
>>>>> can't name them in this public list, however
>>>>> the customers who specifically asked for this were mainly in the
>>>>> banking
>>>>> and telco sector.
>>>>>
>>>>> BR,
>>>>> G
>>>>>
>>>>>
>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>>>>> wrote:
>>>>>
>>>>> > Thanks for updating the document Márton. Why is it that banks will
>>>>> > consider it more secure if Flink comes with Kerberos authentication
>>>>> > (assuming a properly secured setup)? I mean if an attacker can get
>>>>> access
>>>>> > to one of the machines, then it should also be possible to obtain
>>>>> the right
>>>>> > Kerberos token.
>>>>> >
>>>>> > I am not an authentication expert and that's why I wanted to ask
>>>>> what are
>>>>> > other authentication protocols other than Kerberos? Why did we select
>>>>> > Kerberos and not any other authentication protocol? Maybe you can
>>>>> list the
>>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>>> standard
>>>>> > authentication protocol for Kubernetes deployments? If not, what
>>>>> would be
>>>>> > the answer when deploying on K8s?
>>>>> >
>>>>> > Cheers,
>>>>> > Till
>>>>> >
>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>> [hidden email]>
>>>>> > wrote:
>>>>> >
>>>>> >> Hi team,
>>>>> >>
>>>>> >> Happy to be here and hope I can provide quality additions in the
>>>>> future.
>>>>> >>
>>>>> >> Thank you all for helpful the suggestions!
>>>>> >> Considering them the FLIP has been modified and the work continues
>>>>> on the
>>>>> >> already existing Jira.
>>>>> >>
>>>>> >> BR,
>>>>> >> G
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>> [hidden email]>
>>>>> >> wrote:
>>>>> >>
>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket
>>>>> too, let
>>>>> >>> us continue there then.
>>>>> >>>
>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>> possible. It
>>>>> >>> is an important design decision that we aim to keep the list of
>>>>> >>> authentication protocols to a minimum. We believe that this should
>>>>> not be a
>>>>> >>> primary concern of Flink and a trusted proxy service (for example
>>>>> Apache
>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>> authentication
>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>> support
>>>>> >>> consequently consist of a single strong authentication protocol
>>>>> for which
>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>>>> development
>>>>> >>> and light-weight scenarios.
>>>>> >>>
>>>>> >>> Added the above wording to G's doc.
>>>>> >>>
>>>>> >>>
>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>> [hidden email]>
>>>>> >>> wrote:
>>>>> >>>
>>>>> >>>> There's a related effort:
>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>> >>>>
>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>> >>>> >
>>>>> >>>> > Thanks for sharing this proposal with the community Márton. In
>>>>> >>>> general, I
>>>>> >>>> > agree that authentication is missing and that this is required
>>>>> for
>>>>> >>>> using
>>>>> >>>> > Flink within an enterprise. The thing I am wondering is whether
>>>>> this
>>>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>>>> whether a
>>>>> >>>> proxy
>>>>> >>>> > setup could do the job? Have you considered this option? If
>>>>> yes, then
>>>>> >>>> it
>>>>> >>>> > would be good to list it under the point of rejected
>>>>> alternatives.
>>>>> >>>> >
>>>>> >>>> > I do see the benefit of implementing this feature inside of
>>>>> Flink if
>>>>> >>>> many
>>>>> >>>> > users need it. If not, then it might be easier for the project
>>>>> to not
>>>>> >>>> > increase the surface area since it makes the overall maintenance
>>>>> >>>> harder.
>>>>> >>>> >
>>>>> >>>> > Cheers,
>>>>> >>>> > Till
>>>>> >>>> >
>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>> [hidden email]>
>>>>> >>>> wrote:
>>>>> >>>> >
>>>>> >>>> >> Hi team,
>>>>> >>>> >>
>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to
>>>>> the
>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>> transitioned to
>>>>> >>>> the
>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>>>>> >>>> contributing
>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>> Streaming
>>>>> >>>> and
>>>>> >>>> >> security.
>>>>> >>>> >>
>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>> Kerberos and
>>>>> >>>> HTTP
>>>>> >>>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
>>>>> >>>> Previously
>>>>> >>>> >> lacked an authentication story.
>>>>> >>>> >>
>>>>> >>>> >> We are looking to contribute this functionality back to the
>>>>> >>>> community, we
>>>>> >>>> >> believe that given Flink's maturity there should be a common
>>>>> code
>>>>> >>>> solution
>>>>> >>>> >> for this general pattern.
>>>>> >>>> >>
>>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>>> >>>> >>
>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>> >>>> >> [2]
>>>>> >>>> >>
>>>>> >>>> >>
>>>>> >>>>
>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>> >>>> >>
>>>>> >>>>
>>>>> >>>>
>>>>>
>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Gabor Somogyi
Till, thanks for investing time in giving further options.
Marci, thanks for summarizing the use-case point of view.

We've arrived back to one of the original problems. Namely if an attacker
gets access to a node it's possible to cancel other user's jobs (and more
can be done).
Self signed certificate is almost no-op authentication in production
environments because any user can sign its own certificate and no third
party plays.
This problem just can't be solved with SSL no matter from which point of
view we consider it.

BR,
G


On Fri, Jun 4, 2021 at 10:03 AM Till Rohrmann <[hidden email]> wrote:

> I am not saying that we shouldn't add a strong authentication mechanism if
> there are good reasons for it. I primarily would like to understand the
> context a bit better in order to give qualified feedback and come to a good
> decision. In order to do this, I have the feeling that we haven't fully
> considered all available options which are on the table, tbh.
>
> Does the problem of certificate expiry also apply for self-signed
> certificates? If yes, then this should then also be a problem for the
> internal encryption of Flink's communication. If not, then one could use
> self-signed certificates with a longer validity to solve the mentioned
> issue.
>
> I think you can set up Flink in such a way that you don't have to handle
> all the different certificates. For example, you could deploy Flink with a
> "sidecar proxy" which is responsible for the authentication using an
> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
> network interface. That way, the REST endpoint would only be available
> through the sidecar proxy. Additionally, one could enable SSL for this
> communication. Would this be a solution for the problem?
>
> Cheers,
> Till
>
> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <[hidden email]>
> wrote:
>
>> That is an interesting idea, Till.
>>
>> The main issue with it is that TLS certificates have an expiration time,
>> usually they get approved for a couple years. Forcing our users to restart
>> jobs to reprovision TLS certificates would be weird when we could just
>> implement a single proper strong authentication mechanism instead in a
>> couple hundred lines of code. :-)
>>
>> In many cases it is also impractical to go the TLS mutual route, because
>> the Flink Dashboard can end up on any node in the k8s/Yarn cluster which
>> means that we need a certificate per node (due to the mutual auth), but if
>> we also want to protect the private key of these from users accidentally or
>> intentionally leaking them then we need this per user. As in we end up
>> managing user*machine number certificates and having to renew them
>> periodically, which albeit automatable is unfortunately not yet automated
>> in all large organizations.
>>
>> I fully agree that TLS certificate mutual authentication has its nice
>> properties, especially at very large (multiple thousand node) clusters -
>> but it has its own challenges too. Thanks for bringing it up.
>>
>> Happy to have this added to the rejected alternative list so that we have
>> the full picture documented.
>>
>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]>
>> wrote:
>>
>>> I guess the idea would then be to let the proxy do the authentication
>>> job and only forward the request via an SSL mutually encrypted connection
>>> to the Flink cluster. Would this be possible? The beauty of this setup is
>>> in my opinion that this setup should work with all kinds of authentication
>>> mechanisms.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <[hidden email]>
>>> wrote:
>>>
>>>> Thanks for giving options to fulfil the need.
>>>>
>>>> Users are looking for a solution where users can be identified on the
>>>> whole cluster and restrict access to resources/actions.
>>>> A good example for such an action is cancelling other users running
>>>> jobs.
>>>>
>>>> * SSL does provide mutual authentication but when authentication passed
>>>> there is no user based on restrictions can be made.
>>>> * The less problematic part is that generating/maintaining short time
>>>> valid certificates would be a hard (that's the reason KDC like servers
>>>> exist).
>>>> Having long time valid certificates would widen the attack surface but
>>>> since the first concern is there this is just a cosmetic issue.
>>>>
>>>> All in all using TLS certificates is not sufficient in these
>>>> environments unfortunately.
>>>>
>>>> BR,
>>>> G
>>>>
>>>>
>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
>>>> wrote:
>>>>
>>>>> Thanks for the information Gabor. If it is about securing the
>>>>> communication between the REST client and the REST server, then Flink
>>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>>> enough to secure the communication and to pass an audit?
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> Hi Till,
>>>>>>
>>>>>> Since I'm working in security area 10+ years let me share my thought.
>>>>>> I would like to emphasise there are experts better than me but I have
>>>>>> some
>>>>>> basics.
>>>>>> The discussion is open and not trying to tell alone things...
>>>>>>
>>>>>> > I mean if an attacker can get access to one of the machines, then it
>>>>>> should also be possible to obtain the right Kerberos token.
>>>>>> Not necessarily. For example if one gets access to a specific user's
>>>>>> credentials then it's not possible to compromise other user's jobs,
>>>>>> data,
>>>>>> etc...
>>>>>> Security is like an onion, the more layers has been added the more
>>>>>> time an
>>>>>> attacker needs to proceed.
>>>>>> At the end of the day if one is in, then most probably can find the
>>>>>> way but
>>>>>> this time is normally enough to sysadmins or security experts to
>>>>>> close down the system and minimize the damage.
>>>>>>
>>>>>> The other thing is that all tokens has a timeout and if the token is
>>>>>> invalid then the attacker can't proceed further.
>>>>>>
>>>>>> > Is Kerberos also the standard authentication protocol for Kubernetes
>>>>>> deployments?
>>>>>> Kerberos is an industry standard which is cloud/deployment agnostic
>>>>>> and it
>>>>>> can be used in any deployments including k8s.
>>>>>> The main intention is to use kerberos in k8s deployments too since
>>>>>> we're
>>>>>> going this direction as well.
>>>>>> Please see how Spark does this:
>>>>>>
>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>>
>>>>>> Last but not least the most important reason to add at least one
>>>>>> strong
>>>>>> authentication is that we have users who has
>>>>>> hard requirements on this. They're doing security audits and if they
>>>>>> fail
>>>>>> then it's deal breaking.
>>>>>> That is why we have added kerberos at the first place. Unfortunately
>>>>>> we
>>>>>> can't name them in this public list, however
>>>>>> the customers who specifically asked for this were mainly in the
>>>>>> banking
>>>>>> and telco sector.
>>>>>>
>>>>>> BR,
>>>>>> G
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>> > Thanks for updating the document Márton. Why is it that banks will
>>>>>> > consider it more secure if Flink comes with Kerberos authentication
>>>>>> > (assuming a properly secured setup)? I mean if an attacker can get
>>>>>> access
>>>>>> > to one of the machines, then it should also be possible to obtain
>>>>>> the right
>>>>>> > Kerberos token.
>>>>>> >
>>>>>> > I am not an authentication expert and that's why I wanted to ask
>>>>>> what are
>>>>>> > other authentication protocols other than Kerberos? Why did we
>>>>>> select
>>>>>> > Kerberos and not any other authentication protocol? Maybe you can
>>>>>> list the
>>>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>>>> standard
>>>>>> > authentication protocol for Kubernetes deployments? If not, what
>>>>>> would be
>>>>>> > the answer when deploying on K8s?
>>>>>> >
>>>>>> > Cheers,
>>>>>> > Till
>>>>>> >
>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>>> [hidden email]>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> Hi team,
>>>>>> >>
>>>>>> >> Happy to be here and hope I can provide quality additions in the
>>>>>> future.
>>>>>> >>
>>>>>> >> Thank you all for helpful the suggestions!
>>>>>> >> Considering them the FLIP has been modified and the work continues
>>>>>> on the
>>>>>> >> already existing Jira.
>>>>>> >>
>>>>>> >> BR,
>>>>>> >> G
>>>>>> >>
>>>>>> >>
>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>>> [hidden email]>
>>>>>> >> wrote:
>>>>>> >>
>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket
>>>>>> too, let
>>>>>> >>> us continue there then.
>>>>>> >>>
>>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>>> possible. It
>>>>>> >>> is an important design decision that we aim to keep the list of
>>>>>> >>> authentication protocols to a minimum. We believe that this
>>>>>> should not be a
>>>>>> >>> primary concern of Flink and a trusted proxy service (for example
>>>>>> Apache
>>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>>> authentication
>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>>> support
>>>>>> >>> consequently consist of a single strong authentication protocol
>>>>>> for which
>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>>>>> development
>>>>>> >>> and light-weight scenarios.
>>>>>> >>>
>>>>>> >>> Added the above wording to G's doc.
>>>>>> >>>
>>>>>> >>>
>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>>> [hidden email]>
>>>>>> >>> wrote:
>>>>>> >>>
>>>>>> >>>> There's a related effort:
>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>>> >>>>
>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>>> >>>> >
>>>>>> >>>> > Thanks for sharing this proposal with the community Márton. In
>>>>>> >>>> general, I
>>>>>> >>>> > agree that authentication is missing and that this is required
>>>>>> for
>>>>>> >>>> using
>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is
>>>>>> whether this
>>>>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>>>>> whether a
>>>>>> >>>> proxy
>>>>>> >>>> > setup could do the job? Have you considered this option? If
>>>>>> yes, then
>>>>>> >>>> it
>>>>>> >>>> > would be good to list it under the point of rejected
>>>>>> alternatives.
>>>>>> >>>> >
>>>>>> >>>> > I do see the benefit of implementing this feature inside of
>>>>>> Flink if
>>>>>> >>>> many
>>>>>> >>>> > users need it. If not, then it might be easier for the project
>>>>>> to not
>>>>>> >>>> > increase the surface area since it makes the overall
>>>>>> maintenance
>>>>>> >>>> harder.
>>>>>> >>>> >
>>>>>> >>>> > Cheers,
>>>>>> >>>> > Till
>>>>>> >>>> >
>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>>> [hidden email]>
>>>>>> >>>> wrote:
>>>>>> >>>> >
>>>>>> >>>> >> Hi team,
>>>>>> >>>> >>
>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to
>>>>>> the
>>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>>> transitioned to
>>>>>> >>>> the
>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>>>>>> >>>> contributing
>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>>> Streaming
>>>>>> >>>> and
>>>>>> >>>> >> security.
>>>>>> >>>> >>
>>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>>> Kerberos and
>>>>>> >>>> HTTP
>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and
>>>>>> HistoryServer.
>>>>>> >>>> Previously
>>>>>> >>>> >> lacked an authentication story.
>>>>>> >>>> >>
>>>>>> >>>> >> We are looking to contribute this functionality back to the
>>>>>> >>>> community, we
>>>>>> >>>> >> believe that given Flink's maturity there should be a common
>>>>>> code
>>>>>> >>>> solution
>>>>>> >>>> >> for this general pattern.
>>>>>> >>>> >>
>>>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>>>> >>>> >>
>>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>>> >>>> >> [2]
>>>>>> >>>> >>
>>>>>> >>>> >>
>>>>>> >>>>
>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>> >>>> >>
>>>>>> >>>>
>>>>>> >>>>
>>>>>>
>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Gyula Fóra-2
In reply to this post by Till Rohrmann
Hi!

I think there might be possible alternatives but it seems Kerberos on the
rest endpoint ticks all the right boxes and provides a super clean and
simple solution for strong authentication.

I wouldn’t even consider sidecar proxies etc if we can solve it in such a
simple way as proposed by G.

Cheers
Gyula

On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <[hidden email]> wrote:

> I am not saying that we shouldn't add a strong authentication mechanism if
> there are good reasons for it. I primarily would like to understand the
> context a bit better in order to give qualified feedback and come to a good
> decision. In order to do this, I have the feeling that we haven't fully
> considered all available options which are on the table, tbh.
>
> Does the problem of certificate expiry also apply for self-signed
> certificates? If yes, then this should then also be a problem for the
> internal encryption of Flink's communication. If not, then one could use
> self-signed certificates with a longer validity to solve the mentioned
> issue.
>
> I think you can set up Flink in such a way that you don't have to handle
> all the different certificates. For example, you could deploy Flink with a
> "sidecar proxy" which is responsible for the authentication using an
> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
> network interface. That way, the REST endpoint would only be available
> through the sidecar proxy. Additionally, one could enable SSL for this
> communication. Would this be a solution for the problem?
>
> Cheers,
> Till
>
> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <[hidden email]>
> wrote:
>
>> That is an interesting idea, Till.
>>
>> The main issue with it is that TLS certificates have an expiration time,
>> usually they get approved for a couple years. Forcing our users to restart
>> jobs to reprovision TLS certificates would be weird when we could just
>> implement a single proper strong authentication mechanism instead in a
>> couple hundred lines of code. :-)
>>
>> In many cases it is also impractical to go the TLS mutual route, because
>> the Flink Dashboard can end up on any node in the k8s/Yarn cluster which
>> means that we need a certificate per node (due to the mutual auth), but if
>> we also want to protect the private key of these from users accidentally or
>> intentionally leaking them then we need this per user. As in we end up
>> managing user*machine number certificates and having to renew them
>> periodically, which albeit automatable is unfortunately not yet automated
>> in all large organizations.
>>
>> I fully agree that TLS certificate mutual authentication has its nice
>> properties, especially at very large (multiple thousand node) clusters -
>> but it has its own challenges too. Thanks for bringing it up.
>>
>> Happy to have this added to the rejected alternative list so that we have
>> the full picture documented.
>>
>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]>
>> wrote:
>>
>>> I guess the idea would then be to let the proxy do the authentication
>>> job and only forward the request via an SSL mutually encrypted connection
>>> to the Flink cluster. Would this be possible? The beauty of this setup is
>>> in my opinion that this setup should work with all kinds of authentication
>>> mechanisms.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <[hidden email]>
>>> wrote:
>>>
>>>> Thanks for giving options to fulfil the need.
>>>>
>>>> Users are looking for a solution where users can be identified on the
>>>> whole cluster and restrict access to resources/actions.
>>>> A good example for such an action is cancelling other users running
>>>> jobs.
>>>>
>>>> * SSL does provide mutual authentication but when authentication passed
>>>> there is no user based on restrictions can be made.
>>>> * The less problematic part is that generating/maintaining short time
>>>> valid certificates would be a hard (that's the reason KDC like servers
>>>> exist).
>>>> Having long time valid certificates would widen the attack surface but
>>>> since the first concern is there this is just a cosmetic issue.
>>>>
>>>> All in all using TLS certificates is not sufficient in these
>>>> environments unfortunately.
>>>>
>>>> BR,
>>>> G
>>>>
>>>>
>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
>>>> wrote:
>>>>
>>>>> Thanks for the information Gabor. If it is about securing the
>>>>> communication between the REST client and the REST server, then Flink
>>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>>> enough to secure the communication and to pass an audit?
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> Hi Till,
>>>>>>
>>>>>> Since I'm working in security area 10+ years let me share my thought.
>>>>>> I would like to emphasise there are experts better than me but I have
>>>>>> some
>>>>>> basics.
>>>>>> The discussion is open and not trying to tell alone things...
>>>>>>
>>>>>> > I mean if an attacker can get access to one of the machines, then it
>>>>>> should also be possible to obtain the right Kerberos token.
>>>>>> Not necessarily. For example if one gets access to a specific user's
>>>>>> credentials then it's not possible to compromise other user's jobs,
>>>>>> data,
>>>>>> etc...
>>>>>> Security is like an onion, the more layers has been added the more
>>>>>> time an
>>>>>> attacker needs to proceed.
>>>>>> At the end of the day if one is in, then most probably can find the
>>>>>> way but
>>>>>> this time is normally enough to sysadmins or security experts to
>>>>>> close down the system and minimize the damage.
>>>>>>
>>>>>> The other thing is that all tokens has a timeout and if the token is
>>>>>> invalid then the attacker can't proceed further.
>>>>>>
>>>>>> > Is Kerberos also the standard authentication protocol for Kubernetes
>>>>>> deployments?
>>>>>> Kerberos is an industry standard which is cloud/deployment agnostic
>>>>>> and it
>>>>>> can be used in any deployments including k8s.
>>>>>> The main intention is to use kerberos in k8s deployments too since
>>>>>> we're
>>>>>> going this direction as well.
>>>>>> Please see how Spark does this:
>>>>>>
>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>>
>>>>>> Last but not least the most important reason to add at least one
>>>>>> strong
>>>>>> authentication is that we have users who has
>>>>>> hard requirements on this. They're doing security audits and if they
>>>>>> fail
>>>>>> then it's deal breaking.
>>>>>> That is why we have added kerberos at the first place. Unfortunately
>>>>>> we
>>>>>> can't name them in this public list, however
>>>>>> the customers who specifically asked for this were mainly in the
>>>>>> banking
>>>>>> and telco sector.
>>>>>>
>>>>>> BR,
>>>>>> G
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>> > Thanks for updating the document Márton. Why is it that banks will
>>>>>> > consider it more secure if Flink comes with Kerberos authentication
>>>>>> > (assuming a properly secured setup)? I mean if an attacker can get
>>>>>> access
>>>>>> > to one of the machines, then it should also be possible to obtain
>>>>>> the right
>>>>>> > Kerberos token.
>>>>>> >
>>>>>> > I am not an authentication expert and that's why I wanted to ask
>>>>>> what are
>>>>>> > other authentication protocols other than Kerberos? Why did we
>>>>>> select
>>>>>> > Kerberos and not any other authentication protocol? Maybe you can
>>>>>> list the
>>>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>>>> standard
>>>>>> > authentication protocol for Kubernetes deployments? If not, what
>>>>>> would be
>>>>>> > the answer when deploying on K8s?
>>>>>> >
>>>>>> > Cheers,
>>>>>> > Till
>>>>>> >
>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>>> [hidden email]>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> Hi team,
>>>>>> >>
>>>>>> >> Happy to be here and hope I can provide quality additions in the
>>>>>> future.
>>>>>> >>
>>>>>> >> Thank you all for helpful the suggestions!
>>>>>> >> Considering them the FLIP has been modified and the work continues
>>>>>> on the
>>>>>> >> already existing Jira.
>>>>>> >>
>>>>>> >> BR,
>>>>>> >> G
>>>>>> >>
>>>>>> >>
>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>>> [hidden email]>
>>>>>> >> wrote:
>>>>>> >>
>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket
>>>>>> too, let
>>>>>> >>> us continue there then.
>>>>>> >>>
>>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>>> possible. It
>>>>>> >>> is an important design decision that we aim to keep the list of
>>>>>> >>> authentication protocols to a minimum. We believe that this
>>>>>> should not be a
>>>>>> >>> primary concern of Flink and a trusted proxy service (for example
>>>>>> Apache
>>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>>> authentication
>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>>> support
>>>>>> >>> consequently consist of a single strong authentication protocol
>>>>>> for which
>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>>>>> development
>>>>>> >>> and light-weight scenarios.
>>>>>> >>>
>>>>>> >>> Added the above wording to G's doc.
>>>>>> >>>
>>>>>> >>>
>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>>> [hidden email]>
>>>>>> >>> wrote:
>>>>>> >>>
>>>>>> >>>> There's a related effort:
>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>>> >>>>
>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>>> >>>> >
>>>>>> >>>> > Thanks for sharing this proposal with the community Márton. In
>>>>>> >>>> general, I
>>>>>> >>>> > agree that authentication is missing and that this is required
>>>>>> for
>>>>>> >>>> using
>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is
>>>>>> whether this
>>>>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>>>>> whether a
>>>>>> >>>> proxy
>>>>>> >>>> > setup could do the job? Have you considered this option? If
>>>>>> yes, then
>>>>>> >>>> it
>>>>>> >>>> > would be good to list it under the point of rejected
>>>>>> alternatives.
>>>>>> >>>> >
>>>>>> >>>> > I do see the benefit of implementing this feature inside of
>>>>>> Flink if
>>>>>> >>>> many
>>>>>> >>>> > users need it. If not, then it might be easier for the project
>>>>>> to not
>>>>>> >>>> > increase the surface area since it makes the overall
>>>>>> maintenance
>>>>>> >>>> harder.
>>>>>> >>>> >
>>>>>> >>>> > Cheers,
>>>>>> >>>> > Till
>>>>>> >>>> >
>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>>> [hidden email]>
>>>>>> >>>> wrote:
>>>>>> >>>> >
>>>>>> >>>> >> Hi team,
>>>>>> >>>> >>
>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to
>>>>>> the
>>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>>> transitioned to
>>>>>> >>>> the
>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>>>>>> >>>> contributing
>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>>> Streaming
>>>>>> >>>> and
>>>>>> >>>> >> security.
>>>>>> >>>> >>
>>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>>> Kerberos and
>>>>>> >>>> HTTP
>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and
>>>>>> HistoryServer.
>>>>>> >>>> Previously
>>>>>> >>>> >> lacked an authentication story.
>>>>>> >>>> >>
>>>>>> >>>> >> We are looking to contribute this functionality back to the
>>>>>> >>>> community, we
>>>>>> >>>> >> believe that given Flink's maturity there should be a common
>>>>>> code
>>>>>> >>>> solution
>>>>>> >>>> >> for this general pattern.
>>>>>> >>>> >>
>>>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>>>> >>>> >>
>>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>>> >>>> >> [2]
>>>>>> >>>> >>
>>>>>> >>>> >>
>>>>>> >>>>
>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>> >>>> >>
>>>>>> >>>>
>>>>>> >>>>
>>>>>>
>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Till Rohrmann
I did not mean for the user to sign its own certificates but for the
operator of the cluster. Once the user request hits the proxy, it should no
longer be under his control. I think I do not fully understand yet why this
would not work.

What I would like to avoid is to add more complexity into Flink if there is
an easy solution which fulfills the requirements. That's why I would like
to exercise thoroughly through the different alternatives. Also, I am
missing a bit the comparison of Kerberos to other authentication mechanisms
and why they were rejected in favour of Kerberos.

Cheers,
Till

On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <[hidden email]> wrote:

> Hi!
>
> I think there might be possible alternatives but it seems Kerberos on the
> rest endpoint ticks all the right boxes and provides a super clean and
> simple solution for strong authentication.
>
> I wouldn’t even consider sidecar proxies etc if we can solve it in such a
> simple way as proposed by G.
>
> Cheers
> Gyula
>
> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <[hidden email]> wrote:
>
>> I am not saying that we shouldn't add a strong authentication mechanism
>> if there are good reasons for it. I primarily would like to understand the
>> context a bit better in order to give qualified feedback and come to a good
>> decision. In order to do this, I have the feeling that we haven't fully
>> considered all available options which are on the table, tbh.
>>
>> Does the problem of certificate expiry also apply for self-signed
>> certificates? If yes, then this should then also be a problem for the
>> internal encryption of Flink's communication. If not, then one could use
>> self-signed certificates with a longer validity to solve the mentioned
>> issue.
>>
>> I think you can set up Flink in such a way that you don't have to handle
>> all the different certificates. For example, you could deploy Flink with a
>> "sidecar proxy" which is responsible for the authentication using an
>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
>> network interface. That way, the REST endpoint would only be available
>> through the sidecar proxy. Additionally, one could enable SSL for this
>> communication. Would this be a solution for the problem?
>>
>> Cheers,
>> Till
>>
>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <[hidden email]>
>> wrote:
>>
>>> That is an interesting idea, Till.
>>>
>>> The main issue with it is that TLS certificates have an expiration time,
>>> usually they get approved for a couple years. Forcing our users to restart
>>> jobs to reprovision TLS certificates would be weird when we could just
>>> implement a single proper strong authentication mechanism instead in a
>>> couple hundred lines of code. :-)
>>>
>>> In many cases it is also impractical to go the TLS mutual route, because
>>> the Flink Dashboard can end up on any node in the k8s/Yarn cluster which
>>> means that we need a certificate per node (due to the mutual auth), but if
>>> we also want to protect the private key of these from users accidentally or
>>> intentionally leaking them then we need this per user. As in we end up
>>> managing user*machine number certificates and having to renew them
>>> periodically, which albeit automatable is unfortunately not yet automated
>>> in all large organizations.
>>>
>>> I fully agree that TLS certificate mutual authentication has its nice
>>> properties, especially at very large (multiple thousand node) clusters -
>>> but it has its own challenges too. Thanks for bringing it up.
>>>
>>> Happy to have this added to the rejected alternative list so that we
>>> have the full picture documented.
>>>
>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]>
>>> wrote:
>>>
>>>> I guess the idea would then be to let the proxy do the authentication
>>>> job and only forward the request via an SSL mutually encrypted connection
>>>> to the Flink cluster. Would this be possible? The beauty of this setup is
>>>> in my opinion that this setup should work with all kinds of authentication
>>>> mechanisms.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <[hidden email]>
>>>> wrote:
>>>>
>>>>> Thanks for giving options to fulfil the need.
>>>>>
>>>>> Users are looking for a solution where users can be identified on the
>>>>> whole cluster and restrict access to resources/actions.
>>>>> A good example for such an action is cancelling other users running
>>>>> jobs.
>>>>>
>>>>> * SSL does provide mutual authentication but when authentication
>>>>> passed there is no user based on restrictions can be made.
>>>>> * The less problematic part is that generating/maintaining short time
>>>>> valid certificates would be a hard (that's the reason KDC like servers
>>>>> exist).
>>>>> Having long time valid certificates would widen the attack surface but
>>>>> since the first concern is there this is just a cosmetic issue.
>>>>>
>>>>> All in all using TLS certificates is not sufficient in these
>>>>> environments unfortunately.
>>>>>
>>>>> BR,
>>>>> G
>>>>>
>>>>>
>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the information Gabor. If it is about securing the
>>>>>> communication between the REST client and the REST server, then Flink
>>>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>>>> enough to secure the communication and to pass an audit?
>>>>>>
>>>>>> [1]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>>>> [hidden email]> wrote:
>>>>>>
>>>>>>> Hi Till,
>>>>>>>
>>>>>>> Since I'm working in security area 10+ years let me share my thought.
>>>>>>> I would like to emphasise there are experts better than me but I
>>>>>>> have some
>>>>>>> basics.
>>>>>>> The discussion is open and not trying to tell alone things...
>>>>>>>
>>>>>>> > I mean if an attacker can get access to one of the machines, then
>>>>>>> it
>>>>>>> should also be possible to obtain the right Kerberos token.
>>>>>>> Not necessarily. For example if one gets access to a specific user's
>>>>>>> credentials then it's not possible to compromise other user's jobs,
>>>>>>> data,
>>>>>>> etc...
>>>>>>> Security is like an onion, the more layers has been added the more
>>>>>>> time an
>>>>>>> attacker needs to proceed.
>>>>>>> At the end of the day if one is in, then most probably can find the
>>>>>>> way but
>>>>>>> this time is normally enough to sysadmins or security experts to
>>>>>>> close down the system and minimize the damage.
>>>>>>>
>>>>>>> The other thing is that all tokens has a timeout and if the token is
>>>>>>> invalid then the attacker can't proceed further.
>>>>>>>
>>>>>>> > Is Kerberos also the standard authentication protocol for
>>>>>>> Kubernetes
>>>>>>> deployments?
>>>>>>> Kerberos is an industry standard which is cloud/deployment agnostic
>>>>>>> and it
>>>>>>> can be used in any deployments including k8s.
>>>>>>> The main intention is to use kerberos in k8s deployments too since
>>>>>>> we're
>>>>>>> going this direction as well.
>>>>>>> Please see how Spark does this:
>>>>>>>
>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>>>
>>>>>>> Last but not least the most important reason to add at least one
>>>>>>> strong
>>>>>>> authentication is that we have users who has
>>>>>>> hard requirements on this. They're doing security audits and if they
>>>>>>> fail
>>>>>>> then it's deal breaking.
>>>>>>> That is why we have added kerberos at the first place. Unfortunately
>>>>>>> we
>>>>>>> can't name them in this public list, however
>>>>>>> the customers who specifically asked for this were mainly in the
>>>>>>> banking
>>>>>>> and telco sector.
>>>>>>>
>>>>>>> BR,
>>>>>>> G
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > Thanks for updating the document Márton. Why is it that banks will
>>>>>>> > consider it more secure if Flink comes with Kerberos authentication
>>>>>>> > (assuming a properly secured setup)? I mean if an attacker can get
>>>>>>> access
>>>>>>> > to one of the machines, then it should also be possible to obtain
>>>>>>> the right
>>>>>>> > Kerberos token.
>>>>>>> >
>>>>>>> > I am not an authentication expert and that's why I wanted to ask
>>>>>>> what are
>>>>>>> > other authentication protocols other than Kerberos? Why did we
>>>>>>> select
>>>>>>> > Kerberos and not any other authentication protocol? Maybe you can
>>>>>>> list the
>>>>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>>>>> standard
>>>>>>> > authentication protocol for Kubernetes deployments? If not, what
>>>>>>> would be
>>>>>>> > the answer when deploying on K8s?
>>>>>>> >
>>>>>>> > Cheers,
>>>>>>> > Till
>>>>>>> >
>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>>>> [hidden email]>
>>>>>>> > wrote:
>>>>>>> >
>>>>>>> >> Hi team,
>>>>>>> >>
>>>>>>> >> Happy to be here and hope I can provide quality additions in the
>>>>>>> future.
>>>>>>> >>
>>>>>>> >> Thank you all for helpful the suggestions!
>>>>>>> >> Considering them the FLIP has been modified and the work
>>>>>>> continues on the
>>>>>>> >> already existing Jira.
>>>>>>> >>
>>>>>>> >> BR,
>>>>>>> >> G
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>>>> [hidden email]>
>>>>>>> >> wrote:
>>>>>>> >>
>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket
>>>>>>> too, let
>>>>>>> >>> us continue there then.
>>>>>>> >>>
>>>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>>>> possible. It
>>>>>>> >>> is an important design decision that we aim to keep the list of
>>>>>>> >>> authentication protocols to a minimum. We believe that this
>>>>>>> should not be a
>>>>>>> >>> primary concern of Flink and a trusted proxy service (for
>>>>>>> example Apache
>>>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>>>> authentication
>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>>>> support
>>>>>>> >>> consequently consist of a single strong authentication protocol
>>>>>>> for which
>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>>>>>> development
>>>>>>> >>> and light-weight scenarios.
>>>>>>> >>>
>>>>>>> >>> Added the above wording to G's doc.
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>>>> [hidden email]>
>>>>>>> >>> wrote:
>>>>>>> >>>
>>>>>>> >>>> There's a related effort:
>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>>>> >>>>
>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>>>> >>>> >
>>>>>>> >>>> > Thanks for sharing this proposal with the community Márton. In
>>>>>>> >>>> general, I
>>>>>>> >>>> > agree that authentication is missing and that this is
>>>>>>> required for
>>>>>>> >>>> using
>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is
>>>>>>> whether this
>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>>>>>> whether a
>>>>>>> >>>> proxy
>>>>>>> >>>> > setup could do the job? Have you considered this option? If
>>>>>>> yes, then
>>>>>>> >>>> it
>>>>>>> >>>> > would be good to list it under the point of rejected
>>>>>>> alternatives.
>>>>>>> >>>> >
>>>>>>> >>>> > I do see the benefit of implementing this feature inside of
>>>>>>> Flink if
>>>>>>> >>>> many
>>>>>>> >>>> > users need it. If not, then it might be easier for the
>>>>>>> project to not
>>>>>>> >>>> > increase the surface area since it makes the overall
>>>>>>> maintenance
>>>>>>> >>>> harder.
>>>>>>> >>>> >
>>>>>>> >>>> > Cheers,
>>>>>>> >>>> > Till
>>>>>>> >>>> >
>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>>>> [hidden email]>
>>>>>>> >>>> wrote:
>>>>>>> >>>> >
>>>>>>> >>>> >> Hi team,
>>>>>>> >>>> >>
>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short
>>>>>>> to the
>>>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>>>> transitioned to
>>>>>>> >>>> the
>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>>>>>>> >>>> contributing
>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>>>> Streaming
>>>>>>> >>>> and
>>>>>>> >>>> >> security.
>>>>>>> >>>> >>
>>>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>>>> Kerberos and
>>>>>>> >>>> HTTP
>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and
>>>>>>> HistoryServer.
>>>>>>> >>>> Previously
>>>>>>> >>>> >> lacked an authentication story.
>>>>>>> >>>> >>
>>>>>>> >>>> >> We are looking to contribute this functionality back to the
>>>>>>> >>>> community, we
>>>>>>> >>>> >> believe that given Flink's maturity there should be a common
>>>>>>> code
>>>>>>> >>>> solution
>>>>>>> >>>> >> for this general pattern.
>>>>>>> >>>> >>
>>>>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>>>>> >>>> >>
>>>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>>>> >>>> >> [2]
>>>>>>> >>>> >>
>>>>>>> >>>> >>
>>>>>>> >>>>
>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>> >>>> >>
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>>
>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Gabor Somogyi
> I did not mean for the user to sign its own certificates but for the
operator of the cluster. Once the user request hits the proxy, it should no
longer be under his control. I think I do not fully understand yet why this
would not work.
I said it's not solving the authentication problem over any proxy. Even if
the operator is signing the certificate one can have access to an internal
node.
Such case anybody can craft certificates which is accepted by the server.
When it's accepted a bad guy can cancel jobs causing huge impacts.

> Also, I am missing a bit the comparison of Kerberos to other
authentication mechanisms and why they were rejected in favour of Kerberos.
PROS:
* Since it's not depending on cloud provider and/or k8s or bare-metal etc.
deployment it's the biggest plus
* Centralized with tools and no need to write tons of tools around
* There are clients/tools on almost all OS-es and several languages
* Super huge users are using it for years in production w/o huge issues
* Provides cross-realm trust possibility amongst other features
* Several open source components using it which could increase compatibility

CONS:
* Not everybody using kerberos
* It would increase the code footprint but this is true for many features
(as a side note I'm here to maintain it)

Feel free to add your points because it only represents a single viewpoint.
Also if you have any better option for strong authentication please share
it and we can consider the pros/cons here.

BR,
G


On Fri, Jun 4, 2021 at 10:32 AM Till Rohrmann <[hidden email]> wrote:

> I did not mean for the user to sign its own certificates but for the
> operator of the cluster. Once the user request hits the proxy, it should no
> longer be under his control. I think I do not fully understand yet why this
> would not work.
>
> What I would like to avoid is to add more complexity into Flink if there
> is an easy solution which fulfills the requirements. That's why I would
> like to exercise thoroughly through the different alternatives. Also, I am
> missing a bit the comparison of Kerberos to other authentication mechanisms
> and why they were rejected in favour of Kerberos.
>
> Cheers,
> Till
>
> On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <[hidden email]> wrote:
>
>> Hi!
>>
>> I think there might be possible alternatives but it seems Kerberos on the
>> rest endpoint ticks all the right boxes and provides a super clean and
>> simple solution for strong authentication.
>>
>> I wouldn’t even consider sidecar proxies etc if we can solve it in such a
>> simple way as proposed by G.
>>
>> Cheers
>> Gyula
>>
>> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <[hidden email]> wrote:
>>
>>> I am not saying that we shouldn't add a strong authentication mechanism
>>> if there are good reasons for it. I primarily would like to understand the
>>> context a bit better in order to give qualified feedback and come to a good
>>> decision. In order to do this, I have the feeling that we haven't fully
>>> considered all available options which are on the table, tbh.
>>>
>>> Does the problem of certificate expiry also apply for self-signed
>>> certificates? If yes, then this should then also be a problem for the
>>> internal encryption of Flink's communication. If not, then one could use
>>> self-signed certificates with a longer validity to solve the mentioned
>>> issue.
>>>
>>> I think you can set up Flink in such a way that you don't have to handle
>>> all the different certificates. For example, you could deploy Flink with a
>>> "sidecar proxy" which is responsible for the authentication using an
>>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
>>> network interface. That way, the REST endpoint would only be available
>>> through the sidecar proxy. Additionally, one could enable SSL for this
>>> communication. Would this be a solution for the problem?
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <[hidden email]>
>>> wrote:
>>>
>>>> That is an interesting idea, Till.
>>>>
>>>> The main issue with it is that TLS certificates have an expiration
>>>> time, usually they get approved for a couple years. Forcing our users to
>>>> restart jobs to reprovision TLS certificates would be weird when we could
>>>> just implement a single proper strong authentication mechanism instead in a
>>>> couple hundred lines of code. :-)
>>>>
>>>> In many cases it is also impractical to go the TLS mutual route,
>>>> because the Flink Dashboard can end up on any node in the k8s/Yarn cluster
>>>> which means that we need a certificate per node (due to the mutual auth),
>>>> but if we also want to protect the private key of these from users
>>>> accidentally or intentionally leaking them then we need this per user. As
>>>> in we end up managing user*machine number certificates and having to renew
>>>> them periodically, which albeit automatable is unfortunately not yet
>>>> automated in all large organizations.
>>>>
>>>> I fully agree that TLS certificate mutual authentication has its nice
>>>> properties, especially at very large (multiple thousand node) clusters -
>>>> but it has its own challenges too. Thanks for bringing it up.
>>>>
>>>> Happy to have this added to the rejected alternative list so that we
>>>> have the full picture documented.
>>>>
>>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]>
>>>> wrote:
>>>>
>>>>> I guess the idea would then be to let the proxy do the authentication
>>>>> job and only forward the request via an SSL mutually encrypted connection
>>>>> to the Flink cluster. Would this be possible? The beauty of this setup is
>>>>> in my opinion that this setup should work with all kinds of authentication
>>>>> mechanisms.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> Thanks for giving options to fulfil the need.
>>>>>>
>>>>>> Users are looking for a solution where users can be identified on the
>>>>>> whole cluster and restrict access to resources/actions.
>>>>>> A good example for such an action is cancelling other users running
>>>>>> jobs.
>>>>>>
>>>>>> * SSL does provide mutual authentication but when authentication
>>>>>> passed there is no user based on restrictions can be made.
>>>>>> * The less problematic part is that generating/maintaining short time
>>>>>> valid certificates would be a hard (that's the reason KDC like servers
>>>>>> exist).
>>>>>> Having long time valid certificates would widen the attack surface
>>>>>> but since the first concern is there this is just a cosmetic issue.
>>>>>>
>>>>>> All in all using TLS certificates is not sufficient in these
>>>>>> environments unfortunately.
>>>>>>
>>>>>> BR,
>>>>>> G
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the information Gabor. If it is about securing the
>>>>>>> communication between the REST client and the REST server, then Flink
>>>>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>>>>> enough to secure the communication and to pass an audit?
>>>>>>>
>>>>>>> [1]
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>>>>> [hidden email]> wrote:
>>>>>>>
>>>>>>>> Hi Till,
>>>>>>>>
>>>>>>>> Since I'm working in security area 10+ years let me share my
>>>>>>>> thought.
>>>>>>>> I would like to emphasise there are experts better than me but I
>>>>>>>> have some
>>>>>>>> basics.
>>>>>>>> The discussion is open and not trying to tell alone things...
>>>>>>>>
>>>>>>>> > I mean if an attacker can get access to one of the machines, then
>>>>>>>> it
>>>>>>>> should also be possible to obtain the right Kerberos token.
>>>>>>>> Not necessarily. For example if one gets access to a specific user's
>>>>>>>> credentials then it's not possible to compromise other user's jobs,
>>>>>>>> data,
>>>>>>>> etc...
>>>>>>>> Security is like an onion, the more layers has been added the more
>>>>>>>> time an
>>>>>>>> attacker needs to proceed.
>>>>>>>> At the end of the day if one is in, then most probably can find the
>>>>>>>> way but
>>>>>>>> this time is normally enough to sysadmins or security experts to
>>>>>>>> close down the system and minimize the damage.
>>>>>>>>
>>>>>>>> The other thing is that all tokens has a timeout and if the token is
>>>>>>>> invalid then the attacker can't proceed further.
>>>>>>>>
>>>>>>>> > Is Kerberos also the standard authentication protocol for
>>>>>>>> Kubernetes
>>>>>>>> deployments?
>>>>>>>> Kerberos is an industry standard which is cloud/deployment agnostic
>>>>>>>> and it
>>>>>>>> can be used in any deployments including k8s.
>>>>>>>> The main intention is to use kerberos in k8s deployments too since
>>>>>>>> we're
>>>>>>>> going this direction as well.
>>>>>>>> Please see how Spark does this:
>>>>>>>>
>>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>>>>
>>>>>>>> Last but not least the most important reason to add at least one
>>>>>>>> strong
>>>>>>>> authentication is that we have users who has
>>>>>>>> hard requirements on this. They're doing security audits and if
>>>>>>>> they fail
>>>>>>>> then it's deal breaking.
>>>>>>>> That is why we have added kerberos at the first place.
>>>>>>>> Unfortunately we
>>>>>>>> can't name them in this public list, however
>>>>>>>> the customers who specifically asked for this were mainly in the
>>>>>>>> banking
>>>>>>>> and telco sector.
>>>>>>>>
>>>>>>>> BR,
>>>>>>>> G
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> > Thanks for updating the document Márton. Why is it that banks will
>>>>>>>> > consider it more secure if Flink comes with Kerberos
>>>>>>>> authentication
>>>>>>>> > (assuming a properly secured setup)? I mean if an attacker can
>>>>>>>> get access
>>>>>>>> > to one of the machines, then it should also be possible to obtain
>>>>>>>> the right
>>>>>>>> > Kerberos token.
>>>>>>>> >
>>>>>>>> > I am not an authentication expert and that's why I wanted to ask
>>>>>>>> what are
>>>>>>>> > other authentication protocols other than Kerberos? Why did we
>>>>>>>> select
>>>>>>>> > Kerberos and not any other authentication protocol? Maybe you can
>>>>>>>> list the
>>>>>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>>>>>> standard
>>>>>>>> > authentication protocol for Kubernetes deployments? If not, what
>>>>>>>> would be
>>>>>>>> > the answer when deploying on K8s?
>>>>>>>> >
>>>>>>>> > Cheers,
>>>>>>>> > Till
>>>>>>>> >
>>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>>>>> [hidden email]>
>>>>>>>> > wrote:
>>>>>>>> >
>>>>>>>> >> Hi team,
>>>>>>>> >>
>>>>>>>> >> Happy to be here and hope I can provide quality additions in the
>>>>>>>> future.
>>>>>>>> >>
>>>>>>>> >> Thank you all for helpful the suggestions!
>>>>>>>> >> Considering them the FLIP has been modified and the work
>>>>>>>> continues on the
>>>>>>>> >> already existing Jira.
>>>>>>>> >>
>>>>>>>> >> BR,
>>>>>>>> >> G
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>>>>> [hidden email]>
>>>>>>>> >> wrote:
>>>>>>>> >>
>>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket
>>>>>>>> too, let
>>>>>>>> >>> us continue there then.
>>>>>>>> >>>
>>>>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>>>>> possible. It
>>>>>>>> >>> is an important design decision that we aim to keep the list of
>>>>>>>> >>> authentication protocols to a minimum. We believe that this
>>>>>>>> should not be a
>>>>>>>> >>> primary concern of Flink and a trusted proxy service (for
>>>>>>>> example Apache
>>>>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>>>>> authentication
>>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>>>>> support
>>>>>>>> >>> consequently consist of a single strong authentication protocol
>>>>>>>> for which
>>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>>>>>>> development
>>>>>>>> >>> and light-weight scenarios.
>>>>>>>> >>>
>>>>>>>> >>> Added the above wording to G's doc.
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>>>>> [hidden email]>
>>>>>>>> >>> wrote:
>>>>>>>> >>>
>>>>>>>> >>>> There's a related effort:
>>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>>>>> >>>>
>>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>>>>> >>>> >
>>>>>>>> >>>> > Thanks for sharing this proposal with the community Márton.
>>>>>>>> In
>>>>>>>> >>>> general, I
>>>>>>>> >>>> > agree that authentication is missing and that this is
>>>>>>>> required for
>>>>>>>> >>>> using
>>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is
>>>>>>>> whether this
>>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>>>>>>> whether a
>>>>>>>> >>>> proxy
>>>>>>>> >>>> > setup could do the job? Have you considered this option? If
>>>>>>>> yes, then
>>>>>>>> >>>> it
>>>>>>>> >>>> > would be good to list it under the point of rejected
>>>>>>>> alternatives.
>>>>>>>> >>>> >
>>>>>>>> >>>> > I do see the benefit of implementing this feature inside of
>>>>>>>> Flink if
>>>>>>>> >>>> many
>>>>>>>> >>>> > users need it. If not, then it might be easier for the
>>>>>>>> project to not
>>>>>>>> >>>> > increase the surface area since it makes the overall
>>>>>>>> maintenance
>>>>>>>> >>>> harder.
>>>>>>>> >>>> >
>>>>>>>> >>>> > Cheers,
>>>>>>>> >>>> > Till
>>>>>>>> >>>> >
>>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>>>>> [hidden email]>
>>>>>>>> >>>> wrote:
>>>>>>>> >>>> >
>>>>>>>> >>>> >> Hi team,
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short
>>>>>>>> to the
>>>>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>>>>> transitioned to
>>>>>>>> >>>> the
>>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>>>>>>>> >>>> contributing
>>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>>>>> Streaming
>>>>>>>> >>>> and
>>>>>>>> >>>> >> security.
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>>>>> Kerberos and
>>>>>>>> >>>> HTTP
>>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and
>>>>>>>> HistoryServer.
>>>>>>>> >>>> Previously
>>>>>>>> >>>> >> lacked an authentication story.
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> We are looking to contribute this functionality back to the
>>>>>>>> >>>> community, we
>>>>>>>> >>>> >> believe that given Flink's maturity there should be a
>>>>>>>> common code
>>>>>>>> >>>> solution
>>>>>>>> >>>> >> for this general pattern.
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>>>>>> >>>> >>
>>>>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>>>>> >>>> >> [2]
>>>>>>>> >>>> >>
>>>>>>>> >>>> >>
>>>>>>>> >>>>
>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>> >>>> >>
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>>
>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Till Rohrmann
As I've said I am not a security expert and that's why I have to ask for
clarification, Gabor. You are saying that if we configure a truststore for
the REST endpoint with a single trusted certificate which has been
generated by the operator of the Flink cluster, then the attacker can
generate a new certificate, sign it and then talk to the Flink cluster if
he has access to the node on which the REST endpoint runs? My understanding
was that you need the corresponding private key which in my proposed setup
would be under the control of the operator as well (e.g. stored in a
keystore on the same machine but guarded by some secret). That way (if I am
not mistaken), only the entity which has access to the keystore is able to
talk to the Flink cluster.

Maybe we are also getting our wires crossed here and are talking about
different things.

Thanks for listing the pros and cons of Kerberos. Concerning what other
authentication mechanisms are used in the industry, I am not 100% sure.

Cheers,
Till

On Fri, Jun 4, 2021 at 11:09 AM Gabor Somogyi <[hidden email]>
wrote:

> > I did not mean for the user to sign its own certificates but for the
> operator of the cluster. Once the user request hits the proxy, it should no
> longer be under his control. I think I do not fully understand yet why this
> would not work.
> I said it's not solving the authentication problem over any proxy. Even if
> the operator is signing the certificate one can have access to an internal
> node.
> Such case anybody can craft certificates which is accepted by the server.
> When it's accepted a bad guy can cancel jobs causing huge impacts.
>
> > Also, I am missing a bit the comparison of Kerberos to other
> authentication mechanisms and why they were rejected in favour of Kerberos.
> PROS:
> * Since it's not depending on cloud provider and/or k8s or bare-metal etc.
> deployment it's the biggest plus
> * Centralized with tools and no need to write tons of tools around
> * There are clients/tools on almost all OS-es and several languages
> * Super huge users are using it for years in production w/o huge issues
> * Provides cross-realm trust possibility amongst other features
> * Several open source components using it which could increase
> compatibility
>
> CONS:
> * Not everybody using kerberos
> * It would increase the code footprint but this is true for many features
> (as a side note I'm here to maintain it)
>
> Feel free to add your points because it only represents a single viewpoint.
> Also if you have any better option for strong authentication please share
> it and we can consider the pros/cons here.
>
> BR,
> G
>
>
> On Fri, Jun 4, 2021 at 10:32 AM Till Rohrmann <[hidden email]>
> wrote:
>
>> I did not mean for the user to sign its own certificates but for the
>> operator of the cluster. Once the user request hits the proxy, it should no
>> longer be under his control. I think I do not fully understand yet why this
>> would not work.
>>
>> What I would like to avoid is to add more complexity into Flink if there
>> is an easy solution which fulfills the requirements. That's why I would
>> like to exercise thoroughly through the different alternatives. Also, I am
>> missing a bit the comparison of Kerberos to other authentication mechanisms
>> and why they were rejected in favour of Kerberos.
>>
>> Cheers,
>> Till
>>
>> On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <[hidden email]> wrote:
>>
>>> Hi!
>>>
>>> I think there might be possible alternatives but it seems Kerberos on
>>> the rest endpoint ticks all the right boxes and provides a super clean and
>>> simple solution for strong authentication.
>>>
>>> I wouldn’t even consider sidecar proxies etc if we can solve it in such
>>> a simple way as proposed by G.
>>>
>>> Cheers
>>> Gyula
>>>
>>> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <[hidden email]> wrote:
>>>
>>>> I am not saying that we shouldn't add a strong authentication mechanism
>>>> if there are good reasons for it. I primarily would like to understand the
>>>> context a bit better in order to give qualified feedback and come to a good
>>>> decision. In order to do this, I have the feeling that we haven't fully
>>>> considered all available options which are on the table, tbh.
>>>>
>>>> Does the problem of certificate expiry also apply for self-signed
>>>> certificates? If yes, then this should then also be a problem for the
>>>> internal encryption of Flink's communication. If not, then one could use
>>>> self-signed certificates with a longer validity to solve the mentioned
>>>> issue.
>>>>
>>>> I think you can set up Flink in such a way that you don't have to
>>>> handle all the different certificates. For example, you could deploy Flink
>>>> with a "sidecar proxy" which is responsible for the authentication using an
>>>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
>>>> network interface. That way, the REST endpoint would only be available
>>>> through the sidecar proxy. Additionally, one could enable SSL for this
>>>> communication. Would this be a solution for the problem?
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <
>>>> [hidden email]> wrote:
>>>>
>>>>> That is an interesting idea, Till.
>>>>>
>>>>> The main issue with it is that TLS certificates have an expiration
>>>>> time, usually they get approved for a couple years. Forcing our users to
>>>>> restart jobs to reprovision TLS certificates would be weird when we could
>>>>> just implement a single proper strong authentication mechanism instead in a
>>>>> couple hundred lines of code. :-)
>>>>>
>>>>> In many cases it is also impractical to go the TLS mutual route,
>>>>> because the Flink Dashboard can end up on any node in the k8s/Yarn cluster
>>>>> which means that we need a certificate per node (due to the mutual auth),
>>>>> but if we also want to protect the private key of these from users
>>>>> accidentally or intentionally leaking them then we need this per user. As
>>>>> in we end up managing user*machine number certificates and having to renew
>>>>> them periodically, which albeit automatable is unfortunately not yet
>>>>> automated in all large organizations.
>>>>>
>>>>> I fully agree that TLS certificate mutual authentication has its nice
>>>>> properties, especially at very large (multiple thousand node) clusters -
>>>>> but it has its own challenges too. Thanks for bringing it up.
>>>>>
>>>>> Happy to have this added to the rejected alternative list so that we
>>>>> have the full picture documented.
>>>>>
>>>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> I guess the idea would then be to let the proxy do the authentication
>>>>>> job and only forward the request via an SSL mutually encrypted connection
>>>>>> to the Flink cluster. Would this be possible? The beauty of this setup is
>>>>>> in my opinion that this setup should work with all kinds of authentication
>>>>>> mechanisms.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <
>>>>>> [hidden email]> wrote:
>>>>>>
>>>>>>> Thanks for giving options to fulfil the need.
>>>>>>>
>>>>>>> Users are looking for a solution where users can be identified on
>>>>>>> the whole cluster and restrict access to resources/actions.
>>>>>>> A good example for such an action is cancelling other users running
>>>>>>> jobs.
>>>>>>>
>>>>>>> * SSL does provide mutual authentication but when authentication
>>>>>>> passed there is no user based on restrictions can be made.
>>>>>>> * The less problematic part is that generating/maintaining short
>>>>>>> time valid certificates would be a hard (that's the reason KDC like servers
>>>>>>> exist).
>>>>>>> Having long time valid certificates would widen the attack surface
>>>>>>> but since the first concern is there this is just a cosmetic issue.
>>>>>>>
>>>>>>> All in all using TLS certificates is not sufficient in these
>>>>>>> environments unfortunately.
>>>>>>>
>>>>>>> BR,
>>>>>>> G
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the information Gabor. If it is about securing the
>>>>>>>> communication between the REST client and the REST server, then Flink
>>>>>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>>>>>> enough to secure the communication and to pass an audit?
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>>>>>> [hidden email]> wrote:
>>>>>>>>
>>>>>>>>> Hi Till,
>>>>>>>>>
>>>>>>>>> Since I'm working in security area 10+ years let me share my
>>>>>>>>> thought.
>>>>>>>>> I would like to emphasise there are experts better than me but I
>>>>>>>>> have some
>>>>>>>>> basics.
>>>>>>>>> The discussion is open and not trying to tell alone things...
>>>>>>>>>
>>>>>>>>> > I mean if an attacker can get access to one of the machines,
>>>>>>>>> then it
>>>>>>>>> should also be possible to obtain the right Kerberos token.
>>>>>>>>> Not necessarily. For example if one gets access to a specific
>>>>>>>>> user's
>>>>>>>>> credentials then it's not possible to compromise other user's
>>>>>>>>> jobs, data,
>>>>>>>>> etc...
>>>>>>>>> Security is like an onion, the more layers has been added the more
>>>>>>>>> time an
>>>>>>>>> attacker needs to proceed.
>>>>>>>>> At the end of the day if one is in, then most probably can find
>>>>>>>>> the way but
>>>>>>>>> this time is normally enough to sysadmins or security experts to
>>>>>>>>> close down the system and minimize the damage.
>>>>>>>>>
>>>>>>>>> The other thing is that all tokens has a timeout and if the token
>>>>>>>>> is
>>>>>>>>> invalid then the attacker can't proceed further.
>>>>>>>>>
>>>>>>>>> > Is Kerberos also the standard authentication protocol for
>>>>>>>>> Kubernetes
>>>>>>>>> deployments?
>>>>>>>>> Kerberos is an industry standard which is cloud/deployment
>>>>>>>>> agnostic and it
>>>>>>>>> can be used in any deployments including k8s.
>>>>>>>>> The main intention is to use kerberos in k8s deployments too since
>>>>>>>>> we're
>>>>>>>>> going this direction as well.
>>>>>>>>> Please see how Spark does this:
>>>>>>>>>
>>>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>>>>>
>>>>>>>>> Last but not least the most important reason to add at least one
>>>>>>>>> strong
>>>>>>>>> authentication is that we have users who has
>>>>>>>>> hard requirements on this. They're doing security audits and if
>>>>>>>>> they fail
>>>>>>>>> then it's deal breaking.
>>>>>>>>> That is why we have added kerberos at the first place.
>>>>>>>>> Unfortunately we
>>>>>>>>> can't name them in this public list, however
>>>>>>>>> the customers who specifically asked for this were mainly in the
>>>>>>>>> banking
>>>>>>>>> and telco sector.
>>>>>>>>>
>>>>>>>>> BR,
>>>>>>>>> G
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <[hidden email]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > Thanks for updating the document Márton. Why is it that banks
>>>>>>>>> will
>>>>>>>>> > consider it more secure if Flink comes with Kerberos
>>>>>>>>> authentication
>>>>>>>>> > (assuming a properly secured setup)? I mean if an attacker can
>>>>>>>>> get access
>>>>>>>>> > to one of the machines, then it should also be possible to
>>>>>>>>> obtain the right
>>>>>>>>> > Kerberos token.
>>>>>>>>> >
>>>>>>>>> > I am not an authentication expert and that's why I wanted to ask
>>>>>>>>> what are
>>>>>>>>> > other authentication protocols other than Kerberos? Why did we
>>>>>>>>> select
>>>>>>>>> > Kerberos and not any other authentication protocol? Maybe you
>>>>>>>>> can list the
>>>>>>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>>>>>>> standard
>>>>>>>>> > authentication protocol for Kubernetes deployments? If not, what
>>>>>>>>> would be
>>>>>>>>> > the answer when deploying on K8s?
>>>>>>>>> >
>>>>>>>>> > Cheers,
>>>>>>>>> > Till
>>>>>>>>> >
>>>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>>>>>> [hidden email]>
>>>>>>>>> > wrote:
>>>>>>>>> >
>>>>>>>>> >> Hi team,
>>>>>>>>> >>
>>>>>>>>> >> Happy to be here and hope I can provide quality additions in
>>>>>>>>> the future.
>>>>>>>>> >>
>>>>>>>>> >> Thank you all for helpful the suggestions!
>>>>>>>>> >> Considering them the FLIP has been modified and the work
>>>>>>>>> continues on the
>>>>>>>>> >> already existing Jira.
>>>>>>>>> >>
>>>>>>>>> >> BR,
>>>>>>>>> >> G
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>>>>>> [hidden email]>
>>>>>>>>> >> wrote:
>>>>>>>>> >>
>>>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the
>>>>>>>>> ticket too, let
>>>>>>>>> >>> us continue there then.
>>>>>>>>> >>>
>>>>>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>>>>>> possible. It
>>>>>>>>> >>> is an important design decision that we aim to keep the list of
>>>>>>>>> >>> authentication protocols to a minimum. We believe that this
>>>>>>>>> should not be a
>>>>>>>>> >>> primary concern of Flink and a trusted proxy service (for
>>>>>>>>> example Apache
>>>>>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>>>>>> authentication
>>>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>>>>>> support
>>>>>>>>> >>> consequently consist of a single strong authentication
>>>>>>>>> protocol for which
>>>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>>>>>>>> development
>>>>>>>>> >>> and light-weight scenarios.
>>>>>>>>> >>>
>>>>>>>>> >>> Added the above wording to G's doc.
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>>>>>> [hidden email]>
>>>>>>>>> >>> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>>> There's a related effort:
>>>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>>>>>> >>>>
>>>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>>>>>> >>>> >
>>>>>>>>> >>>> > Thanks for sharing this proposal with the community Márton.
>>>>>>>>> In
>>>>>>>>> >>>> general, I
>>>>>>>>> >>>> > agree that authentication is missing and that this is
>>>>>>>>> required for
>>>>>>>>> >>>> using
>>>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is
>>>>>>>>> whether this
>>>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>>>>>>>> whether a
>>>>>>>>> >>>> proxy
>>>>>>>>> >>>> > setup could do the job? Have you considered this option? If
>>>>>>>>> yes, then
>>>>>>>>> >>>> it
>>>>>>>>> >>>> > would be good to list it under the point of rejected
>>>>>>>>> alternatives.
>>>>>>>>> >>>> >
>>>>>>>>> >>>> > I do see the benefit of implementing this feature inside of
>>>>>>>>> Flink if
>>>>>>>>> >>>> many
>>>>>>>>> >>>> > users need it. If not, then it might be easier for the
>>>>>>>>> project to not
>>>>>>>>> >>>> > increase the surface area since it makes the overall
>>>>>>>>> maintenance
>>>>>>>>> >>>> harder.
>>>>>>>>> >>>> >
>>>>>>>>> >>>> > Cheers,
>>>>>>>>> >>>> > Till
>>>>>>>>> >>>> >
>>>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>>>>>> [hidden email]>
>>>>>>>>> >>>> wrote:
>>>>>>>>> >>>> >
>>>>>>>>> >>>> >> Hi team,
>>>>>>>>> >>>> >>
>>>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short
>>>>>>>>> to the
>>>>>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>>>>>> transitioned to
>>>>>>>>> >>>> the
>>>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward
>>>>>>>>> to
>>>>>>>>> >>>> contributing
>>>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>>>>>> Streaming
>>>>>>>>> >>>> and
>>>>>>>>> >>>> >> security.
>>>>>>>>> >>>> >>
>>>>>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>>>>>> Kerberos and
>>>>>>>>> >>>> HTTP
>>>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and
>>>>>>>>> HistoryServer.
>>>>>>>>> >>>> Previously
>>>>>>>>> >>>> >> lacked an authentication story.
>>>>>>>>> >>>> >>
>>>>>>>>> >>>> >> We are looking to contribute this functionality back to the
>>>>>>>>> >>>> community, we
>>>>>>>>> >>>> >> believe that given Flink's maturity there should be a
>>>>>>>>> common code
>>>>>>>>> >>>> solution
>>>>>>>>> >>>> >> for this general pattern.
>>>>>>>>> >>>> >>
>>>>>>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>>>>>>> >>>> >>
>>>>>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>>>>>> >>>> >> [2]
>>>>>>>>> >>>> >>
>>>>>>>>> >>>> >>
>>>>>>>>> >>>>
>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>>> >>>> >>
>>>>>>>>> >>>>
>>>>>>>>> >>>>
>>>>>>>>>
>>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Gabor Somogyi
Hi Till,

Your proxy suggestion has been considered in-depth and updated the FLIP
accordingly.
We've considered 2 proxy implementation (Nginx and Squid) but according to
our analysis and testing it's not suitable for the mentioned use-cases.
Please take a look at the rejected alternatives for detailed explanation.

Thanks for your time in advance!

BR,
G


On Fri, Jun 4, 2021 at 3:31 PM Till Rohrmann <[hidden email]> wrote:

> As I've said I am not a security expert and that's why I have to ask for
> clarification, Gabor. You are saying that if we configure a truststore for
> the REST endpoint with a single trusted certificate which has been
> generated by the operator of the Flink cluster, then the attacker can
> generate a new certificate, sign it and then talk to the Flink cluster if
> he has access to the node on which the REST endpoint runs? My understanding
> was that you need the corresponding private key which in my proposed setup
> would be under the control of the operator as well (e.g. stored in a
> keystore on the same machine but guarded by some secret). That way (if I am
> not mistaken), only the entity which has access to the keystore is able to
> talk to the Flink cluster.
>
> Maybe we are also getting our wires crossed here and are talking about
> different things.
>
> Thanks for listing the pros and cons of Kerberos. Concerning what other
> authentication mechanisms are used in the industry, I am not 100% sure.
>
> Cheers,
> Till
>
> On Fri, Jun 4, 2021 at 11:09 AM Gabor Somogyi <[hidden email]>
> wrote:
>
>> > I did not mean for the user to sign its own certificates but for the
>> operator of the cluster. Once the user request hits the proxy, it should no
>> longer be under his control. I think I do not fully understand yet why this
>> would not work.
>> I said it's not solving the authentication problem over any proxy. Even
>> if the operator is signing the certificate one can have access to an
>> internal node.
>> Such case anybody can craft certificates which is accepted by the server.
>> When it's accepted a bad guy can cancel jobs causing huge impacts.
>>
>> > Also, I am missing a bit the comparison of Kerberos to other
>> authentication mechanisms and why they were rejected in favour of Kerberos.
>> PROS:
>> * Since it's not depending on cloud provider and/or k8s or bare-metal
>> etc. deployment it's the biggest plus
>> * Centralized with tools and no need to write tons of tools around
>> * There are clients/tools on almost all OS-es and several languages
>> * Super huge users are using it for years in production w/o huge issues
>> * Provides cross-realm trust possibility amongst other features
>> * Several open source components using it which could increase
>> compatibility
>>
>> CONS:
>> * Not everybody using kerberos
>> * It would increase the code footprint but this is true for many features
>> (as a side note I'm here to maintain it)
>>
>> Feel free to add your points because it only represents a single
>> viewpoint.
>> Also if you have any better option for strong authentication please share
>> it and we can consider the pros/cons here.
>>
>> BR,
>> G
>>
>>
>> On Fri, Jun 4, 2021 at 10:32 AM Till Rohrmann <[hidden email]>
>> wrote:
>>
>>> I did not mean for the user to sign its own certificates but for the
>>> operator of the cluster. Once the user request hits the proxy, it should no
>>> longer be under his control. I think I do not fully understand yet why this
>>> would not work.
>>>
>>> What I would like to avoid is to add more complexity into Flink if there
>>> is an easy solution which fulfills the requirements. That's why I would
>>> like to exercise thoroughly through the different alternatives. Also, I am
>>> missing a bit the comparison of Kerberos to other authentication mechanisms
>>> and why they were rejected in favour of Kerberos.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <[hidden email]> wrote:
>>>
>>>> Hi!
>>>>
>>>> I think there might be possible alternatives but it seems Kerberos on
>>>> the rest endpoint ticks all the right boxes and provides a super clean and
>>>> simple solution for strong authentication.
>>>>
>>>> I wouldn’t even consider sidecar proxies etc if we can solve it in such
>>>> a simple way as proposed by G.
>>>>
>>>> Cheers
>>>> Gyula
>>>>
>>>> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <[hidden email]>
>>>> wrote:
>>>>
>>>>> I am not saying that we shouldn't add a strong authentication
>>>>> mechanism if there are good reasons for it. I primarily would like to
>>>>> understand the context a bit better in order to give qualified feedback and
>>>>> come to a good decision. In order to do this, I have the feeling that we
>>>>> haven't fully considered all available options which are on the table, tbh.
>>>>>
>>>>> Does the problem of certificate expiry also apply for self-signed
>>>>> certificates? If yes, then this should then also be a problem for the
>>>>> internal encryption of Flink's communication. If not, then one could use
>>>>> self-signed certificates with a longer validity to solve the mentioned
>>>>> issue.
>>>>>
>>>>> I think you can set up Flink in such a way that you don't have to
>>>>> handle all the different certificates. For example, you could deploy Flink
>>>>> with a "sidecar proxy" which is responsible for the authentication using an
>>>>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
>>>>> network interface. That way, the REST endpoint would only be available
>>>>> through the sidecar proxy. Additionally, one could enable SSL for this
>>>>> communication. Would this be a solution for the problem?
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> That is an interesting idea, Till.
>>>>>>
>>>>>> The main issue with it is that TLS certificates have an expiration
>>>>>> time, usually they get approved for a couple years. Forcing our users to
>>>>>> restart jobs to reprovision TLS certificates would be weird when we could
>>>>>> just implement a single proper strong authentication mechanism instead in a
>>>>>> couple hundred lines of code. :-)
>>>>>>
>>>>>> In many cases it is also impractical to go the TLS mutual route,
>>>>>> because the Flink Dashboard can end up on any node in the k8s/Yarn cluster
>>>>>> which means that we need a certificate per node (due to the mutual auth),
>>>>>> but if we also want to protect the private key of these from users
>>>>>> accidentally or intentionally leaking them then we need this per user. As
>>>>>> in we end up managing user*machine number certificates and having to renew
>>>>>> them periodically, which albeit automatable is unfortunately not yet
>>>>>> automated in all large organizations.
>>>>>>
>>>>>> I fully agree that TLS certificate mutual authentication has its nice
>>>>>> properties, especially at very large (multiple thousand node) clusters -
>>>>>> but it has its own challenges too. Thanks for bringing it up.
>>>>>>
>>>>>> Happy to have this added to the rejected alternative list so that we
>>>>>> have the full picture documented.
>>>>>>
>>>>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> I guess the idea would then be to let the proxy do the
>>>>>>> authentication job and only forward the request via an SSL mutually
>>>>>>> encrypted connection to the Flink cluster. Would this be possible? The
>>>>>>> beauty of this setup is in my opinion that this setup should work with all
>>>>>>> kinds of authentication mechanisms.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <
>>>>>>> [hidden email]> wrote:
>>>>>>>
>>>>>>>> Thanks for giving options to fulfil the need.
>>>>>>>>
>>>>>>>> Users are looking for a solution where users can be identified on
>>>>>>>> the whole cluster and restrict access to resources/actions.
>>>>>>>> A good example for such an action is cancelling other users running
>>>>>>>> jobs.
>>>>>>>>
>>>>>>>> * SSL does provide mutual authentication but when authentication
>>>>>>>> passed there is no user based on restrictions can be made.
>>>>>>>> * The less problematic part is that generating/maintaining short
>>>>>>>> time valid certificates would be a hard (that's the reason KDC like servers
>>>>>>>> exist).
>>>>>>>> Having long time valid certificates would widen the attack surface
>>>>>>>> but since the first concern is there this is just a cosmetic issue.
>>>>>>>>
>>>>>>>> All in all using TLS certificates is not sufficient in these
>>>>>>>> environments unfortunately.
>>>>>>>>
>>>>>>>> BR,
>>>>>>>> G
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the information Gabor. If it is about securing the
>>>>>>>>> communication between the REST client and the REST server, then Flink
>>>>>>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>>>>>>> enough to secure the communication and to pass an audit?
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Till
>>>>>>>>>
>>>>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Till,
>>>>>>>>>>
>>>>>>>>>> Since I'm working in security area 10+ years let me share my
>>>>>>>>>> thought.
>>>>>>>>>> I would like to emphasise there are experts better than me but I
>>>>>>>>>> have some
>>>>>>>>>> basics.
>>>>>>>>>> The discussion is open and not trying to tell alone things...
>>>>>>>>>>
>>>>>>>>>> > I mean if an attacker can get access to one of the machines,
>>>>>>>>>> then it
>>>>>>>>>> should also be possible to obtain the right Kerberos token.
>>>>>>>>>> Not necessarily. For example if one gets access to a specific
>>>>>>>>>> user's
>>>>>>>>>> credentials then it's not possible to compromise other user's
>>>>>>>>>> jobs, data,
>>>>>>>>>> etc...
>>>>>>>>>> Security is like an onion, the more layers has been added the
>>>>>>>>>> more time an
>>>>>>>>>> attacker needs to proceed.
>>>>>>>>>> At the end of the day if one is in, then most probably can find
>>>>>>>>>> the way but
>>>>>>>>>> this time is normally enough to sysadmins or security experts to
>>>>>>>>>> close down the system and minimize the damage.
>>>>>>>>>>
>>>>>>>>>> The other thing is that all tokens has a timeout and if the token
>>>>>>>>>> is
>>>>>>>>>> invalid then the attacker can't proceed further.
>>>>>>>>>>
>>>>>>>>>> > Is Kerberos also the standard authentication protocol for
>>>>>>>>>> Kubernetes
>>>>>>>>>> deployments?
>>>>>>>>>> Kerberos is an industry standard which is cloud/deployment
>>>>>>>>>> agnostic and it
>>>>>>>>>> can be used in any deployments including k8s.
>>>>>>>>>> The main intention is to use kerberos in k8s deployments too
>>>>>>>>>> since we're
>>>>>>>>>> going this direction as well.
>>>>>>>>>> Please see how Spark does this:
>>>>>>>>>>
>>>>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>>>>>>
>>>>>>>>>> Last but not least the most important reason to add at least one
>>>>>>>>>> strong
>>>>>>>>>> authentication is that we have users who has
>>>>>>>>>> hard requirements on this. They're doing security audits and if
>>>>>>>>>> they fail
>>>>>>>>>> then it's deal breaking.
>>>>>>>>>> That is why we have added kerberos at the first place.
>>>>>>>>>> Unfortunately we
>>>>>>>>>> can't name them in this public list, however
>>>>>>>>>> the customers who specifically asked for this were mainly in the
>>>>>>>>>> banking
>>>>>>>>>> and telco sector.
>>>>>>>>>>
>>>>>>>>>> BR,
>>>>>>>>>> G
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <
>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> > Thanks for updating the document Márton. Why is it that banks
>>>>>>>>>> will
>>>>>>>>>> > consider it more secure if Flink comes with Kerberos
>>>>>>>>>> authentication
>>>>>>>>>> > (assuming a properly secured setup)? I mean if an attacker can
>>>>>>>>>> get access
>>>>>>>>>> > to one of the machines, then it should also be possible to
>>>>>>>>>> obtain the right
>>>>>>>>>> > Kerberos token.
>>>>>>>>>> >
>>>>>>>>>> > I am not an authentication expert and that's why I wanted to
>>>>>>>>>> ask what are
>>>>>>>>>> > other authentication protocols other than Kerberos? Why did we
>>>>>>>>>> select
>>>>>>>>>> > Kerberos and not any other authentication protocol? Maybe you
>>>>>>>>>> can list the
>>>>>>>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>>>>>>>> standard
>>>>>>>>>> > authentication protocol for Kubernetes deployments? If not,
>>>>>>>>>> what would be
>>>>>>>>>> > the answer when deploying on K8s?
>>>>>>>>>> >
>>>>>>>>>> > Cheers,
>>>>>>>>>> > Till
>>>>>>>>>> >
>>>>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>>>>>>> [hidden email]>
>>>>>>>>>> > wrote:
>>>>>>>>>> >
>>>>>>>>>> >> Hi team,
>>>>>>>>>> >>
>>>>>>>>>> >> Happy to be here and hope I can provide quality additions in
>>>>>>>>>> the future.
>>>>>>>>>> >>
>>>>>>>>>> >> Thank you all for helpful the suggestions!
>>>>>>>>>> >> Considering them the FLIP has been modified and the work
>>>>>>>>>> continues on the
>>>>>>>>>> >> already existing Jira.
>>>>>>>>>> >>
>>>>>>>>>> >> BR,
>>>>>>>>>> >> G
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>>>>>>> [hidden email]>
>>>>>>>>>> >> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the
>>>>>>>>>> ticket too, let
>>>>>>>>>> >>> us continue there then.
>>>>>>>>>> >>>
>>>>>>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>>>>>>> possible. It
>>>>>>>>>> >>> is an important design decision that we aim to keep the list
>>>>>>>>>> of
>>>>>>>>>> >>> authentication protocols to a minimum. We believe that this
>>>>>>>>>> should not be a
>>>>>>>>>> >>> primary concern of Flink and a trusted proxy service (for
>>>>>>>>>> example Apache
>>>>>>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>>>>>>> authentication
>>>>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>>>>>>> support
>>>>>>>>>> >>> consequently consist of a single strong authentication
>>>>>>>>>> protocol for which
>>>>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary
>>>>>>>>>> for development
>>>>>>>>>> >>> and light-weight scenarios.
>>>>>>>>>> >>>
>>>>>>>>>> >>> Added the above wording to G's doc.
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>>
>>>>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>>>>>>> [hidden email]>
>>>>>>>>>> >>> wrote:
>>>>>>>>>> >>>
>>>>>>>>>> >>>> There's a related effort:
>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>>>>>>> >>>>
>>>>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>>>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>>>>>>> >>>> >
>>>>>>>>>> >>>> > Thanks for sharing this proposal with the community
>>>>>>>>>> Márton. In
>>>>>>>>>> >>>> general, I
>>>>>>>>>> >>>> > agree that authentication is missing and that this is
>>>>>>>>>> required for
>>>>>>>>>> >>>> using
>>>>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is
>>>>>>>>>> whether this
>>>>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink
>>>>>>>>>> or whether a
>>>>>>>>>> >>>> proxy
>>>>>>>>>> >>>> > setup could do the job? Have you considered this option?
>>>>>>>>>> If yes, then
>>>>>>>>>> >>>> it
>>>>>>>>>> >>>> > would be good to list it under the point of rejected
>>>>>>>>>> alternatives.
>>>>>>>>>> >>>> >
>>>>>>>>>> >>>> > I do see the benefit of implementing this feature inside
>>>>>>>>>> of Flink if
>>>>>>>>>> >>>> many
>>>>>>>>>> >>>> > users need it. If not, then it might be easier for the
>>>>>>>>>> project to not
>>>>>>>>>> >>>> > increase the surface area since it makes the overall
>>>>>>>>>> maintenance
>>>>>>>>>> >>>> harder.
>>>>>>>>>> >>>> >
>>>>>>>>>> >>>> > Cheers,
>>>>>>>>>> >>>> > Till
>>>>>>>>>> >>>> >
>>>>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>>>>>>> [hidden email]>
>>>>>>>>>> >>>> wrote:
>>>>>>>>>> >>>> >
>>>>>>>>>> >>>> >> Hi team,
>>>>>>>>>> >>>> >>
>>>>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for
>>>>>>>>>> short to the
>>>>>>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>>>>>>> transitioned to
>>>>>>>>>> >>>> the
>>>>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward
>>>>>>>>>> to
>>>>>>>>>> >>>> contributing
>>>>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>>>>>>> Streaming
>>>>>>>>>> >>>> and
>>>>>>>>>> >>>> >> security.
>>>>>>>>>> >>>> >>
>>>>>>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>>>>>>> Kerberos and
>>>>>>>>>> >>>> HTTP
>>>>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and
>>>>>>>>>> HistoryServer.
>>>>>>>>>> >>>> Previously
>>>>>>>>>> >>>> >> lacked an authentication story.
>>>>>>>>>> >>>> >>
>>>>>>>>>> >>>> >> We are looking to contribute this functionality back to
>>>>>>>>>> the
>>>>>>>>>> >>>> community, we
>>>>>>>>>> >>>> >> believe that given Flink's maturity there should be a
>>>>>>>>>> common code
>>>>>>>>>> >>>> solution
>>>>>>>>>> >>>> >> for this general pattern.
>>>>>>>>>> >>>> >>
>>>>>>>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>>>>>>>> >>>> >>
>>>>>>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>>>>>>> >>>> >> [2]
>>>>>>>>>> >>>> >>
>>>>>>>>>> >>>> >>
>>>>>>>>>> >>>>
>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>>>> >>>> >>
>>>>>>>>>> >>>>
>>>>>>>>>> >>>>
>>>>>>>>>>
>>>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Till Rohrmann
Thanks for the update Gabor. I'll take a look and respond in the document.

Cheers,
Till

On Wed, Jun 9, 2021 at 12:59 PM Gabor Somogyi <[hidden email]>
wrote:

> Hi Till,
>
> Your proxy suggestion has been considered in-depth and updated the FLIP
> accordingly.
> We've considered 2 proxy implementation (Nginx and Squid) but according to
> our analysis and testing it's not suitable for the mentioned use-cases.
> Please take a look at the rejected alternatives for detailed explanation.
>
> Thanks for your time in advance!
>
> BR,
> G
>
>
> On Fri, Jun 4, 2021 at 3:31 PM Till Rohrmann <[hidden email]> wrote:
>
>> As I've said I am not a security expert and that's why I have to ask for
>> clarification, Gabor. You are saying that if we configure a truststore for
>> the REST endpoint with a single trusted certificate which has been
>> generated by the operator of the Flink cluster, then the attacker can
>> generate a new certificate, sign it and then talk to the Flink cluster if
>> he has access to the node on which the REST endpoint runs? My understanding
>> was that you need the corresponding private key which in my proposed setup
>> would be under the control of the operator as well (e.g. stored in a
>> keystore on the same machine but guarded by some secret). That way (if I am
>> not mistaken), only the entity which has access to the keystore is able to
>> talk to the Flink cluster.
>>
>> Maybe we are also getting our wires crossed here and are talking about
>> different things.
>>
>> Thanks for listing the pros and cons of Kerberos. Concerning what other
>> authentication mechanisms are used in the industry, I am not 100% sure.
>>
>> Cheers,
>> Till
>>
>> On Fri, Jun 4, 2021 at 11:09 AM Gabor Somogyi <[hidden email]>
>> wrote:
>>
>>> > I did not mean for the user to sign its own certificates but for the
>>> operator of the cluster. Once the user request hits the proxy, it should no
>>> longer be under his control. I think I do not fully understand yet why this
>>> would not work.
>>> I said it's not solving the authentication problem over any proxy. Even
>>> if the operator is signing the certificate one can have access to an
>>> internal node.
>>> Such case anybody can craft certificates which is accepted by the
>>> server. When it's accepted a bad guy can cancel jobs causing huge impacts.
>>>
>>> > Also, I am missing a bit the comparison of Kerberos to other
>>> authentication mechanisms and why they were rejected in favour of Kerberos.
>>> PROS:
>>> * Since it's not depending on cloud provider and/or k8s or bare-metal
>>> etc. deployment it's the biggest plus
>>> * Centralized with tools and no need to write tons of tools around
>>> * There are clients/tools on almost all OS-es and several languages
>>> * Super huge users are using it for years in production w/o huge issues
>>> * Provides cross-realm trust possibility amongst other features
>>> * Several open source components using it which could increase
>>> compatibility
>>>
>>> CONS:
>>> * Not everybody using kerberos
>>> * It would increase the code footprint but this is true for many
>>> features (as a side note I'm here to maintain it)
>>>
>>> Feel free to add your points because it only represents a single
>>> viewpoint.
>>> Also if you have any better option for strong authentication please
>>> share it and we can consider the pros/cons here.
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Fri, Jun 4, 2021 at 10:32 AM Till Rohrmann <[hidden email]>
>>> wrote:
>>>
>>>> I did not mean for the user to sign its own certificates but for the
>>>> operator of the cluster. Once the user request hits the proxy, it should no
>>>> longer be under his control. I think I do not fully understand yet why this
>>>> would not work.
>>>>
>>>> What I would like to avoid is to add more complexity into Flink if
>>>> there is an easy solution which fulfills the requirements. That's why I
>>>> would like to exercise thoroughly through the different alternatives. Also,
>>>> I am missing a bit the comparison of Kerberos to other authentication
>>>> mechanisms and why they were rejected in favour of Kerberos.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <[hidden email]> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I think there might be possible alternatives but it seems Kerberos on
>>>>> the rest endpoint ticks all the right boxes and provides a super clean and
>>>>> simple solution for strong authentication.
>>>>>
>>>>> I wouldn’t even consider sidecar proxies etc if we can solve it in
>>>>> such a simple way as proposed by G.
>>>>>
>>>>> Cheers
>>>>> Gyula
>>>>>
>>>>> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> I am not saying that we shouldn't add a strong authentication
>>>>>> mechanism if there are good reasons for it. I primarily would like to
>>>>>> understand the context a bit better in order to give qualified feedback and
>>>>>> come to a good decision. In order to do this, I have the feeling that we
>>>>>> haven't fully considered all available options which are on the table, tbh.
>>>>>>
>>>>>> Does the problem of certificate expiry also apply for self-signed
>>>>>> certificates? If yes, then this should then also be a problem for the
>>>>>> internal encryption of Flink's communication. If not, then one could use
>>>>>> self-signed certificates with a longer validity to solve the mentioned
>>>>>> issue.
>>>>>>
>>>>>> I think you can set up Flink in such a way that you don't have to
>>>>>> handle all the different certificates. For example, you could deploy Flink
>>>>>> with a "sidecar proxy" which is responsible for the authentication using an
>>>>>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
>>>>>> network interface. That way, the REST endpoint would only be available
>>>>>> through the sidecar proxy. Additionally, one could enable SSL for this
>>>>>> communication. Would this be a solution for the problem?
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <
>>>>>> [hidden email]> wrote:
>>>>>>
>>>>>>> That is an interesting idea, Till.
>>>>>>>
>>>>>>> The main issue with it is that TLS certificates have an expiration
>>>>>>> time, usually they get approved for a couple years. Forcing our users to
>>>>>>> restart jobs to reprovision TLS certificates would be weird when we could
>>>>>>> just implement a single proper strong authentication mechanism instead in a
>>>>>>> couple hundred lines of code. :-)
>>>>>>>
>>>>>>> In many cases it is also impractical to go the TLS mutual route,
>>>>>>> because the Flink Dashboard can end up on any node in the k8s/Yarn cluster
>>>>>>> which means that we need a certificate per node (due to the mutual auth),
>>>>>>> but if we also want to protect the private key of these from users
>>>>>>> accidentally or intentionally leaking them then we need this per user. As
>>>>>>> in we end up managing user*machine number certificates and having to renew
>>>>>>> them periodically, which albeit automatable is unfortunately not yet
>>>>>>> automated in all large organizations.
>>>>>>>
>>>>>>> I fully agree that TLS certificate mutual authentication has its
>>>>>>> nice properties, especially at very large (multiple thousand node) clusters
>>>>>>> - but it has its own challenges too. Thanks for bringing it up.
>>>>>>>
>>>>>>> Happy to have this added to the rejected alternative list so that we
>>>>>>> have the full picture documented.
>>>>>>>
>>>>>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I guess the idea would then be to let the proxy do the
>>>>>>>> authentication job and only forward the request via an SSL mutually
>>>>>>>> encrypted connection to the Flink cluster. Would this be possible? The
>>>>>>>> beauty of this setup is in my opinion that this setup should work with all
>>>>>>>> kinds of authentication mechanisms.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <
>>>>>>>> [hidden email]> wrote:
>>>>>>>>
>>>>>>>>> Thanks for giving options to fulfil the need.
>>>>>>>>>
>>>>>>>>> Users are looking for a solution where users can be identified on
>>>>>>>>> the whole cluster and restrict access to resources/actions.
>>>>>>>>> A good example for such an action is cancelling other users
>>>>>>>>> running jobs.
>>>>>>>>>
>>>>>>>>> * SSL does provide mutual authentication but when authentication
>>>>>>>>> passed there is no user based on restrictions can be made.
>>>>>>>>> * The less problematic part is that generating/maintaining short
>>>>>>>>> time valid certificates would be a hard (that's the reason KDC like servers
>>>>>>>>> exist).
>>>>>>>>> Having long time valid certificates would widen the attack surface
>>>>>>>>> but since the first concern is there this is just a cosmetic issue.
>>>>>>>>>
>>>>>>>>> All in all using TLS certificates is not sufficient in these
>>>>>>>>> environments unfortunately.
>>>>>>>>>
>>>>>>>>> BR,
>>>>>>>>> G
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <
>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the information Gabor. If it is about securing the
>>>>>>>>>> communication between the REST client and the REST server, then Flink
>>>>>>>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>>>>>>>> enough to secure the communication and to pass an audit?
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Till
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Till,
>>>>>>>>>>>
>>>>>>>>>>> Since I'm working in security area 10+ years let me share my
>>>>>>>>>>> thought.
>>>>>>>>>>> I would like to emphasise there are experts better than me but I
>>>>>>>>>>> have some
>>>>>>>>>>> basics.
>>>>>>>>>>> The discussion is open and not trying to tell alone things...
>>>>>>>>>>>
>>>>>>>>>>> > I mean if an attacker can get access to one of the machines,
>>>>>>>>>>> then it
>>>>>>>>>>> should also be possible to obtain the right Kerberos token.
>>>>>>>>>>> Not necessarily. For example if one gets access to a specific
>>>>>>>>>>> user's
>>>>>>>>>>> credentials then it's not possible to compromise other user's
>>>>>>>>>>> jobs, data,
>>>>>>>>>>> etc...
>>>>>>>>>>> Security is like an onion, the more layers has been added the
>>>>>>>>>>> more time an
>>>>>>>>>>> attacker needs to proceed.
>>>>>>>>>>> At the end of the day if one is in, then most probably can find
>>>>>>>>>>> the way but
>>>>>>>>>>> this time is normally enough to sysadmins or security experts to
>>>>>>>>>>> close down the system and minimize the damage.
>>>>>>>>>>>
>>>>>>>>>>> The other thing is that all tokens has a timeout and if the
>>>>>>>>>>> token is
>>>>>>>>>>> invalid then the attacker can't proceed further.
>>>>>>>>>>>
>>>>>>>>>>> > Is Kerberos also the standard authentication protocol for
>>>>>>>>>>> Kubernetes
>>>>>>>>>>> deployments?
>>>>>>>>>>> Kerberos is an industry standard which is cloud/deployment
>>>>>>>>>>> agnostic and it
>>>>>>>>>>> can be used in any deployments including k8s.
>>>>>>>>>>> The main intention is to use kerberos in k8s deployments too
>>>>>>>>>>> since we're
>>>>>>>>>>> going this direction as well.
>>>>>>>>>>> Please see how Spark does this:
>>>>>>>>>>>
>>>>>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>>>>>>>
>>>>>>>>>>> Last but not least the most important reason to add at least one
>>>>>>>>>>> strong
>>>>>>>>>>> authentication is that we have users who has
>>>>>>>>>>> hard requirements on this. They're doing security audits and if
>>>>>>>>>>> they fail
>>>>>>>>>>> then it's deal breaking.
>>>>>>>>>>> That is why we have added kerberos at the first place.
>>>>>>>>>>> Unfortunately we
>>>>>>>>>>> can't name them in this public list, however
>>>>>>>>>>> the customers who specifically asked for this were mainly in the
>>>>>>>>>>> banking
>>>>>>>>>>> and telco sector.
>>>>>>>>>>>
>>>>>>>>>>> BR,
>>>>>>>>>>> G
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <
>>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> > Thanks for updating the document Márton. Why is it that banks
>>>>>>>>>>> will
>>>>>>>>>>> > consider it more secure if Flink comes with Kerberos
>>>>>>>>>>> authentication
>>>>>>>>>>> > (assuming a properly secured setup)? I mean if an attacker can
>>>>>>>>>>> get access
>>>>>>>>>>> > to one of the machines, then it should also be possible to
>>>>>>>>>>> obtain the right
>>>>>>>>>>> > Kerberos token.
>>>>>>>>>>> >
>>>>>>>>>>> > I am not an authentication expert and that's why I wanted to
>>>>>>>>>>> ask what are
>>>>>>>>>>> > other authentication protocols other than Kerberos? Why did we
>>>>>>>>>>> select
>>>>>>>>>>> > Kerberos and not any other authentication protocol? Maybe you
>>>>>>>>>>> can list the
>>>>>>>>>>> > pros and cons for the different protocols. Is Kerberos also
>>>>>>>>>>> the standard
>>>>>>>>>>> > authentication protocol for Kubernetes deployments? If not,
>>>>>>>>>>> what would be
>>>>>>>>>>> > the answer when deploying on K8s?
>>>>>>>>>>> >
>>>>>>>>>>> > Cheers,
>>>>>>>>>>> > Till
>>>>>>>>>>> >
>>>>>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>>>>>>>> [hidden email]>
>>>>>>>>>>> > wrote:
>>>>>>>>>>> >
>>>>>>>>>>> >> Hi team,
>>>>>>>>>>> >>
>>>>>>>>>>> >> Happy to be here and hope I can provide quality additions in
>>>>>>>>>>> the future.
>>>>>>>>>>> >>
>>>>>>>>>>> >> Thank you all for helpful the suggestions!
>>>>>>>>>>> >> Considering them the FLIP has been modified and the work
>>>>>>>>>>> continues on the
>>>>>>>>>>> >> already existing Jira.
>>>>>>>>>>> >>
>>>>>>>>>>> >> BR,
>>>>>>>>>>> >> G
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>>>>>>>> [hidden email]>
>>>>>>>>>>> >> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the
>>>>>>>>>>> ticket too, let
>>>>>>>>>>> >>> us continue there then.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>>>>>>>> possible. It
>>>>>>>>>>> >>> is an important design decision that we aim to keep the list
>>>>>>>>>>> of
>>>>>>>>>>> >>> authentication protocols to a minimum. We believe that this
>>>>>>>>>>> should not be a
>>>>>>>>>>> >>> primary concern of Flink and a trusted proxy service (for
>>>>>>>>>>> example Apache
>>>>>>>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>>>>>>>> authentication
>>>>>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>>>>>>>> support
>>>>>>>>>>> >>> consequently consist of a single strong authentication
>>>>>>>>>>> protocol for which
>>>>>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary
>>>>>>>>>>> for development
>>>>>>>>>>> >>> and light-weight scenarios.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Added the above wording to G's doc.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>
>>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>>>>>>>> [hidden email]>
>>>>>>>>>>> >>> wrote:
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>> There's a related effort:
>>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>>>>>>>> >>>>
>>>>>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>>>>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>>>>>>>> >>>> >
>>>>>>>>>>> >>>> > Thanks for sharing this proposal with the community
>>>>>>>>>>> Márton. In
>>>>>>>>>>> >>>> general, I
>>>>>>>>>>> >>>> > agree that authentication is missing and that this is
>>>>>>>>>>> required for
>>>>>>>>>>> >>>> using
>>>>>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is
>>>>>>>>>>> whether this
>>>>>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink
>>>>>>>>>>> or whether a
>>>>>>>>>>> >>>> proxy
>>>>>>>>>>> >>>> > setup could do the job? Have you considered this option?
>>>>>>>>>>> If yes, then
>>>>>>>>>>> >>>> it
>>>>>>>>>>> >>>> > would be good to list it under the point of rejected
>>>>>>>>>>> alternatives.
>>>>>>>>>>> >>>> >
>>>>>>>>>>> >>>> > I do see the benefit of implementing this feature inside
>>>>>>>>>>> of Flink if
>>>>>>>>>>> >>>> many
>>>>>>>>>>> >>>> > users need it. If not, then it might be easier for the
>>>>>>>>>>> project to not
>>>>>>>>>>> >>>> > increase the surface area since it makes the overall
>>>>>>>>>>> maintenance
>>>>>>>>>>> >>>> harder.
>>>>>>>>>>> >>>> >
>>>>>>>>>>> >>>> > Cheers,
>>>>>>>>>>> >>>> > Till
>>>>>>>>>>> >>>> >
>>>>>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>>>>>>>> [hidden email]>
>>>>>>>>>>> >>>> wrote:
>>>>>>>>>>> >>>> >
>>>>>>>>>>> >>>> >> Hi team,
>>>>>>>>>>> >>>> >>
>>>>>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for
>>>>>>>>>>> short to the
>>>>>>>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>>>>>>>> transitioned to
>>>>>>>>>>> >>>> the
>>>>>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking
>>>>>>>>>>> forward to
>>>>>>>>>>> >>>> contributing
>>>>>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>>>>>>>> Streaming
>>>>>>>>>>> >>>> and
>>>>>>>>>>> >>>> >> security.
>>>>>>>>>>> >>>> >>
>>>>>>>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>>>>>>>> Kerberos and
>>>>>>>>>>> >>>> HTTP
>>>>>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and
>>>>>>>>>>> HistoryServer.
>>>>>>>>>>> >>>> Previously
>>>>>>>>>>> >>>> >> lacked an authentication story.
>>>>>>>>>>> >>>> >>
>>>>>>>>>>> >>>> >> We are looking to contribute this functionality back to
>>>>>>>>>>> the
>>>>>>>>>>> >>>> community, we
>>>>>>>>>>> >>>> >> believe that given Flink's maturity there should be a
>>>>>>>>>>> common code
>>>>>>>>>>> >>>> solution
>>>>>>>>>>> >>>> >> for this general pattern.
>>>>>>>>>>> >>>> >>
>>>>>>>>>>> >>>> >> We are looking forward to your feedback on G's design.
>>>>>>>>>>> [2]
>>>>>>>>>>> >>>> >>
>>>>>>>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>>>>>>>> >>>> >> [2]
>>>>>>>>>>> >>>> >>
>>>>>>>>>>> >>>> >>
>>>>>>>>>>> >>>>
>>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>>>>> >>>> >>
>>>>>>>>>>> >>>>
>>>>>>>>>>> >>>>
>>>>>>>>>>>
>>>>>>>>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Dashboard/HistoryServer authentication

Gabor Somogyi
Hi Till,

Did you have the chance to take a look at the doc? Not yet seen any update.

BR,
G


On Wed, Jun 9, 2021 at 1:43 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for the update Gabor. I'll take a look and respond in the document.
>
> Cheers,
> Till
>
> On Wed, Jun 9, 2021 at 12:59 PM Gabor Somogyi <[hidden email]>
> wrote:
>
>> Hi Till,
>>
>> Your proxy suggestion has been considered in-depth and updated the FLIP
>> accordingly.
>> We've considered 2 proxy implementation (Nginx and Squid) but according
>> to our analysis and testing it's not suitable for the mentioned use-cases.
>> Please take a look at the rejected alternatives for detailed explanation.
>>
>> Thanks for your time in advance!
>>
>> BR,
>> G
>>
>>
>> On Fri, Jun 4, 2021 at 3:31 PM Till Rohrmann <[hidden email]>
>> wrote:
>>
>>> As I've said I am not a security expert and that's why I have to ask for
>>> clarification, Gabor. You are saying that if we configure a truststore for
>>> the REST endpoint with a single trusted certificate which has been
>>> generated by the operator of the Flink cluster, then the attacker can
>>> generate a new certificate, sign it and then talk to the Flink cluster if
>>> he has access to the node on which the REST endpoint runs? My understanding
>>> was that you need the corresponding private key which in my proposed setup
>>> would be under the control of the operator as well (e.g. stored in a
>>> keystore on the same machine but guarded by some secret). That way (if I am
>>> not mistaken), only the entity which has access to the keystore is able to
>>> talk to the Flink cluster.
>>>
>>> Maybe we are also getting our wires crossed here and are talking about
>>> different things.
>>>
>>> Thanks for listing the pros and cons of Kerberos. Concerning what other
>>> authentication mechanisms are used in the industry, I am not 100% sure.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Jun 4, 2021 at 11:09 AM Gabor Somogyi <[hidden email]>
>>> wrote:
>>>
>>>> > I did not mean for the user to sign its own certificates but for the
>>>> operator of the cluster. Once the user request hits the proxy, it should no
>>>> longer be under his control. I think I do not fully understand yet why this
>>>> would not work.
>>>> I said it's not solving the authentication problem over any proxy. Even
>>>> if the operator is signing the certificate one can have access to an
>>>> internal node.
>>>> Such case anybody can craft certificates which is accepted by the
>>>> server. When it's accepted a bad guy can cancel jobs causing huge impacts.
>>>>
>>>> > Also, I am missing a bit the comparison of Kerberos to other
>>>> authentication mechanisms and why they were rejected in favour of Kerberos.
>>>> PROS:
>>>> * Since it's not depending on cloud provider and/or k8s or bare-metal
>>>> etc. deployment it's the biggest plus
>>>> * Centralized with tools and no need to write tons of tools around
>>>> * There are clients/tools on almost all OS-es and several languages
>>>> * Super huge users are using it for years in production w/o huge issues
>>>> * Provides cross-realm trust possibility amongst other features
>>>> * Several open source components using it which could increase
>>>> compatibility
>>>>
>>>> CONS:
>>>> * Not everybody using kerberos
>>>> * It would increase the code footprint but this is true for many
>>>> features (as a side note I'm here to maintain it)
>>>>
>>>> Feel free to add your points because it only represents a single
>>>> viewpoint.
>>>> Also if you have any better option for strong authentication please
>>>> share it and we can consider the pros/cons here.
>>>>
>>>> BR,
>>>> G
>>>>
>>>>
>>>> On Fri, Jun 4, 2021 at 10:32 AM Till Rohrmann <[hidden email]>
>>>> wrote:
>>>>
>>>>> I did not mean for the user to sign its own certificates but for the
>>>>> operator of the cluster. Once the user request hits the proxy, it should no
>>>>> longer be under his control. I think I do not fully understand yet why this
>>>>> would not work.
>>>>>
>>>>> What I would like to avoid is to add more complexity into Flink if
>>>>> there is an easy solution which fulfills the requirements. That's why I
>>>>> would like to exercise thoroughly through the different alternatives. Also,
>>>>> I am missing a bit the comparison of Kerberos to other authentication
>>>>> mechanisms and why they were rejected in favour of Kerberos.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <[hidden email]> wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I think there might be possible alternatives but it seems Kerberos on
>>>>>> the rest endpoint ticks all the right boxes and provides a super clean and
>>>>>> simple solution for strong authentication.
>>>>>>
>>>>>> I wouldn’t even consider sidecar proxies etc if we can solve it in
>>>>>> such a simple way as proposed by G.
>>>>>>
>>>>>> Cheers
>>>>>> Gyula
>>>>>>
>>>>>> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> I am not saying that we shouldn't add a strong authentication
>>>>>>> mechanism if there are good reasons for it. I primarily would like to
>>>>>>> understand the context a bit better in order to give qualified feedback and
>>>>>>> come to a good decision. In order to do this, I have the feeling that we
>>>>>>> haven't fully considered all available options which are on the table, tbh.
>>>>>>>
>>>>>>> Does the problem of certificate expiry also apply for self-signed
>>>>>>> certificates? If yes, then this should then also be a problem for the
>>>>>>> internal encryption of Flink's communication. If not, then one could use
>>>>>>> self-signed certificates with a longer validity to solve the mentioned
>>>>>>> issue.
>>>>>>>
>>>>>>> I think you can set up Flink in such a way that you don't have to
>>>>>>> handle all the different certificates. For example, you could deploy Flink
>>>>>>> with a "sidecar proxy" which is responsible for the authentication using an
>>>>>>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
>>>>>>> network interface. That way, the REST endpoint would only be available
>>>>>>> through the sidecar proxy. Additionally, one could enable SSL for this
>>>>>>> communication. Would this be a solution for the problem?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <
>>>>>>> [hidden email]> wrote:
>>>>>>>
>>>>>>>> That is an interesting idea, Till.
>>>>>>>>
>>>>>>>> The main issue with it is that TLS certificates have an expiration
>>>>>>>> time, usually they get approved for a couple years. Forcing our users to
>>>>>>>> restart jobs to reprovision TLS certificates would be weird when we could
>>>>>>>> just implement a single proper strong authentication mechanism instead in a
>>>>>>>> couple hundred lines of code. :-)
>>>>>>>>
>>>>>>>> In many cases it is also impractical to go the TLS mutual route,
>>>>>>>> because the Flink Dashboard can end up on any node in the k8s/Yarn cluster
>>>>>>>> which means that we need a certificate per node (due to the mutual auth),
>>>>>>>> but if we also want to protect the private key of these from users
>>>>>>>> accidentally or intentionally leaking them then we need this per user. As
>>>>>>>> in we end up managing user*machine number certificates and having to renew
>>>>>>>> them periodically, which albeit automatable is unfortunately not yet
>>>>>>>> automated in all large organizations.
>>>>>>>>
>>>>>>>> I fully agree that TLS certificate mutual authentication has its
>>>>>>>> nice properties, especially at very large (multiple thousand node) clusters
>>>>>>>> - but it has its own challenges too. Thanks for bringing it up.
>>>>>>>>
>>>>>>>> Happy to have this added to the rejected alternative list so that
>>>>>>>> we have the full picture documented.
>>>>>>>>
>>>>>>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I guess the idea would then be to let the proxy do the
>>>>>>>>> authentication job and only forward the request via an SSL mutually
>>>>>>>>> encrypted connection to the Flink cluster. Would this be possible? The
>>>>>>>>> beauty of this setup is in my opinion that this setup should work with all
>>>>>>>>> kinds of authentication mechanisms.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Till
>>>>>>>>>
>>>>>>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <
>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for giving options to fulfil the need.
>>>>>>>>>>
>>>>>>>>>> Users are looking for a solution where users can be identified on
>>>>>>>>>> the whole cluster and restrict access to resources/actions.
>>>>>>>>>> A good example for such an action is cancelling other users
>>>>>>>>>> running jobs.
>>>>>>>>>>
>>>>>>>>>> * SSL does provide mutual authentication but when authentication
>>>>>>>>>> passed there is no user based on restrictions can be made.
>>>>>>>>>> * The less problematic part is that generating/maintaining short
>>>>>>>>>> time valid certificates would be a hard (that's the reason KDC like servers
>>>>>>>>>> exist).
>>>>>>>>>> Having long time valid certificates would widen the attack
>>>>>>>>>> surface but since the first concern is there this is just a cosmetic issue.
>>>>>>>>>>
>>>>>>>>>> All in all using TLS certificates is not sufficient in these
>>>>>>>>>> environments unfortunately.
>>>>>>>>>>
>>>>>>>>>> BR,
>>>>>>>>>> G
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <
>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for the information Gabor. If it is about securing the
>>>>>>>>>>> communication between the REST client and the REST server, then Flink
>>>>>>>>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>>>>>>>>> enough to secure the communication and to pass an audit?
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Till
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Till,
>>>>>>>>>>>>
>>>>>>>>>>>> Since I'm working in security area 10+ years let me share my
>>>>>>>>>>>> thought.
>>>>>>>>>>>> I would like to emphasise there are experts better than me but
>>>>>>>>>>>> I have some
>>>>>>>>>>>> basics.
>>>>>>>>>>>> The discussion is open and not trying to tell alone things...
>>>>>>>>>>>>
>>>>>>>>>>>> > I mean if an attacker can get access to one of the machines,
>>>>>>>>>>>> then it
>>>>>>>>>>>> should also be possible to obtain the right Kerberos token.
>>>>>>>>>>>> Not necessarily. For example if one gets access to a specific
>>>>>>>>>>>> user's
>>>>>>>>>>>> credentials then it's not possible to compromise other user's
>>>>>>>>>>>> jobs, data,
>>>>>>>>>>>> etc...
>>>>>>>>>>>> Security is like an onion, the more layers has been added the
>>>>>>>>>>>> more time an
>>>>>>>>>>>> attacker needs to proceed.
>>>>>>>>>>>> At the end of the day if one is in, then most probably can find
>>>>>>>>>>>> the way but
>>>>>>>>>>>> this time is normally enough to sysadmins or security experts to
>>>>>>>>>>>> close down the system and minimize the damage.
>>>>>>>>>>>>
>>>>>>>>>>>> The other thing is that all tokens has a timeout and if the
>>>>>>>>>>>> token is
>>>>>>>>>>>> invalid then the attacker can't proceed further.
>>>>>>>>>>>>
>>>>>>>>>>>> > Is Kerberos also the standard authentication protocol for
>>>>>>>>>>>> Kubernetes
>>>>>>>>>>>> deployments?
>>>>>>>>>>>> Kerberos is an industry standard which is cloud/deployment
>>>>>>>>>>>> agnostic and it
>>>>>>>>>>>> can be used in any deployments including k8s.
>>>>>>>>>>>> The main intention is to use kerberos in k8s deployments too
>>>>>>>>>>>> since we're
>>>>>>>>>>>> going this direction as well.
>>>>>>>>>>>> Please see how Spark does this:
>>>>>>>>>>>>
>>>>>>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>>>>>>>>
>>>>>>>>>>>> Last but not least the most important reason to add at least
>>>>>>>>>>>> one strong
>>>>>>>>>>>> authentication is that we have users who has
>>>>>>>>>>>> hard requirements on this. They're doing security audits and if
>>>>>>>>>>>> they fail
>>>>>>>>>>>> then it's deal breaking.
>>>>>>>>>>>> That is why we have added kerberos at the first place.
>>>>>>>>>>>> Unfortunately we
>>>>>>>>>>>> can't name them in this public list, however
>>>>>>>>>>>> the customers who specifically asked for this were mainly in
>>>>>>>>>>>> the banking
>>>>>>>>>>>> and telco sector.
>>>>>>>>>>>>
>>>>>>>>>>>> BR,
>>>>>>>>>>>> G
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <
>>>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> > Thanks for updating the document Márton. Why is it that banks
>>>>>>>>>>>> will
>>>>>>>>>>>> > consider it more secure if Flink comes with Kerberos
>>>>>>>>>>>> authentication
>>>>>>>>>>>> > (assuming a properly secured setup)? I mean if an attacker
>>>>>>>>>>>> can get access
>>>>>>>>>>>> > to one of the machines, then it should also be possible to
>>>>>>>>>>>> obtain the right
>>>>>>>>>>>> > Kerberos token.
>>>>>>>>>>>> >
>>>>>>>>>>>> > I am not an authentication expert and that's why I wanted to
>>>>>>>>>>>> ask what are
>>>>>>>>>>>> > other authentication protocols other than Kerberos? Why did
>>>>>>>>>>>> we select
>>>>>>>>>>>> > Kerberos and not any other authentication protocol? Maybe you
>>>>>>>>>>>> can list the
>>>>>>>>>>>> > pros and cons for the different protocols. Is Kerberos also
>>>>>>>>>>>> the standard
>>>>>>>>>>>> > authentication protocol for Kubernetes deployments? If not,
>>>>>>>>>>>> what would be
>>>>>>>>>>>> > the answer when deploying on K8s?
>>>>>>>>>>>> >
>>>>>>>>>>>> > Cheers,
>>>>>>>>>>>> > Till
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>>>>>>>>> [hidden email]>
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> >> Hi team,
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Happy to be here and hope I can provide quality additions in
>>>>>>>>>>>> the future.
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Thank you all for helpful the suggestions!
>>>>>>>>>>>> >> Considering them the FLIP has been modified and the work
>>>>>>>>>>>> continues on the
>>>>>>>>>>>> >> already existing Jira.
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> BR,
>>>>>>>>>>>> >> G
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>>>>>>>>> [hidden email]>
>>>>>>>>>>>> >> wrote:
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the
>>>>>>>>>>>> ticket too, let
>>>>>>>>>>>> >>> us continue there then.
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>>>>>>>>> possible. It
>>>>>>>>>>>> >>> is an important design decision that we aim to keep the
>>>>>>>>>>>> list of
>>>>>>>>>>>> >>> authentication protocols to a minimum. We believe that this
>>>>>>>>>>>> should not be a
>>>>>>>>>>>> >>> primary concern of Flink and a trusted proxy service (for
>>>>>>>>>>>> example Apache
>>>>>>>>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>>>>>>>>> authentication
>>>>>>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms
>>>>>>>>>>>> to support
>>>>>>>>>>>> >>> consequently consist of a single strong authentication
>>>>>>>>>>>> protocol for which
>>>>>>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary
>>>>>>>>>>>> for development
>>>>>>>>>>>> >>> and light-weight scenarios.
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> Added the above wording to G's doc.
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>>
>>>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>>>>>>>>> [hidden email]>
>>>>>>>>>>>> >>> wrote:
>>>>>>>>>>>> >>>
>>>>>>>>>>>> >>>> There's a related effort:
>>>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>>>>>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>>>>>>>>> >>>> >
>>>>>>>>>>>> >>>> > Thanks for sharing this proposal with the community
>>>>>>>>>>>> Márton. In
>>>>>>>>>>>> >>>> general, I
>>>>>>>>>>>> >>>> > agree that authentication is missing and that this is
>>>>>>>>>>>> required for
>>>>>>>>>>>> >>>> using
>>>>>>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is
>>>>>>>>>>>> whether this
>>>>>>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink
>>>>>>>>>>>> or whether a
>>>>>>>>>>>> >>>> proxy
>>>>>>>>>>>> >>>> > setup could do the job? Have you considered this option?
>>>>>>>>>>>> If yes, then
>>>>>>>>>>>> >>>> it
>>>>>>>>>>>> >>>> > would be good to list it under the point of rejected
>>>>>>>>>>>> alternatives.
>>>>>>>>>>>> >>>> >
>>>>>>>>>>>> >>>> > I do see the benefit of implementing this feature inside
>>>>>>>>>>>> of Flink if
>>>>>>>>>>>> >>>> many
>>>>>>>>>>>> >>>> > users need it. If not, then it might be easier for the
>>>>>>>>>>>> project to not
>>>>>>>>>>>> >>>> > increase the surface area since it makes the overall
>>>>>>>>>>>> maintenance
>>>>>>>>>>>> >>>> harder.
>>>>>>>>>>>> >>>> >
>>>>>>>>>>>> >>>> > Cheers,
>>>>>>>>>>>> >>>> > Till
>>>>>>>>>>>> >>>> >
>>>>>>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>>>>>>>>> [hidden email]>
>>>>>>>>>>>> >>>> wrote:
>>>>>>>>>>>> >>>> >
>>>>>>>>>>>> >>>> >> Hi team,
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for
>>>>>>>>>>>> short to the
>>>>>>>>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>>>>>>>>> transitioned to
>>>>>>>>>>>> >>>> the
>>>>>>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking
>>>>>>>>>>>> forward to
>>>>>>>>>>>> >>>> contributing
>>>>>>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on
>>>>>>>>>>>> Spark Streaming
>>>>>>>>>>>> >>>> and
>>>>>>>>>>>> >>>> >> security.
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>>>>>>>>> Kerberos and
>>>>>>>>>>>> >>>> HTTP
>>>>>>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and
>>>>>>>>>>>> HistoryServer.
>>>>>>>>>>>> >>>> Previously
>>>>>>>>>>>> >>>> >> lacked an authentication story.
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>> >> We are looking to contribute this functionality back to
>>>>>>>>>>>> the
>>>>>>>>>>>> >>>> community, we
>>>>>>>>>>>> >>>> >> believe that given Flink's maturity there should be a
>>>>>>>>>>>> common code
>>>>>>>>>>>> >>>> solution
>>>>>>>>>>>> >>>> >> for this general pattern.
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>> >> We are looking forward to your feedback on G's design.
>>>>>>>>>>>> [2]
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>>>>>>>>> >>>> >> [2]
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>>
>>>>>>>>>>>> >>>>
>>>>>>>>>>>>
>>>>>>>>>>>
12