Dear devs, Currently, for log output, Flink does not explicitly distinguish between framework logs and user logs. In Task Manager, logs from the framework are intermixed with the user's business logs. In some deployment models, such as Standalone or YARN session, there are different task instances of different jobs deployed in the same Task Manager. It makes the log event flow more confusing unless the users explicitly use tags to distinguish them and it makes locating problems more difficult and inefficient. For YARN job cluster deployment model, this problem will not be very serious, but we still need to artificially distinguish between the framework and the business log. Overall, we found that Flink's existing log model has the following problems:
Therefore, we propose a mechanism to separate the framework and business log. It can split existing log files for Task Manager. Currently, it is associated with two JIRA issue:
We have implemented and validated it in standalone and Flink on YARN (job cluster) mode. sketch 1: sketch 2: Design documentation : https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing Best, Vino |
I think maybe if I understood this correctly this design is going in the
wrong direction. The problem with Flink logging, when you are running multiple jobs in the same TMs, is not just about separating out the business level logging into separate files. The Flink framework itself logs many things where there is clearly a single job in context but that all ends up in the same log file and with no clear separation amongst the log lines. Also, I don't think shooting to have multiple log files is a very good idea either. It's common, especially on container-based deployments, that the expectation is that a process (like Flink) logs everything to stdout and the surrounding tooling takes care of routing that log data somewhere. I think we should stick with that model and expect that there will be a single log stream coming out of each Flink process. Instead, I think it would be better to enhance Flink's logging capability such that the appropriate context can be added to each log line with the exact format controlled by the end user. It might make sense to take a look at MDC, for example, as a way to approach this. On Thu, Feb 28, 2019 at 4:24 AM vino yang <[hidden email]> wrote: > Dear devs, > > Currently, for log output, Flink does not explicitly distinguish between > framework logs and user logs. In Task Manager, logs from the framework are > intermixed with the user's business logs. In some deployment models, such > as Standalone or YARN session, there are different task instances of > different jobs deployed in the same Task Manager. It makes the log event > flow more confusing unless the users explicitly use tags to distinguish > them and it makes locating problems more difficult and inefficient. For > YARN job cluster deployment model, this problem will not be very serious, > but we still need to artificially distinguish between the framework and the > business log. Overall, we found that Flink's existing log model has the > following problems: > > > - > > Framework log and business log are mixed in the same log file. There > is no way to make a clear distinction, which is not conducive to problem > location and analysis; > - > > Not conducive to the independent collection of business logs; > > > Therefore, we propose a mechanism to separate the framework and business > log. It can split existing log files for Task Manager. > > Currently, it is associated with two JIRA issue: > > - > > FLINK-11202[1]: Split log file per job > - > > FLINK-11782[2]: Enhance TaskManager log visualization by listing all > log files for Flink web UI > > > We have implemented and validated it in standalone and Flink on YARN (job > cluster) mode. > > sketch 1: > > [image: flink-web-ui-taskmanager-log-files.png] > > sketch 2: > [image: flink-web-ui-taskmanager-log-files-2.png] > > Design documentation : > https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing > > Best, > Vino > > [1]: https://issues.apache.org/jira/browse/FLINK-11202 > [2]: https://issues.apache.org/jira/browse/FLINK-11782 > |
Hi Jamie Grier,
Thank you for your reply, let me add some explanations to this design. First of all, as stated in "Goal", it is mainly for the "Standalone" cluster model, although we have implemented it for Flink on YARN, this does not mean that we can't turn off this feature by means of options. It should be noted that the separation is basically based on the "log configuration file", it is very scalable and even allows users to define the log pattern of the configuration file (of course this is an extension feature, not mentioned in the design documentation). In fact, "multiple files are a special case of a single file", we can provide an option to keep it still the default behavior, it should be the scene you expect in the container. According to Flink's official 2016 adjustment report [1], users using the standalone mode are quite close to the yarn mode (unfortunately there is no data support in 2017). Although we mainly use Flink on Yarn now, we have used standalone in depth (close to the daily processing volume of 20 trillion messages). In this scenario, the user logs generated by different job's tasks are mixed together, and it is very difficult to locate the issue. Moreover, as we configure the log file scrolling policy, we have to log in to the server to view it. Therefore, we expect that for the same task manager, the user logs generated by the tasks from the same job can be distinguished. In addition, I have tried MDC technology, but it can not achieve the goal. The underlying Flink is log4j 1.x and logback. We need to be compatible with both frameworks at the same time, and we don't allow large-scale changes to the active code, and no sense to the user. Some other points: 1) Many of our users have experience using Storm and Spark, and they are more accustomed to that style in standalone mode; 2) We split the user log by Job, which will help to implement the "business log aggregation" feature based on the Job. Best, Vino [1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1 Jamie Grier <[hidden email]> 于2019年3月1日周五 上午7:32写道: > I think maybe if I understood this correctly this design is going in the > wrong direction. The problem with Flink logging, when you are running > multiple jobs in the same TMs, is not just about separating out the > business level logging into separate files. The Flink framework itself > logs many things where there is clearly a single job in context but that > all ends up in the same log file and with no clear separation amongst the > log lines. > > Also, I don't think shooting to have multiple log files is a very good idea > either. It's common, especially on container-based deployments, that the > expectation is that a process (like Flink) logs everything to stdout and > the surrounding tooling takes care of routing that log data somewhere. I > think we should stick with that model and expect that there will be a > single log stream coming out of each Flink process. > > Instead, I think it would be better to enhance Flink's logging capability > such that the appropriate context can be added to each log line with the > exact format controlled by the end user. It might make sense to take a > look at MDC, for example, as a way to approach this. > > > On Thu, Feb 28, 2019 at 4:24 AM vino yang <[hidden email]> wrote: > > > Dear devs, > > > > Currently, for log output, Flink does not explicitly distinguish between > > framework logs and user logs. In Task Manager, logs from the framework > are > > intermixed with the user's business logs. In some deployment models, such > > as Standalone or YARN session, there are different task instances of > > different jobs deployed in the same Task Manager. It makes the log event > > flow more confusing unless the users explicitly use tags to distinguish > > them and it makes locating problems more difficult and inefficient. For > > YARN job cluster deployment model, this problem will not be very serious, > > but we still need to artificially distinguish between the framework and > the > > business log. Overall, we found that Flink's existing log model has the > > following problems: > > > > > > - > > > > Framework log and business log are mixed in the same log file. There > > is no way to make a clear distinction, which is not conducive to > problem > > location and analysis; > > - > > > > Not conducive to the independent collection of business logs; > > > > > > Therefore, we propose a mechanism to separate the framework and business > > log. It can split existing log files for Task Manager. > > > > Currently, it is associated with two JIRA issue: > > > > - > > > > FLINK-11202[1]: Split log file per job > > - > > > > FLINK-11782[2]: Enhance TaskManager log visualization by listing all > > log files for Flink web UI > > > > > > We have implemented and validated it in standalone and Flink on YARN (job > > cluster) mode. > > > > sketch 1: > > > > [image: flink-web-ui-taskmanager-log-files.png] > > > > sketch 2: > > [image: flink-web-ui-taskmanager-log-files-2.png] > > > > Design documentation : > > > https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing > > > > Best, > > Vino > > > > [1]: https://issues.apache.org/jira/browse/FLINK-11202 > > [2]: https://issues.apache.org/jira/browse/FLINK-11782 > > > |
Is that something that can just be done by the right logging framework and
configuration? Like having a log framework with two targets, one filtered on "org.apache.flink" and the other one filtered on "my.company.project" or so? On Fri, Mar 1, 2019 at 3:44 AM vino yang <[hidden email]> wrote: > Hi Jamie Grier, > > Thank you for your reply, let me add some explanations to this design. > > First of all, as stated in "Goal", it is mainly for the "Standalone" > cluster model, although we have implemented it for Flink on YARN, this does > not mean that we can't turn off this feature by means of options. It should > be noted that the separation is basically based on the "log configuration > file", it is very scalable and even allows users to define the log pattern > of the configuration file (of course this is an extension feature, not > mentioned in the design documentation). In fact, "multiple files are a > special case of a single file", we can provide an option to keep it still > the default behavior, it should be the scene you expect in the container. > > According to Flink's official 2016 adjustment report [1], users using the > standalone mode are quite close to the yarn mode (unfortunately there is no > data support in 2017). Although we mainly use Flink on Yarn now, we have > used standalone in depth (close to the daily processing volume of 20 > trillion messages). In this scenario, the user logs generated by different > job's tasks are mixed together, and it is very difficult to locate the > issue. Moreover, as we configure the log file scrolling policy, we have to > log in to the server to view it. Therefore, we expect that for the same > task manager, the user logs generated by the tasks from the same job can be > distinguished. > > In addition, I have tried MDC technology, but it can not achieve the goal. > The underlying Flink is log4j 1.x and logback. We need to be compatible > with both frameworks at the same time, and we don't allow large-scale > changes to the active code, and no sense to the user. > > Some other points: > > 1) Many of our users have experience using Storm and Spark, and they are > more accustomed to that style in standalone mode; > 2) We split the user log by Job, which will help to implement the "business > log aggregation" feature based on the Job. > > Best, > Vino > > [1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1 > > Jamie Grier <[hidden email]> 于2019年3月1日周五 上午7:32写道: > > > I think maybe if I understood this correctly this design is going in the > > wrong direction. The problem with Flink logging, when you are running > > multiple jobs in the same TMs, is not just about separating out the > > business level logging into separate files. The Flink framework itself > > logs many things where there is clearly a single job in context but that > > all ends up in the same log file and with no clear separation amongst the > > log lines. > > > > Also, I don't think shooting to have multiple log files is a very good > idea > > either. It's common, especially on container-based deployments, that the > > expectation is that a process (like Flink) logs everything to stdout and > > the surrounding tooling takes care of routing that log data somewhere. I > > think we should stick with that model and expect that there will be a > > single log stream coming out of each Flink process. > > > > Instead, I think it would be better to enhance Flink's logging capability > > such that the appropriate context can be added to each log line with the > > exact format controlled by the end user. It might make sense to take a > > look at MDC, for example, as a way to approach this. > > > > > > On Thu, Feb 28, 2019 at 4:24 AM vino yang <[hidden email]> wrote: > > > > > Dear devs, > > > > > > Currently, for log output, Flink does not explicitly distinguish > between > > > framework logs and user logs. In Task Manager, logs from the framework > > are > > > intermixed with the user's business logs. In some deployment models, > such > > > as Standalone or YARN session, there are different task instances of > > > different jobs deployed in the same Task Manager. It makes the log > event > > > flow more confusing unless the users explicitly use tags to distinguish > > > them and it makes locating problems more difficult and inefficient. For > > > YARN job cluster deployment model, this problem will not be very > serious, > > > but we still need to artificially distinguish between the framework and > > the > > > business log. Overall, we found that Flink's existing log model has the > > > following problems: > > > > > > > > > - > > > > > > Framework log and business log are mixed in the same log file. There > > > is no way to make a clear distinction, which is not conducive to > > problem > > > location and analysis; > > > - > > > > > > Not conducive to the independent collection of business logs; > > > > > > > > > Therefore, we propose a mechanism to separate the framework and > business > > > log. It can split existing log files for Task Manager. > > > > > > Currently, it is associated with two JIRA issue: > > > > > > - > > > > > > FLINK-11202[1]: Split log file per job > > > - > > > > > > FLINK-11782[2]: Enhance TaskManager log visualization by listing all > > > log files for Flink web UI > > > > > > > > > We have implemented and validated it in standalone and Flink on YARN > (job > > > cluster) mode. > > > > > > sketch 1: > > > > > > [image: flink-web-ui-taskmanager-log-files.png] > > > > > > sketch 2: > > > [image: flink-web-ui-taskmanager-log-files-2.png] > > > > > > Design documentation : > > > > > > https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing > > > > > > Best, > > > Vino > > > > > > [1]: https://issues.apache.org/jira/browse/FLINK-11202 > > > [2]: https://issues.apache.org/jira/browse/FLINK-11782 > > > > > > |
From what I understand this isn't about logging Flink/user messages to
different files, but log everything relevant to a specific job to a separate file (including what is being logged in runtime classes, i.e. Tasks, Operators etc.) On 04/07/2019 12:37, Stephan Ewen wrote: > Is that something that can just be done by the right logging framework and > configuration? > > Like having a log framework with two targets, one filtered on > "org.apache.flink" and the other one filtered on "my.company.project" or so? > > On Fri, Mar 1, 2019 at 3:44 AM vino yang <[hidden email]> wrote: > >> Hi Jamie Grier, >> >> Thank you for your reply, let me add some explanations to this design. >> >> First of all, as stated in "Goal", it is mainly for the "Standalone" >> cluster model, although we have implemented it for Flink on YARN, this does >> not mean that we can't turn off this feature by means of options. It should >> be noted that the separation is basically based on the "log configuration >> file", it is very scalable and even allows users to define the log pattern >> of the configuration file (of course this is an extension feature, not >> mentioned in the design documentation). In fact, "multiple files are a >> special case of a single file", we can provide an option to keep it still >> the default behavior, it should be the scene you expect in the container. >> >> According to Flink's official 2016 adjustment report [1], users using the >> standalone mode are quite close to the yarn mode (unfortunately there is no >> data support in 2017). Although we mainly use Flink on Yarn now, we have >> used standalone in depth (close to the daily processing volume of 20 >> trillion messages). In this scenario, the user logs generated by different >> job's tasks are mixed together, and it is very difficult to locate the >> issue. Moreover, as we configure the log file scrolling policy, we have to >> log in to the server to view it. Therefore, we expect that for the same >> task manager, the user logs generated by the tasks from the same job can be >> distinguished. >> >> In addition, I have tried MDC technology, but it can not achieve the goal. >> The underlying Flink is log4j 1.x and logback. We need to be compatible >> with both frameworks at the same time, and we don't allow large-scale >> changes to the active code, and no sense to the user. >> >> Some other points: >> >> 1) Many of our users have experience using Storm and Spark, and they are >> more accustomed to that style in standalone mode; >> 2) We split the user log by Job, which will help to implement the "business >> log aggregation" feature based on the Job. >> >> Best, >> Vino >> >> [1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1 >> >> Jamie Grier <[hidden email]> 于2019年3月1日周五 上午7:32写道: >> >>> I think maybe if I understood this correctly this design is going in the >>> wrong direction. The problem with Flink logging, when you are running >>> multiple jobs in the same TMs, is not just about separating out the >>> business level logging into separate files. The Flink framework itself >>> logs many things where there is clearly a single job in context but that >>> all ends up in the same log file and with no clear separation amongst the >>> log lines. >>> >>> Also, I don't think shooting to have multiple log files is a very good >> idea >>> either. It's common, especially on container-based deployments, that the >>> expectation is that a process (like Flink) logs everything to stdout and >>> the surrounding tooling takes care of routing that log data somewhere. I >>> think we should stick with that model and expect that there will be a >>> single log stream coming out of each Flink process. >>> >>> Instead, I think it would be better to enhance Flink's logging capability >>> such that the appropriate context can be added to each log line with the >>> exact format controlled by the end user. It might make sense to take a >>> look at MDC, for example, as a way to approach this. >>> >>> >>> On Thu, Feb 28, 2019 at 4:24 AM vino yang <[hidden email]> wrote: >>> >>>> Dear devs, >>>> >>>> Currently, for log output, Flink does not explicitly distinguish >> between >>>> framework logs and user logs. In Task Manager, logs from the framework >>> are >>>> intermixed with the user's business logs. In some deployment models, >> such >>>> as Standalone or YARN session, there are different task instances of >>>> different jobs deployed in the same Task Manager. It makes the log >> event >>>> flow more confusing unless the users explicitly use tags to distinguish >>>> them and it makes locating problems more difficult and inefficient. For >>>> YARN job cluster deployment model, this problem will not be very >> serious, >>>> but we still need to artificially distinguish between the framework and >>> the >>>> business log. Overall, we found that Flink's existing log model has the >>>> following problems: >>>> >>>> >>>> - >>>> >>>> Framework log and business log are mixed in the same log file. There >>>> is no way to make a clear distinction, which is not conducive to >>> problem >>>> location and analysis; >>>> - >>>> >>>> Not conducive to the independent collection of business logs; >>>> >>>> >>>> Therefore, we propose a mechanism to separate the framework and >> business >>>> log. It can split existing log files for Task Manager. >>>> >>>> Currently, it is associated with two JIRA issue: >>>> >>>> - >>>> >>>> FLINK-11202[1]: Split log file per job >>>> - >>>> >>>> FLINK-11782[2]: Enhance TaskManager log visualization by listing all >>>> log files for Flink web UI >>>> >>>> >>>> We have implemented and validated it in standalone and Flink on YARN >> (job >>>> cluster) mode. >>>> >>>> sketch 1: >>>> >>>> [image: flink-web-ui-taskmanager-log-files.png] >>>> >>>> sketch 2: >>>> [image: flink-web-ui-taskmanager-log-files-2.png] >>>> >>>> Design documentation : >>>> >> https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing >>>> Best, >>>> Vino >>>> >>>> [1]: https://issues.apache.org/jira/browse/FLINK-11202 >>>> [2]: https://issues.apache.org/jira/browse/FLINK-11782 >>>> |
In reply to this post by Stephan Ewen
Hi Stephan,
Thanks for your reply. In some cases, your solution can take effects. However, in some scenarios, it does not meet the requirement: - One program has multiple job instances; - If we make Flink as a platform, we can not know the package of the users' program to config the log profiles before starting the cluster Chesnay's understanding is right, we need to split business logs based on the job. Recently, a user also feedbacked this requirement.[1] [1]: https://issues.apache.org/jira/browse/FLINK-12953 Stephan Ewen <[hidden email]> 于2019年7月4日周四 下午6:38写道: > Is that something that can just be done by the right logging framework and > configuration? > > Like having a log framework with two targets, one filtered on > "org.apache.flink" and the other one filtered on "my.company.project" or > so? > > On Fri, Mar 1, 2019 at 3:44 AM vino yang <[hidden email]> wrote: > > > Hi Jamie Grier, > > > > Thank you for your reply, let me add some explanations to this design. > > > > First of all, as stated in "Goal", it is mainly for the "Standalone" > > cluster model, although we have implemented it for Flink on YARN, this > does > > not mean that we can't turn off this feature by means of options. It > should > > be noted that the separation is basically based on the "log configuration > > file", it is very scalable and even allows users to define the log > pattern > > of the configuration file (of course this is an extension feature, not > > mentioned in the design documentation). In fact, "multiple files are a > > special case of a single file", we can provide an option to keep it still > > the default behavior, it should be the scene you expect in the container. > > > > According to Flink's official 2016 adjustment report [1], users using the > > standalone mode are quite close to the yarn mode (unfortunately there is > no > > data support in 2017). Although we mainly use Flink on Yarn now, we have > > used standalone in depth (close to the daily processing volume of 20 > > trillion messages). In this scenario, the user logs generated by > different > > job's tasks are mixed together, and it is very difficult to locate the > > issue. Moreover, as we configure the log file scrolling policy, we have > to > > log in to the server to view it. Therefore, we expect that for the same > > task manager, the user logs generated by the tasks from the same job can > be > > distinguished. > > > > In addition, I have tried MDC technology, but it can not achieve the > goal. > > The underlying Flink is log4j 1.x and logback. We need to be compatible > > with both frameworks at the same time, and we don't allow large-scale > > changes to the active code, and no sense to the user. > > > > Some other points: > > > > 1) Many of our users have experience using Storm and Spark, and they are > > more accustomed to that style in standalone mode; > > 2) We split the user log by Job, which will help to implement the > "business > > log aggregation" feature based on the Job. > > > > Best, > > Vino > > > > [1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1 > > > > Jamie Grier <[hidden email]> 于2019年3月1日周五 上午7:32写道: > > > > > I think maybe if I understood this correctly this design is going in > the > > > wrong direction. The problem with Flink logging, when you are running > > > multiple jobs in the same TMs, is not just about separating out the > > > business level logging into separate files. The Flink framework itself > > > logs many things where there is clearly a single job in context but > that > > > all ends up in the same log file and with no clear separation amongst > the > > > log lines. > > > > > > Also, I don't think shooting to have multiple log files is a very good > > idea > > > either. It's common, especially on container-based deployments, that > the > > > expectation is that a process (like Flink) logs everything to stdout > and > > > the surrounding tooling takes care of routing that log data > somewhere. I > > > think we should stick with that model and expect that there will be a > > > single log stream coming out of each Flink process. > > > > > > Instead, I think it would be better to enhance Flink's logging > capability > > > such that the appropriate context can be added to each log line with > the > > > exact format controlled by the end user. It might make sense to take a > > > look at MDC, for example, as a way to approach this. > > > > > > > > > On Thu, Feb 28, 2019 at 4:24 AM vino yang <[hidden email]> > wrote: > > > > > > > Dear devs, > > > > > > > > Currently, for log output, Flink does not explicitly distinguish > > between > > > > framework logs and user logs. In Task Manager, logs from the > framework > > > are > > > > intermixed with the user's business logs. In some deployment models, > > such > > > > as Standalone or YARN session, there are different task instances of > > > > different jobs deployed in the same Task Manager. It makes the log > > event > > > > flow more confusing unless the users explicitly use tags to > distinguish > > > > them and it makes locating problems more difficult and inefficient. > For > > > > YARN job cluster deployment model, this problem will not be very > > serious, > > > > but we still need to artificially distinguish between the framework > and > > > the > > > > business log. Overall, we found that Flink's existing log model has > the > > > > following problems: > > > > > > > > > > > > - > > > > > > > > Framework log and business log are mixed in the same log file. > There > > > > is no way to make a clear distinction, which is not conducive to > > > problem > > > > location and analysis; > > > > - > > > > > > > > Not conducive to the independent collection of business logs; > > > > > > > > > > > > Therefore, we propose a mechanism to separate the framework and > > business > > > > log. It can split existing log files for Task Manager. > > > > > > > > Currently, it is associated with two JIRA issue: > > > > > > > > - > > > > > > > > FLINK-11202[1]: Split log file per job > > > > - > > > > > > > > FLINK-11782[2]: Enhance TaskManager log visualization by listing > all > > > > log files for Flink web UI > > > > > > > > > > > > We have implemented and validated it in standalone and Flink on YARN > > (job > > > > cluster) mode. > > > > > > > > sketch 1: > > > > > > > > [image: flink-web-ui-taskmanager-log-files.png] > > > > > > > > sketch 2: > > > > [image: flink-web-ui-taskmanager-log-files-2.png] > > > > > > > > Design documentation : > > > > > > > > > > https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing > > > > > > > > Best, > > > > Vino > > > > > > > > [1]: https://issues.apache.org/jira/browse/FLINK-11202 > > > > [2]: https://issues.apache.org/jira/browse/FLINK-11782 > > > > > > > > > > |
Free forum by Nabble | Edit this page |