Hello All,
I am a committer on DataSketches.apache.org <http://datasketches.apache.org/> and just learning about Flink, Since Flink is designed for stateful stream processing I would think it would make sense to have the DataSketches library integrated into its core so all users of Flink could take advantage of these advanced streaming algorithms. If there is interest in the Flink community for this capability, please contact us at [hidden email] or on our datasketches-dev Slack channel. Cheers, Lee. |
Hi Lee,
I must admit that I also heard of data sketches for the first time (there are really many Apache projects). Datasketches sounds really exciting. As a (former) data engineer, I can 100% say that this is something that (end-)users want and need and it would make so much sense to have it in Flink from the get-go. Flink, however, is a quite old project already, which grew at a strong pace leading to some 150 modules in the core. We are currently in the process to restructure that and reduce the number of things in the core, such that build times and stability improve. To counter that we created Flink packages [1], which includes everything new that we deem to not be essential. I'd propose to incorporate a Flink datasketch package there. If it seems like it's becoming essential, we can still move it to core at a later point. As I have seen on the page, there are already plenty of adoptions. That leaves a few questions to me. 1. I'm curious on how you would estimate the effort to port datasketches to Flink? It already has a Java API, but how difficult would it be to subdivide the tasks into parallel chunks of work? Since it's already ported on Pig, I think we could use this port as a baseline. 2. Do you have any idea who is usually driving the adoptions? [1] https://flink-packages.org/ On Sun, Apr 26, 2020 at 8:07 AM leerho <[hidden email]> wrote: > Hello All, > > I am a committer on DataSketches.apache.org > <http://datasketches.apache.org/> and just learning about Flink, Since > Flink is designed for stateful stream processing I would think it would > make sense to have the DataSketches library integrated into its core so all > users of Flink could take advantage of these advanced streaming > algorithms. If there is interest in the Flink community for this > capability, please contact us at [hidden email] or on our > datasketches-dev Slack channel. > Cheers, > Lee. > -- Arvid Heise | Senior Java Developer <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Hi Arvid,
Note: I am dual listing this thread on both dev lists for better tracking. 1. I'm curious on how you would estimate the effort to port datasketches > to Flink? It already has a Java API, but how difficult would it be to > subdivide the tasks into parallel chunks of work? Since it's already > ported > on Pig, I think we could use this port as a baseline Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL, Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort of aggregation API, which allows users to plug in custom aggregation functions. Typical API functions found in these APIs are Initialize(), Update() (or Add()), Merge(), and getResult(). How these are named and operate vary considerably from system to system. These APIs are sometimes called User Defined Functions (UDFs) or User Defined Aggregation Functions (UDAFs). DataSketches is a library of Sketching (streaming) aggregation functions, each of which perform specific types of aggregation. For example, counting unique items, determining quantiles and histograms of unknown distributions, identifying most frequent items (heavy hitters) from a stream, etc. The advantage of using DataSketches is that they are extremely fast, small in size, and have well defined error properties defined by published scientific papers that define the underlying mathematics. The task of porting DataSketches is usually developing a thin wrapping layer that translates the specific UDAF API of Flink to the equivalent API methods of the targeted sketches in the library. This is best done by someone with deep knowledge of the UDAF code of the targeted system. We are certainly available answer questions about the DataSketches APIs. Although we did write the UDAF layers for Hive and Pig, we did that as a proof of concept and example on how to write such layers. We are a small team and are not in a position to support these integration layers for every system out there. 2. Do you have any idea who is usually driving the adoptions? To start, you only need to write the UDAF layer for the sketches that you think would be in most demand by your users. The big 4 categories are distinct (unique) counting, quantiles, frequent-items, and sampling. This is a natural way of subdividing the task: choose the sketches you want to adapt and in what order. Each sketch is independent so it can be adapted whenever it is needed. Please let us know if you have any further questions :) Lee. On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[hidden email]> wrote: > Hi Lee, > > I must admit that I also heard of data sketches for the first time (there > are really many Apache projects). > > Datasketches sounds really exciting. As a (former) data engineer, I can > 100% say that this is something that (end-)users want and need and it would > make so much sense to have it in Flink from the get-go. > Flink, however, is a quite old project already, which grew at a strong pace > leading to some 150 modules in the core. We are currently in the process to > restructure that and reduce the number of things in the core, such that > build times and stability improve. > > To counter that we created Flink packages [1], which includes everything > new that we deem to not be essential. I'd propose to incorporate a Flink > datasketch package there. If it seems like it's becoming essential, we can > still move it to core at a later point. > > As I have seen on the page, there are already plenty of adoptions. That > leaves a few questions to me. > > 1. I'm curious on how you would estimate the effort to port datasketches > to Flink? It already has a Java API, but how difficult would it be to > subdivide the tasks into parallel chunks of work? Since it's already > ported > on Pig, I think we could use this port as a baseline. > 2. Do you have any idea who is usually driving the adoptions? > > > [1] https://flink-packages.org/ > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[hidden email]> wrote: > > > Hello All, > > > > I am a committer on DataSketches.apache.org > > <http://datasketches.apache.org/> and just learning about Flink, Since > > Flink is designed for stateful stream processing I would think it would > > make sense to have the DataSketches library integrated into its core so > all > > users of Flink could take advantage of these advanced streaming > > algorithms. If there is interest in the Flink community for this > > capability, please contact us at [hidden email] or on our > > datasketches-dev Slack channel. > > Cheers, > > Lee. > > > > > -- > > Arvid Heise | Senior Java Developer > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > (Toni) Cheng > |
If this can encourage Lee I'm one of the Flink users that already use
datasketches and I found it an amazing library. When I was trying it out (lat year) I tried to stimulate some discussion[1] but at that time it was probably too early.. I really hope that now things are mature for both communities! [1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html Best, Flavio On Mon, Apr 27, 2020 at 7:37 PM leerho <[hidden email]> wrote: > Hi Arvid, > > Note: I am dual listing this thread on both dev lists for better tracking. > > 1. I'm curious on how you would estimate the effort to port datasketches > > to Flink? It already has a Java API, but how difficult would it be to > > subdivide the tasks into parallel chunks of work? Since it's already > > ported > > on Pig, I think we could use this port as a baseline > > > Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL, > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort > of aggregation API, which allows users to plug in custom aggregation > functions. Typical API functions found in these APIs are Initialize(), > Update() (or Add()), Merge(), and getResult(). How these are named and > operate vary considerably from system to system. These APIs are sometimes > called User Defined Functions (UDFs) or User Defined Aggregation Functions > (UDAFs). > > DataSketches is a library of Sketching (streaming) aggregation functions, > each of which perform specific types of aggregation. For example, counting > unique items, determining quantiles and histograms of unknown > distributions, identifying most frequent items (heavy hitters) from a > stream, etc. The advantage of using DataSketches is that they are > extremely fast, small in size, and have well defined error properties > defined by published scientific papers that define the underlying > mathematics. > > The task of porting DataSketches is usually developing a thin wrapping > layer that translates the specific UDAF API of Flink to the equivalent API > methods of the targeted sketches in the library. This is best done by > someone with deep knowledge of the UDAF code of the targeted system. We > are certainly available answer questions about the DataSketches APIs. > Although we did write the UDAF layers for Hive and Pig, we did that as a > proof of concept and example on how to write such layers. We are a small > team and are not in a position to support these integration layers for > every system out there. > > 2. Do you have any idea who is usually driving the adoptions? > > > To start, you only need to write the UDAF layer for the sketches that you > think would be in most demand by your users. The big 4 categories are > distinct (unique) counting, quantiles, frequent-items, and sampling. This > is a natural way of subdividing the task: choose the sketches you want to > adapt and in what order. Each sketch is independent so it can be adapted > whenever it is needed. > > Please let us know if you have any further questions :) > > Lee. > > > > > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[hidden email]> wrote: > > > Hi Lee, > > > > I must admit that I also heard of data sketches for the first time (there > > are really many Apache projects). > > > > Datasketches sounds really exciting. As a (former) data engineer, I can > > 100% say that this is something that (end-)users want and need and it > would > > make so much sense to have it in Flink from the get-go. > > Flink, however, is a quite old project already, which grew at a strong > pace > > leading to some 150 modules in the core. We are currently in the process > to > > restructure that and reduce the number of things in the core, such that > > build times and stability improve. > > > > To counter that we created Flink packages [1], which includes everything > > new that we deem to not be essential. I'd propose to incorporate a Flink > > datasketch package there. If it seems like it's becoming essential, we > can > > still move it to core at a later point. > > > > As I have seen on the page, there are already plenty of adoptions. That > > leaves a few questions to me. > > > > 1. I'm curious on how you would estimate the effort to port > datasketches > > to Flink? It already has a Java API, but how difficult would it be to > > subdivide the tasks into parallel chunks of work? Since it's already > > ported > > on Pig, I think we could use this port as a baseline. > > 2. Do you have any idea who is usually driving the adoptions? > > > > > > [1] https://flink-packages.org/ > > > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[hidden email]> wrote: > > > > > Hello All, > > > > > > I am a committer on DataSketches.apache.org > > > <http://datasketches.apache.org/> and just learning about Flink, > Since > > > Flink is designed for stateful stream processing I would think it would > > > make sense to have the DataSketches library integrated into its core so > > all > > > users of Flink could take advantage of these advanced streaming > > > algorithms. If there is interest in the Flink community for this > > > capability, please contact us at [hidden email] or on our > > > datasketches-dev Slack channel. > > > Cheers, > > > Lee. > > > > > > > > > -- > > > > Arvid Heise | Senior Java Developer > > > > <https://www.ververica.com/> > > > > Follow us @VervericaData > > > > -- > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > Conference > > > > Stream Processing | Event Driven | Real Time > > > > -- > > > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > -- > > Ververica GmbH > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > > (Toni) Cheng > > |
Hi Lee,
I really like this project, I used it with Flink a few years ago when it was still Yahoo DataSketches. The projects clearly complement each other. As Arvid mentioned, the Flink community is trying to foster an ecosystem larger than what is in the main Flink repository. The reason is that the project has grown to such a scale that it cannot reasonably maintain everything. To encourage that sort of growth, Flink is extensively pluggable which means that components do not need to live within the main repository to be treated first-class. I'd like to outline somethings the DataSketch community could do to integrate with Flink. 1) Create a page on the flink packages website. The flink community hosts a website call flink packages to increase the visibility of ecosystem projects with the flink user base[1]. Datasketches are usable from Flink today so I'd encourage you to create a page right away. 2) Implement TypeInformation for DataSketches TypeInformation is Flink's internal type system and is used as a factory for creating serializing for different types. These serializers are what Flink uses when shuffling data around the cluster and when storing records in state backends as state. Providing type information instances for the different sketch types, which would just be wrappers around existing serializers in the data sketch codebase. This should be relatively straightforward. There is no DataStream aggregation API in the way you are describing so this is the *only* step you would need to take to provide first-class support for Flink DataStream API[2][3]. 3) Implement sketch UDFs Along with its Java API, Flink also offers a relational API and UDFs. The community could provide UDFs for datasketches like Hive. To do so only requires implementing the aggregation function interface[4]. Flink SQL offers the concept of modules, which are a collection of SQL UDFs that can easily be loaded in the system[5]. A DataSketch SQL module would provide a simple way for users to get started and expose these UDFs as if they were native to Flink. I hope this helps, I look forward to watching the DataSketch community grow! Seth [1] https://flink-packages.org/ [2] https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html [3] https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html [4] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions [5] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier <[hidden email]> wrote: > If this can encourage Lee I'm one of the Flink users that already use > datasketches and I found it an amazing library. > When I was trying it out (lat year) I tried to stimulate some discussion[1] > but at that time it was probably too early.. > I really hope that now things are mature for both communities! > > [1] > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html > > Best, > Flavio > > On Mon, Apr 27, 2020 at 7:37 PM leerho <[hidden email]> wrote: > > > Hi Arvid, > > > > Note: I am dual listing this thread on both dev lists for better > tracking. > > > > 1. I'm curious on how you would estimate the effort to port > datasketches > > > to Flink? It already has a Java API, but how difficult would it be > to > > > subdivide the tasks into parallel chunks of work? Since it's already > > > ported > > > on Pig, I think we could use this port as a baseline > > > > > > Most systems (including systems like Druid, Hive, Pig, Spark, PostgreSQL, > > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some sort > > of aggregation API, which allows users to plug in custom aggregation > > functions. Typical API functions found in these APIs are Initialize(), > > Update() (or Add()), Merge(), and getResult(). How these are named and > > operate vary considerably from system to system. These APIs are > sometimes > > called User Defined Functions (UDFs) or User Defined Aggregation > Functions > > (UDAFs). > > > > DataSketches is a library of Sketching (streaming) aggregation functions, > > each of which perform specific types of aggregation. For example, > counting > > unique items, determining quantiles and histograms of unknown > > distributions, identifying most frequent items (heavy hitters) from a > > stream, etc. The advantage of using DataSketches is that they are > > extremely fast, small in size, and have well defined error properties > > defined by published scientific papers that define the underlying > > mathematics. > > > > The task of porting DataSketches is usually developing a thin wrapping > > layer that translates the specific UDAF API of Flink to the equivalent > API > > methods of the targeted sketches in the library. This is best done by > > someone with deep knowledge of the UDAF code of the targeted system. We > > are certainly available answer questions about the DataSketches APIs. > > Although we did write the UDAF layers for Hive and Pig, we did that as a > > proof of concept and example on how to write such layers. We are a small > > team and are not in a position to support these integration layers for > > every system out there. > > > > 2. Do you have any idea who is usually driving the adoptions? > > > > > > To start, you only need to write the UDAF layer for the sketches that you > > think would be in most demand by your users. The big 4 categories are > > distinct (unique) counting, quantiles, frequent-items, and sampling. > This > > is a natural way of subdividing the task: choose the sketches you want to > > adapt and in what order. Each sketch is independent so it can be adapted > > whenever it is needed. > > > > Please let us know if you have any further questions :) > > > > Lee. > > > > > > > > > > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[hidden email]> wrote: > > > > > Hi Lee, > > > > > > I must admit that I also heard of data sketches for the first time > (there > > > are really many Apache projects). > > > > > > Datasketches sounds really exciting. As a (former) data engineer, I can > > > 100% say that this is something that (end-)users want and need and it > > would > > > make so much sense to have it in Flink from the get-go. > > > Flink, however, is a quite old project already, which grew at a strong > > pace > > > leading to some 150 modules in the core. We are currently in the > process > > to > > > restructure that and reduce the number of things in the core, such that > > > build times and stability improve. > > > > > > To counter that we created Flink packages [1], which includes > everything > > > new that we deem to not be essential. I'd propose to incorporate a > Flink > > > datasketch package there. If it seems like it's becoming essential, we > > can > > > still move it to core at a later point. > > > > > > As I have seen on the page, there are already plenty of adoptions. That > > > leaves a few questions to me. > > > > > > 1. I'm curious on how you would estimate the effort to port > > datasketches > > > to Flink? It already has a Java API, but how difficult would it be > to > > > subdivide the tasks into parallel chunks of work? Since it's already > > > ported > > > on Pig, I think we could use this port as a baseline. > > > 2. Do you have any idea who is usually driving the adoptions? > > > > > > > > > [1] https://flink-packages.org/ > > > > > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[hidden email]> wrote: > > > > > > > Hello All, > > > > > > > > I am a committer on DataSketches.apache.org > > > > <http://datasketches.apache.org/> and just learning about Flink, > > Since > > > > Flink is designed for stateful stream processing I would think it > would > > > > make sense to have the DataSketches library integrated into its core > so > > > all > > > > users of Flink could take advantage of these advanced streaming > > > > algorithms. If there is interest in the Flink community for this > > > > capability, please contact us at [hidden email] or on > our > > > > datasketches-dev Slack channel. > > > > Cheers, > > > > Lee. > > > > > > > > > > > > > -- > > > > > > Arvid Heise | Senior Java Developer > > > > > > <https://www.ververica.com/> > > > > > > Follow us @VervericaData > > > > > > -- > > > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > > Conference > > > > > > Stream Processing | Event Driven | Real Time > > > > > > -- > > > > > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > > > -- > > > Ververica GmbH > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > > > (Toni) Cheng > > > > |
One more point I forgot to mention.
Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch hive package should just work out of the box. Seth [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman <[hidden email]> wrote: > Hi Lee, > > I really like this project, I used it with Flink a few years ago when it > was still Yahoo DataSketches. The projects clearly complement each other. > As Arvid mentioned, the Flink community is trying to foster an ecosystem > larger than what is in the main Flink repository. The reason is that the > project has grown to such a scale that it cannot reasonably maintain > everything. To encourage that sort of growth, Flink is extensively > pluggable which means that components do not need to live within the main > repository to be treated first-class. > > I'd like to outline somethings the DataSketch community could do to > integrate with Flink. > > 1) Create a page on the flink packages website. > > The flink community hosts a website call flink packages to increase the > visibility of ecosystem projects with the flink user base[1]. Datasketches > are usable from Flink today so I'd encourage you to create a page right > away. > > 2) Implement TypeInformation for DataSketches > > TypeInformation is Flink's internal type system and is used as a factory > for creating serializing for different types. These serializers are what > Flink uses when shuffling data around the cluster and when storing records > in state backends as state. Providing type information instances for the > different sketch types, which would just be wrappers around existing > serializers in the data sketch codebase. This should be relatively > straightforward. There is no DataStream aggregation API in the way you are > describing so this is the *only* step you would need to take to provide > first-class support for Flink DataStream API[2][3]. > > 3) Implement sketch UDFs > > Along with its Java API, Flink also offers a relational API and UDFs. The > community could provide UDFs for datasketches like Hive. To do so only > requires implementing the aggregation function interface[4]. Flink SQL > offers the concept of modules, which are a collection of SQL UDFs that can > easily be loaded in the system[5]. A DataSketch SQL module would provide a > simple way for users to get started and expose these UDFs as if they were > native to Flink. > > I hope this helps, I look forward to watching the DataSketch community > grow! > > Seth > > [1] https://flink-packages.org/ > [2] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html > [3] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html > [4] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions > [5] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html > > > On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier <[hidden email]> > wrote: > >> If this can encourage Lee I'm one of the Flink users that already use >> datasketches and I found it an amazing library. >> When I was trying it out (lat year) I tried to stimulate some >> discussion[1] >> but at that time it was probably too early.. >> I really hope that now things are mature for both communities! >> >> [1] >> >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html >> >> Best, >> Flavio >> >> On Mon, Apr 27, 2020 at 7:37 PM leerho <[hidden email]> wrote: >> >> > Hi Arvid, >> > >> > Note: I am dual listing this thread on both dev lists for better >> tracking. >> > >> > 1. I'm curious on how you would estimate the effort to port >> datasketches >> > > to Flink? It already has a Java API, but how difficult would it be >> to >> > > subdivide the tasks into parallel chunks of work? Since it's >> already >> > > ported >> > > on Pig, I think we could use this port as a baseline >> > >> > >> > Most systems (including systems like Druid, Hive, Pig, Spark, >> PostgreSQL, >> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some >> sort >> > of aggregation API, which allows users to plug in custom aggregation >> > functions. Typical API functions found in these APIs are Initialize(), >> > Update() (or Add()), Merge(), and getResult(). How these are named and >> > operate vary considerably from system to system. These APIs are >> sometimes >> > called User Defined Functions (UDFs) or User Defined Aggregation >> Functions >> > (UDAFs). >> > >> > DataSketches is a library of Sketching (streaming) aggregation >> functions, >> > each of which perform specific types of aggregation. For example, >> counting >> > unique items, determining quantiles and histograms of unknown >> > distributions, identifying most frequent items (heavy hitters) from a >> > stream, etc. The advantage of using DataSketches is that they are >> > extremely fast, small in size, and have well defined error properties >> > defined by published scientific papers that define the underlying >> > mathematics. >> > >> > The task of porting DataSketches is usually developing a thin wrapping >> > layer that translates the specific UDAF API of Flink to the equivalent >> API >> > methods of the targeted sketches in the library. This is best done by >> > someone with deep knowledge of the UDAF code of the targeted system. >> We >> > are certainly available answer questions about the DataSketches APIs. >> > Although we did write the UDAF layers for Hive and Pig, we did that as >> a >> > proof of concept and example on how to write such layers. We are a >> small >> > team and are not in a position to support these integration layers for >> > every system out there. >> > >> > 2. Do you have any idea who is usually driving the adoptions? >> > >> > >> > To start, you only need to write the UDAF layer for the sketches that >> you >> > think would be in most demand by your users. The big 4 categories are >> > distinct (unique) counting, quantiles, frequent-items, and sampling. >> This >> > is a natural way of subdividing the task: choose the sketches you want >> to >> > adapt and in what order. Each sketch is independent so it can be >> adapted >> > whenever it is needed. >> > >> > Please let us know if you have any further questions :) >> > >> > Lee. >> > >> > >> > >> > >> > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[hidden email]> >> wrote: >> > >> > > Hi Lee, >> > > >> > > I must admit that I also heard of data sketches for the first time >> (there >> > > are really many Apache projects). >> > > >> > > Datasketches sounds really exciting. As a (former) data engineer, I >> can >> > > 100% say that this is something that (end-)users want and need and it >> > would >> > > make so much sense to have it in Flink from the get-go. >> > > Flink, however, is a quite old project already, which grew at a strong >> > pace >> > > leading to some 150 modules in the core. We are currently in the >> process >> > to >> > > restructure that and reduce the number of things in the core, such >> that >> > > build times and stability improve. >> > > >> > > To counter that we created Flink packages [1], which includes >> everything >> > > new that we deem to not be essential. I'd propose to incorporate a >> Flink >> > > datasketch package there. If it seems like it's becoming essential, we >> > can >> > > still move it to core at a later point. >> > > >> > > As I have seen on the page, there are already plenty of adoptions. >> That >> > > leaves a few questions to me. >> > > >> > > 1. I'm curious on how you would estimate the effort to port >> > datasketches >> > > to Flink? It already has a Java API, but how difficult would it be >> to >> > > subdivide the tasks into parallel chunks of work? Since it's >> already >> > > ported >> > > on Pig, I think we could use this port as a baseline. >> > > 2. Do you have any idea who is usually driving the adoptions? >> > > >> > > >> > > [1] https://flink-packages.org/ >> > > >> > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[hidden email]> wrote: >> > > >> > > > Hello All, >> > > > >> > > > I am a committer on DataSketches.apache.org >> > > > <http://datasketches.apache.org/> and just learning about Flink, >> > Since >> > > > Flink is designed for stateful stream processing I would think it >> would >> > > > make sense to have the DataSketches library integrated into its >> core so >> > > all >> > > > users of Flink could take advantage of these advanced streaming >> > > > algorithms. If there is interest in the Flink community for this >> > > > capability, please contact us at [hidden email] or on >> our >> > > > datasketches-dev Slack channel. >> > > > Cheers, >> > > > Lee. >> > > > >> > > >> > > >> > > -- >> > > >> > > Arvid Heise | Senior Java Developer >> > > >> > > <https://www.ververica.com/> >> > > >> > > Follow us @VervericaData >> > > >> > > -- >> > > >> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink >> > > Conference >> > > >> > > Stream Processing | Event Driven | Real Time >> > > >> > > -- >> > > >> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany >> > > >> > > -- >> > > Ververica GmbH >> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B >> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, >> Ji >> > > (Toni) Cheng >> > > >> > |
Seth,
Thanks for the enthusiastic reply. However, I have some questions ... and concerns :) 1) Create a page on the flink packages website. I looked at this website and it raises a number of red flags for me: - There is no instructions anywhere on the site on how to add a listing. - The "Login with Github" raises security concerns and without any explanation: - Why would I want or need to authorize this site to have "access to my email account"! Whoa! - This site has registered fewer than 100 GitHub users. That is a very small number. It seems a lot of GitHub users have the same concerns that I have. - The packages listed are "not endorsed by Apache Flink project or Ververica. This site is not affiliated with or released by Apache Flink". There is no verification of licensing. - In other words, this site carries zero or even negative weight. Why would I want to add a listing for our very high quality and properly licensed Apache DataSketches product alongside other listings that are possibly junk? 2) Implement Type Information for DataSketches In terms of serialization and deserialization, the sketches in our library have their own serialization: to and from a byte array, which is also language independent across Java, C++ and Python. How to transport bytes from one system to another is system dependent and external to the DataSketches library. Some systems use Base64, or ProtoBuf, or Kryo, or Kafka, or whatever. As long as we can deserialize (or wrap) the same byte array that was serialized we are fine. If you are asking for metadata about a specific blob of bytes, such as which sketch created the blob of bytes, we can perhaps do that, but the documentation is not clear about how much metadata is really required, because our library does not need it. So we could use some help here in defining what is really required. Be aware that metadata also increases the storage for an object, and we have worked very hard to keep the stored size of our sketches very small, because that is one of the key advantages of using sketches. This is also why we don't use Java serialization, it is way too heavy! 3) Implementing Sketch UDFs Thanks for the references, but this was getting way too deep into the weeds for me right now. I would suggest we start simple and then build these UDF's later, as they seem optional, if I understand your comments correctly. I would suggest we set up a video call with a couple of your key developers that could steer us quickly through the options. Please be aware that we are *extremely* resource limited, Flink is at least 10 times our size, so we could use some help in getting started. What would be ideal would be for someone in your community that is interested in seeing DataSketches integrated into Flink work with us on making it happen. I am looking forward to working with Flink to make this happen. Cheers, Lee. On Mon, Apr 27, 2020 at 2:15 PM Seth Wiesman <[hidden email]> wrote: > One more point I forgot to mention. > > Flink SQL supports Hive UDF's[1]. I haven't tested it, but the datasketch > hive package should just work out of the box. > > Seth > > [1] > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/hive_functions.html > > On Mon, Apr 27, 2020 at 2:27 PM Seth Wiesman <[hidden email]> wrote: > > > Hi Lee, > > > > I really like this project, I used it with Flink a few years ago when it > > was still Yahoo DataSketches. The projects clearly complement each other. > > As Arvid mentioned, the Flink community is trying to foster an ecosystem > > larger than what is in the main Flink repository. The reason is that the > > project has grown to such a scale that it cannot reasonably maintain > > everything. To encourage that sort of growth, Flink is extensively > > pluggable which means that components do not need to live within the main > > repository to be treated first-class. > > > > I'd like to outline somethings the DataSketch community could do to > > integrate with Flink. > > > > 1) Create a page on the flink packages website. > > > > The flink community hosts a website call flink packages to increase the > > visibility of ecosystem projects with the flink user base[1]. > Datasketches > > are usable from Flink today so I'd encourage you to create a page right > > away. > > > > 2) Implement TypeInformation for DataSketches > > > > TypeInformation is Flink's internal type system and is used as a factory > > for creating serializing for different types. These serializers are what > > Flink uses when shuffling data around the cluster and when storing > records > > in state backends as state. Providing type information instances for the > > different sketch types, which would just be wrappers around existing > > serializers in the data sketch codebase. This should be relatively > > straightforward. There is no DataStream aggregation API in the way you > are > > describing so this is the *only* step you would need to take to provide > > first-class support for Flink DataStream API[2][3]. > > > > 3) Implement sketch UDFs > > > > Along with its Java API, Flink also offers a relational API and UDFs. The > > community could provide UDFs for datasketches like Hive. To do so only > > requires implementing the aggregation function interface[4]. Flink SQL > > offers the concept of modules, which are a collection of SQL UDFs that > can > > easily be loaded in the system[5]. A DataSketch SQL module would provide > a > > simple way for users to get started and expose these UDFs as if they were > > native to Flink. > > > > I hope this helps, I look forward to watching the DataSketch community > > grow! > > > > Seth > > > > [1] https://flink-packages.org/ > > [2] > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html > > [3] > > > https://ci.apache.org/projects/flink/flink-docs-stable/dev/datastream_api.html > > [4] > > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/functions/udfs.html#aggregation-functions > > [5] > > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/modules.html > > > > > > On Mon, Apr 27, 2020 at 12:57 PM Flavio Pompermaier < > [hidden email]> > > wrote: > > > >> If this can encourage Lee I'm one of the Flink users that already use > >> datasketches and I found it an amazing library. > >> When I was trying it out (lat year) I tried to stimulate some > >> discussion[1] > >> but at that time it was probably too early.. > >> I really hope that now things are mature for both communities! > >> > >> [1] > >> > >> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-sketches-td26852.html > >> > >> Best, > >> Flavio > >> > >> On Mon, Apr 27, 2020 at 7:37 PM leerho <[hidden email]> wrote: > >> > >> > Hi Arvid, > >> > > >> > Note: I am dual listing this thread on both dev lists for better > >> tracking. > >> > > >> > 1. I'm curious on how you would estimate the effort to port > >> datasketches > >> > > to Flink? It already has a Java API, but how difficult would it > be > >> to > >> > > subdivide the tasks into parallel chunks of work? Since it's > >> already > >> > > ported > >> > > on Pig, I think we could use this port as a baseline > >> > > >> > > >> > Most systems (including systems like Druid, Hive, Pig, Spark, > >> PostgreSQL, > >> > Databases, Streaming Platforms, Map-Reduce Platforms, etc) have some > >> sort > >> > of aggregation API, which allows users to plug in custom aggregation > >> > functions. Typical API functions found in these APIs are > Initialize(), > >> > Update() (or Add()), Merge(), and getResult(). How these are named > and > >> > operate vary considerably from system to system. These APIs are > >> sometimes > >> > called User Defined Functions (UDFs) or User Defined Aggregation > >> Functions > >> > (UDAFs). > >> > > >> > DataSketches is a library of Sketching (streaming) aggregation > >> functions, > >> > each of which perform specific types of aggregation. For example, > >> counting > >> > unique items, determining quantiles and histograms of unknown > >> > distributions, identifying most frequent items (heavy hitters) from a > >> > stream, etc. The advantage of using DataSketches is that they are > >> > extremely fast, small in size, and have well defined error properties > >> > defined by published scientific papers that define the underlying > >> > mathematics. > >> > > >> > The task of porting DataSketches is usually developing a thin wrapping > >> > layer that translates the specific UDAF API of Flink to the equivalent > >> API > >> > methods of the targeted sketches in the library. This is best done by > >> > someone with deep knowledge of the UDAF code of the targeted system. > >> We > >> > are certainly available answer questions about the DataSketches APIs. > >> > Although we did write the UDAF layers for Hive and Pig, we did that > as > >> a > >> > proof of concept and example on how to write such layers. We are a > >> small > >> > team and are not in a position to support these integration layers for > >> > every system out there. > >> > > >> > 2. Do you have any idea who is usually driving the adoptions? > >> > > >> > > >> > To start, you only need to write the UDAF layer for the sketches that > >> you > >> > think would be in most demand by your users. The big 4 categories are > >> > distinct (unique) counting, quantiles, frequent-items, and sampling. > >> This > >> > is a natural way of subdividing the task: choose the sketches you want > >> to > >> > adapt and in what order. Each sketch is independent so it can be > >> adapted > >> > whenever it is needed. > >> > > >> > Please let us know if you have any further questions :) > >> > > >> > Lee. > >> > > >> > > >> > > >> > > >> > On Mon, Apr 27, 2020 at 2:11 AM Arvid Heise <[hidden email]> > >> wrote: > >> > > >> > > Hi Lee, > >> > > > >> > > I must admit that I also heard of data sketches for the first time > >> (there > >> > > are really many Apache projects). > >> > > > >> > > Datasketches sounds really exciting. As a (former) data engineer, I > >> can > >> > > 100% say that this is something that (end-)users want and need and > it > >> > would > >> > > make so much sense to have it in Flink from the get-go. > >> > > Flink, however, is a quite old project already, which grew at a > strong > >> > pace > >> > > leading to some 150 modules in the core. We are currently in the > >> process > >> > to > >> > > restructure that and reduce the number of things in the core, such > >> that > >> > > build times and stability improve. > >> > > > >> > > To counter that we created Flink packages [1], which includes > >> everything > >> > > new that we deem to not be essential. I'd propose to incorporate a > >> Flink > >> > > datasketch package there. If it seems like it's becoming essential, > we > >> > can > >> > > still move it to core at a later point. > >> > > > >> > > As I have seen on the page, there are already plenty of adoptions. > >> That > >> > > leaves a few questions to me. > >> > > > >> > > 1. I'm curious on how you would estimate the effort to port > >> > datasketches > >> > > to Flink? It already has a Java API, but how difficult would it > be > >> to > >> > > subdivide the tasks into parallel chunks of work? Since it's > >> already > >> > > ported > >> > > on Pig, I think we could use this port as a baseline. > >> > > 2. Do you have any idea who is usually driving the adoptions? > >> > > > >> > > > >> > > [1] https://flink-packages.org/ > >> > > > >> > > On Sun, Apr 26, 2020 at 8:07 AM leerho <[hidden email]> wrote: > >> > > > >> > > > Hello All, > >> > > > > >> > > > I am a committer on DataSketches.apache.org > >> > > > <http://datasketches.apache.org/> and just learning about Flink, > >> > Since > >> > > > Flink is designed for stateful stream processing I would think it > >> would > >> > > > make sense to have the DataSketches library integrated into its > >> core so > >> > > all > >> > > > users of Flink could take advantage of these advanced streaming > >> > > > algorithms. If there is interest in the Flink community for this > >> > > > capability, please contact us at [hidden email] or > on > >> our > >> > > > datasketches-dev Slack channel. > >> > > > Cheers, > >> > > > Lee. > >> > > > > >> > > > >> > > > >> > > -- > >> > > > >> > > Arvid Heise | Senior Java Developer > >> > > > >> > > <https://www.ververica.com/> > >> > > > >> > > Follow us @VervericaData > >> > > > >> > > -- > >> > > > >> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > >> > > Conference > >> > > > >> > > Stream Processing | Event Driven | Real Time > >> > > > >> > > -- > >> > > > >> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > >> > > > >> > > -- > >> > > Ververica GmbH > >> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > >> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, > >> Ji > >> > > (Toni) Cheng > >> > > > >> > > > |
Free forum by Nabble | Edit this page |