Hi everyone,
I just recently came across a use-case where I needed to read gzip files and handle byte order marks transparently. I know that gzip can be read with Hadoop input formats but that did not work for me since I wanted to reuse my existing custom Flink input formats. It turned out that both requirements (and more) can be dealt with by allowing the input formats to decorate the input stream. Do you think it is worthwhile to include these changes in Flink? I could take care of it. Cheers, Sebastian |
I think that would be very worthwhile :-) Happy to hear that you want to
contribute that! Decorating the input stream sounds like a great approach and would also work for other compression formats. The other thing that needs to be taken into account is that GZIP files are not splittable in the same way as uncompressed files. You may have to invent something clever there, or simply restrict the format to have one input split per file (rather than block). On Thu, Apr 30, 2015 at 5:41 PM, Kruse, Sebastian <[hidden email]> wrote: > Hi everyone, > > I just recently came across a use-case where I needed to read gzip files > and handle byte order marks transparently. I know that gzip can be read > with Hadoop input formats but that did not work for me since I wanted to > reuse my existing custom Flink input formats. > > It turned out that both requirements (and more) can be dealt with by > allowing the input formats to decorate the input stream. Do you think it is > worthwhile to include these changes in Flink? I could take care of it. > > Cheers, > Sebastian > |
There is already support for inflate compressed files and I introduced logic to handle unsplittable formats.
Sent from my iPhone > On 30.04.2015, at 19:39, Stephan Ewen <[hidden email]> wrote: > > I think that would be very worthwhile :-) Happy to hear that you want to > contribute that! > > Decorating the input stream sounds like a great approach and would also > work for other compression formats. > > The other thing that needs to be taken into account is that GZIP files are > not splittable in the same way as uncompressed files. You may have to > invent something clever there, or simply restrict the format to have one > input split per file (rather than block). > > On Thu, Apr 30, 2015 at 5:41 PM, Kruse, Sebastian <[hidden email]> > wrote: > >> Hi everyone, >> >> I just recently came across a use-case where I needed to read gzip files >> and handle byte order marks transparently. I know that gzip can be read >> with Hadoop input formats but that did not work for me since I wanted to >> reuse my existing custom Flink input formats. >> >> It turned out that both requirements (and more) can be dealt with by >> allowing the input formats to decorate the input stream. Do you think it is >> worthwhile to include these changes in Flink? I could take care of it. >> >> Cheers, >> Sebastian >> |
Right, I saw the .deflate file support und the unsplittable flag and built upon that code. I just tried to generalize it and expose it as a hook, so that unforeseen issues like new exotic compression formats or handling custom preambles can be implemented by the users themselves.
I can create a ticket and a pull request by this week, so that you can have a look at it. Cheers, Sebastian ________________________________________ From: Robert Metzger [[hidden email]] Sent: Thursday, April 30, 2015 21:01 To: [hidden email] Subject: Re: Gzip support There is already support for inflate compressed files and I introduced logic to handle unsplittable formats. Sent from my iPhone > On 30.04.2015, at 19:39, Stephan Ewen <[hidden email]> wrote: > > I think that would be very worthwhile :-) Happy to hear that you want to > contribute that! > > Decorating the input stream sounds like a great approach and would also > work for other compression formats. > > The other thing that needs to be taken into account is that GZIP files are > not splittable in the same way as uncompressed files. You may have to > invent something clever there, or simply restrict the format to have one > input split per file (rather than block). > > On Thu, Apr 30, 2015 at 5:41 PM, Kruse, Sebastian <[hidden email]> > wrote: > >> Hi everyone, >> >> I just recently came across a use-case where I needed to read gzip files >> and handle byte order marks transparently. I know that gzip can be read >> with Hadoop input formats but that did not work for me since I wanted to >> reuse my existing custom Flink input formats. >> >> It turned out that both requirements (and more) can be dealt with by >> allowing the input formats to decorate the input stream. Do you think it is >> worthwhile to include these changes in Flink? I could take care of it. >> >> Cheers, >> Sebastian >> |
Great. Please file a JIRA and open a pull request for the feature!
On Mon, May 4, 2015 at 10:37 AM, Kruse, Sebastian <[hidden email]> wrote: > Right, I saw the .deflate file support und the unsplittable flag and built > upon that code. I just tried to generalize it and expose it as a hook, so > that unforeseen issues like new exotic compression formats or handling > custom preambles can be implemented by the users themselves. > I can create a ticket and a pull request by this week, so that you can > have a look at it. > > Cheers, > Sebastian > ________________________________________ > From: Robert Metzger [[hidden email]] > Sent: Thursday, April 30, 2015 21:01 > To: [hidden email] > Subject: Re: Gzip support > > There is already support for inflate compressed files and I introduced > logic to handle unsplittable formats. > > > Sent from my iPhone > > > On 30.04.2015, at 19:39, Stephan Ewen <[hidden email]> wrote: > > > > I think that would be very worthwhile :-) Happy to hear that you want to > > contribute that! > > > > Decorating the input stream sounds like a great approach and would also > > work for other compression formats. > > > > The other thing that needs to be taken into account is that GZIP files > are > > not splittable in the same way as uncompressed files. You may have to > > invent something clever there, or simply restrict the format to have one > > input split per file (rather than block). > > > > On Thu, Apr 30, 2015 at 5:41 PM, Kruse, Sebastian < > [hidden email]> > > wrote: > > > >> Hi everyone, > >> > >> I just recently came across a use-case where I needed to read gzip files > >> and handle byte order marks transparently. I know that gzip can be read > >> with Hadoop input formats but that did not work for me since I wanted to > >> reuse my existing custom Flink input formats. > >> > >> It turned out that both requirements (and more) can be dealt with by > >> allowing the input formats to decorate the input stream. Do you think > it is > >> worthwhile to include these changes in Flink? I could take care of it. > >> > >> Cheers, > >> Sebastian > >> > |
Free forum by Nabble | Edit this page |