Read XML from HDFS

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Read XML from HDFS

santosh_rajaguru
Hi,

Is there any way to read the complete XML string or file from HDFS using flink?

Thanks and Regards,
Santosh
Reply | Threaded
Open this post in threaded view
|

Re: Read XML from HDFS

Fabian Hueske-2
Hi Santosh,

yes that is possible, if you want to read a complete file without splitting
it into records. However, you need to implement a custom InputFormat for
that which extends Flink's FileInputFormat.

If you want to split it into records, you need a character sequence that
delimits two records. Depending on the schema and format of your data this
might not be possible. If you have such a delimiting character sequence,
you can use Flink's DelimitedInputFormat.

Cheers, Fabian


2015-07-15 12:15 GMT+02:00 santosh_rajaguru <[hidden email]>:

> Hi,
>
> Is there any way to read the complete XML string or file from HDFS using
> flink?
>
> Thanks and Regards,
> Santosh
>
>
>
> --
> View this message in context:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Read-XML-from-HDFS-tp7023.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Read XML from HDFS

Kostas Tzoumas-2
Perhaps there is also an existing HadoopInputFormat for XML that you might
be able to reuse for your purposes (Flink supports Hadoop input formats).

For example, there is an XMLInputFormat in the Apache Mahout codebase that
you could take a look at:
https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java




On Wed, Jul 15, 2015 at 1:37 PM, Fabian Hueske <[hidden email]> wrote:

> Hi Santosh,
>
> yes that is possible, if you want to read a complete file without splitting
> it into records. However, you need to implement a custom InputFormat for
> that which extends Flink's FileInputFormat.
>
> If you want to split it into records, you need a character sequence that
> delimits two records. Depending on the schema and format of your data this
> might not be possible. If you have such a delimiting character sequence,
> you can use Flink's DelimitedInputFormat.
>
> Cheers, Fabian
>
>
> 2015-07-15 12:15 GMT+02:00 santosh_rajaguru <[hidden email]>:
>
> > Hi,
> >
> > Is there any way to read the complete XML string or file from HDFS using
> > flink?
> >
> > Thanks and Regards,
> > Santosh
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Read-XML-from-HDFS-tp7023.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> > Nabble.com.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Read XML from HDFS

santosh_rajaguru
Thanks Fabian Kostas for info. Using XMLInputFormat, I am able to read a xml file from HDFS.

Cheers,
Santosh