[jira] [Created] (FLINK-22143) Flink returns less rows than expected when using limit in SQL

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-22143) Flink returns less rows than expected when using limit in SQL

Shang Yuanchun (Jira)
Peng Yu created FLINK-22143:
-------------------------------

             Summary: Flink returns less rows than expected when using limit in SQL
                 Key: FLINK-22143
                 URL: https://issues.apache.org/jira/browse/FLINK-22143
             Project: Flink
          Issue Type: Bug
          Components: Table SQL / Runtime
    Affects Versions: 1.13.0
            Reporter: Peng Yu
             Fix For: 1.13.0


Flink's blink runtime returns less rows than expected when querying Hive tables with limit.
{code:java}
// sql
select i_item_sk from tpcds_1g_snappy.item limit 5000;
{code}
 

Above query will return only *4998* lines in some cases.

 

This problem can be re-produced on below conditions:
 # A Hive table with parquet format.
 # Running SQL with limit using blink planner since Flink version 1.12.0
 # The input table is small. (With only 1 data file in which there is only 1 row group, e.g. 1 GB of TPCDS benchmark data)
 # The requested count of lines by `limit` is above the batch size (2048 by default)

 

After investigation, a bug is found lying in the *LimitableBulkFormat* class.

In this class, for each batch, *numRead* will be increased *1* more than actual count of rows returned by reader.readBatch().

The reason is that *numRead* get increased even when next() reaches then end of current batch.

If there is only 1 input split, no more lines will be merged into the final result. 

As a result, less lines will be returned by Flink.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)