missing data when reading avro file

Bug #1619480 reported by Steve Yang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Incomplete
Undecided
Unassigned
Sahara
New
Undecided
Unassigned

Bug Description

library used: org.apache.hadoop:hadoop-openstack:2.6.0
We are loading avro files from Oracle Storage Service server (i.e., Swift server) into Spark DataFrame object through the Spark Data Source API. For example:
return hiveCtx.read().format("com.databricks.spark.avro").load(objectName);

The number of records is less than the actual record count in the avro file when reading the avro file from Storage Service server using OpenStack Swift API.

If we run a SQL on top of the returned data frome like "select count(*) as C1 from <temp table>" we can see the record count is smaller when reading the same avro file from local file system.

For a large avro file (awclassic.avro, 105M) the count is always wrong (42451 records vs. 60855). From the log file we can see the reading os the file is splitted into 4:
2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:100663296+10044747
2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:33554432+33554432
2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:0+33554432
2016-09-01 14:18:27 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/awclassic.avro:67108864+33554432

For a smaller avro file (wine.avro, 19M) the count sometimes is correct (57076 records) and sometimes wrong (26999 records). Run the same spark SQL 10 times back-to-back produces the following record count results:
run 1: 26999
run 2: 26999
run 3: 57076
run 4: 57056
run 5: 57076
run 6: 26999
run 7: 57076
run 8: 57076
run 9: 57076
run 10: 57076

For this wine.avro test case there are two splits:
2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/wine.avro:9965269+9965270
2016-08-31 17:42:32 INFO HadoopRDD:59 - Input split: swift://qaTestData.oracleswift/testAvro/wine.avro:0+9965269

I have attach a zip file containing the two avro files in question and the debugged log file section of reading wine.avro file - one with successful reading(C4.ok) and one with missing record reading(C5.miss).

Revision history for this message
Steve Yang (syang97) wrote :
Revision history for this message
Steve Yang (syang97) wrote :

Correction:
old: we can see the record count is smaller when reading the same avro file from local file system.
new: we can see the record count is ALWAYS CORRECT when reading the same avro file from local file system.

Revision history for this message
clayg (clay-gerrard) wrote :

Man, I really hoping we can get you figured out here - but I'm not sure who to reach out to on that Hadoop package? As with lp bug #1618252 - I think the best we can do is loop in the Sahara folk and see if they can provide direction.

Revision history for this message
Tim Burke (1-tim-z) wrote :

Looks like they're using DLOs [1]; I wonder whether this is a container-listing eventual consistency issue?

I also wonder what versions of Swift they want to support, and how difficult it would be to add SLO support...

[1] https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-openstack/src/main/java/org/apache/hadoop/fs/swift/snative/SwiftNativeFileSystemStore.java#L170

Revision history for this message
Tim Burke (1-tim-z) wrote :

This part seems curious, though:

2016-08-31 17:52:55 DEBUG SwiftNativeInputStream:103 - Fetching 10030805 bytes starting at 9965269
2016-08-31 17:52:55 DEBUG SwiftRestClient:622 - getData:bytes=9965269-19996073
2016-08-31 17:52:55 DEBUG SwiftRestClient:1727 - GET https://storage.oraclecorp.com/v1/Storage-dfisher/qaTestData/testAvro/wine.avro
Range: bytes=9965269-19996073
X-Newest: true
X-Auth-Token: AUTH_tk2407a330ae25f8098c4e31e62f68cd8f
User-Agent: Apache Hadoop Swift Client 2.6.0-cdh5.5.0 from fd21232cef7b8c1f536965897ce20f50b83ee7b2 by jenkins source checksum 98e07176d1787150a6a9c087627562c

2016-08-31 17:52:55 DEBUG SwiftRestClient:1731 - Status code = 200
2016-08-31 17:52:55 DEBUG SwiftNativeInputStream:309 - Seek to 0; current pos =0; offset=0
2016-08-31 17:52:55 DEBUG SwiftNativeInputStream:313 - seek is no-op
2016-08-31 17:52:55 DEBUG SwiftNativeInputStream:103 - read(buffer, 0, 65536)
2016-08-31 17:52:55 DEBUG SwiftNativeInputStream:103 - read(buffer, 0, 65536)
2016-08-31 17:52:55 DEBUG SwiftNativeInputStream:309 - Seek to 0; current pos =16054; offset=-16054
2016-08-31 17:52:55 DEBUG SwiftNativeInputStream:318 - seek is backwards
2016-08-31 17:52:55 DEBUG SwiftNativeInputStream:222 - Closing HTTP input stream : seeking to 0

Why did we get a 200 instead of a 206?

Matthew Oliver (matt-0)
Changed in swift:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.