Sahara

Sahara + swift + DLO : file not found

Bug #1639819 reported by Frédéric Gaudet on 2016-11-07

This bug report is a duplicate of: Bug #1593663: [hadoop-swift] Cannot access Swift Static Large Objects. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Sahara	New	Undecided	Unassigned

Bug Description

I'm using Liberty. I use the sahara 4.0.0 from the centos7 repo. This is a vanilla installation, except with the hadoop-swift jar file : I downloaded a quite recent version since I had a previous issue fixed with this release (Socket timeout while accessing swift). When I try to process a large file from swift, I have a FileNotFound exception.

Step to reproduce :
-------------------

In swift, upload a big file splitted into severals 1GB segment. I use the following argument with my jar file :

Input source file = swift://2mass.sahara/2mass.csv

the saveAsHadoop method should download all segments in 2mass_segments container as specified in the 2mass.csv manifest, but java exits with the following error : Exception in thread "main" java.io.FileNotFoundException: Not Found
swift://2mass.sahara/2mass.csv/1476283376.280434/62949101353/1073741824/00000000

It seems that the path should be :
swift://2mass_segments.sahara/2mass.csv/1476283376.280434/62949101353/1073741824/00000000

Informations :
--------------

The hadoop-swift MD5 :
ubuntu@xmatch-fgaudet-ngt-sparkmaster-0:/usr/lib/hadoop$ md5sum hadoop-swift.jar
cbe1478523ba79497bb41cf22c7f9c79 hadoop-swift.jar

Content of my swift container :
--------------------------------

ubuntu@xmatch-fgaudet-ngt-sparkmaster-0:/tmp/spark-edp/XMatch/2fe16654-58b8-4df2-b0f9-d9f4ab1a942c$ swift list
2mass
2mass_segments

ubuntu@xmatch-fgaudet-ngt-sparkmaster-0:/tmp/spark-edp/XMatch/2fe16654-58b8-4df2-b0f9-d9f4ab1a942c$ swift list --lh 2mass
0 2016-10-26 08:03:48 None 2mass.csv
0

Header :
-------

ubuntu@xmatch-fgaudet-ngt-sparkmaster-0:/tmp/spark-edp/XMatch/2fe16654-58b8-4df2-b0f9-d9f4ab1a942c$ swift stat --lh 2mass 2mass.csv
       Account: v1
     Container: 2mass
        Object: 2mass.csv
  Content Type: binary/octet-stream
Content Length: 58G
Last Modified: Wed, 26 Oct 2016 08:03:48 GMT
          ETag: "c1a3b941c9a1a41e0347bd50672215ad"
      Manifest: 2mass_segments/2mass.csv/1476283376.280434/62949101353/1073741824/
    Meta Mtime: 1476283376.280434
Accept-Ranges: bytes
   X-Timestamp: 1477469028.93125
    X-Trans-Id: tx000000000000000003c91-0058208895-10cc84-default

Spark command line (as sahara build it):
---------------------------------------

ubuntu@xmatch-fgaudet-ngt-sparkmaster-0:/tmp/spark-edp/XMatch/2fe16654-58b8-4df2-b0f9-d9f4ab1a942c$ more launch_command.log
2016-11-07 13:49:13,329 INFO Running /opt/spark/bin/spark-submit --driver-class-path /usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop/hadoop-swift.jar: --files spark.xml --class org
.openstack.sahara.edp.SparkWrapper --jars builtin-4aa76035-7cc1-404d-ac34-966cc03fef73.jar --master spark://xmatch-fgaudet-ngt-sparkmaster-0:7077 --deploy-mode client cds.xmatch.spark.jar spark.x
ml cds.xmatch.spark.PreprocessData swift://2mass.sahara/2mass.csv RA 4 5 4096 200 swift://2massoutput.sahara/2mass.rdd

cat stderr :
------------

<---- SNIP --->

16/11/07 13:49:18 INFO scheduler.DAGScheduler: Job 0 failed: saveAsHadoopFile at PreprocessData.java:89, took 0.277151 s
Exception in thread "main" java.io.FileNotFoundException: Not Found swift://2mass.sahara/2mass.csv/1476283376.280434/62949101353/1073741824/00000000
at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystemStore.getObjectMetadata(SwiftNativeFileSystemStore.java:240)
at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystemStore.listDirectory(SwiftNativeFileSystemStore.java:559)
at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystemStore.listSegments(SwiftNativeFileSystemStore.java:1213)
at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem.getFileBlockLocations(SwiftNativeFileSystem.java:238)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)