[hadoop-swift] Cannot access Swift Static Large Objects

Bug #1593663 reported by Saverio Proto
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Sahara
Confirmed
High
Unassigned

Bug Description

When I try to access a file on swift of size 15Gb, Hadoop with swiftfs cannot read it because I think it gets confused because the file is split into segments.

ubuntu@masternoded1a418d4-fb0f-11e5-b6c6-fa163ea692ea:~/saverio$ swift stat googlebooks-ngrams-gz-swift eng/googlebooks-eng-all-0gram-20120701-a.gz
       Account: v1
     Container: googlebooks-ngrams-gz-swift
        Object: eng/googlebooks-eng-all-0gram-20120701-a.gz
  Content Type: application/octet-stream
Content Length: 15339202495
 Last Modified: Fri, 17 Jun 2016 05:54:17 GMT
          ETag: d41d8cd98f00b204e9800998ecf8427e
      Manifest: googlebooks%2Dngrams%2Dgz%2Dswift%5Fsegments/eng/googlebooks%2Deng%2Dall%2D0gram%2D20120701%2Da.gz/1466054520.469164168/15339202495
    Meta Mtime: 1448415962
 Accept-Ranges: bytes
    Keep-Alive: timeout=5, max=100
        Server: Apache
    Connection: Keep-Alive
   X-Timestamp: 1466142857.00000
    X-Trans-Id: tx00000000000000003cc2f-005763cd42-2b2a7af-default
ubuntu@masternoded1a418d4-fb0f-11e5-b6c6-fa163ea692ea:~/saverio$

ubuntu@masternoded1a418d4-fb0f-11e5-b6c6-fa163ea692ea:~/saverio$ hadoop jar /usr/lib/hadoop/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input swift://googlebooks-ngrams-gz-swift.datasets/eng/googlebooks-eng-all-0gram-20120701-a.gz -output swift://results.switchengines/testnumber2 -mapper mapper-ngrams.py -reducer reducer-ngrams.py -numReduceTasks 1
16/06/17 08:38:25 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/06/17 08:38:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/06/17 08:38:25 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
16/06/17 08:38:26 INFO Configuration.deprecation: topology.node.switch.mapping.impl is deprecated. Instead, use net.topology.node.switch.mapping.impl
16/06/17 08:38:27 INFO mapred.FileInputFormat: Total input paths to process : 1
16/06/17 08:38:27 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/app/hadoop/tmp/mapred/staging/ubuntu366412939/.staging/job_local366412939_0001
16/06/17 08:38:27 ERROR streaming.StreamJob: Error launching job , bad input path : Not Found swift://googlebooks-ngrams-gz-swift.datasets/eng/googlebooks-eng-all-0gram-20120701-a.gz/1466054520.469164168/15339202495/00000000
Streaming Command Failed!

Changed in sahara:
importance: Undecided → High
tags: added: hadoop.swift.lib
Changed in sahara:
status: New → Confirmed
Saverio Proto (zioproto)
summary: - [hadoop-swift] Cannot access SWIFT Static Large Objects
+ [hadoop-swift] Cannot access Swift Static Large Objects
Changed in sahara:
milestone: none → next
milestone: next → none
Revision history for this message
John Dickinson (notmyname) wrote :

How can I help diagnose this from the swift side?

Revision history for this message
Saverio Proto (zioproto) wrote :

I was trying to reproduce the bug. I found out that if I copy the files with

hadoop dfs -cp swift://googlebooks-ngrams-gz-swift.switchengines/eng/googlebooks-eng-all-0gram-20120701-a.gz .

this works without problems.

The bug is triggered when accessing the swiftfs with hadoop-streaming-2.7.1.jar

Revision history for this message
Saverio Proto (zioproto) wrote :
Download full text (3.7 KiB)

Hello,

I retested with the Hortonworks distribution.

Using the jars:
/usr/hdp/2.4.3.0-227/hadoop-mapreduce/hadoop-openstack-2.7.1.2.4.3.0-227.jar
/usr/hdp/2.4.3.0-227/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.3.0-227.jar

I cannot reproduce the bug.

But, if I compile the hadoop-openstack jar from the Sahara repository, then I can reproduce the bug.

The code here works:
https://github.com/hortonworks/hadoop-release/tree/HDP-2.5.2.1-tag/hadoop-tools/hadoop-openstack

But the code here DO NOT work:
https://github.com/openstack/sahara-extra/tree/master/hadoop-swiftfs

ubuntu@ambari1:~/hadoop-swift-tutorial$ sudo -u hdfs -i hadoop jar /usr/hdp/2.4.3.0-227/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.3.0-227.jar -input swift://googlebooks-ngrams-gz-swift.switchengines/eng/oglebooks-eng-all-0gram-20120701-a.gz -output /switch/testnumber4 -mapper mapper-ngrams.py -reducer reducer-ngrams.py -file /home/ubuntu/hadoop-swift-tutorial/mapper-ngrams.py -file /home/ubuntu/hadoop-swift-tutorial/reducer-ngrams.py -numReduceTasks 1
WARNING: Use "yarn jar" to launch YARN applications.
16/12/05 17:04:48 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/ubuntu/hadoop-swift-tutorial/mapper-ngrams.py, /home/ubuntu/hadoop-swift-tutorial/reducer-ngrams.py] [/usr/hdp/2.4.3.0-227/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.3.0-227.jar] /tmp/streamjob1748730503238342006.jar tmpDir=null
16/12/05 17:04:49 INFO impl.TimelineClientImpl: Timeline service address: http://ambari2:8188/ws/v1/timeline/
16/12/05 17:04:49 INFO client.RMProxy: Connecting to ResourceManager at ambari2/10.0.192.10:8050
16/12/05 17:04:49 INFO impl.TimelineClientImpl: Timeline service address: http://ambari2:8188/ws/v1/timeline/
16/12/05 17:04:49 INFO client.RMProxy: Connecting to ResourceManager at ambari2/10.0.192.10:8050
16/12/05 17:04:49 ERROR streaming.StreamJob: Error Launching job : Output directory hdfs://ambari1:8020/switch/testnumber4 already exists
Streaming Command Failed!
ubuntu@ambari1:~/hadoop-swift-tutorial$ sudo -u hdfs -i hadoop jar /usr/hdp/2.4.3.0-227/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.3.0-227.jar -input swift://googlebooks-ngrams-gz-swift.switchengines/eng/googlebooks-eng-all-0gram-20120701-a.gz -output /switch/testnumber5 -mapper mapper-ngrams.py -reducer reducer-ngrams.py -file /home/ubuntu/hadoop-swift-tutorial/mapper-ngrams.py -file /home/ubuntu/hadoop-swift-tutorial/reducer-ngrams.py -numReduceTasks 1
WARNING: Use "yarn jar" to launch YARN applications.
16/12/05 17:05:33 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/ubuntu/hadoop-swift-tutorial/mapper-ngrams.py, /home/ubuntu/hadoop-swift-tutorial/reducer-ngrams.py] [/usr/hdp/2.4.3.0-227/hadoop-mapreduce/hadoop-streaming-2.7.1.2.4.3.0-227.jar] /tmp/streamjob3486478259414753576.jar tmpDir=null
16/12/05 17:05:34 INFO impl.TimelineClientImpl: Timeline service address: http://ambari2:8188/ws/v1/timeline/
16/12/05 17:05:34 INFO client.RMProxy: Connecting to ResourceManager at ambari2/10.0.192.10:8050
16/12/05 17:05:35 INFO impl.TimelineClientImpl: Timeline service address: h...

Read more...

Luigi Toscano (ltoscano)
tags: added: sahara-extra
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.