Benvenuto in Scribd!

Exercise 6 PDF

Caricato da

Il 0% ha trovato utile questo documento (0 voti)

133 visualizzazioni2 pagine

The document describes two Hadoop exercises: 1) a word count program that counts the frequency of words in texts from Shakespeare and the Bible, and 2) a k-mer counting program that counts overlapping k-length sequences in a genome file to find the most common k-mers. It provides instructions on running hadoop jar to perform word counting on sample input files and find the top 10 words, and on implementing k-mer counting to find the top 10 most frequent 9-mers in an E. coli genome file.

Descrizione originale:

Titolo originale

Exercise 6.pdf

Copyright

Formati disponibili

PDF, TXT o leggi online da Scribd

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Segnala questo documento

Copyright:

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Il 0% ha trovato utile questo documento (0 voti)

133 visualizzazioni2 pagine

Exercise 6 PDF

Caricato da

việt lê

Copyright:

Formati disponibili

Scarica in formato PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Salta alla pagina

Sei sulla pagina 1di 2

Cerca all'interno del documento

Exercise 6

1. Hadoop Word Count

The next program to test is the hadoop word count program. This example reads text files and
counts how often words occur. The input is text files and the output is text files, each line of which
contains a word and the count of how often it occured, separated by a tab.
Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the
word and 1. Each reducer sums the counts for each word and emits a single key/value with the
word and sum. As an optimization, the reducer is also used as a combiner on the map outputs. This
reduces the amount of data sent across the network by combining each word into a single record.
Before you can run the example, you'll have to copy some data into the distributed filesystem
(HDFS). Here we will create an input directory, and copy in the complete works of Shakespeare
and the bible (a standard large corpus for text mining)
The datafile is also avalable at - make sure to gunzip after downloading: bible+shakes.nopunc.gz
$ hadoop fs -mkdir /user/USERNAME/wordcount

$ hadoop fs -mkdir /user/USERNAME/wordcount/input

$ hadoop fs -put /bluearc/data/schatz/data/textmining/bible+shakes.nopunc

/user/mschatz/wordcount/input

To run the example, the command syntax is

$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount \

/user/USERNAME/wordcount/input \

/user/USERNAME/wordcount/output

After this completes, download the results to your local directory like this:
$ hadoop fs -get /user/USERNAME/wordcount/output output

Question: What are the top 10 most frequently used words in the corpus?
Hint: Use the unix commands sort and head to scan the output file

2. Hadoop Kmer Counting

The next exercise will be to implement a kmer counter using hadoop. Conceptually this is very
similar to the wordcount program, but since there are no spaces in the human genome, we will
count overlapping kmers instead of discrete words.
The idea is if the genome is:
>chr1

ACACACAGT
And we are counting 3-mers, your map function will output
ACA 1

CAC 1

ACA 1

CAC 1

ACA 1

CAG 1

AGT 1

The shuffle function will sort them so the same key comes right after each other
ACA 1

ACA 1

CAC 1

CAG 1

AGT 1

And your reducer will output:

ACA 3

CAC 2

CAG 1

AGT 1

You can implement this in Java, using the WordCount program as an example, or you can use
Hadoop Streaming to implement it in any language you would like.
The Hadoop Streaming documentation describes how to use it:
https://hadoop.apache.org/docs/r1.2.1/streaming.html
And here is a nice tutorial using Python:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
The genome file is available here: ecoli.fa.gz
Question: What are the top 10 most frequently occurring 9-mers in E coli?

Potrebbero piacerti anche

Answers To Case Studies 1a - 2d
Documento9 pagine
Answers To Case Studies 1a - 2d
Ognen Galeski
Nessuna valutazione finora
Hadoop Classroom Notes
Documento76 pagine
Hadoop Classroom Notes
HADOOPtraining
100% (2)
Map Reduce
Documento38 pagine
Map Reduce
Fikret Toydemir
Nessuna valutazione finora
Sample Internship PPT
Documento19 pagine
Sample Internship PPT
Sangeeta Jamadar
Nessuna valutazione finora
This Study Resource Was: Interactive Reading Questions
Documento3 pagine
This Study Resource Was: Interactive Reading Questions
Joshua Lagonoy
Nessuna valutazione finora
Elementary Surveying
Documento19 pagine
Elementary Surveying
Jefferson Escobido
Nessuna valutazione finora
Midterm Decision Analysis Exercises
Documento5 pagine
Midterm Decision Analysis Exercises
AYLEN INJAYA
Nessuna valutazione finora
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
Documento33 pagine
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
bhattsb
Nessuna valutazione finora
Hands On Exercises 2013
Documento51 pagine
Hands On Exercises 2013
Manish Jain
Nessuna valutazione finora
Bda Lab Manual
Documento20 pagine
Bda Lab Manual
RAKSHIT AYACHIT
Nessuna valutazione finora
Assignment 11 DSBDA
Documento4 pagine
Assignment 11 DSBDA
DARSHAN JADHAV
Nessuna valutazione finora
Hadoop Map-Reduce
Documento2 pagine
Hadoop Map-Reduce
samiullah
Nessuna valutazione finora
Tutorial MapReduce
Documento13 pagine
Tutorial MapReduce
pavan2711
Nessuna valutazione finora
Big Data & Analytics Lab Manual
Documento51 pagine
Big Data & Analytics Lab Manual
Sathish
Nessuna valutazione finora
Bda Unit 1
Documento13 pagine
Bda Unit 1
CrazyYT Gaming channel
Nessuna valutazione finora
Best Hadoop Online Training
Documento41 pagine
Best Hadoop Online Training
Harika583
Nessuna valutazione finora
Hadoop: A Report Writing On
Documento13 pagine
Hadoop: A Report Writing On
dilip kodmour
Nessuna valutazione finora
Extreme Computing Lab Exercises Session One: 1 Getting Started
Documento6 pagine
Extreme Computing Lab Exercises Session One: 1 Getting Started
Kumar Satti
Nessuna valutazione finora
Palak
Documento10 pagine
Palak
Dolly Mehra
Nessuna valutazione finora
BD - Unit - III - MapReduce
Documento31 pagine
BD - Unit - III - MapReduce
Prem Kumar
Nessuna valutazione finora
PDC All Labs
Documento129 pagine
PDC All Labs
Sai Kiran
100% (1)
Commands in Hadoop
Documento7 pagine
Commands in Hadoop
shweta shedshale
Nessuna valutazione finora
BDA Unit 4 Notes
Documento20 pagine
BDA Unit 4 Notes
duraisamy
Nessuna valutazione finora
Lab Hadoop
Documento7 pagine
Lab Hadoop
Tev Wallace
Nessuna valutazione finora
Hadoop Installation Step by Step
Documento6 pagine
Hadoop Installation Step by Step
Umesh Nagar
Nessuna valutazione finora
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
Documento51 pagine
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
prabhakar_n1
Nessuna valutazione finora
BigData Hadoop Online Training by Experts
Documento41 pagine
BigData Hadoop Online Training by Experts
Harika583
Nessuna valutazione finora
Unit 4&5
Documento18 pagine
Unit 4&5
manasa
Nessuna valutazione finora
BD7 Assignment S22
Documento4 pagine
BD7 Assignment S22
Course Zila
Nessuna valutazione finora
Part 03 Intro To Hadoop
Documento22 pagine
Part 03 Intro To Hadoop
Sahera Shabnam
Nessuna valutazione finora
Hadoop Interview Questions Author: Pappupass Learning Resource
Documento16 pagine
Hadoop Interview Questions Author: Pappupass Learning Resource
Dheeraj Reddy
Nessuna valutazione finora
Trắc Nghiệm Big data
Documento69 pagine
Trắc Nghiệm Big data
Minh
Nessuna valutazione finora
Hadoop Single Node Cluster Setup Steps
Documento7 pagine
Hadoop Single Node Cluster Setup Steps
Shushrutha Reddy K
Nessuna valutazione finora
100+ Hadoop Interview Questions From Interviews
Documento32 pagine
100+ Hadoop Interview Questions From Interviews
satish.sathya.a2012
Nessuna valutazione finora
Big Data 1 PDF
Documento17 pagine
Big Data 1 PDF
Himanshu Kumar Singh
Nessuna valutazione finora
Setup Hadoop Gettingstart
Documento4 pagine
Setup Hadoop Gettingstart
Nawaal Ali
Nessuna valutazione finora
By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150
Documento17 pagine
By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150
neerendra pratap singh
Nessuna valutazione finora
L Hadoop 1 PDF
Documento12 pagine
L Hadoop 1 PDF
Dao Van Hang
Nessuna valutazione finora
Notes
Documento53 pagine
Notes
Radheshyam Shah
Nessuna valutazione finora
Hadoop Presentaton
Documento47 pagine
Hadoop Presentaton
Jhumri Talaiya
Nessuna valutazione finora
Shell Script Programming
Documento14 pagine
Shell Script Programming
Sumanta Kar
Nessuna valutazione finora
Writing An Hadoop MapReduce Program in Python
Documento21 pagine
Writing An Hadoop MapReduce Program in Python
Vigneshwaran Sundaresan
Nessuna valutazione finora
Hadoop: A Seminar Report On
Documento28 pagine
Hadoop: A Seminar Report On
Roshni Khairnar
Nessuna valutazione finora
BDA Unit-4
Documento38 pagine
BDA Unit-4
Aishwarya Rayasam
Nessuna valutazione finora
Hadoop and Map Reduce
Documento27 pagine
Hadoop and Map Reduce
arshpreetmundra14
Nessuna valutazione finora
CS-702 (D) BigData
Documento61 pagine
CS-702 (D) BigData
garima bh
Nessuna valutazione finora
Technical Seminar
Documento32 pagine
Technical Seminar
Sda Sdasd
Nessuna valutazione finora
Big Data
Documento43 pagine
Big Data
ABHIJEETH kumar
Nessuna valutazione finora
Mapreduce: Simplified Data Processing On Large Clusters
Documento38 pagine
Mapreduce: Simplified Data Processing On Large Clusters
car
Nessuna valutazione finora
New 9
Documento3 pagine
New 9
Raj Pradeep
Nessuna valutazione finora
Tsinghua Introduction To Big Data Systems hw5 Question
Documento4 pagine
Tsinghua Introduction To Big Data Systems hw5 Question
jovanhuang
Nessuna valutazione finora
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
Documento5 pagine
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
Ahmed Mohamed
Nessuna valutazione finora
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
Documento47 pagine
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
kavitha
Nessuna valutazione finora
Bda Lab
Documento37 pagine
Bda Lab
Dhanush Kumar
Nessuna valutazione finora
Kcs 061 PPT Unit 2
Documento56 pagine
Kcs 061 PPT Unit 2
PRACHI ROSHAN
Nessuna valutazione finora
NYOUG Hadoop Presentaton
Documento47 pagine
NYOUG Hadoop Presentaton
V Kalyan
Nessuna valutazione finora
BDA Experiment 14 PDF
Documento77 pagine
BDA Experiment 14 PDF
Nikita Ichale
Nessuna valutazione finora
Practical Assignment - :: Distributed Data Processing With Apache Spark
Documento3 pagine
Practical Assignment - :: Distributed Data Processing With Apache Spark
Teshome Mulugeta
Nessuna valutazione finora
Bigdata Lab
Documento55 pagine
Bigdata Lab
Radheshyam Shah
Nessuna valutazione finora
Unit1 Remainingtopics 6feb
Documento13 pagine
Unit1 Remainingtopics 6feb
Prince Rathore
Nessuna valutazione finora
Big Data Analytics With Hadoop and Apache Spark
Documento17 pagine
Big Data Analytics With Hadoop and Apache Spark
Fernando Andrés Hinojosa Villarreal
Nessuna valutazione finora
Map-Reduce For Parallel Computing: Amit Jain
Documento72 pagine
Map-Reduce For Parallel Computing: Amit Jain
Nate Austin
Nessuna valutazione finora
The Mac Terminal Reference and Scripting Primer
Da Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
Valutazione: 4.5 su 5 stelle
4.5/5 (3)
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Da Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
Nessuna valutazione finora
Bash Command Line Pro Tips
Da Everand
Bash Command Line Pro Tips
Jason Cannon
Valutazione: 4.5 su 5 stelle
4.5/5 (8)
Financial Crisis Among UTHM Students
Documento7 pagine
Financial Crisis Among UTHM Students
Pravin Periasamy
Nessuna valutazione finora
Improving Self-Esteem - 08 - Developing Balanced Core Beliefs
Documento12 pagine
Improving Self-Esteem - 08 - Developing Balanced Core Beliefs
Jag Kaley
Nessuna valutazione finora
Navamsa Karma and God
Documento9 pagine
Navamsa Karma and God
Visti Larsen
50% (2)
Nin/Pmjay Id Name of The Vaccination Site Category Type District Block
Documento2 pagine
Nin/Pmjay Id Name of The Vaccination Site Category Type District Block
Nikunja Padhan
Nessuna valutazione finora
Lesson I. Background Information
Documento21 pagine
Lesson I. Background Information
suidivo
Nessuna valutazione finora
Fallopian Tube Blockage
Documento11 pagine
Fallopian Tube Blockage
simran kaur
Nessuna valutazione finora
Grade 5 Forces Acting On Structures and Mechanisms Cirriculum
Documento3 pagine
Grade 5 Forces Acting On Structures and Mechanisms Cirriculum
api-207202175
0% (1)
Articles 62 & 63: Presented By: Muhammad Saad Umar FROM: BS (ACF) - B 2K20
Documento10 pagine
Articles 62 & 63: Presented By: Muhammad Saad Umar FROM: BS (ACF) - B 2K20
Muhammad Saad Umar
Nessuna valutazione finora
I Wonder Lonely As A Cloud by W. Words Worth
Documento6 pagine
I Wonder Lonely As A Cloud by W. Words Worth
Green Bergen
100% (1)
Bsed Filipino
Documento18 pagine
Bsed Filipino
J.P
Nessuna valutazione finora
Unsaturated Polyester Resins: Influence of The Styrene Concentration On The Miscibility and Mechanical Properties
Documento5 pagine
Unsaturated Polyester Resins: Influence of The Styrene Concentration On The Miscibility and Mechanical Properties
Mamoon Shahid
Nessuna valutazione finora
BSCHMCTT 101
Documento308 pagine
BSCHMCTT 101
JITTU
Nessuna valutazione finora
Different Departments Required in A Hospital
Documento11 pagine
Different Departments Required in A Hospital
Edsel Dudes Abante
Nessuna valutazione finora
Measures-English, Metric, and Equivalents PDF
Documento1 pagina
Measures-English, Metric, and Equivalents PDF
luz adolfo
Nessuna valutazione finora
Nurse-Patient Trust Relationship
Documento12 pagine
Nurse-Patient Trust Relationship
Marina Costa
100% (1)
MTAP Math Challenge
Documento5 pagine
MTAP Math Challenge
Haron Abedin
100% (1)
Who, Summary Notes
Documento12 pagine
Who, Summary Notes
Ivan Lohr
100% (2)
Unsung Ancient African Indigenous Heroines and Heros
Documento27 pagine
Unsung Ancient African Indigenous Heroines and Heros
msipaa30
Nessuna valutazione finora
Rediscovering The True Self Through The
Documento20 pagine
Rediscovering The True Self Through The
Manuel Ortiz
100% (1)
Final Research Report
Documento14 pagine
Final Research Report
Alojado Lamuel Jesu A
Nessuna valutazione finora
Adolescent Violence Towards Parents Myths and Realities
Documento25 pagine
Adolescent Violence Towards Parents Myths and Realities
João D C Mendonça
Nessuna valutazione finora
Determining Rounding Common Core
Documento2 pagine
Determining Rounding Common Core
api-366290373
0% (1)
Sta. Lucia National High School: Republic of The Philippines Region III-Central Luzon
Documento7 pagine
Sta. Lucia National High School: Republic of The Philippines Region III-Central Luzon
Lee Charm Santos
Nessuna valutazione finora
Reaction Paper
Documento3 pagine
Reaction Paper
Cecille Robles San Jose
Nessuna valutazione finora
Amplifier
Documento20 pagine
Amplifier
Valerie Strauss
Nessuna valutazione finora