Hadoop MapReduce v2 Cookbook - Second Edition

Ebook779 pages5 hours

Hadoop MapReduce v2 Cookbook - Second Edition

Name: Hadoop MapReduce v2 Cookbook - Second Edition
Author: Thilina Gunarathne
ISBN: 9781783285488

By Thilina Gunarathne

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

Process large and complex datasets using next generation Hadoop
Install, configure, and administer MapReduce programs and learn what’s new in MapReduce v2
More than 90 Hadoop MapReduce recipes presented in a simple and straightforward manner, with step-by-step instructions and real-world examples

Who This Book Is For

If you are a Big Data enthusiast and wish to use Hadoop v2 to solve your problems, then this book is for you. This book is for Java programmers with little to moderate knowledge of Hadoop MapReduce. This is also a one-stop reference for developers and system admins who want to quickly get up to speed with using Hadoop v2. It would be helpful to have a basic knowledge of software development using Java and a basic working knowledge of Linux.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateFeb 25, 2015

ISBN9781783285488

Author

Thilina Gunarathne

Related authors

Skip carousel

Related to Hadoop MapReduce v2 Cookbook - Second Edition

Related ebooks

Skip carousel

Hadoop: Data Processing and Modelling
Ebook
Hadoop: Data Processing and Modelling
byGarry Turkington
Rating: 0 out of 5 stars
0 ratings
Hadoop Beginner's Guide
Ebook
Hadoop Beginner's Guide
byGarry Turkington
Rating: 4 out of 5 stars
4/5
Learning Hadoop 2
Ebook
Learning Hadoop 2
byGarry Turkington
Rating: 4 out of 5 stars
4/5
Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
Hadoop 2.x Administration Cookbook
Ebook
Hadoop 2.x Administration Cookbook
byGurmukh Singh
Rating: 0 out of 5 stars
0 ratings
Apache Spark for Data Science Cookbook
Ebook
Apache Spark for Data Science Cookbook
byPadma Priya Chitturi
Rating: 0 out of 5 stars
0 ratings
Neo4j Cookbook
Ebook
Neo4j Cookbook
byAnkur Goel
Rating: 0 out of 5 stars
0 ratings
Python Business Intelligence Cookbook
Ebook
Python Business Intelligence Cookbook
byDempsey Robert
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
Talend Open Studio Cookbook
Ebook
Talend Open Studio Cookbook
byRick Barton
Rating: 2 out of 5 stars
2/5
MongoDB Cookbook - Second Edition
Ebook
MongoDB Cookbook - Second Edition
byDasadia Cyrus
Rating: 0 out of 5 stars
0 ratings
Clojure Data Analysis Cookbook - Second Edition
Ebook
Clojure Data Analysis Cookbook - Second Edition
byEric Rochester
Rating: 0 out of 5 stars
0 ratings
Hadoop Blueprints
Ebook
Hadoop Blueprints
byAnurag Shrivastava
Rating: 0 out of 5 stars
0 ratings
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
Ebook
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
byJohn Wolohan
Rating: 0 out of 5 stars
0 ratings
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Learning Apache Cassandra - Second Edition
Ebook
Learning Apache Cassandra - Second Edition
bySandeep Yarabarla
Rating: 0 out of 5 stars
0 ratings
Hadoop in Action
Ebook
Hadoop in Action
byChuck Lam
Rating: 0 out of 5 stars
0 ratings
Spark GraphX in Action
Ebook
Spark GraphX in Action
byMichael Malak
Rating: 0 out of 5 stars
0 ratings
How to Lead in Data Science
Ebook
How to Lead in Data Science
byJike Chong
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Scala in Depth
Ebook
Scala in Depth
byJosh Suereth
Rating: 4 out of 5 stars
4/5
Mastering Hadoop
Ebook
Mastering Hadoop
bySandeep Karanth
Rating: 0 out of 5 stars
0 ratings
Neo4j in Action
Ebook
Neo4j in Action
byTareq Abedrabbo
Rating: 0 out of 5 stars
0 ratings
Spark in Action
Ebook
Spark in Action
byMarko Bonaci
Rating: 0 out of 5 stars
0 ratings
Graph Databases in Action: Examples in Gremlin
Ebook
Graph Databases in Action: Examples in Gremlin
byJosh Perryman
Rating: 0 out of 5 stars
0 ratings
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Mastering Elasticsearch - Second Edition
Ebook
Mastering Elasticsearch - Second Edition
byRafał Kuć
Rating: 0 out of 5 stars
0 ratings
Learning HBase
Ebook
Learning HBase
byShashwat Shriparv
Rating: 0 out of 5 stars
0 ratings
Mastering Scala Machine Learning
Ebook
Mastering Scala Machine Learning
byAlex Kozlov
Rating: 0 out of 5 stars
0 ratings

Databases For You

Skip carousel

Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Excel 2021
Ebook
Excel 2021
byJIAYI SIMONDS
Rating: 4 out of 5 stars
4/5
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Spring in Action, Sixth Edition
Ebook
Spring in Action, Sixth Edition
byCraig Walls
Rating: 5 out of 5 stars
5/5
CompTIA DataSys+ Study Guide: Exam DS0-001
Ebook
CompTIA DataSys+ Study Guide: Exam DS0-001
byMike Chapple
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Python Projects for Everyone
Ebook
Python Projects for Everyone
byMohamad Charara
Rating: 0 out of 5 stars
0 ratings
Building a Scalable Data Warehouse with Data Vault 2.0
Ebook
Building a Scalable Data Warehouse with Data Vault 2.0
byDaniel Linstedt
Rating: 4 out of 5 stars
4/5
Beginning Microsoft SQL Server 2012 Programming
Ebook
Beginning Microsoft SQL Server 2012 Programming
byPaul Atkinson
Rating: 1 out of 5 stars
1/5
The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation
Ebook
The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation
byJhon Dujardin
Rating: 4 out of 5 stars
4/5
SQL Clearly Explained
Ebook
SQL Clearly Explained
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
Ebook
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
byWilliam Sullivan
Rating: 5 out of 5 stars
5/5
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
Ebook
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
byAJIT DASH
Rating: 3 out of 5 stars
3/5
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
Ebook
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
byBrian Peasland
Rating: 0 out of 5 stars
0 ratings
Getting Started with SQL Server 2014 Administration
Ebook
Getting Started with SQL Server 2014 Administration
byGethyn Ellis
Rating: 0 out of 5 stars
0 ratings
SQL Server: Tips and Tricks - 1
Ebook
SQL Server: Tips and Tricks - 1
byPriyanka Agarwal
Rating: 5 out of 5 stars
5/5
Relational Database Design and Implementation
Ebook
Relational Database Design and Implementation
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
SQL: Practical Guide for Developers
Ebook
SQL: Practical Guide for Developers
byMichael J. Donahoo
Rating: 2 out of 5 stars
2/5
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
Ebook
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
byPiyanka Jain
Rating: 5 out of 5 stars
5/5
Oracle Enterprise Manager Cloud Control 12c: Managing Data Center Chaos
Ebook
Oracle Enterprise Manager Cloud Control 12c: Managing Data Center Chaos
byPorus Homi Havewala
Rating: 0 out of 5 stars
0 ratings
Jump Start MySQL: Master the Database That Powers the Web
Ebook
Jump Start MySQL: Master the Database That Powers the Web
byTimothy Boronczyk
Rating: 0 out of 5 stars
0 ratings
COBOL Basic Training Using VSAM, IMS and DB2
Ebook
COBOL Basic Training Using VSAM, IMS and DB2
byRobert Wingate
Rating: 5 out of 5 stars
5/5
Learn Git in a Month of Lunches
Ebook
Learn Git in a Month of Lunches
byRick Umali
Rating: 0 out of 5 stars
0 ratings
Learn SQL Server Administration in a Month of Lunches
Ebook
Learn SQL Server Administration in a Month of Lunches
byDon Jones
Rating: 3 out of 5 stars
3/5
Data Science Strategy For Dummies
Ebook
Data Science Strategy For Dummies
byUlrika Jägare
Rating: 0 out of 5 stars
0 ratings
The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality
Ebook
The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality
byLowell Fryman
Rating: 5 out of 5 stars
5/5
Access 2010 All-in-One For Dummies
Ebook
Access 2010 All-in-One For Dummies
byAlison Barrows
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
Podcast episode
Hasty Treat - Hireable Skills for 2021: In this Hasty Treat, Scott and Wes talk about hireable skills or 2021 — what you need to know to get a job and grow in your career this year! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at and put SYNTAX in the “How did...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
Data Visualization and D3.js with Irene Ros: Scott talks to Data Visualization expert Irene Ros. When she isn't contributing to the Miso Project, teaching her d3.js class, or working on making OpenVis Conf the best data visualization conference it can be, she's working on projects that focus on creating engaging interactive visual displays of information.
Podcast episode
Data Visualization and D3.js with Irene Ros: Scott talks to Data Visualization expert Irene Ros. When she isn't contributing to the Miso Project, teaching her d3.js class, or working on making OpenVis Conf the best data visualization conference it can be, she's working on projects that focus on creating engaging interactive visual displays of information.
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
Speed Up And Simplify Your Streaming Data Workloads With Red Panda - Episode 152: An interview with Vectorized founder Alexander Gallego about the Red Panda streaming engine and building a drop-in replacement for Kafka with better performance and throughput.
Podcast episode
Speed Up And Simplify Your Streaming Data Workloads With Red Panda - Episode 152: An interview with Vectorized founder Alexander Gallego about the Red Panda streaming engine and building a drop-in replacement for Kafka with better performance and throughput.
byData Engineering Podcast
0 ratings
0% found this document useful
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
Podcast episode
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110: An interview about how SnowflakeDB was built to provide a performant and flexible data platform for the cloud era
byData Engineering Podcast
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
Hasty Treat - Webhooks: In this Hasty Treat, Scott and Wes talk about webhooks — one of those concepts that seems a lot scarier than it actually is. Linode - Sponsor Whether you’re working on a personal project or managing enterprise infrastructure, you deserve simple,...
Podcast episode
Hasty Treat - Webhooks: In this Hasty Treat, Scott and Wes talk about webhooks — one of those concepts that seems a lot scarier than it actually is. Linode - Sponsor Whether you’re working on a personal project or managing enterprise infrastructure, you deserve simple,...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
Podcast episode
120: FastAPI & Typer - Sebastián Ramírez: Sebastián Ramírez is the developer behind FastAPI for Python REST APIs and Typer, for CLI applications. We discuss FastAPI, Typer, Swagger UI, interface design, autocompletion, and more.
byTest and Code
0 ratings
0% found this document useful
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
Podcast episode
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
byData Engineering Podcast
0 ratings
0% found this document useful
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
Podcast episode
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
byTest and Code
0 ratings
0% found this document useful
Python, Django, and Channels: with Andrew Godwin, creator of Django Channels
Podcast episode
Python, Django, and Channels: with Andrew Godwin, creator of Django Channels
byThe Changelog: Software Development, Open Source
0 ratings
0% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
Podcast episode
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
byData Skeptic
0 ratings
0% found this document useful
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
Podcast episode
All Things Azure with Dwayne Monroe: Dwayne Monroe is a senior cloud architect at Cloudreach, an organization that helps enterprises maximize their cloud investments, who’s focused on Azure. Prior to joining Cloudreach, Dwayne worked as a senior Microsoft and cloud architect at High Availabi
byScreaming in the Cloud
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
Podcast episode
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Revisiting The Technical And Social Benefits Of The Data Mesh: An interview with Zhamak Dehghani about her experience working with the community that has grown up around her idea of the data mesh and the lessons that she has learned.
Podcast episode
Revisiting The Technical And Social Benefits Of The Data Mesh: An interview with Zhamak Dehghani about her experience working with the community that has grown up around her idea of the data mesh and the lessons that she has learned.
byData Engineering Podcast
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Kafka Event Sourcing with Neha Narkhede: When a user of a social network updates her profile, that profile update needs to propagate to several databases that want to know about such an update–search indexes, user databases, caches, and other services. When Neha Narkhede was at LinkedIn,
Podcast episode
Kafka Event Sourcing with Neha Narkhede: When a user of a social network updates her profile, that profile update needs to propagate to several databases that want to know about such an update–search indexes, user databases, caches, and other services. When Neha Narkhede was at LinkedIn,
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
Podcast episode
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
Podcast episode
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
byPractical AI: Machine Learning, Data Science
0 ratings
0% found this document useful
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57: Scalable and Stateful Streaming Data With Apache Flink (Interview)
Podcast episode
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57: Scalable and Stateful Streaming Data With Apache Flink (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
EP54 - What is the Map Operation in Java Streams?: Big announcement: today marks the launch of our brand new "Beginners only" Coding Bootcamp. If you're a beginner to coding and have spent less than about 6 months learning to code, you're a great fit for this new 16 week Coding Bootcamp. You can join...
Podcast episode
EP54 - What is the Map Operation in Java Streams?: Big announcement: today marks the launch of our brand new "Beginners only" Coding Bootcamp. If you're a beginner to coding and have spent less than about 6 months learning to code, you're a great fit for this new 16 week Coding Bootcamp. You can join...
byCoders Campus Podcast
0 ratings
0% found this document useful
Leveling Up Natural Language Processing with Transfer Learning: An interview with Paul Azunre about how you can use transfer learning techniques to build more flexible natural language processing systems and reduce the requirements for labelled data.
Podcast episode
Leveling Up Natural Language Processing with Transfer Learning: An interview with Paul Azunre about how you can use transfer learning techniques to build more flexible natural language processing systems and reduce the requirements for labelled data.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39: Self Service Data Flows With Apache NiFi (Interview)
Podcast episode
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39: Self Service Data Flows With Apache NiFi (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
Podcast episode
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
byTest and Code
0 ratings
0% found this document useful

Skip carousel

Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Usability
Linux Format
Article
Usability
Oct 19, 2021
3 min read
How Image Recognition Works
APC
Article
How Image Recognition Works
Nov 4, 2019
4 min read
Are Docker Containers a Good Idea for Laptops?
Maximum PC
Article
Are Docker Containers a Good Idea for Laptops?
Mar 31, 2020
Docker containers are cool. If you haven’t yet played with Docker, you’re missing a large world of easily deployed applications. For example, I can deploy NodeRed, Plex, Jupyter Lab, and Nextcloud servers, and run them behind a Traefik reverse proxy
2 min read
» Stochastic Algorithms
Linux Format
Article
» Stochastic Algorithms
Dec 14, 2021
If you’re up for some relatively maths-heavy computer-science reading (and who isn’t?), then consider looking into stochastic algorithms. Sometimes lumped together with machine-learning, stochastic algorithms is a loosely defined category that you co
1 min read
Elasticsearch And Kibana Basics
Linux Format
Article
Elasticsearch And Kibana Basics
Dec 15, 2020
1 min read
Metrics & Visuals In Go
Linux Format
Article
Metrics & Visuals In Go
Nov 17, 2020
Mihalis Tsoukalos is a DataOps engineer and a technical writer. He’s the author of Go Systems Programming and Mastering Go, 2nd edition. The subject of this tutorial is two-fold. First, it’s about creating a Go application that exports metrics to P
7 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
Docker vs Podman
APC
Article
Docker vs Podman
Apr 19, 2021
When Cockpit was first developed, it had plug-in support for administering your Docker containers remotely via its user-friendly web interface. But then Red Hat OS became a major backer of Cockpit, and when Red Hat developed its own alternative to Do
1 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
The Return Of Gpu Computing
PC Pro Magazine
Article
The Return Of Gpu Computing
Jul 8, 2021
5 min read
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
Twenty Years Of WordPress Websites!
Linux Format
Article
Twenty Years Of WordPress Websites!
Oct 17, 2023
11 min read
Enterprise-grade Monitoring Made Easy
Linux Format
Article
Enterprise-grade Monitoring Made Easy
Mar 10, 2020
9 min read
Set Up A Production-ready Web Server
Linux Format
Article
Set Up A Production-ready Web Server
Sep 24, 2019
8 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Installing Apache for Linux… on Windows
TechLife
Article
Installing Apache for Linux… on Windows
Jul 27, 2020
5 min read
Answers
Linux Format
Article
Answers
Jul 2, 2019
Q How do I generate a very large file, say around 980GB, with the dd command? I’m trying to test a quota for an XFS file system under RHEL7. I use the following command as the user oracle, which has the user quota set: and it outputs Where is my
7 min read
Automatically Provision Devices With Ansible
Linux Format
Article
Automatically Provision Devices With Ansible
Nov 15, 2022
Matt Holder has worked in IT support for over a decade, and always tries to utilise Linux alongside other installed systems. C loud computing is a term that means a number of things. Software as a Service (SaaS) is one such example of what can be hos
9 min read
2 Build A Photography Website In Drupal
Digital Camera World
Article
2 Build A Photography Website In Drupal
Aug 21, 2020
4 min read
Build A Pi-powered Network Storage Device
Linux Format
Article
Build A Pi-powered Network Storage Device
Dec 14, 2021
10 min read
Mailserver
Linux Format
Article
Mailserver
Jun 27, 2023
4 min read
Create Your Own Virtual Classroom
Linux Format
Article
Create Your Own Virtual Classroom
Mar 8, 2022
Credit: https://moodle.org David Rutland believes it’s impossible for a person to be either be overdressed or overeducated. People who know him agree that he is neither. Education, education, education. If you were around in 1997, you probably rememb
10 min read
Create Your Own Virtual Classroom
Linux Format
Article
Create Your Own Virtual Classroom
Mar 8, 2022
Credit: https://moodle.org David Rutland believes it’s impossible for a person to be either be overdressed or overeducated. People who know him agree that he is neither. Education, education, education. If you were around in 1997, you probably rememb
10 min read
Answers
Linux Format
Article
Answers
Jan 11, 2022
7 min read
Host Your Own Cloud
PC Pro Magazine
Article
Host Your Own Cloud
Jul 9, 2020
8 min read

Related categories

Skip carousel

Reviews for Hadoop MapReduce v2 Cookbook - Second Edition

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Hadoop MapReduce v2 Cookbook - Second Edition - Thilina Gunarathne

Hadoop MapReduce v2 Cookbook Second Edition

Credits

About the Author

Acknowledgments

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Getting Started with Hadoop v2

Introduction

Hadoop Distributed File System – HDFS

Hadoop YARN

Hadoop MapReduce

Hadoop installation modes

Setting up Hadoop v2 on your local machine

Getting ready

How to do it...

How it works...

Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode

Getting ready

How to do it...

How it works...

There's more...

See also

Adding a combiner step to the WordCount MapReduce program

How to do it...

How it works...

There's more...

Setting up HDFS

Getting ready

How to do it...

See also

Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2

Getting ready

How to do it...

How it works...

See also

Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution

Getting ready

How to do it...

There's more...

HDFS command-line file operations

Getting ready

How to do it...

How it works...

There's more...

Running the WordCount program in a distributed cluster environment

Getting ready

How to do it...

How it works...

There's more...

Benchmarking HDFS using DFSIO

Getting ready

How to do it...

How it works...

There's more...

Benchmarking Hadoop MapReduce using TeraSort

Getting ready

How to do it...

How it works...

2. Cloud Deployments – Using Hadoop YARN on Cloud Environments

Introduction

Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce

Getting ready

How to do it...

See also

Saving money using Amazon EC2 Spot Instances to execute EMR job flows

How to do it...

There's more...

See also

Executing a Pig script using EMR

How to do it...

There's more...

Starting a Pig interactive session

Executing a Hive script using EMR

How to do it...

There's more...

Starting a Hive interactive session

See also

Creating an Amazon EMR job flow using the AWS Command Line Interface

Getting ready

How to do it...

There's more...

See also

Deploying an Apache HBase cluster on Amazon EC2 using EMR

Getting ready

How to do it...

See also

Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs

How to do it...

There's more...

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

How to do it...

How it works...

See also

3. Hadoop Essentials – Configurations, Unit Tests, and Other APIs

Introduction

Optimizing Hadoop YARN and MapReduce configurations for cluster deployments

Getting ready

How to do it...

How it works...

There's more...

Shared user Hadoop clusters – using Fair and Capacity schedulers

How to do it...

How it works...

There's more...

Setting classpath precedence to user-provided JARs

How to do it...

How it works...

Speculative execution of straggling tasks

How to do it...

There's more...

Unit testing Hadoop MapReduce applications using MRUnit

Getting ready

How to do it...

See also

Integration testing Hadoop MapReduce applications using MiniYarnCluster

Getting ready

How to do it...

See also

Adding a new DataNode

Getting ready

How to do it...

There's more...

Rebalancing HDFS

See also

Decommissioning DataNodes

How to do it...

How it works...

See also

Using multiple disks/volumes and limiting HDFS disk usage

How to do it...

Setting the HDFS block size

How to do it...

There's more...

See also

Setting the file replication factor

How to do it...

How it works...

There's more...

See also

Using the HDFS Java API

How to do it...

How it works...

There's more...

Configuring the FileSystem object

Retrieving the list of data blocks of a file

4. Developing Complex Hadoop MapReduce Applications

Introduction

Choosing appropriate Hadoop data types

How to do it...

There's more...

See also

Implementing a custom Hadoop Writable data type

How to do it...

How it works...

There's more...

See also

Implementing a custom Hadoop key type

How to do it...

How it works...

See also

Emitting data of different value types from a Mapper

How to do it...

How it works...

There's more...

See also

Choosing a suitable Hadoop InputFormat for your input data format

How to do it...

How it works...

There's more...

See also

Adding support for new input data formats – implementing a custom InputFormat

How to do it...

How it works...

There's more...

See also

Formatting the results of MapReduce computations – using Hadoop OutputFormats

How to do it...

How it works...

There's more...

Writing multiple outputs from a MapReduce computation

How to do it...

How it works...

Using multiple input data types and multiple Mapper implementations in a single MapReduce application

See also

Hadoop intermediate data partitioning

How to do it...

How it works...

There's more...

TotalOrderPartitioner

KeyFieldBasedPartitioner

Secondary sorting – sorting Reduce input values

How to do it...

How it works...

See also

Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache

How to do it...

How it works...

There's more...

Distributing archives using the DistributedCache

Adding resources to the DistributedCache from the command line

Adding resources to the classpath using the DistributedCache

Using Hadoop with legacy applications – Hadoop streaming

How to do it...

How it works...

There's more...

See also

Adding dependencies between MapReduce jobs

How to do it...

How it works...

There's more...

Hadoop counters to report custom metrics

How to do it...

How it works...

5. Analytics

Introduction

Simple analytics using MapReduce

Getting ready

How to do it...

How it works...

There's more...

Performing GROUP BY using MapReduce

Getting ready

How to do it...

How it works...

Calculating frequency distributions and sorting using MapReduce

Getting ready

How to do it...

How it works...

There's more...

Plotting the Hadoop MapReduce results using gnuplot

Getting ready

How to do it...

How it works...

There's more...

Calculating histograms using MapReduce

Getting ready

How to do it...

How it works...

Calculating Scatter plots using MapReduce

Getting ready

How to do it...

How it works...

Parsing a complex dataset with Hadoop

Getting ready

How to do it...

How it works...

There's more...

Joining two datasets using MapReduce

Getting ready

How to do it...

How it works...

6. Hadoop Ecosystem – Apache Hive

Introduction

Getting started with Apache Hive

How to do it...

See also

Creating databases and tables using Hive CLI

Getting ready

How to do it...

How it works...

There's more...

Hive data types

Hive external tables

Using the describe formatted command to inspect the metadata of Hive tables

Simple SQL-style data querying using Apache Hive

Getting ready

How to do it...

How it works...

There's more...

Using Apache Tez as the execution engine for Hive

See also

Creating and populating Hive tables and views using Hive query results

Getting ready

How to do it...

Utilizing different storage formats in Hive - storing table data using ORC files

Getting ready

How to do it...

How it works...

Using Hive built-in functions

Getting ready

How to do it...

How it works...

There's more...

See also

Hive batch mode - using a query file

How to do it...

How it works...

There's more...

See also

Performing a join with Hive

Getting ready

How to do it...

How it works...

See also

Creating partitioned Hive tables

Getting ready

How to do it...

Writing Hive User-defined Functions (UDF)

Getting ready

How to do it...

How it works...

HCatalog – performing Java MapReduce computations on data mapped to Hive tables

Getting ready

How to do it...

How it works...

HCatalog – writing data to Hive tables from Java MapReduce computations

Getting ready

How to do it...

How it works...

7. Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop

Introduction

Getting started with Apache Pig

Getting ready

How to do it...

How it works...

There's more...

See also

Joining two datasets using Pig

How to do it...

How it works...

There's more...

Accessing a Hive table data in Pig using HCatalog

Getting ready

How to do it...

There's more...

See also

Getting started with Apache HBase

Getting ready

How to do it...

There's more...

See also

Data random access using Java client APIs

Getting ready

How to do it...

How it works...

Running MapReduce jobs on HBase

Getting ready

How to do it...

How it works...

Using Hive to insert data into HBase tables

Getting ready

How to do it...

See also

Getting started with Apache Mahout

How to do it...

How it works...

There's more...

Running K-means with Mahout

Getting ready

How to do it...

How it works...

Importing data to HDFS from a relational database using Apache Sqoop

Getting ready

How to do it...

Exporting data from HDFS to a relational database using Apache Sqoop

Getting ready

How to do it...

8. Searching and Indexing

Introduction

Generating an inverted index using Hadoop MapReduce

Getting ready

How to do it...

How it works...

There's more...

Outputting a random accessible indexed InvertedIndex

See also

Intradomain web crawling using Apache Nutch

Getting ready

How to do it...

See also

Indexing and searching web documents using Apache Solr

Getting ready

How to do it...

How it works...

See also

Configuring Apache HBase as the backend data store for Apache Nutch

Getting ready

How to do it...

How it works...

See also

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Getting ready

How to do it...

How it works...

See also

Elasticsearch for indexing and searching

Getting ready

How to do it...

How it works...

See also

Generating the in-links graph for crawled web pages

Getting ready

How to do it...

How it works...

See also

9. Classifications, Recommendations, and Finding Relationships

Introduction

Performing content-based recommendations

How to do it...

How it works...

There's more...

Classification using the naïve Bayes classifier

How to do it...

How it works...

Assigning advertisements to keywords using the Adwords balance algorithm

How to do it...

How it works...

There's more...

10. Mass Text Data Processing

Introduction

Data preprocessing using Hadoop streaming and Python

Getting ready

How to do it...

How it works...

There's more...

See also

De-duplicating data using Hadoop streaming

Getting ready

How to do it...

How it works...

See also

Loading large datasets to an Apache HBase data store – importtsv and bulkload

Getting ready

How to do it…

How it works...

There's more...

Data de-duplication using HBase

See also

Creating TF and TF-IDF vectors for the text data

Getting ready

How to do it…

How it works…

See also

Clustering text data using Apache Mahout

Getting ready

How to do it...

How it works...

See also

Topic discovery using Latent Dirichlet Allocation (LDA)

Getting ready

How to do it…

How it works…

See also

Document classification using Mahout Naive Bayes Classifier

Getting ready

How to do it...

How it works...

See also

Index

Hadoop MapReduce v2 Cookbook Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: January 2013

Second edition: February 2015

Production reference: 1200215

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78328-547-1

www.packtpub.com

Cover image by Jarek Blaminsky (<milak6@wp.pl>)

Credits

Authors

Thilina Gunarathne

Srinath Perera

Reviewers

Skanda Bhargav

Randal Scott King

Dmitry Spikhalskiy

Jeroen van Wilgenburg

Shinichi Yamashita

Commissioning Editor

Edward Gordon

Acquisition Editors

Joanne Fitzpatrick

Content Development Editor

Shweta Pant

Technical Editors

Indrajit A. Das

Pankaj Kadam

Copy Editors

Puja Lalwani

Alfida Paiva

Laxmi Subramanian

Project Coordinator

Shipra Chawhan

Proofreaders

Bridget Braund

Maria Gould

Paul Hindle

Bernadette Watkins

Indexer

Priya Sane

Production Coordinator

Nitesh Thakur

Cover Work

Nitesh Thakur

About the Author

Thilina Gunarathne is a senior data scientist at KPMG LLP. He led the Hadoop-related efforts at Link Analytics before its acquisition by KPMG LLP. He has extensive experience in using Apache Hadoop and its related technologies for large-scale data-intensive computations. He coauthored the first edition of this book, Hadoop MapReduce Cookbook, with Dr. Srinath Perera.

Thilina has contributed to several open source projects at Apache Software Foundation as a member, committer, and a PMC member. He has also published many peer-reviewed research articles on how to extend the MapReduce model to perform efficient data mining and data analytics computations in the cloud. Thilina received his PhD and MSc degrees in computer science from Indiana University, Bloomington, USA, and received his bachelor of science degree in computer science and engineering from University of Moratuwa, Sri Lanka.

Acknowledgments

I would like to thank my wife, Bimalee, my son, Kaveen, and my daughter, Yasali, for putting up with me for all the missing family time and for providing me with love and encouragement throughout the writing period. I would also like to thank my parents and siblings. Without their love, guidance, and encouragement, I would not be where I am today.

I really appreciate the contributions from my coauthor, Dr. Srinath Perera, for the first edition of this book. Many of his contributions from the first edition of this book have been adapted to the current book even though he wasn't able to coauthor this book due to his work and family commitments.

I would like to thank the Hadoop, HBase, Mahout, Pig, Hive, Sqoop, Nutch, and Lucene communities for developing great open source products. Thanks to Apache Software Foundation for fostering vibrant open source communities.

Big thanks to the editorial staff at Packt for providing me with the opportunity to write this book and feedback and guidance throughout the process. Thanks to the reviewers of this book for the many useful suggestions and corrections.

I would like to express my deepest gratitude to all the mentors I have had over the years, including Prof. Geoffrey Fox, Dr. Chris Groer, Dr. Sanjiva Weerawarana, Prof. Dennis Gannon, Prof. Judy Qiu, Prof. Beth Plale, and all my professors at Indiana University and University of Moratuwa for all the knowledge and guidance they gave me. Thanks to all my past and present colleagues for the many insightful discussions we've had and the knowledge they shared with me..

About the Author

Srinath Perera (coauthor of the first edition of this book) is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a member of the visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a cofounder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. Srinath is also a committer of Apache open source projects Axis, Axis2, and Geronimo.

Srinath received his PhD and MSc in computer science from Indiana University, Bloomington, USA, and his bachelor of science in computer science and engineering from University of Moratuwa, Sri Lanka.

Srinath has authored many technical and peer-reviewed research articles; more details can be found on his website. He is also a frequent speaker at technical venues.

Srinath has worked with large-scale distributed systems for a long time. He closely works with big data technologies such as Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.

I would like to thank my wife, Miyuru, and my parents, whose never-ending support keeps me going. I would also like to thank Sanjiva from WSO2 who encouraged us to make our mark even though project such as these are not in the job description. Finally, I would like to thank my colleagues at WSO2 for ideas and companionship that have shaped the book in many ways.

About the Reviewers

Skanda Bhargav is an engineering graduate from Visvesvaraya Technological University (VTU), Belgaum, Karnataka, India. He did his majors in computer science engineering. He is currently employed with Happiest Minds Technologies, an MNC based out of Bangalore. He is a Cloudera-certified developer in Apache Hadoop. His interests are big data and Hadoop.

He has been a reviewer for the following books and a video, all by Packt Publishing:

Instant MapReduce Patterns – Hadoop Essentials How-to

Hadoop Cluster Deployment

Building Hadoop Clusters [Video]

Cloudera Administration Handbook

I would like to thank my family for their immense support and faith in me throughout my learning stage. My friends have brought the confidence in me to a level that makes me bring out the best in myself. I am happy that God has blessed me with such wonderful people, without whom I wouldn't have tasted the success that I've achieved today.

Randal Scott King is a global consultant who specializes in big data and network architecture. His 15 years of experience in IT consulting has resulted in a client list that looks like a Who's Who of the Fortune 500. His recent projects include a complete network redesign for an aircraft manufacturer and an in-store video analytics pilot for a major home improvement retailer.

He lives with his children outside Atlanta, GA. You can visit his blog at www.randalscottking.com.

Dmitry Spikhalskiy currently holds the position of software engineer in a Russian social network service, Odnoklassniki, and is working on a search engine, video recommendation system, and movie content analysis.

Previously, Dmitry took part in developing Mind Labs' platform, infrastructure, and benchmarks for a high-load video conference and streaming service, which got The biggest online-training in the world Guinness world record with more than 12,000 participants. As a technical lead and architect, he launched a mobile social banking start-up called Instabank. He has also reviewed Learning Google Guice and PostgreSQL 9 Admin Cookbook, both by Packt Publishing.

Dmitry graduated from Moscow State University with an MSc degree in computer science, where he first got interested in parallel data processing, high-load systems, and databases.

Jeroen van Wilgenburg is a software craftsman at JPoint (http://www.jpoint.nl), a software agency based in the center of the Netherlands. Their main focus is on developing high-quality Java and Scala software with open source frameworks.

Currently, Jeroen is developing several big data applications with Hadoop, MapReduce, Storm, Spark, Kafka, MongoDB, and Elasticsearch.

Jeroen is a car enthusiast and likes to be outdoors, usually training for a triathlon. In his spare time, Jeroen writes about his work experience at http://vanwilgenburg.wordpress.com.

Shinichi Yamashita is a solutions architect at System Platform Sector in NTT DATA Corporation, Japan. He has more than 9 years of experience in software and middleware engineering (Apache, Tomcat, PostgreSQL, Hadoop Ecosystem, and Spark). Shinichi has written a few books on Hadoop in Japanese.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

We are currently facing an avalanche of data, and this data contains many insights that hold the keys to success or failure in the data-driven world. Next generation Hadoop (v2) offers a cutting-edge platform to store and analyze these massive data sets and improve upon the widely used and highly successful Hadoop MapReduce v1. The recipes that will help you analyze large and complex datasets with next generation Hadoop MapReduce will provide you with the skills and knowledge needed to process large and complex datasets using the next generation Hadoop ecosystem.

This book presents many exciting topics such as MapReduce patterns using Hadoop to solve analytics, classifications, and data indexing and searching. You will also be introduced to several Hadoop ecosystem components including Hive, Pig, HBase, Mahout, Nutch, and Sqoop.

This book introduces you to simple examples and then dives deep to solve in-depth big data use cases. This book presents more than 90 ready-to-use Hadoop MapReduce recipes in a simple and straightforward manner, with step-by-step instructions and real-world examples.

What this book covers

Chapter 1, Getting Started with Hadoop v2, introduces Hadoop MapReduce, YARN, and HDFS, and walks through the installation of Hadoop v2.

Chapter 2, Cloud Deployments – Using Hadoop Yarn on Cloud Environments, explains how to use Amazon Elastic MapReduce (EMR) and Apache Whirr to deploy and execute Hadoop MapReduce, Pig, Hive, and HBase computations on cloud infrastructures.

Chapter 3, Hadoop Essentials – Configurations, Unit Tests, and Other APIs, introduces basic Hadoop YARN and HDFS configurations, HDFS Java API, and unit testing methods for MapReduce applications.

Chapter 4, Developing Complex Hadoop MapReduce Applications, introduces you to several advanced Hadoop MapReduce features that will help you develop highly customized and efficient MapReduce applications.

Chapter 5, Analytics, explains how to perform basic data analytic operations using Hadoop MapReduce.

Chapter 6, Hadoop Ecosystem – Apache Hive, introduces Apache Hive, which provides data warehouse capabilities on top of Hadoop, using a SQL-like query language.

Chapter 7, Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop, introduces the Apache Pig data flow style data-processing language, Apache HBase NoSQL data storage, Apache Mahout machine learning and data-mining toolkit, and Apache Sqoop bulk data transfer utility to transfer data between Hadoop and the relational databases.

Chapter 8, Searching and Indexing, introduces several tools and techniques that you can use with Apache Hadoop to perform large-scale searching and indexing.

Chapter 9, Classifications, Recommendations, and Finding Relationships, explains how to implement complex algorithms such

Enjoying the preview?

Page 1 of 1

Hadoop MapReduce v2 Cookbook - Second Edition

About this ebook

Thilina Gunarathne

Related authors

Related to Hadoop MapReduce v2 Cookbook - Second Edition

Related ebooks

Databases For You

Related podcast episodes

Related articles

Related categories

Reviews for Hadoop MapReduce v2 Cookbook - Second Edition

What did you think?

Book preview

Hadoop MapReduce v2 Cookbook - Second Edition - Thilina Gunarathne

Table of Contents

Hadoop MapReduce v2 Cookbook Second Edition

Hadoop MapReduce v2 Cookbook Second Edition

Credits

About the Author

Acknowledgments

About the Author

About the Reviewers

Support files, eBooks, discount offers, and more

Why Subscribe?

Preface

What this book covers