Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Ebook594 pages4 hours

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Name: Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
Author: Hien Luu
ISBN: 9781484235799

By Hien Luu

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Develop applications for the big data landscape with Spark and Hadoop. This book also explains the role of Spark in developing scalable machine learning and analytics applications with Cloud technologies. Beginning Apache Spark 2 gives you an introduction to Apache Spark and shows you how to work with it.
Along the way, you’ll discover resilient distributed datasets (RDDs); use Spark SQL for structured data; and learn stream processing and build real-time applications with Spark Structured Streaming. Furthermore, you’ll learn the fundamentals of Spark ML for machine learning and much more.

After you read this book, you will have the fundamentals to become proficient in using Apache Spark and know when and how to apply it to your big data applications.

What You Will Learn

Understand Spark unified data processing platform
Howto run Spark in Spark Shell or Databricks
Use and manipulate RDDs
Deal with structured data using Spark SQL through its operations and advanced functions
Build real-time applications using Spark Structured Streaming
Develop intelligent applications with the Spark Machine Learning library

Who This Book Is For
Programmers and developers active in big data, Hadoop, and Java but who are new to the Apache Spark platform.

Skip carousel

LanguageEnglish

PublisherApress

Release dateAug 16, 2018

ISBN9781484235799

Author

Hien Luu

Related authors

Skip carousel

Related to Beginning Apache Spark 2

Related ebooks

Skip carousel

Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark
Ebook
Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark
byButch Quinto
Rating: 0 out of 5 stars
0 ratings
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library
Ebook
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library
byHien Luu
Rating: 0 out of 5 stars
0 ratings
Spark for Data Science
Ebook
Spark for Data Science
bySrinivas Duvvuri
Rating: 0 out of 5 stars
0 ratings
Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
Ebook
Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
byEd Elliott
Rating: 0 out of 5 stars
0 ratings
Splunk Certified Study Guide: Prepare for the User, Power User, and Enterprise Admin Certifications
Ebook
Splunk Certified Study Guide: Prepare for the User, Power User, and Enterprise Admin Certifications
byDeep Mehta
Rating: 0 out of 5 stars
0 ratings
Applied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle
Ebook
Applied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle
byRamcharan Kakarla
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Spark: Big Data Cluster Computing in Production
Ebook
Spark: Big Data Cluster Computing in Production
byIlya Ganelin
Rating: 0 out of 5 stars
0 ratings
Apache Spark Machine Learning Blueprints
Ebook
Apache Spark Machine Learning Blueprints
byAlex Liu
Rating: 0 out of 5 stars
0 ratings
Practical Splunk Search Processing Language: A Guide for Mastering SPL Commands for Maximum Efficiency and Outcome
Ebook
Practical Splunk Search Processing Language: A Guide for Mastering SPL Commands for Maximum Efficiency and Outcome
byKarun Subramanian
Rating: 0 out of 5 stars
0 ratings
Eclipse TEA Revealed: Building Plug-ins and Creating Extensions for Eclipse
Ebook
Eclipse TEA Revealed: Building Plug-ins and Creating Extensions for Eclipse
byMarkus Duft
Rating: 0 out of 5 stars
0 ratings
Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application
Ebook
Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application
byAtri Sharma
Rating: 0 out of 5 stars
0 ratings
Real-Time Big Data Analytics
Ebook
Real-Time Big Data Analytics
byShilpi
Rating: 5 out of 5 stars
5/5
Next-Generation Machine Learning with Spark: Covers XGBoost, LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More
Ebook
Next-Generation Machine Learning with Spark: Covers XGBoost, LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More
byButch Quinto
Rating: 0 out of 5 stars
0 ratings
Web Applications with Elm: Functional Programming for the Web
Ebook
Web Applications with Elm: Functional Programming for the Web
byWolfgang Loder
Rating: 0 out of 5 stars
0 ratings
Data Science Solutions with Python: Fast and Scalable Models Using Keras, PySpark MLlib, H2O, XGBoost, and Scikit-Learn
Ebook
Data Science Solutions with Python: Fast and Scalable Models Using Keras, PySpark MLlib, H2O, XGBoost, and Scikit-Learn
byTshepo Chris Nokeri
Rating: 0 out of 5 stars
0 ratings
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
Ebook
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
byKanika Sud
Rating: 0 out of 5 stars
0 ratings
Advanced R 4 Data Programming and the Cloud: Using PostgreSQL, AWS, and Shiny
Ebook
Advanced R 4 Data Programming and the Cloud: Using PostgreSQL, AWS, and Shiny
byMatt Wiley
Rating: 0 out of 5 stars
0 ratings
Advanced Data Analytics Using Python: With Machine Learning, Deep Learning and NLP Examples
Ebook
Advanced Data Analytics Using Python: With Machine Learning, Deep Learning and NLP Examples
bySayan Mukhopadhyay
Rating: 0 out of 5 stars
0 ratings
DevOps for SharePoint: With Packer, Terraform, Ansible, and Vagrant
Ebook
DevOps for SharePoint: With Packer, Terraform, Ansible, and Vagrant
byOscar Medina
Rating: 0 out of 5 stars
0 ratings
Microservices in SAP HANA XSA: A Guide to REST APIs Using Node.js
Ebook
Microservices in SAP HANA XSA: A Guide to REST APIs Using Node.js
bySergio Guerrero
Rating: 0 out of 5 stars
0 ratings
Rapid Java Persistence and Microservices: Persistence Made Easy Using Java EE8, JPA and Spring
Ebook
Rapid Java Persistence and Microservices: Persistence Made Easy Using Java EE8, JPA and Spring
byRaj Malhotra
Rating: 0 out of 5 stars
0 ratings
Raspbian OS Programming with the Raspberry Pi: IoT Projects with Wolfram, Mathematica, and Scratch
Ebook
Raspbian OS Programming with the Raspberry Pi: IoT Projects with Wolfram, Mathematica, and Scratch
byAgus Kurniawan
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Hadoop
Ebook
Deep Learning with Hadoop
byDipayan Dev
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Splunk Best Practices
Ebook
Splunk Best Practices
byTravis Marlette
Rating: 0 out of 5 stars
0 ratings
OpenStack Sahara Essentials
Ebook
OpenStack Sahara Essentials
byOmar Khedher
Rating: 0 out of 5 stars
0 ratings
Pro Machine Learning Algorithms: A Hands-On Approach to Implementing Algorithms in Python and R
Ebook
Pro Machine Learning Algorithms: A Hands-On Approach to Implementing Algorithms in Python and R
byV Kishore Ayyadevara
Rating: 0 out of 5 stars
0 ratings
JavaScript Data Structures and Algorithms: An Introduction to Understanding and Implementing Core Data Structure and Algorithm Fundamentals
Ebook
JavaScript Data Structures and Algorithms: An Introduction to Understanding and Implementing Core Data Structure and Algorithm Fundamentals
bySammie Bae
Rating: 0 out of 5 stars
0 ratings

Databases For You

Skip carousel

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Blockchain Basics: A Non-Technical Introduction in 25 Steps
Ebook
Blockchain Basics: A Non-Technical Introduction in 25 Steps
byDaniel Drescher
Rating: 5 out of 5 stars
5/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
100+ SQL Queries T-SQL for Microsoft SQL Server
Ebook
100+ SQL Queries T-SQL for Microsoft SQL Server
byIFS Harrison
Rating: 4 out of 5 stars
4/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 1 out of 5 stars
1/5
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
Ebook
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
byWilliam Sullivan
Rating: 5 out of 5 stars
5/5
Learn Git in a Month of Lunches
Ebook
Learn Git in a Month of Lunches
byRick Umali
Rating: 0 out of 5 stars
0 ratings
Excel 2021
Ebook
Excel 2021
byJIAYI SIMONDS
Rating: 4 out of 5 stars
4/5
Relational Database Design and Implementation
Ebook
Relational Database Design and Implementation
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Beginning Microsoft SQL Server 2012 Programming
Ebook
Beginning Microsoft SQL Server 2012 Programming
byPaul Atkinson
Rating: 1 out of 5 stars
1/5
Learning PostgreSQL
Ebook
Learning PostgreSQL
byJuba Salahaldin
Rating: 1 out of 5 stars
1/5
Business Intelligence Guidebook: From Data Integration to Analytics
Ebook
Business Intelligence Guidebook: From Data Integration to Analytics
byRick Sherman
Rating: 4 out of 5 stars
4/5
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
Ebook
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
byJeremy Li
Rating: 3 out of 5 stars
3/5
Learn SQL Server Administration in a Month of Lunches
Ebook
Learn SQL Server Administration in a Month of Lunches
byDon Jones
Rating: 0 out of 5 stars
0 ratings
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
Ebook
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
byPiyanka Jain
Rating: 5 out of 5 stars
5/5
COMPUTER SCIENCE FOR ROOKIES
Ebook
COMPUTER SCIENCE FOR ROOKIES
byAngel Bahabwa
Rating: 0 out of 5 stars
0 ratings
The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality
Ebook
The Data and Analytics Playbook: Proven Methods for Governed Data and Analytic Quality
byLowell Fryman
Rating: 5 out of 5 stars
5/5
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Python Projects for Everyone
Ebook
Python Projects for Everyone
byMohamad Charara
Rating: 0 out of 5 stars
0 ratings
Codeless Data Structures and Algorithms: Learn DSA Without Writing a Single Line of Code
Ebook
Codeless Data Structures and Algorithms: Learn DSA Without Writing a Single Line of Code
byArmstrong Subero
Rating: 0 out of 5 stars
0 ratings
Artificial Intelligence for Fashion: How AI is Revolutionizing the Fashion Industry
Ebook
Artificial Intelligence for Fashion: How AI is Revolutionizing the Fashion Industry
byLeanne Luce
Rating: 0 out of 5 stars
0 ratings
Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
SQL: Practical Guide for Developers
Ebook
SQL: Practical Guide for Developers
byMichael J. Donahoo
Rating: 2 out of 5 stars
2/5
SQL Clearly Explained
Ebook
SQL Clearly Explained
byJan L. Harrington
Rating: 5 out of 5 stars
5/5
Access 2010 All-in-One For Dummies
Ebook
Access 2010 All-in-One For Dummies
byAlison Barrows
Rating: 4 out of 5 stars
4/5
Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries
Ebook
Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries
byTracy Boggiano
Rating: 0 out of 5 stars
0 ratings
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
Ebook
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
byDan Clark
Rating: 0 out of 5 stars
0 ratings
The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation
Ebook
The AI Bible, Making Money with Artificial Intelligence: Real Case Studies and How-To's for Implementation
byJhon Dujardin
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Julien Le Dem: Why Data Lineage Matters: Julien has a unique history of building open frameworks that make data platforms interoperable. He’s contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework...
Podcast episode
Julien Le Dem: Why Data Lineage Matters: Julien has a unique history of building open frameworks that make data platforms interoperable. He’s contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework...
byThe Analytics Engineering Podcast
0 ratings
0% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
Podcast episode
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
Podcast episode
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
185: InstructorEx for LLMs: Explore InstructorEx's approach to harnessing LLMs for structured JSON data and Elixir's role in refining AI interactions. Uncover strategies for enhancing tasks and integrating Python skills with Elixir potential, and more!
Podcast episode
185: InstructorEx for LLMs: Explore InstructorEx's approach to harnessing LLMs for structured JSON data and Elixir's role in refining AI interactions. Uncover strategies for enhancing tasks and integrating Python skills with Elixir potential, and more!
byThinking Elixir Podcast
0 ratings
0% found this document useful
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
Podcast episode
Oracle Data Lakehouse: With each passing day, more and more data sources are sending greater volumes of data across the globe. For any organization, this combination of structured and unstructured data continues to be a challenge. Data lakehouses link, correlate, and...
byOracle University Podcast
0 ratings
0% found this document useful
Scalable Python for Everyone, Everywhere // Matthew Rocklin // MLOps Meetup #38
Podcast episode
Scalable Python for Everyone, Everywhere // Matthew Rocklin // MLOps Meetup #38
byMLOps.community
0 ratings
0% found this document useful
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Podcast episode
Find Out About The Technology Behind The Latest PFAD In Analytical Database Development: Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
byData Engineering Podcast
0 ratings
0% found this document useful
Commanding the Council of the Lords of Thought with Anna Belak: A few years ago Corey caught wind of the open source project Sysdig, which at the time attracted his attention. Now it has turned into something “rather interesting” when it comes to observability and security. Anna Belak, Sysdig’s Director of Thought Lea
Podcast episode
Commanding the Council of the Lords of Thought with Anna Belak: A few years ago Corey caught wind of the open source project Sysdig, which at the time attracted his attention. Now it has turned into something “rather interesting” when it comes to observability and security. Anna Belak, Sysdig’s Director of Thought Lea
byScreaming in the Cloud
0 ratings
0% found this document useful
Eliminating Garbage In/Garbage Out for Analytics and ML // Roy Hasson & Santona Tuli // MLOps Podcast #166
Podcast episode
Eliminating Garbage In/Garbage Out for Analytics and ML // Roy Hasson & Santona Tuli // MLOps Podcast #166
byMLOps.community
0 ratings
0% found this document useful
Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30: Brief Conversations On Data Engineering From The Open Data Science Conference
Podcast episode
Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30: Brief Conversations On Data Engineering From The Open Data Science Conference
byData Engineering Podcast
0 ratings
0% found this document useful
#06 - Tech stack of Open Podcast: Which database is best?
Podcast episode
#06 - Tech stack of Open Podcast: Which database is best?
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
Podcast episode
108: PySpark - Jonathan Rioux: Apache Spark is a unified analytics engine for large-scale data processing. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task.
byTest and Code
0 ratings
0% found this document useful
OffHeap 81. Is the guilded age of open source over?: ElasticSearch, Akka, Hashicorp, and Red Hat are starting to change their licensing models. What used to be considered open source (Apache, GPL, MIT) is morphing (with an asterisk) for the large open source projects that we know and love. But...
Podcast episode
OffHeap 81. Is the guilded age of open source over?: ElasticSearch, Akka, Hashicorp, and Red Hat are starting to change their licensing models. What used to be considered open source (Apache, GPL, MIT) is morphing (with an asterisk) for the large open source projects that we know and love. But...
byJava Off-Heap
0 ratings
0% found this document useful
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
Podcast episode
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
Podcast episode
Autonomous Database Tools: In this episode, hosts Lois Houston and Nikita Abraham speak with Oracle Database experts about the various tools you can use with Autonomous Database, including Oracle Application Express (APEX), Oracle Machine Learning, and more. Oracle...
byOracle University Podcast
0 ratings
0% found this document useful
Powering Rails Applications with Postgres - RUBY 621
Podcast episode
Powering Rails Applications with Postgres - RUBY 621
byRuby Rogues
0 ratings
0% found this document useful
Managing Oracle Database with REST APIs and ADB Built-in Tools: In this episode, Lois Houston and Nikita Abraham are joined by Cloud Engineer Nick Commisso to talk about managing Oracle Database with REST APIs. They also look at Autonomous Database built-in tools, which are pre-assembled, pre-configured,...
Podcast episode
Managing Oracle Database with REST APIs and ADB Built-in Tools: In this episode, Lois Houston and Nikita Abraham are joined by Cloud Engineer Nick Commisso to talk about managing Oracle Database with REST APIs. They also look at Autonomous Database built-in tools, which are pre-assembled, pre-configured,...
byOracle University Podcast
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
Open Source and Fast Decision Making // Rob Hirschfeld // MLOps Podcast #164
Podcast episode
Open Source and Fast Decision Making // Rob Hirschfeld // MLOps Podcast #164
byMLOps.community
0 ratings
0% found this document useful
#564: [INTRODUCING] Amazon Athena for Apache Spark
Podcast episode
#564: [INTRODUCING] Amazon Athena for Apache Spark
byAWS Podcast
0 ratings
0% found this document useful
Putting the “Fun” in Functional with Frank Chen: Almost everyone is using Slack, and a lot of that is because of the work of those like Frank Chen, Slack’s Senior Staff Software Engineer. Frank is here to tell us how Slack keeps us all angrily typing. But equally as important is his own trajectory which
Podcast episode
Putting the “Fun” in Functional with Frank Chen: Almost everyone is using Slack, and a lot of that is because of the work of those like Frank Chen, Slack’s Senior Staff Software Engineer. Frank is here to tell us how Slack keeps us all angrily typing. But equally as important is his own trajectory which
byScreaming in the Cloud
0 ratings
0% found this document useful
Spencer Kimball, CEO of Cockroach Labs: Future of Open Source
Podcast episode
Spencer Kimball, CEO of Cockroach Labs: Future of Open Source
by"World of DaaS"
0 ratings
0% found this document useful
2015 - DevOpsDays DC - 28 - Consumer to Collaborator: Re-imaging the US Government's role in Open Source: Greg Elin, GovReady PBC ; Fen Labalme, CivicActions ; Shawn Wells, Red Hat
Podcast episode
2015 - DevOpsDays DC - 28 - Consumer to Collaborator: Re-imaging the US Government's role in Open Source: Greg Elin, GovReady PBC ; Fen Labalme, CivicActions ; Shawn Wells, Red Hat
byDevOps Days Podcast
0 ratings
0% found this document useful
Foundational Models are the Future but... with Alex Ratner CEO of Snorkel AI // MLOps Podcast #139
Podcast episode
Foundational Models are the Future but... with Alex Ratner CEO of Snorkel AI // MLOps Podcast #139
byMLOps.community
0 ratings
0% found this document useful
MLOps Meetup #25 // Python and Dask: Scaling the DataFrame // Dan Gerlanc - Founder of Enplus Advisors
Podcast episode
MLOps Meetup #25 // Python and Dask: Scaling the DataFrame // Dan Gerlanc - Founder of Enplus Advisors
byMLOps.community
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Is Flink the answer to the ETL problem? (with Robert Metzger)
Podcast episode
Is Flink the answer to the ETL problem? (with Robert Metzger)
byDeveloper Voices
0 ratings
0% found this document useful

Skip carousel

Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Machine Learning – With Zero Programming
APC
Article
Machine Learning – With Zero Programming
Aug 12, 2019
6 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
Monitor And Graph Your System Metrics
Linux Format
Article
Monitor And Graph Your System Metrics
Dec 13, 2022
Credit: https://oss.oetiker.ch/rrdtool Matt Holder has worked in IT support for over a decade, and always tries to use Linux alongside other installed systems. The code used in this article can be downloaded from https:// github.com/ mattmole/ LXF297
8 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
A.I.-POWERED RASPBERRY Pi
Linux Format
Article
A.I.-POWERED RASPBERRY Pi
Sep 19, 2023
1 min read
Rclone
Linux Format
Article
Rclone
Sep 20, 2022
Version: 1.59.0 Web: https://rclone.org There’s no such thing as being too cautious when it comes to protecting sensitive data. It’s not just malicious attacks and external threats, either – something as innocuous as a misdirected dd command can wipe
1 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
Mining Actionable Information with Smart Capture
The European Business Review
Article
Mining Actionable Information with Smart Capture
May 22, 2018
4 min read
Yarp: A Similar Framework
Linux Format
Article
Yarp: A Similar Framework
Jan 12, 2021
YARP is the framework for communications within robotics. It can replace the ROS master as a name server. You can also do this the other way around using the YARP as a name server. The name server will support the nodes and protocols across your syst
1 min read
How We Tested…
Linux Format
Article
How We Tested…
Jan 12, 2021
You’ll find these applications in the software repositories of most desktop distributions, even if the featured version is not the latest. Some programs provide Snap packages, and others provide installable binaries for RPM- and DEB-based distributio
1 min read
Sherlock
Linux Format
Article
Sherlock
May 31, 2022
1 min read
Sherlock
Linux Format
Article
Sherlock
May 31, 2022
1 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Build A Pi-powered Network Storage Device
Linux Format
Article
Build A Pi-powered Network Storage Device
Dec 14, 2021
10 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
Flask Resources
Linux Format
Article
Flask Resources
Jun 4, 2019
The designers of Flask decided early on that the framework itself would not have all the functions embedded. This philosophy is, of course, an extension of all open source work. When you start using Flask, you may be disappointed by this. Don’t be: t
1 min read
Visualise Smart- Home Sensor Data
Linux Format
Article
Visualise Smart- Home Sensor Data
Oct 17, 2023
8 min read
High Performance Software Foundation announced
Linux Format
Article
High Performance Software Foundation announced
Dec 12, 2023
The Linux Foundation has announced the formation of the High Performance Software Foundation (HPSF). The stated aims of the HPSF are to revolutionise the high-performance computing (HPC) landscape by fostering global collaboration and driving open s
1 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
The Genealogy Source Citation Process
Family Tree
Article
The Genealogy Source Citation Process
Dec 20, 2022
1 min read
New Supercomputer’s Speed Equal To 250 Million Laptops
How It Works
Article
New Supercomputer’s Speed Equal To 250 Million Laptops
Jun 13, 2019
1 min read
Is My Data Really Safe? Your Questions About Cloud-Based Storage, Answered.
Entrepreneur
Article
Is My Data Really Safe? Your Questions About Cloud-Based Storage, Answered.
Nov 1, 2014
2 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
Editor’s Note
Techfastly
Article
Editor’s Note
Apr 1, 2022
Dear Readers, eBPF is a ground-breaking technology derived from the Linux kernel that allows sandboxed programmes to operate within an operating system kernel. It stands for Extended Berkeley Packet Filter (eBPF). eBPF was first published in diminish
1 min read
Build A Dynamic App Security Pipeline
Linux Format
Article
Build A Dynamic App Security Pipeline
Sep 21, 2021
8 min read

Related categories

Skip carousel

Reviews for Beginning Apache Spark 2

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Beginning Apache Spark 2 - Hien Luu

Hien LuuBeginning Apache Spark 2https://doi.org/10.1007/978-1-4842-3579-9_1

1. Introduction to Apache Spark

Hien Luu¹

(1)

SAN JOSE, California, USA

There is no better time to learn Spark than now. Spark has become one of the critical components in the big data stack because of its ease of use, speed, and flexibility. This scalable data processing system is being widely adopted across many industries by many small and big companies, including Facebook, Microsoft, Netflix, and LinkedIn. This chapter provides a high-level overview of Spark, including the core concepts, the architecture, and the various components inside the Apache Spark stack.

Overview

Spark is a general distributed data processing engine built for speed, ease of use, and flexibility. The combination of these three properties is what makes Spark so popular and widely adopted in the industry.

The Apache Spark website claims it can run a certain data processing job up to 100 times faster than Hadoop MapReduce. In fact, in 2014, Spark won the Daytona GraySort contest, which is an industry benchmark for sorting 100TB of data (one trillion records). The submission from Databricks claimed Spark was able to sort 100TB of data three times faster and using ten times fewer resources than the previous world record set by Hadoop MapReduce.

Since the inception of the Spark project, the ease of use has been one of the main focuses of the Spark creators. It offers more than 80 high-level, commonly needed data processing operators to make it easy for developers, data scientists, and data analysts to build all kinds of interesting data applications. In addition, these operators are available in multiple languages, namely, Scala, Java, Python, and R. Software engineers, data scientists, and data analysts can pick and choose their favorite language to solve large-scale data processing problems with Spark.

In terms of flexibility, Spark offers a single unified data processing stack that can be used to solve multiple types of data processing workloads, including batch processing, interactive queries, iterative processing needed by machine learning algorithms, and real-time streaming processing to extract actionable insights at near real-time. Before the existence of Spark, each of these types of workload required a different solution and technology. Now companies can just leverage Spark for most of their data processing needs. Using a single technology stack will help with dramatically reducing the operational cost and resources.

A big data ecosystem consists of many pieces of technology including a distributed storage engine called HDFS, a cluster management system to efficiently manage a cluster of machines, and different file formats to store a large amount of data efficiently in binary and columnar format. Spark integrates really well with the big data ecosystem. This is another reason why Spark adoption has been growing at a really fast pace.

Another really cool thing about Spark is it is open source; therefore, anyone can download the source code to examine the code, to figure out how a certain feature was implemented, or to extend its functionalities. In some cases, it can dramatically help with reducing the time to debug problems.

History

Spark started out as a research project at Berkeley AMPLab in 2009. At that time, the researchers of this project observed the inefficiencies of the Hadoop MapReduce framework in handling interactive and iterative data processing use cases, so they came up with ways to overcome those inefficiencies by introducing ideas such as in-memory computation and an efficient way of dealing with fault recovery. Once this research project proved to be a viable solution that outperformed MapReduce, it was open sourced in 2010 and became the Apache top-level project in 2013. A group of researchers working on this research project got together and founded a company called Databricks; they raised more than $43 million in 2013. Databricks is the primary commercial steward behind Spark. In 2015, IBM announced a major investment in building a Spark technology center to advance Apache Spark by working closely with the open source community and building Spark into the core of its company’s analytics and commerce platforms.

Two popular research papers about Spark are Spark: Cluster Computing with Working Sets and Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. These papers were well received at academic conferences and provide good foundations for anyone who would like to learn and understand Spark. You can find them at http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf and http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf , respectively.

Since its inception, the Spark open source project has been an active project with a vibrant community. The number of contributors has increased to more than 1,000 in 2016, and there are more than 200,000 Apache Spark meetups. In fact, the number of Apache Spark contributors has exceeded the number of contributors of one of the most popular open source projects called Apache Hadoop. Spark is so popular now that it has its own summit called Spark Summit, which is held annually in North America and Europe. The summit attendance has doubled each year since its inception.

The creators of Spark selected the Scala programming language for their project because of the combination of Scala’s conciseness and static typing. Now Spark is considered to be one of the largest applications written in Scala, and its popularity certainly has helped Scala to become a mainstream programming language.

Spark Core Concepts and Architecture

Before diving into the details of Spark, it is important to have a high-level understanding of the core concepts and the various core components in Spark. This section will cover the following:

Spark clusters

The resource management system

Spark applications

Spark drivers

Spark executors

Spark Clusters and the Resource Management System

Spark is essentially a distributed system that was designed to process a large volume of data efficiently and quickly. This distributed system is typically deployed onto a collection of machines, which is known as a Spark cluster. A cluster size can be as small as a few machines or as large as thousands of machines. The largest publicly announced Spark cluster in the world has more than 8,000 machines. To efficiently and intelligently manage a collection of machines, companies rely on a resource management system such as Apache YARN or Apache Mesos. The two main components in a typical resource management system are the cluster manager and the worker. The cluster manager knows where the workers are located, how much memory they have, and the number of CPU cores each one has. One of the main responsibilities of the cluster manager is to orchestrate the work by assigning it to each worker. Each worker offers resources (memory, CPU, etc.) to the cluster manager and performs the assigned work. An example of the type of work is to launch a particular process and monitor its health. Spark is designed to easily interoperate with these systems. Most companies that have been adopting big data technologies in recent years usually already have a YARN cluster to run MapReduce jobs or other data processing frameworks such as Apache Pig or Apache Hive.

Startup companies that fully adopt Spark can just use the out-of-the-box Spark cluster manager to manage a set of dedicated machines to perform data processing using Spark. Figure 1-1 depicts this type of setup.

../images/419951_1_En_1_Chapter/419951_1_En_1_Fig1_HTML.jpg

Figure 1-1

Interactions between a Spark application and a cluster manager

Spark Application

A Spark application consists of two parts. The first is the application data processing logic expressed using Spark APIs, and the other is the Spark driver. The application data processing logic can be as simple as a few lines of code to perform a few data processing operations or can be as complex as training a large machine learning model that requires many iterations and could run for many hours to complete. The Spark driver is the central coordinator of a Spark application, and it interacts with a cluster manager to figure out which machines to run the data processing logic on. For each one of those machines, the Spark driver requests that the cluster manager launch a process called the Spark executor. Another important job of the Spark driver is to manage and distribute Spark tasks onto each executor on behalf of the application. If the data processing logic requires the Spark driver to display the computed results to a user, then it will coordinate with each Spark executor to collect the computed result and merge them together. The entry point into a Spark application is through a class called SparkSession, which provides facilities for setting up configurations as well as APIs for expressing data processing logic.

Spark Driver and Executor

Each Spark executor is a JVM process and is exclusively allocated to a specific Spark application. This was a conscious design decision to avoid sharing a Spark executor between multiple Spark applications in order to isolate them from each other so one badly behaving Spark application wouldn’t affect other Spark applications. The lifetime of a Spark executor is the duration of a Spark application, which could run for a few minutes or for a few days. Since Spark applications are running in separate Spark executors, sharing data between them will require writing the data to an external storage system like HDFS.

As depicted in Figure 1-2, Spark employs a master-slave architecture, where the Spark driver is the master and the Spark executor is the slave. Each of these components runs as an independent process on a Spark cluster. A Spark application consists of one and only one Spark driver and one or more Spark executors. Playing the slave role, each Spark executor does what it is told, which is to execute the data processing logic in the form of tasks. Each task is executed on a separate CPU core. This is how Spark can speed up the processing of a large amount of data by processing it in parallel. In addition to executing assigned tasks, each Spark executor has the responsibility of caching a portion of the data in memory and/or on disk when it is told to do so by the application logic.

At the time of launching a Spark application, you can request how many Spark executors an application needs and how much memory and the number of CPU cores each executor should have. Figuring out an appropriate number of Spark executors, the amount of memory, and the number of CPU requires some understanding of the amount of data that will be processed, the complexity of the data processing logic, and the desired duration by which a Spark application should complete the processing logic.

../images/419951_1_En_1_Chapter/419951_1_En_1_Fig2_HTML.jpg

Figure 1-2

A small Spark cluster with three executors

Spark Unified Stack

Unlike its predecessors, Spark provides a unified data processing engine known as the Spark stack. Similar to other well-designed systems, this stack is built on top of a strong foundation called Spark Core, which provides all the necessary functionalities to manage and run distributed applications such as scheduling, coordination, and fault tolerance. In addition, it provides a powerful and generic programming abstraction for data processing called resilient distributed datasets (RDDs). On top of this strong foundation is a collection of components where each one is designed for a specific data processing workload, as shown in Figure 1-3. Spark SQL is for batch as well as interactive data processing. Spark Streaming is for real-time stream data processing. Spark GraphX is for graph processing. Spark MLlib is for machine learning. Spark R is for running machine learning tasks using the R shell.

../images/419951_1_En_1_Chapter/419951_1_En_1_Fig3_HTML.jpg

Figure 1-3

Spark unified stack

This unified engine brings several important benefits to the task of building scalable and intelligent big data applications. First, applications are simpler to develop and deploy because they use a unified set of APIs and run on a single engine. Second, it is way more efficient to combine different types of data processing (batch, streaming, etc.) because Spark can run those different sets of APIs over the same data without writing the intermediate data out to a storage system. Finally, the most exciting benefit is Spark enables new applications that were not possible before because of its ease of composing different sets of data processing types within a Spark application. For example, it can run interactive queries over the results of machine learning predictions of real-time data streams. An analogy that everyone can relate to is the smartphone, which consists of a powerful camera, cell phone, and GPS device. By combining the functions of these components, a smartphone enables innovative applications such as Waze, a traffic and navigation application.

Spark Core

Spark Core is the bedrock of the Spark distributed data processing engine. It consists of two parts: the distributed computing infrastructure and the RDD programming abstraction.

The distributed computing infrastructure is responsible for the distribution, coordination, and scheduling of computing tasks across many machines in the cluster. This enables the ability to perform parallel data processing of a large volume of data efficiently and quickly on a large cluster of machines. Two other important responsibilities of the distributed computing infrastructure are handling of computing task failures and efficiently moving data across machines, which is known as data shuffling . Advanced users of Spark need to have intimate knowledge of the Spark distributed computing infrastructure to be effective at designing highly performant Spark applications.

The key programming abstraction in Spark is called RDD, and it is something every Spark developer should have some knowledge of, especially its APIs and main concepts. The technical definition of an RDD is that it is an immutable and fault-tolerant collection of objects partitioned across a cluster that can be manipulated in parallel. Essentially, it provides a set of APIs for Spark application developers to easily and efficiently perform large-scale data processing without worrying where data resides on the cluster or dealing with machine failures. For example, say you have a 1.5TB log file that resides on HDFS and you need to find out the number of lines containing the word Exception. You can create an instance of RDD to represent all the log statements in that log file, and Spark can partition them across the nodes in the cluster such that filtering and counting logic can be executed in parallel to speed up the search and counting logic.

The RDD APIs are exposed in multiple programming languages (Scala, Java, and Python), and they allow users to pass local functions to run on the cluster, which is something that is powerful and unique. RDDs will be covered in detail in Chapter 3.

The rest of the components in the Spark stack are designed to run on top of Spark Core. Therefore, any improvements or optimizations done in Spark Core between versions of Spark will be automatically available to the other components.

Spark SQL

Spark SQL is a component built on top of Spark Core, and it is designed for structured data processing at scale. Its popularity has skyrocketed since its inception because it brings a new level of flexibility, ease of use, and performance.

Structured Query Language (SQL) has been the lingua franca for data processing because it is fairly easy for users to express their intent, and the execution engine then performs the necessary intelligent optimizations. Spark SQL brings that to the world of data processing at the petabyte level. Spark users now can issue SQL queries to perform data processing or use the high-level abstraction exposed through the DataFrames APIs. A DataFrame is effectively a distributed collection of data organized into named columns. This is not a novel idea; in fact, this idea was inspired by data frames in R and Python. An easier way to think about a DataFrame is that it is conceptually equivalent to a table in a relational database.

Behind the scenes, Spark SQL leverages the Catalyst optimizer to perform the kinds of the optimizations that are commonly done in many analytical database engines.

Another feature Spark SQL provides is the ability to read data from and write data to various structured formats and storage systems, such as JavaScript Object Notation (JSON), comma-separated value (CSV) files, Parquet or ORC files, relational databases, Hive, and others. This feature really helps in elevating the level of versatility of Spark because Spark SQL can be used as a data converter tool to easily convert data from one format to another.

According to a 2016 Spark survey, Spark SQL was the fastest-growing component. This makes sense because Spark SQL enables a wider audience beyond big data engineers to leverage the power of distributed data processing (i.e., data analysts or anyone who is familiar with SQL).

The motto for Spark SQL is to write less code, read less data, and let the optimizer do the hard work.

Spark Structured Streaming and Streaming

It has been said that Data in motion has equal or greater value than historical data. The ability to process data as it arrives is becoming a competitive advantage for many companies in highly competitive industries. The Spark Streaming module enables the ability to process real-time streaming data from various data sources in a high-throughput and fault-tolerant manner. Data can be ingested from sources such as Kafka, Flume, Kinesis, Twitter, HDFS, or TCP sockets.

The main abstraction in the first generation of the Spark Streaming processing engine is called discretized stream (DStream), which implements an incremental stream processing model by splitting the input data into small batches (based on a time interval) that can regularly combine the current processing state to produce new results. In other words, once the incoming data is split into small batches, each batch is treated as an RDD and replicated out onto the cluster so they can be processed accordingly.

Stream processing sometimes involves joining with data at rest, and Spark makes it easy to do so. In other words, combining batch and interactive queries with streaming processing can be easily done in Spark because of the unified Spark stack.

A new scalable and fault-tolerant streaming processing engine called Structured Streaming was introduced in Spark 2.1, and it was built on top of the Spark SQL engine. This engine further simplifies the life of streaming processing application developers by treating streaming computation the same way as one would express a batch computation on static data. This new engine will automatically execute the streaming processing logic incrementally and continuously as new streaming data continues to arrive. A new and important feature that Structured Streaming provides is the ability to process incoming streaming data based on the event time, which is necessary for many of the new streaming processing use cases. Another unique feature in the Structured Streaming engine is the end-to-end, exactly once guarantee, which will make a big data engineer’s life much easier than before in terms of saving data to a storage system such as a relational database or a NoSQL database.

As this new engine matures, undoubtedly it will enable a new class of streaming processing applications that are easy to develop and maintain.

According to Reynold Xin, Databricks’ chief architect, the simplest way to perform streaming analytics is not having to reason about streaming.

Spark MLlib

In addition to providing more than 50 common machine learning algorithms, the Spark MLlib library provides abstractions for managing and simplifying many of the machine learning model building tasks, such as featurization, pipeline for constructing, evaluating and tuning model, and persistence of models to help with moving the model from development to production.

Starting with Spark 2.0, the MLlib APIs will be based on DataFrames to take advantage of the user friendliness and many optimizations provided by the Catalyst and Tungsten components in the Spark SQL engine.

Machine learning algorithms are iterative in nature, meaning they run through many iterations until a desired objective is achieved. Spark makes it extremely easy to implement those algorithms and run them in a scalable manner through a cluster of machines. Commonly used machine learning algorithms such as classification, regression, clustering, and collaborative filtering are available out of the box for data scientists and engineers to use.

Spark Graphx

Graph processing operates on data structures consisting of vertices and edges connecting them. A graph data structure is often used to represent real-life networks of interconnected entities, including professional social networks on LinkedIn, networks of connected web pages on the Internet, and so on. Spark GraphX is a component that enables graph-parallel computations by providing an abstraction of a directed multigraph with properties attached to each vertex and edge. The GraphX component includes a collection of common graph processing algorithms including page ranks, connected components, shortest paths, and others.

SparkR

SparkR is an R package that provides a light-weight front end to use Apache Spark. R is a popular statistical programming language that supports data processing and machine learning tasks. However, R was not designed to handle large datasets that can’t fit on a single machine. SparkR leverages Spark’s distributed computing engine to enable large-scale data analysis using the familiar R shell and popular APIs that many data scientists love.

Apache Spark Applications

Spark is a versatile, fast, and scalable data processing engine. It was designed to be a general engine since the beginning days and has proven that it can be used to solve various use cases. Many companies in various industries are using Spark to solve real-life use cases. The following is a small list of applications that were developed using Spark:

Customer intelligence applications

Data warehouse solutions

Real-time streaming solutions

Recommendation engines

Log processing

User-facing services

Fraud detection

Example Spark Application

In the world of big data processing, the canonical example application is a word count application. This tradition started with the introduction of the MapReduce framework. Since then, every big data processing technology book must follow this unwritten tradition by including this canonical example. The problem space in the word count example application is easy for everyone to understand since all it does is count how many times a particular word appears in a given set of documents, whether that is a chapter of a book or hundreds of terabytes of web pages from the Internet. Listing 1-1 contains the word count example written in Spark using Scala APIs.

val textFiles = sc.textFile(hdfs://)

val words = textFiles.flatMap(line => line.split( ))

val wordTuples = words.map(word => (word, 1))

val wordCounts = wordTuples.reduceByKey(_ + _)

wordCounts.saveAsTextFile(hdfs://)

Listing 1-1

Word Count Example Application in Spark in the Scala Language

There is a lot going on behind these five lines of code. The first line is responsible for reading in the text files in the specified folder. The second line iterates through each line in each of the files, tokenizes each line into an array of words, and finally flattens each array into one word per line. To count the number of words across all the documents, the third line attaches a count of 1 to each word. The fourth line performs the summation of the count of each word. Finally, the last line saves the result in the specified folder. Ideally this gives you a general sense of the ease of use of Spark to perform data processing. Future chapters will go into much more detail about what each of these lines of code does.

Summary

In this chapter, you learned the following:

Apache Spark has certainly produced many sparks since its inception. It has created much excitement and many opportunities in the world of big data. More important, it allows you to create many new and innovating big data applications to solve a diverse set of use cases.

The three important properties of Spark to note are ease of use, speed, and flexibility.

The Spark distributed computing infrastructure employs a master-slave architecture. Each Spark application consists of a driver, which plays the master role, and one or more executors, which are the slaves, to process data in parallel.

Spark provides a unified scalable and distributed data processing engine that can be used for batch processing, interactive and exploratory data processing, real-time streaming processing, training machine learning models and performing predictions, and graph processing.

Spark applications can be written in multiple programming languages including Scala, Java, Python, and R.

Hien LuuBeginning Apache Spark 2https://doi.org/10.1007/978-1-4842-3579-9_2

2. Working with Apache Spark

Hien Luu¹

(1)

SAN JOSE, California, USA

This chapter provides details about the different ways of working with Spark, including using the Spark shell, submitting a Spark application from the command line, and using a hosted cloud platform called Databricks. The last part of this chapter is geared toward software engineers who want to set up the Apache Spark source code on a local machine to examine the Spark code and learn how certain features were implemented.

Downloading and Installing Spark

For the purposes of learning or experimenting with Spark, it is good to install Spark locally on your computer. This way you can easily try the Spark features or test your data processing logic with small datasets. By having Spark locally installed on your laptop, you can learn Spark from anywhere, including in your comfortable living room, on the beach, or at a bar in Mexico.

Spark is written in the Scala programming language, and it is packaged in such a way that it can be run on both Windows and Unix-like systems (e.g., Linux and macOS). All that is needed is to have Java installed on your computer.

Setting up a multitenant Spark production cluster requires a lot more details and resources and is beyond the scope of this book.

Downloading Spark

The Download section of the Apache Spark website ( http://spark.apache.org/downloads.html ) has detailed instructions for downloading the prepackaged Spark binary file. At the time of writing this book, the latest version is 2.3.0. In terms of the package type, choose the one with the latest version of Hadoop. Figure 2-1 shows the various options for downloading Spark. The important thing is to download the prepackaged binary file because it contains the necessary JAR files to run Spark on your computer. Clicking the link on line 4 will trigger the binary file download process. There is a way to manually package the Spark binary from source code, and the instructions for how to do that will be available later in the chapter.

../images/419951_1_En_2_Chapter/419951_1_En_2_Fig1_HTML.jpg

Figure 2-1

Apache Spark download options

Installing Spark

Once the binary file is successfully downloaded onto your computer, the next step is to uncompress it. The downloaded file, spark-2.x.x-bin-hadoop2.7.tgz, is in a GZIP-compressed tar archive file, so you need to use the right tool to uncompress it.

For Linux or Mac computers, the tar command should already exist. So, run the following command to uncompress

Enjoying the preview?

Page 1 of 1

Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

About this ebook

Hien Luu

Related authors

Related to Beginning Apache Spark 2

Related ebooks

Databases For You

Related podcast episodes

Related articles

Related categories

Reviews for Beginning Apache Spark 2

What did you think?

Book preview

Beginning Apache Spark 2 - Hien Luu

1. Introduction to Apache Spark

Overview

History

Spark Core Concepts and Architecture

Spark Clusters and the Resource Management System

Spark Application

Spark Driver and Executor

Spark Unified Stack

Spark Core

Spark SQL

Spark Structured Streaming and Streaming

Spark MLlib

Spark Graphx

SparkR

Apache Spark Applications

Example Spark Application

Summary

2. Working with Apache Spark

Downloading and Installing Spark

Downloading Spark

Installing Spark