Kafka Streams - Real-time Streams Processing

Ebook788 pages5 hours

Kafka Streams - Real-time Streams Processing

Name: Kafka Streams - Real-time Streams Processing
Brand: Prashant Kumar Pandey
Rating: 5.0 (2 reviews)

By Prashant Kumar Pandey

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

The book Kafka Streams - Real-time Stream Processing helps you understand the stream processing in general and apply that skill to Kafka streams programming. This book is focusing mainly on the new generation of the Kafka Streams library available in the Apache Kafka 2.x. The primary focus of this book is on Kafka Streams. However, the book also touches on the other Apache Kafka capabilities and concepts that are necessary to grasp the Kafka Streams programming.

Who should read this book?

Kafka Streams: Real-time Stream Processing is written for software engineers willing to develop a stream processing application using Kafka Streams library. I am also writing this book for data architects and data engineers who are responsible for designing and building the organization's data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Kafka implementation, but they work with the people who implement Kafka Streams at the ground level.

What should you already know?

This book assumes that the reader is familiar with the basics of Java programming language. The source code and examples in this book are using Java 8, and I will be using Java 8 lambda syntax, so experience with lambda will be helpful.

Kafka Streams is a library that runs on Kafka. Having a good fundamental knowledge of Kafka is essential to get the most out of Kafka Streams. I will touch base on the mandatory Kafka concepts for those who are new to Kafka. The book also assumes that you have some familiarity and experience in running and working on the Linux operating system.

Skip carousel

LanguageEnglish

PublisherPrashant Kumar Pandey

Release dateMar 24, 2019

ISBN9789353517250

Author

Prashant Kumar Pandey

Related authors

Skip carousel

Related to Kafka Streams - Real-time Streams Processing

Related ebooks

Skip carousel

Real-Time Streaming with Apache Kafka, Spark, and Storm: Create Platforms That Can Quickly Crunch Data and Deliver Real-Time Analytics to Users
Ebook
Real-Time Streaming with Apache Kafka, Spark, and Storm: Create Platforms That Can Quickly Crunch Data and Deliver Real-Time Analytics to Users
byBrindha Priyadarshini Jeyaraman
Rating: 0 out of 5 stars
0 ratings
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Ebook
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
byEric Chou
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Camel
Ebook
Mastering Apache Camel
byJean-Baptiste Onofré
Rating: 0 out of 5 stars
0 ratings
Cloud Native Microservices with Spring and Kubernetes: Design and Build Modern Cloud Native Applications using Spring and Kubernetes (English Edition)
Ebook
Cloud Native Microservices with Spring and Kubernetes: Design and Build Modern Cloud Native Applications using Spring and Kubernetes (English Edition)
byRajiv Srivastava
Rating: 0 out of 5 stars
0 ratings
Learning Elasticsearch 7.x: Index, Analyze, Search and Aggregate Your Data Using Elasticsearch (English Edition)
Ebook
Learning Elasticsearch 7.x: Index, Analyze, Search and Aggregate Your Data Using Elasticsearch (English Edition)
byAnurag Srivastava
Rating: 0 out of 5 stars
0 ratings
Learning Elasticsearch
Ebook
Learning Elasticsearch
byAbhishek Andhavarapu
Rating: 4 out of 5 stars
4/5
Implementing DevOps on AWS
Ebook
Implementing DevOps on AWS
byVeselin Kantsev
Rating: 0 out of 5 stars
0 ratings
Akka Cookbook
Ebook
Akka Cookbook
byHéctor Veiga Ortiz
Rating: 2 out of 5 stars
2/5
Monitoring Docker
Ebook
Monitoring Docker
byMcKendrick Russ
Rating: 0 out of 5 stars
0 ratings
Spring Boot Cookbook
Ebook
Spring Boot Cookbook
byAntonov Alex
Rating: 0 out of 5 stars
0 ratings
Building Modern Serverless Web APIs: Develop Microservices and Implement Serverless Applications with .NET Core 3.1 and AWS Lambda (English Edition)
Ebook
Building Modern Serverless Web APIs: Develop Microservices and Implement Serverless Applications with .NET Core 3.1 and AWS Lambda (English Edition)
byTanmoy Sarkar
Rating: 0 out of 5 stars
0 ratings
Building a RESTful Web Service with Spring
Ebook
Building a RESTful Web Service with Spring
byDewailly Ludovic
Rating: 5 out of 5 stars
5/5
Distributed Computing in Java 9
Ebook
Distributed Computing in Java 9
byRaja Malleswara Rao Pattamsetti
Rating: 0 out of 5 stars
0 ratings
Implementing Cloud Design Patterns for AWS
Ebook
Implementing Cloud Design Patterns for AWS
byMarcus Young
Rating: 0 out of 5 stars
0 ratings
Backend Developer in 30 Days: Acquire Skills on API Designing, Data Management, Application Testing, Deployment, Security and Performance Optimization (English Edition)
Ebook
Backend Developer in 30 Days: Acquire Skills on API Designing, Data Management, Application Testing, Deployment, Security and Performance Optimization (English Edition)
byPedro Marquez-Soto
Rating: 0 out of 5 stars
0 ratings
Kafka in Action
Ebook
Kafka in Action
byDylan Scott
Rating: 0 out of 5 stars
0 ratings
Streaming Data: Understanding the real-time pipeline
Ebook
Streaming Data: Understanding the real-time pipeline
byAndrew Psaltis
Rating: 0 out of 5 stars
0 ratings
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
Ebook
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
byJean-Georges Perrin
Rating: 0 out of 5 stars
0 ratings
Kubernetes Handbook: Non-Programmer's Guide to Deploy Applications with Kubernetes
Ebook
Kubernetes Handbook: Non-Programmer's Guide to Deploy Applications with Kubernetes
byStephen Fleming
Rating: 4 out of 5 stars
4/5
Kafka Streams in Action: Real-time apps and microservices with the Kafka Streams API
Ebook
Kafka Streams in Action: Real-time apps and microservices with the Kafka Streams API
byBill Bejeck
Rating: 0 out of 5 stars
0 ratings
Data Pipelines with Apache Airflow
Ebook
Data Pipelines with Apache Airflow
byJulian de Ruiter
Rating: 0 out of 5 stars
0 ratings
Hands-On Microservices with Kubernetes: Build, deploy, and manage scalable microservices on Kubernetes
Ebook
Hands-On Microservices with Kubernetes: Build, deploy, and manage scalable microservices on Kubernetes
byGigi Sayfan
Rating: 5 out of 5 stars
5/5
Infrastructure as Code, Patterns and Practices: With examples in Python and Terraform
Ebook
Infrastructure as Code, Patterns and Practices: With examples in Python and Terraform
byRosemary Wang
Rating: 0 out of 5 stars
0 ratings
Pipeline as Code: Continuous Delivery with Jenkins, Kubernetes, and Terraform
Ebook
Pipeline as Code: Continuous Delivery with Jenkins, Kubernetes, and Terraform
byMohamed Labouardy
Rating: 3 out of 5 stars
3/5
Apache Cassandra Essentials
Ebook
Apache Cassandra Essentials
byPadalia Nitin
Rating: 4 out of 5 stars
4/5
Redis in Action
Ebook
Redis in Action
byJosiah Carlson
Rating: 0 out of 5 stars
0 ratings
Event Streams in Action: Real-time event systems with Kafka and Kinesis
Ebook
Event Streams in Action: Real-time event systems with Kafka and Kinesis
byValentin Crettaz
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Mastering Redis
Ebook
Mastering Redis
byJeremy Nelson
Rating: 0 out of 5 stars
0 ratings
MongoDB High Availability
Ebook
MongoDB High Availability
byAfshin Mehrabani
Rating: 5 out of 5 stars
5/5

Computers For You

Skip carousel

Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Practical Lock Picking: A Physical Penetration Tester's Training Guide
Ebook
Practical Lock Picking: A Physical Penetration Tester's Training Guide
byDeviant Ollam
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
Ebook
Summary of Dotcom Secrets: by Russell Brunson - The Underground Playbook for Growing Your Company Online with Sales Funnels - A Comprehensive Summary
byAlexander Cooper
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Dawn of the New Everything: Encounters with Reality and Virtual Reality
Ebook
Dawn of the New Everything: Encounters with Reality and Virtual Reality
byJaron Lanier
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

API First, Lifecycles and Governance
Podcast episode
API First, Lifecycles and Governance
byThe Cloudcast
0 ratings
0% found this document useful
EP 22: What is OAuth 2?
Podcast episode
EP 22: What is OAuth 2?
byPro Coder Show
0 ratings
0% found this document useful
HashiCorp Vault for Kubernetes: Bret is joined by Rosemary Wang from HashiCorp to show off Vault for Kubernetes, an open source secrets provider.
Podcast episode
HashiCorp Vault for Kubernetes: Bret is joined by Rosemary Wang from HashiCorp to show off Vault for Kubernetes, an open source secrets provider.
byDevOps and Docker Talk: Cloud Native Interviews and Tooling
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Distributed Systems Research with Peter Alvaro: Every software company is a distributed system, and distributed systems fail in unexpected ways. This ever-present tendency for systems to fail has led to the rise of failure testing, otherwise known as chaos engineering.
Podcast episode
Distributed Systems Research with Peter Alvaro: Every software company is a distributed system, and distributed systems fail in unexpected ways. This ever-present tendency for systems to fail has led to the rise of failure testing, otherwise known as chaos engineering.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
An Introduction to the Go Programming language with Andrew Gerrand: Andrew Gerrand is a developer at Google who works on the Go Programming Language (golang). Why Go and why now? What kinds of problems does Go solve that aren't a good match for existing languages? How does Go compare to C++ and improve upon it?
Podcast episode
An Introduction to the Go Programming language with Andrew Gerrand: Andrew Gerrand is a developer at Google who works on the Go Programming Language (golang). Why Go and why now? What kinds of problems does Go solve that aren't a good match for existing languages? How does Go compare to C++ and improve upon it?
byHanselminutes with Scott Hanselman
0 ratings
0% found this document useful
All Roads Lead to Kubernetes with Kendall Miller: Kendall Miller is the president at Fairwinds, a shop that helps teams optimize containerized apps and get the most out of Kubernetes that was formerly called ReactiveOps. He's also the host of Authority Issues, a podcast about leadership. Prior to these p
Podcast episode
All Roads Lead to Kubernetes with Kendall Miller: Kendall Miller is the president at Fairwinds, a shop that helps teams optimize containerized apps and get the most out of Kubernetes that was formerly called ReactiveOps. He's also the host of Authority Issues, a podcast about leadership. Prior to these p
byScreaming in the Cloud
0 ratings
0% found this document useful
Microservices with Rafi Schloming: Microservices are a widely adopted pattern for breaking an application up into pieces that can be well-understood by the individual teams within the company. Microservices also allow these individual pieces to be scaled independently and updated in iso...
Podcast episode
Microservices with Rafi Schloming: Microservices are a widely adopted pattern for breaking an application up into pieces that can be well-understood by the individual teams within the company. Microservices also allow these individual pieces to be scaled independently and updated in iso...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
The Pragmatic Programmer celebrates 20 years with Dave Thomas and Andy Hunt: Straight from the programming trenches, The Pragmatic Programmer cuts through the increasing specialization and technicalities of modern software development to examine the core process—what do you do, as an individual and as a team, if you want to create software that’s easy to work with and good for your users. Now updated after 20 years, Scott talks to Andy and Dave about this classic book! This classic title is regularly featured on software development “Top Ten” lists, and is issued by many corporations to new hires.
Podcast episode
The Pragmatic Programmer celebrates 20 years with Dave Thomas and Andy Hunt: Straight from the programming trenches, The Pragmatic Programmer cuts through the increasing specialization and technicalities of modern software development to examine the core process—what do you do, as an individual and as a team, if you want to create software that’s easy to work with and good for your users. Now updated after 20 years, Scott talks to Andy and Dave about this classic book! This classic title is regularly featured on software development “Top Ten” lists, and is issued by many corporations to new hires.
byHanselminutes with Scott Hanselman
100%
100% found this document useful
Introduction to Data Mesh
Podcast episode
Introduction to Data Mesh
byThe Cloudcast
0 ratings
0% found this document useful
Serverless Event-Driven Architecture with Danilo Poccia: In an event driven application, each component of application logic emits events, which other parts of the application respond to. We have examined this pattern in previous shows that focus on pub/sub messaging, event sourcing, and CQRS.
Podcast episode
Serverless Event-Driven Architecture with Danilo Poccia: In an event driven application, each component of application logic emits events, which other parts of the application respond to. We have examined this pattern in previous shows that focus on pub/sub messaging, event sourcing, and CQRS.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
EP 01: The Best of SpringOne 2021 (ft. Dan Vega)
Podcast episode
EP 01: The Best of SpringOne 2021 (ft. Dan Vega)
byPro Coder Show
0 ratings
0% found this document useful
Running Databases on Kubernetes
Podcast episode
Running Databases on Kubernetes
byThe Cloudcast
0 ratings
0% found this document useful
#76 - Learning Domain-Driven Design - Vladik Khononov
Podcast episode
#76 - Learning Domain-Driven Design - Vladik Khononov
byTech Lead Journal
0 ratings
0% found this document useful
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57: Scalable and Stateful Streaming Data With Apache Flink (Interview)
Podcast episode
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57: Scalable and Stateful Streaming Data With Apache Flink (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
Podcast episode
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42: A Whirlwind Tour Of The PostgreSQL Database (Interview)
byData Engineering Podcast
100%
100% found this document useful
Kafka Event Sourcing with Neha Narkhede: When a user of a social network updates her profile, that profile update needs to propagate to several databases that want to know about such an update–search indexes, user databases, caches, and other services. When Neha Narkhede was at LinkedIn,
Podcast episode
Kafka Event Sourcing with Neha Narkhede: When a user of a social network updates her profile, that profile update needs to propagate to several databases that want to know about such an update–search indexes, user databases, caches, and other services. When Neha Narkhede was at LinkedIn,
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Episode 101. Allright, let's talk about Kafka: Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache...
Podcast episode
Episode 101. Allright, let's talk about Kafka: Whew! So we took a big break over summer (like Bob said, we were just swamped with work.. oof), but we are BACK! and like always we are ready to explore even deeper Java topics for the professional developer. This time we set our sights in Apache...
byJava Pub House
0 ratings
0% found this document useful
Kafka Streams with Jay Kreps: Kafka Streams is a library for building streaming applications that transform input Kafka topics into output Kafka topics. In a time when there are numerous streaming frameworks already out there, why do we need yet another?
Podcast episode
Kafka Streams with Jay Kreps: Kafka Streams is a library for building streaming applications that transform input Kafka topics into output Kafka topics. In a time when there are numerous streaming frameworks already out there, why do we need yet another?
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
Podcast episode
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60: Tackling Apache Spark From The Data Engineer's Perspective (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
#567: AWS Lambda SnapStart
Podcast episode
#567: AWS Lambda SnapStart
byAWS Podcast
0 ratings
0% found this document useful
Leveling Up Natural Language Processing with Transfer Learning: An interview with Paul Azunre about how you can use transfer learning techniques to build more flexible natural language processing systems and reduce the requirements for labelled data.
Podcast episode
Leveling Up Natural Language Processing with Transfer Learning: An interview with Paul Azunre about how you can use transfer learning techniques to build more flexible natural language processing systems and reduce the requirements for labelled data.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
Podcast episode
Data Mechanics: Data Engineering with Jean-Yves Stephan: Apache Spark is a popular open source analytics engine for large-scale data processing. Applications can be written in Java, Scala, Python, R, and SQL. These applications have flexible options to run on like Kubernetes or in the cloud.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Software Architecture with Simon Brown: Software architecture address the challenge of communicating and navigating large, complex systems to stakeholders, both technical and non-technical. Over the years software architecture has gone in and out of fashion.
Podcast episode
Software Architecture with Simon Brown: Software architecture address the challenge of communicating and navigating large, complex systems to stakeholders, both technical and non-technical. Over the years software architecture has gone in and out of fashion.
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful

Skip carousel

Basic Concepts
Linux Format
Article
Basic Concepts
Jul 2, 2019
A messaging system such as Kafka enables you to send messages between processes, applications and servers. Applications connect to Kafka to send or get data. Strictly speaking, a Kafka ‘topic’ is a unit of storage in Kafka: data in Kafka is stored in
1 min read
Build A Software Analysis Gitlab Pipeline
Linux Format
Article
Build A Software Analysis Gitlab Pipeline
Aug 24, 2021
8 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
Create A RESTful Server In Go
Linux Format
Article
Create A RESTful Server In Go
Oct 19, 2021
8 min read
An Introduction To Rabbitmq
Linux Format
Article
An Introduction To Rabbitmq
Jun 29, 2021
RabbitMQ is a Message Broker, which means that it can safely hold messages generated by applications and make them available to other applications. The main advantages are reliability, support for clustering and high-availability queues, tracing capa
1 min read
Metrics & Visuals In Go
Linux Format
Article
Metrics & Visuals In Go
Nov 17, 2020
Mihalis Tsoukalos is a DataOps engineer and a technical writer. He’s the author of Go Systems Programming and Mastering Go, 2nd edition. The subject of this tutorial is two-fold. First, it’s about creating a Go application that exports metrics to P
7 min read
KAFKA Build Utilities With The Kafka Server
Linux Format
Article
KAFKA Build Utilities With The Kafka Server
Jul 2, 2019
Nowadays, quite a few data architectures involve both a database and Apache Kafka, which is a distributed streaming platform and the subject of this tutorial. You can also find Kafka described as a publish-subscribe message system, which is a fancy w
7 min read
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Techfastly
Article
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Apr 1, 2022
7 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Docker vs Podman
APC
Article
Docker vs Podman
Apr 19, 2021
When Cockpit was first developed, it had plug-in support for administering your Docker containers remotely via its user-friendly web interface. But then Red Hat OS became a major backer of Cockpit, and when Red Hat developed its own alternative to Do
1 min read
How Image Recognition Works
APC
Article
How Image Recognition Works
Nov 4, 2019
4 min read
Elasticsearch And Kibana Basics
Linux Format
Article
Elasticsearch And Kibana Basics
Dec 15, 2020
1 min read
Create Asynchronous Code With Python
Linux Format
Article
Create Asynchronous Code With Python
Jun 29, 2021
8 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
How Netflix’s OTT Architecture Functions?
Techfastly
Article
How Netflix’s OTT Architecture Functions?
May 1, 2022
With so many OTT platforms in the market today, Netflix has managed to capture a majority of the audience on a global scale. Netflix has become the go-to source of so much entertainment for consumers in less than 20 years. It can even be said that Ne
4 min read
Monitor Systems And Docker Deployments
Linux Format
Article
Monitor Systems And Docker Deployments
Jun 30, 2020
Welcome to Netdata, software for distributed real-time performance and health monitoring of UNIX machines. Don’t you dare turn that page! A key advantage of Netdata is that it collects all of its metrics without introducing too much load on to the Li
8 min read
AWS Vs Azure What’s The Difference?
PC Pro Magazine
Article
AWS Vs Azure What’s The Difference?
Sep 11, 2022
7 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
Code A Cataloguing Application In Python
Linux Format
Article
Code A Cataloguing Application In Python
Nov 15, 2022
Credit: www.djangoproject.com Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://github.com/mat
8 min read
Build A Dynamic App Security Pipeline
Linux Format
Article
Build A Dynamic App Security Pipeline
Sep 21, 2021
8 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
Code An Admin Back-end In Django
Linux Format
Article
Code An Admin Back-end In Django
Dec 13, 2022
Credit: www.djangoproject.com OUR EXPERT Matt Holder has been a fan of the open source methodology for over two decades and uses Linux and other tools where possible. More featurepacked source code for this project can be downloaded from https://
6 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Hashtagsearches.com
TechLife
Article
Hashtagsearches.com
Jun 3, 2019
2 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
Flask Resources
Linux Format
Article
Flask Resources
Jun 4, 2019
The designers of Flask decided early on that the framework itself would not have all the functions embedded. This philosophy is, of course, an extension of all open source work. When you start using Flask, you may be disappointed by this. Don’t be: t
1 min read
Contributing For Non - Coders
Linux Format
Article
Contributing For Non - Coders
Jan 10, 2023
9 min read
Join the Pod, Man!
Linux Format
Article
Join the Pod, Man!
May 30, 2023
8 min read
Set Up A Production-ready Web Server
Linux Format
Article
Set Up A Production-ready Web Server
Sep 24, 2019
8 min read

Related categories

Skip carousel

Reviews for Kafka Streams - Real-time Streams Processing

Rating: 5 out of 5 stars

5/5

2 ratings1 review

Rating: 5 out of 5 stars
5/5
Great book on kafka streams, very good explanation and examples.

Book preview

Kafka Streams - Real-time Streams Processing - Prashant Kumar Pandey

Preface

Welcome to the first edition of Kafka Streams: Real-time Stream Processing. I am excited to bring you a valuable resource on Apache Kafka. This book is focusing mainly on the new generation of the Kafka Streams library available in the Apache Kafka 2.x. The primary focus of this book is on Kafka Streams. However, I will also touch base on the other Kafka capabilities and concepts that are necessary to grasp the Kafka Streams programming.

Apache Kafka is currently one of the most popular and necessary components of any Bigdata solution. Initially conceived as a messaging queue, Kafka has quickly evolved into a full-fledged streaming platform. As a streaming platform, Kafka provides a highly scalable, fault tolerant, publish and subscribe pipeline of streams of events. With the addition of Kafka Streams library, it is now able to process the streams of events in real-time with millisecond responses to support a variety of business use cases.

I hope this book gives you a solid foundation to write modern real-time stream processing applications using Apache Kafka and Kafka Streams library. In the next section, I will tell you a little bit about my background, who are the intended audiences of this book, and how I have organized the material.

About the Author

Prashant Kumar Pandey is passionate about helping people to learn and grow in their career by bridging the gap between their existing and required skills. In his quest to fulfil this mission, he is authoring books, publishing technical articles, and creating training videos to help IT professionals and students succeed in the industry.

With over 18 years of experience in IT as a developer, architect, consultant, trainer, and mentor, he has worked with international software services organizations on various data-centric and Bigdata projects.

Prashant is a firm believer in lifelong continuous learning and skill development. To popularize the importance of lifelong continuous learning, he started publishing free training videos on his YouTube channel (www.youtube.com/learningjournalin) and conceptualized the idea of creating a Journal of his learning under the banner of Learning Journal.

He is the founder, lead author, and chief editor of the portal (www.learningjournal.guru) that offers various skill development courses, training, and technical articles since the beginning of the year 2018.

About the Book

I am writing Kafka Streams: Real-time Stream Processing to help you understand the stream processing in general and apply that skill to Kafka streams programming. My approach to writing this book is a progressive common-sense approach to teaching a complex subject. By using this unique approach, I will help you to apply your general ability to perceive, understand and reason the concepts progressively that I am explaining in this book.

Who should read this book?

Kafka Streams: Real-time Stream Processing is written for software engineers willing to develop a stream processing application using Kafka Streams library. I am also writing this book for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Kafka implementation, but they work with the people who implement Kafka Streams at the ground level.

What should you already know?

Kafka Version

This book is based on Kafka Streams library available in Apache Kafka 2.x. All the source code and examples in this book are tested on Apache Kafka 2.1 open source distribution. Some chapters of this book also make use of Confluent community version to explain and demonstrate functionalities that are only available in Confluent Platform such as Schema Registry and Avro Serializer/Deserializer. This book is not a replacement for the Kafka documentation. I recommend you to frequently refer Java docs for Apache Kafka Client API as well as Kafka Streams API. You can also leverage Confluent Platform documentation for more details.

Source Code Repository

All the examples and source code listings in different chapters of the book are trimmed in a readable format. The code snippet in the book is not expected to be executed independently. However, a working version of the source code for all the examples are available in the GitHub repository. You can access the GitHub repository at the below URL.

https://github.com/LearningJournal/Kafka-Streams-Real-time-Stream-Processing

Example and Exercises

Working examples are the most critical tool to convert your knowledge into a skill. I have already included a lot of examples in the book. I also have a video version for some of the examples in this book to help you understand a few critical things such as setting up your development IDE and executing examples from the GitHub repository. I may also create some additional examples and exercises as and when time permits. You can leverage those extra resources and more content at the publisher’s website. A link to the list of all additional content/resources is available at the Book’s webpage. You can refer the Book’s webpage at the below URL.

https://www.learningjournal.guru/ebook/kafka-streams-real-time-stream-processing/

Part 1 Bigdata and Stream Processing

In Part-1 of this book, I will talk about the beginning and rise of the Bigdata era in general. We will then discuss how the Bigdata problem is eventually progressing into real-time stream processing requirements. We will also try to understand the importance of real-time stream processing considering some common industry use cases. Finally, I will close the Part-1 of the book talking about unique challenges of real-time stream processing and how it differs from batch processing of large volumes of data.

Chapter - Dawn of Bigdata

The internet and the World Wide Web had a revolutionary impact on our life, culture, commerce, and technology. The search engine has become a necessity in our daily life. However, until the year 1993, the World Wide Web was indexed by hand. Sir Tim Berners-Lee used to edit that list by hand and host it on the CERN web server. Later, Yahoo emerged as the first popular method for people to find web pages on the internet. However, it was based on a web-directory rather than a full-text index of the web pages. Around the year 2000, a new player Google rose to the prominence by an innovative idea of PageRank algorithm. The PageRank algorithm was based on the number and the domain authority of other websites that link to the given web page. One of the critical requirements for implementing the PageRank algorithm was to crawl and collect a massive amount of data from the World Wide Web. This was the time and the need that I would attribute to the birth of Bigdata.

Inception

Google was not only the first to realize the Bigdata problem, but also the first to develop a viable solution and establish a commercially successful business around the invention. At a high level, Google had four main problems to solve.

Discover and crawl the web pages over the internet and capture content/metadata.

Store the massive amount of data collected by the Web Crawler.

Apply the PageRank algorithm on the received data and create an index to be used by the search engine.

Organize the index in a random-access database that can support high-speed queries by the Google Search Engine application.

These four problems are at the core of any data-centric application, and they translate into four categories of the problem domain.

Data Ingestion/Integration

Data Storage and Management

Data Processing and Transformation

Data Access and Retrieval

These problems were not new, and in general, they existed even before the year 2000 when Google was trying to solve them using a new and innovative approach. The industry was already generating, collecting, processing and accessing data. The database management systems (DBMS) like Oracle and MS SQL Server were in the middle of such applications. At the time, some organizations were already managing terabytes of data volume using systems like Teradata. However, the problem that Google was trying to solve had a combination of three unique attributes that made it much more difficult to be addressed using DBMS of the time. Let’s try to understand these attributes that uniquely characterize the Bigdata problem.

Volume

Variety

Velocity

Volume

Creating a Web Crawler was not a big deal for Google. It was a simple program that takes a page URL, retrieves the content of the page and stores it in the storage system. But in this process, the main problem was the amount of data that the crawler collects. The Web Crawler was supposed to read all the web pages over the internet and store a copy of those pages in the storage system. Google knew that they will be dealing with the gigantic volume of data and no DBMS at the time was scalable to store and manage that quantity.

Variety

The content of the web pages had no structure. It was not possible to convert them into a row-column format without storing and processing them. Google knew that they need a system to deal with the raw data files that may come in a variety of formats. The DBMS of the time had no meaningful support to deal with the raw data.

Velocity

Velocity is one of the most critical and essential characteristics of any usable system. Google needed to acquire data quickly, process it, and use it at a faster rate. The speed of the Google Search Engine has been its USP since they came into the search engine business.

It is the speed that is driving the new age of Bigdata applications that need to deal with the data in real-time. In some cases, the data points are highly time sensitive like a patient’s vitals. Such data elements have a limited shelf-life where their value can be futile with time, in some cases, very quickly. I will come back to the velocity once again because the velocity is the driving force of a real-time system – the main subject of this book.

Explication

Google successfully solved all the above problems, and they were generous enough to reveal their solution to the rest of the world in a series of three white papers. These three whitepapers talked about Google’s approach to solving data storage, data processing, and data retrieval.

The first whitepaper was published by Google in the year 2003. (https://ai.google/research/pubs/pub51). The first paper elucidated the architecture of a scalable, distributed filesystem which they termed as Google File System (GFS). The GFS solved their problem of storage by leveraging commodity hardware to store data on a distributed cluster.

The second whitepaper was published by Google in the year 2004 (https://ai.google/research/pubs/pub62). This paper revealed the MapReduce (MR) programming model. The MR model adopted a functional programming style, and Google was able to parallelize the MR functions on a large cluster of commodity machines. The MR model solved the problem of processing large quantities of data in a distributed cluster.

The last paper in this series was published by Google in the year 2006 (https://ai.google/research/pubs/pub27898). This paper described the design of a distributed database for managing structured data across thousands of commodity servers. They named it Google Bigtable and used it for storing petabytes of data and serve it at a lower latency.

All these three whitepapers were well appreciated by the open source community, and they formed the basis for the design and the development of similar open source implementation– Hadoop. The GFS is implemented as a Hadoop Distributed File System– HDFS. The Google MR is implemented as Hadoop MapReduce programming framework on top of HDFS. Finally, the Bigtable was implemented as a Hadoop Database– HBase.

Since then, there are many other solutions developed over the Hadoop platform and open sourced by various organizations. Some of the most widely adopted systems are Pig, Hive and finally Apache Spark. All these solutions were simplification and improvement over the Hadoop MapReduce framework, and they shared the common trait of processing large volumes of data on a distributed cluster of commodity hardware.

Expansion

Hadoop and all the other solutions developed over the HDFS tried to solve a common problem of processing large volumes of data that was previously stored in the distributed storage. This approach is popularly termed as Batch Processing. In the batch processing approach, the data is collected and stored in the distributed system. Then a batch job that is written in MapReduce is used to read the data and process the same by utilizing the distributed cluster. While you are processing the previously stored data, the new data keeps arriving at the storage. Then you must do the next batch and take care of combining the results with the outcome of the previous batch. This process goes on in a series of batches.

In the batch processing approach, the outcome is available after a specific time that depends upon the frequency of your batches and the time taken by the batch to complete the processing. The insights derived from these data processing batches are valuable. However, all such insights are not equal. Some insights may have a much higher value shortly after the data was initially generated, and that value diminishes very fast with the time.

Many use cases need to collect the data about the events in real-time as they occur and extract an actionable insight as quickly as possible. For example, a fraud detection system is much more valuable if it can identify a fraudulent transaction even before the transaction completes. Similarly, in a healthcare ICU or in the surgical operation setup, the data from various monitors can be used in real time to generate alerts for nurses and doctors, so they know instantly about changes in a patient’s condition.

In many cases, the data points are highly time-sensitive, and they need to be actioned in minutes, seconds or even in milliseconds. Enormous amounts of time-sensitive data are being generated from a rapidly growing set of disparate data sources. We will discuss some of the prevalent use cases and data sources for such requirements in the next chapter. However, it is essential to comprehend that the demand for the speed in processing the time-sensitive data is continuously pushing the limits of Bigdata processing solutions. These requirements are the source of many innovative and new frameworks such as Kafka Streams, Spark streaming, Apache Storm, Apache Flink, Apache Samza and cloud services like Google Cloud Dataflow and Amazon Kinesis. These new solutions are evolving to meet the real-time requirements and are called by many names. Some of the commonly used names are Real-time Stream Processing, Event Processing, and Complex Event Processing. However, in this book, we will be terming it as Stream Processing.

In the next chapter, I will talk about some popular industry use cases of stream processing. It will help you to develop a better understanding of the real-time stream processing requirements and differentiate it from the batch processing needs. You should be able to clearly identify and answer when to use real-time stream processing vs. batch processing.

Summary

In this chapter, I talked about the Bigdata problem and how it started. We also talked about the innovative solutions that Google created and then shared it with the world. We briefly touched on the open source implementation of Google whitepapers under the banner of Hadoop. Hadoop grabbed immense attention and popularity among the organizations and professionals. We further discussed the growing expectation and the need to handle time-sensitive data at speed and why it is fostering new innovations in real-time stream processing. In the next chapter, we will progress our discussion about real-time stream processing to develop a better understanding of the subject area.

Chapter 2 - Real-time Streams

In the earlier chapter, I talked about the genesis of Bigdata, and we learned how it is expanding into the real-time requirements. In this chapter, we will try to develop a broader sense of the following.

What is the stream?

What do we mean by stream processing?

We will start by developing a general sense of stream processing and elaborate on the subject considering some well-known use cases.

Conception of Events

Bigdata emerged with a focus on storing the data first and then processing large volumes of data afterward. The advent of the Data Lake is the result of the same idea. The Data Lake is the place where data goes to sit, and later you process it in smaller batches.

While many are building upon the data lake, at the same time, a whole new fleet of technology companies like Uber, Netflix, Twitter, Spotify, and many more start-ups are emerging and differentiating themselves. They are architecting their systems in a non-traditional method that takes everything happening in the business and make it available as the sequence of events.

The notion of looking at the day to day business activities as a sequence of events is in the foundation of the real-time stream processing systems. For example, a retailer has a series of events for orders, returns, shipments and so on. A bank may have events of customer requests, stock ticks, market indicators, trades, and financial time series transactions. A popular website might want to track the occurrence of clicks and impressions. Google AdSense is all about matching a bid event to the ad-request event in real-time and then monitoring the ad-click events.

The first step of heading towards stream processing is the conception of business activities as events.

Event Streams

An individual event happening in business is discrete, but these events are happening as a continuous stream. The event stream is nothing but a constant flow of separate business events that are happening in real-time, and in this book, we refer to them as a real-time stream of events.

The real big of the Bigdata is created by the act of capturing each event of the business and putting them to use for decision making. However, we are talking about the change in the perspective to look at those events as a real-time stream rather than a dataset stored in the data lake. This new perspective of the data as a stream of events may seem a little foreign to people who are accustomed to thinking of data as a table in a database or a file in the data lake. The data lake is stationary in nature, but the stream refers to the Data in Motion.

Once you are comfortable with the notion of looking at the data as a real-time stream of events rather than a file or a table, you are ready to dive into the new era of stream processing.

Stream Processing

The notion of processing event streams appears in many different areas, and hence different people in the various field may have a different vocabulary to refer to the same thing. However, in general, the stream processing is the act of continuously incorporating new data to compute a result. Processing a stream is all about dealing with a series of events that arrives at the stream processing system.

It is essential to understand that the stream has no predetermined beginning or end. It just forms a series of unbounded data packets such as credit card transactions, clicks on a website, or sensor readings from an IoT device. The stream processing applications continuously incorporate this data into the system and compute various queries over this stream or perhaps join them with some other data sets and channel the outcome to the external storage.

Stream Processing Use Cases

Everything that I talked so far should be good enough to help you develop a general perception of what the stream processing means, and what we do in a stream processing system. However, to cement the idea and concrete your understanding, I am going to give five example use-cases of stream processing. While it is not an exhaustive list, but I have arranged them in the increasing order of complexity. A typical business may want to start with the first category of the use case and progress towards the more complex use case.

Finally, I will talk about the sixth use case in little more detail to develop a holistic understanding of the streaming processing implementation for an internet scale application.

Incremental ETL

Incremental ETL is one of the most common requirements for enterprises. They turn their operational data stores into a stream of transactions by applying some change data capture (CDC) tool and bring those changes to the data warehouse or data lake at low latency. The Figure-2.1 shows a typical implementation of the CDC in a data warehouse or data lake setup. In this configuration, the Change Producer continuously monitors the redo logs and produces a stream of events for the Change Consumer on the other side. The consumer materializes the stream back to the target database. Figure 2.1 also shows a Coordinator. A CDC tools often need a Coordinator to configure, coordinate and monitor the whole process.

CDC Configuration – Turn your database into a Stream

Figure 2.1 - CDC Configuration – Turn your database into a Stream

The CDC process may be as simple as continuously bringing data to the data lake, but in many cases, we can also implement a sequence of transformations on the fly. Such data transformations may include data quality checks, data enrichment, and other data enhancements. Incremental ETL is the most basic streaming application, and it is quite popular because it allows you to incorporate new data within seconds, enabling users to query it faster downstream.

Real-time Reporting

Many organizations use stream processing applications to run real-time dashboards. The real-time dashboarding is a popular category, and some specific use cases are listed as following.

Infrastructure Monitoring – Every platform team would use a dashboard to monitor the platform KPIs such as storage, CPU, network usage, overall system load, uptime, etc. Another excellent example of monitoring is to watch the adoption of new features as they are rolled out among applications.

Campaign Management – Most of the e-commerce applications continuously monitor the effects that new campaigns and site changes may have on their traffic such as to see whether a one-day promotion is driving traffic to your site?

Business Dashboards – Businesses use real-time dashboards to monitor KPIs such as a delivery backlog, on-time shipment, loading-time indicators, etc. These KPIs when generated in real-time are critical for planning, organizing and taking in-time decisions for day to day operations.

Real-time Alerting

Notification and alerting are probably the most apparent streaming use cases. In a typical example, these applications identify a pattern of events or continuously compute a threshold to trigger an alert. Some specific use cases are listed below.

Supply Chain Alerts – Real-time notifications are pushed to the mobile devices of the warehouse employees for the fulfilment of critical orders.

Healthcare Monitoring – Patient’s vital readings are continuously monitored remotely against a threshold by using sensors, and alerts are generated to provide in-time efficient medical services.

Traffic Monitoring – These systems use traffic data streams to immediately detect incidents and reduce the impact by initiating timely actions.

Real-time Decision Making and ML

Real-time decision making involves analysing new inputs and responding to them automatically using business logic or applying some machine learning models. This business logic may include a set of simple business decisions or an application of medium to a sophisticated machine learning model, but these systems make clear, concrete and automated decisions. Some specific examples are listed below.

Fraud detection - A bank that wants to automatically verify whether a new transaction on a customer’s credit card represents fraud by applying a decision tree and deny the operation if the charge is determined fraudulent.

Real-time clinical measurement - Visensia Safety Index is a real-time data-driven clinical measurement. It can make accurate predictions about the patient’s deterioration up to 24 hours earlier. Such predictions should permit medical interventions to save both lives and money.

Real-time bidding – When a user visits a webpage, a request for an advertisement goes to the exchange where multiple advertisers bid for the ad-space in real-time. However, the decision goes in favour of the highest bidder.

Personalized customer experience – Customer experience is critical for every industry but for our purpose, let’s take an example from an online gaming system. Gaming platforms implement adaptive gameplay to fine tune the difficulty level ensuring that the session is sufficiently challenging without being boring or frustrating.

Online Machine Learning and AI

Online machine learning is the real artificial intelligence and the most challenging area of stream processing and machine learning. In this technique, the real-time data is used to update the best prediction at each step. In many cases, the online learning dynamically adapts the new patterns in the data and optimize the learning model in real-time. There is not much work done in this area and still various research are underway.

Real-time Data Preparation at Scale

So far, we have seen a bunch of use cases where generating and processing a stream of real-time events make sense. By now, you must have clarity on where the real-time stream processing is applied in

Enjoying the preview?

Page 1 of 1

Kafka Streams - Real-time Streams Processing

About this ebook

Prashant Kumar Pandey

Related authors

Related to Kafka Streams - Real-time Streams Processing

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Kafka Streams - Real-time Streams Processing

What did you think?

Book preview

Kafka Streams - Real-time Streams Processing - Prashant Kumar Pandey

Preface

About the Author

About the Book

Who should read this book?

What should you already know?

Kafka Version

Source Code Repository

Example and Exercises

Part 1

Bigdata and Stream Processing

Chapter - Dawn of Bigdata

Inception

Volume

Variety

Velocity

Explication

Expansion

Summary

Chapter 2 - Real-time Streams

Conception of Events

Event Streams

Stream Processing

Stream Processing Use Cases

Incremental ETL

Real-time Reporting

Real-time Alerting

Real-time Decision Making and ML

Online Machine Learning and AI

Real-time Data Preparation at Scale