Data Pipelines with Spark & DataStax Enterprise

Successfully reported this slideshow.

Data Pipelines with Spark & DataStax Enterprise
Simon Ambridge
Data Pipelines With Spark & DSE
An Introduction To Building Agile, Flexible and Scalable Big Data and Data ...
Certified Apache Cassandra and DataStax enthusiast who enjoys
explaining that the traditional approaches to data managemen...
Introduction To Big Data Pipelines
Big, Static Data
Fast, Streaming Data
Big Data Pipelining: Classification
Big Data Pipelines can mean different things to ...
Static Datasets
All You Can Eat?
Really.
Static Data Analytics : Traditional Tools
Repeated iterations, at each stage
Run/debug cycle can be slow
Sampling Modeling...
Static Data Analytics : Scale Up Challenges
Sampling and analysis often run on a single machine
• CPU and memory limitatio...
Static Data Analytics : Traditional Scaling
DATA (GB)
DATA (MB)
DATA (TB)
Small datasets, small servers
Large datasets, la...
Static Data Analytics: Big Data Problems
Data is getting Really Big!
• Data volumes are getting larger!
• The number of da...
Static Data Analytics : Big Data Needs
We need scalable infrastructure + distributed technologies
• Data volumes can be sc...
Static Data Analytics : DSE Delivers
Building a distributed data processing framework can be a complex task!
It needs to b...
Cassandra: THE Web, IoT & Cloud Database
What is Apache Cassandra?
• Very fast
• Extremely resilient
• Across multiple dat...
DataStax
Enterprise
DataStax Enterprise: Editions
DataStax Enterprise Standard
• DSE Standard is DataStax’s entry level
co...
Spark: THE Analytics Engine
What is Apache Spark?
• Distributed in-memory analytic processing
• Batch and streaming analyt...
Spark: Dayton Gray Sort Contest
Dayton Gray benchmark - tests how fast a system can sort 100 TB
of data (1 trillion record...
DataStax Enterprise: Analytics Integration
Cassandra Cluster
Spark Cluster
ETL
Spark Cluster
• Tight integration
• Data lo...
Static Data Analytics : Requirements
Valid data pipeline analysis methods must be:
Auditable
• Reproducible
• Documented
C...
Notebooks: Features
What are Notebooks?
• Drive your data analysis from the browser
• Highly interactive
• Tight integrati...
Example: Spark Notebook
Cells
Markdown
Output
Controls
Static Data Analytics : Approach
Example architecture & requirements
1. Optimised source data format
2. Distributed in-mem...
Static Data Analytics : Example
ADAM
Notebook Persistent Storage
OLTP Database Visualisation
Genome research platform - AD...
Static Data Analytics : Pipeline Process Flow
3. Persistent data storage
2. Interactive, flexible and reproducible analysi...
Static Data Analytics : Pipeline Scalability
• Add more (physical or virtual) nodes as
required to add capacity
• Containe...
Static Data Analytics : Now
• No longer an iterative process constrained by hardware limitations
• Now a more scalable, re...
Real-Time Datasets
If it’s Not “Now”, Then It’s Probably Already Too Late
Big Data Pipelining: Why Real-Time?
• React to customers faster and with more accuracy
• Reduce risk through more accurate...
Big Data Pipelining: Real-Time Analytics
• Capture, prepare, and process fast streaming data
• Different approach from tra...
Big Data Pipelining: Real-Time Use Cases
Sensor data (IoT)
Transactional data
User Experience
Social media
Use cases for s...
Big Data Analytics: Streams
Data tidal waves!Netflix
• Ingests Petabytes of data per day
• Over 1 TRILLION transactions pe...
Big Data Pipelining: Real-Time architecture
Analytics in real-time, at scale
Fast processing, distributed, in-memory
Incre...
Kafka: Architecture
How Does Kafka Work?
Kafka “De-couples” producers and consumers in data pipelines
’Producers’ send mes...
Kafka: Streaming With Spark
Kafka writes, Spark reads
• Topics can have multiple partitions
• Each topic partition stored ...
DataStax Enterprise: Streaming Schematic
Sensor
Network
Signal
Aggregation
Services
Messaging Queue
Sensor Data Queue
Mana...
DataStax Enterprise: Streaming Analytics
Real-time
Analytics
Persistent Storage
OLTP Database
!$£€!
Personalisation
Action...
DataStax Enterprise: Multi-DC Uses
DC: EUROPEDC: USA
Real-time active-active geo-replication
across physical datacentres
4...
Real-Time Analytics: DSE Multi-DC
Workload Management and Separation With DSE
Analytics / BI
Analytics
Datacentre
OLTP
Dat...
DSE & Analytics : Summary
Static, Massive Data
Scalable Data Pipelines
1. Optimised data storage formats
2. Scalable, dist...
Thank you!
Data Pipelines with Spark & DataStax Enterprise

Upcoming SlideShare

Loading in …5

×

  1. 1. Simon Ambridge Data Pipelines With Spark & DSE An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines Version 0.8
  2. 2. Certified Apache Cassandra and DataStax enthusiast who enjoys explaining that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world. Previously 25 years implementing Oracle relational data management solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE simon.ambridge@datastax.com @stratman1958 Simon Ambridge Pre-Sales Solution Engineer, Datastax UK
  3. 3. Introduction To Big Data Pipelines
  4. 4. Big, Static Data Fast, Streaming Data Big Data Pipelining: Classification Big Data Pipelines can mean different things to different people Repeated analysis on a static but massive dataset • An element of research – e.g. genomics, clinical trial, demographic data • Typically repetitive, iterative, shared amongst data scientists for analysis Real-time analytics on streaming data • Industrialised or commercial processes – sensors, tick data, bioinformatics, transactional data, real-time personalisation • Happening in real-time, data cannot be dropped or lost
  5. 5. Static Datasets All You Can Eat? Really.
  6. 6. Static Data Analytics : Traditional Tools Repeated iterations, at each stage Run/debug cycle can be slow Sampling Modeling InterpretTuning Reporting Re-sample Typical traditional ‘static’ data analysis model Data Results
  7. 7. Static Data Analytics : Scale Up Challenges Sampling and analysis often run on a single machine • CPU and memory limitations Limited sampling of a large dataset because of data size limitations • Multiple iterations over large datasets is frequently not an ideal approach
  8. 8. Static Data Analytics : Traditional Scaling DATA (GB) DATA (MB) DATA (TB) Small datasets, small servers Large datasets, large servers
  9. 9. Static Data Analytics: Big Data Problems Data is getting Really Big! • Data volumes are getting larger! • The number of data sources is exploding! • More data is arriving faster! Scaling up is becoming impractical • Physical limits • Datalimits • The validity of the analysis becomes obsolete, faster
  10. 10. Static Data Analytics : Big Data Needs We need scalable infrastructure + distributed technologies • Data volumes can be scaled • Distribute the data across multiple low-cost machines • Faster processing • More complex processing • No single point of failure
  11. 11. Static Data Analytics : DSE Delivers Building a distributed data processing framework can be a complex task! It needs to be: • Scalable • Fast in-memory processing • Replicated for resiliency • Batch and real-time data feeds • Ad-hoc queries DataStax delivers an integrated analytics platform
  12. 12. Cassandra: THE Web, IoT & Cloud Database What is Apache Cassandra? • Very fast • Extremely resilient • Across multiple data centres • No single point of failure • Continuous Availability, Disaster Avoidance • Linear scale • Easy to operate Enterprise Cassandra platform from Datastax
  13. 13. DataStax Enterprise DataStax Enterprise: Editions DataStax Enterprise Standard • DSE Standard is DataStax’s entry level commercial database offering • Represents the minimum recommended to deploy Cassandra in a production environment DataStax Enterprise Max • DSE Max is DataStax’s advanced commercial database offering • Designed for production Cassandra environments that have mixed workload requirements
  14. 14. Spark: THE Analytics Engine What is Apache Spark? • Distributed in-memory analytic processing • Batch and streaming analytics • Fast - 10x-100x faster than Hadoop MapReduce • Rich Scala, Java and Python APIs Tightly integrated with DSE
  15. 15. Spark: Dayton Gray Sort Contest Dayton Gray benchmark - tests how fast a system can sort 100 TB of data (1 trillion records) • Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72 minutes • 2014: Spark completed the benchmark in 23 minutes on just 206 EC2 nodes = 3X faster using 10X fewer machines • Spark sorted 1 PB (10 trillion records) on 190 machines in < 4 hours. Previous Hadoop MapReduce time of 16 hours on 3800 machines = 4X faster using 20X fewer machines
  16. 16. DataStax Enterprise: Analytics Integration Cassandra Cluster Spark Cluster ETL Spark Cluster • Tight integration • Data locality • Microsecond response times X • Apache Cassandra for Distributed Persistent Storage • Integrated Apache Spark for Distributed Real-Time Analytics • Analytics nodes close to data - no ETL required X • Loose integration • Data separate from processing • Millisecond response times “Latency  when  transferring  data  is  unavoidable.  The  trick  is  to  reduce   the  latency  to  as  close  to  zero  as  possible…”
  17. 17. Static Data Analytics : Requirements Valid data pipeline analysis methods must be: Auditable • Reproducible • Documented Controlled • Version control Collaborative • Accessible
  18. 18. Notebooks: Features What are Notebooks? • Drive your data analysis from the browser • Highly interactive • Tight integration with Apache Spark • Handy tools for analysts: • Reproducible visual analysis • Code in Scala, CQL, SparkSQL, Python • Charting – pie, bar, line etc • Extensible with custom libraries
  19. 19. Example: Spark Notebook Cells Markdown Output Controls
  20. 20. Static Data Analytics : Approach Example architecture & requirements 1. Optimised source data format 2. Distributed in-memory analytics 3. Interactive and flexible data analysis tool 4. Persistent data store 5. Visualisation tools
  21. 21. Static Data Analytics : Example ADAM Notebook Persistent Storage OLTP Database Visualisation Genome research platform - ADST (Agile Data Science Toolkit)
  22. 22. Static Data Analytics : Pipeline Process Flow 3. Persistent data storage 2. Interactive, flexible and reproducible analysis 1. Source data 4. Visualise and analyse
  23. 23. Static Data Analytics : Pipeline Scalability • Add more (physical or virtual) nodes as required to add capacity • Container tools ease configuration management and deployment • Scale out quickly
  24. 24. Static Data Analytics : Now • No longer an iterative process constrained by hardware limitations • Now a more scalable, resilient, dynamic, interactive process, easily shareable Analyse The new model for large-scale static data analytics Share X Load SCALE & DISTRIBUTE PROCESSING
  25. 25. Real-Time Datasets If it’s Not “Now”, Then It’s Probably Already Too Late
  26. 26. Big Data Pipelining: Why Real-Time? • React to customers faster and with more accuracy • Reduce risk through more accurate understanding of the market • Optimise return on marketing investment • Faster time to market • Improve efficiency In a highly connected world In most cases ‘real-time’ data changing at <1s intervals
  27. 27. Big Data Pipelining: Real-Time Analytics • Capture, prepare, and process fast streaming data • Different approach from traditional batch processing • The speed of now – cannot wait • Immediate insight, instant decisions What problem are we trying to solve?
  28. 28. Big Data Pipelining: Real-Time Use Cases Sensor data (IoT) Transactional data User Experience Social media Use cases for streaming analytics
  29. 29. Big Data Analytics: Streams Data tidal waves!Netflix • Ingests Petabytes of data per day • Over 1 TRILLION transactions per day (>10 m per second) into DSE Data streams? Data torrent?
  30. 30. Big Data Pipelining: Real-Time architecture Analytics in real-time, at scale Fast processing, distributed, in-memory Increasingly using a technology stack comprising Kafka, Spark and Cassandra • Scalable • Distributed • Resilient Streaming analytics architecture - what do we need?
  31. 31. Kafka: Architecture How Does Kafka Work? Kafka “De-couples” producers and consumers in data pipelines ’Producers’ send messages to the Kafka cluster, which in turn serves them up to ’Consumers’ • Kafka maintains feeds of messages in categories called topics • A Kafka cluster is comprised of one or more servers called a broker Producer Producer Producer Consumer Consumer Consumer Kafka Cluster
  32. 32. Kafka: Streaming With Spark Kafka writes, Spark reads • Topics can have multiple partitions • Each topic partition stored as a log (an ordered set of messages) • Messages are simply byte arrays, so can store any object in any format • Each message in a partition is assigned a unique offset Spark consumes messages as a stream, in micro batches, saved as RDD’s 1 2 3 4 5 6 7 8 Partition 0 1 2 3 4 5 6 7 8 Partition 1 1 2 3 4 5 6 Partition 0 Temperature Topic Rainfall Topic Temperature Consumer Rainfall Consumer Temperature Consumer
  33. 33. DataStax Enterprise: Streaming Schematic Sensor Network Signal Aggregation Services Messaging Queue Sensor Data Queue Management Broker Broker Collection Service Data Storage OLTP PersistenceLayer Streaming Data Ingest
  34. 34. DataStax Enterprise: Streaming Analytics Real-time Analytics Persistent Storage OLTP Database !$£€! Personalisation Actionable insight Monitoring Web / Analytics / BI
  35. 35. DataStax Enterprise: Multi-DC Uses DC: EUROPEDC: USA Real-time active-active geo-replication across physical datacentres 4 3 25 1 4 3 25 1 8 1 2 3 4 5 6 7 1 2 3 OLTP: Cassandra 5 4 Analytics: Cassandra + Spark Replication Replication Workload separation via virtual datacentres
  36. 36. Real-Time Analytics: DSE Multi-DC Workload Management and Separation With DSE Analytics / BI Analytics Datacentre OLTP Datacentre 100% Uptime, Global Scale OLTP Real-Time Analytics Mixed Load OLTP and Analytics Platform Replication Replication JDBC ODBC Separation of OLTP from Analytics Social Media IoT Personalisation & Persistence Personalisation !$£€! Actionable insight Monitoring App, Web
  37. 37. DSE & Analytics : Summary Static, Massive Data Scalable Data Pipelines 1. Optimised data storage formats 2. Scalable, distributed technologies 3. Flexible and interactive analysis tools 4. Resilient, persistent Storage Real-Time Streaming Data Scalable Data Pipelines 1. Scalable, distributed technologies 2. De-coupled Producers and Consumers 3. Real-Time analytics 4. Resilient, persistent Storage Spark Mesos Akka Cassandra Kafka
  38. 38. Thank you!