Intro to Cassandra

Successfully reported this slideshow.

Intro to Cassandra
Intro toCassandra  Tyler Hobbs
HistoryDynamo                     BigTable(clustering)               (data model)               Cassandra
Users
Clustering    Every node plays the same role    – No masters, slaves, or special nodes    – No single point of failure
Consistent Hashing           0     50          10     40          20           30
Consistent Hashing                      Key: “www.google.com”           0     50          10     40          20           30
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                      Key: “www.google.com”           0                      md5(“www.google.com”)     5...
Consistent Hashing                        Key: “www.google.com”           0                        md5(“www.google.com”)  ...
Clustering    Client can talk to any node
ScalingRF = 2             0              50        10The node at50 owns thered portion             20                   30
ScalingRF = 2               0                50        10   Add a new    40        20   node at 40                     30
ScalingRF = 2               0                50        10   Add a new    40        20   node at 40                     30
Node FailuresRF = 2               0                50        10   Replicas                40        20                    ...
Node FailuresRF = 2               0                50        10   Replicas                40        20                    ...
Node FailuresRF = 2               0                50        10                40        20                     30
Consistency, Availability    Consistency    – Can I read stale data?    Availability    – Can I write/read at all?    T...
Consistency    N = Total number of replicas    R = Number of replicas read from    – (before the response is returned) ...
Consistency    N = Total number of replicas    R = Number of replicas read from    – (before the response is returned) ...
Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must...
Consistency    Tunable Consistency    – Specify N (Replication Factor) per data set    – Specify R, W per operation
Consistency    Tunable Consistency    – Specify N (Replication Factor) per data set    – Specify R, W per operation    – ...
Availability    Can tolerate the loss of:    – N – R replicas for reads    – N – W replicas for writes
CAP TheoremDuring node or network failure:          100%                                          Not                     ...
CAP TheoremDuring node or network failure:          100%                                                 Not              ...
Clustering    No single point of failure    Replication that works    Scales linearly    – 2x nodes = 2x performance   ...
Data Model    Comes from Google BigTable    Goals    – Minimize disk seeks    – High throughput    – Low latency    – Du...
Data Model    Keyspace    – A collection of Column Families    – Controls replication settings    Column Family    – Kin...
Column Families    Static    – Object data    – Similar to a table in a relational database    Dynamic    – Pre-calculat...
Static Column Families                   Users   zznate    password: *    name: Nate   driftx    password: *   name: Brand...
Dynamic Column Families    Rows    – Each row has a unique primary key    – Sorted list of (name, value) tuples       • L...
Dynamic Column Families                     Followingzznate    driftx:   thobbs:driftxthobbs    zznate:jbellis   driftx:  ...
Dynamic Column Families    Column Timestamps    – Each column (tuple) has a timestamp    – In the case of a collision, th...
Dynamic Column Families    Other Examples:    – Timeline of tweets by a user    – Timeline of tweets by all of the people...
The Data API    Two choices    – RPC-based API    – CQL       • Cassandra Query Language
Inserting Data INSERT INTO users (KEY, “name”, “age”)     VALUES (“thobbs”, “Tyler”, 24);
Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”)     VALUES (“thobbs”, 34); Or UPDATE users S...
Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
Fetching Data Explicit column select: SELECT “name”, “age” FROM users     WHERE KEY = “thobbs”;
Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e     WHERE KEY = “key”; SELECT 1..3 FROM le...
Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST ...
Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FI...
Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users...
Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS     WHERE age = 24 AN...
Performance    Writes    – 10k – 30k per second per node    – Sub-millisecond latency    Reads    – 1k – 10k per second ...
Other Features    Distributed Counters    – Can support millions of high-volume counters    Excellent Multi-datacenter S...
What Cassandra Cant Do    Transactions    – Unless you use a distributed lock    – Atomicity, Isolation    – These arent ...
Not One-size-fits-all    Use alongside an RDBMS    – Use the RDBMS for highly-transactional or highly-      relational da...
Language Support    Good:    – Java    – Python    – Ruby    – PHP    – C#    Coming Soon:    – Everything else, now tha...
Questions?          Tyler Hobbs               @tylhobbs       tyler@datastax.com

Upcoming SlideShare

Loading in …5

×

  1. 1. Intro toCassandra Tyler Hobbs
  2. 2. HistoryDynamo BigTable(clustering) (data model) Cassandra
  3. 3. Users
  4. 4. Clustering Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  5. 5. Consistent Hashing 0 50 10 40 20 30
  6. 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  7. 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  8. 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  9. 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  10. 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  11. 11. Clustering Client can talk to any node
  12. 12. ScalingRF = 2 0 50 10The node at50 owns thered portion 20 30
  13. 13. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  14. 14. ScalingRF = 2 0 50 10 Add a new 40 20 node at 40 30
  15. 15. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  16. 16. Node FailuresRF = 2 0 50 10 Replicas 40 20 30
  17. 17. Node FailuresRF = 2 0 50 10 40 20 30
  18. 18. Consistency, Availability Consistency – Can I read stale data? Availability – Can I write/read at all? Tunable Consistency
  19. 19. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success)
  20. 20. Consistency N = Total number of replicas R = Number of replicas read from – (before the response is returned) W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  21. 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  22. 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  23. 23. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  24. 24. Consistency Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  25. 25. Availability Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  26. 26. CAP TheoremDuring node or network failure: 100% Not Possible Availability Possible Consistency 100%
  27. 27. CAP TheoremDuring node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  28. 28. Clustering No single point of failure Replication that works Scales linearly – 2x nodes = 2x performance • For both writes and reads – Up to 100s of nodes Operationally simple Multi-Datacenter Replication
  29. 29. Data Model Comes from Google BigTable Goals – Minimize disk seeks – High throughput – Low latency – Durable
  30. 30. Data Model Keyspace – A collection of Column Families – Controls replication settings Column Family – Kinda resembles a table
  31. 31. Column Families Static – Object data – Similar to a table in a relational database Dynamic – Pre-calculated query results – Materialized views
  32. 32. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  33. 33. Dynamic Column Families Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like a sorted map or dictionary – The (name, value) tuple is called a “column”
  34. 34. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
  35. 35. Dynamic Column Families Column Timestamps – Each column (tuple) has a timestamp – In the case of a collision, the latest timestamp wins – Client specifies timestamp with write – Writes are idempotent • Infinite retries allowed
  36. 36. Dynamic Column Families Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  37. 37. The Data API Two choices – RPC-based API – CQL • Cassandra Query Language
  38. 38. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  39. 39. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  40. 40. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  41. 41. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  42. 42. Fetching Data Get a slice of columns UPDATE letters SET 1=a, 2=b, 3=c, 4=d, 5=e WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  43. 43. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  44. 44. Fetching Data Get a slice of columns SELECT 3.. FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4.. FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  45. 45. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  46. 46. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  47. 47. Performance Writes – 10k – 30k per second per node – Sub-millisecond latency Reads – 1k – 10k per second per node – Depends on data set, caching – Usually 0.1 to 10ms latency
  48. 48. Other Features Distributed Counters – Can support millions of high-volume counters Excellent Multi-datacenter Support – Disaster recovery – Locality Hadoop Integration – Isolation of resources – Hive and Pig drivers Compression
  49. 49. What Cassandra Cant Do Transactions – Unless you use a distributed lock – Atomicity, Isolation – These arent needed as often as youd think Limited support for ad-hoc queries – Know what you want to do with the data
  50. 50. Not One-size-fits-all Use alongside an RDBMS – Use the RDBMS for highly-transactional or highly- relational data • Usually a small set of data – Let Cassandra scale to handle the rest
  51. 51. Language Support Good: – Java – Python – Ruby – PHP – C# Coming Soon: – Everything else, now that we have CQL
  52. 52. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com