Sei sulla pagina 1di 6

Cassandra Installation Review:

1.
2.
3.
4.
5.
6.
7.
2
2
2
2

1
1
2
3
4
5
6
7
8
9
10
11
12
13

Which one is not an aspect of the CAP theorem - Cluster


Tolerance
Where does Cassandra fit within the CAP theorem - Availability
and Partition Tolerance
What are the technological roots of Cassandra - Google
BigTable and Amazon Dynamo
What technology does Cassandra use to model data - CQL more structured DML
What opensource library must be installed for production use JNA
What setting determines a node's cluster, and where is it
configured - cassandra.yaml, cluster_name, (listen address, seed
address)
What open source library must be installed for production use?
a.
JNA - Java Native accessors
Where do you find and set the system log file location?
a.
2.0 Log for j server files
b.
2.1 Logback xml
What settings determines a node 's cluster and where is it
configured?
a.
Cluster name setting in the Cassandra yaml files
How would you stop a background Cassandra instance on Linux or
Mac OSX?
a.
Fine the process id and kill it - ps aux
What settings might you adjust, in which configuration file, to tune
Cassandra memory use?
a.
max_heap_size 8 gb and heap_new_size 100m/core - set
in the Cassandra - env.sh
Additional Notes:
Cluster membership protocol - GAAB and
Nodes has its own IP address and seed address it is a cluster node.
Partition key - partition and primary key
Num tokens greater than 1 are Virtual nodes.
Set - unique
List - may or maybnot be unique
Map Data density - Load size per node - break down the problem - add
more nodes.
RPC - row procedure communication - for communication across
zones.
Cassandra Stress- commandline benchmark and load testing
Murmer3 range - +2to the power 63 to -2 to the power 63
Node - Coordinator and persistence
DC - regional and work load isolation
Workload isolation - important for cache and performance and
analytics. Eg: - break up read and analytics for faster performance.

14 Seed address- entry point address to join gossip network.


15 VN - randomly distributed and automatically generated. Single node
config it has to be done manually.
16 Sequential write hold 64 gigs of data for V nodes.
17 Timestamp - 64 gigs
18 System hints table
19 Gossip communication - Rack DC info - location, membership doesn't
not communicate the health - based on heartbeat
20 How many nodes 1-3 nodes per second communicating health.
21 Low latency - also has broadcast for health communication.
22 Compaction thresholds.
23 25% of JVM utilization causes flush and initiates compaction.
24 VN - default 256 - m/s bootstrap faster because of VN - more
granular distribution of data.
25 Coordinator and persistence problem
26 Replication factor - how many replicas to make of each node
27 Replication strategy - Which node should each replica sit on
28 Keyspace - is a collection of tables that has RF associated to it.
a.
Simple strategy - doesn't not recognize multiple data
centers
2 Snitch logic - to avoid single point of failure - best to have
homogenous rack set up.
3 Latency - for reading and writing ( faster)Latency measured in
millisecond. based on acknowledgements.
4 Network Topology - racks and DC -millisecond
Cassandra Tools Review:
1 Which is not a nodetool option to provide cluster status information token
2 Port 7199 is the default JMX port - True
3 What tools/commands can be used to populate data in Cassandra copy, SSTableLoader
4 Which is NOT a use of CCM - Split a rack
5 What is a coordinator - Client selected node responsible for servicing
a read or write request to the cluster
6 In a 3 node cluster with RF=2, how much total data volume does
each node own - 2/3 of the data
7 How could RF and CL be tuned to ensure immediate consistency (nodes_written + nodes_read) > replication_factor
8 What is the function of the nodetool repair operation - Synchronizes
node data with the most current cluster replicas
9 Describe the relationship of nodes, racks, clusters and data centers?
Nodes replicate data/ sub grouped and stored as clusters in a DC and
racks
10 What is the function of the partitioner? Partitioner hashes data give
to it is set up based on an algorithm(murmer3). Then the cluster uses
that to determine where data should live.
11 Can a node hold a partition with a token outside its primary range?
Yes if it is replicating data OR if it is holding hint for another node
12 In a 3 node cluster with a RF=2, how much total data volume does
each node own? 2/3 [Fomula is n/2+1]

13
14

15

1
2
3
4
5
6
7
8
9
10

11
12
1
2
3
4
5

What is the function of the nodetool repair operation? To


Synchronize replication
What is a remote coordinator? When we are using network topology
strategy and writing data to another DC ( as an optimization for
network traffic). We pick one of the replicas in the remote Casandra
DC and send it to the remote coordinator and it will replicate the
data in that DC.
How could RF and CL be tunes to ensure immediate consistency?
RF(Replication Factor) is set once - while creating keyspace.
CL(Consistency Level) is changed frequently.
a.
R all - W 1
b.
R 1 - W all
c.
R quorum - W quorum
What is the relationship between a column family and a CQL table Table is a two-dimensional view of a multi-dimensional column family
How are wide rows displayed in CQL - As clustering columns multiple CQL rows
By default, how are clustering columns ordered - Ascending by
default can be changed by clustering order by CREATE table
CQL counters are 100% accurate - False
Which of the following are NOT allowed in a CQL query - GROUP BY
How can data from two tables be combined in a CQL query? - They
can't be joined
What is the relationship between a column family and a CQL table Terminologically same. Column family is the cli view and CQL is the
logical tabulated presentation of data.
How are wide rows implemented in CQL - Clustering columns allow
nest and organize data inside a partition
How are clustering columns ordered - Ascending by default can be
changed by clustering order by CREATE table while creating table
What is the difference between UUID and TIMEUUID - UUID
guarantee randomness. TIMEUUID has date and time component
built in that can be extracted and allows uniqueness, actual value
and sortability all in one value
When should secondary indexes be used - Rarely - if used they should
be used for columns with low cardinality
Are CQL counters 100% - No because for retry - it will increment or
decrement the column
How does and upsert work - where insert and update are same
What predicates are allowed in a CQL query - equality matches /
inequality matches /IN clause and range queries between multiple
values - slices.
When should the ALLOW FILTERING clause be used - in
Development to scan for data, if at all. In production - Usually for
very small volumes of data.
How can data from two tables be combined in a CQL framework Nesting data into one table or client side joint on the application
side.
What are the components of the data modeling framework Conceptual, Logical and physical data models

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7

What is the purpose of Chetbotoko Diagrams - to draw out as table


with their fields and label query access patters with the queries that
will be performed during that access pattern
Rotational drives perform at least two head seeks when writing to
the commit log - False
When handling a write request, Cassandra stores the new values in
Memtables and the commit log - True
Which will not cause a Memtable to flush - sstable_max_size is
reached
What happens when a Memtable is flushed - Cassandra writes it to a
new SSTable on disk
What is compaction - Combining all SSTables into a single SSTable
What happens when a Memtable is flushed - A new table is created
under the disk
What causes a Memtable to flush - Max allocation for
memtables/capped the limit size of Commit log release from
Memtables
What is the relationship of CQL table to Memtables and SSTables Table of data and some of the newest data will be in a Memtable and
all flushed data will be on disk in SSTables
Do disk seeks happen during writes - No, we write to memory.
How are data files organized - Data directory - one directory for each
keyspace - one directory for each table - All SSTables in these
directories
What benefit do Bloom filters provide to the read process - Prevents
reading an SSTable that does not contain the partition
What benefit does the key cache provide - Allows skipping the
partition summary and partition index
Why is the row cache turned off by default - To prevent double
caching with the operating system
Cassandra reads the partition summary read for partition keys in the
key cache - False
What benefits do Bloom filters provide to the read process - They
Allow us to skip SSTables that do not have the data we are looking
for.
Is the partition summary read for partition keys in the key cache No, it allows to skip over partition summary to and partition index
and go straight to SSTable
What is the relationship between the partition summary and index Summary is like an index into the partition index
How many key caches are maintained for a Memtable - Just one.
Which of the following statements is incorrect - Compaction causes
several sporadic disk/head seeks
Which one is NOT a disadvantage of Size-tiered compaction - Poor
performance for write-heavy applications
What are zombie columns, and how do you prevent them - Resurrect
data that come from a dead node, that was down long enough to miss
the update

8
9
10
11
12
13
14
15
16

17
18
19
20
21
22
23
24

25
26
27

How do SSTables change compaction - There is no change in the


tables. But they are read and new ones are created by filtering out
the old tables
What are some benefits of Size-Tiered-Compaction - fast write
operation/ lower pressure on disk bandwidth - 50% 50gz - 1tb
What are some benefits of Level-tiered-Compaction - Lesser free disk
space required. Predictable and consist read performance - 10% not
more than 8 gb required
DateTiredCompaction When might you use nodetool compact - Not very frequently. Not in
production.
A single SSD has lower latency than a RAID - True
Networked attached storage is ideal for Cassandra - False
Adding more memory helps with - Reads
The CAP theorem states that only two out of three characteristics
can be met by a distributed database system simultaneously. Which
of the two characteristics does Apache Cassandra value most?
Choose 2. - Availability & Partition Tolerance
Apache Cassandra achieves high availability by____________? having
no single point of failure
Durability is the property that ensures that all written data is the
same on all nodes that it is written to - False
What does the gossip protocol do - Enables each node to share
known state and location with other nodes
Which of these Cassandra technologies work together to keep track
of the cluster data center and rack topology? Choose all that apply Gossip Protocol/Configured Partitioner and Snitch Implementation
Apache Cassandra can be downloaded in which formats, including
DataStax Community variants? Choose all that apply - Debian
package/ Windows Installer / Tarball / Ubuntu package
Clients should have ________________ so that the column values will
replace older values based on their timestamp - Synchronized clocks
Which of the following is not true for Apache Cassandra - Cassandra
can compute the minimum value for a column with CQL
In a Cassandra instance, a table called Orders holds order
information. Each time an order is placed, an entry must also be
placed in a denormalized table OrdersByCustomer to keep track of
order history per customer. How does Cassandra handle this?
Cassandra will do nothing. An additional write must be specified in
the application that fulfills the wrote to the Orders and
OrdersByCustomer tables.
When defining a table in Apache Cassandra, a _________________ must
be defined - Primary key
The main function of a keyspace is to control - replication
An application tracks a user's habits. Every time a user clicks a link
on a page within your site, the time of the event is recorded as well
as the link clicked. In order to write an efficient query, all data must
be stored in a single partition. Which of the following tables best
models the need of the application - CREATE TABLE habits(userid
UUID PRIMARY KEY(UNSERID, CLICKTIME));

28 You have designed a query to return all users by a designated state


and designated age from a table holding all users information. Which
approach is best for efficient queries like this - Create the users
table. Create a table that stores partition keys group by state and
age. Query the second table
29 Which of the following is a valid Cassandra data type - timestamp
30 Given the following table, which of the following statements is an
example of Data Modification Language(DML) in CQL - SELECT *
FROM comics;
31 An SSTable is an immutable, meaning that it cannot be modified True
32 Compaction does NOT do which of the following - Redistribute
SSTables to rebalance node workload
33 Tombstones are - Markers that a delete has occurred - the standard
write mechanisms are used to propagate the delete to other replicas
34 Which of the following statements about writes is incorrect Cassandra sends write request to the minimum number of replicas
nodes needed to fulfill consistency requirements
35 Given the following table from a physical data model, what are the
most likely choices for missing data types for the avg_rating,
category and amount_ingrediant, in that order? -

1
2
3
4
5
6
7

float, list<text>, map<text, text>


Data Modeling:
1:1 - one to one
1:n - one to many
M:n- many to many
Ellipses - attributes
Rectangle - Data
Diamond - relationship

Potrebbero piacerti anche