Sei sulla pagina 1di 25

Google Cloud Platform and Machine Learning Specialization Coursera

1. Products and Use Cases where ML is used are discussed in the course.
2. Compute and Storage -> Processing Data
3. Decision whether to move to GCP

4.
5. MySQL database is hosted on GCP as the CloudSQL.
6. Dataproc is the set of processes which can be used to manipulate the data

7.
8. What is GCP (Google Cloud Platform) – To organize world’s information, Google has created
the most powerful infrastructure on the planet.
9. Edge Locations – Cached version of a resource is maintained at the edged locations. The
data centres contain the main data resources but the edge location maintains the cached
versions of the same resources.
10. Colossus is the GFS, it’s the base file system used in Google Cloud Platform. The Colossus
takes large datasets and breaks them into smaller pieces and distributes it across various
storage resources.
11. Dremel and Flume are the MapReduce platforms of GCP. They perform the complete tasks
of MapReduce as it is done in Hadoop.
12. Flume is part of Dataflow
13. MapReduce is limited by the number of compute nodes that can be used.
14. Auto-scale the cluster, distribute the compute and distribute the storage
15. No-Ops is the internal understanding of Google Cloud to execute a task and identifying the
location of data resources and processing the same in compute environment.

16.

17.
18. (Rooms to Go) Customer Purchase data and Navigation History are used to identify the
customer purchase patterns and create new packages which were then sold to the
customers
19. The compute may not happen at the individual virtual machine level. The computing is at a
much higher level with the flexibility of compute space usage as per our requirement.
20. A preemptible machine is a machine on which the customer gets a great discount of the
order of 80% but in exchange the customer is agreeing to give up the machine if another
customer is willing to pay the full amount for the machine. For ex. A machine might be a
part of the Hadoop Cluster, which can manage a failover machine scenario. So, in case a
machine goes down the other machines in the cluster can take over the tasks that were
performed by that machine. By taking a preemptible machine along with the standard
machines, the Hadoop clustering company is dropping its cost significantly.
21. Cloud Shell is an ephemeral machine which is used to access the google cloud platform.
Thus, it is reset every 60 min. In case of SSH tunnelling, the connections will need to be
reset every 60 min.
22. Creating a Data Proc as a Hadoop cluster.
23. There are Standard VMs and Preemptible VMS.
24. There is an option of cloud storage which is cheaper than a persistent disk storage and
efficient in handling data. Other options like the disks on compute engines can be risky
since - if the compute engine is taken away so is the disk attached to the compute engine.
Thus, the data required for processing on the compute engine can be stored in and
retrieved from the cloud storage.
25. “Gsutil” is a command line prompt to copy data from the local machine on to the compute
engine in cloud.
26. Cloud storage is a key-value storage system with a wrapper of hierarchy system which
reflects the folder structures in file systems.
27. REST API is available to be used to transfer content to and from the Cloud Storage.
28. Cloud storage acts as a staging area for data, from where the data can be used by various
cloud applications such as Cloud SQL, Compute engine, etc.
29. Every bucket is a project in GCP and the billing is done on the project level in Google Cloud.
30. Every object in the buckets has a URL which can be shared and accessed by users logged
into the their google account.
31. Region > Zone
32. The location for caching the data in cloud storage is not controlled by the user, however
the system by itself takes care of caching the data in edge locations based on the locations
from where the data is accessed frequently.
33. Cloud Storage buckets can be mounted onto the Compute engines as persistent storage
volumes.
34. The instructions for the cloud storage bucket mount and access from compute engine can
be found at the following links.
a. https://cloud.google.com/storage/docs/gcs-fuse
b. https://cloud.google.com/compute/docs/disks/gcs-buckets
c. https://cloud.google.com/storage/docs/authentication#service_accounts
d. https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/installing.md
e. https://cloud.google.com/compute/docs/instances/transfer-files
35. GCS should not be used for anything that requires high through put. GCS is not useful for
real-time data storage. Frequent read/writes are not a speciality of GCS.
36. There is a price estimator for calculating the total monthly charges based on the usage
specification of the Company.
37. Cloud Launcher is a platform that allows the user to use the VM instances which already
have the required software installed on them.
38. GCS is a geo-redundant storage with high availability and performance. It is ideal for low-
latency and high QPS content. It is well suited for reading the content from a single storage
at high availability, however it is not suited for real-time data exchanges.
39. Nearline (once in a month) and Coldline (once in a year) are cloud archival solutions. These
are counterparts of Amazon Glacier in AWS.
40. Old App Engine is where the program is deployed for the world to be able to reach out to
the Application. This was the basics of Google’s hosting application.
41. Old App Engine is ideal for green field development instead of Legacy systems.
42. Google App Engine allows the users to run their programs directly with integration with
Google Cloud. In case the application must be run on Tomcat itself, Google suggests that
the tomcat application be deployed in Docker containers which can be managed by Google
App Engines in case of any service issues.
43. Google has a managed services facility which allows the companies to request help from
Google in sorting out the issues faced by the company on the front end.
44. Cloud Pub/Sub is a messaging service which doesn’t need a server to execute the messaging
services.
45. We must understand Google Cloud as a service which provides a complete environment
which can manage all the tasks of a company on its own infrastructure. The independent
applications of each company can be deployed just in a compute engine and made to
function from there but Google Cloud offers a much larger perspective of a system working
on cloud using various components of Google Cloud to perform tasks without the
requirement of a specific application running server.

46.
47. The recommendation engines identify following
a. Ratings given by a user to a house
b. The time spent by a user on a specific page
c. The links on which the customer has clicked
48. For ML,
a. There will be 2 users who rate a house at the same level
b. There will be houses which are not so great and all the users have rated the house 1 or 2.
c. The rating is created based on such similarities and the top 3-5 ratings are extracted from
the model and suggested as recommendations.
49. Cloud SQL is a managed MYSQL platform
50. Data management is done using python spark on DataProc

51.
52. A few 100GB of data is ideal for management using Cloud SQL.
53.
54. Cloud SQL helps pacify the database. This reduces the pricing for machines which are not
required to be run 24X7.
55. In Cloud SQL, the system IP address must be added to authentication every time a user tries
to access through a new cloud shell (because every time cloud shell IP is different). The
authentication section is checked for accesses every time some system tries to access the
CloudSQL

56.
57. Google Dataproc allows you to integrate Google Cloud Storage with the Hadoop clusters
for data processing using Spark, Hive, etc. The benefits are listed below,
a. Instead of initiating compute machines for storage of data after processing, the data can
be directly stored in the Cloud Storage.
b. The Dataproc instances can be started just for performing the compute and be shut once
the jobs are completed. Thus, saving on the costs.
c. The integration with Cloud Storage reduces the cost of maintaining the machines and the
amount I/O from the compute machine to various databases.
d. The security offered by Dataproc on the data is the same as the security provided by Cloud
storage since the data is ultimately stored there. Thus, reducing the cost for securing the
compute engine environments.
e. The hassle of managing the infrastructure is reduced
58. The systems must be updated so that they have authorization to access other components
in Cloud environment. For ex. To access CloudSQL, the dataproc machines must have
permissions and authorization to access the Cloud SQL.
59. Relational Databases (Cloud SQL) have relatively slow writes and thus high-throughput
writes are not possible. They can work only on data at the level of few 100 Gigabytes.

60.

61.
62. The datastore can hold data up to Terabytes. It allows you to store data as objects. Each
object is stored as a key value pair. Each object has a key. The data can be searched in 2
ways - using the key and using properties of the object.
63. Datastore allows transactional exchanges of structured data. It uses indexing of
information to store data in an easily retrievable manner.
64. Cloud SQL persistence of data is a part of customization and is not a default feature of Cloud
SQL.
65. While reading data from DataStore the items (objects) are retuned in form of iterables
because a single query on TB of data can respond with GB of data and returning a list of
items will consume that many GB of space in the memory. Thus the returned objects are
returned in iterables.
66. DataStore allows all operational functions possible with a RDBMS – Create, delete, Update
and Read but it is not a relational database.

67.
68. For streaming data and high throughput needs Big table is used. The capacity of Big Table
is petabytes to manage real-time stream data. The data is stored in form of Key-Value pairs.
There is no transactional support in BigTable. Thus, Bigtable cannot be efficiently used for
aggregations and other batch operations on data (No-Ops), but it is designed to be
extremely efficient in managing real-time transactional data.
69. The search in Bigtable is based on just the key. The table are recommended to be tall and
narrow. For ex. In case there are 5 Boolean columns, in Bigtable it is suggested that instead
of storing the 5 variables as columns, the data must be stored as the variables which are
true.
Instead of saving data as

key Bool1 Bool2 Bool3 Bool4 Bool5


1011 T T T F F
Save data as

key Bool_Column
1011 Bool1; Bool2; Bool3
70. In case of an update, a new row is added to the table and BigTable takes care of providing
the latest data from the table in case any read requests. This is done by reading data from
the bottom such that latest data is retrieved first.
71. The importing and functioning of data from BigTable is managed using the HBase methods.
72. Notebook Development - Datalabs

73. Each Datalab has a web URL for sharing the draft codes to collaborate with team members.
74. Datalabs inherently follow a server and client system. Thus, the server must be hosted
somewhere for others to be able to access the URLs. The 3 scenarios where the servers can
be hosted are
a. Local machine (on your laptop) – where in if the laptop is shut, the access is stopped
b. On Compute Engine – where in the setup is not cost effective. In this case the SSH tunnel
needs to be established from cloud shell to connect with the machine.
c. Gateway VM – where in the servers are installed in a VM using the already existing Cloud
Launcher VMs.
75. BigQuery – It is a superfast data processor. It can handle data of size of Petabytes. It can
access csv and json data without ingesting them into Bigquery. However, for the processing
the data, it needs to be ingested into BigQuery warehouse.
76. Data can be streamed in using Cloud Dataflow and while the streaming is going on, the
processing can be done using Bigquery.
77. BigQuery can be run from Datalabs as well which allows the user to work on Python and
BigQuery parallelly in codes.
78. BigQuery is Awesome!!!
79. Tensor Flows are C++ based framework to perform numeric processing. It is extremely
efficient in doing machine learning on Neural Network.
80. Tensor Flow is a way to perform machine learning modelling.
81. Machine Learning APIs by Google are trained on Google’s data. They can be directly used
by ML models instead of creating the complete ML models again.
82. ML Engine also allows Jobs to be deployed from models. It requires the models to be
created and trained and then jobs can be deployed in real time execution.
83. Asynchronous processing a way of managing the processing system separate from data
receiving system. It essentially introduces a messaging based queueing service to manage
the scaling and expansion of resource usage in case of escalated usage.
84. Pub/Sub is the system which manages these message processing. Pub/Sub has topics,
where each topic can be held responsible for manging a specific set of functions.
85. Modular management of different tasks can be easily managed by Pub/Sub.
86. Dataflow – Serverless Data Pipelines using Apache beam – Each step of dataflow is a class
in Java. It is a No-Ops data pipeline.
87. ParDo option helps execute a task parallelly on all rows of a table. It creates multiple
engines to compute the data parallelly. These machines need not be created beforehand.
Dataflow manages the creation of these machines on the fly.
88. Cloud Dataflow allows the system to keep reading the information from Pub/Sub for a
specific amount of time. The information can be read from a file or any source including
cloud storage.
89. Dataflow is primarily made to process historical data and the real-time data coming in from
different sources. It can massage the data in the same way for both and send them for
scoring and testing the ML models to BigQuery.

90.
91. The charges applied when the data is leaving a zone and not when it is being added to a
zone. This plays an important role since the cloud storage and the compute engine must be
in the same zone to nullify this cost. Hence it is advised that the data location is the same
as compute location
92. The bandwidth between zones in different regions is not as good as the bandwidth
between zones in the same region. Usually the zones in the same region are present in
different buildings in the same facility.
93. There are 3 ways of using the HDFS file system in dataproc
a. Single machine cluster – All the compute is done in the single machine
b. Single Master Node – There is just 1 master node and other worker nodes. In this case if
the master node dies, then the complete system is broken
c. High Availability System with 3 master nodes – There is a load balancer in this system which
manages the information processing on 3 masters and respective worker nodes
94. The usage of HDFS file system is not suggested in compute using dataproc applications
because this doesn’t allow the dataproc to dynamically create and delete clusters. Every
new cluster config would require reconfiguring the connections to enable input/output of
information.
95. It is suggested to use Preemptible nodes as worker nodes for clusters. A system can have 1
master 3 standard worker and 7 standard preemptible nodes in a cluster. The preemptible
nodes are valid only for 24 hours and can shut off based on demand of compute engines.
Thus, the company benefits from such nodes since it reduces costs drastically. Minimum
number of worker nodes in a cluster is 2 nodes.
96. The preemptible nodes added to a cluster have the same configuration as other worker
nodes. There is no primary disk assigned to these nodes since they are not a part of the
HDFS.
97. It is difficult to find and get the larger machine as preemptible since there is only limited
numbers and every other person on cloud is looking for such a machine hence the lifetime
of such a preemptible machine may not be long enough. It may go away in less time.
98. The master node in Dataproc manages the joining and leaving of preemptible machines.
Each preemptible machine when leaving has 30 seconds to complete the job it is assigned
at that moment, report back to the master and then shut down.
99. The pricing for preemptible machines starts after 10 min of running the machine. If the
machine is lost within the 10 min then there is no charged applicable on the machine. If it
exists more than 10 min, then it can exist for a max of 24 hours.
100. There is an option to select the versions for the software on Dataproc.
101. A custom dataproc cluster can be created using the CLI or graphic interface. The CLI
command is - Custom-6-30720 for 6CPU and 30GB Ram (30X1024 = 30720)
102. Standard machines for Dataproc are more cost efficient than the custom machines.
103. Pig is used for transformation of data.
104. YARN is a management platform to manage the Hadoop file system used for persistent
storage.
105. A shard is a horizontal partition of database table such that each shard can be stored in a
different table in a different database. It can be stored on completely different hardware
as well. The benefits of sharding data is that it reduces the amount of indexed content in
each table and thus further reducing the time taken for searching data from a table.
106. Google cloud platform infrastructure tries to isolate compute from the storage through
dataproc component. Reasons
a. When a dataproc setup needs to be scaled out – the data needs to be copied again
onto all the new nodes added into the system
b. Dataproc fundamentally eases the process of creating new compute engines
whenever needed to tackle high loads and deleting the engines when the
processing is finished. Copying of data every time on to the new machines and
keeping them, as persistent storage units completely beats the point of using
dataproc services.
107. Colossus is a global file system of petabyte scale. It lays the foundation for Google Cloud
Storage buckets.
108. Dataproc, Dataflow and BigQuery can be used to decouple the storage and compute
components of a dataproc system.
109. With Dataproc, the option to utilize persistent cloud storage buckets instead of HDFS file
system is available. The reference can be changed from hdfs:// to gs:// to point to the
appropriate cloud storage buckets.
110. Apache Beam can be configured to run processes on Dataproc in a stateless manner. This
can help further automate the processing capabilities in a completely serverless
environment.
111. All the installations and infrastructural configurations are done on both master and worker
nodes in the dataproc cluster. This is managed by the GCP itself so that the tasks need not
be repeated on every new node.
112. There is way of installing just on the master or just on the worker node by writing an
external script. The script recognizes the master or worker nodes using the following code
format. Note that the role is mentioned as “master” and this attribute is received from the
metadata servers.
113. The metadata server keeps a note of all the existing master and worker nodes in dataproc.
In case of addition of preemptible nodes into the cluster, metadata server manages the
awareness of new worker nodes at the master node level. Thus, the tasks are distributed
to the new worker nodes as well.
114. There is google git repository with a set of commands to install open source software on
newly created master nodes. This initiation command can be stored in the google cloud
storage and can be triggered as soon as the cluster is up and running.
115. Any other script can also be created and run as an initializing script.
116. The initialization actions can be mentioned in the CLI and GUI while creating the dataproc.
The location of initialization scripts can be mentioned as a part of the command line or in
the console to be executed as soon as the dataproc is created. Each file must be referenced
during the execution. In case of multiple files, the file references must be separated by
commas.
117. GCP also allows you to change the properties of complete cluster and not just the nodes.
These properties are stored in a file called core-site.xml which can be modified using
commands from the Gcloud SDK.
118. Spark code can process bigquery data. There can be a huge machine learning algorithm
running in spark which might need to use bigquery data and thus it can communicate and
interact with bigquery to generate output for the algorithm.
119. An aggregated information from BigQuery cannot be imported into Pandas data frame
because the information processed by pandas must be in-memory and the aggregated
output from big query is most unlikely to fit into the memory of the dataproc cluster
machine. Hence collecting a subset of bigquery data makes more sense than an aggregated
data from bigquery.
120. Big query is efficient if the data is de-normalized
121. The payment for Bigquery is made on the bases of data that we process.
122. Cost of storage in Bigquery is the same as cost of storage in Cloud Storage.
123. Bigquery is near real-time data storage system which can be used for ad-hoc analysis.
124. There are both flat payment and pay-per-use payment plans for bigquery. Based on the
amount of usage, the payment plans can be selected for different functions.
125. Bigquery data can be shared across the company. Somebody who has access to the data
can utilize data to perform ad-hoc analysis on the same.
126. A bigquery dataset can have tables and views. The access is managed at the dataset level.
127. The storage of data in bigquery is columnar. Bigquery is not a transactional storage. Every
column of data in bigquery is stored in a separate file. Thus, there are no indexes or keys
to manage the data. Partitions can be used but they are not a part of default establishment
of Bigquery. The pricing is also related to the number of columns that are contained in the
data. Thus, to reduce the pricing, one must reduce the number of columns in the data
tables.
128. Bigquery is meant for immutable large datasets.
129. The output data can be exported into Google sheet or tables anywhere else.
130.
131. LPAD command in Bigquery does a left padding for text content. For ex. If the month is 2
and it is concatenated into a data, then the padding would result in YYYY-02-DD
132. Bq command lines can be used from Cloud shell to create a new dataset in Bigquery
133. A query can be assigned as a view and can be referred to in another query by using WITH
as the keyword. For ex. WITH xyz as (select * from abc) select * from xyz where pqr.
134. ARRAY/STRUCT create arrays of key value pairs in bigquery. The struct helps in untangling
the arrays using the keyword UNNEST.
135. The tables in bigquery can be joined based on functions. For ex. Starts_with(a,b) gives the
output of all entities of a which start with b.
136. Bigquery has the options like LEAD(), LAG(), NTH_VALUE(). There are also functions like
ROW_NUMBER(), RANK(), etc. With RANK() function a keyword OVER is used.
137. The end of a regular expression is marked by a $
138. Optimization of Query can be managed by following -

139. To optimize the query,


a. select only the required number of columns from the table
b. Join the biggest tables first then the smaller then the smaller and so on
c. Built-in functions are faster than UDFs
d. APPROX functions are internal bq functions (which are extremely fast) which can
be used in cases where approximate number can work. The error value in APPROX
functions are 1%.
e. Order the outermost query instead of on any inner queries.
f. Wildcard tables function helps union all the tables starting with a certain pattern
of names. This function can operate faster in bq.
g. Partitioned tables (based on Time stamp) can be processed faster in bq.
140. The wait time is a relative value. If the wait value is 100% then all the other functions are
measured relative to the wait time.
141. Stackdriver allows you to monitor the Bigquery from a GUI based interface
142. Loading data from bq and exporting data from bq is free. Any cached query is free of cost.
Cache is maintained per user. Any query that results in an error is free. There is a cost for
ingesting data into bq. Data processed beyond 1Tb are charged.
143. Dataflow allows the management of data pipelines. For ex. Dataflow is used to transfer
data from Bigquery to Cloud storage through a series of transform steps. If the data that
must be transferred is huge, Dataflow has the capability to autoscale and run the transform
logics on multiple parallel servers and create the output in cloud storage.
144. While reading from Pub/Sub, the aggregate functions must be run by applying a window
thus you get a moving average in case of mean.
145. Parallel Collection – Pcollection is a list of items which is not bounded (it need not fit into
the memory). Data in a pipeline is represented as a parallel collection. The transformations
are applied to parallel collections (input and output of transformation).
146. Without sharding the output content is stored on a single machine which creates i/o issues.
147. Use “mvn” to execute a java program. It downloads or includes all the dependencies for
the code flow.
148. The tasks for pipeline must be executed just by the main script which takes care of the
subsequent flows. The main runs on the local server. To execute main in cloud, dataflow
must be configured to manage the pipeline tasks.
149. Like map in MapReduce, ParDo acts on 1 item at a time but ParDo allows parallel
processing.
150. Pcollection can be used to pass a list of string.
151. Combine operation and GroupByKey perform similar operations. Combine is preferred as
the first choice for any of the aggregation tasks.
152. Model is a mathematical function which takes and input and provides an output based on
training.
153. N-dimensional array is a tensor.
154.
155. Gradient decent is the method used to identified w1, w2 and b. By changing the value of
weights, if the mean square error goes down, it means that we are closer to the finer model.
156. The machine learning model is only as good as the training data used to train the model.
157. Neuron is the sum of weighted inputs, it can be understood as the new learning achieved
from the same set of input variables.
158. The training data must also have a good number of negative data to train the model to
identify if a certain type of input is negative. This is more like fine tuning the model to give
better results.
159. In machine learning, as opposed to Statistical modelling, if there are outliers then the user
must go search for more instances of such outliers to train the model to learn about the
occurrence of such outliers and include the outliers within the ML model.

160.
161.
162. The error value from model must be identified or measured with MSE to iterate and create
better model. Thus, the model that gives the lowest MSE is the best model for the problem.
163. For classification problem, the error measure is called cross-entropy

164.
165.
a. Accuracy – The number of times the model has identified the label correctly
b. Precision – If the model predicts a certain label to be positive, then how often is
the model correct. TP/(TP+FP)
c. Recall – Out of all the correct objects, how many did the model recognize
correctly. TP/(TP + FN)
166. Accuracy is used when the outcomes are balanced or there are roughly 50% chances of
each outcome.
167. Root Mean Square Error – Root of MSE
168. Hyper-evaluation
169. ML models using TensorFlow. There is a component called tensor board which will be used
to do the same.
170. TensorFlow – It’s a library which allows one to do numeric computations. It’s a way of
writing JAVA codes which runs on variety of hardware components. It provides portability.
171. TensorFlow allows one to train the ML model on GPUs and run it on mobile phone. It
represents the codes as dataflow graphs which can be presented in different hardware
platforms.
172. Tensor flow is a framework like various python libraries which can perform complex
machine learning modelling operations. It is an open source framework which can be
downloaded into the local system and used for ML.
173. Tensor flow can be run at scale in Google Cloud using Cloud ML engine.
174. The operation of 2 tensors happens only in a tensor session.
175.
176. This allows the user to create a graph in a different place and run the graph in a different
place. Thus, the remote execution of tensor is possible.
177. To implement operations in tensors one must setup placeholders for values and then insert
values into the respective placeholders

178.
179.
180. There is an experiment class in tensor flow which allows the user to create a robust
modelling system. It allows the user to assign the metric on which one wanted to base the
iteration trigger cycle.
181. Hyperparameter Tunnelling?
182. A tensorflow model can be packaged as a Python module for scaling out the execution of
tensorflow in real time.
183. Use single region bucket for ML training data. The ML modelling process will occur in a
single region and hence the region must be maintained consistently.
184. ML requires default values for the predictor variables input into the model
185. While modelling it is a good practice to keep the data coming from multiple sources synced
on time. For ex. If the transaction data for prediction is used till 3 days ago, then the
customer information data should also be used as it was 3 days ago. This maintains
consistency of the model
186. In case of machine learning the data is not imputed from available data in case the data is
missing as in the case of statistics. There is a column for cases where the data is not
provided in case of 1 hot encoding.
187. Tight coupling between the sender and receiver of a certain message in stream processing
can create issues in case a component (sender or receiver) crashes.
188. Thus, to manage message in a loosely coupled system message bus is used to buffer the
messages.
189.
190. Pub/Sub is the solution. Smaller packets are going to write faster than the Larger packets.
Thus, the data will be jumbled up and latency in involved in the data ingestion part of the
system

191.
192. Any kind of stream processing on stored data increases the latency. Thus, big query can
process data on streaming data while it arrives into the warehouse.
193. For very high throughput and very low latency – Bigtable is a more appropriate solution.

194.
195. Pubsub primarily resolves the variable input data speed. It guarantees delivery of a
message at least once. It supports both Push and Pull for subscribers.

196.
197. By publishing the messages as a set of batch processes, the network costs can be reduced.
Alternatively, if each message is send separately, the cost incurred for passing the message
is higher.

198.
199. Pull has delays because the subscriber function works periodically, while push is faster with
zero latency.
200. 7 days is the maximum time duration for which the data is saved in the message buffer in
pub/sub.
201. A message published before the existence of a subscriber will not be displayed in the
subscriber once its created in the topic.
202. The continuously arriving data can be out of order. Thus, to manage this shuffled data the
processing can be done based on windowing. For ex. All data arriving before 9 am can be
processed together.
203. Windowing the data helps in organizing the data and distinguishing between the processing
time and event time.
204.
205. Watermark tracks the time difference between processing and event time of a message.
206. Dataflow knows whether all the messages have been received from Pub/Sub for a window.

207.
208. Data studio is a dashboard tool given by Google which can be connected to various GCP
sources. It helps one build charts, maps, etc.
209. Cloud Spanner must be used if one needs global consistency with data. It enables horizontal
scalability with addition of more and more nodes.
210.

211.

212.
213. BigTable separates processing and storage by storing the data in colossus filesystem (cloud
storage) and processes the information on nodes. It uses pointers to the data to compute
the information.
214. BigTable information is stored in form of Column families. There is a single row key which
can be indexed and the data is stored in ascending order of the row key in sections called
tablets. While reading the data, it is preferred that sections of similar row-keys are stacked
together in the same tablet for low-latency access to the data.
215. While designing BigTable there can be a wide table or a narrow table. In wide table format
each column has a value for every row.
216. The rows are sorted lexicographically by row key. The key must be designed such that the
reverse timestamp is read (YMD).
217. Rowkey must not start with Domain name. If the user_id is sequentially assigned then user
id must not be used to start the rowkey either.
218. Bigtable learns the access patterns and reorganizes data to balance the loads on all nodes
across the cluster nodes. And since the Bigtable only stores pointers to data on its nodes,
this kind of redistribution is not heavy process. To enable Bigtable to make this adjustment,
it must be allowed to process > 300 GB of data and run for at least a few hours.

219.

Potrebbero piacerti anche