Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Amazon is one of the most technologically-savvy companies -- especially as it re

ID: 3714727 • Letter: A

Question

Amazon is one of the most technologically-savvy companies -- especially as it relates to its "cloud computing" offerings (check http://en.wikipedia.org/wiki/Cloud_computing for more details about cloud computing, beyond what we covered in class). In mid-April 2011, Amazon's cloud services suffered significant disruptions. Amazon offered a detailed explanation of what mitigated these disruptions. You can read the description of the incident and how Amazon responded to it at http://aws.amazon.com/message/65648/. The article discusses a number of issues that contributed to a domino effect of serious failures. Most of these issues were subjects of study in CS-350. For the purposes of this problem, refer to the highlighted sections from the Amazon report posted on Piazza at http://www.piazza.com/class_profile/get_resource/jbzqx5zv9uw689/jfzrspesuf42ga Based on this answer the following questions:

a. In your own words, explain why it is necessary that "when data on a customer's volume is being re-mirrored, access to that data is blocked until the system has identified a new primary (or writable) replica."

b. The article refers to at least one situation where starvation was an issue. In your own words, explain the situation by identifying the resource that was the subject of starvation and the processes that were starved.

c. The article refers to a distributed synchronization operation that requires solution to the leader election problem, which we discussed in class following our coverage of the global snapshot problem as well as when we discussed Lamport’s distributed consensus protocol (Paxos). In your own words, explain why it was necessary to solve that problem in the Amazon system. In particular, identify the functionality of the elected leader and why it is crucial to have one agreed-upon leader to carry out this functionality.

d. Identify one place in the Amazon system (as described) in which the designers intentionally enforced an upper bound on the level of concurrency. In your own words, explain the pros and cons of doing so.

e. The article shows that it might have been necessary to control the level of concurrency in other parts of the system to avoid the thrashing that ensued because of the cascaded set of failures in the system. Suggest at least one way this could have been done. Specifically, what process or service would have benefitted from a cap on the level of concurrency?

Explanation / Answer

a) yes, Its necessary to block the data when it is re-mirrored to another system.

Because there is chances of modification data at the same time by any other client and it may lead to data inconsistancy.

Whever we keep replicas of same data in different systems, we should ensure that any modification in the original data is done there in replicas also.

Data inconsistancy means same data is used by different clients with different values at the same time. And it will definitely give incorrect results or not expected results.

Its very similar to the cache coherance problem in shared distributed system. It also occurs due to inconsistancy.

So at the time of re-mirroring of a particular data, its a good practice to block that data from all accesses.

That means it gives one more level of protection to clients data.

b) The problem had occured when a high speed traffic missrouted to a low speed network rather than routing to a high speed primary network router itself.

Due to this ,many of the nodes lost its connectivity with other nodes carrying the replicas of its data.

One solution for this type of errors was to search immediately for a new node with enough space to copy its whole replica.

Since many of the nodes lost connectivity to the replica nodes, free space got exhausted in a lessor period of time.

This event is called as re-mirroring storm. At this point of time cluster was completly not able to process any more API requests.

These requests for creating new vlume (API requests) then started backup call and this resulted in Thread starvation in the control pane of EBS.

c) Leader Election problem is one which is used in Distributed system.

Distributed systems are systems in which data is completely distributed among different nodes.

By selecting a leader node among these different nodes( who is capable of managing shared resouces) , we can avoid shared resource consistancy problems that may happen when multiple nodes try to access the same.

Amazone faced such a big "no space for new replica problem" . It was possible to reduce that problem to some extend with a Leader node who is capable of managing resources in an optimal way.

d) When failure in one zone started affecting other available zones also, the designers decided to disable all the request irrespective of the affected zones.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote