CS 541 Lecture -*- Outline -*- * Specifying distributed systems A programming method for distributed systems, based on the paper Liskov and Weihl, ``Specification of Distributed Programs'', Distributed Computing Vol. 1, pages 102-108, 1986. ** the problem: fault-tolerance vs. performance we want to be able to design and program fault tolerant systems, with good availability and reliability despite crashes and link failures to achieve availability and reliability, have to replicate functions and data but then synchronizing changes becomes slower... *** idealized system: logically centralized, one-user-at-a-time **** logically centralized the simplest way to specify a distributed system is to make it *logically centralized* def: a logically centralized system acts as if it was running on one computer ***** replication (atomic) If replicating data to provide availability and reliability, need to avoid inconsistency in different copies can use atomic transactions (so all updated or none are) byzantine generals problem ***** machine crashes (recoverable) crashes after user walks away shouldn't concern the user need persistent storage that survives crashes (stable storage: Lampson, LNCS 105) need to make crash of user's machine look like crash of whole system abort's the user's changes similar kinds of stuff needed if user aborts the computation (Control-C) have to ensure that side-effects on the rest of the system are undone rollback, orphan detection crashes of other machines program should be able to exploit replication, by trying other machines on demand, (so needs to know, or timeout, etc.) aborting effects of the machines that crashed timeouts and retries in system should not mean things happen more than once (3 computer chain, middle one crashes) user shouldn't have to wait indefinitely for other machines **** one-user-at-a-time (serializable) each user has exclusive access user's activity must be serializable def: serializable means effect same as if executed one by one preserves invariants need locking or some other concurrency control mechanism *** consequences effects of a user-request do not take into account concurrent activity user's aren't bothered with implementation details Some performance problems: atomic commit takes about 2 orders of magnitude longer than RPC locking means making copies of objects (for abort) These peformance penalities may be worth paying for things like banking systems Some authors don't specify systems as atomic, to avoid these penalities ** the idea specify the system as if it were atomic but use nondeterminism to allow efficient implementation. *** advantages illusion of logically-centralized, atomic system allows spec to concentrate on behavior relevant to clients nondeterminism allows implementation without using transactions, locking, etc. ** examples *** dictionary (e.g. a directory without additional info) logically-centralized spec -------------- DICTIONARY OBJECT LOGICALLY CENTRALIZED insert = proc(x:element) REQUIRES: x not in Members EFFECT: adds x to Members delete = proc(x:element) REQUIRES: x in Members EFFECT: adds x to ExMembers list = proc() returns(sequence[element]) EFFECT: return Members - ExMembers ---------------- To implement this, have to do locking, etc. Want to be able to do this sort of thing for some apps. ------------------- DISTRIBUTED (weaker) SPECIFICATION list = proc() returns(sequence[element]) EFFECT: return a subset of Members --------------- Problem is that this doesn't convey enough information. ------------------ RECOMMENDED FORMAT list = proc() returns(sequence[element]) NORMAL EFFECT: return Members - ExMembers ABNORMAL EFFECT: return subset of Members -------------- Normal is same as in logically centralized system abnormal effects allow for distribution, but don't say how. Starting with logically centralized spec, add nondeterminism as desired for performance specs show user point of view, and help evaluate whether the nondeterminism is acceptable *** banking system (p. 107) Q: what would the logically centralized system look like? Q: how is it weakened? Q: is the weakening acceptable? * what is expected in their designs? language must show programmers some machine/link failures while program runs (SR doesn't) like it to be fairly high-level (low-level timeouts done by language) explicit support for atomic, serializable, recoverable programs has to use SR syntax as base, extensions/deletions permitted but extensions should resemble SR syntax hints: may want to add atomic transactions (or maybe SR has enough?) may want to add exception handling may want to add stable storage (or maybe SR can do that?) take out some of the low-level stuff to compensate?