CS 541 Lecture -*- Outline -*-

* Specifying distributed systems
	A programming method for distributed systems, based on the paper
	Liskov and Weihl, ``Specification of Distributed Programs'',
	Distributed Computing Vol. 1, pages 102-108, 1986.

** the problem: fault-tolerance vs. performance
	we want to be able to design and program fault tolerant systems,
		with good availability and reliability
		despite crashes and link failures
	to achieve availability and reliability, have to replicate functions
		and data
	but then synchronizing changes becomes slower...

*** idealized system: logically centralized, one-user-at-a-time
**** logically centralized
	the simplest way to specify a distributed system is to make it
		*logically centralized*

	def: a logically centralized system acts as if it was running
		on one computer
***** replication (atomic)
	If replicating data to provide availability and reliability,
		need to avoid inconsistency in different copies
	can use atomic transactions (so all updated or none are)
		byzantine generals problem
***** machine crashes (recoverable)
	crashes after user walks away shouldn't concern the user
		need persistent storage that survives crashes
		(stable storage: Lampson, LNCS 105)

	need to make crash of user's machine look like crash of whole system
		abort's the user's changes
	similar kinds of stuff needed if user aborts the computation
		(Control-C)
	have to ensure that side-effects on the rest of the system are undone
		rollback, orphan detection

	crashes of other machines
		program should be able to exploit replication,
			by trying other machines on demand,
				(so needs to know, or timeout, etc.)
			aborting effects of the machines that crashed
		timeouts and retries in system should not mean things happen
			more than once (3 computer chain, middle one crashes)
		user shouldn't have to wait indefinitely for other machines

**** one-user-at-a-time (serializable)
	each user has exclusive access
		user's activity must be serializable

	def: serializable means effect same as if executed one by one
		preserves invariants

	need locking or some other concurrency control mechanism

*** consequences
	effects of a user-request do not take into account concurrent activity
	user's aren't bothered with implementation details
	
	Some performance problems:
		atomic commit takes about 2 orders of magnitude longer than RPC
		locking means making copies of objects (for abort)

	These peformance penalities may be worth paying for things like
		banking systems

	Some authors don't specify systems as atomic, to avoid these penalities

** the idea
	specify the system as if it were atomic
	but use nondeterminism to allow efficient implementation.

*** advantages
	illusion of logically-centralized, atomic system
		allows spec to concentrate on behavior relevant to clients
	nondeterminism allows implementation without using
		transactions, locking, etc.

** examples
*** dictionary (e.g. a directory without additional info)
	logically-centralized spec
--------------
	DICTIONARY OBJECT

 LOGICALLY CENTRALIZED

insert = proc(x:element)
   REQUIRES: x not in Members
   EFFECT:   adds x to Members  

delete = proc(x:element)
   REQUIRES: x in Members
   EFFECT:   adds x to ExMembers

list = proc() returns(sequence[element])
   EFFECT:   return Members - ExMembers
----------------
	To implement this, have to do locking, etc.
	Want to be able to do this sort of thing for some apps.
-------------------
 DISTRIBUTED (weaker) SPECIFICATION

list = proc() returns(sequence[element])
   EFFECT:   return a subset of Members
---------------
	Problem is that this doesn't convey enough information.
------------------
 RECOMMENDED FORMAT

list = proc() returns(sequence[element])
 NORMAL EFFECT: return Members - ExMembers
 ABNORMAL EFFECT: return subset of Members
--------------
	Normal is same as in logically centralized system
	abnormal effects allow for distribution, but don't say how.

	Starting with logically centralized spec,
		add nondeterminism as desired for performance
		specs show user point of view, and help evaluate whether
			the nondeterminism is acceptable
*** banking system (p. 107)
	Q: what would the logically centralized system look like?
	Q: how is it weakened?
	Q: is the weakening acceptable?

* what is expected in their designs?
	language must show programmers some machine/link failures
		while program runs (SR doesn't)
	like it to be fairly high-level (low-level timeouts done by language)
	explicit support for atomic, serializable, recoverable programs
	has to use SR syntax as base, extensions/deletions permitted
		but extensions should resemble SR syntax
	hints: may want to add atomic transactions (or maybe SR has enough?)
		may want to add exception handling
		may want to add stable storage (or maybe SR can do that?)
		take out some of the low-level stuff to compensate?