Introduction to distributed system design

SekFook
4 min readAug 16, 2022

Introduction

Distributed system is one the hardest things in computer science, in this article, we briefly talk about each of the concepts in system design and some of the design patterns.

Scaling

load balancer: allow you to redirect the request to a group of servers. You can scale up more servers during peak hours to handle all the requests.

Database: replicate multiples database with master-slave replication. The Master will handle the write request while the slave will handle the read request.

Cache: reside between the application and database layer. data would be cached in Redis or Memcache for fast access by the client

Latency: time to perform some actions

Throughout: number of actions produced per unit of time

CAP

these are 3 important criteria in distributed computing system but we can only support 2 of the following guarantees.

  1. Consistency — every read will receive the most recent write result
  2. Availability — every request would get a response.
  3. Partition/network Tolerance — the system will be operable despite arbitrary partitioning due to network failure

we would always need to have partition tolerance since the so network wouldn’t be reliable.

CP — consistency and partition tolerance

response from the partitioned node might result in a timeout error

AP — availability and partition tolerance

each request would always get a response with the data available on any node, which is not guaranteed to be the latest.

Consistency Pattern

  1. weak consistency: after a write, reads may or may not see it. use case: video chat, multiplayer games where several second of loss receptions are acceptable.
  2. Eventual consistency: after a write, reads will eventually see it as data is replicated asynchronously. Use case: dns, email.
  3. Strong consistency: after a write reads will see it. data is replicated synchronously

Availability Pattern

Active-passive Failover: heartbeats are sent between active and passive server (standy). If there is no heatbeat received, passive server takes over the active server.

Active-active failover: both servers manage the traffic and spreading load between them.

Service Discovery

Etcd, Zookeeper, or other systems can help services find each other by keep track of registered name, addresses and ports. Health check can be done to verify service status.

Database

  • Atomicity: transaction is all or nothing
  • consistency: transaction will bring database from one valid state to another
  • isolation: transactions that get executed concurrently wouldn’t affect results of each other
  • Durability — the transaction that get commited would remain so.

Federation

functional partitioning that spilt up database by functions. less read and write requests to each database. Throughput can be improved as you can write in parallel.

Sharding

sharding distributed data across different database such that each database only manage a subset of data. advantages are similar to the federation.

Cache

Write-through

application use cache as the main data store, read and write data into it, which cache would write and read data to database synchronous.

The write process is kind of slow but the subsequent read is faster. User is generally more tolerant of latency when updating data than reading data.

Write behind

write behind add/update entry in cache and asynchronously write an entry to a data store, improving performance.

user -write→ cache -add event to → queue -handle and execute event by → event process -upsert → db

data could be lost if cache goes down.

Pattern

1. Write Ahead Log

Process might crash at any time due to hardware or software faults. if a process is responsible to data storage, it must at least be designed to have a durability guarantee for data to store on the server.

Write Ahead log store each state change as a command in an append-only file on a hard disk. The server stores each state change as a command in an append-only file on the hard disk and appending file is very fast. If the server crash, the log can be replayed to build in memory state again.

2. Heartbeat

In a network, there is no upper bound on delays caused in transmitting messages across the network. besides, network partition might happen which causes some sets of servers can’t communicate with each other.

Those scenarios created 2 problems:

  1. A server can’t wait indefinitely to know if another server has crashed
  2. We don’t want 2 sets of servers, each considers another set to have failed, continuing to server different sets of clients. This is called as split brain.

Heartbeat: a pattern that requires each server to send a heartbeat message to another server at regular intervals, if a heartbeat is missed, the server sending the heartbeat is considered crashed. It would help us to handle the first problem.

3. Quorum

In the previous session, we mentioned the split brain, Quorum helps us take care of it, every action taken by the server is only considered successful when major of the servers agree. If servers can not get the majority, the request wouldn’t be processed.

4. Leader and followers

Quorum is a good solution but unfortunately, it doesn’t provide consistency, imagine there is a recent write and when the client does a read operation, the latest value might not be obtained as the data is not available on the server.

Leader server control and coordinate the replication of data on the followers, a high-water mark, and an index in the write-ahead log showing the last successful replication, all client can only read the log entries till the high-water mark as the log entries beyond the high-water mark have not confirmation about the entries are replicated. Leader will also propagates the high-water mark to the followers so in case the leader fails, there are no inconsistencies.

ref: https://martinfowler.com/articles/patterns-of-distributed-systems/

ref: https://github.com/donnemartin/system-design-primer#system-design-topics-start-here

--

--

SekFook

Software engineer at Xendit. Love data, machine learning, distributed systems and writing clean code. I got too many articles in my reading list.