03-System Design

#System-Design #LLD-HLD

# Topic covered
* Performance
  * Latency, Throughput, Bandwidth, Response Time
* Consistency, Availability, and Partition Tolerance (CAP)
* CAP Theorem
* Failure & Fault Tolerance

Performance vs scalability

Performance refers to the responsiveness and speed of a system. It measures how quickly a system can execute its intended function within given time constraints.

Scalability refers to the capability of a system to increase or decrease its performance under an increased or decreased load.

A scalable system can handle growing demand and increasing load without a significant impact on performance.

Another way to look at performance vs scalability:

If you have a performance problem, your system is slow for a single user.
If you have a scalability problem, your system is fast for a single user but slow under heavy load.

# Performance Metrics to evaluate a system
* Throughput
* Bandwidth
* Latency
* Response Time

Latency vs Response Time

Latency is the time it takes for data to pass from one point on a network to another, i.e. time spend on the network $Latency = Time Spend On The Network$

Response time refers to the total time it takes for a system to respond to a request, including the time spent on processing the request. It includes both the latency and the processing time

$Response Time = Latency + Processing Time$

Latency vs throughput

Throughput is number of actions perform per unit of time. Work done at unit amount of time. $Throughput = \frac{Load}{Time Taken} = \frac{Work done}{Time Taken}$

Generally, you should aim for maximal throughput with acceptable latency.

Bandwidth vs Throughput

Bandwidth is the maximum data capacity of a network, or how much data can potentially travel from one point to another in a given time.

Bandwidth vs Throughput

Performance Metrics of components

Application
- API response time
- Throughput of API
- Error occurrence
Database
- Time taken by various db queries
- Number of queries executed per unit time(throughput)
Cache
- Latency of writing to cache
- No of cache eviction and invalidation
- Memory of cache instance
Message Queues
- Rate of production and consumption
- Fraction of stale or unprocessed messages
Workers
- Time taken for job completion
- Resource used in processing

Performance Management Tools –> New Relic, Datadog, SolarWinds

Consistency, Availability, and Partition Tolerance

https://github.com/karanpratapsingh/system-design#cap-theorem

Consistency

Every read receives the most recent write or an error

Consistency means that all clients see the same data at the same time, no matter which node they connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or replicated across all the nodes in the system before the write is deemed “successful”.

Availability

Every request receives a response, without guarantee that it contains the most recent version of the information.

Availability in a distributed system ensures that the system remains operational 100% of the time.

Partition tolerance

The system does not fail, regardless of if messages are dropped or delayed (or linkage failure) between nodes in a system.

A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.

CAP Theorem

https://www.educative.io/answers/what-is-the-cap-theorem

Consistency, Availability, and Partition Tolerance (CAP) is a concept in distributed computing that describes the trade-offs between three desirable properties of a distributed system

The CAP theorem states that it is impossible for a distributed system to simultaneously provide all three guarantees

The CAP theorem (also called Brewer’s theorem) states that a distributed database system can only guarantee two out of these three characteristics: Consistency, Availability, and Partition Tolerance.

Consistency-Availability Tradeoff

We live in a physical world and can’t guarantee the stability of a network, so distributed databases must choose Partition Tolerance (P) This implies a tradeoff between Consistency (C) and Availability (A).

A CA database delivers consistency and availability across all nodes. It can’t do this if there is a partition between any two nodes in the system, and therefore can’t deliver fault tolerance.

Example: PostgreSQL, MariaDB.

An AP database delivers availability and partition tolerance at the expense of consistency. When a partition occurs, all nodes remain available but those at the wrong end of a partition might return an older version of data than others. When the partition is resolved, the AP databases typically re-syncs the nodes to repair all inconsistencies in the system.

Example: Apache Cassandra, CouchDB.

Failure & Fault Tolerance

1. Understanding types of faults
2. Tolerating faults - continue operating without interruption
3. Making system fail-safe

Example 1

Faults: Out of memory - Hardware not able to handle huge load
Tolerant: System scaling

Example 2

Faults: Hardware Failure
Tolerant: System Replication

Example 3

Fault: Bug in the code
Tolerant: Friendly Message in FE

Hardware fault tolerance –> Replication

03-System Design

Performance, CAP, CAP Theorem, Failure & Fault Tolerance

03-System Design

Performance, CAP, CAP Theorem, Failure & Fault Tolerance

Performance vs scalability

Latency vs Response Time

Latency vs throughput

Bandwidth vs Throughput

Performance Metrics of components

Consistency, Availability, and Partition Tolerance

Consistency

Availability

Partition tolerance

CAP Theorem

Consistency-Availability Tradeoff

Failure & Fault Tolerance