# Topic covered
* Database partitioning
* Vertical partitioning (aka Normalisation)
* Horizontal partitioning (aka Database Sharding)
* Hashing
* Consistent hashing
Database partitioning
Database partitioning is the backbone of modern DBMS
(distributed database management systems).
It is a process of dividing a large dataset into several small partitions
placed on different machines.
In other words, it is a way of partitioning data like tables or index-organized tables into smaller pieces so that data can be easily accessed and managed.
https://www.enjoyalgorithms.com/blog/data-partitioning-system-design-concept
https://github.com/karanpratapsingh/system-design#sharding
Why database partitioning?
With the growth in services and user base, it becomes tricky for a single server or database
to function efficiently.
We may experience lower performance
with the architecture of a single database server.
Here is some situation that could arise:
- Database operations become
slower
. - Network
bandwidth
starts reaching thesaturation
level. - The database server starts running
out of disk space
at some point.
DB partitioning types
.
- Vertical partitioning (aka
Normalisation
) - Horizontal partitioning (aka
Database Sharding
)
Vertical partitioning(aka Normalisation)
In vertical partitioning, we partition the data vertically based on columns
.
We divide the tables into relatively smaller tables with fewer elements, and each part is present in a separate partition.
It is also referred to as Normalisation
Usage
An ideal scenario for this type of partition is when you don’t need all the information
about the customer in your query.
Can handle the critical part of data differently
from the not so crucial part of our data.
For example, we can store sensitive data like passwords, salary information, etc., in a separate partition to restrict access or provide additional security controls.
Horizontal partitioning (aka Database Sharding)
In horizontal partitioning we split the table data horizontally based on rows
and have the same number of columns.
Partision key
is used to decide which row goes on which table.
In below table ID is used a partision key
Advantages
Scalability: Proves to increase scalability by distributing the data across multiple partitions.
Query Performance: Improves the performance of the system.
Instead of querying the whole database, now the system has to query only a smaller partition
.
Reliability and accessibility If any shard fails, the majority of the system remains accessible.
Security: Helps improve the system’s security by storing sensitive and non-sensitive
data in different partitions.
This could provide better manageability and security to sensitive data.
Data Manageability: Divides tables and indexes into smaller and more manageable units.
Disadvantages
Complexity: Sharding increases the complexity of the system
in general.
Joins across shards: Each shards runs on different server. Some cross-shard queries can be very expensive or just not possible
.
Undo sharding Once sharding is done it’s very hard if not impossible on some system to undo sharding.
Rebalancing: If the data distribution is not uniform
or there is a lot of load on a single shard, in such cases,
we have to rebalance our shards so that the requests are as equally distributed among the shards as possible.
When to use sharding?
Here are some reasons why sharding might be the right choice:
Leveraging existing hardware
instead of high-end machines.- Maintain data in
distinct geographic
regions. Quickly scale
by adding more shards.Better performance
as each machine is under less load.- When more concurrent connections are required.
.
Key or hash-based partitioning
.
This strategy divides the rows into different partitions based on a hashing algorithm
rather than grouping database rows based on continuous indexes.
Hash value determines
which database partition to use.
So it should be calculated based on static key(like user_id)
, that does not change.
For example:
Suppose we have 4 database
partitions, and each request contained an application id.
So we can simply perform a modulo operation
on the application id with 4
and take the remainder to determine the partition to place the data.
Cons:
Increase or decrease of db partition
is an issue.
The keys need to be remapped and migrated
to a new partition, and the hash function will need to be changed.
Range based sharding
.
In Range partitioning, we partition the table in such a way that each partition contains rows within a given range
defined by the partition key.
Eg:Use partitioning key from 01-Jan-2022 to 31-Jan-2022
Pros
Increase or decrease of db partition does not have any issue
Cons:
Hotspot - caused by uneven distribution
of Data
Directory based sharding
Directory-based sharding utilizes a lookup table
that keeps track of which shard holds what data.
In other words, it specifies a one-to-one mapping of the data with the shard that it is stored in.
Consider the example below. A column(ID
) from the original table is selected as shard key.
Each shard key is then given a specific shard ID
, which tells which shard has the data with its corresponding shard key.
In this way, the rows of all the rows in the original table are divided into different shards.
Hashing
Hashing is a process of mapping data
of arbitrary size (or keys) to fixed-size values (hash codes or hash values) using a hash function
.
The primary purpose of hashing is to quickly locate a data record
given its search key.
Hash functions are designed to produce a relatively uniform distribution of hash codes for different inputs.
.
Consistent hashing
Consistent Hashing is a specific technique used in distributed computing to distribute data across multiple nodes
while minimizing the impact of adding or removing nodes from the system
.
It’s designed to address some of the challenges of traditional hashing when used in distributed systems.