8 min read

The Math Behind LinkedIn’s Real-Time Social Network

captionless image

Ever wondered how Linkedin instantly knows if someone’s your 2nd or 3rd connection, in a graph with millions of members and billions of connections? Spoiler: it’s not some magic, it’s math.

In this Blog, we are gonna be deep diving into the maths and algorithms that makes this possible in real time.

Member Representation

When you create an account on Linkedin, you are represented as a vertex in linkedin’s social graph. Each connection you add, is represented as an edge.

captionless image

Social Networks require the ability to perform various graph computations, like:

The first two operations are pretty cheap, but computing graph distance at linkedin’s scale, that’s where things get tricky.

Why naive BFS Fails :

Social Graphs are huge, with hundreds of millions of vertices and billions of edges. The Graphs are stored in a distributed fashion, split into partitions, with each node (a physical machine) holding a few partitions.

If you run BFS on this graph, the system will likely blow up :

Naive BFS Approach

Bi-Directional BFS + Caching

Linkedin tackles this using a smart approach.

Each GetDistance call becomes a simple set intersection problem — way faster.

Bi-directional BFS

But, here’s a catch, cache misses. When the second degree connections aren’t cached, we have to compute those on the fly.

The Real-Time Challenge

To build a member’s 2nd degree array on the fly, we have to:

Since graphs are super interconnected, we get tons of duplicates. Doing all the merging at the caching layer is expensive and slow.

Ideally, we want to push maximum merging to GraphDB Nodes themselves, which requires the optimal set of GraphDB Nodes that can serve our request.

So, the problem boils down to selecting the optimal set of GraphDB Nodes, that can cover all the required partitions for a member’s second-degree network. This is an application of the set-cover Algorithm

Set-Cover in a Distributed Graph

Given a set of elements, and a bunch of subsets whose union gives us the original set, the classic set cover problem is to find a minimum number of subsets whose union will be equal to the original set.

In our case, set cover problem will be to find the minimum number of GraphDB nodes to cover all the required partitions to serve a member’s second degree.

Now, classic set cover problem is NP Hard, so linkedin uses a greedy approximation to make it work at scale. It works by picking that subset, at each iteration, that covers most remaining uncovered elements in the set.

An example of how greedy set cover algorithm works :

Assume we have a distributed Graph with 6 partitions. We store 2 partitions on each node, with 2 replicas.

Say a member’s second-degree connections lie on partitions {1, 2, 3, 4, 6}. Our sets (nodes) look like this:

captionless image

Running our greedy set cover algo:

  1. K= {1, 2, 3, 4, 6}. Pick R11 -> covers {1, 2}. Now K = {3, 4, 6}.
  2. From remaining, pick R23 -> covers {3, 6}. Now K = {4}.
  3. Finally pick R12 -> covers {3, 4}. Done

We only touched 3 nodes instead of all 6 -> ~50% fan-out reduction

Optimizing Further

If you noticed, we picked out best coverage set at each iteration by intersecting every set of partitions with set of required partitions, which is ~O(l²) operations ( l being the number of sets of partitions) . This adds up to a significant latency as number of connections grow.

We can optimize this, by exploiting the fact that each set of partitions in our GraphDB is disjoint. So, instead of intersecting with every node, we only check those subsets in L with K, which are likely to provide best coverage.

C ← 0 /
repeat
  pk ← randomly selected partition from K
  nodes ← map[pk ] // pick out nodes covering pk
  for node from nodes not added to C do
    Find nodek with coverage Sk maximizing | K ∩ Si |
  end for
  K ← K − Sk
  C ← C ∪ {nodek }
until C covers all elements in K
return C

modified set cover algorithm

This dramatically cuts down the set interactions, saving precious milliseconds in real-time queries.

Results

LinkedIn’s modified set cover approach produced massive results :

Wrapping Up

Next time you see those little “1st”, “2nd” or “3rd” label above someone’s profile, remember the clever maths and engineering going on behind the scenes. By framing second-degree queries as a set-cover problem, linkedin was able to cut down fan out, reduce merges, and serve graph distances in real time.

That’s the beauty of maths and engineering, invisible to a user, but keeps these massive systems running smoothly.

That tiny ‘2nd’ label you see? It’s backed by one of the hardest problems in Computer Science. Wild, right?