DDL Ep 05: AI and Data — The LinkedIn Story

What does it take for a tech giant like LinkedIn to transform from a simple professional networking site into a data-driven, AI-powered career ecosystem?

In the fifth episode of our Decoding Data Leadership series, Shirshanka Das (Co-founder/CTO of DataHub) sits down with Kapil Surlaker (VP of Engineering, LinkedIn) to discuss LinkedIn’s engineering prowess in AI and data.

Their discussion offers a compelling case study of how a major tech company adapts to the rapidly evolving digital landscape. They break down how LinkedIn’s data strategy evolved from a centralized team focused on basic metrics and dashboards to a decentralized network of data specialists driving AI innovation.

This transformation showcases how LinkedIn stays at the cutting edge of innovation while tackling the challenges of maintaining quality and governance at scale, providing valuable insights for organizations navigating their own data and AI journeys.

DDL Ep 05 — AI and Data: The LinkedIn Story

From Observation to Action: The BI to AI Shift

In its early days, LinkedIn’s data strategy centered around understanding product performance and user interactions — a classic Business Intelligence (BI) approach. However, as the potential of data grew, so did LinkedIn’s ambitions. The company recognized that data could be more than just an observational tool; it could become an active ingredient in creating personalized, impactful user experiences.

This shift from BI to AI marked a pivotal moment in LinkedIn’s data story. It necessitated a complete overhaul of team structures and infrastructure. The once-centralized data team gave way to a decentralized model, where multiple teams work independently on domain-specific data and products. This new approach fostered greater agility and innovation, allowing LinkedIn to rapidly develop analytics and machine learning models and leverage cutting-edge technologies.

Navigating the Challenges of Decentralization with DataHub

However, with great power comes great responsibility. While the decentralized model promoted innovation, it also presented significant challenges in governance and quality control. Maintaining consistency and high standards across diverse teams became a critical focus. LinkedIn’s solution? A robust metadata management system and dedicated metrics platforms.

Enter DataHub, LinkedIn’s answer to the metadata challenge. What started as a basic search and discovery mechanism evolved into a sophisticated AI enabler, centralizing billions of data artifacts. DataHub now serves as the foundational substrate, connecting various components of LinkedIn’s data and AI systems. It ensures repeatability, traceability, and auditability of data workflows — critical factors in today’s data-driven, regulatory-conscious environment.

Lessons for the Data-Driven Enterprise

The evolution of LinkedIn’s data strategy offers valuable lessons for companies at any stage of their data and AI journey:

  • Focus on a Hierarchy of Data and AI needs: Companies transitioning to AI-enabled environments should prioritize building a solid foundation of scalable infrastructure, robust metadata infrastructure, and data quality assurance. Skipping steps in this hierarchy is only going to hinder AI development.
  • Avoid the ‘Answering Machine’ Trap: To prevent data teams from being trapped in a cycle of providing answers, the focus should be on empowering users to self-serve through the right tools and metadata infrastructure. AI can augment tool experiences, making them more intuitive and self-improving over time.

As data continues to shape our digital experiences, the lessons from LinkedIn’s journey will undoubtedly prove valuable for organizations striving to stay ahead in the data revolution.

Check out the full recording on YouTube and read on for a slightly edited version of the conversation.👇

Shirshanka: Kapil, before we dive in, please tell us about your journey to data — before and at LinkedIn.

Kapil: I’ve been in the data and infrastructure domain for most of my career. I started with traditional database management systems back at Oracle. After a few years, I moved to several startups, specializing in data and database solutions. About 13 years ago, I joined LinkedIn — around the same time you joined — when it was experiencing a growth spurt and inflection point in traffic. That was a foundational phase where a lot of data infrastructure was being built from the ground up.

This included work on source-of-truth systems like Espresso, a low-latency, high QPS database. Then, I moved to offline batch processing infrastructure when we moved from traditional data warehouse approaches to a modern, distributed, open-source stack. As we went up the stack, we worked together on DataHub, which became LinkedIn’s central metadata platform.

More recently. I’ve focused on what we call our data and AI platforms. This includes the entire spectrum of the AI lifecycle — feature engineering, data preparation for model training, and model serving, which powers all the experience on LinkedIn.

Shirshanka: Your journey at LinkedIn has been amazing, and I was fortunate to share part of it with you. Most people see Kapil as an engineering leader, but I had the unique opportunity to work with him as a technical brain. One of the things I admired most was his ability to simplify complex problems quickly. I recall LinkedIn’s early Espresso design decisions being rather convoluted, and Kapil was instrumental in streamlining them.

I find it fascinating to compare data teams across different organizations. Often, people need to understand the underlying reasons for their success to hear about the technologies and solutions used.

So, tell us about the data team at LinkedIn. How is it organized, and how has its position within the organization evolved?

Kapil: The key point is understanding the business value and the company’s vision, which ties data back to your goal.

LinkedIn’s mission is to connect the world’s professionals and make them more productive and successful. And we do this through various products like people recommendations, feed updates, and job suggestions, all powered by data.

When we started, AI wasn’t as integrated into the process as it is now. The value of data was primarily in understanding product performance and member interactions — metrics, dashboards, and insights into what was working and what wasn’t. This led to a more centralized data team operating on a traditional data warehouse, initially an Oracle data warehouse, later evolving into more modern data solutions.

The shift happened when data transitioned from being merely observational to becoming an active ingredient in the product experience.

Think about logging into LinkedIn and seeing personalized recommendations. This personalization is driven by the collective activities on LinkedIn and the system’s ability to recommend relevant content. Multiply this by hundreds of engineers working on various projects using data to enhance user experiences.

Today, LinkedIn employs a decentralized approach. Multiple teams work on data related to their specific products and domains, building analytics, developing machine learning models, and leveraging the latest technologies. They all use a common infrastructure and platform layers, including databases like Espresso, big data storage technologies, and machine learning frameworks.

This decentralization allows for greater agility and innovation, aligning closely with the evolving role of data in delivering personalized, impactful user experiences.

Shirshanka: Got it. Data went from being something that told you about how the business was doing to actually becoming part of the products. So, this also changed the function of the data team from being a single unit that provided insights and dashboards to being integrated into product teams, making data a key part of their deliverables.

Kapil: Exactly. That dynamic has a profound impact on how those teams structure themselves, and it also impacts how the infrastructure and platform teams work and what they need to build.

You don’t have one place where a single team carefully curates all the data; instead, you have multiple teams producing and consuming their data in a connected layer. All these producers and consumers need to find data that others have built. How do you understand the semantics of data someone else has produced? How do you understand the contracts and the meaning of the data? To make an effective decentralized data system, the infrastructure and platform team has to build and operate with the right tools. This is where a technology like DataHub becomes extremely critical.

Shirshanka: Got it. What I find interesting is also this transition from the BI-centric view of data to a more AI-centric view of data.

Kapil: The other way to think about it is that data goes from feeding dashboards to becoming the oxygen that powers your products.

Shirshanka: Speaking of which, AI and data are a vast topic, but with your role — being in charge of both the data and the AI platforms — it’s an interesting aspect. Not many organizations have realized the importance of integrating these elements as early as LinkedIn has.

I’m curious — what facets of AI and data are you finding interesting, and how are you orienting them towards the next wave of LinkedIn’s data evolution?

Kapil: That’s a great question. As you mentioned, the space that data and AI occupy in an ecosystem like ours is vast. There is no space or experience at LinkedIn today that isn’t involved in AI. You can think of it as different buckets. One dominant pattern is our large-scale recommendation system, and the LinkedIn feed is a perfect example of that.

When you log in to LinkedIn and your feed pops up, it’s very personalized for you — every member on LinkedIn sees a different version of the feed. This personalization extends to other features like People You May Know, job recommendations, or even search. These are examples of large-scale recommendation systems. You’re starting from potentially billions of items, selecting a candidate pool from those billions, and narrowing it down to a few hundred items.

The monumental task is how to accomplish this scale. At the foundation, the scale for these problems is in the exabytes. That’s a ton of data to process.

When you think about AI, there are two phases. One is using data to train machine learning models, involving large-scale computing for feature computation, distributed training on hundreds of thousands of GPUs, and distributed storage.

The other phase is the inference part, which happens when you show up on the site. In milliseconds, we go from billions of items to a few hundred that are useful for you. This involves multiple factors to optimize your final list, such as the likelihood of you finding an item interesting, liking it, sharing it, or clicking it.

At a high level, all these recommendation systems work similarly. The first stage, candidate generation, selects a few thousand candidates from billions of items, optimizing for recall. It’s a classic needle-in-a-haystack problem. For the LinkedIn feed, there are traditional index-based lookups and newer machine learning-based techniques like embedding-based retrievals. The first stage narrows it down from billions to a few thousand candidates. The next stage, L1 ranking, uses simpler models to optimize for recall, reducing the list to a few hundred. The final stage, L2 ranking, uses deep neural networks or transformer-based models to go from a thousand to a few hundred items. Then, re-rankers optimize for fairness or other factors, ensuring a good quality pipeline.

Shirshanka: What you just described seems incredibly complex. I’d love to look at the lineage of that whole endpoint. It’s probably twenty or thirty hops from the original datasets back into the product.

Kapil: Exactly. And the way you determine how good — loosely speaking — the quality of the algorithm, is through an online evaluation of the entire system.

We typically rely on A/B tests on our A/B testing platform T-Rex, and monitor metrics on how the old algorithms are doing versus the new ones. Then we ramp up the new algorithm — assuming we see the right behavior. Ultimately, the quality of that experience is the final judgment. That’s a loose way of describing a large-scale system.

Shirshanka: Right. And how does the quality of this end experience get impacted by the quality of the individual hops? How does that affect the quality of your A/B test analysis?

Kapil: You hit the nail on the head. Your entire chain is only as good as the weakest link, so the lineage of any number of hops becomes super important. There are different aspects of metadata that we talk about. Questions about the semantics of your data touch on one aspect while here we’re touching on metadata about your data pipelines — more active monitoring metadata.

You want to understand things: “When did this data arrive? What was the quality at each particular stage?” This leads to another important point: determining quality at the end of a multi-stage pipeline. Of course, it’s important, but if you have problems, it’s too late to catch them all.

If there’s a possibility of catching problems early on, you’re better off than waiting until the end. If your data is unreliable to begin with, that affects every subsequent stage. Many hops later, you find something is off with your online A/B test. Think about the time spent debugging and understanding what happened. It’s crucial to insist on the right data quality at the first stage of your pipeline.

Data quality became top of mind for us.

We started looking into various aspects and asked, “What data quality do we want? Did we get 100% of the data that went into the input? Can you give me a guarantee of 99.9% completeness?”

Then you can do much more sophisticated data quality checks, like: “Does the column distribution in your dataset match what I’m expecting? Are all the values between zero and a hundred? This column should never have a null. Is there an implicit relationship between two or sometimes even multiple columns?”

This brings us to the concept of data contracts. You and DataHub have talked about this, and you’re right on the money. If I take a dependency on a dataset you produced, I need to understand its quality in terms of a contract we both agree on. If you tell me the size and shape of the data and that a column adheres to this contract, I know your quality needs, and we can extend that to the rest of the pipeline.

As the data goes through many hops and reaches model training and inference, it gets more complicated. You can’t simply assert strict rules; you need to consider the behavior itself. When looking at quality, this is a real top-of-mind issue.

Shirshanka: It’s almost like Maslow’s hierarchy of needs — the first step is having the infrastructure to serve the data at the required speeds and scale. Different companies may be at various points on this curve; LinkedIn, for instance, is quite advanced, but smaller companies will encounter similar challenges sooner or later and may need multiple systems to address them.

The second aspect is data quality. If data is vital to your operations, ensuring its quality becomes paramount. Managing quality is a multi-stage process, much like how humans organize the world with countries, states, and borders, ensuring each boundary can be independently governed.

Once you have the infrastructure to move swiftly and ensure reliability at each stage of the pipeline, other challenges may arise.

I’m curious — with the multitude of applications, such as feed jobs, and the constant stream of new projects at LinkedIn, maintaining this fast pace can pose concerns. Are there any issues you’ve encountered with this rapid pace of operations?

Kapil: At LinkedIn, you’re in a decentralized system where multiple people are independently building things very fast. So far, what we discussed allows them to build high-quality pipelines that are trusted and stamped for data quality. However, this doesn’t mean they are well-governed or consistent with each other.

Think about your metrics. In any mature data ecosystem, you need to measure things accurately because, as the saying goes, “you can’t fix what you don’t measure.” It’s common for multiple people to compute the same metric in parallel, leading to inconsistencies even if each one is high-quality, trusted, and controlled. Maybe because they are computing slightly different things, but they are calling it the same thing. So you have to get consistency and governance right in that broader ecosystem.

For metrics, we built a platform to ensure the same governance process, collect the right metadata, and then it becomes about high-level data discovery. Can you discover the right metrics? Can you understand who owns them and how they are defined and computed?

Extending this approach you realize the complexity of the data and AI system. The thing that connects it, ultimately, is your metadata as the foundational substrate.

In AI practice, it goes deeper — because now you’re not just getting metadata about your individual data artifacts but also automatically collecting data on your machine-learning features. You’re collecting metadata about your machine learning models, especially for repeatability. For example, when you’re training a bunch of machine learning models together, you need to tie the metadata of these trained elements.

As your ecosystem gets more mature and sophisticated, the foundational substrate of metadata connecting everything becomes even more important.

Shirshanka: We just talked about the evolution of data at LinkedIn, going from BI-centric to AI-centric and from centralized to decentralized.

It was interesting how at every step, metadata was seen as an enabler. Can you tell us how metadata evolved at LinkedIn, from the early days to now, along the same chronology?

Kapil: I think the evolution of the metadata story also follows the evolution of the data story — because it doesn’t happen in a vacuum and I’m happy to tell the story, although I feel like this is something you probably know better than I do because you built DataHub and worked on all these problems.

Going back to when we were in a more BI-centric world, your needs for metadata were much more centered around basic search and discovery. And as we observed back then, people were relying on either masking each other or spreadsheets. So where do I go to find my data? I go to the spreadsheet. And when you’re small or early-stage, that may not be a bad place to start. But that’s why you build DataHub as the one place where you can discover billions of data artifacts.

Then you quickly get to the next phase as you build more and more of these pipelines. Then it’s not just about observing and discovering — you need lineage. And then it becomes about whether you can trust the data and how you can monitor it.

Shirshanka: I remember what we call the valley of disillusionment after our initial rollout of the search and discovery tool. People would come in, search, and sometimes find what they wanted. But when they didn’t, they walked away and never returned because they couldn’t trust it.

It wasn’t until we integrated operational signals and information from the metrics pipeline that data scientists and analysts felt confident. They could see their metrics, know when they were computed, see the current computation window, and even request a backfill. These features brought the next wave of acknowledgment.

Kapil: That’s an interesting insight because the system becomes more useful as it collects more metadata. With more signals, like detailed metadata about computation and different windows, it becomes even more useful. Add regulatory compliance to it and that metadata is now more useful.

The effect on data productivity is another layer. It’s a self-fulfilling cycle: as it gets better, people use it more, bringing in more metadata, which fuels growth.

This is the story of DataHub evolving from a basic search and discovery mechanism to a place for all metadata. You and Acryl have done a great job pushing it forward. It’s wonderful to see what we built at LinkedIn finding a life outside and and being so widely adopted. At LinkedIn, we continue to centralize metadata to avoid silos, especially with the influx of metadata from the AI lifecycle. Centralizing that into DataHub is the goal, and there’s a lot of exciting stuff still coming.

Shirshanka: That’s great to hear. LinkedIn is a special company with a unique set of characters and interesting technology.

But you’ve probably also looked at many startups and early-stage companies seeking advice on developing a data culture like LinkedIn or Airbnb.

What advice would you give companies transitioning from being data-driven to AI-enabled or AI-driven?

One question I hear a lot is, “Is traditional AI dead? Should we be focusing on new AI technologies?”

Kapil: Where we are today, there’s a place for different kinds of technologies and ML models. This space evolves very quickly. But for advice, I’ll go back to one of the points you made earlier about thinking of it as a hierarchy of needs, like Maslow’s hierarchy, but for data and AI.

First, the business value you’re trying to establish is key — what problem are you trying to solve and why?

You can’t skip the steps in the pyramid. To produce good AI and machine learning, you need great data and a solid data foundation of scalable infrastructure. And you need a strong foundation of metadata infrastructure.

It’s tempting to think you can skip to AI and have it solve all your problems, but you can’t have great AI without a great data story. And you can’t have a great data story without a solid metadata foundation. This includes reducing the problem, having catalogs, and ensuring data quality. The hierarchy of needs is a good framing, and that’s the advice I’d give every company. Start wherever you are, but think about your investments in these terms.

Shirshanka: So, an MVP metadata strategy with an MVP data quality strategy, focusing on a slice of the whole pyramid instead of jumping straight into AI, is key.

One of the challenges data teams often face, especially in analytics, is being asked to provide one answer, then the next question comes, and they provide another answer, and so on.

What advice would you give to those teams to stop getting trapped in that vicious cycle of just being an answering machine?

Kapil: Answers lead to more questions. Right? And, to some extent, it’s a much longer conversation. We probably need another segment to discuss metrics and iterative data analytics. From an infrastructure and platform point of view, if we focus on making those experiences much more self-serve, people can answer those questions themselves without needing to go to someone else.

So, having the right tools, like notebooks for querying and issuing repeated queries on your data, is key.

We talked about the platform built on our Notebooks infrastructure, but connecting that with metadata is crucial. Many people struggle because they don’t know what datasets to query — it’s intimidating to find the right thing among thousands or millions of tables. That’s again, why having a good metadata foundation is so important.

Since we talk about GenAI, this is also a space where AI can help our tools get better. This is where infrastructure and data make AI better, and AI can help improve tool experiences — almost like a perpetual machine feeding itself.

But, of course, this is just the quick and short answer.

Shirshanka: Thank you so much, Kapil, for doing this. It’s been great.

Recommended Next Reads