DDL: Data Mesh - Lessons from the Field
Summary
TLDR在本集DDL节目中,AutoTrader的工程总监Darren Hacken与主持人、ACIL联合创始人兼CTO Shashanka讨论了数据领域的演变和数据网(Data Mesh)的概念。Darren分享了他个人的职业经历,以及AutoTrader如何通过分散数据团队来提高数据处理能力。他们还探讨了数据网的实施,包括如何通过数据产品和元数据管理来实现更好的数据治理和可观察性。Darren对数据网的未来充满期待,认为它将帮助组织以更分散的方式构建和利用数据产品。
Takeaways
- 🎉 Darren Hacken 是 AutoTrader 的工程总监,负责平台和数据,该公司是英国最大的汽车平台。
- 🚀 Darren 初期对数据工作不感兴趣,但随着大数据技术的兴起,他对数据领域产生了热情。
- 🌐 AutoTrader 的数据团队设置相对分散,有多个平台团队和专注于特定问题领域的数据团队。
- 🔄 数据团队的演变从集中式到分散式,反映了随着组织规模的扩大,对数据管理方式的适应。
- 🤖 数据网格(Data Mesh)是一种社会技术概念,强调了文化和团队结构的重要性,以及如何实现去中心化。
- 🛠️ 实施数据网格的过程中,AutoTrader 遇到了技术工具集中化与去中心化需求之间的差距。
- 📊 通过 Kubernetes 和 Data Hub,AutoTrader 正在构建数据产品的思维和实践,以提高数据的可发现性和治理。
- 🔧 数据网格的实施带来了对数据产品命名和数据建模实践的新挑战。
- 🌟 Darren 认为数据产品的概念是数据网格中最有力的部分,它有助于更好地组织和利用数据。
- 🚫 数据网格的实施并非一蹴而就,需要时间和持续的技术进步来克服现有的挑战。
- 🔮 未来,Darren 期待数据网格和数据产品能够进一步推动组织内部的数据使用和创新,特别是在 AI 和 ML 领域。
Q & A
Darren Hacken目前担任什么职位?
-Darren Hacken目前担任AutoTrader公司的工程总监,负责平台和数据方面的工作。
AutoTrader公司主要业务是什么?
-AutoTrader公司是一个汽车市场和科技平台,主要业务是作为英国最大的汽车平台,涉及买卖汽车等相关服务。
Darren Hacken对于数据领域有哪些看法?
-Darren Hacken非常关注数据领域,他认为数据是非常重要的,可以塑造和改变组织,并且随着AI和ML等技术的发展,数据领域一直在成长。
Darren Hacken的职业经历中有哪些转变?
-Darren Hacken在职业生涯初期并不喜欢数据相关工作,因为他不喜欢基于ETL工具的重复性工作。但随着大数据技术的兴起,他发现数据领域变得非常吸引人,最终成为了他热爱的领域。
Darren Hacken提到的数据产品是什么?
-数据产品是指将数据和相关功能捆绑在一起的产品,它可以帮助组织更有效地管理和使用数据,支持数据的发现、分析和治理。
AutoTrader公司的数据团队是如何运作的?
-AutoTrader公司的数据团队是分散式的,有多个平台团队和数据团队,他们专注于不同的业务领域,如广告、用户行为、车辆定价等,并致力于构建数据产品和提供自助分析服务。
Darren Hacken如何看待数据治理和元数据管理?
-Darren Hacken认为数据治理和元数据管理是实现数据分散化后的关键需求,特别是在数据产品之间建立清晰的所有权和依赖关系,以及确保数据的质量和安全性。
Darren Hacken提到了哪些技术在数据领域的应用?
-Darren Hacken提到了DBT、Kubernetes、Cuberes、数据Hub等技术在数据领域的应用,这些技术帮助他们实现了数据产品的创建、管理和治理。
Darren Hacken对于数据领域的未来有哪些期待?
-Darren Hacken期待数据产品的概念能够更加深入人心,同时他也希望看到更多支持数据分散化的技术出现,使得数据管理和治理变得更加容易。
Darren Hacken如何看待数据领域的挑战?
-Darren Hacken认为数据领域的挑战在于如何保持数据质量和实践的高标准,以及如何在没有中央团队的情况下维持这些标准。此外,数据命名和建模也是持续存在的挑战。
Darren Hacken对于数据合同有何看法?
-Darren Hacken认为数据合同是一个有趣的领域,他们目前更多地隐含地使用数据合同,通过标准化的方法和验证器来检测模式变化,并对未来数据合同的发展持开放态度。
Outlines
🎤 开场与介绍
本段介绍了视频节目的开场,主持人表达了对讨论话题的兴奋之情,并欢迎嘉宾Darren Hacken加入节目。Darren是AutoTrader的工程总监,负责平台和数据。主持人Shashanka是acil的联合创始人和CTO,也是数据Hub项目的创始人。Darren分享了他与数据结缘的经历,以及他如何从不喜欢数据工作转变为对数据充满热情。
🔍 数据团队的结构与运作
Darren描述了AutoTrader的数据团队结构,包括平台团队和专注于特定领域的数据团队。他强调了数据团队的去中心化,以及如何通过构建数据平台来支持组织中的数据能力。他还提到了数据团队与其他团队的互动,以及如何围绕问题组织团队。
🌐 数据网格的理解和实践
Darren分享了他对数据网格的理解,将其视为一种社会技术实践和文化转变。他提到了数据网格的起源和它如何帮助组织实现去中心化。Darren讨论了他们如何开始应用数据网格原则,特别是在技术架构上从集中式模型转变为更加分散的数据产品。
🛠️ 数据产品的治理与挑战
Darren讨论了在实施数据网格过程中遇到的挑战,特别是在数据治理、元数据管理和可观察性方面。他提到了技术工具在支持去中心化方面的不足,并分享了他们如何使用元数据和数据Hub来解决这些问题。
🔄 数据产品的创建与管理
Darren解释了他们如何通过使用Kubernetes作为控制平面来创建和管理数据产品。他讨论了如何通过自动化和代码化的方式来处理数据产品的元数据,并分享了他们如何使用数据Hub来收集和连接数据产品。
🤔 数据网格的挑战与未来
Darren探讨了数据网格在组织中可能带来的架构压力,以及如何在没有中央团队的情况下保持数据实践的质量。他还提到了数据命名和建模的挑战,以及他们如何使用数据合同来隐含地处理这些问题。
🚀 数据网格的未来展望
Darren对未来的数据网格和数据产品表示兴奋。他预见了数据产品思维将如何帮助组织更好地利用数据,以及数据网格如何帮助缩短产品上市时间并提高市场响应速度。他还提到了AI和数据产品如何相互促进,并对未来的技术发展表示乐观。
🙌 结语与感谢
节目的最后,主持人Shashanka感谢Darren的参与和分享,并对未来的合作表示期待。他们讨论了数据产品和数据网格的未来,以及如何通过社区和开源项目来推动这些概念的发展。
Mindmap
Keywords
💡数据网格(Data Mesh)
💡数据产品(Data Products)
💡元数据(Metadata)
💡数据治理(Data Governance)
💡数据平台(Data Platform)
💡数据团队(Data Teams)
💡数据所有权(Data Ownership)
💡数据质量(Data Quality)
💡数据发现(Data Discovery)
💡数据合同(Data Contracts)
💡数据架构(Data Architecture)
Highlights
Darren分享了自己对数据领域的热情以及其在AutoTrader的角色和职责。
Darren讲述了自己职业生涯的转变,从最初不喜欢数据工作到成为数据领域的领导者。
AutoTrader的数据团队结构是分散式的,有专门针对不同领域如广告和用户行为的数据团队。
Darren解释了数据产品的概念以及如何通过数据产品实现团队间的协作和数据共享。
AutoTrader在数据平台建设上面临的挑战,特别是在技术分界和数据治理方面。
Darren讨论了数据网格(Data Mesh)的概念以及它如何帮助组织实现数据的去中心化。
Darren分享了AutoTrader实施数据网格的经验,包括技术挑战和文化变革。
讨论了数据治理、元数据管理和可观察性在数据网格实施中的重要性。
Darren提到了使用Kubernetes作为数据产品的控制平面,并如何通过自动化提高效率。
讨论了数据网格的未来,以及它如何影响组织内部的数据使用和产品开发。
Darren对于数据产品和数据合同在数据网格中的作用和未来发展的展望。
讨论了数据网格的挑战,包括如何保持数据质量和实践中的困难。
Darren分享了对于数据网格概念未来的看法,以及它如何适应不断变化的技术环境。
讨论了数据网格如何帮助组织更好地利用数据,并提高决策的速度和质量。
Darren对于数据网格和数据产品的未来发展表示乐观,并期待技术的进步。
Transcripts
[Music]
[Music]
[Music]
[Music]
hello everyone and welcome to episode
four of the ddl show I am so excited
that we're going to be talking about a
topic that used to be exciting and has
stopped being exciting and that itself
is exciting so I'm super excited to
bring on Darren hacken uh I think our
first conversation Darren was literally
on the data mesh learning group first
time we met um and so it's it's kind of
a full circle I'm super excited to
welcome you to the show Darren is an
engineering director heading up uh
platform and data at AutoTrader and I'm
your host shashanka co-founder and CTO
at uh acil and founder of the data Hub
project so Darren tell us uh about
yourself and how you got into Data hi
shash thank you for having me today um
yeah so my name is Darren I work for a
company in the UK in the United Kingdom
called aut Trader so we're a
automotive Marketplace and Technology
platform that drives it's the UK's
largest um Automotive platform so buying
and selling cars that kind of thing and
one of the areas I deeply deeply care
about is is the data space um so here at
aut Trader I kind of look after our kind
of data platform um the capabilities
that we need in order to surface data
been working in data a long time now
maybe eight nine years um I my I Funny
Story I v I would never work in data
because when I started my career I
worked in fintech for in a in a data
team and I absolutely hated it because
it was all guwy based ETL tools and I
got out of this F as I possibly could
and said never again I love engineering
I you know I'm a coder I need to get
away and do this other thing you don't
like pointing and clicking clearly I
didn't like pointing and clicking I like
I like code um and then it kind of got
really sexy and big data and technology
changed and I think it's one of the most
exciting areas of Technology now so
never say never is probably my I always
find that a funny kind of starting point
for me in terms of data to leave a leave
a rooll and go never again and here I am
um so yeah passionate about data really
think it's one of them things that
really can shape and change
organizations it's um and it's it's
growing all the time right with things
like Ai and LMS and hype Cycles around
things like that but yeah thanks for
having me they do say data has gravity
and you know uh normally it's like
pulling other data close to it but uh
clearly people also get attracted to it
and can never leave I was literally the
same way uh well I never went to data
and I wasn't able to leave so I was um
you know an engineer on the um online
data infrastructure teams right so I was
uh doing U display ads and uh doing
real-time bidding on ads at Yahoo and
then I uh was offered the uh chance of a
lifetime to go rebuild linkedin's data
infrastructure and I didn't actually
know what data meant at that point I was
scared of databases honestly because you
know it's hard to build something that's
supposed to be a source of Truth like
wait you're responsible for actually
making sure the right actually made it
to dis and it actually got flushed and
was replicated three times so that no
one loses an update well that seems like
a hard problem so you know that was my
mission impossible that I went to
LinkedIn for and I never left I've just
been in data this whole time so can
totally relate you never escape the
gravity you do not um so well so you're
you're leading big uh teams at auto
trader right now you know platform and
data tell me a little bit about what
that team does because you know as I
have talked to so many data leaders
around the world it seems clear to me
that all data teams are similar but not
all teams are exactly the same so maybe
walk our audience through what does the
data team do and who are the surrounding
teams and how do they interact with them
yeah um so we've so interestingly aut
Trader as a or A's been around for about
40 years so they started as a magazine
you could go into your you know local
store and find the magazine and pick it
up so that's interestingly means that as
Technologies evolved throughout the
decades you know they've gone through
many chapters of of it um but today
we're relatively decentralized in terms
of our data team setup and you know
we'll get into that I guess a little bit
more when we talk about data mesh today
um but we have a kind of platform team
so we have several platform teams and we
have a platform team um predominantly
built made up of Engineers and kind of
Sr de you know folks and they build um
what we call our data platform and that
is the kind of product name I guess for
the bundling of
technology which would would help Drive
data capabilities across the
organization you know that might be
building data products which we can get
into later it could be um metadata
management how to create security
policies with data um but crucially
their play is about building
capabilities that let other people um
lose these capabilities and and build
technology and other than that we try to
keep data teams closer to um the domain
of of a of an area or a problem so we
may have data teams we focus a lot on
like advertising or user Behavior maybe
more around like vehicles and pricing
and fulfillment type problems um but we
we tend to have kind of Engineers or
Engineers that specialize in data um
scientists and analysts so they they're
kind of as a discipline together and
manage together from a craft perspective
but then in terms of how how they work
together we chend to form form them
around problems um pricing as I said
earlier and things like that and they
would maybe do analytics self- serve
analytics um product analytics machine
learning um you know feature engineering
very much that kind of thing and we're
trying to keep it as close to kind of
engineering as as possible so very much
a decentralized play or that's been our
current our current generation of people
wear and team topologies um got it got
it and by the way for the audience who's
listening in um definitely uh feel free
to ask questions we'll we'll try to pull
them up uh as they come in so you know
this is meant to be me talking to Darren
and Darren talking to me and all of you
being uh having the ability to kind of
participate in the conversation so um
definitely as we keep talking about this
topic uh keep asking questions and we'll
try to pull them up and um combine them
so Darren you talked a little bit about
how the teams were structured it
definitely resonated with kind of how uh
LinkedIn evolved over the over the years
I was there we started out uh with uh a
single data team that was uh responsible
for both platform as well as
uh business so you know they were
responsible for making decisions like
what warehousing technology to use and
how to go about it and then but also
building the executive dashboard and
building the foundational data sets we
had so many debates about whether to
call them foundational or gold but the
concept was still the same you build
kind of the the the canonical business
model on top of which you want all um
insights as well as you know analytics
as well as AI to be derived from and
then over the years we definitely had a
lot of stress with that centralization
and had to kind of split apart the
responsibilities uh we ended up going to
a model where there was essentially a
data unaware or semantics unaware team
that was fully responsible just for the
platform and um sub teams that emerged
out of those out of that original team
that sometimes got fully embedded into
product delivery teams to actually um
essentially have a local Loop where
product gets built data comes out of it
and then the whole Loop of creating
insights models and features and then
shipping it back into the product was
all owned and operated by um a specific
team so it looks like that's kind of
where you've ended up as well yeah in
fact that's spookily similar I mean we
started definitely more centralized and
then these teams sort of came out of
that more centralized model so like we
we built a team about use behavior and
advertising kind of build that that went
really well and then they felt a lot
more connected and it did evolve like
that um and and a lot of this I think
just spawns from scale really so I mean
my organization is definitely another
the figers where you were previously
working shashanka but we definitely find
that you know the more hungaryan
organization gets for data eventually
you you simply can't keep up with this
centralized team with this scarcity of
resource and everyone fighting over the
same thing gets really hard to think
about you know do I invest in the
finance team do I uh invest in our
advertising or our marketing team so
like eventually like partitioning almost
your resource in some way feels
inevitable that you have to to otherwise
it becomes it becomes so
hard cool so let's let's let's talk
about the topic of
the day what does data mesh mean to you
then now that we've kind of understood
how the teams have evolved and what your
uh teams are doing day today yeah and I
think it's a really good point that we
started around teams and culture
actually because that is really what I
think the heart of what J mesh is um so
I I used to work um For Thought Works
where shaku also kind of came up with
the the data mesh thing um kind of came
from and I I wasn't working at the time
but I remember reading it and we've
we're already on this journey of like we
need to decentralize and our platform is
really important to us and we need
capabilities and we want more people to
do that and in fact you know we were
succeeding at decentralizing and scaling
um but I think when we did that we were
entering new spaces where a lot of
people hadn't really talked about it so
for me data mesh one of the things that
it means it's a you know socio technical
thing a cultural thing it's like devops
really or something like that for me
she's done a great job describing how
to
um you know get there like data products
and all this kind of thing but one of
the great things I think that J did with
talking about dat mesh was built a
lexicon a grammar a way of us all
communicating to each other like
shashanka me and you met on a on a data
mesh you know community and immediately
we we were able to speak at a level that
we simply wouldn't have been able to
maybe if we would have met five years
ago and try to have the same
conversation um so a lot of it's that
for me that's what data mesh is it's
about it's a it's a method or an
architectural pattern or set of
principles or guidelines about how you
could achieve decentralization and and
move away from this this Central team
and kind of break apart from it um and
that has been and that has been the big
draw right to of of the concepts because
a lot of people relate to it uh and kind
of resonate with it and then that from
that um what is it Summit of Hope comes
the the valley of Despair where you you
start figuring out okay how do I
translate this idea into reality and how
much do I need to change um so walk us
through your journey of like how have
you implemented data mesh how have you
taken these principles and brought them
to life or at least attempted to bring
them to life and we'll see how you feel
about it like would you give yourself an
a grade or a b-grade we we'll we'll
figure that out later but what have you
done in in BR to life so so at the point
when we started trying to apply data
mesh um we were in this place where we
we decentralized some of our teams but
our technology underneath is still very
much centralized and shared so almost
like a monolith with teams contributing
to it but everything was partitioned or
structured around technology so we'd end
up with I don't know a DBT projects or
something right or we had a monolith
around spark jobs and things it's very
technology partitioned um and then when
we started looking at data mesh we were
really excited because one of the big
things that we took out it was this term
data product and we're like great we've
now found this this this language to
describe how we were going to try and
break things down like before that we
were trying to break break you know lots
of data down into chunks of data but we
just couldn't think of like the wording
gave us a lot more power to to start
communicating so we we started trying to
break down our DBT monolith essentially
into Data products um and that's been
one of our journeys of like breaking it
partitioning it and doing that so that
was the big starting point of doing that
um so it was very much like we had some
teams that were decentralized and then
like how do we almost catch the
technology up so DBT was the starting
point of
that so you went from a monolithic repo
where all of your transformation logic
was being
hosted to chopping it up and um
splitting it up uh across multiple
different teams um great so once you did
that what did you then
find well then you find that the tooling
and system that we've got today has some
gaps when you start to think about
decentralization like a lot of the
technologies that we use in the data
space do promote very much very Central
centralized approach um like I think
it's becoming a little bit less popular
but you know airflow it' be like one
airflow for your whole
organization EBT might say one big
projects even though they are saying
that less now but there was definitely a
period where like you know that was the
that was the popular approach so we you
broke things apart
and now you've got gaps between data
products where you've got DBT and DBT
and now you've got gaps and that's where
you really start to realize that there
are other requirements that start to
come in that you need and two big ones
that felt obvious for us were around
data governance metadata kind of knowing
more about these data products at at a
met at a meta level observability and
how you define that and also how you
start creating security policy between
them so it's the classic thing of when
organizations move to microservices like
all of a sudden like monitoring between
things things breaking in you know in
the infrastructure level between the the
network protocol starts to
happen I think the data world is not
there and is catching up and I think it
will one day but today they were some of
the gaps that we started to see um so
like by breaking down I'll give you an
example so like by breaking down dbts
have this monolith with maybe I don't
know 50 people working on an area of a
monolith and then you break that down
into Data products you then start to
realize well we didn't really have clear
ownership with that like who owned it
like people were contributing together
as maintainers maybe but who owns who
owns this data asset who actually who is
the team that do it and that's where we
started to realize well you need kind of
metadata over the top to start labeling
things like that or we also had this
other symptom coming out because we had
all of our code in one place it was very
easy for like team a and Team B to use
data between each other and not really
realize and start creating dependencies
so then we were almost trying to start
using metadata to say well who should be
allowed to use my data product and that
stuff starts to get teased out so cross