OSMC 2025: Onotify: A scalable, flexible Alertmanager by Colin Douch.pdf

onotify: A scalable, flexible
Alertmanager
Colin Douch (SRE @ DuckDuckGo)

Which brings me to Alertmanager
• Alertmanager infuriates me
• What happened to all my nice
alerting features?
• Why am I being forced to use
this thing?
3

Alertmanager is this weird little orphan child
• It’s had a fifth of the commits of
Prometheus itself
• People are sort of forced to use
it if they use Prometheus
• Providers are bullied into
helping maintain it because of
Prometheus’ success
4

So, let’s talk about it
• Alertmanager’s weird
incentives
• What we’ve lost
• Why is Alertmanager so
difficult?
• Can we do better?
Internal Confidential 5

Hi! I’m Colin
• Formerly leading the
Observability Team at
Cloudflare
• Now an SRE at DuckDuckGo,
working on privacy preserving
telemetry at scale
• Older than I look
• My knees hurt
6

What about Authentication?
• Key and lock from Alicia • Believe it or not, it’s useful to
know who does things
• Alertmanager, when it can sort
of relies on an honor system
• But for the most part, you’re on
your own. Who made changes?
Who knows! Not you!
7

What about Silences? Or downtimes?
• Maybe you’re doing some
maintenance
• Maybe you’re getting a PTSD
response to your pagerduty
alert noise
8

What about Silences? Or downtimes?
9

“If you like it you shoulda put a proxy in front of it”
• Bouncer from Alicia • Alertmanager recommends
putting a proxy in front of it for any
business logic
• But it notably doesn’t actually
provide any such proxy
• We made ”Alertmanager
Bouncer” at Cloudflare:
https://github.com/sinkingpoint/a
lertmanager-bouncer
10

Automation? We had Event Handlers for that
• Automating Alert Remediation
(temporarily while a real fix is
being implemented) is sort of a
core DevOps tenant
• We’re left on our own, yet again
with Alertmanager
• Better get used to the
Alertmanager Webhook
interface
• Good luck
11

What about Alert Histories?
• Yet another hard coupling with
Prometheus – you can just use
the ALERTS timeseries
• But what if you have multiple
Prometheus servers?
• What if you need longer
retentions?
12

What about Alert Histories?
13
Histories!!!

Let’s talk about Acknowledging
14

Let’s talk about Acknowledging
• Once again, Alertmanager
punts this to the downstream
sinks
• Fine, until you have multiple
alerting systems
• Oh look, yet more vendor lock
in
15

Which brings us back to: AlertManager is weird
• Which brings us back to this:
Alertmanager is weird
• It’s almost toxically open
source
• Providers are both incentivized
to help maintain Alertmanager,
and simultaneously do as little
as possible to it
16

We’ve lost so much
• All our functionality is
offloaded to downstream
alerting sinks
• Providers are incentivized to
put as little work into
Alertmanager as necessary
• This is great for vendor lock-in,
and bad for us who actually
have to use Alertmanager
17
Old man from Alicia

Which brings me to onotify
18
https://github.com/sinkingpoint/onotify

What we’ve done differently
• Kiora attempted to mimic
Alertmanager as much as
possible
• Onotify acknowledges:
• Alerting is bursty
• Alertmanager is wasteful
• There’s a lot of effort already put
into Alertmanager both in Open
Source, and internally
19

Serverless to the rescue!
• Have bursty traffic? Serverless
might be for you!
• Cloudflare is big on
dogfooding, so Cloudflare
Workers it is
• Not just a serverless platform:
Lots of nice primitives to build
full applications, like Durable
Objects
20

Onotify has logical separation by “account”
21

What about all the existing stuff?
• There’s a lot of Alertmanager
stuff out there!
• Despite its flaws, there is a
large community effort to make
up for it
• Karma, amtool, Alertmanager
Bouncer, all your existing
config, there’s lots of effort put
into them!
22

onotify needed to mimic Alertmanager
• onotify needed to look as
much like Alertmanager as
possible
• It takes the same config
• It’s entirely backwards
compatible with
Alertmanagers API
• Karma, amtool, your random
script that keeps everything
running? It “just works”
23

But we still improved things
24

25
We have links!
Canonical Links

We have authentication!
• onotify authenticates every
request - no more giving
everything access to
everything
• It supports its own
authentication, or farming it
out to Cloudflare Access or
Clerk
• No more honor system!
26

Requests have pagination
• onotify is built to scale!
• Pagination
• Authentication/authorization
• Centralized Alert histories,
comments, and
acknowledgements
27

What about configuration?
• You can upload your
Alertmanager configuration
directly
• But at Cloudflare, we had lots
of duplicated branches
• Our alert evaluations started to
push 10 seconds
28

What about configuration?
• But no more routing trees!
• Instead: The routing Directed
Acyclic Graph (DAG)
• Faster to evaluate, no
duplication, less resource
usage, easier updates
29

onotify has proper histories!
30

And we can acknowledge them too!
32

onotify centralizes acknowledgements
• You can acknowledge alerts in
onotify
• Those acknowledgements can
be viewed in the history
• They can be synced with
downstream providers – no
more checking multiple places
33

So what’s next?
• DuckDuckGo probably isn’t
going to move away from Icinga
any time soon
• But we can still learn from it!
• onotify will continue to evolve,
as an example of what we
could have
• Prometheus and its children
are not the be all and end all
36

Questions?
37
Art by Alicia Edwards:
https://www.aliciaillustration.com/
https://colindou.ch/slides
https://github.com/sinkingpoint/onotify

OSMC 2025: Onotify: A scalable, flexible Alertmanager by Colin Douch.pdf

More Related Content

Recently uploaded

Featured

OSMC 2025: Onotify: A scalable, flexible Alertmanager by Colin Douch.pdf