onotify: A scalable, flexible
Alertmanager
Colin Douch (SRE @ DuckDuckGo)
2
Which brings me to Alertmanager
• Alertmanager infuriates me
• What happened to all my nice
alerting features?
• Why am I being forced to use
this thing?
3
Alertmanager is this weird little orphan child
• It’s had a fifth of the commits of
Prometheus itself
• People are sort of forced to use
it if they use Prometheus
• Providers are bullied into
helping maintain it because of
Prometheus’ success
4
So, let’s talk about it
• Alertmanager’s weird
incentives
• What we’ve lost
• Why is Alertmanager so
difficult?
• Can we do better?
Internal Confidential 5
Hi! I’m Colin
• Formerly leading the
Observability Team at
Cloudflare
• Now an SRE at DuckDuckGo,
working on privacy preserving
telemetry at scale
• Older than I look
• My knees hurt
6
What about Authentication?
• Key and lock from Alicia • Believe it or not, it’s useful to
know who does things
• Alertmanager, when it can sort
of relies on an honor system
• But for the most part, you’re on
your own. Who made changes?
Who knows! Not you!
7
What about Silences? Or downtimes?
• Maybe you’re doing some
maintenance
• Maybe you’re getting a PTSD
response to your pagerduty
alert noise
8
What about Silences? Or downtimes?
9
“If you like it you shoulda put a proxy in front of it”
• Bouncer from Alicia • Alertmanager recommends
putting a proxy in front of it for any
business logic
• But it notably doesn’t actually
provide any such proxy
• We made ”Alertmanager
Bouncer” at Cloudflare:
https://github.com/sinkingpoint/a
lertmanager-bouncer
10
Automation? We had Event Handlers for that
• Automating Alert Remediation
(temporarily while a real fix is
being implemented) is sort of a
core DevOps tenant
• We’re left on our own, yet again
with Alertmanager
• Better get used to the
Alertmanager Webhook
interface
• Good luck
11
What about Alert Histories?
• Yet another hard coupling with
Prometheus – you can just use
the ALERTS timeseries
• But what if you have multiple
Prometheus servers?
• What if you need longer
retentions?
12
What about Alert Histories?
13
Histories!!!
Let’s talk about Acknowledging
14
Let’s talk about Acknowledging
• Once again, Alertmanager
punts this to the downstream
sinks
• Fine, until you have multiple
alerting systems
• Oh look, yet more vendor lock
in
15
Which brings us back to: AlertManager is weird
• Which brings us back to this:
Alertmanager is weird
• It’s almost toxically open
source
• Providers are both incentivized
to help maintain Alertmanager,
and simultaneously do as little
as possible to it
16
We’ve lost so much
• All our functionality is
offloaded to downstream
alerting sinks
• Providers are incentivized to
put as little work into
Alertmanager as necessary
• This is great for vendor lock-in,
and bad for us who actually
have to use Alertmanager
17
Old man from Alicia
Which brings me to onotify
18
https://github.com/sinkingpoint/onotify
What we’ve done differently
• Kiora attempted to mimic
Alertmanager as much as
possible
• Onotify acknowledges:
• Alerting is bursty
• Alertmanager is wasteful
• There’s a lot of effort already put
into Alertmanager both in Open
Source, and internally
19
Serverless to the rescue!
• Have bursty traffic? Serverless
might be for you!
• Cloudflare is big on
dogfooding, so Cloudflare
Workers it is
• Not just a serverless platform:
Lots of nice primitives to build
full applications, like Durable
Objects
20
Onotify has logical separation by “account”
21
What about all the existing stuff?
• There’s a lot of Alertmanager
stuff out there!
• Despite its flaws, there is a
large community effort to make
up for it
• Karma, amtool, Alertmanager
Bouncer, all your existing
config, there’s lots of effort put
into them!
22
onotify needed to mimic Alertmanager
• onotify needed to look as
much like Alertmanager as
possible
• It takes the same config
• It’s entirely backwards
compatible with
Alertmanagers API
• Karma, amtool, your random
script that keeps everything
running? It “just works”
23
But we still improved things
24
25
We have links!
Canonical Links
We have authentication!
• onotify authenticates every
request - no more giving
everything access to
everything
• It supports its own
authentication, or farming it
out to Cloudflare Access or
Clerk
• No more honor system!
26
Requests have pagination
• onotify is built to scale!
• Pagination
• Authentication/authorization
• Centralized Alert histories,
comments, and
acknowledgements
27
What about configuration?
• You can upload your
Alertmanager configuration
directly
• But at Cloudflare, we had lots
of duplicated branches
• Our alert evaluations started to
push 10 seconds
28
What about configuration?
• But no more routing trees!
• Instead: The routing Directed
Acyclic Graph (DAG)
• Faster to evaluate, no
duplication, less resource
usage, easier updates
29
onotify has proper histories!
30
We can annotate alerts
31
And we can acknowledge them too!
32
onotify centralizes acknowledgements
• You can acknowledge alerts in
onotify
• Those acknowledgements can
be viewed in the history
• They can be synced with
downstream providers – no
more checking multiple places
33
We can even export them!
34
Demo Time
Please Pray
35
So what’s next?
• DuckDuckGo probably isn’t
going to move away from Icinga
any time soon
• But we can still learn from it!
• onotify will continue to evolve,
as an example of what we
could have
• Prometheus and its children
are not the be all and end all
36
Questions?
37
Art by Alicia Edwards:
https://www.aliciaillustration.com/
https://colindou.ch/slides
https://github.com/sinkingpoint/onotify

OSMC 2025: Onotify: A scalable, flexible Alertmanager by Colin Douch.pdf

  • 1.
    onotify: A scalable,flexible Alertmanager Colin Douch (SRE @ DuckDuckGo)
  • 2.
  • 3.
    Which brings meto Alertmanager • Alertmanager infuriates me • What happened to all my nice alerting features? • Why am I being forced to use this thing? 3
  • 4.
    Alertmanager is thisweird little orphan child • It’s had a fifth of the commits of Prometheus itself • People are sort of forced to use it if they use Prometheus • Providers are bullied into helping maintain it because of Prometheus’ success 4
  • 5.
    So, let’s talkabout it • Alertmanager’s weird incentives • What we’ve lost • Why is Alertmanager so difficult? • Can we do better? Internal Confidential 5
  • 6.
    Hi! I’m Colin •Formerly leading the Observability Team at Cloudflare • Now an SRE at DuckDuckGo, working on privacy preserving telemetry at scale • Older than I look • My knees hurt 6
  • 7.
    What about Authentication? •Key and lock from Alicia • Believe it or not, it’s useful to know who does things • Alertmanager, when it can sort of relies on an honor system • But for the most part, you’re on your own. Who made changes? Who knows! Not you! 7
  • 8.
    What about Silences?Or downtimes? • Maybe you’re doing some maintenance • Maybe you’re getting a PTSD response to your pagerduty alert noise 8
  • 9.
    What about Silences?Or downtimes? 9
  • 10.
    “If you likeit you shoulda put a proxy in front of it” • Bouncer from Alicia • Alertmanager recommends putting a proxy in front of it for any business logic • But it notably doesn’t actually provide any such proxy • We made ”Alertmanager Bouncer” at Cloudflare: https://github.com/sinkingpoint/a lertmanager-bouncer 10
  • 11.
    Automation? We hadEvent Handlers for that • Automating Alert Remediation (temporarily while a real fix is being implemented) is sort of a core DevOps tenant • We’re left on our own, yet again with Alertmanager • Better get used to the Alertmanager Webhook interface • Good luck 11
  • 12.
    What about AlertHistories? • Yet another hard coupling with Prometheus – you can just use the ALERTS timeseries • But what if you have multiple Prometheus servers? • What if you need longer retentions? 12
  • 13.
    What about AlertHistories? 13 Histories!!!
  • 14.
    Let’s talk aboutAcknowledging 14
  • 15.
    Let’s talk aboutAcknowledging • Once again, Alertmanager punts this to the downstream sinks • Fine, until you have multiple alerting systems • Oh look, yet more vendor lock in 15
  • 16.
    Which brings usback to: AlertManager is weird • Which brings us back to this: Alertmanager is weird • It’s almost toxically open source • Providers are both incentivized to help maintain Alertmanager, and simultaneously do as little as possible to it 16
  • 17.
    We’ve lost somuch • All our functionality is offloaded to downstream alerting sinks • Providers are incentivized to put as little work into Alertmanager as necessary • This is great for vendor lock-in, and bad for us who actually have to use Alertmanager 17 Old man from Alicia
  • 18.
    Which brings meto onotify 18 https://github.com/sinkingpoint/onotify
  • 19.
    What we’ve donedifferently • Kiora attempted to mimic Alertmanager as much as possible • Onotify acknowledges: • Alerting is bursty • Alertmanager is wasteful • There’s a lot of effort already put into Alertmanager both in Open Source, and internally 19
  • 20.
    Serverless to therescue! • Have bursty traffic? Serverless might be for you! • Cloudflare is big on dogfooding, so Cloudflare Workers it is • Not just a serverless platform: Lots of nice primitives to build full applications, like Durable Objects 20
  • 21.
    Onotify has logicalseparation by “account” 21
  • 22.
    What about allthe existing stuff? • There’s a lot of Alertmanager stuff out there! • Despite its flaws, there is a large community effort to make up for it • Karma, amtool, Alertmanager Bouncer, all your existing config, there’s lots of effort put into them! 22
  • 23.
    onotify needed tomimic Alertmanager • onotify needed to look as much like Alertmanager as possible • It takes the same config • It’s entirely backwards compatible with Alertmanagers API • Karma, amtool, your random script that keeps everything running? It “just works” 23
  • 24.
    But we stillimproved things 24
  • 25.
  • 26.
    We have authentication! •onotify authenticates every request - no more giving everything access to everything • It supports its own authentication, or farming it out to Cloudflare Access or Clerk • No more honor system! 26
  • 27.
    Requests have pagination •onotify is built to scale! • Pagination • Authentication/authorization • Centralized Alert histories, comments, and acknowledgements 27
  • 28.
    What about configuration? •You can upload your Alertmanager configuration directly • But at Cloudflare, we had lots of duplicated branches • Our alert evaluations started to push 10 seconds 28
  • 29.
    What about configuration? •But no more routing trees! • Instead: The routing Directed Acyclic Graph (DAG) • Faster to evaluate, no duplication, less resource usage, easier updates 29
  • 30.
    onotify has properhistories! 30
  • 31.
    We can annotatealerts 31
  • 32.
    And we canacknowledge them too! 32
  • 33.
    onotify centralizes acknowledgements •You can acknowledge alerts in onotify • Those acknowledgements can be viewed in the history • They can be synced with downstream providers – no more checking multiple places 33
  • 34.
    We can evenexport them! 34
  • 35.
  • 36.
    So what’s next? •DuckDuckGo probably isn’t going to move away from Icinga any time soon • But we can still learn from it! • onotify will continue to evolve, as an example of what we could have • Prometheus and its children are not the be all and end all 36
  • 37.
    Questions? 37 Art by AliciaEdwards: https://www.aliciaillustration.com/ https://colindou.ch/slides https://github.com/sinkingpoint/onotify