# Interview with Ilan Rabinovitch
by Seth Kenlon

Ilan Rabinovitch is well-known to anyone who has done the conference
circuit around Southern California. He's a helpful and friendly guy
who I met once at a
[BarCamp](http://barcamp.org/w/page/402984/FrontPage) years ago, and
more than once encountered on IRC, and often ended up getting great
tech tips from him by sheer proximity.

He's speaking at this year's [Linux Fest
NorthWest](https://www.linuxfestnorthwest.org/2016/sessions/monitoring-101-finding-signal-noise),
training attendees on system monitoring. I spoke with him about what he'd be covering:


**Anyone starting out in System Administration hears a lot about
  "monitoring". Broadly speaking, what is "monitoring" and how's it
  done?**

As systems administrators and application developers, we build and
deploy services that our colleagues or users depend on. Monitoring is
the practice of observing and recording the behavior of these services
to ensure they are functioning as we expect. Monitoring allows us to
receive meaningful, automated alerts for potential problems and
ongoing failures. It also allows us to quickly investigate and get to
the bottom of issues or incidents once they occur.

**What should I be monitoring? Is it just my network? or are there
  other things that I should be looking at? and will your talk address
  those things?**

In short, monitor everything. Collecting data is cheap, but not having
it when you need it can be expensive, so you should instrument
everything, and collect all the useful data you reasonably can.  So
yes, you want data about your network, but also about your operating
systems, your applications, and even business metrics.

Individually, they might be interesting. Combined, they can help you
tell the full story of why an incident occurs in your environment. You
never want to be in a situation where you cannot explain why you
experienced an outage or service degradation.

That being said, the specifics of what to monitor and alert on will
differ from environment to environment. What's important to you and
your application may not be critical for me. During the session at
Linuxfest NorthWest, we'll review some frameworks and methods for
categorizing your metrics, and identifying what is important to you
and your business.


**A focus of your talk at Linux Fest Northwest is separating "signal
  from noise". The more you monitor, the more noise you find you have
  to sift through. Without copy-pasting your entire talk, what are
  some of the tactics you propose a sys admin takes to mitigate
  that?**

It's important to distinguish between monitoring and alerting. Just
because you collect a piece of data for trending or investigating
purposes doesn't mean you need to page a team member every time it
moves.

If we're paging someone, it should be urgent and actionable
every single time. As soon as you start questioning the validity of
alerts and whether they require a response you are going to start
picking and choosing when to respond and investigate alerts from your
monitoring. This is called *pager fatigue* and significantly undermines
your monitoring goals.

You primarily want to focus your alerting around [work metrics](https://www.datadoghq.com/blog/monitoring-101-alerting): metrics that quantify the work or useful output from your
environment. These represent the top-level health of your systems and
services and are the true indicator of whether your systems are
performing correctly. Your users will never call you to say “CPU is
high”, but they might complain about slowing responses on your APIs or
the service being down entirely. So why are you waking up engineers
about high cpu usage?



**Your talk is not tool-specific, but what are some of your personal
  favourite monitoring tools?**

I'd have to say [Datadog](https://www.datadoghq.com) is my favorite
tool at the moment! Not because they're my employer (it's actually the
other way around. The reason I joined Datadog this past summer was due
to how much I loved using their products as a customer). We're a mix of
a hosted service and open source agent that runs on your servers to
collect metrics. We tend to focus on environments with dynamic
infrastructure (containers, cloud and auto-scaling or scheduled
workloads), as well as aggregating metrics from a number of sources
including other monitoring systems.

The open source world has seen some great developments and
improvements in our toolsets in recent years. While Nagios and Cacti
or Ganglia have been the workhorses of the open source monitoring
stack for the better part of the last 20 years, we now have a number
of new tools such as Graphite and Graphana for time series data, ELK
for log management, and much more.

Which tool you pick to form your monitoring stack will depend on your
environment, staff, and scaling needs. The [monitoringsucks
project](https://github.com/monitoringsucks/tool-repos) offers a great
overview of available tools.  Brendan Gregg also has some amazing
resources for understanding Linux system performance on his blog, I
especially like [the Linux Observability slide](http://www.brendangregg.com/Perf/linux_observability_tools.png).


**One thing that intimidates people from actively monitoring systems,
  I think, is the setup required, and learning the new tools. How long
  did it take you to get comfortable with monitoring, and the tools
  that go along with it?**

I’ve been working with some form of monitoring for over 15 years, but
I wouldn’t say I’ve mastered it. Monitoring, like other focuses of
engineering, is an ongoing learning experience. Each time I build a
new environment or system, it has different requirements either because
of new technologies or new needs of the business.

With that in mind, with each project I get to re-evaluate what metrics
are most important and how best to collect them.


**Another thing that intimidates people from monitoring is the fear
  that they won't know how to diagnose an issue when they find
  it. What do you do when you see an issue that raises a red flag, but
  you have no idea how to go about solving it (or even understanding
  what needs solving)?**

Burying one’s head in the sand is never a solution.  Even if you don’t
know how to respond, it’s much better to be aware that your service
or systems have failed, than to find out out it when irate users or
customers call.

With that in mind: start simple. Find the metrics and checks that tell
you if your service is online and performing the way your users
expect. Once you’ve got that in place, you can expand your coverage to
other types of metrics that might deepen visibility into how your
services interact, or of their underlying systems.



**How do you keep from falling into a false sense of security once
  your monitoring system is up and running?**

Post-mortems and incident reviews are great learning opportunities.
Use them to learn from any mistakes, as well as to identify areas in
your environment where you would benefit from more visibility.  Did
your monitoring detect the issue or did a user? Is there a leading
indicator that might have brought your attention to the issue before
it impacted our business?

**What's your background as a professional geek? how did you get started in the tech business, and what do you do?**

I got my start tinkering with a hand-me-down computer my parents gave
my siblings when they upgraded the computer they used for their small
business. I did quite a bit of tinkering over the years, but my
interests got a jump start when I started playing with Linux with the
goal of building a home router. I quickly found myself attending user
group meetings with groups like [SCLUG](www.sclug.org), [UCLA
LUG](http://linux.ucla.edu), and others. The skills I picked up at
LUGs and in personal projects helped me pick up consulting / entry
level sysadmin work.

Since then, I’ve spent many years as SysAdmin, then later
leading infrastructure automation and tooling teams for large web
operations like Edmunds.com and Ooyala. Most recently, I’ve had the
opportunity to combine my interest in large scale systems with my open
source community activities as Director of Technical Community at Datadog.


**You clearly do a lot with open source, and for the open source
  community. Why's open source important to you?**

The Open Source community and Linux helped me get my start in
technology. The open nature of Linux and other software made it
possible me to dive in and learn about how everything fit together.
Then LUGs and other community activities were always there to help if
I ever got stuck. It’s important to me to stay active the community
and give back so that others can have similar experiences.

**Wait a minute, aren't you that one guy from SCaLE? How did you get
  involved with SCaLE and how's it going?**

Indeed! I'm one of the co-founders and the current conference chair of
[SCaLE](https://www.socallinuxexpo.org).  SCaLE started back in 2002
as a 1 day event focused on open source and free software at the
University of Southern California.  We started SCALE at time where
there were very few tech events in the Southern California
area. Additionally the available FOSS/Linux focused events were
primarily organized as a commercial ventures (eg Linux World).
Between geography, cost and in some cases age requirements we found it
was quite difficult to see developer lead sessions about the open
source projects we were interested in.  In the spirit of open source
we decided to scratch our own itch and build a place where we could
attend the sessions we interested in seeing. Naively, we thought "how
hard could it be to start a conference?"

Thus SCALE was born.

We are now launching planning for our 15th year of SCALE, which will
be held March 2-5, 2017 at the Pasadena Convention Center.  While the
event has grown from just a few hundred our first year to 3200
attendees, we've made a strong effort to keep our community feel. We
like to think we've succeeded in meeting our original goals of
bringing community, academia, and business together under a single,
affordable, and accessible conference.

**In that case, I have to ask: what distribution of Linux do you use,
  and what's your current desktop or Window Manager?**

My home machine runs Ubuntu, so Unity on the desktop. My personal
servers tend to be Debian based.

At Datadog, we primarily use Ubuntu across our infrastructure,
although we work with a wide set of Linux distros to ensure our open
source agent works well across the board.

Our docker containers start off from a Debian image.