While working at Booking.com, I was looking for a solution to logging that matched the ease of use and power as Graphite did for metrics. Reluctant to bring a new technology into production, I talked to co-workers and one mentioned that they were using ElasticSearch in some front-end systems for search and disambiguation. He mentioned hearing there were a few projects using ElasticSearch for storing log data.
This began my love-hate-love relationship with ElasticSearch. I've spent the past 8 years working with ElasticSearch professionally and in my spare time. Graphite and ElasticSearch are two projects that change the game in terms of exploring your data. The countless insights I've gained into system performance, application performance, and system and network security with these tools is unparalleled. Tools like Grafana and Kibana allow you to visualize your data quickly and beautifully. As a system and security engineer, sometimes this isn't enough. I spend most of my day in a terminal and needed something to explore and pivot through the data there.
This is the first part, in a many part series about a tool I created to make ElasticSearch's powerful search interface more accessible from the terminal. This tool has been essential to nearly every incident I've investigated. It was developed with the help, patience, and amazing ideas from co-workers both at Booking.com and now at Craigslist.
Full disclosure, I'm not a fan of systemd. I started working with Linux in the late 90's and watched it grow from a marginalized operating system to the most dominant operating system in the datacenter. I've lived through so many "year of the Linux desktop" years I remember when it wasn't a joke. From my vantage point, administering Linux servers professionally for nearly 20 years, systemd is Linux on the desktop at the cost of Linux in the datacenter.
Why do I feel this way? It's mostly the reinvention and incorrect implementations of core UNIX tools and modalities. There's a lot of information on systemd out there. There's a lot of bias involved. So, today, I'm not going to talk about that. I am going to address a critical mistake in the systemd-resolved daemon which implements DNS lookups for systems running systemd.
I'll jump right to the work-around. If you're running a system which is using systemd, you should probably be running systemd-resolved configured to use a single DNS resolver, 127.0.0.1, and run Unbound. There are resources on how to configure and run Unbound, but the best is Calomel's Unbound Tutorial. If you need to maintain consistent, reliable DNS resolution that's compatible with previous versions of Linux, the only way to do that is to have a single DNS server in /etc/resolv.conf.
After getting a few questions from concerned folks about VPN services. I realized this might be better served as an article. This way anyone who is curious about how to protect themselves better online can reference it.
The Bad News
Well, there's really no easy way to this: There is very little, if any, privacy on the Internet. Even after following all of the advice I'm about to give, all sorts of clever folks in the Valley and beyond are envisioning clever new ways to improve the "User Experience" (UX) and in the process accidentally creating newer, clever means to circumvent any and all privacy controls you might deploy.
In 2004, when I was starting a new job at the National Institute on Aging's Intramural Research Program I began evaluating products to meet FISMA requirements for file integrity monitoring. We already purchased a copy of Tripwire, but I was being driven mad by the volume of alerting from the system. I wanted something open source. I wanted something that would save me time, rather than waste 2 hours a day clicking through a GUI confirming file changes caused by system updates and daily operations.
At the time, I found two projects: Samhain and OSSEC-HIDS. Samhain is a great project that does one thing and does that one thing very well. However, I was buried in a mountain of FISMA compliance requirements and OSSEC offered more than file integrity monitoring; OSSEC offered a framework for distributed analysis of logs, file changes, and other anomalous events in the same open source project.
I now work at Booking.com and manage one of the world's largest distributions of OSSEC-HIDS. My team and I are active contributors to the OSSEC Community. After nearly a decade of experience deploying, managing, and extracting value from OSSEC, I was approached to write a book introducing new users to OSSEC. After 6 months of work, the book has been published!
We use ElasticSearch at my job for web front-end searches. Performance is critical, and for our purposes, the data is mostly static. We update the search indexes daily, but have no problems running on old indexes for weeks. The majority of the traffic to this cluster is search; it is a "read heavy" cluster. We had some performance hiccups at the beginning, but we worked closely with Shay Bannon of ElasticSearch to eliminate those problems. Now our front end clusters are very reliable, resilient, and fast.
I am now working to implement a centralized logging infrastructure that meets compliance requirements, but is also useful. The goal of the logging infrastructure is to emulate as much of the Splunk functionality as possible. My previous write-up on logging explains why we decided against Splunk.
After evaluating a number of options, I've decided to utilize ElasticSearch as the storage back-end for that system. This type of cluster is very different from the cluster we've implemented for heavy search loads.
If you haven't looked at OSSEC HIDS, here's the overview:
OSSEC is a scalable, multi-platform, open source Host-based Intrusion Detection System (HIDS). It has a powerful correlation and analysis engine, integrating log analysis, file integrity checking, Windows registry monitoring, centralized policy enforcement, rootkit detection, real-time alerting and active response.
It runs on most operating systems, including Linux, OpenBSD, FreeBSD, MacOS, Solaris and Windows.
I do most of my work over SSH. Even when I'm working in my browser or pgAdminIII, I'm usually doing that over SSH tunnels. VPN Software has been around for quite some time and it's still mostly disappointing and usually run by the least competent group in any IT department. I developed a workflow using SSH from my laptop, either on the corporate network or at home, I can ssh /directly/ to the server I'm interested in working on.
In order to accomplish this, I have made some compromises. First off, if I'm SSH-ing from my home, I am /required/ to type the fully qualified domain names (FQDN) when workign remotely. I use the presence of the domain name to activate the proper leap frogging. I also decided to use ControlMaster's with SSH that can leave me with a terminal without a prompt when I forget which shell is my master. Overall, the pros outweigh the cons and I'm more productive because of it.
First things first. I've stated that you should drop everything and install Graphite. If you didn't already, please do that now. Go ahead, I'll wait.
When you get back we'll talk about how to monitor ElasticSearch with Graphite for fun and profit!
The reaction to my Central Logging post has been significantly greater and more positive than I could've expected, so I wanted to recap some of the conversation that came out of this. I am pleasantly surprised by most of the comments on the Hacker News Thread. So, here's a real quick recap of the responses I've received. I will continue this series this weekend with more technical details.
I have worn many hats over the past few years: System Administrator, PostgreSQL and MySQL DBA, Perl Programmer, PHP Programmer, Network Administrator, and Security Engineer/Officer. The common thread is having the data I need available, searchable, and visible.
So what data am I talking about? Honestly, everything. System logs, application logs, events, system performance data, and network traffic data are key requirements to making any tough infrastructure decision, if not key to the trivial infrastructure and implementation decisions we have to make everyday.
I'm in the midst of implementing a comprehensive solution, and this post is a brain dump and road map for how I went about it, and why.