ElasticSearch CLI Tools - Part 1

11 minute read Published: 2019-05-18

While working at Booking.com, I was looking for a solution to logging that matched the ease of use and power as Graphite did for metrics. Reluctant to bring a new technology into production, I talked to co-workers and one mentioned that they were using ElasticSearch in some front-end systems for search and disambiguation. He mentioned hearing there were a few projects using ElasticSearch for storing log data.

This began my love-hate-love relationship with ElasticSearch. I've spent the past 8 years working with ElasticSearch professionally and in my spare time. Graphite and ElasticSearch are two projects that change the game in terms of exploring your data. The countless insights I've gained into system performance, application performance, and system and network security with these tools is unparalleled. Tools like Grafana and Kibana allow you to visualize your data quickly and beautifully. As a system and security engineer, sometimes this isn't enough. I spend most of my day in a terminal and needed something to explore and pivot through the data there.

This is the first part, in a many part series about a tool I created to make ElasticSearch's powerful search interface more accessible from the terminal. This tool has been essential to nearly every incident I've investigated. It was developed with the help, patience, and amazing ideas from co-workers both at Booking.com and now at Craigslist.

Perl Setup

I'm a Perl programmer. You may have strong feelings about that, but Perl has been good to me. The freedom to write code as beautifully, or as ugly, as I need to get the job done is liberating. I recommend using Perl 5.28 or newer with Perlbrew.

You should be comfortable with the command line, so follow the steps to install Perlbrew from it's homepage. After that:

$ perlbrew init
$ perlbrew install -j 8 -n --thread 5.28.2
$ perlbrew switch 5.28.2
$ perlbrew install-cpanm

Now that you have a working, local, user managed Perl, we'll install the toolset.

$ cpanm App::ElasticSearch::Utilities

The utilities and their dependencies will be installed in your local, user managed Perl path.

(Some) Utilities Installed

es-alias-manager.pl - Alternative to curator for managing aliases for indexes
es-apply-settings.pl - Applies settings to an index based on index name and age.
es-copy-index.pl - Tool for copying all (or based on a search) documents from an index on the same or a different cluster to another index, optionally supports supplying alternate settings/mappings for the destination index if it's being created
es-daily-index-maintenance.pl - Alternative to curator for maintaining index life spans
es-graphite-dynamic.pl - Script to extract ElasticSearch Performance metrics into Graphite directly or via collectd/diamond.
es-status.pl - A quick "how's the cluster" status overview
es-storage-overview.pl - Check how much storage each node and/or index is consuming in the cluster.

And finally, the tool I'm going to be talking about: es-search.pl. This is a tool designed with the UNIX philosophy in mind to enable workflows where the output of one query can be fed into another.

Configuration

In order to ensure we have the most fun with the tool, let's setup some defaults to make our command lines shorter. All of the scripts (and if you're so inclined, the entirety of the App::ElasticSearch::Utilities functions) use this config file to determine how to find, connect, and talk to your ElasticSearch cluster.

Create ~/.es-utils.yaml file with something like this:

---
host: localhost
port: 9200
base: syslog
days: 1
timestamp: '@timestamp'

host - The host of the hostname or IP of the node you'd like to use to connect, default is localhost
port - The port to use to connect, the default is 9200
base - Default index base name, defaults to logstash
days - Default number of days to search, defaults to 7
timestamp - Default name of the field containing the timestamp for logging events, defaults to @timestamp

Index Bases

The idea behind this tool, is to make things as simple as possible. If you're like me, you probably use index names to differentiate where shards are allocated and ultimately, how long shards will exist on your cluster. On large indices, where data is variably interesting, I tend to use this pattern.

I want to index HTTP access logs, I'll designate the mappings keying off the pattern: *-access-*
My logs span multiple datacenters, so I'll set allocation rules to make shards in each datacenter stay in that datacenter. If my datacenter tag is sfo, I'd set a pattern sfo-* to grab those shards
There maybe lower value data in the logs, like requests for images, CSS, or JavaScript assets. I want these around, but if they're 90% of my logging volume and they generally become less interesting more quickly and I'll want shorter retention rules applied to them. These indexes might include a tag in the index name of *-bulk-* to make them distinguishable.

At the end of this madness I might have a list of indexes like:

Index Name	Alias	Retention	Content
ams-access-2019.05.19	access-2019.05.19	90d	Normal access logs for `ams` servers
ams-access-bulk-2019.05.19	access-2019.05.19	7d	Uninteresting access logs for `ams` servers
ams-syslog-2019.05.19	syslog-2019.05.19	90d	Syslog data for `ams` servers
sfo-access-2019.05.19	access-2019.05.19	90d	Normal access logs for `sfo` servers
sfo-access-bulk-2019.05.19	access-2019.05.19	7d	Uninteresting access logs for `sfo` servers
sfo-syslog-2019.05.19	syslog-2019.05.19	90d	Syslog data for `sfo` servers

If I wanted to search those indexes, I could just use --base access as they all will be parsed to the correct bases. If you're not sure what es-search.pl might think of what bases you have available, ask it to tell you:

$ es-search.pl --bases
Bases available for search:
	access
    ams-access
    ams-access-bulk
    ams-syslog
    sfo-access
    sfo-access-bulk
    sfo-syslog
    syslog
# Bases: 8 from a combined 6 indices.

Handling More Than One Index Base with Ease!

That's all fine and good if all of your indexes contain the same document types. That's unlikely as you should be splitting different document types up into separate indices, if not clusters. If you want to work with es-search.pl across all those indexes easily, it will need to know the correct timestamp field. To enable per-base timestamp fields, you can just add a meta section to your ~/.es-utils.yaml file.

---
host: localhost
port: 9200
base: syslog
days: 1
meta:
  access:
    timestamp: timestamp
  ossec:
    timestamp: ts
  zeek:
    timestamp: event_ts

Now es-search.pl and the rest of the utilities will know that when you specify --base zeek the timestamp field to sort on will be event_ts and you won't need to think about adding --timestamp event_ts to the command line.

Seeing Data

Now that you're configured, we can just run:

$ es-search.pl
= Querying Indexes: syslog-2019.05.19
---
action: connect
hostname: janus
message: 'connect from unknown[102.165.34.33]'
proc: smtpd
proc_id: 30775
program: postfix/smtpd
src: unknown
src_ip: 102.165.34.33
tags:
  - decoder_syslog
  - mail
  - postfix
timestamp: 2019-05-19T02:07:34.861416
total_time: 0.004363

<snip>

# Search Parameters:
#    {"bool":{}}
# Displaying 20 of 357 in 0.0584328174591064 seconds.
# Indexes (1 of 1) searched: syslog-2019.05.19

Each document's _source is YAML printed to the screen. This is not the usual use case for es-search.pl, so let's do better. It's also likely that the documents you're viewing may not contain all the valid fields in the index.

Finding the Fields in the Index

When you start working with ElasticSearch indexes, you may not know all the fields available for search. es-search.pl allows you to explore a bit:

$ es-search.pl --base syslog --fields
Fields available for search:
	- action
	- dev
	- dst_geoip.continent
	- dst_geoip.country
	- dst_geoip.location
	- dst_ip
	- dst_port
	- exe
	- file
	- hostname
	- in_bytes
	- message
	- out_bytes
	- proc
	- proc_id
	- program
	- proto_app
	- rec_id
	- src
	- src_geoip.city
	- src_geoip.continent
	- src_geoip.country
	- src_geoip.location
	- src_geoip.postal_code
	- src_ip
	- src_port
	- src_user
	- tags
	- timestamp
	- timing.phase
	- timing.seconds
	- total_time
# Fields: 32 from a combined 1 indices.

This will help you understand what an index contains. Maybe you wanna see what's in a field? There's two ways, the first with search, the second with aggregations.

Finding Field Values with Search

The simplest, and least taxing way to ask ElasticSearch what a field contains is to query the index and return the relevant field. To optimize for documents containing the field, we can use the --exists <fieldname> filter.

If I just want to see the most recent 20 documents where the field proc exists and just see the proc entry, it's as simple as:

$ es-search.pl --exists proc --show proc
= Querying Indexes: syslog-2019.05.19
timestamp    proc
2019-05-19T02:04:06.135686    smtpd
2019-05-19T02:04:06.135786    smtpd
2019-05-19T02:04:05.856884    smtpd
2019-05-19T02:03:46.471311    smtpd
2019-05-19T02:03:46.471352    smtpd
2019-05-19T02:03:46.199116    smtpd
2019-05-19T02:03:37.013022    smtpd
2019-05-19T02:03:37.012866    smtpd
2019-05-19T02:03:36.741711    smtpd
2019-05-19T02:03:18.239108    smtpd
2019-05-19T02:03:18.239135    smtpd
2019-05-19T02:03:17.947805    smtpd
2019-05-19T02:03:07.837098    smtpd
2019-05-19T02:03:07.837133    smtpd
2019-05-19T02:03:07.553645    smtpd
2019-05-19T02:03:07.342514    smtpd
2019-05-19T02:03:07.342686    smtpd
2019-05-19T02:03:07.067929    smtpd
2019-05-19T02:02:57.157830    smtpd
2019-05-19T02:02:57.157612    smtpd
# Search Parameters:
#    {"bool":{"must":[{"exists":{"field":"proc"}}]}}
# Displaying 20 of 85 in 0.0445699691772461 seconds.
# Indexes (1 of 1) searched: syslog-2019.05.19

This might not give me the best understanding of what the field is, but already, I know that postfix log entries are setting this field.

Finding Field Values with Aggregations

We can do a lot better by leveraging aggregations in ElasticSearch. To do so, we ask es-search.pl for the top values.

$ es-search.pl --top proc
= Querying Indexes: syslog-2019.05.19
count    proc
224  smtpd
27   smtps_smtpd
12   qmgr
9    localsmtp_smtpd
6    cleanup
4    submission_smtpd
3    anvil
3    lmtp
3    pipe
# Search Parameters:
#    {"bool":{}}
# Displaying 9 of 693 in 0.00798892974853516 seconds.
# Indexes (1 of 1) searched: syslog-2019.05.19
#
# Totals across batch
#
count    proc
224  smtpd
27   smtps_smtpd
12   qmgr
9    localsmtp_smtpd
6    cleanup
4    submission_smtpd
3    anvil
3    pipe
3    lmtp

We now have the top 20 (or fewer if there's not 20 total) values in the proc field.

Putting It Together

It looks like proc is the component piece for postfix syslog data. To be sure, let's ask ElasticSearch for the top programs with the top 10 procs each. Since es-search.pl is designed to make this easy, we type almost exactly that:

$ es-search.pl --top program --with proc:10 --exists proc
= Querying Indexes: syslog-2019.05.19
count  program
224  postfix/smtpd              terms.proc    smtpd    224
35   postfix/smtps/smtpd    	terms.proc    smtps_smtpd    35
12   postfix/qmgr    			terms.proc    qmgr    12
9    postfix/localsmtp/smtpd    terms.proc    localsmtp_smtpd    9
6    postfix/anvil    			terms.proc    anvil    6
6    postfix/cleanup            terms.proc    cleanup    6
6    postfix/submission/smtpd   terms.proc    submission_smtpd    6
3    postfix/lmtp               terms.proc    lmtp    3
3    postfix/pipe               terms.proc    pipe    3
# Search Parameters:
#    {"bool":{"must":[{"exists":{"field":"proc"}}]}}
# Displaying 9 of 304 in 0.0130970478057861 seconds.
# Indexes (1 of 1) searched: syslog-2019.05.19

Let's break down that query:

--top program - Top aggregation, infers terms, uses the value of --size which defaults to 20
--with proc:10 - Sub aggregation, form is agg_type:field name:sub size
- agg_type - defaults to terms and can be omitted, but can also be: significant_terms,max, min, sum, avg, cardinality
- field_name - is required and is the sub aggregate field name
- sub_size - defaults to 3
--exists proc - Filter the entire aggregation to just documents with the proc field

Wrapping up for now

I think this is a reasonable point to pause. This provides you with enough information to start getting your feet wet with the tool. In the next part, I'll examine building useful queries and how this tool enables pivoting and data exploration.

If you can't wait til next time, run: es-search.pl --manual to get an in depth overview of the options available. See below for that man page online:

GitHub Project Page: reyjrar/es-utils
- es-search.pl man page
MetaCPAN Project Page: BLHOTSKY/App-ElasticSearch-Utilities