analyze logs

There are times you may need to monitor what’s happening on an Apache web server as is happens. This can be done from the command line using a combination of the tail command, which outputs the last part of a file, and grep or egrep which are used for regular expression pattern matching.

Viewing everything

If the log file to view is at /var/log/apache/myvirtualhost.log the first command below will show the last few lines from the file and then continue to echo to the command line as new lines are entered into the log file i.e. as additional requests are made to the web server.

`1`	`tail` `-f /var/log/apache/myvirtualhost.log`

The -f flag is what makes the tail command output additional data as it is appended to the log.

Viewing everything from a specific IP address

Tail can be combined with grep to pattern match. To filter the results to only show requests for a specific IP address (in this example 192.168.206.1) pipe the output from tail through grep like so:

`1`	`tail` `-f /var/log/apache/myvirtualhost.log \|` `grep` `192.168.206.1`

This can be useful to only show results from your own requests.

Note that the above example would also match e.g. 192.168.206.10 etc and that dots will match any character not just the period divider; if this is a concern then escape the dots with \ and put the IP address in brackets with a space after the last digit in the IP address like this:

`1`	`tail` `-f /var/log/apache/myvirtualhost.log \|` `grep` `"192\.168\.206\.1 "`

Viewing everything excluding a specific IP address

Adding the -v flag to grep excludes the pattern. If you want to exclude requests from your own IP address but show everything else this can be useful:

`1`	`tail` `-f /var/log/apache/myvirtualhost.log \|` `grep` `-v` `"192\.168\.206\.1 "`

Including particular file types only

If you only want to watch for requests for a particular file type, or even a particular file then use the same concept as grepping for the IP address. For example to show only jpg files:

`1`	`tail` `-f /var/log/apache/myvirtualhost.log \|` `grep` `.jpg`

And to match a specific file, for example the robots.txt file if perhaps you are looking out for when a search engine bot hits the site:

`1`	`tail` `-f /var/log/apache/myvirtualhost.log \|` `grep` `robots.txt`

Excluding particular file types

To show only webpages can be problematic especially if there is no common extension for the files being served, and some might end with / whereas other might end with .html, or there might be query strings at the end of the URL which present issues with pattern matching.

A possible solution is instead to exclude everything that’s not a webpage. Multiple exclusions can be entered by separating them with the pipe | character when using egrep instead of grep. To exclude several common file extensions and show hopefully just web pages do this:

`1`	`tail` `-f /var/log/apache/myvirtualhost.log \|` `egrep` `-v` `"(.gif\|.jpg\|.png\|.swf\|.ico\|.txt\|.xml\|.css\|.js\|.rss)"`

Note that because the regular expression contains the pipe character the expression must be contained within quotes. You can adjust the above list of extensions to suit your own conditions.

—

quoted: electrictoolbox.com

You find the most interesting stuff in access logs. I spend a fair amount of time digging around in web analytic software. Looking at who is referring to us, analyzing traffic patterns, figuring out peak usage times…you get the picture. Sometimes though, you just need to get right down into the raw logs to figure out what’s going on, and that situation happened to me a while back.

Around the middle of last year a security exploit was found in the implementation of the XML-RPC protocol in PHP. PHP is obviously a very popular web scripting language and dozens of content management systems, including the blogging software I use for this blog, utilize the language. Now an open security hole in such a popular language is just an invitation for the underground to attempt to exploit the situation, and try to exploit they did.

It started out as just a trickle, but before long it became obvious that the underground was attempting to exploit the security hole. How’d we know? By analyzing our logs.

An access log is generated by your web server, in our case Apache, and contains a lot of information. For example, here is a random line from this blog.


208.52.128.170 - - [28/Apr/2006:11:07:53 -0700] "GET /wp-content/themes/rdc/style.css HTTP/1.1" 200 7581 "http://www.creative-workflow-hacks.com/" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.2"

The first column is where the ip request comes from. The second is the date. The third is the url and the GET means it uses a GET style HTTP request. The fourth is the user agent, in this case Safari. So that’s all well and good, but what’s that have to do with the security exploit?

Well, by looking at your raw access logs, you start to notice patterns. Something like this is what we started to see in our access logs.

209.213.106.25 - - [26/Apr/2006:05:46:19 -0400] "GET /xmlrpc.php HTTP/1.0" 404 1453

and

209.213.106.25 - - [26/Apr/2006:05:46:20 -0400] "GET /drupal/xmlrpc.php HTTP/1.0" 404 1453

Drupal? We’re not using Drupal. What was happening were attempts to exploit the security hole. And there were quite a few attempts and they kept coming.

So was this some lone hacker typing into her browser? Not likely. What was happening was bots.

You’ve probably heard about spyware. In addition to sending you all kinds of weird popups and putting casino shortcuts on your desktop, malware like this can take over your computer and do things like send spam email, or in this case attempt to send security exploits to take over yet more computers and send spam email and then attempt…well, you get the picture.

So by analyzing our logs we start to get a picture of what to watch out for and figure out what software *really* needs to be up to date. Better get over to wordpress and check for patches

What I want to show today is one little tool that I used to analyze our logs while this exploit was going on full tilt, although we still see a lot of requests for xmlrpc as the last laggards still haven’t updated their software and the law of diminishing returns kicks in for the bots

We’re going to use the command line today (it’s really not that scary, you might even end up with 10 different shells open on your deskop one day like a real unix grey beard…don’t laugh, it could happen).It goes without saying that the following will only work if your webserver is running apache and you have unix shell acess to the server your log files are kept. The first order of business is to find out where your log files are kept. A common location is /var/log/httpd or if your on a shared host, a logsdirectory in your home directory. Once you’ve figured out where your log file is located, let’s cd to that directory.

Unix tools are designed to do one thing very well and work in conjunction with other small tools that do one thing well to accomplish the task at hand. Unix tools have a concept called standard input and standard ouput which allow each tool to do their bit of business and then pass the result on to the next tool. The command we are going to use today is:


cat access.log | grep xmlrpc |  awk '{print $1}' | sort | uniq | wc -l

Scary huh? But don’t freak out. We’ll go through each part and you’ll start to understand that it really is easy and you might be able to do some useful work with this command line stuff any how.

So let’s start off with the first part. Commands start from the left like you’re reading a sentence. cat access_log catalogs or writes out the acess_log to standard outputwhich in this case is the terminal in which you typed the command. Your version of your access_log might be named something else like access.log or somesuch, but you should be able to figure it out. You could also use tail command if you had an access_log that is several days long and you only want to check out just the last bit of access.

Use tail -n 100 to grab the last 100 entries. Which brings us to Unix command switches, the -n 100 is a command switch. Most unix commands let you alter the base behavior of the command with a switch. How do you know which switches to use? You use the man command. Type man tail and see what you get.

A lot of the stuff in a man can seem hopelessly geeky at first, but with a little perserverance we can find what we need. So we’ve written out our access_log file. This is the raw data that comes from the web server. Take a look around, it’s pretty interesting even before you’ve run analytics on it, but for most folks it’s just too overwhelming to find anything useful. Let’s take a look at the next part of our command.


cat access_log | grep xmlrpc

the | is the pipe indicator. We talked earlier about standard input and standard output. The | lets us pipe the output of cat into the input of grep. If we wanted to write out our acess_log to a file we would use the > character.


tail -n 100 access_log > ourfile.txt

would write a file called ourfile.txt with the last 100 lines of our access_log.

Back to our command. We pipe the output of our access_log to something calledgrep. What is grep? Grep stands for “generalized regular expression processor”. Grep allows us to search for patterns in our output called regular expressions. Think of a regular expression as a find and replace on steroids. Regular expressions can get really complicated, but in our command we’re just looking for xmlrpc. It won’t match on variations with upper case or with dashes, etc., but it’s surprisingly effective here.

Let’s move on to


cat access_log | grep xmlrpc | awk '{print $1}'

We now pipe any matches on xmlrpc to awk. awk is a general purpose language that is designed to process text data. I don’t use awk a lot, but it is really handy to grab a specific column from text output. That’s what we do here. awk'{print $1}’ takes only the first column and prints that to standard output. So from the first three parts of our command, we’ve printed the ip address of every access on our web server that used the word xmlrpc. Pretty neat. Let’s move on.


cat access_log | grep xmlrpc | awk '{print $1}' | sort

All of those ip addresses are all over the place. Let’s sort them into something useful. We first sort them numerically with sort.We use the uniq command to grab only the unique ip’s. These bots might hit us once or a bunch of times, but we are only interested in grabbing the unique ones.


cat access_log | grep xmlrpc | awk '{print $1}' | sort | uniq

If we stop here we get a sorted, unique ip list of requests for xmlrpc. If we had a blackhole list we could add these ip’s and not accept requests from these ip’s in the future when the next exploit becomes available. We do need to be careful with that though because your most important customers might have a virus or spyware and not be able to access your site or your grep search string might find ip’s that are accessing your site legitimately, use your judgement.

And finally


cat access_log | grep xmlrpc | awk '{print $1}' | uniq | sort | wc -l

let’s use the word count command wc command with the -l switch to count the number of lines. 137. Wow. During the worst of our xmlrpc weirdness we were getting 137 unique ip’s trying to exploit our web server.Definitely time to check for patches.

So that’s it. This was definitely a long-winded post, but I hope you find it useful especially if your new to the command line. Like most things computer oriented, try to break down your big goal into smaller pieces until you work out the combinations that get you were you need to go.

—-

Quoted from: creative-workflow-hacks.com