Server log analysis or how to get valuable information about a website

The most common tool for marketing and web analysis today is undoubtedly Google Analytics. But this tool is far from the only one you can use to analyse your website. For example, comparative analysis of server logs could be used as an extension. A server log is a log recording the real traffic on your website, whereas the output from Google Analytics is more about the behavioural trends of your visitors. But is the resultant data coincident or does each provide a different output? Let’s take a closer look at both methods.

Anyone involved in web analytics would like to have a complete awareness of what is happening on their website. Is the information we can get from tools like Google Analytics really complete and indicative of the overall traffic on the website? And if not, how do we know what information is missing and how much is actually missing? Server log analysis can help us answer these questions, thanks to which you can get an overview of the volume and structure of visits that are not recorded in Google Analytics.

Server logs and Google Analytics Pageview: what do they have in common?

Server logs are simple text records automatically created and stored by the server (own, cloud, or webhosting), which documents all requests sent to it. These logs have a standardized format and provide a wide range of information about how, when and who visited your website. We’ll take a look at their structure in more detail in the next section.

If we want to compare server logs with Google Analytics, we need to understand when and how the web view information is sent to GA, because both systems receive data at different times.

Google Analytics collects web data based on cookies and javascript measurement code that you must place on each page you want to measure. The code is only triggered when the page is loaded and the pageviews metric records the individual page loads of the page being tracked. On the other hand, server logs document all requests they receive, regardless of the loading of each web page.

Structure of server logs

The most common form of server logs we can encounter corresponds to CLF (Common Log Format) files, which can look like this:

66.249.76.9 – [01/Sep/2020:07:06:53 +0200] “GET /2020/03/14/vizualizace-vyvoj-covid-19-v-cesku-powerbi/ HTTP/1.1” 200 18868 “-” “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

Lots of interesting information. How to decrypt it?

66.249.76.9 is the IP address of the user who sent the request to our website.

  •  There may be a user-identifier here, but in our case, we don’t have that information.
  • There may be a user-ID here, but again we don’t have that information.

[01/Sep/2020:07:06:53 +0200] The timestamp of when the server received the request. This is the str time type in the format %d/%b/%Y:%H:%M:%S %z.

“GET /2020/03/14/visualization-development-covid-19-in-powerbi/ HTTP/1.1” The specific request received by the server consists of several interesting parts. “GET” is the query method of the HTTP protocol – most often you will come across GET, HEAD, POST. This is followed by the URL of the requested page and its HTTP protocol.

200 HTTP status code – specifies how the requests were processed by the server – three-digit numbers starting with 2 indicate success, 3 indicates redirection, 4 indicates a client-side error, etc. See status codes for details.

18868 Is the size of the object sent to the client, measured in bytes.

“Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

The last part that we can see in our server log is called user agent and it brings a wide range of information about the user or his browser. From individual parts of this thread, you can get detailed information about the user’s system and their browser, what platform their browser is using and information about it, and last but not least, we can also see if this is a normal user visit or a request generated by a bot or crawler to our server – as in our case. This user agent refers to the Googlebot crawler that searches through the site to get the freshest information for its search engine. The full set of information that can be extracted from this particular user agent can be found here.

Data editing and cleaning

Now we know what type of information we can get from server logs. The amount of data is enormous, and to be able to make at least a basic comparison with the GA pageview metrics, we need to clean our data properly.

The first step should be modifying our server log data to make it comparable to GA pageview, cleaning it up and trimming it down to match the way GA reports our website.

When extracting excess data from the URL, we get information in separate fields that we use for further data filtering – specifically the HTTP method and status codes. For our comparison, we will only use log lines that have a status code of 200. That’s the standard response for a successful HTTP request. In this case, the requested page has been loaded and, if a GA code has been inserted, logged in Analytics.

Another filter we should apply to our data is that for comparative analysis we will only work with rows that contain the HTTP GET method, which is the default method for displaying hypertext pages.

The server logs archive all requests that came to the server, but not always the request that was intended to load a specific page. It could have been a request for an image or other type of file, and therefore you should exclude all other formats – e.g. .jpg, .png, .gif, but also php or .js, anything starting with /wp if you have built your site in WordPress, etc. It is important to note that the logs also contain records of the work of admins and other users who may be filtered out in GA, so it is also necessary to think about how to take these settings into account during server log analysis.

Finally, we must exclude all “non-human traffic” from our data. In recent years, the number of bots and crawlers of all kinds has been increasing, making this step one of the most difficult. You could argue that at this stage it would be possible to compare unfiltered data, since we usually have the option of unfiltered view in GA as well, but I wouldn’t recommend it. The amount of bots that GA will detect is minimal, and this is because most bots or crawlers do not support javascript, and thus GA code will not be run. Therefore I recommend doing the comparison on cleaned data to minimize misrepresentation. So how to find all sorts of bots and crawlers?

To identify bots or crawlers we can use information from user agents.

“Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

At the end of this thread you will find information that it is a bot. Same with most of the official ones. Then just set up a filter to eliminate all that traffic and you’re done.

That’s the theory. In practice, it’s a bit more complicated. You need to look at what the data tells you, what the user agent structure looks like, and what it contains. If the user agent is in a non-standard form, it may be a bot – you can check this in the various bot databases available on the web, e.g. https://user-agents.net/bots. There are various paid and public lists of bots that you can attach to your data and try to filter the data based on them, but this has not worked well for us.

Conclusion

This is how we can prepare data for comparative analysis. If at the beginning the input data from the servers exceeded the data from Google Analytics by several times, at this point we should be close to the numbers that correspond to the structure and volume of data in Analytics. They will never be equal though, but that shouldn’t matter. Think of it as Analytics showing more of the trends of visitor behavior on your site, while from server logs you get a snapshot of the real traffic on your site. They are two different views of the same activity, and they need to be treated as such.

Scroll to Top