Log File Analysis needs to be part of each SEO professional’s instrument belt, however most SEOs have by no means performed one. Which means most SEOs are lacking out on distinctive and invaluable insights that common crawling instruments simply can’t produce.
Let’s demystify Log File Analysis so it is not so intimidating. If you’re within the fantastic world of log recordsdata and what they’ll convey to your web site audits, this information is certainly for you.
What are Log Files?
Log Files are recordsdata containing detailed logs on who and what’s making requests to your web site server. Every time a bot makes a request to your web site, information (such because the time, date IP handle, consumer agent, and so forth.) is saved on this log. This useful information permits any SEO to discover out what Googlebot and different crawlers are doing in your web site. Unlike common crawlings, similar to with the Screaming Frog SEO Spider, that is real-world information — not an estimation of how your web site is being crawled. It is an actual overview of how your web site is being crawled.
Having this correct information will help you establish areas of crawl funds waste, simply discover entry errors, perceive how your SEO efforts are affecting crawling and far, way more. The better part is that, normally, you are able to do this with easy spreadsheet software program.
In this information, we can be focussing on Excel to carry out Log File Analysis, however I’ll additionally focus on different instruments similar to Screaming Frog’s much less well-known Log File Analyser which might simply make the job a bit simpler and sooner by serving to you handle bigger information units.
Note: proudly owning any software program aside from Excel is just not a requirement to comply with this information or get your fingers soiled with Log Files.
How to Open Log Files
Rename .log to .csv
When you get a log file with a .log extension, it’s actually as straightforward as renaming the file extension .csv and opening the file in spreadsheet software program. Remember to set your working system to present file extensions if you would like to edit these.
How to open break up log recordsdata
Log recordsdata can are available both one massive log or a number of recordsdata, relying on the server configuration of your web site. Some servers will use server load balancing to distribute visitors throughout a pool or farm of servers, inflicting log recordsdata to be break up up. The excellent news is that it is very easy to mix, and you should use one among these three strategies to mix them after which open them as regular:
- Use the command line in Windows by Shift + right-clicking within the folder containing your log recordsdata and deciding on “Run Powershell from here”
Then run the next command:
copy *.log mylogfiles.csv
You can now open mylogfile.csv and it’ll include all of your log information.
Or if you’re a Mac consumer, first use the cd command to go to the listing of your log recordsdata:
Then, use the cat or concatenate command to be a part of up your recordsdata:
cat *.log > mylogfiles.csv
2) Using the free instrument, Log File Merge, mix all of the log recordsdata after which edit the file extension to .csv and open as regular.
three) Open the log recordsdata with the Screaming Frog Log File Analyser, which is so simple as dragging and dropping the log recordsdata:
(Please word: This step isn’t required if you’re utilizing Screaming Frog’s Log File Analyser)
Once you’ve gotten your log file open, you’re going to want to break up the cumbersome textual content in every cell into columns for simpler sorting later.
Excel’s Text to Column operate is useful right here, and is as straightforward as deciding on all of the crammed cells (Ctrl / Cmd + A) and going to Excel > Data > Text to Columns and deciding on the “Delimited” choice, and the delimiter being a Space character.
Once you’ve separated this out, you might also need to kind by time and date — you are able to do so within the Time and Date stamp column, generally separating the information with the “:” colon delimiter.
Your file ought to look comparable to the one beneath:
As talked about earlier than, don’t fear in case your log file doesn’t look precisely the identical — totally different log recordsdata have totally different codecs. As lengthy as you’ve gotten the fundamental information there (time and date, URL, user-agent, and so forth.) you’re good to go!
Understanding Log Files
Now that your log recordsdata are prepared for evaluation, we will dive in and begin to perceive our information. There are many codecs that log recordsdata can take with a number of totally different information factors, however they often embrace the next:
- Server IP
- Date and time
- Server request methodology (e.g. GET / POST)
- Requested URL
- HTTP standing code
More particulars on the frequent codecs will be discovered beneath for those who’re within the nitty gritty particulars:
- Apache and NGINX
- Amazon Elastic Load Balancing
- HA Proxy
How to shortly reveal crawl funds waste
As a fast recap, Crawl Budget is the variety of pages a search engine crawls upon each go to of your web site. Numerous elements have an effect on crawl funds, together with hyperlink fairness or area authority, web site pace, and extra. With Log File Analysis, we can be in a position to see what kind of crawl funds your web site has and the place there are issues inflicting crawl funds to be wasted.
Ideally, we wish to give crawlers essentially the most environment friendly crawling expertise attainable. Crawling shouldn’t be wasted on low-value pages and URLs, and precedence pages (product pages for instance) shouldn’t have slower indexation and crawl charges as a result of an internet site has so many lifeless weight pages. The title of the sport is crawl funds conservation, and with good crawl funds conversion comes higher natural search efficiency.
See crawled URLs by consumer agent
Seeing how incessantly URLs of the location are being crawled can shortly reveal the place serps are placing their time into crawling.
If you’re all for seeing the conduct of a single consumer agent, that is straightforward as filtering out the related column in excel. In this case, with a WC3 format log file, I’m filtering the cs(User-Agent) column by Googlebot:
And then filtering the URI column to present the variety of instances Googlebot crawled the house web page of this instance web site:
This is a quick approach of seeing if there are any drawback areas by URI stem for a singular user-agent. You can take this a step additional by trying on the filtering choices for the URI stem column, which on this case is cs-uri-stem:
From this fundamental menu, we will see what URLs, together with useful resource recordsdata, are being crawled to shortly establish any drawback URLs (parameterized URLs that shouldn’t be being crawled for instance).
You may also do broader analyses with Pivot tables. To get the variety of instances a selected consumer agent has crawled a selected URL, choose the entire desk (Ctrl/cmd + A), go to Insert > Pivot Table after which use the next choices:
All we’re doing is filtering by User Agent, with the URL stems as rows, after which counting the variety of instances every User-agent happens.
With my instance log file, I bought the next:
Then, to filter by particular User-Agent, I clicked the drop-down icon on the cell containing “(All),” and chosen Googlebot:
Understanding what totally different bots are crawling, how cellular bots are crawling otherwise to desktop, and the place essentially the most crawling is going on will help you see instantly the place there may be crawl funds waste and what areas of the location want enchancment.
Find low-value add URLs
Crawl funds shouldn’t be wasted on Low value-add URLs, that are usually attributable to session IDs, infinite crawl areas, and faceted navigation.
To do that, return to your log file, and filter by URLs that include a “?” or query mark symbols from the URL column (containing the URL stem). To do that in Excel, bear in mind to use “~?” or tilde query mark, as proven beneath:
A single “?” or query mark, as acknowledged within the auto filter window, represents any single character, so including the tilde is like an escape character and makes certain to filter out the query mark image itself.
Isn’t that straightforward?
Find duplicate URLs
Duplicate URLs is usually a crawl funds waste and an enormous SEO challenge, however discovering them is usually a ache. URLs can typically have slight variants (similar to a trailing slash vs a non-trailing slash model of a URL).
Ultimately, one of the best ways to discover duplicate URLs can also be the least enjoyable approach to achieve this — you’ve gotten to kind by web site URL stem alphabetically and manually eyeball it.
One approach you will discover trailing and non-trailing slash variations of the identical URL is to use the SUBSTITUTE operate in one other column and use it to take away all ahead slashes:
=SUBSTITUTE(C2, “/”, “”)
In my case, the goal cell is C2 because the stem information is on the third column.
Then, use conditional formatting to establish duplicate values and spotlight them.
However, eyeballing is, sadly, the perfect methodology for now.
See the crawl frequency of subdirectories
Finding out which subdirectories are getting crawled essentially the most is one other fast approach to reveal crawl funds waste. Although consider, simply because a shopper’s weblog has by no means earned a single backlink and solely will get three views a yr from the enterprise proprietor’s grandma doesn’t imply it is best to contemplate it crawl funds waste — inner linking construction needs to be persistently good all through the location and there is perhaps a robust cause for that content material from the shopper’s perspective.
To discover out crawl frequency by subdirectory degree, you will have to largely eyeball it however the next formulation will help:
The above formulation seems to be like a little bit of a doozy, however all it does is examine if there’s a trailing slash, and relying on the reply, depend the variety of trailing slashes and subtract both 2 or 1 from the quantity. This formulation might be shortened for those who take away all trailing slashes out of your URL listing utilizing the RIGHT formulation — however who has the time. What you’re left with is subdirectory depend (ranging from zero from as the primary subdirectory).
Replace C2 with the primary URL stem / URL cell after which copy the formulation down your total listing to get it working.
Make certain you substitute the entire C2s with the suitable beginning cell after which kind the brand new subdirectory counting column by smallest to largest to get a superb listing of folders in a logical order, or simply filter by subdirectory degree. For instance, as proven within the beneath screenshots:
The above picture is subdirectories sorted by degree.
The above picture is subdirectories sorted by depth.
If you’re not coping with plenty of URLs, you could possibly merely kind the URLs by alphabetical order however then you definitely received’t get the subdirectory depend filtering which is usually a lot sooner for bigger websites.
See crawl frequency by content material kind
Finding out what content material is getting crawled, or if there are any content material varieties which can be hogging crawl funds, is a good examine to spot crawl funds waste. Frequent crawling on pointless or low precedence CSS and JS recordsdata, or how crawling is going on on pictures if you’re attempting to optimize for picture search, can simply be noticed with this tactic.
In Excel, seeing crawl frequency by content material kind is as straightforward as filtering by URL or URI stem utilizing the Ends With filtering choice.
Quick Tip: You may also use the “Does Not End With” filter and use a .html extension to see how non-HTML web page recordsdata are being crawled — all the time price checking in case of crawl funds waste on pointless js or css recordsdata, and even pictures and picture variations (taking a look at you WordPress). Also, bear in mind if in case you have a web site with trailing and non-trailing slash URLs to take that into consideration with the “or” operator with filtering.
Spying on bots: Understand web site crawl conduct
Log File Analysis permits us to perceive how bots behave by giving us an thought of how they prioritize. How do totally different bots behave in several conditions? With this data, you can’t solely deepen your understanding of SEO and crawling, but additionally provide you with an enormous leap in understanding the effectiveness of your web site structure.
See most and least crawled URLs
This technique has been touched up beforehand with seeing crawled URLs by user-agent, but it surely’s even sooner.
In Excel, choose a cell in your desk after which click on Insert > Pivot Table, make sure that the choice comprises the required columns (on this case, the URL or URI stem and the user-agent) and click on OK.
Once you’ve gotten your pivot desk created, set the rows to the URL or URI stem, and the summed worth because the user-agent.
From there, you’ll be able to right-click within the user-agent column and type the URLs from largest to smallest by crawl depend:
Now you’ll have a terrific desk to make charts from or shortly evaluation and search for any problematic areas:
A query to ask your self when reviewing this information is: Are the pages you or the shopper would need being crawled? How typically? Frequent crawling doesn’t essentially imply higher outcomes, however it may be a sign as to what Google and different content material user-agents prioritize most.
Crawl frequency per day, week, or month
Checking the crawling exercise to establish points the place there was lack of visibility round a time period, after a Google replace or in an emergency can inform you the place the issue is perhaps. This is so simple as deciding on the “date” column, ensuring the column is within the “date” format kind, after which utilizing the date filtering choices on the date column. If you’re trying to analyze an entire week, simply choose the corresponding days with the filtering choices obtainable.
Crawl frequency by directive
Understanding what directives are being adopted (for example, if you’re utilizing a disallow or perhaps a no-index directive in robots.txt) by Google is crucial to any SEO audit or marketing campaign. If a web site is utilizing disallows with faceted navigation URLs, for instance, you’ll need to make sure that these are being obeyed. If they aren’t, advocate a greater answer similar to on-page directives like meta robots tags.
To see crawl frequency by directive, you’ll want to mix a crawl report together with your log file evaluation.
(Warning: We’re going to be utilizing VLOOKUP, but it surely’s actually not as difficult as individuals make it out to be)
To get the mixed information, do the next:
- Get the crawl out of your web site utilizing your favourite crawling software program. I is perhaps biased, however I’m an enormous fan of the Screaming Frog SEO Spider, so I’m going to use that.
If you’re additionally utilizing the spider, comply with the steps verbatim, however in any other case, make your individual name to get the identical outcomes.
- Export the Internal HTML report from the SEO Spider (Internal Tab > “Filter: HTML”) and open up the “internal_all.xlsx” file.
From there, you’ll be able to filter the “Indexability Status” column and take away all clean cells. To do that, use the “does not contain” filter and simply go away it clean. You may also add the “and” operator and filter out redirected URLs by making the filter worth equal “does not contain → “Redirected” as proven beneath:
This will present you canonicalized, no-index by meta robots and canonicalized URLs.
- Copy this new desk out (with simply the Address and Indexability Status columns) and paste it in one other sheet of your log file evaluation export.
- Now for some VLOOKUP magic. First, we’d like to make sure that the URI or URL column information is in the identical format because the crawl information.
Log Files don’t usually have the foundation area or protocol within the URL, so we both want to take away the top of the URL utilizing “Find and Replace” in our newly made sheet, or make a brand new column in your log file evaluation sheet append the protocol and root area to the URI stem. I favor this methodology as a result of then you’ll be able to shortly copy and paste a URL that you’re seeing issues with and have a look. However, if in case you have a large log file, it’s in all probability quite a bit much less CPU intensive with the “Find and Replace” methodology.
To get your full URLs, use the next formulation however with the URL area modified to no matter web site you might be analyzing (and ensure the protocol is right as nicely). You’ll additionally need to change D2 to the primary cell of your URL column
https://www.example.com”&D2 Drag down the formulation to the tip of your Log file desk and get a pleasant listing of full URLs:
- Now, create one other column and name it “Indexability Status”. In the primary cell, use a VLOOKUP comparable to the next: =VLOOKUP(E2,CrawlSheet!A$1:B$1128,2,FALSE). Replace E2 with the primary cell of you “Full URL” column, then make the lookup desk into your new. crawl sheet. Remember to sue the greenback indicators in order that the lookup desk does not change as you. apply the formulation to additional roles. Then, choose the right column (1 could be the primary column of the index desk, so quantity 2 is the one we’re after). Use the FALSE vary lookup mode for actual matching. Now you’ve gotten a pleasant tidy listing of URLs and their indexability standing matched with crawl information:
Crawl frequency by depth and inner hyperlinks
This evaluation permits us to see how a web site’s structure is performing by way of crawl funds and crawlability. The important purpose is to see if in case you have way more URLs than you do requests — and for those who do then you’ve gotten an issue. Bots shouldn’t be “giving up” on crawling your total web site and never discovering essential content material or losing crawl funds on content material that’s not essential.
Tip: It can also be price utilizing a crawl visualization instrument alongside this evaluation to see the general structure of the location and see the place there are “off-shoots” or pages with poor inner linking.
To get this all-important information, do the next:
- Crawl your web site together with your most well-liked crawling instrument and export whichever report has each the clicking depth and variety of inner hyperlinks with every URL.
In my case, I’m utilizing the Screaming Frog SEO Spider, going exporting the Internal report:
- Use a VLOOKUP to match your URL with the Crawl Depth column and the variety of Inlinks, which will provide you with one thing like this:
- Depending on the kind of information you need to see, you may want to filter out solely URLs returning a 200 response code at this level or make them filterable choices within the pivot desk we create later. If you’re checking an e-commerce web site, you may want to focus solely on product URLs, or for those who’re optimizing crawling of pictures you’ll be able to filter out by file kind by filtering the URI column of your log file utilizing the “Content-Type” column of your crawl export and making an choice to filter with a pivot desk. As with all of those checks, you’ve gotten loads of choices!
- Using a pivot desk, now you can analyze crawl charge by crawl depth (filtering by the actual bot on this case) with the next choices:
To get one thing like the next:
Better information than Search Console? Identifying crawl points
Search Console is perhaps a go-to for each SEO, but it surely definitely has flaws. Historical information is more durable to get, and there are limits on the variety of rows you’ll be able to view (at the moment of writing it’s 1000). But, with Log File Analysis, the sky’s the restrict. With the next checks, we’re going to be found crawl and response errors to give your web site a full well being examine.
Discover Crawl Errors
An apparent and fast examine to add to your arsenal, all you’ve gotten to do is filter the standing column of your log file (in my case “sc-status” with a W3C log file kind) for 4xx and 5xx errors:
Find inconsistent server responses
A specific URL could have various server responses over time, which might both be regular conduct, similar to when a damaged hyperlink has been mounted or an indication there’s a critical server challenge occurring similar to when heavy visitors to your web site causes much more inner server errors and is affecting your web site’s crawlability.
Analyzing server responses is as straightforward as filtering by URL and by Date:
Alternatively, if you would like to shortly see how a URL is various in response code, you should use a pivot desk with the rows set to the URL, the columns set to the response codes and counting the variety of instances a URL has produced that response code. To obtain this setup create a pivot desk with the next settings:
This will produce the next:
As you’ll be able to see within the above desk, you’ll be able to clearly see “/inconcistent.html” (highlighted within the purple field) has various response codes.
View Errors by Subdirectory
To discover which subdirectories are producing essentially the most issues, we simply want to do some easy URL filtering. Filter out the URI column (in my case “cs-uri-stem”) and use the “contains” filtering choice to choose a selected subdirectory and any pages inside that subdirectory (with the wildcard *):
For me, I checked out the weblog subdirectory, and this produced the next:
View Errors by User Agent
Finding which bots are struggling will be helpful for quite a few causes together with seeing the variations in web site efficiency for cellular and desktop bots, or which serps are finest in a position to crawl extra of your web site.
You may need to see which specific URLs are inflicting points with a selected bot. The easiest method to do that is with a pivot desk that enables for filtering the variety of instances a selected response code happens per URI. To obtain this make a pivot desk with the next settings:
From there, you’ll be able to filter by your chosen bot and response code kind, similar to picture beneath, the place I am filtering for Googlebot desktop to hunt down 404 errors:
Alternatively, you may also use a pivot desk to see what number of instances a selected bot produces totally different response codes as an entire by making a pivot desk that filters by bot, counts by URI incidence, and makes use of response codes as rows. To obtain this use the settings beneath:
For instance, within the pivot desk (beneath), I’m taking a look at what number of of every response code Googlebot is receiving:
Diagnose on-page issues
Websites want to be designed not only for people, however for bots. Pages shouldn’t be gradual loading or be an enormous obtain, and with log file evaluation, you’ll be able to see each of those metrics per URL from a bot’s perspective.
Find gradual & giant pages
While you’ll be able to kind your log file by the “time taken” or “loading time” column from largest to smallest to discover the slowest loading pages, it’s higher to have a look at the typical load time per URL as there might be different elements that may have contributed to a gradual request aside from the online web page’s precise pace.
To do that, create a pivot desk with the rows set to the URI stem or URL and the summed worth set to the time taken to load or load time:
Then utilizing the drop-down arrow, on this case, the place it says “Sum of time-taken” and go to “Value Field Settings”:
In the brand new window, choose “Average” and also you’re all set:
Now it is best to have one thing comparable to the next if you kind the URI stems by largest to smallest and common time taken:
Find giant pages
You can now add the obtain dimension column (in my case “sc-bytes”) utilizing the settings proven beneath. Remember that the set the dimensions to the typical or sum relying on what you prefer to to see. For me, I’ve completed the typical:
And it is best to get one thing comparable to the next:
Bot conduct: Verifying and analyzing bots
The finest and easiest method to perceive bot and crawl conduct is with log file evaluation as you might be once more getting real-world information, and it’s quite a bit much less trouble than different strategies.
Find un-crawled URLs
Simply take the crawl of your web site together with your instrument of alternative, after which take your log file an examine the URLs to discover distinctive paths. You can do that with the “Remove Duplicates” characteristic of Excel or conditional formatting, though the previous is quite a bit much less CPU intensive particularly for bigger log recordsdata. Easy!
Identify spam bots
Unnecessary server pressure from spam and spoof bots is well recognized with log recordsdata and a few fundamental command line operators. Most requests may also have an IP related to it, so utilizing your IP column (in my case, it’s titled “c-ip” in a W3C format log), take away all duplicates to discover every particular person requesting IP.
From there, it is best to comply with the method outlined in Google’s doc for verifying IPs (word: For Windows customers, use the nslookup command):
Or, for those who’re verifying a bing bot, use their helpful instrument:
Conclusion: Log Files Analysis — not as scary because it sounds
With some easy instruments at your disposal, you’ll be able to dive deep into how Googlebot behaves. When you perceive how an internet site handles crawling, you’ll be able to diagnose extra issues than you’ll be able to chew — however the true energy of Log File Analysis lies in having the ability to check your theories about Googlebot and increasing the above methods to collect your individual insights and revelations.
What theories would you check utilizing log file evaluation? What insights might you collect from log recordsdata aside from those listed above? Let me know within the feedback beneath.