Search engines as a reconnaissance tool

Search engines can be a powerful tool for reconnaissance, in this article we look at how search engines work, the types of information they collect, and what this means for our reconnaissance process.

Web Search engines.

The sheer volume of information on the web, means that without an efficient way of finding something, we would be limited to well known sources of information. Much like the early days of BBS, the knowledge available to you would be based on whatever sources you had identified.

Given that the internet is estimated to contain around 1.7 Billion websites¹, and estimates of around 130 Trillion pages², sifting through this huge amount of information can be a difficult task. (It is also interesting to note that estimates put the indexed, searchable content at around 3-10% of the total content available³.

The premise of a search engine is simple. You type in a set of terms you are interested in, and the search engine returns a set of results that should match these terms. How relevance of the returned results, is a key factor in a search engines success.

Note

Cynical Note: Actually, most of these search engines are in the business of advertising. Collecting these vast amounts of information on the webs content, and linking that to interests of people via their search queries means that they can build a detailed profile of its users. I find it interesting to consider this multi Billion industry, is based on the intelligence gathering techniques we have been discussing this week.

The leading search engines include:

Google http://www.google.com (Approx 70% of traffic)
Bing http://bing.com
Baidu http://www.baidu.com/ (Approx 75% of all Chinese Traffic)
Yahoo http://www.yahoo.com
Yandex https://yandex.com/ Popular in the Russian territories

Note

For this next set of articles, I will focus on using Google for search. Other search engines are available, and offer similar features. Due to the way each of these engines indexes and presents the information it finds, it is often worthwhile searching with different tools to compare what is found.

How is the data collected

Data on pages are collected by "Web Crawlers" (or spiders). These visit web pages, and collect data about the contents of each page. Each crawler will start with a list of pages to scan, as it visits the page it updates its list with any links found. When the crawler has finished indexing a page, it visits the next link in its list.

The process of indexing the content of a page is propriety (and in the case of google a highly guarded secret), approaches can include making use of <meta> tags that have been placed on a page, to machine learning techniques that can the full text of a page for content.

Limiting what is discovered by web crawlers.

Crawlers tend to be limited to information that is publicly available. Therefore pages behind authentication mechanisms (ie password protected parts of sites) may not be scanned and indexed.

There can also be issues with "paywalled" content, usually a content provider will want this information be be included in search results as it will drive visits to the site. However, applying the browsing limitations would stop crawlers being able to visit the pages. A common workaround for this is to examine the user-agent for those visiting the site, if it is one of the well known crawlers, the paywall restrictions are lifted.

Tip

You may want to consider what would happen if you modify your browsers user-agent next time you hit a paywall)

Another issue is with dynamically generated content, sites where the data is generated "on the fly" from a database can be difficult to index in a sensible way. This means that information from API's etc may not be available in the search.

As it is acknowledged that Web crawling can cause security issues. For example, a site may contain information that it needs to be available, but does not wish to be in the public domain. While security through obscurity is a Bad Thing(TM), the majority of crawlers operate a "politeness" policy, and will follows any rules for indexing found in the robots.txt file.

Robots.txt

The robots.txt file is placed in the root folder of a site. Is a standard that informs web spiders which elements of a site the author wants indexing. By providing a set of rules that the spider should follow.

Tip

Obviously, if a site has something in Robots.txt, it means they don't want google to find it. That tends to imply that it might be interesting to look at (or it could be a tonne of CSS)

There are many different definitions available in the robots.txt file (See http://www.robotstxt.org/). A few examples of common definitions follow.

The following definition will block all robots from visiting everything on the site. This means that no content will be indexed or available in the search results.

User-agent: * 
Dissalow: /

We can block Robots from visiting specifying directory's (in this case the cgi-bin/) And specific types of file such as PDF's

User-agent: * 
Disallow: /cgi-bin/
Disallow: *.pdf

While the use of robots.txt may seem like a great idea, it relies on the spider (or visitor), actually being polite enough to follow the rules, and can leak more information about what is available on the site.

For example, consider the following definition:

User-agent: *
Disallow: /sekret-login.php

While the Secret login page may not turn up in google search results, it would tell any malicious entity that reads the robots.txt file, that there is something interesting here.

Its not just indexing web pages

As well as indexing the HTML based web pages, many search engines also index other documents. While most will be familiar with the image search functionality, things like word documents and PDF files can also be collected. We will consider the impact of this when we introduce Google Hacking.

Summary

In this article we have discussed search engines, how they collect data on content and how this information is presented to the user. In the next article we will discuss searching techniques and how they can be used to discover more information about your target.

Discuss

Previously we introduced the Robots.txt file, which is intended to provide web crawlers with a method of identifying content that should not be indexed in the search results. In the forums, discuss this approach, what are the benefits of using this kind of file? Are there any significant drawbacks?

How could we (mis)use robots.txt to gain more information on a sites structure?
Do you have any Ideas on how we could mitigate this as a problem?