Common Files and Metadata

Many sites make use of "well known" files to store information about them. Looking at these files, can give us an idea hidden functionality.

In this article we will take a look at some of the more common files that can help our enumeration.

Site Structure: Sitemaps

Sitemaps are generally used for search engine optimisation.

Here a developer builds an Text, or XML representation of all of the pages in the site, and either submits it to the search provider, or leaves it in the root directory of the site.

Example

The folloiwng is part of a text based sitemap of cueh.coventry.ac.uk

http://cueh.coventry.ac.uk/
http://cueh.coventry.ac.uk/staff.html
http://cueh.coventry.ac.uk/facilities.html
http://cueh.coventry.ac.uk/blog/
http://cueh.coventry.ac.uk/more_info_research.html
http://cueh.coventry.ac.uk/index.html
http://cueh.coventry.ac.uk/blog/2016/12/new-toy

Or its XML Equivilent

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!--  created with Free Online Sitemap Generator www.xml-sitemaps.com  -->
<url>
    <loc>http://cueh.coventry.ac.uk/</loc>
    <lastmod>2017-09-05T11:10:09+00:00</lastmod>
    <priority>1.00</priority>
</url>
<url>
    <loc>http://cueh.coventry.ac.uk/staff.html</loc>
    <lastmod>2017-09-05T11:10:09+00:00</lastmod>
    <priority>0.80</priority>
</url>
<url>
    <loc>http://cueh.coventry.ac.uk/facilities.html</loc>
    <lastmod>2017-09-05T11:10:09+00:00</lastmod>
    <priority>0.80</priority>
</url>
<url>
    <loc>http://cueh.coventry.ac.uk/blog/</loc>
    <lastmod>2021-08-19T14:49:41+00:00</lastmod>
    <priority>0.80</priority>
</url>

If we find a sitemap we can audit the links provided to see if they match the endpoints we have already identified. Any new elements can be mapped for interesting parameters or interaction like before.

There are several ways of generating sitemaps, either by hand or with an automated process. If the process is automated, then the accuracy of the map will depend on the spider used. If it is a web based spider (like in the sitemap above), then it will only visit links that are visible from the site itself. However, if the sitemap is generated server side, it may be based on the contents of the web directory. In this case, all of the links on the site will be visible.

Robots.txt

Robots.txt gives web developers a way of asking search engines not to include certain results in their listings. The intention here is to stop Google, or other providers from listing sensitive information in their results. However, this gives us a bit of a problem, because it creates a list of "Pages I don't want people to look at", in a well known place.

The Format of Robots.txt looks something like this:

User-agent: *
Disallow: <path to block>
Disallow: <path>
Allow: <path to allow>

User-Agent refers to the web crawler used for indexing. We can either have * for everything, or specific agents, such as bingbot for Microsoft, or Googlebot for Google.
one or more Disallow fields, specifying paths that we don't want the spider to crawl
(optional) Allow fields, the give the crawler permission to visit that part of the page.

When developing, its also important to note, that following the directives in robots.txt is not mandatory, we have to trust the developers of specific crawlers not to follow the link.

As part of our security audit we should add URLS in robots.txt to our list.

Coventry's Robots.txt

Here we have part of Coventrys Robots.txt

Sitemap: https://www.coventry.ac.uk/sitemapindex.xml
User-agent: *
Disallow: /study-at-coventry/course-finder-search-results/*
Disallow: /*.xml$
Allow: /*sitemap*.xml$
Allow: /sitemapindex.xml$

Disallow: /episerver
Disallow: /episerver/*
Disallow: /util/*
...
Disallow: *.aspx
Disallow: *.axd 
...
Disallow: /testing/*
Disallow: */testing/*
...

We can identify

- A Sitemap file.
- That the episerver CMS is used (which can give us an idea of technologies)
- ASPX files are blocked, (which might indicate we are using Microsoft IIS for the server)
- There is a *testing* directory that may be of interest.

Note

Each year¹ in the security lecture we look for interesting google dorks on Coventrys site, as part of the lecture on OSInt.

Each year², I go to ITS / SMT with another big list of "Things that are public, but probably shouldn't be"

Over this time, Coventry's Robots.txt has evolved dramatically, from disallowing only a small number of things³, to being quite restrictive in the types of files it allows Google to index.

In Page Meta Tags

As we discussed in the HTML Recap, meta tags give us a way of placing extra information in our web pages. These are often used by web crawlers, or other applications (like social media) to help build a bigger picture of what the site contains. Meta tags can include things like the site Author, details of CMS systems used.

For example <meta name="generator" content="mkdocs-1.2.1, mkdocs-material-7.1.7"> shows that the page was generated using the mkdocs static site generator

!!! example "Open Graph Data":

For example,  the Open Graph standard, defines ways of making "Content Cards" to be dlisplayed on social media.
An example of OpenGraph metadata can be seen below.  Here we identify some ditrecotys where images are stored

```
<meta property="og:image" content="https://example.com/ogp.jpg" />
<meta property="og:image:secure_url" content="https://secure.example.com/ogp.jpg" />
<meta property="og:image:type" content="image/jpeg" />
```

Meta tags can also be useful for identifying new URL endpoints and other areas of interest in the application.

Meta Tags and Robots

To address some of the issues with having a public robots.txt, we can also have instructions for crawlers in Meta tags. As a dev this is a great idea, as it means we dont have to publish a list of "place i dont want you to look"⁴

Each page can have a set of tags that give the crawer instruction on how to behave.

<meta name="robots" content="noimageindex, nofollow, nosnippet">
<meta name="googlebot" content="nofollow">

While this may not be as useful as having a list of pages provided to us by robots.txt, once we have discovered a page, it might give us an indication of wheter the developers want it hidden or not.

Other Well Known Files

Two other files of interest that may exist on the system are:

Security.txt: Gives the sites security policy, details of bug bounties etc. Sometimes will contain details of CTF / Recruitment opportunities too
Humans.txt: Gives details of the developers. May include things like developer names / emails / job titles. Can be useful for Social Engineering.

Application Specific Entrypoints

If we have identified a specific technology used on the site, we may be able to find default pages (such as admin panels). Our success here will depend on the technology used, and whether the user has moved the default locations.

If we can identify the Technology and Version number, a quick google along the lines of wordpress 3.14 default login should help us to find the relevant page

Example

URLS that allow us to do this inclue:

/login, /admin, /wp-admin/ for WordPress
/manager/html for Apache Tomcat
/phpmyadmin for PHPMyAdmin

Summary

As part of the audit process common or well known files can help us identify further endpoints for mapping. In this article we have looked at some of the meta information that websites use to help with SEO, and to provide users and other services with information.

Next we are going to look at the concept of Fuzzing and Brute Forcing.

Next week for OSInt ↩
And sometimes, during the lecture. ↩
One year it even included a "Death Switch" that would crash the site and reboot the webserver. ↩
We can also set robots information in the x-robots-tag field of a HTTP response header. ↩