Encoding Data
Encoding data is the process of converting the data into a specified format. We can use this to work around restrictions in the way data is transmitted (usually due to legacy design decisions)
In this article we will take a look at some of the common encoding methods used in the web.
HTML Escaping
HTML uses markup to help us provide structure to the sites we build. As part of this markup language there are several reserved characters that are used as part of the markup syntax,
For example, we know that we can create tags using the <tag> </tag>
markup, the reserved characters here are the greater than, and less than symbols.
When the browser encounters one of these characters it interprets it as HTML.
This can lead to unexpected behavour in our pages, as the browser doesn't know if 10 < 20 and > 5
means:
- "Ten is Less than 20 and Greater than 5"
- 10 <20 and> 5
(where <20 and>
is a HTML element)
To work around this problem, we make use of character escape codes, to represent symbols as characters, and without any special meaning.
In HTML Escape codes take the format of an Ampersand &
, the escape sequence for the character, and a semicolon ;
.
The escape sequences for the reserved characters are as follows.
Character | Encoding (Name) | Encoding (Number) |
---|---|---|
< | < |
> |
> | > |
< |
& | & |
& |
' | ' |
' |
" | " |
" |
Backslash Escaping
We get a similar problem with any "language" where symbols have special meanings.
For example in regular expressions, the star *
character means "match anything after this point"
The approach here, and common to many unix functions, is to escape any special characters with a backslash slash ie \*
will match the "*" symbol2.
We can also make use of escape characters to represent Unicode, or other special characters (although just saving the file as UTF-8, will probably be enough to show them) In this case we could encode the μ (mu) character3 (Unicode 956) as:
- Numeric (Decimal)
`&965;
- Numeric (Hex)
μ
- Unicode Named
μ
Important
Escaping characters in HTML is also important for security. Allowing the user to enter text that is then interpreted as HTML can lead to problems such as SQL injection, or XSS. We will cover this in much more detail later in the module.
URL Encoding
For example, the syntax rules for URI / URL has a limited set of permitted characters, and some characters may also have a special meaning.
URL encoding is used to transform data into the alphabet allowed in a URL1. This allows us to transmit data for things like form requests, or binary data using a consistent form.
This is done by encoding special characters as a numerical code, the code is
prefixed with a percent symbol %
. For example
space
becomes %20#
becomes %23
Example
We know that the amphersand &
is used to seperate out data items in a HTTP Request.
For example, lets imagine a Login form that sends a username and password.
(Yes I know sending passwords as a GET request is a bad idea, but POST isn't that much better)
login?usename=Foo&password=Bar
However, if our password has the &
symbol in it, this would break the way
the URL is represented, as it is unclear whether the &
is part of the password,
or the separator symbol.
Task
Using whatever tool you want (ie Python, or GCHQ's excellent CyberChef) decode the following query string. Spot anything interesting in the output?
'flag=5067%7B%E2%88%AAn%D1%96code_%E2%85%BEec%CE%BFde%7D'
Base64
Base64 is another common method for encoding data. It takes a a string of Binary input, (which can include text), and converts it to a sequence of Printable ASCII characters.
This means that we can use Base64 to transmit data stored in binary format, using protocols that can only reliably support text based content. (For example, a lot of the web based protocols, and for sending Email attachments)
The majority of Base64 implementations make use of the characters A-Z a-z 0-9
for the first 62 characters of the alphabet, but may differ with the values
chosen for the last 2 characters. However, +
and /
are common.
Padding is also used with the =
character used to make the string a
multiple of 3 Bytes.
Note
These charateristics make it easy to spot Base64 data. If we get a string of only Alpha Numeric characters, with equals symbols at the end, its a good bet its been base 64 encooded
Email Attachments
So back in the days when things like E-mail were invented we only really expected people to use them for text messages. So a decision was made to base the SMTP protocol on the ASCII character set.
However, a few years later, people started to want to transfer files. The problem is, is there is no garentee that the file will only contain ASCII characters. (There is also the added complexity that there was no garentee that differnt OS would handle non ASCII charaters in the same way)
So Mary Ann Horton came up with the idea of uuencode. Where groups of bytes are combined into a limited subset of the ASCII characters. This was cool, as it meant that we could encode files using characters that we knew would be handled in a standard way.
Base64 is a derivative of uuencoding, that was chosen to be the standard for MIME.
So when we transfer a file via email, it gets converted to base64, then transmitted using the same SMTP protocols that have been about since the beginning. Its pretty cool.
Task
Using a tool (ie cyberchef) can you get the data from the following email.
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=frontier
This is a message with multiple parts in MIME format.
--frontier
Content-Type: text/plain
This is the body of the message.
--frontier
Content-Type: application/octet-stream
Content-Transfer-Encoding: base64
PGh0bWw+CiAgPGhlYWQ+CiAgPC9oZWFkPgogIDxib2R5PgogICAgPHA+PjUwNjd7RGVjb2RlX0VtYWlsX2I2NH08L3A+CiAgPC9ib2R5Pgo8L2h0bWw+Cgo=
--frontier--
Summary
In this topic we looked at Encoding as a form of Encryption (or not) Encoding is a common way of transferring data between machines, and allows us to deal with types of data we might not be able to otherwise.
We looked at HTML Escapes, URL encoding, and Base64 as examples of encoding systems, and have explored some examples of where they are used.
From a security standpoint, its important to remember that encoding is not encryption. While the meaning of the data may be obscured, the process to unencoding is also well known, meaning that the original data can be obtained easily.
Other Encoding forms #encodings
In this topic we have introduced:
- HTML Encoding
- URL Encoding
- Base64 Endoding
What other forms of common forms of encoding are there?
Are there any systems you have used before?
Discuss in the feed with the tag #encodings
-
Which can lead to all sorts of fun things like the backslash plague in Python. ↩
-
Always liked Mu, one of my favorite Greek symbols. ≆ is another good one. ↩