Encoding Data
Encoding data is the process of converting the data into a specified format. We can use this to work around restrictions in the way data is transmitted (usually due to legacy design decisions)
Note
While not strictly cryptography, encoding does change the data we store, and is often confused with crypto. Therefore It makes sense to talk about encoding here.
Some common forms of encoding are:
URL Encoding
For example, the syntax rules for URI / URL has a limited set of permitted characters, and some characters may also have a special meaning.
URL encoding is used to transform data into the alphabet allowed in a URL1. This allows us to transmit data for things like form requests, or binary data using a consistent form.
This is done by encoding special characters as a numerical code, the code is
prefixed with a percent symbol %
. For example
space
becomes %20#
becomes %23
Example
We know that the amphersand &
is used to seperate out data items in a HTTP Request.
For example, lets imagine a Login form that sends a username and password.
(Yes I know sending passwords as a GET request is a bad idea, but POST isn't that much better)
login?usename=Foo&password=Bar
However, if our password has the &
symbol in it, this would break the way
the URL is represented, as it is unclear whether the &
is part of the password,
or the separator symbol.
Task
Using whatever tool you want (ie Python, or GCHQ's excellent CyberChef) decode the following query string. Spot anything interesting in the output?
'flag=%E1%92%BF45%7B%E2%88%AAn%D1%96code_%E2%85%BEec%CE%BFde%7D'
Base64
Base64 is another common method for encoding data. It takes a a string of Binary input, (which can include text), and converts it to a sequence of Printable ASCII characters.
This means that we can use Base64 to transmit data stored in binary format, using protocols that can only reliably support text based content. (For example, a lot of the web based protocols, and for sending Email attachments)
The majority of Base64 implementations make use of the characters A-Z a-z 0-9
for the first 62 characters of the alphabet, but may differ with the values
chosen for the last 2 characters. However, +
and /
are common.
Padding is also used with the =
character used to make the string a
multiple of 3 Bytes.
Note
These charateristics make it easy to spot Base64 data. If we get a string of only Alpha Numeric characters, with equals symbols at the end, its a good bet its been base 64 encooded
Email Attachments
So back in the days when things like E-mail were invented we only really expected people to use them for text messages. So a decision was made to base the SMTP protocol on the ASCII character set.
However, a few years later, people started to want to transfer files. The problem is, is there is no garentee that the file will only contain ASCII characters. (There is also the added complexity that there was no garentee that differnt OS would handle non ASCII charaters in the same way)
So Mary Ann Horton came up with the idea of uuencode. Where groups of bytes are combined into a limited subset of the ASCII characters. This was cool, as it meant that we could encode files using characters that we knew would be handled in a standard way.
Base64 is a derivative of uuencoding, that was chosen to be the standard for MIME.
So when we transfer a file via email, it gets converted to base64, then transmitted using the same SMTP protocols that have been about since the beginning. Its pretty cool.
Task
Using a tool (ie cyberchef) can you get the data from the following email.
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=frontier
This is a message with multiple parts in MIME format.
--frontier
Content-Type: text/plain
This is the body of the message.
--frontier
Content-Type: application/octet-stream
Content-Transfer-Encoding: base64
PGh0bWw+CiAgPGhlYWQ+CiAgPC9oZWFkPgogIDxib2R5PgogICAgPHA+MjQ1Q1R7RGVjb2Rl
X0VtYWlsX2I2NDwvcD4KICA8L2JvZHk+CjwvaHRtbD4K
--frontier--
Summary
In this topic we looked at Encoding as a form of Encryption (or not) Encoding is a common way of transferring data between machines, and allows us to deal with types of data we might not be able to otherwise.
We looked at URL encoding, and Base64 as two examples of encoding systems, and have explored some examples of where they are used.
From a security standpoint, its important to remember that encoding is not encryption. While the meaning of the data may be obscured, the process to unencoding is also well known, meaning that the original data can be obtained easily.
Other Encoding forms #encodings
What other forms of common forms of encoding are there?
Are there any systems you have used before?
Discuss in the feed with the tag #encodings