Representation of Data

As we all know, "...computers are all just a bunch of 1's and 0's.", but those ones and zeros can be used to represent everything you can possibly imagine a computer can do. Just having 1's and 0's for a representation of data however is a little unreadable for most, not to mention that it's a long way to represent anything even remotely complex as you would need a lot of them to represent anything. This section is dedicated to the ways data is represented, a rough idea on how to tell the different representations apart, and how we can work with them using python.

Bytes and bits.

Bytes for the unenlightened, are a collection of 8 bits that we use to represent an integer value. That integer value, in groups of 8, then maps to a single character in a character map depending on the encoding you are using. Common types of encoding are ASCII, Unicode and UTF-8, which we'll cover below. When data is stored in a system it is generally stored as a byte object and then encoded so it can be displayed in the console. In python 2 all string objects were stored like this as default. In python 3 however, all strings are stored as unicode by default and have to be converted to bytes. It's often advantageous to work with bytes when interacting with API's and the like instead of strings, so often API's will only accept byte string inputs. Here is a short python program to show you how to convert them:

example-1.py from the beginning to the end  Download this file 
unicode_char_string = "This string is unicode!"
bytes_char_string = b"This string is a bytestring!"

print(type(unicode_char_string))
print(type(bytes_char_string))

print(unicode_char_string)
print(bytes_char_string)   # python 3 forces automatically the output to a byte string with print(). 


# To show you what a byte looks like natively we can force the bytes independently into a list and print it. 
# We can create a list with a number of bytes equal to size like this:
size = 3
temp_list = bytes(size)
print(temp_list) 

Download it and see for yourself!

As you can see there is a way to convert bytes to strings and strings to bytes, but we can also change it to many different types of data. It's worth noting that encoding is not the same as encryption. Representing data in different ways like this isn't the same as encryption as it's a type of obfuscation and not a type of control mechanism or security barrier. It's easy to change the data with python built-in functions after all.

Binary

Hopefully by now you all know what binary looks like and how to count in binary. For a refresher on how to count in binary, you can find it easily enough online but a favourite video of mine is here, it covers how to count and do simple arithmetic in binary. To convert a decimal value to a binary one in python you simply need to use the bin() function. You can't directly convert a string to binary however, you first have to make it easy for python to identify an integer to turn into binary, which we can make happen by iterating through the string and converting each of the characters separately. Consider this code:

example-2.py from the beginning to the end   
string = "We will convert this string to binary!"

# We can convert a char into an int with ord() and 
# then convert the int into a binary number with bin()
temp = []
for i in string:
    temp.append(bin(ord(i)))    # i gets turned into an int and then into binary then added to temp with .append()
print(temp) 
Output
['0b1010111', '0b1100101', '0b100000', '0b1110111', '0b1101001', '0b1101100', '0b1101100', '0b100000', '0b1100011', '0b1101111', '0b1101110', '0b1110110', '0b1100101', '0b1110010', '0b1110100', '0b100000', '0b1110100', '0b1101000', '0b1101001', '0b1110011', '0b100000', '0b1110011', '0b1110100', '0b1110010', '0b1101001', '0b1101110', '0b1100111', '0b100000', '0b1110100', '0b1101111', '0b100000', '0b1100010', '0b1101001', '0b1101110', '0b1100001', '0b1110010', '0b1111001', '0b100001']

Example

In this example we can see that we need to iterate through the string, change each character into its integer representation with ord() and then change that into its binary representation with bin(). There are other ways to do it, but this is one of the easiest ways to understand it all.

Bases

Binary is also known as a base 2 counting system, this is because there are only 2 symbols to represent numbers for counting. We usually use a base 10 system because we have ten symbols that represent values for counting, the numbers 0 to 9. In computing, we also commonly use base 8 (also known as octal) and base 16 (also known as hexadecimal or hex). There are lots of other bases that you can use, and you should look at them all if you get a free moment because some of them are very interesting, but base 8 and base 16 are the common ones you'll see in computing so we'll cover those.

Base 8

Octal is the counting system that uses only 8 symbols for counting, the numbers 0 to 7. We use it in computing to group binary numbers into groups of 3 to shorten them. To learn how to convert a decimal number to an octal number, start with any binary number:

10011101

Then, starting from the right, split it into 3's adding any 0's on the left hand side you need to make it up to a set of 3.

010-011-101

Then convert each set of 3 into its respective decimal number.

001	011	101
2	3	5

So your final octal number for that binary number is 135. You can find lots of other videos online with lots more examples if that explanation didn't quite work for you. In a linux system, services like chmod use the octal system to determine read/write access to files. To transform a number into an octal in python, all we need to do is use the oct() function on it.

Base 16

Base 16, hexadecimal, or generally referred to as 'hex', is another way to compress a binary number. As bytes are 8 bits or binary digits long, it's quite unruly to have to deal with with them in code or on paper, and it's much easier to make a mistake when you have large numbers of them. Splitting 8 bits into two separate 4 bit numbers and representing them with a single symbol is a great way to solve that though, and exactly what we do in base 16. To convert a binary number into base 16, all you have to do is to take any 8 bit binary number and split it into two chunks:

1001-1101

Now we convert the decimal value of that binary number into a single hex digit, here is a conversion table to help:

Decimal	Hex	Binary	Decimal	Hex	Binary
0	0	0000	8	8	1000
1	1	0001	9	9	1001
2	2	0010	10	A	1010
3	3	0011	11	B	1011
4	4	0100	12	C	1100
5	5	0101	13	D	1101
6	6	0110	14	E	1110
7	7	0111	15	F	1111

So for our example we can easily change the two chunks of the binary number into hex:

1001-1101 = 9-D = 9D

For another guide on how to count and convert in hex, there is a very good video here. In python, an integer or binary number can be converted into hexadecimal using the function hex().

ASCII, Unicode and UTF-8

We touched on this above, but single characters can also be changed to integers using the ord() inbuilt python function. But where does the number come from? Each character's integer will depend on its character map and what type of encoding is used. In python, we can take a string of bytes and use the .encode() attribute of the object to encode it into the character set we pass to it as a variable.

example-3.py from the beginning to the end   
#!/usr/bin/env python3

# First we have a long string that we divide up into
# a list of integers that can be converted back to 
# characters. 
bytes_in_a_string = "54 68 69 73 20 69 73 20 61 20 73 \
    74 72 69 6e 67 20 74 6f 20 63 6f 6e 76 65 72 74 21"
characters = bytes_in_a_string.split()
print(characters)

# Using list comprehension, we can overwrite the list
# by looping through it and changing the base 16 hex to 
# its integer equivalent using int(i, 16), and then using
# char() we can change it into its UTF-8 character 
# equivalent.   
characters = [chr(int(i, 16)) for i in characters]
print(characters)

# The next line changes the list we have in characters 
# and then turns it into a sting again to be printed.
fixed = "".join(characters)
print(fixed) 

ASCII

Originally, the American Standard Code for Information Interchange (ASCII) was derived from telegraphic codes to be used in digital communications in 1963. That's when the first edition of the character-encoding scheme was released. It included 128 characters, many of which were unprintable or are obsolete now, and included the full English alphabet. These characters were mapped to integer values to make it easy for them to be represented in a computer, and set a standard for digital printers everywhere. The obvious limitation of such a character map is that it doesn't have many characters overall, and it only supports English/Latin characters. ASCII has evolved some and is still used today, but it still really only supports English/Latin characters.

Unicode

Unicode was invented in the 80's primarily to create a database relationship of Chinese and Japanese characters to integers, and later the Arabic and Cyrillic characters too, to make many more characters and languages available to be printed and displayed. ASCII currently maps to 256 characters with the extended ASCII library. To make Unicode backwards compatible, all of the integer values for the character mapping in ASCII was copied over to unicode and a comprehensive list of Arabic, Chinese, Cyrillic, emoji, Greek, Korean, Japanese, and lots more were added. The list is so comprehensive that it takes up 32 bits to represent a character. In Unicode there are over 1.1 million characters in total, which if a file allows for unicode support, you can end up with much larger files as each character now needs support for 4 times the size of bits. Most of python function names and strings as of version 3.0 assume Unicode encoding unless otherwise specified, which means you can program directly in Chinese characters if you really wanted to, but internal functions and object attributes would still need their specific keywords not the translated equivalent.

UTF-8

In response to unicode being so large, the world needed something that was more transmittable than 32 bit long characters. After some playing around with other 16 bit character encoding that no one liked, UTF-8 was developed to create a variable bit length character encoding for faster transmission. For some extra details of how each character range is encoded, you can find some more info here. UTF-8 has some other advantages for transmission as it is easy to determine where bytes of data are corrupted or lost during transmission, and is easy for other systems to interpret with their own methods of decoding transmissions as it is byte orientated. Because it was faster to transmit and supported almost all of the character set unicode does (and ASCII by extension), it's become the most popular character encoding on the internet! Python often defaults to UTF-8 encoding when transmitting.

Using these data types in python

See the code example below:

example-4.py from the beginning to the end   
#!/usr/bin/env python3
import codecs

# Declare a bunch of strings to show types and how encoding works.
unicode_string = "This is how to declare a unicode string, they're default in python 3."
byte_string = b"This is how to declare a byte string in python"
decoded_string = b"This is how to decode a bytestring into a regular string".decode("utf-8")
encoded_string = "This is how to encode a regular string into a bytestring".encode()

# Print out types and messages to show byte and string objects
print(type(unicode_string))
print(unicode_string)
print(type(byte_string))
print(byte_string)
print(type(decoded_string))
print(decoded_string)
print(type(encoded_string))
print(encoded_string)


## This is how you can specify base64 encoding. Depending on what codecs you 
## use, will depend on whether it expects a byte object or a string to be the 
## source to be encoded. All of the ones I tested outputted a bytes string 
## after execution regardless. 
try:
    print(codecs.encode(byte_string, encoding="base64", errors="strict").decode())
except UnicodeDecodeError:
    print("contains unrecognized characters or problem with encoding.")


try:
    utf_7_string = "This string is utf-7 encoded"
    print(codecs.encode(utf_7_string, encoding="utf-7", errors="strict").decode())
except UnicodeDecodeError:
    print("contains unrecognized characters or problem with encoding.")

try:
    # This breaks because directly encoding from unicode to utf-32
    # from unicode won't work without creating insane amounts of empty data
    # making it unclear where the start and ends of characters are. 
    utf_32_string = "This string will break utf-32 encoding"
    print(codecs.encode(utf_32_string, encoding="utf-32", errors="ignore"))
except UnicodeDecodeError:
    print("contains unrecognized characters or problem with encoding.") 

Example

In python, if you use the .encode() function on a string it will encode it into a byte string ready for transmission. If you use the .decode() method on a byte string then you can turn bytes back into a string to make changes to it. In our example we go through a few of these to show this in action, but we also introduce a way to directly encode byte data using the codecs standard library built into python. If you're unsure about the try and except block, then you should refer to our testing section.

Other representations

Base64

Base64 is another type of encoding that you'll see quite a lot as hackers, especially during CTF (capture the flag) events, as every flag you can capture is generally base64 encoded. As it is another base like as above, it is still a type of character encoding but unlike many of the other bases, up to three equals (=) signs are used to pad the string out to make the final number of characters divisible by 4. In the example above you can see an example of its usage.