CS Ramble — Set 1c: variables

This is post is part of set 1 of A Ramble Around CS.

Variables

When you say a = 42, what is happening in memory? Well, a is actually just a synonym for a particular piece of your computer’s memory.

Let’s say your program happend to put a in memory location 1234:

0 42 0 0 1233 a 1234 1235 1236

If we don’t care about the exact memory location (and we usually don’t), we can draw that like this:

42 a:  

Now, when you say b = a (and assuming your program happens to pick 1235 for b), we get:

0 42 42 0 1233 a 1234 b 1234 1236

or, alternatively:

42 42 a:   b:  

As you can see, the program has made a copy of the value of a, and given b the same value.

What if we have a more complicated value? Many languages have “structure” or “struct” types, which can combine pieces of data together. For instance, here’s a snippet of Go:

// Define a “struct” with two fields, both bytes, named `x` and `y`:
type MyThing struct {
  x byte
  y byte
}

// Create a new variable named `a`, of type MyThing, and give it some
// initial values:
a := MyThing{ x: 42, y: 99 }

Now, what we have is this:

0 42 99 0 1233 1234 1235 1236 a

or, alternatively:

a: 42 99 x y

We can update the individual fields of a:

// Update the `9` field of `a` to 17:
a.y = 17

And now we have:

a: 42 17 x y

If we now create a variable b with a’s value:

// Create a new variable named `b`, also of type MyThing, and give it
// the same value as `a`:
b := a

We get:

a: 42 17 x y b: 42 17 x y

Once again, our program created a copy of the value of a.

What if we have wider types inside the struct? No problem, they work just as described in the last post:

// Define a struct with two fields, one a byte, and one a 16-bit
// unsigned integer (which takes up 2 bytes):
type YourThing struct {
  w byte
  z uint16
}

// Create a new variable named `c`, of type YourThing, and give it some
// initial values:
c := YourThing{ w: 42, z: 258 }

Which looks like:

0 42 2 1 0 c w z (258 = 2 + 1×256)

Or, alternatively:

c: 42 258 w z

Pointers

To talk about text, we first need to introduce something called a “pointer”. You may have run into that term before, and it may have been confusing or intimidating.

Actually, it’s a way of talking about something we’ve already seen.

Remember when we said we’d put 42 in a, and that our program happened to put a in location 1234 in memory? Here’s the diagram again:

0 42 0 0 1233 a 1234 1235 1236

Well, 1234 is called the “address” of a. If you put that in a variable, say b, then we say b is a “pointer” to a:

0 42 210 4 1233 1234 1235 1236 a b (1234 = 210 + 4×256)

In Go, you’d do that like this:

a := 42
b := &a

// `b` now points to `a`. It's value is the address at which a's value
// is stored. In Go (and most languages), the computer still keeps
// track of the fact that it is a "pointer to integer", not just a
// generic "pointer", so that it knows what to do with it if you, say,
// print the value:

fmt.Println(a)  // → 42
fmt.Println(*b) // → 42, *b is "the value that `b` points" to (`a`)
fmt.Println(b)  // → 0xc00009c000, when I ran it; of course it varies
fmt.Println(&b) // → 0xc000094018, this is where `b` is stored

(live example for you to play with - click “run”)

That last value—0xc00009c000—brings up a good point. The example we showed above, with b getting a value of 1234, that fits in just two bytes, is a little outdated. Two bytes was enough to store addresses on machines like the Apple II or Commodore 64, which had a chip that maxed out at 64k (256×256 = 65536 bytes). Nowadays, your computer and operating system are in 32- or 64-bit mode, so pointers take 32 or 64 bits (4 or 8 bytes) to store. So the example above would more accurately look something like this:

0 42 210 4 0 0 0 0 0 0 0 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 a b (&a = 1234 = 210 + 4×256 + 0×65536 + 0×16777216 + 0× ...)

We don’t usually care about exactly how big pointers are (unless we’re in a phase of caring how big everything is), and often draw that last picture like this:

a: 42 b:

Some ways of talking about pointers. We “dereference” b to get the value it points to, 42. In Go, as mentioned above, and in fact in many languages that descend from C, this is written *b. Conversely, we can “get a reference to a” (a’s address in memory) by writing &a. People often talk about “following a pointer”, which means exactly what it sounds like: just follow the arrow!

You actually now know enough to understand what a “linked list” is!

a: 42 data next 17 data next 23 data next

That last pointer is a “null pointer”1: by convention, a pointer with the value 0—also called null—points to nothing, and is drawn as ⟶‖

Let’s create that linked list in Go:

// Each Node has some data, and a pointer to the next Node (possibly
// null)
type Node struct {
	data int
	next *Node
}

// Construct the linked list in the diagram above, the laborious way:
node1 := Node{data:42}
node2 := Node{data:17}
node3 := Node{data:23}
node1.next = &node2
node2.next = &node3
node3.next = nil // Go spells "null" as "nil".
a := &node1

// Construct the linked list in a more concise way. `&Node{}` is Go
// shorthand for creating a struct value and then immediately taking
// its address.
a = &Node{
	data: 42,
	next: &Node{
		data: 17,
		next: &Node{
			data: 23,
			next: nil,
		},
	},
}

// Print the linked list by repeatedly following the `next` pointer
// until we get a null pointer.
for a != nil {
	fmt.Println(a.data)
	a = a.next
}

// Prints:
// 42
// 17
// 23

(playground version)

Values vs References

Once last thing for this part. Remember our earlier MyThing example?

// Define a “struct” with two fields, both bytes, named `x` and `y`:
type MyThing struct {
  x byte
  y byte
}

// Create a new variable named `a`, of type MyThing, and give it some
// initial values:
a := MyThing{ x: 42, y: 17 }
b := a

which copied a’s value into b, giving us:

a: 42 17 x y b: 42 17 x y

What if a and b were pointers instead?

// Create a new variable named `a`, of type pointer-to-MyThing, and
// give it some initial values:
a := &MyThing{ x: 42, y: 17 }
// Copy a's value into b:
b := a

Now, a’s value is a pointer: a 64-bit number, the address of that MyThing structure. So copying that value into b makes b point to the same structure!

a: 42 17 x y b:

If you now change part of what b points to, you’re changing the thing a points to too, since they point to the same value!

// In Go, you can refer to the `y` field of `b` with `b.y`, whether
// `b` is a struct value, or a pointer to a struct value. In C, you'd
// need to use `b->y` if `b` was a pointer.
b.y = 23

fmt.Println(b.y) // 23
fmt.Println(a.y) // 23

Text

We finally have enough background to write about how text is stored. Most languages have a string type, which stores pieces of text. A common value for a string type, under the covers, is a pointer to an area of memory to hold the text, and integer holding the length in bytes.2

The actual characters are stored in memory as we described in the last post:

a := "Hello, world"
a: 12 str len H e l l o , w o r l d 72 101 108 108 111 44 32 119 111 114 108 100

How about letters and words in foreign scripts? Well, for that, you need a “character encoding”, a way of giving each character a number and a representation as bytes. The most common one used to be ASCII3, but the modern answer the problem is Unicode, an ambitious scheme to assign a unique number to every letter in every alphabet of every written language. And a bunch of Emojis. And other things.

Unicode takes care of giving each character/symbol a unique number4, but the numbers get pretty big, and you still have to figure out how to represent them in just little old bytes. For that you need an encoding. Unless you have good reasons and know what you’re doing, you’ll mostly want to use UTF-8, which leaves the lower 128 ASCII characters alone, and then uses a clever encoding to encode all the other Unicode code points. The origin story is fun.

Summary

Well, that’s been a fun ramble; thanks for sticking with us! Remember, you can start at any of the concepts we’ve discussed here, and explore as deeply as you want. There are is nothing wrong with not knowing, questions, disinterest, or way too much interest, and no such thing as bad questions or things you’re “supposed” to know.

A couple of interesting links to leave you with, since they contain interesting and clear discussions of topics we’ve covered here:

 

Up next are the questions for set 2.


  1. Many consider null pointers to be a big mistake, and many modern languages make them impossible. ↩︎

  2. At least, that’s how Go does it. In C, it’s just a pointer, and you count on a 0 byte to terminate the string. If that sounds risky, like you could miss the 0, and just run off into some other part of memory, well, yep! ↩︎

  3. On Unix systems, like Linux and Mac OS, you can type man ascii in a terminal window to see the ASCII values. The bottom set shows them in decimal, and is probably what you want. We’ll get into Octal and Hexadecimal and all that some other time. ↩︎

  4. Actually, it gives each code point a number, and then one or more of those combine to make one “user-perceived character”, as Wikipedia puts it. It’s complicated. This post by Manish Goregaokar goes into lots of details. ↩︎