A linear search works, and is the best that we can do (in the GCSE) if our data is unsorted. However, if we have sorted data then a binary search is (on average) much more efficient. This is because instead of starting with the first item, checking if it's the one we want, and then proceeding onto the next item, we can instead start in the middle, see if the item we want is higher or lower than the element in the middle, and then use that to refine our search.

Let's assume we have this data:


[
    {
        "name": "Bertrand",
        "photo_count": 8846
    },
    {
        "name": "Boyd",
        "photo_count": 939
    },
    {
        "name": "Carlos",
        "photo_count": 204
    },
    {
        "name": "Cielo",
        "photo_count": 26
    },
    {
        "name": "Clementina",
        "photo_count": 3111
    },
    {
        "name": "Conor",
        "photo_count": 378
    },
    {
        "name": "Demarcus",
        "photo_count": 42923
    },
    {
        "name": "Eli",
        "photo_count": 290
    },
    {
        "name": "Enrico",
        "photo_count": 338
    },
    {
        "name": "Eula",
        "photo_count": 853
    },
    {
        "name": "Fernando",
        "photo_count": 84567
    },
    {
        "name": "Jovanny",
        "photo_count": 17909
    },
    {
        "name": "Maya",
        "photo_count": 3372
    },
    {
        "name": "Minnie",
        "photo_count": 13
    },
    {
        "name": "Morton",
        "photo_count": 16658
    },
    {
        "name": "Raymond",
        "photo_count": 22
    },
    {
        "name": "Saul",
        "photo_count": 3065
    },
    {
        "name": "Selena",
        "photo_count": 314
    },
    {
        "name": "Sigrid",
        "photo_count": 40
    },
    {
        "name": "Thora",
        "photo_count": 170
    }
]


number of items = 20

We want find out how many photos "Morton" has. This is sorted data (in order to determine the order of the names, we use a similar algorithm to that of the dictionary – start with the first letter and see which one comes first in the alphabet; if they are the same letter, move onto the next letter and repeat) and all the names are unique (so we cannot have a collision).

The mechanism for a binary search is not too complicated:

pick the middle item
check if the middle item is greater or smaller than the item we're looking for (if it is, then we stop searching and return the value)
pick either the lower half or the upper half of the list (lower half if the middle item is higher than the item we want, upper half if the middle item is lower than the one we want)
select the middle item of that list
repeat (unless we've reached the end)

Consider how we'd do that for the list above:

we start by selecting the middle item, i.e. "Eula" (don't forget – list indices start from zero in most programming languages). If we have an even number of items, we don't actually have a middle of the list – both the 10th and 11th (not zero-indexed) items of the list could be the middle. Here I've chosen to always use the one that comes first (so we round down from the median, which is the (n+1)/2th item).
We compare "Eula" to "Morton". As "Eula" begins with "E" and "Morgan" with "M" we conclude that "Eula" comes before "Morton". This means that we can ascertain that "Morton" is definitely in the top half of the list.
We now have this data


[
  {
    "name": "Fernando",
    "photo_count": 84567
  },
  {
    "name": "Jovanny",
    "photo_count": 17909
  },
  {
    "name": "Maya",
    "photo_count": 3372
  },
  {
    "name": "Minnie",
    "photo_count": 13
  },
  {
    "name": "Morton",
    "photo_count": 16658
  },
  {
    "name": "Raymond",
    "photo_count": 22
  },
  {
    "name": "Saul",
    "photo_count": 3065
  },
  {
    "name": "Selena",
    "photo_count": 314
  },
  {
    "name": "Sigrid",
    "photo_count": 40
  },
  {
    "name": "Thora",
    "photo_count": 170
  }
]

We pick the middle item in this list – i.e. "Morton" (there ten items, so I've picked the fifth one – but we could have picked the sixth)
We have a match!

One thing that is interesting to consider is how many items we have to look up with a binary vs a linear search.

With a binary search we had to look at two items
With a linear search we would have had to look at 15 items (assuming we started from the front – six if we started from the back)

Yes, I know that this list turned out to be quite convenient, which is why in general we look at the average performance of algorithms and their worst-case performance when we select them. We can consider the differences between these two algorithms mathematically.

For a binary search of a list containing 20 elements (assuming we always pick the value that comes first) once we pick element 9, we have conducted one lookup. Assuming the worse case (i.e. that we have to search the 10 element list, from 11-20 inclusive) we then select element five. In this case, we have done two lookups. Assuming that we then have to search the larger list again, we now search a five element list. We pick the middle item of that list – so we have now made three lookups. In the worst case, we now search a list of two items, so by the time we have searched the whole (sorted) list, we have looked at a maximum of five items.

In contrast, in the worst case linear search we have to look at an entire 20 items. Think of all the power savings you could get with a binary search!

You can also have a look at some further information on binary vs linear searching.

The exam board *love* to ask questions on this.

Broadly speaking, if you can use one (i.e. your data is sorted) you should prefer a binary search. If your data is not sorted, then you need to use a linear search.

Why prefere a binary search? It's much more efficient because it has to look at a lot fewer of the elements (consider this graph showing the average running times of a binary search vs a linear search as the number of elements increases).

There are three sorting algorithms that you need to be aware of, and although they might not exactly make your soul sing with joy, they do contain a lot of interesting computer science priniciples that you can and should apply to other 'problems' you find interesting.

Linear search is really simple: keep looking until we find the right item.


function is_the_one_we_want(item):
    # this would be a user-defined function that takes an item
    # and returns `True` if and only if this is the one we want,
    # and `False` otherwise
    False

function find_item():
    # we look through each item
    for item in list:
        # check if this item is the one
        if is_the_one_we_want(item):
            # return from this function, aborting the loop
            return Item
    # once we've gone through the list, if we haven't found the item
    # then it cannot be in the list
    return None

That's it!

The main problem with a linear search is that it is not very fast (compared to other search algorithms that we can use). This is because we have to look over every single item in the dataset – sometimes this is not too bad (if the data happens to be near where we start), but in the worst case we have to look over every item in the dataset.

The other search algorithm in the GSCE is a binary search which is more efficient, but can only be used when the data is sorted (whereas a linear search does not require this.)

A search algorithm allows you to find a specific item in a data structure. For the GCSE, this data structure is usually a list, but one can also use other data structures (such as trees).

For example, if you are building a social media platform, you might have some data represented a bit like this:


[
    {
        user_id: 1, 
        name: "Alice", 
        file_locations: ["files/rocket.jpg", "files/solving_equations.mp4", ...]
    },
    {
        user_id: 2, 
        name: "Bob", 
        video_locations: ["files/dog.mp4", "files/bear.png", ...]
    },
    {
        user_id: 3, 
        name: "Charlie", 
        file_locations: ["files/one.png", "files/two.png", ...]
    },
    ...
]

(where ... means "and so on")

Which is all very well, until we want to get a list of Alice's videos because one of their friends has submitted a request to the server (assuming that we have checked that they have Alice's permission!!!). Here it is easy to see that Alice is the first person in our list, and therefore the problem is trivial. However, this is not always the case! Imagine you have a million users, and you want to find "Quentin" – in this case we have no idea where in the list Quentin is!

This is what a search algorithm is for – we might want to find the position of an item in a list (which is the, slightly boring, example that most textbooks give), or (much more frequently) we might want to find an item in our records based on a "key" (e.g. the name of the person).

There are a few search algorithms you should know about:

The bubble sort runs on a quite a simple idea. Pick the first two items, use them to form a "bubble" and then sort the items in the bubble.

Imagine that you have a list a bit like this:


[12, 1, 8, 9, 7, 5]

To sort this using a bubble sort:

take the first two items (i.e. [12, 1]) and sort them – 12 > 1 so we swap them, such that our list now looks like:


[1, 12, 8, 9, 7, 5]

Take the next two items (i.e. [12, 8]) and sort them – 12 is also greater than 8, so we swap them (again) such that our list now looks like:


[1, 8, 12, 9, 7, 5]

We do the next thing for all the other elements – try doing it yourself. The end result should look like:


[1, 8, 9, 7, 5, 12]

We now return to step one (except that we use the list from step 3, in stead of the one we started with). We keep going until the list is sorted. One way of doing this is to keep a boolean variable denoting whether or not we swapped any values. If we didn't swap any values, then the list must be sorted, so we stop.

In general: merge sort is faster than insertion sort which is faster than bubble sort.

Yeah this one is real simple.

Steps:

pick up the next item in the list
look at all the previous elements in the list, and move it to the appropriate position
repeat until we've done all the items

e.g. for [12, 6, 21, 5, 9]

we start with 12. There are no previous elements, so we move onto the next item.
We look at 6. Comparing it to the only previous element 6 we can see that we want to move it before 12, so our list should look like [6, 12, 21, 5, 9].
We look at 21, and compare it to the previous elements. First we compare 21 and 12 – clearly 21 is greater than 12 (and we know that the previous elements must all be sorted, so we can stop there).
We look at 5. 5 is smaller than 21, so we compare it to the previous element. 5 is also smaller than 12 so we turn to look at 6. 5 is less than 6. Therefore our list is now [5, 6, 12, 21, 9]
We now look at 9. 9 is less than 21, so we look at 12. 9 is less than 12, so we look at 6. 9 is greater than 6, so we insert 9 after 6. Our list is now [5, 6, 9, 12, 21]

The final list is therefore [5, 6, 9, 12, 21].

why is merge sort fast(er)?

We can do operations in parallel – for example, when we need to merge four lists at one stage, we can use multiple processor cores so that we can merge each pair of lists at the same time rather than having to merge the first pair and the second pair sequentially.

practice questions

# practice questions

**[work in progress](<../work%20in%20progress.md>)**

practice questions

Given a table website_users write an SQL progarm to select all the data from this table.

practice questions

Which of these is not a substring of the string "grace hopper"?
- A: "grace"
- B: "hopper"
- C: "ce h"
- D: "gracehopper"

I have written some practice questions [on this website](https://stepik.org/course/103434).

All contents on a line following a `#` are ignored by Python.

This means that in terms of output,


print("one") # print("one")


print("one")
# print("two")


# print("one")
print("one") # hello world
# some more random characters

are all identical (you can check yourself by running all those snippets in Python).

The reason that this is a feature is so that you can add annotations to your programs.

**[work in progress](<../work%20in%20progress.md>)**

An expression is a way of detailing a series of instructions that the machine should perform. You've probably encountered these in maths before.

For example:

1 + 2
3 * 2
3 * 2 + 1

You've also probably encountered expressions containing a variable. For example:

A for loop provides a way to "iterate" over an "iterator". This is probably best explained with an example: suppose that you run a popular web service, as part of which you collect anonymous usage data on how long each page load (for the same page) takes. This data might come in the form of a "[list]()" looking like this (in milliseconds – these are not actually implausible values for page load times):


load_times = [190, 170, 120, 80, 300, 210, 533, 1012, 50]

It's useful having this data alone, and we can look at these nine items and see roughly how long page loads are taking. To calculate the average (we'll use the mean) you know (or should know) that the mean of a list containing n items (where n is any whole number greater than zero) we use the formula (sum of all the items) / (number of items) (where / means "divide").

In Python, to calculate the length of load_times we use len(load_times). The sum is a bit less obvious (there is actually a sum(load_times) built into the programming language). Let's create a new variable sum_load_times and work from there.


load_times = [190, 170, 120, 80, 300, 210, 533, 1012, 50]
sum_load_times = 0

How do we add each item in load_times to sum_load_times? One way is to try something like this:


# add the first item in `load_times` to `sum_load_times`
sum_load_times += load_times[0]
# add the second item in `load_times` to `sum_load_times`
sum_load_times += load_times[1]
# add the third item in `load_times` to `sum_load_times`
sum_load_times += load_times[2]
# and so on...

We quickly run into a problem though. We can continue this all the way up to nine (that's how many items there are in load_times). The problem is that we don't actually know how long load_times will be in advance (have some imagination and assume that load_times is coming from our web pages, rather than me having just invented some numbers).

This is not a new problem for programming languages, and there is quite a neat solution to this: the for loop. In the code above (where we were doing the adding), you might have noticed that we were actually doing the same thing each time: sum_load_times += load_times[<some number here>]. The only thing that was changing was <some number here>. A for loop gives us a way to fill in <some number here>.


for i in range(len(load_times)):
    sum_load_times += load_times[i]

Here, range(len(load_times)) means "all the numbers between 0 and len(load_times)-1 (this means we will run sum_load_times += load_times[i] on a total of len(load_times) different numbers)."

We can now compute the average:


load_times = [190, 170, 120, 80, 300, 210, 533, 1012, 50]
# compute the sum
sum_load_times = 0
for i in range(len(load_times)):
    sum_load_times += load_times[i]
average = sum_load_times / len(load_times)
print("The average load time for this page is " + str(average) + "ms.")

Python allows you to also run a for loop for each item in a list directly. This means that


for i in range(len(load_times)):
    sum_load_times += load_times[i]

is equivalent to


for load_time in load_times:
    sum_load_times += load_time

practice questions

Write a program to the sum of squares of the numbers in the list [1, 2, 3, 4, 5, 6, 7, 8, 9] (hint: to calculate the square of a number in Python do n**2, e.g. two squared would be 2**2 and three squared would be 3**2).

Answers: for loop answers

writing conditions for if statements

An if statement looks like this:


if <some condition>:
    <take some action>

In this section we'll work out how to add the missing <some condition>. Let's imagine that we're checking that a username is correct before writing it to a database. Usernames should start with a capital letter and contain only letters in the range "a"-"z" and "A"-"Z". We start with a function.


def check_username(username):
    # condition 1
    if <some condition>:
        return False
    # condition 2
    if <some condition>:
        return False
    return True

This is a function that takes a username (a string) and returns a boolean – so it maps strings to booleans. Let's think about how we can check that the start of the string is a capital letter.

How do we get the start of the string. Using indexing, we know that the start of a string will be some_string[0]. In our case, the variable storing our string is username, so to get the start of the string we want to pick username[0].


# condition 1
if username[0] # some expression here:
    return False

For this condition, we will return false (i.e. that the username is invalid) when the condition is true. So our condition must be false when the condition is valid (i.e. when the first character in the string is upper case – which is when a username would be valid – we want to return false) and true when the condition is invalid (i.e. when the first character is not capital letter our condition should be true). If you think this is confusing you are not alone. Everyone makes mistakes doing this (and many professional computer scientists have experienced the horror of having to debug a live computer system in the middle of the night, having been woken up because faulty boolean logic caused a malfunction). To mitigate this, we use testing in order to catch faulty boolean logic before it becomes a problem.

This means that the boolean expression we want is not username[0].isupper(). This is the same as username[0].islower() (because we will check in the next condition that the string does not contain any non-latin alphabet characters).


def check_username(username):
    # condition 1
    if username[0].islower():
        return False
    # condition 2
    if <some condition>:
        return False
    return True

For condition (work in progress, I have not uploaded this section yet).

practice questions

Write an algorithm that will allow the user to input two numbers. The algorithm should then compare these two numbers, and output the larger one. You can check your answer here
Write a program that inputs a string and outputs "yes" (case-sensitive) if the string is equal to "discombobulated" and outputs "no" if it is not equal. You can check your answer here if answers

**[work in progress](<../work%20in%20progress.md>)**

There are some essential things that you need to know in order to get your Python programs to interact with a user.

input() reads a line from the terminal keyboard. print(<something>) prints content to the terminal.

**TODO**

Heaven knows why this is on the specification.

The concept of an "integrated development environment" is just a fancy name for "a text editor, but smarter." A text editor is something like Notepad++ or Vim – you type text and it saves it into a file. You can also highlight bits of text, cut and paste, and so on. In an integrated development environment, the text editor is made more sophisticated by making it aware of the existence of the programming language you are using. This provides some advantages:

the IDE can check your program for bugs (such as syntax errors) and automatically highlight the offending parts of your progam
some IDEs will allow you to select lines that the program should be paused at, and then run it under a debuger. At each of the lines that you have selected, the IDE will pause the program, and allow you to inspect the program state
some IDEs will provide a button to allow you to run your code
some IDEs allow you to rename items (think variable, function names) and see these automatically updated accross the whole codebase.

further resources

If you are interested, I have collected some further reading:

"Why an IDE?" by Aleksey Kladov (a programmer who works on IDEs, including previously at Jetbrains – probably the most famous company building IDEs). Note that you can obtain free access to the usually paid Jetbrains IDEs as a student.

These are actually quite fun.

Machines run machine code.

But writing machine code is not particularly pleasant. It's very fiddly, and easy to get wrong. Unfortunately, machines run machine code, not other forms of code. Grace Hopper pioneered a solution to this problem; she worked out that it was possible to design other forms of code, which were easier to write certain classes of problems (e.g. code for processing financial transactions, which doesn't need exact control ove how every bit is layed out on the machine) and then "translate" that code into machine code.

This is the essence of a compiler. It turns programs into machine code.

program -> compiler -> machine code

There are also "interpreters". Instead of feeding in the entire program and then generating the appropriate machine code, they start with the entry point (maybe the first line) of the program, and then run all the lines of the program whenever it needs to.

practice questions

Which of these descriptions would apply to a compiler?
- A: runs through a program line by line, and runs each line as soon as it reads it
- B: takes an entire program and translates it in its entirety into machine code, which can then be executed
- C: takes a series of abbreviations for specific machine code instructions and replaces them with the corresponding machine code
Consider the program below, and decide (with a reason for each) which of a compiler, assembler or interpreter you could use in order to run it.


function f(x)
    return x * x
endfunction

Consider this program (taken from this resource) and explain whether an assembler would be a suitable device to translate it.


; ----------------------------------------------------------------------------------------
; Writes "Hello, World" to the console using only system calls. Runs on 64-bit Linux only.
; To assemble and run:
;
;     nasm -felf64 hello.asm && ld hello.o && ./a.out
; ----------------------------------------------------------------------------------------

          global    _start

          section   .text
_start:   mov       rax, 1                  ; system call for write
          mov       rdi, 1                  ; file handle 1 is stdout
          mov       rsi, message            ; address of string to output
          mov       rdx, 13                 ; number of bytes
          syscall                           ; invoke operating system to do the write
          mov       rax, 60                 ; system call for exit
          xor       rdi, rdi                ; exit code 0
          syscall                           ; invoke operating system to exit

          section   .data
message:  db        "Hello, World", 10      ; note the newline at the end

Note – if you would like solutions you can try the exercises in a self-marking form on stepik.org.

programming problems

Often, students are confused as to how to recieve the input data into their programs.

The trick is to look the example inputs. For example, if you have a question asking you to sum two numbers, and some sample input looking like:


15
21

Asking for output looking like:

Then you should write code something like:


# reads a single line from the standard input and converts it to an integer
a = int(input())
# reads a single line from the standard input and converts it to an integer
b = int(input())
# outputs the result in the requested form
print(a + b)

This is mostly me summarising what I've seen in examiner's reports.

fair warning, this section is mostly polemic

One thing that it appears students struggle with exams is the terminology (used here as a politer form of "jargon") that exam boards require students to use. I would argue that the fault for this is, however, more on the exam board than it is on the students. Take a question that OCR asked in 2019 – "describe two examples of defensive design that should be considered when developing this program" (the program was given in a previous part of the question). If you ask most working programmers what defensive design is, you are likely to get a lot of blank stares (for the uninitiated defensive design means anticipating malicious users of one's program). Even those who work in some of the most error-prone and complex areas of software engineering (e.g. distributed systems) will probably draw a blank.

One really has to question the pedagogic utility of making up your own concepts (ok, to be fair, "defensive design" does have a Wikipedia page, but I was unable to find a Google result that did not mention "OCR GCSE Computer Science" in it) to quiz concepts on. I can live with questions about email-protocol minutiae, because perhaps some students will end up building email software, but I really struggle when it comes to these made-up ideas.

end the polemnic

issues with programming

For more useful insights, it seems that functions confuse students (and I think many maths teachers would agree that students find function confusing). This is unfortunate given that functions are at the heart of computer science. The key issue I feel students experience with functions is the idea of "binding" – i.e. where variables are valid, and their relation to the overall lifecycle of the function (when they are taught programming, many students do not seem to have a very good idea of progam dynamics – more on this below). Many students are not sure of as to the meaning of a "return value". This is perhaps a consequence of Python programs (unfortunately frequently written by teachers) that look a bit like this:


x = 12

def square(number):
    x = number ** 2

square(12)
print(x)

Which is very confusing to a lot of students, because they are not really taught about the idea of "scope". In programming language terms, what this does is:


# creates the global variable x
x = 12

# define the square function
def square(number):
    # 'capture' a reference to `x` inside this function
    # this reference will then be used when the function
    # is later called
    x = number ** 2

# invocate square, which updates the reference to `x` it has
square(12)
# prints the value of "x" to the console
print(x)

Unfortunately this approach really undermines a core reason why we use functions – encapsulation. Whereas if we write:


def square(number):
    return number ** 2
    
print(square(12))

We have moved the squaring logic into a separate function. This has all the usual benefits that functions bring. Of course, in reality our functions tend to do more complicated things, and also reference each other.

I think the key things to double down on are:

binding (aka scopes, namespaces)
that functions are designed for encapsulation
how the control flow of functions works (i.e. something about how they are executed – don't start explaining how stack frames work, but do try to explain to them that programs are things that do things.)

Relating to 3, is the issue of dynamics. This is a term from programming language theory (from the excellent Practical Foundations for Programming Languages the following definition is provided – "the dynamics of a language describes how programs are executed"). You might accuse me of 'inventing' unnecessary terms in the same was as OCR do with "defensive design" here, but the key difference is that the term "dynamic" is actually used in programming language theory, whereas "defensive design" is not. Explaining how programs are executed is essential. Using a debugger, or other mechanism for visualising a program's execution throughout time is really essential here.

One question that seemed to give a lot of students trouble one year was where candidates were asked as to whether or not a number of loops would run forever or terminate. Given the importance of the halting problem to theoretical computer science (and its easy proof) I imagine that an engaging lesson could be devised around it.

issues with number bases

It seems that the key issue here is place value – i.e. that numbers in different columns need to be multiplied by relevant_base^(the zero-indexed column number in question) before they can be added together.

issues with algorithms

general explanation of algorithms

At GCSE it is exceedingly rare that students are asked to implement an algorithm from scratch. However, they are often asked to comment on implementations on algorithms and many students seem to fall down here.

binary searches

When asked to perform binary searches on datatypes other than numbers, many students struggle. I think this is to do with how binary searches are usually introduced to students by using lists of numeric data. The example I give on this site uses names (see binary search for exact details) and I think some explanation on how a system for sorting strings can be devised (as in the dictionary) is helpful. It is also probably sensible to explain how records can be searched (by sorting on a single field).

This site contains materials suitable for UK secondary school computer science exams. At the moment it is mostly notes, but I am working on adding practice questions.

I think that computer science can and should be a really enjoyable subject. The computer removes a lot of physical constraints that in other subjects make it much harder to apply what you learn directly; you can't conceivably set up an industrial chemical plant working in your bedroom on your own, but you could build an industrial social media platform (albeit perhaps not as polished as one built by a billion-dollar company).

If you have thoughts, comments, feedback or suggestions please feel free to contact me via Reddit.

Note that all code is written in the Python programming language.

For ambitious pupils wanting to go beyond the GCSE spec. # freely available books on programming - [Structure and Interpretation of Computer Programs](https://web.mit.edu/alexmv/6.037/sicp.pdf). Aka "The Wizard Book" - [How to Design Programs](http://htdp.org/2003-09-26/)

This means that I have not yet finished writing the page in question. Don't worry – I will get around to it eventually!

Backlinks

sql

defensive design

logic gates

truth tables

introduction

boolean

string

expression

function

if

python basics

UK secondary Computer Science