Test your own words here (no spaces or non a-z characters). Please try random words, names, place names, keyboard mashing etc!
I used two different techniques to vectorise the text. Text itself is not directly usable by a neural network, instead the text is 'vectorised' - turned into a multi-dimensional data type.
E.g. in an alphabet with only five letters, for the word 'bed' we would create 15 (5 for the size of the alphabet x 3 letters in the word), and place a '1' in the second column to indicated the 'B'.
A | B | C | D | E |
0 | 1 | 0 | 0 | 0 |
Then, we would do the same for the next letter, like so:
A | B | C | D | E |
0 | 0 | 0 | 0 | 1 |
And so on for each letter. This means you get a lot of columns, with the majority of them set to '0'. You may be tempted to think 'why not cut down on columns by representing them as their character position 1 for A, 2 for B, 26 for Z etc?'' The reason is that we would be implying some deeper mathematical relationship than actually exists to the algorithm. B isn't double A, and E isn't half of J.
However, this technique generates a lot of data for each word. Is there a way to do better?
Instead of representing each letter individually, we could try to represent the letters in the word. Take the word BABE.
A | B | C | D | E |
1 | 1 | 0 | 0 | 1 |
We could use a '1' to indicate that the word contains the letter. In fact, we can do one better, and count the number of occurences of each letter. This is the second technique I used, 'count vectorisation'.
A | B | C | D | E |
1 | 2 | 0 | 0 | 1 |
I also wanted to extract some of the structure of the words. Using count vectorisation, the words 'able' and 'bale' look identical. I found that using regular expressions, it is possible to get a rough idea of the number of syllables in a word. It's not exact: it may count 'bale' as having two syllables, but it was generally close. The syllable count by itself was not good enough because it did not encode the 'density' of syllables. So the word length divided by syllables works.
Finally, I entered one more feature into the model: is the last letter a vowel, with a binary 1 or 0 as the data.
I'm not able to describe entirely what a neural network is here.
The basic idea is that data is entered into the network, iteration by iteration. The data 'flows' through the network, where it is combined together and undergoes relatively simple mathematical procedures. The data then pops out the other side, and this is the prediction. The data is input one 'column' by one, so each datapoint (in this case the 26 letters + other features) is entered one by one.
The mathematical processes that happen are based on 'weights' - numbers that are multiplied with the input data. Initially, the weights are as good as random, and the output from the network is random, and not useful. Data is also passed through nonlinear 'activation' functions, so that the network is able to understand nonlinear relationships between the data points.
We can compare the output from the network with the 'label' - the actual classification. We take the difference between the output and the actual value, is go back through the network and change the weights, just a little bit, so that next time the output would be slightly closer to the correct answer. Doing this over a lot of iterations over enough data means that the network gets better at generalising - understanding the relationships in the data that cause it to be labelled one way or the other.
The neural network code comes from synaptic.js. It is a simple network, with a depth of 26 charachters + the other 2 features. There is one hidden layer of the same depth, with a ReLU activation function (or 'squash', as the synaptic API refers to them as). ReLU means 'Rectified Linear Unit' - but just means that values under 0 are set to 0, and values above zero are unchanged. The final layer has one output with the logistic function - necessary for the binary classification.
Scoring
The network is scored by accuracy at the moment.