10

What is a good way of implementing "next-word prediction"? For example, the user types "I am" and the system suggests "a" and "not" (or possibly others) as the next word. I am aware of a method that uses Markov Chains and some training text(obviously) to more or less achieve this. But I read somewhere that this method is very restrictive and applies to very simple cases.

I understand basics of neural networks and genetic algorithms(though have never used them in a serious project) and maybe they could be of some help. I wonder if there are any algorithms that, given appropriate training text(e.g., newspaper articles, and the user's own typing) can come up with reasonably appropriate suggestions for the next word. If not (links to)algorithms, general high-level methods to attack this problem are welcome.

jonsca
  • 585
yati sagade
  • 2,089

3 Answers3

9

Take a look at n-grams. One n-gram is a sequence of n words. In your case you want n to be 3, since you need two query words and a resulting word. One 3-gram would be for example "I am tired", another one "I am happy".

What you then need is a collection of these 3-grams that are collected over your target language, say English. Since you cannot collect it over everything ever written in English you need to make a selection. That selection of representative texts is called a corpus. If your corpus is good it will tell you how often a sequence of three specific words occur together in English. From that you can calculate the probability of a 3-gram.

Collecting this kind of data is the hardest part. Once you have the list of all 3-grams together with their probability you can filter your list to all 3-grams starting with "I am". Then you sort all this list by probability et voilĂ : your prediction.

2

It looks like the problem domain is a subset of string search. By extending words to include white spaces, fuzzy string matching can be applied here.

You might want to consider/allow all user input as one word during training in addition to your dictionary. This allows you to suggest next word but also suggests auto complete of word or phrases.

Here is a link to a compilation of fuzzy string search algorithms

http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html

1

You are looking for a (statistical) language model.

A statistical language model assigns a probability to a sequence of m words P(w_1,...,w_m) by means of a probability distribution...

In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence...

gnat
  • 20,543
  • 29
  • 115
  • 306
user3287
  • 947