Find the user with closest taste of mine

Question

I trying to write in SQL the following requirement:

Each user can rate as much films as they like. (1= I don't like, 5=I love it) And the system should provide to the current user a list of films that he doesn't rate yet and that other users like too.

To simplify, I have this table and the following data:

create table wich (
  uid int,
  film varchar(50),
  rate int
);

insert into wich values 
(1, 'usual suspect', 5), (1, 'gataca', 4), (1, 'goldeneye', 2),
(2, 'usual suspect', 4), (2, 'gataca', 5), (2, 'i am a legend', 4), (2, 'the hobbit', 1),
(3, 'usual suspect', 1), (3, 'gataca', 5), (3, 'goldeneye', 5),
(4, 'usual suspect', 5), (4, 'goldeneye', 5),
(5, 'usual suspect', 5), (5, 'gataca', 4), (5, 'goldeneye', 5),
(6, 'usual suspect', 4), (6, 'gataca', 3), (6, 'goldeneye', 3), (6, 'shrek', 4);

This can be read as: User 1 like 'usual suspect' and 'gataca' but not 'goldeneye'.

So what I want know is to find the user(s) that have the closest tastes of the current user and get a list of films that the current user could probably like.

Following, what I have done so far:

Step 1: for every films rated by user 1, if user1 and user2 rate the film in the same way, compute a score (should be 0 if the difference is greater than 2) and it's great if both like a film or both doesn't like a film.

select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
        from wich w1
        inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
        where w1.uid=1

Step 2: compute a score by user

select scores.uid, SUM(scores.score) * 100 / sum(scores.nb) as score
from (
    select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
    from wich w1
    inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
    where w1.uid=1
) scores
group by uid
ORDER BY score desc

step 3: find films that user1 has not yet rated that he could like

select w3.uid, w3.film,w3.rate, matches.score
from (
    select scores.uid, SUM(scores.score) * 100 / sum(scores.nb) as score
    from (
        select w2.uid, 1 as nb, case when abs(w1.rate - w2.rate) <= 2 THEN 5-abs(w1.rate - w2.rate) ELSE 0 END as score
        from wich w1
        inner join wich w2 on w1.uid<>w2.uid and w1.film=w2.film
        where w1.uid=1
    ) scores
    group by uid
    ORDER BY score desc
) as matches
inner join wich w3 on matches.uid=w3.uid and w3.rate >= 3
left join wich w4 on w4.uid=1 and w3.film=w4.film
where w4.uid is null
order by w3.rate desc, matches.score desc;

This seems to work, but I am not sure that this will still respond in a small amount of time if each user rate of lot of films and the table become become bigger.

What do you think ?

Is there a better way to accomplish such a thing ?

A working fiddle: http://sqlfiddle.com/#!2/89c57/1/0

Edit:

In my real table, the film column is an integer and I have a second table with films data (title, description, year, ...)

create table wich (
  uid int,
  film int,
  rate int
);

The example given here is just to simplify the problem with just one table.

score 1 · Answer 1 · answered Jul 30 '14 at 14:05

You can make a table which holds the users with similar taste for every user.

CREATE TABLE `similar_users` (
    `uid1` INT,
    `uid2` INT,
    `score` INT,
    PRIMARY KEY (`uid1`, `uid2`)
)

The algorithm for calculating the score can be anything, as long as a higher score means a closer taste.

You you should limit the amount of similar users to a fixed number and also exclude the ones with a score less than a certain value (the ones with a not so similar tastes).

The contents of the table will be calculated periodically (e.g. every day).

Then step 3 from your solution becomes something like this:

SELECT w.film
FROM similar_users su
JOIN wich w ON w.uid = su.uid2 AND w.rate > 2
LEFT JOIN wich wn ON wn.film = w.film AND wn.uid = {current_user}
WHERE su.uid1 = {current_user}
ORDER BY su.score * w.rate DESC

score -1 · Answer 2 · answered Jun 09 '14 at 20:37

Normalize it

The biggest thing that stands out to me is that you need to normalize the relationship here. You can find tons of stuff online about normalizing your database and ER diagrams. But to cut right to the code, it means pushing out the film titles to what you could consider a "lookup table". Then your wich table will just be all int codes. More difficult for humans to read, but much better for machines to process, and avoids a lot of data duplication and potential data anomalies.

CREATE TABLE wich (
  uid int,
  film_id int REFERENCES film(id),
  rate int,
  PRIMARY KEY (uid, film_id)
);

CREATE TABLE film (  
  id int PRIMARY KEY,
  title varchar(50)
);

To get the film's title to show up in your queries, you'll just want to JOIN the film table to get the title.

On performance

Your database will probably slow down as you data size grows UNLESS you add some indexes. The indexes to make depend largely on your queries that you will run against the database. A good place to start is to create indexes on:

columns that you are using as part of a JOIN condition
columns that you reference in your WHERE clauses

score -1 · Answer 3 · edited Jul 29 '14 at 09:47

I'll advice that you create another table to store the films first, although not necessary, so the film field would only contain integers in case you might one to restrict users from creating same movies multiple times with different spacing or case etcetera , then you need to take your time to understand the problem well.

SELECT b.uid AS pple_wit_cmon_intrest, b.film AS film, b.rate AS rate 
FROM `wich` AS a 
LEFT JOIN wich AS b on b.film = a.film  
WHERE a.uid = 1 #the id you are trying to find people with common interest with
ORDER BY film ASC;#order the films alphabetically in asc order

Find the user with closest taste of mine

3 Answers3