Best Practices for Creating and Deleting Records in MYSQL

Question

I'm fairly new to working with databases, but I've currently have a webCrawler that I've written in C# using MYSQL database. The crawler is frequently writing and deleting records from the database as sites are scraped.

Each record has primary key, which is MD5 checksum for the URL, to insure that no table has two duplicate entries.

Currently is it good practice to do a check on table before inserting the database to see if there is a duplicate. I.e. two operations on the database.

Or is it sufficient to add the record and let the database gracefully fail to add it.

The same question is relevant for other operations like deleting etc.

At the moment I'm trying to operate on 1000's of records a minute from a single client (multiple connections from that client). Does the answer change knowing that level of database activity.

Also it fairly frequently that there will be duplicate. and the add code will be skipped. Say every ten adds, there is one new record.

Masoud · Answer 1 · 2015-05-13T07:02:47.023

There are at least two drawback when using a query to check if URL exist:

Sending an extra query to database, will include additional network delay to your overall response time and also more network traffic.
no matter what value you insert, database will still check for duplicate key. therefore, the search on primary key is performed twice when two query is used.

Moreover, @mrdenny is right, there is 1% chance of collisions (with as few as 93,000 values). in order to avoid that, keep a copy of the URL in your table and write query as below:

SELECT * FROM myTable WHERE md5Val = MD5('https://www.mysql.com') and url = 'https://www.mysql.com';

This will avoid collisions but your md5 value cannot be the primary key anymore. Therefore, I suggest to create table as below:

CREATE TABLE Urls(
id int unsigned NOT NULL auto_increment,
url varchar(255) NOT NULL,
url_crc int unsigned NOT NULL DEFAULT 0,
PRIMARY KEY(id), INDEX index_url_crc(url_crc)
);

and you may also use a trigger to maintain the crc value

DELIMITER //
CREATE TRIGGER Urls_crc_ins BEFORE INSERT ON Urls FOR EACH ROW BEGIN
SET NEW.url_crc=crc32(NEW.url);
END;
//
CREATE TRIGGER Urls_crc_upd BEFORE UPDATE ON Urls FOR EACH ROW BEGIN
SET NEW.url_crc=crc32(NEW.url);
END;
//
DELIMITER ;

and now query your table like this

SELECT * FROM Urls WHERE url_crc = CRC32('https://www.mysql.com') and url = 'https://www.mysql.com';

hope this is helpful.

score 0 · Answer 2 · answered Mar 14 '14 at 19:42

For inserts, you can use INSERT ... ON DUPLICATE KEY UPDATE .... This lets you update certain fields if primary key is already used.

The syntax would be something like:

INSERT INTO `table` 
(`id`, `column1`, `column2`, `column3`) 
VALUES ('key', 'data1', 'data2', 'data3') 
ON DUPLICATE KEY UPDATE 
`column1`= values(`column1`), 
`column2` = values(`column2`), 
`column3` = values(`column3`);

More information: http://dev.mysql.com/doc/refman/5.6/en/insert-on-duplicate.html

Also if you are running asynchronous inserts, check out INSERT DELAYED.

More info: http://dev.mysql.com/doc/refman/5.6/en/insert-delayed.html

Best Practices for Creating and Deleting Records in MYSQL

2 Answers2