Large imports in SuiteCRM

For an upcoming post on performance tuning I’ve been playing around with larger databases. To do this I wrote a small script which would insert (possibly) millions of rows. This presented a problem however. Inserting the data was taking a very long time and I was keen to have a process I could repeat to keep the comparisons fair.

For this post I’ll look at inserting 40000 accounts with 100 contacts each and 100 larger accounts with 10000 contacts each.

The usual (and correct) way to do this would be to use SuiteCRM’s inbuilt importer. This is good for reasonably sized imports but will quickly grind to a halt with the sizes I was trying.

The next solution was direct database access. This would be simpler than using the inbuilt importer for these large size of accounts.

My first solution to both generate (using the Faker library) and insert the data looked something like:

<?php
require_once 'vendor/autoload.php';

//Do the few, large accounts
$accountsCount = 100;
$contactsCount = 10000;
doImport($accountsCount,$contactsCount);

//Do the many, smaller accounts
$accountsCount = 40000;
$contactsCount = 100;
doImport($accountsCount,$contactsCount);


function doImport($db,$accountsCount,$contactsCount){
  $accountPrep = $db->prepare("INSERT INTO accounts (id,name,date_entered,date_modified,modified_user_id,created_by) VALUES(:id,:name, NOW(), NOW(),'1','1');");
  $contactPrep = $db->prepare("INSERT INTO contacts (id,first_name,last_name,date_entered, date_modified,modified_user_id,created_by) VALUES(:id,:first_name, :last_name, NOW(), NOW(),'1','1');");
  $contactLinkPrep = $db->prepare("INSERT INTO accounts_contacts (id,contact_id, account_id,deleted,date_modified) VALUES(:id,:contactId, :accountId, 0,NOW());");
  $faker = Faker\Factory::create();
  for($x = 0; $x < $accountsCount; $x++){
    if($x % 100 === 0){
      echo ".";
    }
    if($x > 0 && $x % 1000 === 0){
      echo "\n";
    }
    $accountId = getGUID();
    $accountPrep->execute(array(':id' => $accountId, ':name' => $faker->company));
    for($y = 0; $y < $contactsCount; $y++){
      $contactId = getGUID();
      $contactPrep->execute(array(':id' => $contactId, ':first_name' => $faker->firstName, ':last_name' => $faker->lastName));
      $contactLinkPrep->execute(array(':id'=>getGUID(),':contactId' => $contactId, ':accountId' => $accountId));
    }
  }

}
echo "\nComplete\n";

Turns out this is slow, very slow. I left it running for a few days to provide data for this post but gave up after 41 hours. Even then it was only about a quarter of the way through (1,207,229 out of 5,000,000 records - the full process would have taken a week).

The above code is quite simplistic. At the cost of using more RAM we could batch the inserts. I was a bit sceptical that this would make enough of a difference to be worth it but here we go. Our pseudo code now looks something like:

<?php
require_once 'vendor/autoload.php';

//Do the few, large accounts
$accountsCount = 100;
$contactsCount = 10000;
doImport($accountsCount,$contactsCount);

//Do the many, smaller accounts
$accountsCount = 40000;
$contactsCount = 100;
doImport($accountsCount,$contactsCount);


function doImport($db,$accountsCount,$contactsCount){
  $accountVals = array();
  $contactVals = array();
  $contactLinkVals = array();
  $faker = Faker\Factory::create();
  for($x = 0; $x < $accountsCount; $x++){
    if($x % 100 === 0){
      echo ".";
    }
    if($x > 0 && $x % 1000 === 0){
      echo "\n";
    }
    $accountId = getGUID();
    $accountVals[] = "('".$accountId."', \"".$faker->company."\",NOW(),NOW(),'1','1')";
    for($y = 0; $y < $contactsCount; $y++){
      $contactId = getGUID();
      $contactVals[] = "('".$contactId."',\"".$faker->firstName."\",\"". $faker->lastName."\",'". $now."','". $now."','1','1')";
      $contactLinkVals[] = "('".getGUID()."','".$contactId."','".$accountId."', '0','".$now."')";
    }
    //Periodically insert values
    if($x > 0 && $x % 1000 === 0){
      doInserts($accountVals, $contactVals, $contactLinkVals);
    }
  }

}
echo "\nComplete\n";

This was better. Again I gave up after a while but this time the estimate was a little over two days to complete. Still far too slow however. We need a different approach.

MySQL provides the “LOAD DATA INFILE” command which the MySQL docs suggest is up to 20x faster than INSERTS. Let’s give that a try.

First we continue to use the script to generate data but now we have it generate a CSV file:

<?php
require_once 'vendor/autoload.php';

//Do the few, large accounts
$accountsCount = 100;
$contactsCount = 10000;
doImport($accountsCount,$contactsCount);

//Do the many, smaller accounts
$accountsCount = 40000;
$contactsCount = 100;
doImport($accountsCount,$contactsCount);


function doImport($db,$accountsCount,$contactsCount){
  $faker = Faker\Factory::create();
  $outAccounts = 'accounts.csv';
  $outContacts = 'contacts.csv';
  $outAccountsContacts = 'accountsContacts.csv';
  $outAccountsHandle = fopen($outAccounts,'a');
  $outContactsHandle = fopen($outContacts,'a');
  $outAccountsContactsHandle = fopen($outAccountsContacts,'a');
  $now = (new DateTime())->format("Y-m-d H:i:s");
  for($x = 0; $x < $accountsCount; $x++){
    if($x % 100 === 0){
      echo ".";
    }
    if($x > 0 && $x % 1000 === 0){
      echo "\n";
    }
    $accountId = getGUID();
    fputcsv($outAccountsHandle,array($accountId,$faker->company, $now, $now,1,1));
    for($y = 0; $y < $contactsCount; $y++){
      $contactId = getGUID();
      fputcsv($outContactsHandle,array($contactId,$faker->firstName, $faker->lastName, $now, $now,1,1));
      fputcsv($outAccountsContactsHandle,array(getGUID(),$contactId, $accountId, 0,$now));
    }
  }

}
echo "\nComplete\n";

This takes about 50 minutes

Finally we can run the actual import:

mysql> LOAD DATA INFILE 'accounts.csv' INTO TABLE accounts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' (id,name,date_entered, date_modified,modified_user_id,created_by);  
Query OK, 40100 rows affected (30.15 sec)  
Records: 40100  Deleted: 0  Skipped: 0  Warnings: 0

LOAD DATA INFILE 'contacts.csv' INTO TABLE contacts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' (id,first_name,last_name,date_entered, date_modified,modified_user_id,created_by);  
Query OK, 5000000 rows affected (4 hours 56 min 52.34 sec)  
Records: 5000000  Deleted: 0  Skipped: 0  Warnings: 0

mysql> LOAD DATA INFILE 'accountsContacts.csv' INTO TABLE accounts_contacts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' (id,contact_id, account_id,deleted,date_modified);  
Query OK, 5000000 rows affected (2 hours 18 min 19.70 sec)  
Records: 5000000  Deleted: 0  Skipped: 0  Warnings: 0

This totals about 7.3 hours (+ 50 minutes to generate the data). At least this brings the time taken to less than a working day but still quite slow.

MySQL is likely doing a lot of work to keep the indexes up to date, we can save it some work (and time) by removing the indexes and adding them back afterwards. Using SHOW CREATE TABLE we can check what indexes exist and remove them like so:

SHOW CREATE TABLE accounts;
ALTER TABLE accounts DROP KEY idx_accnt_id_del;
ALTER TABLE accounts DROP KEY idx_accnt_name_del;
ALTER TABLE accounts DROP KEY idx_accnt_assigned_del;
ALTER TABLE accounts DROP KEY idx_accnt_parent_id;
SHOW CREATE TABLE contacts;
ALTER TABLE contacts DROP KEY idx_cont_last_first;
ALTER TABLE contacts DROP KEY idx_contacts_del_last;;
ALTER TABLE contacts DROP KEY idx_cont_del_reports;
ALTER TABLE contacts DROP KEY idx_reports_to_id;
ALTER TABLE contacts DROP KEY idx_del_id_user;
ALTER TABLE contacts DROP KEY idx_cont_assigned;
SHOW CREATE TABLE accounts_contacts;
ALTER TABLE accounts_contacts DROP KEY idx_account_contact;
ALTER TABLE accounts_contacts DROP KEY idx_contid_del_accid;

Once the indexes are gone we we import the data:

LOAD DATA INFILE 'accounts.csv' INTO TABLE accounts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' (id,name,date_entered, date_modified,modified_user_id,created_by);  
Query OK, 40100 rows affected (4.72 sec)  
Records: 40100  Deleted: 0  Skipped: 0  Warnings: 0

LOAD DATA INFILE 'contacts.csv' INTO TABLE contacts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' (id,first_name,last_name,date_entered, date_modified,modified_user_id,created_by);  
Query OK, 5000000 rows affected (58 min 8.44 sec)  
Records: 5000000  Deleted: 0  Skipped: 0  Warnings: 0

LOAD DATA INFILE 'accountsContacts.csv' INTO TABLE accounts_contacts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' (id,contact_id, account_id,deleted,date_modified);  
Query OK, 5000000 rows affected (32 min 55.77 sec)  
Records: 5000000  Deleted: 0  Skipped: 0  Warnings: 0

And, after a quick repair and rebuild we’ll be given the SQL needed to rebuild the indexes:

mysql> ALTER TABLE accounts ADD INDEX idx_accnt_id_del (id,deleted), ADD INDEX idx_accnt_name_del (name,deleted), ADD INDEX idx_accnt_assigned_del (deleted,assigned_user_id), ADD INDEX idx_accnt_parent_id (parent_id);
Query OK, 0 rows affected (5.37 sec)

mysql> ALTER TABLE contacts ADD INDEX idx_cont_last_first (last_name,first_name,deleted), ADD INDEX idx_contacts_del_last (deleted,last_name), ADD INDEX idx_cont_del_reports (deleted,reports_to_id,last_name), ADD INDEX idx_reports_to_id (reports_to_id), ADD INDEX idx_del_id_user (deleted,id,assigned_user_id), ADD INDEX idx_cont_assigned (assigned_user_id);
Query OK, 0 rows affected (15 min 41.37 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE contacts ADD INDEX idx_cont_last_first (last_name,first_name,deleted), ADD INDEX idx_contacts_del_last (deleted,last_name), ADD INDEX idx_cont_del_reports (deleted,reports_to_id,last_name), ADD INDEX idx_reports_to_id (reports_to_id), ADD INDEX idx_del_id_user (deleted,id,assigned_user_id), ADD INDEX idx_cont_assigned (assigned_user_id);
Query OK, 0 rows affected (15 min 41.37 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE accounts_contacts ADD INDEX idx_account_contact (account_id,contact_id), ADD INDEX idx_contid_del_accid (contact_id,deleted,account_id);
Query OK, 0 rows affected (24 min 55.06 sec)
Records: 0  Duplicates: 0  Warnings: 0

90 minutes (+50 to generate the data +45 to rebuild the indexes). Still slower than I would like but this at least brings it into the realms of the realistic. I can now run multiple tests with a large dataset to test performance tweaks.