Monday 13 February 2012

Data Migration Performance to Crm Online

I've recently been looking at the rate of data migration into Crm Online using SSIS, and how this can be optimised. I started with a baseline rate of 12 records per second, and have so far improved this to 430 records per second, all using one client machine.


The easiest way to migrate data into Crm Online is via a synchronous, single-threaded process that writes one record at a time, so this was the baseline. It soon became very clear that the performance bottleneck in this scenario is network latency - i.e. the round-trip time for the network packets to make a request to the Crm Online server, and to receive a response back.


So, the challenge was to improve on this, which can be done by addressing each aspect of the simple scenario - i.e.



  1. Synchronous v. asynchronous calls

  2. Single or multi-threaded - either within one process, or multiple concurrent processes

  3. Sending more than one record at a time

So far, I've not tested asynchronous calls, primarily because SSIS is stream-based, and I can't see a way to write out synchronous error output if using an asynchronous pattern. It would be possible to write out asynchronous error information, but for now that would involve too much code rewrites. In general, though, I would expect use of an asynchronous pattern would give similar performance benefits to the multi-threaded approach, though it may be possible to multiply the benefits by combining the approaches.


Multiple threads
SSIS controls the threading behaviour of a package, so rather than try for a multi-threaded single process, I went for running several instances of the same package concurrently, which you can do with the dtexec tool. There are two main aspects to making this work:



  1. You will need to be able to partition the source data, so that each package instance submits different records. For my tests, I had an integer identity column on the source data, and used the SQL modulo operator (%) to filter on the remainder from an integer division. For 10 concurrent packages, the where clause was 'WHERE id % 10 = ?' with the '?' replaced by a package variable.

  2. The package will not be able to reference any files, either as data sources, destinations or log files, as SSIS will attempt to get exclusive access to the files. So, I used a SQL Server source, and wrote log information to SQL via the SSIS SQL Log Provider

I tested 10 concurrent packages, and this gave between a 7-fold and 9-fold performance improvement.


Submitting multiple records
The Crm API is primarily designed around modifying one record at a time, with a separate request per record. However, CRM 2011 introduced the facility to pass multiple records, using the RelatedEntities property of a 'parent' entity. This allows you to build (in memory) a RelatedEntityCollection of multiple records, then attach this to one record, and submit this as one request.


There are two limitations to this approach:



  1. The entities have to be related via a relationship in CRM.

  2. The same data operation has to apply to the parent record, and the records in the RelatedEntityCollection

Initially I'd hoped to use the systemuser entity as the parent entity, as there is necessarily a relationship between this entity and any user-owned entity. However, this wouldn't work with limitation 2, as I wanted to update the systemuser, but create the related entities, and this doesn't work.


Instead, I had to make schema changes. I created a new entity (e.g. exc_batchimport), and a 1-N relationship with each entity that I wanted to import. Each request would then create 1 exc_batchimport record, and a configurable number of related records.


I tried various batch sizes, and had success up to a batch size of 1000, but failures with a batch of 5000. My initial view is that the failures come from the number of records, and not the total data size, but I've not tested this extensively.


This approach also gave significant performance gains, though only when network latency was a main performance factor - i.e. it helped a lot when connecting to Crm Online, but gave no appreciable benefit (and in some case, worse performace) when connecting to an On Premise CRM server. Most of the benefit came with a batch size of 10 records, though performance did continue to improve slightly if increasing the batch sizes up to 1000 records.


Performance results
I did the tests running on Windows 2008 Server with moderate capacity:



  • A virtual server running via Hyper-V

  • One processor core allocated to the server, running at 2GHz

  • 4 GB of memory

  • Server was running in a hosted environment in England, connecting to the EMEA Crm Online Data Centre

The tests were done writing 100000 new records to a custom entity, writing 2 text fields, and integer field, and an option set, and allowing CRM to generate the primary key values.

































Concurrent packagesBatch SizeTimeRecords / sec
1n/a83212
10n/a109491
1010276362
10100267374
101000233429


Conclusions
The main performance issue with modifying multiple records with Crm Online relates to network latency. I've successfully tested 2 complementary approaches which give a combined 35-fold speed improvement, and it may also be possible to gain further improvements with asynchronous calls.


The performance figures were for a custom entity. I'm intending to do further tests with the customer entities (account and contact), and the activity entities, as each of these require more SQL updates than a custom entity.