Showing posts with label SSIS. Show all posts
Showing posts with label SSIS. Show all posts

Monday, 13 February 2012

Data Migration Performance to Crm Online

I've recently been looking at the rate of data migration into Crm Online using SSIS, and how this can be optimised. I started with a baseline rate of 12 records per second, and have so far improved this to 430 records per second, all using one client machine.


The easiest way to migrate data into Crm Online is via a synchronous, single-threaded process that writes one record at a time, so this was the baseline. It soon became very clear that the performance bottleneck in this scenario is network latency - i.e. the round-trip time for the network packets to make a request to the Crm Online server, and to receive a response back.


So, the challenge was to improve on this, which can be done by addressing each aspect of the simple scenario - i.e.



  1. Synchronous v. asynchronous calls

  2. Single or multi-threaded - either within one process, or multiple concurrent processes

  3. Sending more than one record at a time

So far, I've not tested asynchronous calls, primarily because SSIS is stream-based, and I can't see a way to write out synchronous error output if using an asynchronous pattern. It would be possible to write out asynchronous error information, but for now that would involve too much code rewrites. In general, though, I would expect use of an asynchronous pattern would give similar performance benefits to the multi-threaded approach, though it may be possible to multiply the benefits by combining the approaches.


Multiple threads
SSIS controls the threading behaviour of a package, so rather than try for a multi-threaded single process, I went for running several instances of the same package concurrently, which you can do with the dtexec tool. There are two main aspects to making this work:



  1. You will need to be able to partition the source data, so that each package instance submits different records. For my tests, I had an integer identity column on the source data, and used the SQL modulo operator (%) to filter on the remainder from an integer division. For 10 concurrent packages, the where clause was 'WHERE id % 10 = ?' with the '?' replaced by a package variable.

  2. The package will not be able to reference any files, either as data sources, destinations or log files, as SSIS will attempt to get exclusive access to the files. So, I used a SQL Server source, and wrote log information to SQL via the SSIS SQL Log Provider

I tested 10 concurrent packages, and this gave between a 7-fold and 9-fold performance improvement.


Submitting multiple records
The Crm API is primarily designed around modifying one record at a time, with a separate request per record. However, CRM 2011 introduced the facility to pass multiple records, using the RelatedEntities property of a 'parent' entity. This allows you to build (in memory) a RelatedEntityCollection of multiple records, then attach this to one record, and submit this as one request.


There are two limitations to this approach:



  1. The entities have to be related via a relationship in CRM.

  2. The same data operation has to apply to the parent record, and the records in the RelatedEntityCollection

Initially I'd hoped to use the systemuser entity as the parent entity, as there is necessarily a relationship between this entity and any user-owned entity. However, this wouldn't work with limitation 2, as I wanted to update the systemuser, but create the related entities, and this doesn't work.


Instead, I had to make schema changes. I created a new entity (e.g. exc_batchimport), and a 1-N relationship with each entity that I wanted to import. Each request would then create 1 exc_batchimport record, and a configurable number of related records.


I tried various batch sizes, and had success up to a batch size of 1000, but failures with a batch of 5000. My initial view is that the failures come from the number of records, and not the total data size, but I've not tested this extensively.


This approach also gave significant performance gains, though only when network latency was a main performance factor - i.e. it helped a lot when connecting to Crm Online, but gave no appreciable benefit (and in some case, worse performace) when connecting to an On Premise CRM server. Most of the benefit came with a batch size of 10 records, though performance did continue to improve slightly if increasing the batch sizes up to 1000 records.


Performance results
I did the tests running on Windows 2008 Server with moderate capacity:



  • A virtual server running via Hyper-V

  • One processor core allocated to the server, running at 2GHz

  • 4 GB of memory

  • Server was running in a hosted environment in England, connecting to the EMEA Crm Online Data Centre

The tests were done writing 100000 new records to a custom entity, writing 2 text fields, and integer field, and an option set, and allowing CRM to generate the primary key values.

































Concurrent packagesBatch SizeTimeRecords / sec
1n/a83212
10n/a109491
1010276362
10100267374
101000233429


Conclusions
The main performance issue with modifying multiple records with Crm Online relates to network latency. I've successfully tested 2 complementary approaches which give a combined 35-fold speed improvement, and it may also be possible to gain further improvements with asynchronous calls.


The performance figures were for a custom entity. I'm intending to do further tests with the customer entities (account and contact), and the activity entities, as each of these require more SQL updates than a custom entity.

Friday, 4 February 2011

.Net Framework versions of custom components with SQL 2008 R2 BIDS

Rather a long title, but I couldn't think of anything shorter. Anyway, it's a topic that I would have preferred was better publicised.

SQL 2008 R2 Business Intelligence Development Studio (BIDS) will only recognise extension components (such as SSIS Data Flow Components) that are built against .Net Framework 3.5. Neither earlier nor later versions will work, and I've yet to find any useful messages to tell you why.

So far I've only done enough testing to find combinations that definitely work, as summarised in the following table.

BIDS Version.Net Framework version of component
SQL 20052.0
SQL 20082.0
SQL 2008 R23.5

Thursday, 5 November 2009

SSIS Packages and Excel Data on 64 bit machines

We do a fair amount of integration work between CRM and Excel using SSIS, and it's all got a bit harder with 64 bit machines. The main issues are:

There is no 64 bit OLEDB provider for Excel
So the packages cannot be executed by 64 bit code. This meant that the programmatic route for running packages with the classes in the Microsoft.SqlServer.Dts.Runtime namespace no longer works. The error you get is typically unhelpful - 'AcquireConnection method call to the connection manager "EXCEL: Source" failed with error code 0xC00F9304'. You can also get error code 0xC020801C.

The workaround is to run a package using the 32 bit version of dtexec.exe, which should be installed in Program Files (x86)\Microsoft SQL Server\100\DTS\Binn. This is also the fix if scheduling packages with the SQL Agent; your job steps need to be of type CmdExec instead of SSIS Package, and you have to learn the command line syntax of dtexec.

Overall, something of a pain, but I've not yet found anything I can't do with dtexec.exe and its command line parameters.

The Registry settings for the 32 bit OLEDB provider for Excel have moved
There are a several registry settings that you may need to make the OLEDB provider for Excel work, as described here and here. When running a package in 32 bit mode on a 64 bit server, you find that it uses registry keys in a different location.

Values that were in HKLM\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel are now in HKLM\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Excel.

Note also that the paths to the providers have changed, so for example "Win32"="C:\Windows\system32\msexcl40.dll" becomes "Win32"="C:\Windows\syswow64\msexcl40.dll"

Although my emphasis here has been on Excel providers, and SSIS, the issues and resolutions can apply to other OLEDB providers and environments, as there are several OLEDB providers that are not expected to have 64 bit releases (the Exchange OLEDB provider, and the FoxPro one are 2 that I'm aware of)

Monday, 26 January 2009

Office 2007 System Driver - errors in SSIS if you don't have them

A quick post, as I didn't find any others out there giving all the details you may need. SQL 2008 SSIS supports the Office 2007 file formats, but requires the Office 2007 System Driver.

If you deploy a package onto a machine without these drivers, you get errors like the following when the component is validated:

SSIS Error Code DTS_E_OLEDB_NOPROVIDER_ERROR
The requested OLE DB provider Microsoft.ACE.OLEDB.12.0 is not registered
Hresult: 0x80040154 Description: "Class not registered"

Installing the Office 2007 System Driver should fix this.

Friday, 10 October 2008

Excel Data Type issues with OLEDB and SSIS

I recently met an annoying issue when reading Excel data in SSIS. The issue was when a column contained a mix of data-types – in this case string and numeric information. The problem comes from the fact that Excel does not have a concept of column data-types, and the issues in SSIS come from the way the OLE-DB provider for Excel attempts to resolve this.

I found the information I needed in an excellent blog post here, and I won’t repeat the content here. One additional point that is specific to SSIS is how to add the IMEX extended property to the connection string. In Visual Studio 2005 (I’ve not checked other versions) you can only do this by editing the ConnectionString property directly in the Properties window – the dialog box doesn’t offer this option. See the image below.

Thursday, 20 December 2007

Stored procedures in SSIS; the need for nocount

I've been doing a lot of SSIS work lately, and met an odd problem when using stored procedures in an OLEDB source. The general idea was to have a stored procedure that did some data modification, then finished with a SELECT statement to output logging information.

When previewing the OLEDB source everything was fine, but running the package would give the following error in the pre-execute phase:
'A rowset based on the SQL command was not returned by the OLE DB provider'

It took a while to work out, but eventually the problem came down to the way that SQL Server returns information about the number of rows affected. The resolution is to put the following SQL statement at the start of the stored procedure defnition to prevent output of the number of rows:

SET NOCOUNT ON

After that it was all fine.