Choose the right .NET data provider, optimize application performance

In this article, John Goodson discusses the choices a .NET developer faces in using the ADO.NET/System.Data elements of the Framework Class Library. He discusses how to choose between a DataSet and a DataReader, what's wrong with the CommandBuilder, and how to use connection pooling to minimize the number of open connections against the database server.

Developing performance-oriented .NET applications is not easy. The .NET standard includes only basic guidelines

and interface definitions to help programmers develop .NET applications. In addition, .NET data providers do not throw exceptions to say that your code is running too slowly.

Designing a .NET application is a complex process, in part because the code can be very data provider-specific. If you are working with several databases, you will find differences in the programming concepts between the data providers. You will need much more database knowledge to design your application to run efficiently.

This article contains guidelines compiled by examining the .NET implementations of shipping .NET applications and analyzing the common performance mistakes that we found. The guidelines discuss selecting .NET objects and methods, designing .NET applications, retrieving data, and updating data.

Selecting .NET Objects and Methods

The guidelines in this section will help you to optimize system performance when selecting and using .NET objects and methods.

Choosing Between a DataSet and a DataReader

A critical choice when designing your application is whether to use a DataSet or a DataReader to retrieve data. If you need to retrieve many records rapidly, use a DataReader. The DataReader object is fast, returning a fire hose of read-only data from the server, one record at a time. In addition, retrieving results with a DataReader requires significantly less memory than a creating a DataSet. The DataReader does not allow random fetching, nor does it allow for updating the data. However, .NET data providers optimize their DataReaders for efficiently fetching large amounts of data.

In contrast, the DataSet object is a cache of disconnected data stored in memory on the client. In effect, it is a small database in itself. Because the DataSet contains all of the data that has been retrieved, you have more options in the way you can process the data. You can randomly choose records from within the DataSet and update/insert/delete records at will. You can also manipulate relational data as XML. This flexibility provides impressive functionality for any application, but comes with a high cost in memory consumption. In addition to keeping the entire result set in memory, the DataSet maintains both the original and the changed data, which leads to even higher memory usage. Do not use DataSets with very large result sets as the scalability of the application will be drastically reduced.

Avoiding the CommandBuilder Object

It is tempting to use a CommandBuilder object because it generates SQL statements and can save the developer time when coding a new application that uses DataSets. However, this shortcut can have a negative effect on performance. Because of concurrency restrictions, the CommandBuilder can generate highly inefficient SQL statements. For example, suppose you have a table called EMP, which is an 8-column table with simple employee records. A CommandBuilder would generate the following update statement:

CommandText: "UPDATE EMP SET EMPNO = ?, ENAME = ?, JOB = ?, MGR = ?, HIREDATE = ?, SAL = ?, COMM = ?, DEPT = ? WHERE ( (EMPNO = ?) AND (ENAME = ?) AND (JOB = ?) AND ((MGR IS NULL AND ? IS NULL) OR (MGR = ?)) AND (HIREDATE = ?) AND (SAL = ?) AND ((COMM IS NULL AND ? IS NULL) OR (COMM = ?)) AND (DEPT = ?) )"

The end-user can often write much more efficient UPDATE and DELETE statements than those that the CommandBuilder generates. In the previous example, the programmer, knowing the underlying database schema and that the EMPNO column of the EMP table is the primary key for the table, can code the same UPDATE statement as follows:

UPDATE EMP SET EMPNO = ?, ENAME = ?, JOB = ?, MGR = ?, HIREDATE = ?, SAL = ?, COMM = ?, DEPT = ? WHERE EMPNO = ?

This statement will run much more efficiently on the database server than the statement generated by the CommandBuilder.

Another drawback is also implicit in the design of the CommandBuilder object. The CommandBuilder must generate statements at runtime. Each time a DataAdapter.Update method is called, the CommandBuilder must analyze the contents of the result set and generate UPDATE, INSERT, and DELETE statements for the DataAdapter. When the programmer explicitly specifies the UPDATE, INSERT, and DELETE statements for the DataAdapter, this extra processing time is avoided.

Using Parameter Markers as Arguments to Stored Procedures

When calling stored procedures, always use parameters for the arguments instead of using literal arguments.

.NET data providers can call stored procedures on the database server either by executing the procedure the same way as any other SQL query, or by optimizing the execution by invoking an RPC (Remote Procedure Call) directly into the database server. When you execute the stored procedure as a SQL query, the database server parses the statement, validates the argument types, and converts the arguments into the correct data types.

Remember that SQL is always sent to the database server as a character string, for example, “getCustName (12345)”. In this case, even though the application programmer might assume that the only argument to getCustName is an integer, the argument is actually passed inside a character string to the server. The database server would parse the SQL query, consult database metadata to determine the parameter contract of the procedure, isolate the single argument value 12345, then convert the string ‘12345’ into an integer value before finally executing the procedure as a SQL language event.

By invoking an RPC inside the database server, the overhead of using a SQL character string is avoided. Instead, a .NET data provider will construct a network packet that contains the parameters in their native data type formats and execute the procedure remotely.

To use stored procedures correctly, set the CommandText property of the Command object to the name of the stored procedure. Then, set the CommandType property of the command to StoredProcedure. Finally, pass the arguments to the stored procedure using parameter objects. Do not physically code the literal arguments into the CommandText.

Case 1

SybaseCommand   DBCmd = new SybaseCommand ("getCustName(12345)", Conn);
SybaseDataReader  myDataReader;
myDataReader = DBCmd.ExecuteReader();

In this case, the stored procedure cannot be optimized to use a server-side RPC. The database server must treat the SQL request as a normal language event, which includes parsing the statement, validating the argument types, and converting the arguments into the correct data types before executing the procedure.

Case 2

SybaseCommand   DBCmd = new SybaseCommand("getCustName", Conn);
DBCmd.Parameters.Add("param1",SybaseDbType.Int,10,"").Value = 12345;
SybaseDataReader  myDataReader;
myDataReader.CommandType = CommandType.StoredProcedure
myDataReader = DBCmd.ExecuteReader();

In Case 2, the stored procedure can be optimized to use a server-side RPC. Because the application avoids literal arguments and calls the procedure by specifying all arguments as parameters, the .NET data provider can optimize the execution by invoking the stored procedure directly inside the database as an RPC. This case avoids SQL language processing on the database server and the execution time is greatly improved.

Designing .NET Applications

The guidelines in this section will help you to optimize system performance when designing .NET applications.

Using Connection Pooling

Connecting to a database is the single slowest operation inside a data-centric application. That’s why connection management is important to application performance. Optimize your application by connecting once and using multiple statement objects, instead of performing multiple connections. Avoid connecting to a data source after establishing an initial connection.

Connection pooling is an intrinsic part of the .NET data provider. Connection pooling lets you reuse active connections. Closing connections does not close the physical connection to the database. When an application requests a connection, an active connection is reused. This avoids the network I/O needed to create a new connection.

Pre-allocate connections. Decide what connection strings you will need. Remember that each unique connection string creates a new connection pool.

In the following example, three new OracleConnection objects are created, but only two connection pools are required to manage them. Note that the first and second connection strings differ only by the value assigned for User ID and Password.

OracleConnection conn1 = new OracleConnection ();
conn1.ConnectionString = "SID=ORCL;Host=Accounting;User
 ID=scott;password=tiger";
conn1.Open();
// Pool A is created.

OracleConnection conn2 = new OracleConnection();
conn2.ConnectionString = "SID=ORCL;Host=Accounting;User
 ID=lucy;password=quake";
conn2.Open();
// Pool B is created because the connection strings differ.

OracleConnection conn3 = new OracleConnection();
conn3.ConnectionString = "SID=ORCL;Host=Accounting;User
 ID=scott;password=tiger";
conn3.Open();
// conn3 is assigned an existing connection that was created in Pool A
// when the pool was created for conn1.

Once created, connection pools are not destroyed until the active process ends or the connection lifetime is exceeded. Maintenance of inactive or empty pools involves minimal system overhead.

Connection and statement handling should be addressed before implementation. Spending time and thoughtfully handling connection management improves application performance and maintainability.

Opening and Closing Connections

Open connections just before they are needed. Opening them earlier than necessary decreases the number of connections available to other users and can increase the demand for resources.

To keep resources available, explicitly Close the connection as soon as it is no longer needed. If you wait for the garbage collector to implicitly clean up connections that go out of scope, the connections will not be returned to the connection pool immediately, tieing up resources that are not actually being used.

Close connections inside a finally block. Code in the finally block always runs, regardless of whether an exception occurs. This guarantees explicit closing of connections. For example:

try
{
 DBConn.Open();
... // Do some other interesting work
}
catch (Exception ex)
{
 // Handle exceptions
}
finally
{
 //  Close the connection
 if (DBConn != null)
  DBConn.Close();
}

If you are using connection pooling, opening and closing connections is not an expensive operation. Using the Close() method of the data provider’s Connection object adds or returns the connection to the connection pool. Remember, however, that closing a connection automatically closes all DataReader objects associated with the connection.

Managing Commits in Transactions

Committing transactions is slow because of disk I/O and potentially, network I/O. Always start a transaction after connecting; otherwise, you are in autocommit mode.

What does a commit actually involve? The database server must flush back to disk every data page that contains updated or new data. This is usually a sequential write to a journal file, but nonetheless, is a disk I/O. By default, Autocommit is on when connecting to a data source, and Autocommit mode usually impairs performance because of the significant amount of disk I/O needed to commit every operation.

Furthermore, some database servers do not provide an Autocommit mode natively. For this type of server, the .NET data provider must explicitly issue a COMMIT statement and a BEGIN TRANSACTION for every operation sent to the server. In addition to the large amount of disk input/output required to support Autocommit mode, a performance penalty is paid for up to three network requests for every statement issued by an application.

The following code fragment starts a transaction for Oracle:

OracleConnection MyConn = new OracleConnection
              ("Connection String info");
MyConn.Open()

// Start a transaction
OracleTransaction TransId = MyConn.BeginTransaction();

// Enlist a command in the current transaction
OracleCommand OracleToDS = new OracleCommand();
OracleToDS.Transaction = TransId;
...
// Continue on and do more useful work in the
// transaction

Although using transactions can help application performance, do not take this tip too far. Leaving transactions active can reduce throughput by holding locks on rows for long times, preventing other users from accessing the rows. Commit transactions in intervals that allow maximum concurrency.

Choosing the Right Transaction Model

Many systems support distributed transactions; that is, transactions that span multiple connections. Distributed transactions are at least four times slower than normal transactions due to the logging and network I/O necessary to communicate between all the components involved in the distributed transaction (the .NET data provider, the transaction monitor, and the database system). Distributed transactions should be used only when transactions must span multiple DBMSs or multiple servers.

Unless distributed transactions are required, avoid using them. Instead, use local transactions whenever possible.

Using Commands that Retrieve Little or No Data

Commands such as INSERT, UPDATE and DELETE do not return data. Use these commands with ExecuteNonQuery method of the Command object. Although you may successfully execute these commands using the ExecuteReader method, the .NET Data Provider will properly optimize the database access for INSERT, UPDATE, and DELETE statements only through the ExecuteNonQuery method. The following example shows how to insert a row into the EMPLOYEE table using ExecuteNonQuery.

DBConn.Open();
DBTxn = DBConn.BeginTransaction();
// Set the Connection property of the Command object
DBCmd.Connection = DBConn;
// Set the text of the Command to the INSERT statement
DBCmd.CommandText = "INSERT into EMPLOYEE VALUES (15,'HAYES',
     'ADMIN',6, " + "'17-APR-2002',18000,NULL,4)";
// Set the transaction property of the Command object
DBCmd.Transaction = DBTxn;
// Execute the statement with ExecuteNonQuery, because we are not
// returning results
DBCmd.ExecuteNonQuery();
// Now commit the transaction
DBTxn.Commit();

// Close the connection
DBConn.Close();

Use the ExecuteScalar method of the Command object to return a single value, such as a sum or a count, from the database. The ExecuteScalar method returns only the value of the first column of the first row of the result set. Once again, you could use the ExecuteReader method to successfully execute such queries, but by using the ExecuteScalar method, you tell the .NET data provider to optimize for a result set that consists of a single row and a single column. By doing so, the data provider can avoid a lot of overhead and improve performance. The following example shows how to retrieve the count of a group:

// Retrieve the number of employees who make more than $50000
// from the EMPLOYEE table

// Open connection to Sybase database
SybaseConnection  Conn;
Conn = new SybaseConnection("host=bowhead;port=4100;User ID=test01;
                             Password=test01;Database Name=Accounting");
Conn.Open();

// Make a command object
SybaseCommand  salCmd = new SybaseCommand("SELECT count(sal) from EMPLOYEES" +
                                          "WHERE sal>'50000'",Conn);

try
{
    int count = (int)salCmd.ExecuteScalar();
}
catch (Exception ex)
{
    // Display any exceptions in a messagebox
    MessageBox.Show (ex.Message);
}
//  Close the connection
Conn.Close();

Using Commands Multiple Times

Choosing whether to use the Command.Prepare method can have a significant positive (or negative) effect on query execution performance. The Command.Prepare method tells the underlying data provider to optimize for multiple executions of statements that use parameter markers. Note that it is possible to Prepare any command regardless of which execution method is used (ExecuteReader, ExecuteNonQuery, or ExecuteScalar).

Consider the case where a .NET data provider implements Command.Prepare by creating a stored procedure on the server that contains the prepared statement. Creating stored procedures involves substantial overhead, but the statement can be executed multiple times. Although creating stored procedures is performance-expensive, execution of that statement is minimized because the query is parsed and optimization paths are stored at create procedure time. Applications that execute the same statement multiples times can benefit greatly from calling Command.Prepare and then executing that Command multiple times.

However, using Command.Prepare for a statement that is executed only once results in unnecessary overhead. Furthermore, applications that use Command.Prepare for large single execution query batches exhibit poor performance. Similarly, applications that either always use Command.Prepare or never use Command.Prepare do not perform as well as those that use a logical combination of prepared and unprepared statements.

Using Native Managed Providers

Bridges into unmanaged code, that is, code outside the .NET environment, adversely affect performance. Calling unmanaged or native code from managed code causes the CLR (Common Language Runtime) to make additional checks on calls to the unmanaged code, which impacts performance.

The .NET CLR is a very efficient and highly tuned environment. By using 100% managed code so that your .NET assemblies run inside the CLR, you can take advantage of the numerous built-in services to enhance the performance of your managed application and your staff. The CLR provides automatic memory management, so developers don’t have to spend time debugging memory leaks. Automatic lifetime control of objects includes garbage collection, scalability features, and support for side-by-side versions. In addition, the .NET Framework security enforces security restrictions on managed code that protect the code and data from being misused or damaged by other code. An administrator can define a security policy to grant or revoke permissions on an enterprise, a machine, an assembly, or a user level.

However, many .NET data provider architectures must bridge outside the CLR into native code to establish network communication with the database server. The overhead and processing required to enter this bridge is slow in the current version of the CLR.

Depending on your architecture, you may not realize that that the underlying .NET data provider is incurring this security risk and performance penalty. Be careful when choosing a .NET data provider that advertises itself as a 100% or pure managed code data provider. If the “Managed Data Provider” requires unmanaged database clients or other unmanaged pieces, then it is not a 100% managed data access solution. Only a very few vendors produce true managed code providers that implement their entire stack as a managed component.

Retrieving Data

To retrieve data efficiently, return only the data that you need, and choose the most efficient method of doing so. The guidelines in this section will help you to optimize system performance when retrieving data with .NET applications.

Retrieving Long Data

Unless it is necessary, applications should not request long data because retrieving long data across a network is slow and resource-intensive. Remember that when you use a DataSet, all data is retrieved from the data source, even if you never use it.

Although the best method is to exclude long data from the select list, some applications do not formulate the select list before sending the query to the .NET data provider (that is, some applications send SELECT * from <table name> ...). If the select list contains long data, most .NET data providers must retrieve that data at fetch time, even if the application does not ultimately bind the long data result columns to display to the user. When possible, the developer should try to implement a method that limits the number of columns that are retrieved.

Most users don’t want to see long data. If the user does want to process these result items, then the application can query the database again, specifying only the long columns in the select list. This method allows the average user to retrieve the result set without having to pay a high performance penalty for network traffic.

For example, consider a query such as

"SELECT * from EMPLOYEES WHERE SSID = '999-99-2222' ".

An application might only want to retrieve this employee’s name and address. But, remember that a .NET data provider is not intuitive. It has no idea which result columns an application might be trying to retrieve when the query is executed. A data provider only knows that an application can request any of the result columns. When the .NET data provider processes the fetch request, it will most likely return at least one, if not more, result rows across the network from the database server. In this case, a result row will contain all the column values for each row — including an employee photograph if the Employees table happens to contain such a column. Limiting the select list to contain only the name and address columns results in decreased network traffic and a faster performing query at runtime.

Reducing the Size of Data Retrieved

To reduce network traffic and improve performance, you can reduce the size of any data being retrieved to some manageable limit by using a database-specific command. For example, an Oracle data provider might let you limit the number of bytes of data the connection uses to fetch multiple rows. A Sybase data provider might let you limit the number of bytes of data that can be returned from a single IMAGE column in a result set. For example, with Microsoft SQL Server and Sybase ASE, you can issue “Set TEXTSIZE n” on any connection, where n sets the maximum number of bytes that will ever be returned to you from any TEXT or IMAGE column.

If the data provider allows you to define the packet size, use the smallest packet size that will meet your needs.

In addition, be careful to return only the rows you need. If you return five rows when you only need two rows, performance is decreased, especially if the unnecessary rows include long data.

Especially when using a DataSet, be sure to use a WHERE clause with every SELECT statement to limit the amount of data that will be retrieved. Even using a WHERE clause, a SELECT statement that does not adequately restrict the request could return hundreds of rows of data. For example, if you want the complete row of data from the EMPLOYEE table for each manger hired in recent years, you might be tempted to issue the following statement and then, in your application code, filter out the rows who are not managers:

SELECT * from EMPLOYEE WHERE hiredate > 2000

However, suppose the EMPLOYEE table contains a PHOTOGRAPH column. Retrieving all the extra rows could be extremely expensive. Let the database filter them for you and avoid having all the extra data that you don’t need sent across the network. A better request further limits the data returned and improves performance:

SELECT * from EMPLOYEE WHERE hiredate > 2003 and job_title=’Manager’

Choosing the Right Data Type

Advances in processor technology brought significant improvements to the way that operations such as floating-point math are handled. However, retrieving and sending certain data types are still expensive when the active portion of your application will not fit into on-chip cache. When you are working with data on a large scale, it is still important to select the data type that can be processed most efficiently. For example, integer data is processed at least twice as fast as floating-point data. Floating-point data is defined according to internal database-specific formats, usually in a compressed format. The data must be decompressed and converted into a different format so that it can be processed by the wire protocol.

Processing time is shortest for character strings, followed by integers, which usually require some conversion or byte ordering.

Synchronizing Changes Back to the Data Source

This section provides guidelines to help you to optimize system performance when updating data in databases with a .NET data provider.

You must build the update logic into the DataAdapter for synchronizing the changes back to the data source (or let the CommandBuilder do it). By using (in the WHERE clause) only the columns that guarantee a unique row in the table, your UPDATE/INSERT/DELETE statements will be most efficient. The following table lists the mechanisms that you can use for different databases:

Database
Update Mechanism
DB2 Primary key, Identity
Oracle Rowid
SQL Server Primary key, Identity
Sybase Primary key, Identity

The following example shows the application flow for updating a DataSet using Oracle’s Rowid as the update mechanism:

// Create the DataAdapter and DataSets
OracleCommand DbCmd = new OracleCommand
("SELECT rowid, deptid, deptname FROM department", DBConn);

myDataAdapter = new OracleDataAdapter();
myDataAdapter.SelectCommand = DBCmd;
myDataAdapter.Fill(myDataSet, "Departments");

// Build the Update rules
// Specify how to update data in the data set
myDataAdapter.UpdateCommand = new OracleCommand("Update department"+
  "set deptname = ? , deptid = ? WHERE rowid =?", DBConn);

// Bind parameters
myDataAdapter.UpdateCommand.Parameters.Add
("param1", OracleDbType.VarChar,100,"deptname");
myDataAdapter.UpdateCommand.Parameters.Add("param2",
OracleDbType.Number, 4,"deptid";
myDataAdapter.UpdateCommand.Parameters.Add("param3",
OracleDbType.Number, 4, "rowid");

In this example, performance of the queries on the Oracle server improves because the WHERE clause includes only the rowid as a search condition.

Conclusion

With thoughtful design and implementation, the performance of .NET applications can be improved. By retrieving only required data, selecting functions that optimize performance, and properly managing connections and updates, your applications can run more efficiently and generate less network traffic.

About the author

John Goodson leads the product strategy, direction and development efforts for DataDirect products. For more than 10 years, John has worked closely with Sun and Microsoft on the development and evolution of database connectivity standards including J2EE, JDBC, .NET, ODBC and ADO. His active memberships in various standards committees, including the JDBC Expert Group, have helped John's team develop the most technically advanced data connectivity technologies. John is a frequent speaker at major industry events and has most recently been published in Java Developers Journal, Integration Developer News, and Sun Developer Source. John holds a Bachelor of Science in Computer Science from Virginia Tech.


This was first published in December 2007

Dig deeper on .NET Framework database programming best practices

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchCloudComputing

SearchSoftwareQuality

SearchSOA

TheServerSide

SearchCloudApplications

Close