How to (efficiently) delete millions of records in a SQL table

A while ago, I blogged about how to efficiently update millions of records in a SQL table, today I’d like to show you a simple trick to do the same when deleting millions of records in a SQL table.

The idea is the same, you don’t want to do it all at once, instead, you want to delete a fixed number of records at a time. Why do we need to do this? For starters, when you delete data from a table, a lot of things happen behind the scenes, and because of this, you need to be careful when removing data, especially, large amounts of data.

Many activities are performed by SQL when removing data, and because of this, it is important to know that certain items like the ones listed below can affect the locking and performance behavior of SQL delete statements:

  • Number of indexes in the table
  • Poor indexed foreign keys
  • Lock escalation mode
  • Isolation levels
  • Batch sizes of the delete statement
  • Trigger(s)
  • Temporal tables

One thing you need to consider is also cascade deletes, when you delete data from one table, SQL might also attempt to remove data from a few or many more tables with references to the data you are deleting. This is a good things as it prevents orphan records, but it has also the potential to affect the performance of your delete statements.

The list above shows you some of the main areas to consider when removing large amounts of data, but in this post, I will assume you have considered having the correct type of indexes to make sure that the indexes are helpful when deleting data, and not the reason for table blocking or worst, deadlocks.

I’d like to show you how to do batch deletes to only delete an specific number of records at a time. Performing the delete in batches, helps avoiding or reducing locking on the tables where the data is being removed. Below are some examples on how to do this in code, as well as when deleting data directly via SQL code.

Batch deletes with C#

n C#, you can use the Dapper library to delete data in batches using the “DELETE” statement in conjunction with a “WHERE” clause and a loop to iterate through a set of records.

Here is an example of how to delete data in batches of 1000 records using Dapper:

using (IDbConnection connection = new SqlConnection(connectionString))
{
    connection.Open();
    int batchSize = 1000;
    int currentIndex = 0;
    while (true)
    {
        string deleteSql = "DELETE FROM myTable WHERE Id IN (SELECT Id FROM myTable ORDER BY Id OFFSET @index ROWS FETCH NEXT @batch ROWS ONLY)";
        var parameters = new { index = currentIndex, batch = batchSize };
        int rowsAffected = connection.Execute(deleteSql, parameters);
        if (rowsAffected == 0)
        {
            break;
        }
        currentIndex += rowsAffected;
    }
}

Please note that this is a simple example for demonstration purposes only and it is recommended that you add additional error handling and logging in your actual implementation.

Batch deletes with SQL

Here is an example of a DELETE statement that can be used to delete data in batches of 1000 records within Microsoft SQL.

DECLARE @batchSize INT = 1000;
DECLARE @currentIndex INT = 0;

WHILE 1 = 1
BEGIN
    DELETE FROM myTable
    WHERE Id IN (
        SELECT Id
        FROM myTable
        ORDER BY Id
        OFFSET @currentIndex ROWS
        FETCH NEXT @batchSize ROWS ONLY
    );

    SET @currentIndex = @currentIndex + @batchSize;

    IF @@ROWCOUNT = 0
        BREAK;
END;

This query uses a WHILE loop to repeatedly execute the DELETE statement until all rows have been deleted. The number of rows deleted in each iteration is determined by the @batchSize variable. The loop continues to execute as long as the @@ROWCOUNT variable is not 0.

It is worth noting that this kind of query can be resource-intensive if the table being deleted from is very large, so it’s important to test the performance on a test environment before running it on production. However, if you have the need to delete millions of records, this is still much better than attempting to delete all records at once.

It’s also worth noting that this query will work on SQL Server. If you are using other SQL databases such as MySQL, PostgreSQL or Oracle, the syntax may be slightly different.

The optimal batch size for deleting records in SQL will depend on various factors such as the size of the table, the amount of available memory, and the overall performance of the database. In general, larger batch sizes can be more efficient as they reduce the number of times the database needs to perform the delete operation, but they also require more memory to hold the set of records being deleted.

A batch size of 1000 records is a common starting point, but it may not be the best option for your specific use case. You may want to experiment with different batch sizes such as 5000 and observe the performance of your database to determine the optimal batch size for your needs.

It’s important to keep in mind that larger batch sizes can consume more memory, especially if your table is large and contains a lot of data. Also, If you have indexes on the table, then deleting large batch of records at once may lead to a more significant impact on the performance of the database.

In addition, if you are working with a production database, you should also consider the possible impact on other queries or transactions that might be executing concurrently. It’s always a good practice to test your code on a test environment before deploying to a production environment.

Lastly, if you have a table or tables where the data needs to be deleted often, you might want to create a scheduled SQL job that will run weekly or as often as you need to, and delete the data that isn’t needed anymore. This will be helpful to avoid deleting large amounts of data at once, and it will also keep your table(s) from increasing in size without need.

Happy coding!

A note about impostor syndrome

Impostor syndrome is when a person doubts their accomplishments, feels that they don’t deserve it, or think that their achievements (a promotion, a raise, etc.) are the result of luck. The impostor syndrome can affect anyone, especially women and minorities who fear they owe their accomplishments to affirmative action.

What if we do the following when the impostor syndrome surround us?

What if we pretended we didn’t feel it?

What if we acted as though we were more confident and more competent?

What if we showed appreciation for what we’ve accomplished and behaved as we thoroughly deserved it?

What if we told our friends and family how happy we are about our accomplishments and how the result was expected due to all of our hard work and persistence?

It takes a lot of work to do this, it takes a lot of effort, more so than any of us is able to cope with.

But what if we did it every time the impostor syndrome shows up?

It’s possible that after doing the above for a while and acting as if we deserve our accomplishments, perhaps we would teach ourselves to take what we deserve and see the outcome we have always hoped for.

 

Parallelism. Using Parallel.For and ConcurrentBag.

Parallelism refers to the technique of running multiple calculations at the same time to speed up a computer program. Historically, this has been a complicated thing to write requiring a developer to do complicated coding including low-level manipulation of threads and locks.

A program will generally run faster if you allow it to execute multiple calculations at the same time. For example, you might have a program where you need check how many orders a customer has, and instead of looping through each customer to check on their orders, you could check on multiple customers at the same time by using something like Parallel.For.

Code example:

private IEnumerable<Orders> MyMethod(List<Orders> orders)
    {
        // Converting the List<Orders> to ConcurrentBag for thread-safe purposes.
        var result = new ConcurrentBag<Orders>();

        Parallel.ForEach(orders, item =>
        {
           // Some data manipulation
           result.Add(new Orders(/* constructor parameters */);
        });

        return result;
    }

The .NET Framework makes writing parallel code a much simpler task than before. A variety of enhancements and additions such as runtime, class library types, and diagnostic tools were introduced with the .NET Framework 4.0 to help developers write safe and efficient parallel code.

Below are some of these tools and enhancements, you can click any of the links for access to Microsoft’s documentation for each one of these:

The benefits

The benefit of using parallel programming is gaining the advantage to execute multiple instructions at the same time. This offers the benefit of making your program faster by reducing the time for the same code to execute sequentially. While this is a great way to speed up your code, you should still consider other ideas as well and not use the framework features around parallelism before knowing more about it. Believe, I know by personal experience, unfortunately.

The disadvantages

The disadvantages of using parallel coding are the increase of use of CPU for it (something to be aware of) and also the potential for issues when using collection objects that aren’t thread-safe. Thread safe means multiple threads can access the common data without any problem. When using something like Parallel.For you want to use a thread-safe object such as ConcurrentBag<T>. Bags are useful for storing objects when ordering doesn’t matter, and unlike sets, bags support duplicates. If you need your collection to be ordered, remember to sort it after converting it to a List<>.

As with everything else, test your code and find out if using the Parallel library or PLINQ in your existing scenario is the right thing for it or not. While it might seem that running things in parallel will always be faster, this isn’t always true. Read more about it here.

Happy coding!