52

I'd like to be able to batch delete thousands or tens of thousands of files at a time on S3. Each file would be anywhere from 1MB to 50MB. Naturally, I don't want the user (or my server) to be waiting while the files are in the process of being deleted. Hence, the questions:

  1. How does S3 handle file deletion, especially when deleting large numbers of files?
  2. Is there an efficient way to do this and make AWS do most of the work? By efficient, I mean by making the least number of requests to S3 and taking the least amount of time using the least amount of resources on my servers.
tpml7
  • 479
SudoKill
  • 623

11 Answers11

55

The excruciatingly slow option is s3 rm --recursive if you actually like waiting.

Running parallel s3 rm --recursive with differing --include patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include pattern matching.

Enter bulk deletion.

I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects.

Here's an example:

cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "{Key=%s}," "$@")],Quiet=true"' _
  • The -P8 option on xargs controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time.
  • The -n1000 option tells xargs to bundle 1000 keys for each aws s3api delete-objects call.
  • Removing ,Quiet=true or changing it to false will spew out server responses.
  • Note: There's an easily missed _ at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.

But how do you get file-of-keys?

If you already have your list of keys, good for you. Job complete.

If not, here's one way I guess:

aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys
antak
  • 669
27

AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).

The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!

I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).

If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.

If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.

Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.

Ed D'Azzo
  • 416
20

A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.

https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html

cam8001
  • 301
  • 2
  • 4
7

One liner batch delete w/ S3API

Here is my one liner based on other responses to this post.

  1. Gets a batch of 1000 S3 object keys (without needing to save them to a file)
  2. Pipes the keys to the delete command
  3. Two delete commands are initiated in parallel for 500 objects each
aws s3api list-objects-v2 --bucket $BUCKET --prefix $PREFIX --output text --query \
'Contents[].[Key]' | grep -v -e "'" | tr '\n' '\0' | xargs -0 -P2 -n500 bash -c \
'aws s3api delete-objects --bucket $BUCKET --delete "Objects=[$(printf "{Key=%q}," "$@")],Quiet=true"' _ 

Why no keys file?

One of the complaints about piping the keys to a file was that errors could happen while deleting from s3. If you had to restart the delete command, you would have a file with lots of keys that are already deleted and you would waste time running the delete commands again.

Why 2 delete commands?

I tried to delete all 1000 objects with 1 delete command. I would get an error the my argument list was too long (because I have long keys)

ksutton
  • 71
5

I know this post is really old at this point but if you're having to do this today, the AWS dashboard now has an "Empty" feature on the bucket search page which will perform a bulk delete (1000 at a time) for you:

Image of AWS Dashboard - S3 Empty Button Highlighted

5

I was frustrated by the performance of the web console for this task. I found that the AWS CLI command does this well. For example:

aws s3 rm --recursive s3://my-bucket-name/huge-directory-full-of-files

For a large file hierarchy, this may take some considerable amount of time. You can set this running in a tmux or screen session and check back later.

dannyman
  • 388
  • 5
  • 15
5

There already mention about s3 sync command before, but without example and word about --delete option.

I found it fastest way to delete content of folder in S3 bucket my_bucket by:

aws s3 sync --delete "local-empty-dir/" "s3://my_bucket/path-to-clear"

Hubbitus
  • 311
  • 3
  • 6
3

There is a way in Python that uses botos'3 delete_objects, tqdm to give a bit of feedback, and multiple threads to give it a little bit of parallelism (in spite of the GIL, multiple requests to S3 should run at the same time)

This is at https://gist.github.com/michalc/20b79c9028c342ef7c38df8693f8715b but to copy it here:

# From some light testing deletes at between 2000 to 6000 objects per second, and works best if you
# have objects distributed into folders/CommonPrefixes as specified by the delimiter.

from concurrent.futures import FIRST_EXCEPTION, ThreadPoolExecutor, wait from dataclasses import dataclass, field from functools import partial from typing import Callable, Optional, Tuple from queue import PriorityQueue

import boto3 from botocore.config import Config

def bulk_delete( bucket, prefix, workers=4, page_size=1000, delimiter='/', get_s3_client=lambda: boto3.client('s3', config=Config(retries={'max_attempts': 16, 'mode': 'adaptive'})), on_delete=lambda num: None, ): s3 = get_s3_client() queue = PriorityQueue()

@dataclass(order=True)
class Work:
    # A tuple that determines the priority of the bit of work in "func". This is a sort of
    # "coordinate" in the paginated node tree that prioritises a depth-first search.
    priority: Tuple[Tuple[int,int], ...]
    # The work function itself that fetches a page of Key/CommonPrefixes, or deletes
    func: Optional[Callable[[], None]] = field(compare=False)

# A sentinal "stop" Work instance with priority chosen to be before all work. So when it's
# queued the workers will stop at the very next opportunity
stop = Work(((-1,-1),), None)

def do_work():
    while (work := queue.get()) is not stop:
        work.func()
        queue.task_done()
        with queue.mutex:
            unfinished_tasks = queue.unfinished_tasks
        if unfinished_tasks == 0:
            for _ in range(0, workers):
                queue.put(stop)

def list_iter(prefix):
    return iter(s3.get_paginator('list_objects_v2').paginate(
        Bucket=bucket, Prefix=prefix,
        Delimiter=delimiter, MaxKeys=page_size, PaginationConfig={'PageSize': page_size},
    ))

def delete_page(page):
    s3.delete_objects(Bucket=bucket, Delete={'Objects': page})
    on_delete(len(page))

def process_page(page_iter, priority):
    try:
        page = next(page_iter)
    except StopIteration:
        return

    # Deleting a page is done at the same priority as this function. It will often be done
    # straight after this call because this call must have been the highest priority for it to
    # have run, but there could be unfinished nodes earlier in the depth-first search that have
    # since submitted work, and so would be prioritised over the deletion
    if contents := page.get('Contents'):
        delete_priority = priority
        queue.put(Work(
            priority=delete_priority,
            func=partial(delete_page, [{'Key': obj['Key']} for obj in contents]),
        ))

    # Processing child prefixes are done after deletion and in order. Importantly anything each
    # child prefix itself enqueues should be done before the work of any later child prefixes
    # to make it a depth-first search. Appending the index of the child prefix to the priority
    # tuple of this function does this, because the work inside each child prefix will only
    # ever enqueue work at its priority or greater, but always less than the priority of
    # subsequent child prefixes or the next page.
    for prefix_index, obj in enumerate(page.get('CommonPrefixes', [])):
        prefix_priority = priority + ((prefix_index,0),)
        queue.put(Work(
            priority=prefix_priority,
            func=partial(process_page,
                         page_iter=list_iter(obj['Prefix']), priority=prefix_priority)
        ))

    # The next page in pagination for this prefix is processed after delete for this page,
    # after all the child prefixes are processed, and after anything the child prefixes
    # themselves enqueue.
    next_page_priority = priority[:-1] + ((priority[-1][0], priority[-1][1] + 1),)
    queue.put(Work(
        priority=next_page_priority,
        func=partial(process_page, page_iter=page_iter, priority=next_page_priority),
    ))

with ThreadPoolExecutor(max_workers=workers) as worker_pool:
    # Bootstrap with the first page
    priority = ((0,0),)
    queue.put(Work(
        priority=priority,
        func=partial(process_page, page_iter=list_iter(prefix), priority=priority),
    ))

    try:
        # Run workers, waiting for the first exception, if any, raised by them
        done, _ = wait(
            tuple(worker_pool.submit(do_work) for _ in range(0, workers)),
            return_when=FIRST_EXCEPTION,
        )
    finally:
        # The workers might have all stopped cleanly, or just one since it raised an exception,
        # or none of them have stopped because the above raised an exception (e.g. on 
        # KeyboardInterrupt). By putting enough stop sentinels in the queue, we can make sure
        # to eventually stop any remaining workers in any of these cases.
        for _ in range(0, workers):
            queue.put(stop)

    # If an exception raised by a worker, re-raise it
    if e := next(iter(done)).exception():
        raise e from None

used as:

from tqdm import tqdm

print('Deleting...') with tqdm(unit=' objects') as pbar: bulk_delete( bucket='my-bucket', prefix='my-prefix/', on_delete=lambda num: pbar.update(num), ) print('Done')

As the gist suggests, it seems to delete at a rate between 2000 and 6000 objects per second, which from my testing is better than plain delete_objects without multiple threads. But your milage may vary.

It works by walking the tree of CommonPrefixes (folder-like structures in S3), and paginating through objects in them, but walking and deleting multiple branches of the tree at the same time. So it should give a performance benefit over plain pagination + delete_objects with no threads if there is a good amount of such CommonPrefixes, and there is a good amount of objects per CommonPrefix.

1

I found rclone to be pretty fast as it uses the S3 API.

https://rclone.org/

rclone delete --progress --transfers=1000 <rclone_confg>:<s3_bucket_and_prefix>

Shawnzam
  • 111
  • 2
0

I made a python script for this.

P.S. it nukes your account s3, all buckets.

import concurrent.futures
import boto3

def purge_bucket(Bucket, S3Client): response = S3Client.list_objects_v2(Bucket=Bucket) while 'Contents' in response and response['KeyCount'] > 0: for key in response['Contents']: value = key['Key'] key.clear() key['Key'] = value print(f'Deleting {len(response["Contents"])} keys at {Bucket}') out = S3Client.delete_objects( Bucket=Bucket, Delete={'Objects': response['Contents']} ) if 'Errors' in out: print(f'Errors at {Bucket}: {out["Errors"]}') response = S3Client.list_objects_v2(Bucket=Bucket) return Bucket

s3 = boto3.client('s3') response = s3.list_buckets() if len(response['Buckets']) > 0: with concurrent.futures.ThreadPoolExecutor() as executor: runs = [] for bucket in response['Buckets']: bucket = bucket['Name'] runs.append(executor.submit(purge_bucket, Bucket=bucket, S3Client=s3)) for run in concurrent.futures.as_completed(runs): try: end = s3.delete_bucket(Bucket=run.result()) print(end) except Exception as e: print(f'{run.result()}: {e}')

0

Without knowing how you're managing the s3 buckets, this may or may not be particularly useful.

The AWS CLI tools has an option called "sync" which can be particularly effective to ensure s3 has the correct objects. If you, or your users, are managing S3 from a local filesystem, you may be able to save a ton of work determining which objects need to be deleted by using the CLI tools.

http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Bill B
  • 49