As businesses become increasingly reliant on cloud-based storage solutions, it is imperative that they have the right tools and techniques for effective management of their big data. In previous posts (e.g. here and here) we have explored several different methods to recover data from cloud storage and demonstrated their effectiveness on different types of tasks. We found that the most optimal tool can vary depending on the specific task at hand (e.g., file format, size of data files, data access pattern) and the metrics we wish to optimize (e.g., latency, speed or cost). In this post, we explore another popular tool for cloud-based storage management, sometimes referred to as ““the swiss army knife of cloud storage” – he clone command line utility. Supporting more than 70 storage service providersrclone supports similar functionality to vendor-specific storage management applications such as the AWS CLI (for Amazon S3) and gsubtle (for Google storage). But does it work well enough to be a viable alternative? Are there situations where rclone would be the tool of choice? In the following sections we will demonstrate the use of rclone, evaluate its performance and highlight its value in a particular use case: transferring data across different object storage systems.
Disclaimers
This publication is not intended, in any way, to replace the official publication. cloning documentation. It is also not intended to be an endorsement of the use of rclone or any of the other tools we should mention. The best choice for your cloud-based data management will largely depend on the details of your project and should be made after extensive testing specific to each use case. Be sure to reevaluate the statements we make with the most up-to-date tools available at the time you read this.
The following command line uses clone sync to synchronize the contents of a cloud-based object storage path with a local directory. This example demonstrates the use of the amazon s3 cloud storage service, but you could also have used a different cloud storage service.
rclone sync -P \
--transfers 4 \
--multi-thread-streams 4 \
S3store:my-bucket/my_files ./my_files
The rclone command has dozens of flags to program their behavior. He -P flag outputs the progress of data transfer, including transfer speed and total time. In the command above we included two (of many) controls that can affect the performance of the rclone runtime: transfers The indicator determines the maximum number of files that will be downloaded simultaneously and multi-threaded streams Determines the maximum number of threads to use to transfer a single file. Here we have left both at their default values (4).
The functionality of Rclone is based on the proper definition of the rclone configuration file. Below we demonstrate the definition of remote control. S3 Store object storage location used in the previous command line.
(S3store)
type = s3
provider = AWS
access_key_id = <id>
secret_access_key = <key>
region = us-east-1
Now that we’ve seen rclone in action, the question that arises is whether it provides any value over other cloud storage management tools out there, such as the popular AWS CLI. In the next two sections we will evaluate the performance of rclone compared to some of its alternatives in two scenarios that we have explored in detail in our previous posts: 1) downloading a 2 GB file and 2) downloading hundreds of 1 MB files.
Use case 1: Download a large file
The following command line uses the AWS CLI to download a 2GB file from Amazon S3. This is just one of the many methods we evaluated in a previous post. We use Linux time command to measure performance.
time aws s3 cp s3://my-bucket/2GB.bin .
The reported download time amounted to approximately 26 seconds (i.e. ~79 MB/s). Please note that this value was calculated on our own local PC and can vary greatly from runtime to runtime. The equivalent cloned copy The command appears below:
rclone sync -P S3store:my-bucket/2GB.bin .
In our setup, we found that rclone’s download time is more than twice as slow as the standard AWS CLI. It is very likely that this can be significantly improved by properly tuning rclone’s control flags.
Use case 2: Download a large number of small files
In this use case we evaluate the runtime performance of the download. 800 relatively small files of 1 MB each. In a previous blog post, we discussed this use case in the context of streaming data samples to a deep learning training workload and demonstrated the superior performance of s5cmd beast mode. In beast mode we create a file with a list of object file operations that s5cmd performs when using multiple parallel workers (256 by default). The s5cmd beast mode option is shown below:
time s5cmd --run cmds.txt
He cmds.txt The file contains a list of 800 lines of the form:
cp s3://my-bucket/small_files/<i>.jpg <local_path>/<i>.jpg
The s5cmd command took an average time of 9.3 seconds (average of ten tests).
Rclone supports similar functionality to s5cmd’s beast mode with the Files of command line option. Next we run rclone copy in our 800 files with transfers value set to 256 to match the default value concurrence s5cmd configuration.
rclone -P --transfers 256 --files-from files.txt S3store:my-bucket /my-local
He files.txt the file contains 800 lines of the form:
small_files/<i>.jpg
The cloned copy of our 800 The files took an average of 8.5 seconds, a little less than s5cmd (averaged over ten tests).
We recognize that the results demonstrated so far may not be enough to convince you to prefer rclone over your existing tools. In the next section we will describe a use case that highlights one of the potential advantages of rclone.
Nowadays it is not uncommon for development teams to maintain their data in more than one object store. The motivation behind this could be the need to protect against the possibility of storage failure or the decision to use data processing offerings from multiple cloud service providers. For example, your solution for ai development might depend on training your models on AWS using data in Amazon S3 and running data analytics on Microsoft Azure using the same data stored in Azure Storage. Additionally, you may want to keep a copy of your data on a local storage infrastructure such as FlashBlade, Cloudianeither VAST. These circumstances require the ability to transfer and synchronize your data between multiple object stores in a secure, reliable, and timely manner.
Some cloud service providers offer dedicated services for such purposes. However, these don’t always address the precise needs of your project or may not allow you the level of control you want. For example, Google Storage Transfer excels in the rapid migration of all data within a specific storage folder, but does not (at the time of writing) support transferring a specific subset of files from there.
Another option we could consider would be to apply our existing data management for this purpose. The problem with this is that tools like AWS CLI and s5cmd do not (at the time of writing) support specifying different access settings and security-credentials for the source and destination storage systems. Therefore, migrating data between storage locations requires transferring it to an intermediate (temporary) location. In the following command we combine the use of s5cmd and AWS CLI to copy a file from Amazon S3 to Google Storage through system memory and using the Linux pipeline:
s5cmd cat s3://my-bucket/file \
| aws s3 cp --endpoint-url https://storage.googleapis.com
--profile gcp - s3://gs-bucket/file
While this is a legitimate, if clumsy, way to transfer a single file, in practice we may need the ability to transfer many millions of files. To support this, we would need to add an additional layer to spawn and manage multiple parallel workers/processors. Things could get ugly pretty quickly.
Data transfer with Rclone
Unlike tools like AWS CLI and s5cmd, rclone allows us to specify different access configurations for the source and destination. In the following rclone configuration file we add configurations for access to Google Cloud Storage:
(S3store)
type = s3
provider = AWS
access_key_id = <id>
secret_access_key = <key>(GSstore)
type = google cloud storage
provider = GCS
access_key_id = <id>
secret_access_key = <key>
endpoint = https://storage.googleapis.com
Transferring a single file between storage systems has the same format as copying it to a local directory:
rclone copy -P S3store:my-bucket/file GSstore:gs-bucket/file
However, the real power of rclone comes from combining this feature with the Files of option described above. Instead of having to organize a custom solution to parallelize data migration, we can transfer a long list of files using a single command:
rclone copy -P --transfers 256 --files-from files.txt \
S3store:my-bucket/file GSstore:gs-bucket/file
In practice, we can further speed up data migration by parsing the list of object files into smaller lists (for example, with 10,000 files each) and running each list on a separate compute resource. While the precise impact of this type of solution will vary from project to project, it can provide a significant boost to the speed and efficiency of your development.
In this post, we explore cloud-based storage management using rclone and demonstrate its application to the challenge of maintaining and synchronizing data across multiple storage systems. There are undoubtedly many alternative solutions for data transfer. But there is no doubt about the convenience and elegance of the rclone-based method.
This is just one of many posts we’ve written on the topic of maximizing the efficiency of cloud-based storage solutions. Be sure to check out some of our other publications on this important topic.