Google Drive & rclone on Agave
Introduction
The University Technology Office provides all staff, faculty, and students storage on the cloud via Google Drive. While Google Drive is not an ideal solution for the storage of research data, as the data limits on Google Drive are extremely limited, some research data may live there. This page documents how to configure a sophisticated command-line tool, rclone
to transfer data off of Google Drive.
Using rclone, the process of archiving or moving files and data from Google Drive to a different cloud or storage medium becomes a relatively straightforward task. However, rclone
is a bit complex to first configure for interaction with Google Drive, which is mostly what this page documents.
Configuring rclone
Once the shared drive is initialized, it may be accessed from the supercomputer. It is highly recommend that these steps are followed from a Virtual Desktop session in our web portal.
module load rclone/1.58.1
rclone config
This will now lead to a multi-step interactive configuration process in order to generate $HOME/.config/rclone/rclone.conf
. These steps are well documented by the rclone devs, but shown here as well for the configuration of a shared drive rc-drive
. For reference, these prompt steps will be partitioned.
Some users may wish to use a virtual desktop session as provided by our webapp, as one of the steps (prompt 10) will provide an authentication link and may attempt to open a browser window. This link may also be opened locally on the user’s computer. The guide assumes that the user is in a virtual desktop session.
Creating a Client ID and Secret through the Google API Console
The Green Info Box.
For those that are ready for production, rerun rclone config
and edit the previous remote’s client_id
and client_secret
(Prompts 4 and 5). The steps below must be followed first (taken from the rclone Google Drive docs). Note that you only need one of these for all shared drives.
Here is how to create your own Google Drive client ID for rclone
:
Log into the Google API Console with your Google account. It doesn’t matter what Google account you use. (It need not be the same account as the Google Drive you want to access)
Select a project or create a new project.
Under “ENABLE APIS AND SERVICES” search for “Drive”, and enable the “Google Drive API”.
Click “Credentials” in the left-side panel (not “Create credentials”, which opens the wizard), then “Create credentials”, then “OAuth client ID”. It will prompt you to set the OAuth consent screen product name, if you haven’t set one already.
Choose an application type of “Desktop App”, and click “Create”. (the default name is fine)
It will show you a client ID and client secret. Use these values in
rclone config
to add a new remote or edit an existing remote.
- 1 Creating a Client ID and Secret through the Google API Console
- 2 Prompt 1 – New Remote – Response: n
- 3 Prompt 2 – Name Remote – Response: <your-project-name>
- 4 Prompt 3 – Choose Cloud – Response: drive
- 5 Prompt 4 – Client ID – Response: Client ID from the Green Info Box or None
- 6 Prompt 5 – Client Secret – Response: Client Secret from the Green Info Box or None
- 7 Prompt 6 – Remote Permissions (Scopes) – Response: 1
- 8 Prompt 7 – Root Folder ID – Response: None
- 9 Prompt 8 – Service Account File – Reponse: None
- 10 Prompt 9 – Enter Advanced Config – Response: n
- 11 Prompt 10 – Auto Config – Response: n
- 12 Prompt 11 – Configure Shared Drive – Response: y
- 13 Prompt 12 – Choose Shared Drive – Response: <integer_associated_with_shared_drive>
- 14 Prompt 13 – Summary – Response: y
- 15 Prompt 14 – Finished – Response: q
- 16 Post Prompt
Prompt 1 – New Remote – Response: n
The first prompt looks like this:
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q>
We respond with n
, as the rclone configuration does not yet exist for our shared drive.
Prompt 2 – Name Remote – Response: <your-project-name>
The second prompt:
name>
This is arbitrary, but it’s wise to use the shared drive’s name. In this case, rc-drive
.
Prompt 3 – Choose Cloud – Response: drive
The third prompt lists all the available cloud backends. Currently there are 46 enumerated options, but for simplicity, only the relevant option (# 17) is shown (N.B. in the future the number associated with Google Drive may change):
drive
or 17
are both proper responses.
Prompt 4 – Client ID – Response: Client ID from the Green Info Box or None
This prompt and the next involve creating an application on Google Cloud, and may lead to improved throughput. Note that the steps for this are documented above in the green Info Box and also in the video documentation.
So the response here is to paste in the Client ID.
Prompt 5 – Client Secret – Response: Client Secret from the Green Info Box or None
This prompt and the previous involve creating an application on Google Cloud, and may lead to improved throughput. Note that the steps for this are documented above in the green Info Box and also in the video documentation.
So the response here is to paste in the Client Secret.
Prompt 6 – Remote Permissions (Scopes) – Response: 1
Another big prompt here, but rclone
needs full access to all files. We will specify that rclone
is limited to our shared drive by Prompt 13. Scopes are addressed here by the rclone devs and defined here by Google.
drive
or 1
are both sufficient responses.
Prompt 7 – Root Folder ID – Response: None
We leave this next prompt’s response blank, as we want to work from the root of our shared drive.
We accept the default empty string (just hit the return key).
Prompt 8 – Service Account File – Reponse: None
Another advanced Google Cloud application feature, we ignore this. Service accounts may be used to automate certain tasks for users.
We accept the default empty string (just hit the return key).
Prompt 9 – Enter Advanced Config – Response: n
The basic configuration has been done by this point, and rclone
offers to make it longer. The advanced configuration is optional and unrecommended.
n
is the recommended response.
Prompt 10 – Auto Config – Response: n
We now inform rclone
that we are on the supercomputer. Responding y
here will lead to a remote browser session, which is not an issue if using a virtual desktop session as provided by our webapp. With ssh
, n
is recommended and assumed instead.
We respond with n
.
Prompt 11 – Configure Shared Drive – Response: y
This prompt configures rclone
to only associate with a shared drive. Note that rclone
refers to the shared drive as a team drive
.
y
is the recommended response.
Prompt 12 – Choose Shared Drive – Response: <integer_associated_with_shared_drive>
Assuming y
was passed to Prompt 11, rclone
retrieves a list of shared drives (referred to as team drives) from the user’s Google Drive.
To choose the shared drive noted in the example, either the integer associated with the shared drive or the alphanumeric string may be used as responses, that is 1
or 0XXXXXXXXXXXXXXXXXX
.
Prompt 13 – Summary – Response: y
rclone
then summarizes the configuration and asks for confirmation. The shared drive is specified by its alphanumeric id from Prompt 12, and the access tokens are saved in the JSON
structure token
.
y
confirms the summary.
Prompt 14 – Finished – Response: q
The configuration is complete, and rclone
loops back to the first prompt but with an existing configuration.
Assuming there are no more remotes to configure, q
may be passed to the prompt.
Post Prompt
Recalling that this was all done within an interactive
session, the session should now be closed.
From here, rclone
should be fully configured to interact with a previously created shared drive for a research project, and may now be used within new interactive
sessions or preferably sbatch
job submissions. When first learning how to use rclone
, interactive
sessions are prudent, but once the archiving section of the workflow is figured out, sbatch
is highly recommended.
Be very careful with rclone
from here on out. Verify that rclone
remotes only have access to the shared drives created for respective research projects. Do not use rclone
subcommands (e.g. lsd
, ls
, copy
, purge
, or delete
) without knowing their downstream effects first. Test before application, as any files lost may be unrecoverable!
Using rclone
When first getting acquainted with rclone
it may be prudent to work in an interactive
session
Assuming a remote configured to be named project-name
, we note the following commands:
List directories in the top level of your
project-name
drive
List all the files in your drive
To copy a research project directory to a drive directory called backup
To improve the speed of a rclone copy
, keep in mind points (2) and (3) from the Introduction: fewer large files will always transfer faster than a large number of small files. The following flags achieved 800 MBps on fn1
(and hit the 750 GB limit within 10 minutes):
Using rclone with Dropbox
Once Google Drive is configured, please follow this video documentation on how to configure rclone for use with your ASU dropbox. rclone
enables cloud-to-cloud transfers without storing anything intermediate on Agave.