Rclone
Robust software to distribute files between A and B. Distribute operations are move/copy/sync and A,B is roughly 80 different protocols.
It is an high quality software with a well deserved huge user base and was/is written by Nick Craig-Wood . I can easily recommend it as an component wherever files need to go in/out between systems or zones.
What Rclone is not (per 2022-07-05)
- It does not contain the RSync algoritm implementation. https://rsync.samba.org/ Instead respective protocol's feature is used to implent similar features like checksum/resume etc. This is agnostic from the user but check the Rclone documentation for your specific protocol if checksum is implemented etc.
- It doesnt emit event for each file. Instead it's metrics or log file can be used for this , or add your custom event easily by 1-2 lines of Golang , see below and my post "use Rclone programaticly".
- It doesnt do "pure" streaming. It can read pipe but the whole file is read in memory behind the scenes.
- It doesnt (didnt ..) contain a directory watch function. Instead an cron job could be used to repeatly call an operation like cp,mv etc. If last operation is not finished before next cron job an parameter that allocates an specific port could be used. If the port is taken by an previously rclone job , the command fails directly.
- Strangly enough it is missing WebHDFS protocol . It is the only protocoll I found missing.
- It is not totally safe against missuse ... ie a failed file transfer will be cleaned up nicely BUT it will be visible on the destination side for a flickering amount of time.
This video catches the moment Rclone finds out the source file is still growing (perhaps from an onging ftp put operation into your servers filesystem) and correcly aborts the transfer and cleans the destination file away. Note that RClone has multiple ways of mitigate this default behavior where one option is wait a minimum time before transfering the option "--min-age 10m" for a 10 minute marginal before starting to read the file, but lets pretend we didnt kow this !
Programatic usage
The below minimal example uses Rclone as a library and lists a directory programaticly.
package main
import (
"context"
"fmt"
"github.com/rclone/rclone/fs/config/configfile"
"log"
_ "github.com/rclone/rclone/backend/all"
_ "github.com/rclone/rclone/backend/drive"
_ "github.com/rclone/rclone/backend/local"
"github.com/rclone/rclone/fs"
"github.com/rclone/rclone/fs/filter"
"github.com/rclone/rclone/fs/sync"
)
func main() {
ctx := context.Background()
configfile.Install()
fsource, err := fs.NewFs(ctx, "hadoop:")
if err != nil {
log.Fatal(err)
}
entries, err := fsource.List(context.Background(), "/data")
if err != nil {
log.Fatal(err)
}
fmt.Println(entries)
}
My config file is located under /home/rickard/.config/rclone/rclone.conf
[hadoop]
type=hdfs
namenode = 10.1.1.190:8020
username = rickard
It is easy to see that the above example could be extended to mitigate a the problem with the "flickering file" by placing a empty marker/checksum file for each sucessfully transfered and ready file. The destination side should monitor these marker files , verify the checksum and then grab the file. See my upcoming post "use Rclone programaticly"