Sysadmin: file and folder synchronisation

Table of Content

Technology
Over the years I’ve struggled to keep my folder data synchronised between my various desktop and laptops.

Here I present the tools I’ve tried and what I’ve finally settled on as possibly the ultimate answer to the problem of synchronising files and folders across multiple computers:

* [rsync](#rsync)
* [unison](#unison)
* [WinSCP](#winscp)
* [General Backup tools](#backup)
* [Revision Control Systems](#rcs)
* [Complex setup](#compexsetup)
* [What we want from data synchronisation](#whatwewant)
* [Live Mesh folders](#livemesh)
* [Conclusion](#conclusion)
* [References](#references)

### rsync {#rsync}
I’ve tried [rsync][], which is a great Open Source tool to securely synchronise data either one-way or both-ways.
It’s very efficient with bandwidth as it only transfer blocks of data that have actually changed in a file instead of the whole file. It can tunnel traffic across SSH and I’ve got a few [cronjobs][] set up between various servers to back-up files daily.

It’s only weaknesses are that:

* Every time it runs, it needs to inspect all files on both sides to determine the
changes, which is quite an expensive operation.
* Setting up synchronisation between multiple copies of the data can be tricky:
you need to sync your computers in pairs multiple times, which quickly becomes
expensive and risky if you have the same copy across multiple computers.
* It doesn’t necessarily detect that files are in use at the time of the sync, which
could corrupt them.

### unison {#unison}
It a folder synchronisation tool whose specific purpose is to address some of the shortcomings of `rsync` when synchronising folders between computers.
It’s also a cross-platform Open Source tool that works on Linux, OS/X, Windows, etc.

[Unison][unison] uses the efficient file transfer capabilities of `rsync` but it is better at detecting conflicts and it will give you a chance to decide which copy you want when a conflict is detected.

The issue though is that, like `rsync`, it needs to inspect all files to detect changes which prevents it from detecting and propagating updates as they happen.

The biggest issue with these synchronisation tools is that they tend to increase the risk of conflict because changes are only detected infrequently.

### WinSCP {#winscp}
[WinSCP][] Is an Open Source Windows GUI FTP utility that also allows you to synchonise folders between a local copy and a remote one on the FTP server.

It has conflict resolution and allows you to decide which copy to keep.

It’s great for what it does and allows you to keep a repository of your data in sync with your local copies but here again, WinSCP needs to go through each file to detect the differences and you need to sync manually each computer against the server, which is cumbersome and time consuming.

### General Backup tools {#backup}
There are lot more tools that fall into that category of backup utilities: they all keep a copy of your current data in an archive, on a separate disk or online.
Some are great in that they allow you to access that data on the web (I use the excellent [JungleDisk][] myself) but file synchronisation is not their purpose.

Now for some [Captain Obvious][capobvious] recommendation: remember that file synchronisation is _not a backup plan_: you must have a separate process to keep read-only copies of your important data.
File synchronisation will update and delete files you modify across all your machines, clearly not what you want if you need to be able to recover them!

### Revision Control Systems {#rcs}
[Revision control software][versioncontrol] like [cvs][], [subversion][], [git][], [etc][vcetc] are generally used to keep track of changes of source code files; however, they have also been used successfully to keep multiple copies of the same data in sync.
It’s actually exactly [what I use][TortoiseSVN] for all my source code and associated files: I have a subversion server and I check-out copies of my software project folders on various computers.

After making changes on one computer, I _commit_ the changes back to the server and _update_ these changes on all other computers manually.

While great at keeping track of each version of your files and ideally suited to pure text documents like source code, using revision control systems have drawbacks that make them cumbersome for general data synchronisation:

* you need to manually commit and update your local copies against the server.
* not all of them are well suited to deal with binary files
* when they work with binary files, they just copy the whole file when it changed, which is wasteful and inefficient.

Revision Control System are great for synchronising source code and configuration files but using them beyond that is rather cumbersome.

### Complex setup {#compexsetup}
All of the above solutions also have a major drawback: getting them to work across the Internet requires complex setup involving firewall configurations, security logins, exchange of public encryption keys in some cases, etc.

All these are workable but don’t make for friendly and piece-of-mind setup.

### What we want from data synchronisation {#whatwewant}
I don’t know about you but what I’m looking for in a synchronisation tool is pretty straightforward:

* Being able to point to a folder on one computer and make it synchronise across one or multiple computers.
* Detect and update the changed files transparently in the background without my intervention, as the changes happen.
* Be smart about conflict detection and only ask me to make a decision if the case isn’t obvious to resolve.

### Live Mesh folders {#livemesh}
Enters [Microsoft Live Mesh Folders][meshwiki], now in beta and available to the public.
Live Mesh is meant to be Microsoft answer’s to synchronising _information_ (note, I’m not saying _data_ here) across computers, devices and the Internet.
While Live Mesh wants to be something a lot bigger than just synchronising folders, let’s just concentrate on that aspect of it.

Installing Live Mesh is pretty easy: you will need a Windows Live account to log-in but once this is done, it’s a small download and a short installation.

Once you’ve added your computer to your “Mesh” and are logged in you are ready to use Live Mesh:

* You decide how the data is synchronised for each computer participating in your Mesh:
you’re in charge of what gets copied where, so it’s easy to make large folders pair between say your laptop and work desktop and not your online Live Desktop (which has a 5GB limit) or your computer at home. You’re in control.
* Files are automatically synchronised as they change across all computers that share the particular folder you’re working in.
If the file is currently used, it won’t be synced before it is closed.
* If the other computers are not available, the sync will automatically happen as they are up again.
* There is no firewall setup: each computer knows how to contact the others and automatically -and uses- the appropriate network: transfers are local if the computers are on the same LAN or done across the Internet otherwise.
All that without user intervention at all.
* Whenever possible, data is exchanged in a P2P fashion where each device gets data from all the other devices it can see, making transfers quite efficient.
* File transfers are encrypted so they should be pretty safe even when using unsafe public connections.
* If you don’t want to allow sync, say you’re on a low-bandwidth dialup, you can work offline.
* The Mesh Operating Environment (MOE) is pretty efficient at detecting changes to files. Unlike other systems, in most cases it doesn’t need to scan all files to find out which ones have been updated or deleted.

__Some drawbacks__

* It’s not a final product, so there are some quirks and not all expected functionalities are there yet.
* The Mesh Operating Environment (MOE) services can be pretty resource hungry, although, in fairness, it’s not too bad except that it slows down your computer’s responsiveness while it loads at boot time.
* You can’t define patterns of files to exclude in your folder hierarchy.
That can be a bit annoying if the software you use is often creating large backup files automatically (like CorelDraw does) or if there are sub folders you don’t need to take everywhere.
* The initial sync process can take a long time if you have lots of files.
A solution if you have large folders to sync is to copy them first manually on each computer and then force Live Mesh to use these specific folders: the folders will be merged together and the initial sync process will be a lot faster as very little data needs to be exchanged between computers.

Bear in mind that Live Mesh is currently early beta and that most of these drawback will surely be addressed in the next months.

### Conclusion {#conclusion}
I currently have more than 18GB representing about 20,000 files synchronised between 3 computers (work desktop, laptop and home desktop) using Live Mesh.

While not 100% there, Live Mesh Folder synchronisation is really close to the real thing: it’s transparent, efficient, easy to use and it just works as you would
expect.

Now that Microsoft has released the [Sync Framework][syncf] to developers, I’m sure that other products will come on the market to further enhance data synchronisation in a more capable way.
In the meantime, Live Mesh has answered my needs so far.

### References {#references}
* [Wikipedia reference on file synchronisation][filesync].
* [rsync][] home page and [Wikipedia entry][rsync2]
* [unison][] home page and [Wikipedia entry][unison2]
* [JungleDisk][] online backup (cheap and very configurable, uses [Amazon S3][S3] for storage)
* Microsoft [Live Mesh][mesh] web site and [Live Mesh Wikipedia entry][meshwiki]

[rsync]:http://samba.anu.edu.au/rsync/
[rsync2]:http://en.wikipedia.org/wiki/Rsync
[filesync]:http://en.wikipedia.org/wiki/File_synchronization
[unison]:http://www.cis.upenn.edu/~bcpierce/unison/
[unison2]:http://en.wikipedia.org/wiki/Unison_(file_synchronizer)
[JungleDisk]:http://www.jungledisk.com/
[S3]:http://aws.amazon.com/s3/
[WinSCP]:http://winscp.net/eng/
[cronjobs]:http://en.wikipedia.org/wiki/Cronjob
[capobvious]:http://uncyclopedia.wikia.com/wiki/Captain_Obvious
[mesh]:http://mesh.com
[meshwiki]:http://en.wikipedia.org/wiki/Windows_Live_Core
[versioncontrol]:http://en.wikipedia.org/wiki/Revision_control
[git]:http://en.wikipedia.org/wiki/Git_(software)
[cvs]:http://savannah.nongnu.org/project/memberlist.php?detailed=1&group=cvs
[subversion]:http://subversion.tigris.org/
[TortoiseSVN]:http://tortoisesvn.tigris.org/
[vcetc]:http://en.wikipedia.org/wiki/Comparison_of_revision_control_software
[syncf]:http://msdn.microsoft.com/en-us/sync/default.aspx

Comments

Dave commented 16 years ago

I’ve been looking at all these too and would love to use live mesh (or live sync, sugar sync or syncplicity for that matter) but none of them seem to deal with network drives. If I must I guess I could use two pieces of software, but that scares me a bit. Any ideas?

Author

Renaud commented 16 years ago

Hi Dave, thank you for dropping by. You’re right, Sync functionalities in Mesh (or other framework) don’t work well on shared drives. The issue with shared network locations is that it’s not -to my knowledge- possible to monitor file changes in the folder structure without going through full enumeration of the content. On local drives these sync tools use the capabilities of the OS and the filesystem to get notifications of file changes, which makes them very efficient. Full enumeration over the network would be tremendously costly. The only solutions I see are: ways for the remote storage to notify network subscribers of changes in their structure. or just install the sync tool on the file server itself but that’s not really an option for network storage hardware devices . I’m sure these issues will eventually be solved and as the Sync Framework matures we will get background services for Linux, Mac, etc appearing. It will probably take a little while but I’m sure it’ll be there eventually.

Jay Levitt commented 14 years ago

There’s also a project called “lsyncd” which uses inotify to watch for changes, and feeds that into rsync. Haven’t tried it yet, but I’m about to: http://code.google.com/p/lsyncd

Comments are closed.