When it works, PIM data synchronization is like magic: neat effects, apparently effortless… and the magician performing the act doesn’t reveal his secret. One unspoken “secret” of the industry specializing in this particular kind of computer magic is that there are situations where it all breaks down and the user’s data gets mangled or lost.
I believe that this is something that users should (and will want to) know about. It’s going to be long and somewhat technical, but I have trust in you, dear reader of this blog and (perhaps) user of SyncEvolution! Therefore I’m going to explain some of the tricky situations that can arise during data synchronization, how two SyncML servers (ScheduleWorld and Funambol) handle them, and what you as the user can do to avoid problems. Don’t worry, it’s not difficult. Just bring your towel and don’t panic…
Please let me know what you think about this kind of blog entry: too long and/or difficult or useful? Writing all this up took quite a bit of time. If it doesn’t interest anyone, then I’ll better write more code and less prose in the future
Merging Items: Need More Data
Suppose someone tells you that John Doe has home phone number 12345. Later someone else tells you John has the home phone number 67890. Which phone number do you write down in your address book for John? Are both persons even talking about the same John Doe? There are multiple solutions, but only one of them is right for this conflict:
- The first person was right: John Doe has 12345. 67890 is an old, now invalid number.
- The second person was right: John Doe currently has 67890.
- John really has two different home phone numbers.
- There are two persons called “John Doe” (a popular name…), so two different address book entries are needed.
This is the same situation sometimes faced by SyncML servers, and just like humans, they cannot decide what the right solution is without more information. A human might ask the person he got the information from how old the information is. A SyncML server cannot do that because the protocol neither includes this option nor time stamps. A human might have additional information (”John now lives in town ABC, the town has area code XYZ, so the second number is right.”) but this is highly specific to the data being handled and therefore won’t solve the problem in general. A SyncML server also cannot ask his user (who might be able to decide the issue) because again, the protocol and clients don’t support such an interaction.
SyncML servers run into this problem each time they do a so called “slow” synchronization: in this mode the client sends all its data to the server, the server has to decide what to do with it and then tells the client how to update its own data. A slow synchronization with merging usually occurs in two situations:
- A client is synchronized with the server for the first time and both client and server already contain data. An example is when a user manually added contacts on his phone and on his computer. Then he starts using a SyncML server for data synchronization and first synchronizes the computer (no data on the server yet, so all is fine), then the phone (server contains data from the computer, now needs to merge with data on the phone).
- A synchronization fails, perhaps because the connection got lost. Client and server both don’t know in which state their peer is, so usually they recover from this by falling back to a slow synchronization, as if the client never had connected to the server.
In both cases the server has to merge potentially different items: either they are really different (e.g., old vs. new phone number) or they might have been stored differently (e.g., +49-228-12345 vs. +4922812345; “John Doe” vs. “Doe, John”). Because of this different storing of items, a slow sync can be problematic even when client and server have the same data.
After the initial slow synchronization clients are in sync with the server and from then on, only modified items are exchanged. A conflict can still occur if the user modifies the same item on two different clients and then synchronizes them with the server. Synchronize before modifying an item that was modified elsewhere, otherwise the conflict results in the same problems as a conflict during slow synchronization.
Dumb vs. Smart, or Limited vs. Capable
Many mobile devices only have limited amounts of memory or (more likely nowadays, with hardware becoming cheaper) limited software. Therefore they often only store a subset of the information that full-blown desktop applications store. For example, pictures for a contact might not be stored. Suppose a contact with many different attributes is sent to the server, from there to such a limited device, modified on the device and from there sent back to the server.
What the server gets is a contact were certain attributes were removed. The server is not told whether it was the user who removed them (for example, removing an obsolete phone number) or whether the device was incapable of storing them. What is the poor server supposed to do?
- Some servers simply replace their own data completely with what is sent by the device; anything that the device couldn’t store is lost and will also be removed from all other devices. Clearly not desirable…
- Therefore more intelligent servers merge their own data with the data sent by clients. With such a server it might not be possible to remove certain attributes even if that is desired, because the attribute will be preserved by the server.
- The most intelligent servers know (= hard-coded) or detect (= by analyzing information sent by clients about themselves) which attributes a device can store. In this case the server can distinguish between attributes that were intentionally removed and those which were lost. Removing attributes becomes possible.
All of the operations mentioned so far for servers rely on some understanding of the data which is to be synchronized. In contrast to file synchronization which can treat files as opaque blobs of data, PIM data needs to be parsed by clients and server. It is the server’s responsibility to transform data between different formats; in some cases one format might be less capable than the other (older vCalendar 1.0 compared to its successor iCalendar 2.0; rich text notes vs. plain text).
It is hard to tell from product descriptions how complete the support for a certain data format really is. “Supports vCard 2.1″ can mean anything from “name + one phone number” to “all attributes which are defined by vCard 2.1 + several common extensions”. Some servers are known to not support certain attributes. These attributes then cannot be synchronized, even if all clients support them. In practice only experiments with all clients and the server will show how well the combination really works.
General Advice On Avoiding Pitfalls
The guidelines in this section work independently from the specific SyncML server or client. These are “best known methods” and only guidelines instead of hard rules; there might be situations where other methods work better. The most important rule first:
Don’t assume that everything will just work! Make backups of all data that you cannot afford to loose.
When getting started with synchronization, gather all data on the most capable client, usually your computer. Clean up duplicate entries. Do a “refresh from client” synchronization with that client. In this mode any data on the server is wiped out and replaced by the one on the client. SyncEvolution does this when invoked with
--sync refresh-from-client. Then do a “refresh from server” sync on each additional device. This replaces the (hopefully old!) data on those devices with the data on the server, which is the same as on the main device. If your SyncML client doesn’t let you choose these synchronization modes, then wipe out the old data on device respectively server manually before synchronizing for the first time.
Be careful when deleting data on either client and server if it has already been synchronized! An ensuing normal sync will also remove the data elsewhere.
Clearly this manual merging can be a lot of work, but at least it is safe; describing how it can be done with the various clients is beyond the scope of this article.
Once all clients are in sync with the server, avoid synchronization failures at all costs. If you are on an unreliable mobile phone network, then better don’t synchronize. Don’t interrupt a synchronization manually.
Specific Advice for …
When using SyncEvolution, then consider enabling automatic backups of your data:
- Create a new directory
- Set this as the “log dir”:
syncevolution --sync-property logdir=/home/myname/foo
- Now SyncEvolution will create a new directory inside the log dir for each sync and store database dumps before and after each synchronization there;
maxlogdirscan be set to the maximum number of directories which are to be kept around before deleting the oldest one.
After a failed synchronization it is possible to wipe out the local data and reimport it from one of the older database dumps. Evolution can read these
.ics files directly. The
synccompare command line utility can be used to compare two database dumps.
Keeping these log dirs around is also useful for debugging a problem. Finally, it allows comparing the data currently stored locally with the data sent to the server in the last sync (
syncevolution --status <server name>).
When copying configurations, always remember to update the
deviceId property. This ID is used by the server to identify which device it is talking to; if two different configuration or clients use the same ID, they confuse the server and will run into unexpected slow synchronizations. Usually it should not be necessary to copy configurations. When creating them from scratch with
--configure a unique
deviceId is created automatically.
Evolution uses the full iCalendar 2.0 and vCard 3.0 internally. SyncEvolution usually just passes this data through to servers, so hardly any data is lost due to conversions. iCalendar 2.0 items are not modified at all, which implies that servers must support this format. Contacts can be converted to and from vCard 2.1 by SyncEvolution, which is mostly a lossless conversion. Evolution and SyncEvolution store some attributes not defined by the vCard standards as extensions; currently these extensions are marked as
X-EVOLUTION- which is not supported by all servers. It is planned to switch over to the more common
X- prefix as soon as the Funambol server also uses them (planned for 7.1/Q1 ‘09). ScheduleWorld already understands the
Another known limitation of SyncEvolution is the missing support for file attachments in events/todos. For a long time it didn’t seem like servers would support these either, but then I learned that ScheduleWorld does – time for SyncEvolution to catch up!
Contact lists are not synchronized, which is a limitation of both SyncEvolution and servers.
SyncEvolution for the Mac OS X and iPhone address book generates and parses vCard 2.1. When I wrote that code, I documented all known limitations on the compatibility page. This is not the code used by the Funambol iPhone and Mac OS X plug-in! Those two projects each use different code and may have different limitations.
ScheduleWorld is a very capable server which (like Evolution) uses vCard 3.0 and iCalendar 2.0 internally. There are no attributes which are known to be unsupported.
ScheduleWorld resolves conflicts during slow synchronizations by storing both copies of an item. This leads to duplicates that the user has to merge manually later on. In order to help with duplicates ScheduleWorld provides two solutions (contacts only at the moment):
- If you are using the syncSW Thunderbird add-on you can use the ‘merge contact duplicates’ feature. This feature finds duplicates and merges their attributes into a single contact and deletes the other one.
- The new ScheduleWorld web app also provides this functionality. The web app is in beta but the merge code is the same well tested code found in the Thunderbird syncSW add-on.
When receiving updates, ScheduleWorld merges its existing item with the update sent by a client. With some clients, ScheduleWorld parses the information about supported attributes and in addition, it knows that all SyncEvolution backends store all data. This means that it is possible to remove attributes.
Update 2008-12-03: when receiving an update for an item that was also in the “modified” state on the server because it was updated by another client, ScheduleWorld resolves the conflict by replacing data of the item on the server with data of the item it just received from the client. This means that changes made by the first client will be lost. For example, if the first client adds a cell phone number to a contact and the second client a work phone number, then the cell phone number will be lost when the second client sends its update of the contact.
There is currently one bug in ScheduleWorld which can cause problems. Each time a new client is synchronize with it, all other clients are forced to do a slow synchronization the next time they connect to ScheduleWorld. To limit the effects of this bug (like unwanted duplicates), be careful when adding a new client:
- Synchronize all existing clients with ScheduleWorld.
- Synchronize all clients again, in case that one of them modified data on the server.
- Synchronize the new client. If possible (= it has no valuable data), choose a “refresh from server” to avoid merging of data on the server.
- Synchronize all old clients, if possible using “refresh from server”.
The last step is important! If data was modified on one of the old clients without getting it into sync with the server first, then during the ensuing “slow” sync conflicts between new data on the client and old data on the server would lead to duplicates.
This section refers to the default data storage of PIM data. There are connectors for Funambol which store the data differently.
Funambol stores PIM data in a database schema that is modelled after the capabilities of vCard 2.1 and vCalendar 1.0. There are attributes which are not supported yet; I had a closer look at that in a previous blog post. In particular calendar support with iCalendar 2.0 has several limitations. Improvements are expected for Funambol 7.1 and 8.0.
Funambol assumes that clients send properties with empty values if a property was removed. Not including the property in the update is interpreted as “client was unable to store the property”, in which case the property as stored on the server is preserved. Evolution (and thus SyncEvolution) does not generate entries for properties that were removed, therefore removing obsolete values on the server is not currently possible. SyncEvolution could work around this by mangling the items it sends out, but it has no means of detecting the Funambol server and modifying the data unconditionally might confuse other servers. Besides, it increases the amount of data that has to be sent. It would be better to reconfigure the Funambol server, like it is already done for other clients which behave like SyncEvolution. In the meantime:
Overwrite old information with a space instead of removing it. This ensures that the server is sent the property, which causes it to overwrite the obsolete value in its own copy of the item instead of preserving the value.
Update 2008-12-03: when receiving an update for an item that was also in the “modified” state on the server, Funambol resolves the conflict by merging the server’s data with the client data. In the example given for ScheduleWorld above that means that both the new cell phone and the work phone number are preserved.
The philosophy that Funambol follows is that duplication of items has to be avoided at all costs because that causes work for users. It is considered acceptable that data is lost in some cases. In the merge example given above, the server would choose the second option (”the second person is right”). The rationale is that the data currently received is the one that the user wants to have stored on the server. If the currently sent item is in fact older than the one on the server, then the more recent data gets lost. From this heuristic follows this advice:
To avoid data loss with the Funambol server, always synchronize the client with the oldest data first, then clients with more recent data.
This may not be possible in all cases. Suppose a user owns two devices which he has not synchronized automatically yet. He manually entered two contacts on each device.
Device A and B:
- John Doe, home phone “old-phone-john”
- Alice Doe, home phone “old-phone-alice”
Later he made changes, but forgot to keep them in sync – it would have to be done manually after all, which is so cumbersome…
- John Doe, home phone “old-phone-john”
- Alice Doe, home phone “new-phone-alice”
- John Doe, home phone “new-phone-john”
- Alice Doe, home phone “old-phone-alice”
Now he synchronizes device A with the server, then device B. In both cases a slow sync is done. He synchronizes device A and B again, doing a two-way sync in both cases.
The result is on device A and B:
- John Doe, home phone “new-phone-john”
- Alice Doe, home phone “old-phone-alice”
Alice’s new phone number is lost. If the order of syncing the two devices had been swapped, then John’s number would have been lost (exercise left to the reader…). I intentionally used two contacts to demonstrate that no matter how the user syncs, he’ll lose data in both cases.
The only way around that is to merge the two devices manually, as suggested in the “general advice” section above.
I hope this article was not too long and scary. Always remember that this was about the corner cases; most of the time the magic really works. All sync solutions have to deal with these issues in one way or another. If they don’t document how it is done, then most likely the pitfalls also exist but aren’t talked about… When you are just getting started with syncing, then simply avoid the pitfalls and enjoy the magic!
I believe that “forewarned is forearmed”. Now no-one can claim that I haven’t warned him and armed with that knowledge I can peacefully wait for whatever user feedback the future might bring If there are factual errors or omissions, then please let me know and I’ll update this blog post.