Backing up Large Files
by Paul Zarucki, Electronic Equipments
Ltd., 2008-06-27. Updated 2008-07-02.
are quite a few programs
for backing up files, some with graphical interfaces, some with web
interfaces and others which work from the command line. They are
mostly designed to do scheduled backups to a server or hard disc but I
wanted a straighforward way to make one-off backups of some large files
to DVDs or a USB drive. I decided, therefore, to use a simple command
line procedure to do most of the work. It takes only two commands to
create the archive and three to restore the archive and check it for
The beauty of the command line approach is that, once you have a
working procedure, you can save the commands in a text file (called a
shell script) and link it to an icon on your desktop. Anytime you need
to repeat the procedure, simply click the icon! The shell script also
serves to document the procedure and you can easily edit it if you need
to change it.
If you are new to the command line, I highly recommend the beginner's
introduction by Rosalyn Hunter.
use the Debian 4.0 GNU/Linux system but the methods described here
should work on most GNU/Linux and Unix systems with little or no
modification. The references to copy-and-paste assume you are using an
X-windows based graphical desktop system (I use Gnome but the
principles apply to almost any of the
desktop systems available for GNU/Linux and Unix).
| This tutorial makes use of the command
line. If you are only used to point-and-click methods, don't be afraid!
The command line is your friend and, for some tasks, it is quicker and
easier than the available graphical software. To start typing commands,
open a console or terminal window (if you are using the Gnome desktop
look under Applications -> Accessories ->
You can copy-and-paste each command from this page into
the console window then edit it to suit your needs before pressing the
return key. The left and right arrow keys move the cursor, the delete
and backspace keys delete text. Anything you type or paste will be
inserted at the cursor position.
TIP: to copy,
select some text with the mouse then, to paste, simply move the mouse
to the destination and middle-click (press either the middle button or
the scroll wheel on your mouse).
I have a directory containing some files for a virtual computer. The
files that hold the data for the virtual computer's hard discs are not
only big, they are also "sparse" files, which means that they use only
enough disc space for the data that was actually written to the file.
For example, the virtual computer may have a 30GB drive of which 2GB
has been used. Even though it uses only 2GB on my hard disc, a program
that reads the file may see it as a 30GB file. This type of file can be
tricky to back up because, when you copy it, you can end up with a 30GB
file, or it might simply fail to copy, depending on the type of file
system used on the backup storage.
It would also be nice to preserve links as well as the ownership and
properties of the files, all of which might be lost if they were simply
to a DVD or USB drive.
A simple way to create an archive that uses no more space than is
necessary for sparse files, as well as preserving links, file
permissions, is to use the venerable tar
program. This packs the files into a single "archive" file and can
compress the files as well. Think of it as the Unix equivalent of a ZIP
file. The command would look something like
-czf myarchive.tgz mydir
where mydir is the directory to be archived and myarchive.tgz is the
name of the archive file to be created.
I want the archive to be easy to copy to a variety of storage media
like a USB flash drive and different formats of DVD. These have limits
on the maximum size of a file which can be as small as 1GB so I use the
split program to divide the
archive into 1000MB chunks. The result is a set of files, each 1000MB
in size, which can easily be copied to and moved between different
types of storage. This also makes it easy to copy the archive onto
multiple DVDs if it is too big to go onto one disc. I could use the
following command to split the file "myarchive.tgz"
-d -b 1000m myarchive.tgz myarchive.tgz.
This would leave me with the original file (myarchive.tgz) plus a
series of files named "myarchive.tgz.00",
etc., each no larger than 1000MB.
In practise it would be better to use a pipe to feed the output of the tar command to the split command and create the 1000MB
files directly without the need first to save the archive as a single
file. This halves the amount of disc space and time required for the
job. The above comands would then be replaced by the single line
-cz mydir | split -d -b 1000m - myarchive.tgz.
where "|" tells the computer to pipe the output of the tar program to
the split program.
I then use the md5sum program
to calculate the checksum of each file which will make it easy to check
for corrupted files when restoring the archive. The command would be
myarchive.tgz.* > myarchive.md5
which creates the file myarchive.md5 containing the checksums of each
of the files produced by the split command.
Restoring the Archive
When I want to restore the archive, I copy the files from the backup
medium into a temporary directory. In the console window, I change to
the temporary directory and type
md5sum -c myarchive.md5
which calculates new checksums for the archive and compares them with
the original checksums in the file myarchive.md5. If all is OK I then
re-combine the chunks into a single archive and extract the original
myarchive.tgz.* | tar -xz
(Again we use a pipe to avoid the need for an intermediate
file, saving time and disc space.)
The current directory now has a sub-directory containing a perfect copy
Creating the Archive
I have a directory "vm" with a sub-directory "oldpc" that contains some
files that I want to archive. I'll call the archive "oldpc2008-06-29".
I type the following command to create the archive:
-cz vm/oldpc | split -d -b 1000m - oldpc2008-06-29.tgz.
The result is a series of files named "oldpc2008-06-29.tgz.00",
etc., each no larger than 1000MB.
To create the checksum file for this archive, I type
oldpc2008-06-29.tgz.* > oldpc2008-06-29.md5
That's it! In two lines we have created a set of archive and checksum
files ready to be copied to the backup storage of our choice. You can
use your favourite
DVD writing program (GnomeBaker, K3b, etc.) to write the files to DVD
or, alternatively, plug in a USB drive and simply copy the files to it.
Restoring the Archive
To restore the files from the archive, I create a temporary directory,
say "restored", and copy the archive files into it. This is something I
can do using my normal file manager program. I then open a console
window and type the following commands:
md5sum -c oldpc2008-06-29.md5
The first line does a "change directory" to select the directory
containing the archive files. The second line verifies the checksums.
If all is OK, I then type
oldpc2008-06-29.tgz.* | tar -xz
The directory "restored" now has a sub-directory "vm/oldpc" containing
a perfect copy of the
A Note about Very Large Files
The tar archive format imposes an upper limit of 68GB on the size of
any single file that can be added to a tar archive. I believe the GNU
version of the tar program will handle files up to this size but some
other versions may have a limit of 8GB. If you have files larger than
the maximum then, as long as they are not sparse files, you can
overcome the size limitation by compressing and/or splitting these
files before making the archive (e.g. using the gzip and split commands). As far as I know,
there is no upper limit on the total size of the tar archive or the
number of files it may contain.
The tar and split commands are very versatile
and this tutorial shows just one way of using them. More information
can be obtained on any of the commands used here from the manual pages.
To view the "man page" for the tar command, for
Scroll the man page using the up/down arrow,
PageUp, PageDown, Home and End keys. To finish, press the letter "q".
TIP: You can have multiple console
windows open at the same time, one for entering commands and another
for viewing man pages, for example. You can also copy and paste between
them (or between almost any two windows) using the
On the Web
Thanks to Kurt Jürgen Andereya Andereya for pointing out the 68GB
limit on the size of any single file that can be added to a tar archive.
Thanks to Michael Crider for pointing out that the md5sum program can
also compare the calculated checksums with pre-existing ones.
2008-07-02: added a note about file size limits (very large files).
2008-07-01: corrected mistake in the restore procedure (should read
"md5sum -c myarchive.md5" rather than "md5sum -c myarchive.md5
2008-06-29: used md5sum with the -c option to eliminate the need to use
the diff command.
2008-06-27: first version.
-- read people's comments on this article and add your own comments if
Feedback and comments to paulelectronic-equipments.co.uk
Permission is granted
to copy, distribute and/or modify this document under the terms of the GNU
Version 1.2 or any later version
published by the Free
Foundation, with no
Invariant Sections, no Front-Cover Texts, and no Back-Cover