207 lines
8.9 KiB
Text
207 lines
8.9 KiB
Text
# $Id: parallel.txt,v 1.5 2004/07/10 14:37:36 grant Exp $
|
|
#
|
|
|
|
These are my (<dmcmahill>) thoughts on how one would want a parallel
|
|
bulk build to work.
|
|
|
|
|
|
====================================================================
|
|
Single Machine Build Process
|
|
====================================================================
|
|
|
|
The current (as of 2003-03-16) bulk build system works in the
|
|
following manner:
|
|
|
|
1) All installed packages are removed.
|
|
|
|
2) Packages listed in the BULK_PREREQ variable are installed. This
|
|
must be done before step 2 as some packages (like xpkgwedge) can
|
|
affect the dependencies of other packages when installed.
|
|
|
|
3) Each package directory is visited and its explicitly listed
|
|
dependencies are extracted and put in a 'dependstree' file. The
|
|
mk/bulk/tflat script is used to generate flattened dependencies
|
|
for all packages from this dependstree file in both the up and
|
|
down directions. The result is a file 'dependsfile' which has one
|
|
line per package that lists all build dependencies. Additionally,
|
|
a 'supportsfile' is created which has one line for each package
|
|
and lists all packages which depend upon the listed pacakge.
|
|
Finally, tsort(1) is applied to the 'dependstree' file to
|
|
determine the correct build order for the bulk build. The build
|
|
order is stored in a 'buildorder' file. This is all achieved via
|
|
the 'bulk-cache' top level target. By extracting dependencies in
|
|
this fashion, we avoid highly redundant recursive make calls. For
|
|
example, we no longer need to use a recursive make to find the
|
|
dependencies for libtool literally thousands and thousands of
|
|
times throughout the build.
|
|
|
|
4) During the build, the 'buildorder' file is consulted to figure out
|
|
which package should be built next. Then to build the package,
|
|
the following steps are taken:
|
|
|
|
a) Check for the existance of a '.broken' file in the package
|
|
directory. If this file exists, then the package is already
|
|
broken for some reason so move on to the next package.
|
|
|
|
b) Remove all packages which are not needed to build the current
|
|
package. This dependency list is obtained from the 'dependsfile'
|
|
created in step 3 and the BULK_PREREQ variable.
|
|
|
|
c) Install via pkg_add all packages which are needed to build the
|
|
current package. We are able to do this because we have been
|
|
building our packages in a bottom up order so all dependencies
|
|
should have been built.
|
|
|
|
d) Build and package the package.
|
|
|
|
e) If the package build fails, then we copy over the build log to
|
|
a .broken file and in addition, we consult the 'supportsfile' and
|
|
mark all packages which depend upon this one as broken by adding a
|
|
line to their .broken files (creating them if needed). By going
|
|
ahead and marking these packages as broken, we avoid wasting time
|
|
on them later.
|
|
|
|
f) Append the package directory name to the top level pkgsrc
|
|
'.make' file to indicate that we have processed this package.
|
|
|
|
5) Run the mk/bulk/post-build script to collect the summary and
|
|
generate html pages and the email we've all seen.
|
|
|
|
====================================================================
|
|
Single Machine Build Comments
|
|
====================================================================
|
|
|
|
There are several features of this approach that are worth mentioning
|
|
explicitly.
|
|
|
|
1) Packages are built in the correct order. We don't want to rebuild
|
|
the gnome meta-pkg and then rebuild gnome-libs for example.
|
|
|
|
2) Restarting the build is a cheap operation. Remember that this
|
|
build can take weeks or more. In fact the 1.6 build took nearly 6
|
|
weeks on a sparc 20! If for some reason, the build needs to be
|
|
interrupted, it can be easily restarted because in step 4f we keep
|
|
track of what has been built in a file. The lines in the build
|
|
script which control this are:
|
|
|
|
for pkgdir in `cat $ORDERFILE` ; do
|
|
if ! grep -q "^${pkgdir}\$" $BUILDLOG ; then
|
|
(cd $pkgdir && \
|
|
nice -n 20 ${BMAKE} USE_BULK_CACHE=yes bulk-package)
|
|
fi
|
|
done
|
|
|
|
In addition to storing the progress to disk, the bulk cache files
|
|
(the 'dependstreefile', 'dependsfile', 'supportsfile', and
|
|
'orderfile') are stored on disk so they do not need to be
|
|
recreated if a build is stopped and then restarted.
|
|
|
|
3) By leaving packages installed and only deleting the ones which are
|
|
not needed before each build, we reduce the amount of installing
|
|
and deinstalling needed during the build. For example, it is
|
|
quite common to build several packages in a row which all need GNU
|
|
make or perl.
|
|
|
|
4) Using the 'supportsfile' to mark all packages which depend upon a
|
|
package which has just failed to build can greatly reduce the time
|
|
wasted on trying to build packages which known broken dependencies.
|
|
|
|
====================================================================
|
|
Parallel Build Thoughts
|
|
====================================================================
|
|
|
|
To exploit multiple machines in an attempt to reduce the build time,
|
|
many of the same ideas used in the single machine build can still be
|
|
used. My view of how a parallel build should work is detailed here.
|
|
|
|
master == master machine. This machine is in charge of directing
|
|
the build and may or may not actively participate in it.
|
|
In addition, this machine might not be of the same
|
|
architecture or operating system as the slaves (unless it
|
|
is to be used as a slave as well).
|
|
|
|
slave#x == slave machine #x. All slave machines are of the same
|
|
MACHINE_ARCH and have the same operating system and access
|
|
the same pkgsrc tree via NFS and access the same binary
|
|
packages directory.
|
|
|
|
If the master machine is also to be used as a build
|
|
machine, then it is also considered a slave.
|
|
|
|
Prior to starting the build, the master directs one of the slaves to
|
|
extract the dependency information per steps 1-3 in the single machine
|
|
case.
|
|
|
|
The actually build should progress as follows:
|
|
|
|
1) For each slave which needs a job, the master assigns a package to
|
|
build based on the rule that only packages that have had all their
|
|
dependencies built will be sent to slaves for compilation.
|
|
|
|
2) When a slave finishes, the master either notes that the binary
|
|
package is now available for use as a depends _or_ notes failure
|
|
and marks all packages which depend upon it as broken as in step
|
|
4e of the single machine build.
|
|
|
|
|
|
Each slave builds a package in the same way as it would in a single
|
|
machine build (steps 4a-d).
|
|
|
|
====================================================================
|
|
Important Parallel Build Considerations
|
|
====================================================================
|
|
|
|
|
|
1) Security. Packages are installed as root prior to packaging.
|
|
|
|
2) All state kept by the master should be stored to disk to
|
|
facilitate restarting a build. Remember this could take weeks so
|
|
we don't want to have to start over.
|
|
|
|
3) The master needs to be able to monitor all slaves for signs of
|
|
life. I.e., if a slave machine is simply shut off, the master
|
|
should detect that it's no longer there and re-assign that slaves
|
|
current job.
|
|
|
|
3a) The master must be able to distinguish between a slave failing to
|
|
compile a package due to the package failing vs a
|
|
network/power/disk/etc. failure. The former causes the package to
|
|
be marked as broken, the latter causes the slave to be marked as
|
|
broken.
|
|
|
|
4) Ability to add and remove slaves from the cluster during a build.
|
|
Again, a build may take a long time so we want to add/remove
|
|
slaves while the build is in progress.
|
|
|
|
====================================================================
|
|
Additional Thoughts
|
|
====================================================================
|
|
|
|
This is mostly related to using slaves which are not on a local
|
|
network.
|
|
|
|
- maybe a hook could be put in place which rsync's the binary package
|
|
tree between the binary package repository machine and the slave
|
|
machine before and after each package is built?
|
|
|
|
- security
|
|
|
|
- support for Kerberos?
|
|
|
|
====================================================================
|
|
Implementation Thoughts
|
|
====================================================================
|
|
|
|
- Can this all be written around using ssh to send out tasks? How do
|
|
we monitor slaves for signs of life? How do we indicate 'build
|
|
failed/build succeeded/slave failed' conditions?
|
|
|
|
- Maybe we could have a file listing slaves and the master consults
|
|
this each time it needs a slave. That would make adding/removing
|
|
slaves easy. There would need to be another file to keep track of
|
|
which slaves are busy (and with what).
|
|
|
|
- Do we want to use something like pvm instead? There is a
|
|
p5-Parallel-Pvm package and perl nicely deals with parsing some of
|
|
these files and sorting dependencies although I hate to add any
|
|
extra dependencies to the build system.
|