Some practical advice when running Haplin
You may occasionally run into problems when using Haplin. Although I’ve tried to make Haplin respond with proper messages as often as possible, sometimes an error may occur, and the error message may be cryptic. Warnings may also appear, even if Haplin completes its run. You may also have problems with Haplin running very slowly. Below I list a few pieces of advice to avoid problems. Some of this will be built into Haplin in the future, so that it produces appropriate warnings along the way.
- Too low threshold: The
threshold parameter decides how rare a haplotype has to be to be
excluded from the analysis. After a temporary haplotype frequency
estimation Haplin removes all haplotypes with frequency below this
limit. The default is set to 0.01. This default is fairly low, and
means that you will sometimes have too many haplotypes in the analysis.
This may cause Haplin to run slowly or even crash. It will also cause
many of the double-dose estimates to have very wide confidence limits
since there is little data to estimate double doses for rare
haplotypes. You see that the threshold is (too) low if Haplin has a
large number of parameters to estimate in each EM step. Try running
Haplin with, for instance, threshold = 0.05. This will usually reduce
or eliminate the problem. You can tune the threshold parameter once you
get an idea of the haplotype distribution.
A side effect of increasing the threshold parameter is that Haplin has to remove a number of triads that only contain the rare (excluded) haplotypes. This leads to a loss of data, but not necessarily a serious one. The first part of the Haplin printout reports how many triads were actually removed due to rare haplotypes.
An alternative, and sometimes better, solution, is to choose response=“mult”, which forgoes the double dose estimation and assumes a multiplicative (dose-response) model, which is usually more stable.
- Too much missing information:
It usually works fine to include triads where, for instance, all
information on the father is missing. However, you should watch out for
triads that lack information on all markers for several of the family
members. These contain little information but a lot of ambiguity. So
Haplin will have to work hard to make sense of them with little extra
power in return. Haplin should detect these in the future, but for the
time being it is a good idea to try and remove the “hopelessly
- Too many markers: Since
the number of markers included decides how many possible haplotypes
there are, the workload of Haplin increases strongly with the number of
markers. Haplin 2.0 handles this pretty well, but you should keep
in mind that if you run more than, say, 6-7 SNP markers at a time there
will be a lot of rare haplotypes that Haplin will need to get rid of.
This may lead to a data loss that sometimes becomes serious. So it is
probably a good idea to try and limit the number of markers in each
run. The markers you want to include can be picked from the data using
the markers argument. Setting, for instance, markers = c(2,3,5,6) picks
markers 2,3,5 and 6 from the file, so that you don’t have to make a
separate file for each combination you want to try.
- Too much printout: If you
think Haplin has a tendency of producing too much printout during the
EM process, you may be right. Even though I do
recommend checking convergence by looking at the parameter estimates
printed during EM, the amount of printout is considerably reduced by
setting the verbose argument to F (=FALSE), as in:
haplin(prepdata, use.missing = T, verbose = F).