Hotspot and PTIH
----------------

Hotspot is a program for identifying regions of local enrichment of
short-read sequence tags mapped to the genome using a binomial
distribution model.  Regions flagged by the algorithm are called
"hotspots."  See the doc directory for somewhat out-of-date
documentation.

The purpose of this particular distribution is to use hotspot to
compute PTIH (percent tags in hotspots), a quality measure for
short-read sequence experiments.  PTIH is exactly what it says it is:
the percentage of all tags that fall in hotspots.

Hotspot was originally conceived and written by Mike Hawrylycz, and is
now maintained by Bob Thurman, University of Washington, with
contributions by Eric Haugen, University of Washington, and especially
Scott Kuehn.  Hotspot has not yet been published, so we ask that you
do not publish data or results using hotspot without permission.  We
will give you appropriate citation at that time.

Bob Thurman
rthurman@uw.edu
3 Mar 2010


Making the hotspot program
--------------------------

A compiled (Redhat linux, 64-bit) version of the hotspot binary is
included in this distribution, in 

    hotspot-deploy/bin/hotspot5

Type the program name without arguments for a usage statement.

To compile your own version, cd to hotspot-deploy and type "make."


Running hotspot and computing PTIH
----------------------------------

In its current incarnation, proper hotspot identification (and PTIH
calculation) requires two passes of calls to hotspot (see doc
directory documentation).  A number of scripts are provided to do this
for you properly.  A test example is provided, with test data in the
data directory.  The file DNaseI.DS10081.5m.hg18.bed.gz is a test data
file consisting of 5 million mapped tag positions, in bed format.
This is a high quality experiment, with PTIH = 0.7411.

To use the scripts provided to compute PTIH yourself, see the directory

    pipeline-scripts/test

and files runall and runall.tokens.txt therein.  The file
runall.tokens.txt contains parameters for calling hotspot and
computing PTIH.  To run the example you will need to edit the
directories on the files defined in runall.tokens.txt.  The runall
file generates child scripts populated with the parameters defined in
runall.tokens.txt, and runs the scripts.  You will only need to change
the values in runall.tokens.txt, and not runall itself.  Then just
type "runall."

In addition to the input tags bed file itself, a couple of key
auxiliary files are worth mentioning.  The hotspot program makes use
of mappability information.  For a given tag length (k-mer size), only
a subset of the genome is uniquely mappable.  The file defined by
variable _MAPPABLE_FILE_ in runall.tokens.txt contains the mappable
positions in the genome for a given k-mer size.  We have included
these files for various values of k for the human genome hg18 in the
data directory.  Similarly, file _MAPPABLE_10KB_FILE_ contains 10k
window counts of uniquely mappable positions across the genome, and
again, we have included a selection of these.  

For purposes of computing PTIH, you can surely get by using one of the
sets of files provided, even if it does not match the k-mer size you
use exactly.

If you include tags mapped to locations that are not uniquely
mappable, you can either provide your own mappability files, use one
of our precomputed files as an approximation, or assume the entire
genome is mappable.  In the latter case, you would set _MAPPABLE_FILE_
to be the chromosome coordinates themselves (file chromInfo.hg18.bed
in the data directory, for example), and set _MAPPABLE_10KB_FILE_ to
be the file everything.mappable.counts.10kb.total.summed (in the case
of hg18) in the data directory.

Important:  see Dependencies section below.

Dependencies
------------

The scripts in the pipeline-scripts directory make fairly liberal use
of some of the UW internal utility programs for computing standard set
operations, etc. on bed files.  So you will need to have those
utilities in your path.  The utilities are available in the parallel
distribution uw_utils.

Also, the script tokenizer (see ScriptTokenizer directory) requires
Python 2.6+.
    
