Hotspot and PTIH
----------------

Hotspot is a program for identifying regions of local enrichment of
short-read sequence tags mapped to the genome, using a binomial
distribution model.  Regions flagged by the algorithm are called
"hotspots."  See the doc directory for somewhat out-of-date
documentation.

This distribution includes scripts for computing minimally thresholded
(z-score = 2) hotspots; FDR thresholded hotspots, using randomly
generated tags; and FDR thresholded peaks within hotspots. Also
included is a script for computing the SPOT (Signal Portion Of Tags)
score, a quality measure for short-read sequence experiments.  The
SPOT score is computed as percentage of all tags that fall in
(minimally thresholded) hotspots.

IMPORTANT: See the Dependencies section, below, for program
requirements that fall outside this distribution.

Hotspot was originally conceived and written by Mike Hawrylycz, and is
now maintained by Bob Thurman, University of Washington, with
contributions by Eric Haugen, University of Washington, and especially
Scott Kuehn.  Although there is no stand-alone publication for hotspot,
the algorithm is described in detail in 

Sam John et al., Chromatin accessibility pre-determines glucocorticoid
receptor binding patterns, Nature Genetics 43, 264-268 

The above should therefore serve as the primary citation for hotspot.

This distribution is available via the uwencode website, at

    http://uwencode.org/software/hotspot

Bob Thurman
rthurman@uw.edu
22 Jan 2013


Making the hotspot program

--------------------------

A compiled (Redhat linux, 64-bit) version of the hotspot binary is
included in this distribution, in 

    hotspot-deploy/bin/hotspot5

Type the program name without arguments for a usage statement.

To compile your own version, cd to hotspot-deploy and type "make."



Running hotspot
----------------------------------

Hotspot calls, FDR thresholding and peak-calling require two passes of
calls to hotspot (see doc directory documentation), as well as
generation of random data, hotspot calling on that, and peak-calling
on tag densities.  A number of scripts are provided to do all of these
for you properly.  A test example is provided, with test data
(DNase-seq) in the data directory.

To use the scripts provided to compute hotspots and peaks yourself,
see the directory

    pipeline-scripts/test

and files runhotspot and runall.tokens.txt therein.  The file
runall.tokens.txt contains parameters (tokens) for calling hotspot.
To run the example you will need to edit the file paths defined in
runall.tokens.txt to match the locations on your own file system.  The
runhotspot file generates child scripts populated with the parameters
defined in runall.tokens.txt, and runs the scripts.  You will need to
change paths in runhotspot as well.  After you change these files,
just type

    ./runhotspot 

on the command line in the directory containing that script.  This
particular example is configured to run all of the steps of hotspot
(hotspots, FDR thresholding, peak-finding, and SPOT score
calculation), and should take about an hour total to complete all
steps. See below for locations and descriptions of the final output.

For running hotspot on your own data, please see runall.tokens.txt for
descriptions of the adjustable parameters, the most important of which
are described below.  You will need to change all of the paths to
reflect your environment.  As mentioned above, the runhotspot script
is configured to run all of the steps of hotspot.  There are
instructions in the runhotspot script for changing the script to just
compute SPOT scores.  The latter does not require peak-finding or FDR
thresholding, and therefore can be done more quickly and efficiently
than doing everything.

The primary input to the hotspot program is a tags file in bam format
(variable _TAGS_ in runall.tokens.txt).  In addition, a tag density
file (150 bp count of tags, sliding every 20bp; variable _DENS_) is
required for peak-calling, although that file will be generated
automatically if needed.

In addition to the input tag file, some other key auxiliary files are
required, and worth mentioning.  The hotspot program makes use of
mappability information.  For a given tag length (k-mer size, variable
_K_), only a subset of the genome is uniquely mappable.  Hotspot uses
this information to help compute the background expectation for
gauging enrichment.  The file defined by variable _MAPPABLE_FILE_ in
runall.tokens.txt contains the mappable positions in the genome for a
given k-mer size.  We have included these files for various values of
k for the human genome hg19 in the data directory.

You can likely get by using one of the sets of files provided if your
k-mer size is close, but not the same as those provided.  For much
different k-mer sizes, contact us and we can easily provide auxilliary
files for others.

If you include tags mapped to locations that are not uniquely
mappable, you can either provide your own mappability files, use one
of our precomputed files as an approximation, or assume the entire
genome is mappable.  In the latter case, you would set _MAPPABLE_FILE_
to be the chromosome coordinates themselves (file chromInfo.hg19.bed
in the data directory, for example).

For peak-finding, the routine run_wavelet_peak_finding performs a
"level 3" wavelet smooth of the density, and finds local maxima of the
result.  This means it smooths the density to a scale of 2^3=8 times
the resolution of the density file.  The level of smoothing can be
controlled by editing the _PKFIND_SMTH_LVL_ variable in
runall.tokens.txt.

Understanding Output

If all of the steps of hotspot, peak-finding and FDR thresholding are
followed, a (potentially confusing) number of output files and
directories are generated.  For practical purposes, however, you will
probably only need to look in the directory containing the final set
of hotspots and peaks (FDR thresholded or not), which is the
sub-directory of the output directory (token _OUTDIR_ in
runall.tokens.txt) whose name is the same as the tags file name (minus
the .bam extension), appended with the suffix "-both-passes".  Within
this directory will be found files with some or all of the following
names.

*.twopass.merge150.wgt10.zgt2.wig	minimally thresholded hotspots
*.hotspot.twopass.fdr0.05.merge.wig	FDR thresholded hotspots
*.hotspot.twopass.fdr0.05.merge.pks.bed	FDR thresholded peaks
*.hotspot.twopass.fdr0.05.merge.pks.wig

Some files are in bed format and some are in "wig" format, which is
roughly bed format with a header line making the file viewable in the
UCSC genome browser.  The scores in 4th or 5th column of the hotspot
files are the z-scores that come from the binomial model for scoring
hotspots.  The scores in the peaks files represent the maximum tag
density value for each peak.

The SPOT score will be in a file with extension spot.out in the output
directory.


Dependencies
------------

The scripts in the pipeline-scripts directory depend on three
non-standard, outside programs or suites, described below, which you
will need to have in your path.  In addition, the script tokenizer
(see ScriptTokenizer directory) requires Python 2.6+, and the FDR
thresholding routines require R (I am not sure what the minimal
version would be for that).

1) The scripts make liberal use of the bedops suite of utility
programs (created by UW informatics staff) for computing standard set
operations, etc. on bed files, in addition to routines for efficiently
compressing bed files.  The bedops package is available for download
at

    http://code.google.com/p/bedops/

which contains precompiled binaries as well as source code and
makefiles for building them yourself.  Individual programs used by the
scripts include bedops, bedmap, and sort-bed, for sorting and
operating on bed files, and starch and unstarch for bed file
compression.

2) The script that generates random tags, run_generate_random_lib,
uses a single routine, shuffleBed, from the bedTools package,
available for download at 

    http://code.google.com/p/bedtools/

3) The peak-finding script run_wavelet_peak_finding uses wavelets for
smoothing tag densities before performing peak-finding. The script
run_wavelet_peak_finding calls a more general bash script, wavePeaks,
which can be found in the src directory, which in turn calls the
wavelets program.  A precomputed linux binary version of the latter is
provided in the bin directory (you still need to add this to your
path).  If this binary does not work for you, you can download and
build the wavelets program yourself, from

    http://staff.washington.edu/dbp/WMTSA/NEPH/wavelets.html
