Specification for:
Feature Distance Calculation Software

1. PURPOSE

   The ability to compare the relative locations of two genomic features is an
   essential task in a genomics compbio toolkit.  The feature distance
   calculation program is an optimized utility that calculates the
   numerical distance (in bp) between two different segments.


2. SYSTEM STRATEGIES

   The software evaluates each input DNA segment listed in the master
   BED file, and outputs the signed distance to the nearest entry in
   the comparison BED file.  Assume the comparison BED file is in sorted
   order.

   Various options modify the behavior of the program.  It should be
   possible to eliminate comparisons between identical segments. For
   example, if the line: "chr1   100 200 id-1" appears in both the master
   file and the comparison file, the line should be ignored in the
   comparison file, and the distance to the next nearest segment
   should be reported.  This feature allows a user to quickly assess the
   nearest data point in the same file.

   This program should be highly optimized, and as such, should be
   constructed in C or C++.  Optionally, this program can be written
   in Python and link to routines written in native code for resource 
   intensive calculations.

   Several behavior specific notes:
       o If the comparison file does not contain any values for the
         chromosome listed in the master file,  output 'no-data-chrN',
         where 'N' is replaced with the chrom from the master file.
       o If the nearest segment is in the 3' direction of the target 
	 segment, the resulting value is negative.

   This program is designed to be an improvement to the featDistance
   utility, written by Bill Noble and Scott Kuehn.  There is a
   reference copy of this program in the doc directory.


3. DATA AND PROGRAM SPECIFICATIONS   

   The program requires the following as input:

       o A master BED file (optionally stdin) that is sorted.
       o A file of genomic DNA segments, one per line, sorted and formatted
         as UCSC BED.  This is the comparison BED file. (optionally stdin)
       o A flag to indicate that only the same strand should be evaluated
         (-p).
       o A verbosity parameter. Default to a quiet setting.  Did not
         include this parameter.
       o A flag to indicate that a segment should never be compared to
         itself (-n).

   Calculation results are written to stdout. Verbose information
   is purely diagnostic, and should be written to stderr, and consist of (at
   the most verbose level) results from any underlying programs,
   periodic estimation of progress, warnings, and errors.

   --Did not include a verbosity flag.  Errors are written to stderr along
     with a usage statement.  Program is fast enough that progress estimation
     is not useful.

   
Shane Neph, Scott Kuehn
Thu Mar 23 11:33:21 EST 2006



