//=============================================================================
//
// ============
// make notes:
// ============
// To create the setops binary, just type 'make'.
// To create both the setops binary and the supporting java regression test
//  classes, type 'make test'.
// To clean the setops binary, use 'make clean'.
// To clean both the setops binary and supporting java regression test classes,
//  type 'make clean_test'.
//
// Look at the make file: The g++ call uses -I to refer to a directory where
//  general headers are placed.  If that directory is moved, then update the
//  -I path to point to the new location (these resources are not called at
//  runtime - only used to make the setops binary).
//
// The code for setops is ANSI/ISO-compliant and should compile without issue
//  on any OS with a compliant compiler.  The 'Makefile' is very simple, but
//  is targeted toward unix-variants.  On Windows, use a suitable alternative.
//  On Mac OS X, remove the -static flag in the make file only.
//  In short, the code is completely portable, but the make system is not.
//
// ===================
// Regression Testing
// ===================
// TestPlan.xml contains the historical regression tests, but one could make
//  another .xml file and do things similarly.  To begin regression testing,
//  one must 'make test'.  This creates the Regression.class file.  Assuming
//  your CLASSPATH and PATH variables point to the directory containing setops
//  and Regression.class, type the following:
//  'java Regression TestPlan.xml'
//
// Files created during testing are not deleted so you have the opportunity
//  to look them over.
//
// TestPlan.xml is fairly easy to use.  Look over the existing examples when
//  creating new tests.  You can add to the current TestPlan.xml or make your
//  own .xml test file.
//
// ===============
// Other programs
// ===============
// LexiBedSort.pl is a script originally written by Bill Noble.  The file
//  was modified to perform (case sensitive) lexicographical chromosome
//  comparisons prior to comparing coordinate positions.
// sort-bed is a C program written by Scott Kuehn that is designed for speed.
//  It also sorts files consistent with the requirements of setops.
// Sorting should be case-sensitive for chromosome names.
//
// ============
// User notes:
// ============
// In regards to -c|d|i|m|s
// All overlapping regions within a file or between files are handled properly
//  by this program - all adjacent/overlapped regions are merged during
//  analysis on a per-chromosome basis, so all output consists of disjoint
//  regions.  In order to support this merging, all output rows have only
//  3 column entries - chrom start end
//
// In regards to -e|n|u
// Regions are not merged.  Output includes all corresponding input column
//  information.  In the case of -e|n, the corresponding output information
//  comes only from the first specified file (the reference file).
//
// Can use '-' as a placeholder for any file name in order to read from
//  std::cin, allowing you to pipe input into setops.
//
// Program sweeps through all input files only once and requires minimum
//  memory, except for -e|n.  To achieve this, all input files must be ordered
//  by:
//  1) Chromosome (case insensitive, lexicographical), then
//  2) first coordinate, then
//  3) second coordinate,
//  where first coordinate < second coordinate always.
//
// For the -e|n (element (not) of) option, some file input is cached - consumes
//  more memory and takes more time than all other options.  It's likely that
//  the extra memory requirements will still be small and the running time will
//  be approximately double of other set operations.  However, in the worst
//  case, time would be n^2 and all input would be read into memory.
//
// Duplicate entries (ties) are allowed.  "Completely covered" rows are allowed
//  (i.e., Row1=chr7 100 200 and Row2=chr7 150 175).
// Proper ordering is assumed, and is not checked by this program itself.  Use
// sort-bed to ensure that your input files are sorted correctly.
//
// Proper column formatting is assumed by the program and minimal error
// checking is performed on the inputs themselves.  Internally, the program
// uses unsigned types for coordinates so no value < 0 should ever be used.
//
// ===============
// Preconditions:
// ===============
// All input files must be in (possibly relaxed) BED-like format, with at least
//  3 columns: chromosome, first coordinate, second coordinate.  setops does
//  not use any information after these first 3 columns, so you can choose to
//  have any other information you like, in any format, in the 4th column and
//  beyond in any input row.  This same information can be retained in the
//  output by using setops with any of -e|n|u mode.
// All input coordinate ranges are assumed to be of the form: [start, stop).
// All input files must be sorted.  Use sort-bed (or equivalent program).
// Must invoke program with -c, -d, -e, -i, -m, -n, -s or -u (only one may be
//  specified), followed by an input file(s).
// All options require a minimum number of files (1 or 2: see invokation
//  details below), but there is no maximum number of input files constraint,
//  outside of limitations imposed by the OS (number of open files).  You may
//  pipe (sorted) input into setops using '-' as a file placeholder.
//
// ========
// Output:
// ========
// Ordered regions found.  All results are sent to std::cout.
// For -c|d|i|m|s
//  All output ranges are unique and disjoint.  All adjacent regions are
//   merged.  The output consists of the chromosome and coordinate information
//   (only), and is sorted per the sorting requirements specified above.
// For -e|n|u
//  Output is a subset of the input (not necessarily a proper subset) with all
//   column information intact.  Row order is preserved.
//
// ===========
// Invokation
// ===========
// NOTE: Can use - (dash) as a placeholder filename --> pipe in to std::cin.
//       The dash can be substituted for any (one) input file.
// NOTE: -h|H will bring up the same screen as just typing 'setops'.
//
// Invoke program with -c|C|d|D|e|E|i|I|m|M|s|S|u|U file1 [file]*
//  -c|C = Set Complement: { ~A }
//         Complement of all coordinate regions found within the input file(s).
//         Report regions between those found in the input file(s), since
//         the universe of discourse (uod) is unknown to the program itself.
//         To receive the complete complement of A wrt to the uod, the user
//         must place boundary coordinates in the input file(s).  For example,
//         if you want coordinates 1 to 99 included, but your (ordered) input
//         file starts at 100, then insert a row with coordinates 0	1 at
//         the top of the list for that chromosome.  Similarly, you can add
//         end boundary coordinates.  Never use negative numbers (use
//         unsigned values internally).
//         Minimum of 1 input file required.
//
//  -d|D = Set Difference: { A - B }
//         Coordinate regions found within the first (master) file, but not
//         found within any other input file(s).
//         Minimim of 2 input files required.  First file is the (sole) master.
//
//  -e|E = Set Element Of: { element of A is also an element of B }
//         Rows from the first (reference) file whose chrom start stop are
//         also "found" in any other input file.  To be included in the output,
//         an element (row) of the reference file must qualify by overlapping a
//         specified percentage, or number of base pairs, of the merged
//         coordinate regions of the other input files.  By default, 100% is
//         used.  The user can choose a different percentage by:
//           setops -e -X% refFile File2 [File]*
//         where 0 <= X <= 100.  If X = 0 than any overlap (>= 1 nucleotide)
//         qualifies and will be included in the output.  The percentage is
//         relative to the size of the region from the reference file only.
//         Or, the user can explicity ask for >= some number of bp overlap:
//           setops -e -X refFile File2 [File]*
//         where 1 <= X
//         All applicable merged regions from File2 to FileN will be used
//         in the calculated overlap, including multiple, disjoint regions.
//         So, while all nucleotides that overlap are distinct, they may not
//         be contiguous.
//         Minimum of 2 input files required.
//
//  -h|H = Help:
//         Send program usage to std::cerr.
//
//  -i|I = Set Intersection: { A ^ B }
//         Coordinate regions found within every input file.
//         Minimum of 2 input files required.
//
//  -m|M = Merge: { A v B }
//         All merged coordinate regions found within any input file(s).
//         Minimum of 1 input file(s) required.
//         With only 1 input file, the output will simply be the input after
//         dealing with overlapping/adjacent coordinates in that file.
//
//  -n|N = Set Not Element Of: { element of A is not an element of B }
//         The output from this operation will be all rows of A that would not
//         be included when -e is used.  Refer to -e (and invert overlap logic)
//         for details.  Also includes options -X% or -X (in bps) just as
//         -e does.  If -100% is used, then the output would be all lines of
//         the reference file that are overlapped < 100% in the merged regions
//         of all other input files.
//         Minimum of 2 input files required.
//
//  -s|S = Set Symmetric Difference: { { A - B } v { B - A } }
//         All unique coordinate regions found in any one file, but not in any
//         other input file(s).  For any N inputs, computes:
//         { {A - {B v C v ... v N}} v ... v {N - {A v B v C v ... v N-1}} }
//         Minimum of 2 input files required.
//
//  -u|U = Set Union All: { A v B }
//         This is quite different than -m|M.  The output is functionally
//         equivalent to 'cat (all files) + sort', but runs much faster.
//         All output columns are kept intact and no merging of coordinates is
//         done.  This option could be useful when performing a merge sort on
//         very large inputs.
//         Note: The output will have multiple, identical entries if file
//         inputs also do.
//         Minimum of 2 input files required (with 1 input, the output would
//         just echo the input).
//=============================================================================
