ENCODE DNase I and ChIP-Seq allelic imbalance data
---------------------------
If you have any questions, please reach us at:
sabramov@altius.org Sergey Abramov
sboytsov@altius.org Alexandr Boytsov
---------------------------
Accessing the data¶
The data is stored at https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2
The directory contains:

DNase data (/dnase folder):
	Metadata file: metadata+encode_id.tsv
	Tested variants aggregated across all the samples: all.aggregation.dnase.bed.gz
	Tested variants aggregated across cell-type: cell_type.aggregation.bed.gz, 
		last column corresponds to taxonomy_name_id in the metadata
	First round genotypes: genotypes_dnase.vcf.gz
	Files to do a custom aggregation (see jupyter notebook for more details): sample-split-variants.zip

ChIP-Seq data (/chip-seq folder):
	Metadata file: metadata.tsv
	Metadata provided by Anshul, Vivek and Idan: encode_meta.tsv
	Tested variants aggregated across all the samples: all.aggregation.chipseq.bed.gz
	First round genotypes: genotypes_chipseq.vcf.gz
	Files to do a custom aggregation (see jupyter notebook for more details): sample-split-variants.zip

This readme file: readme.txt
Jupyter notebook: ENCODE DNASE I Allelic Imbalance - 2023-01-24.ipynb
---------------------------
Metadata file columns:
ag_id - a unique identifier of a sample
idniv_id - individual genotype id. Samples with the same indiv_id have similar genotype
 (not necessarily the same cell type)

---------------------------
Genotypes:
Genotypes are provided in VCF format, each sample id corresponds to ag_id in corresponding metadata file.
---------------------------
Variants format:
Variants are stored in the bed-like format:
#chr, start, end: genomic position of the SNV, hg38 genome assembly;

ID: rsSNP ID of the SNV according to the dbSNP build 151;

ref: reference allele (A,C,G, or T, according to hg38);

alt: alternative allele;

AAF, RAF: Topmed allele frequencies of reference and alternative alleles;

FMR: Failed mapping rate #WASP filtered reads/#total reads;

mean_BAD: Mean background allelic dosage estimation at the variant. Higher BAD values correspond to the higher contribution of aneuploidy and local copy-number variants. BAD scores serve as a baseline when estimating the statistical significance and the effect size of each tested variant. BAD=1 in case of diploid and BAD=2 in case of triploid;

footprints_n: number of individual variants in DNase footrpints (called with https://www.vierstra.org/resources/dgf);

hotspots_n: number of individual variants in dnase peaks (called with hotspots2);

logit_pval_ref, logit_pval_alt: Logit aggregated p-values for reference and alternative alleles;

!! Effect sizes are in log2 scale, positive values correspond to preference towards reference allele
es_mean: Aggregated effect size; mean of individual effect sizes; (less affected by outliers)
es_weighted_mean: Aggregated effect size, averaged with weights -log10[min(pval_ref, pval_alt)] (more sensitive estimate);

nSNPs: # of individual variants aggregated at this position;

max_cover: Maximum read coverage of aggregated variants;

fdrp_bh_ref, fdrp_bh_alt: FDR-corrected reference and alternative p-values;

min_fdr: minimum of fdrp_bh_ref and fdrp_bh_alt. AS events have min_fdr <= 0.05;

(only in cell-type specific aggregation) taxonomy_name_id
----------------------------