ENCODE DNase I and ChIP-Seq allelic imbalance data --------------------------- If you have any questions, please reach us at: sabramov@altius.org Sergey Abramov sboytsov@altius.org Alexandr Boytsov --------------------------- Accessing the data¶ The data is stored at https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2 The directory contains: DNase data (/dnase folder): Metadata file: metadata+encode_id.tsv Tested variants aggregated across all the samples: all.aggregation.dnase.bed.gz Tested variants aggregated across cell-type: cell_type.aggregation.bed.gz, last column corresponds to taxonomy_name_id in the metadata First round genotypes: genotypes_dnase.vcf.gz Files to do a custom aggregation (see jupyter notebook for more details): sample-split-variants.zip ChIP-Seq data (/chip-seq folder): Metadata file: metadata.tsv Metadata provided by Anshul, Vivek and Idan: encode_meta.tsv Tested variants aggregated across all the samples: all.aggregation.chipseq.bed.gz First round genotypes: genotypes_chipseq.vcf.gz Files to do a custom aggregation (see jupyter notebook for more details): sample-split-variants.zip This readme file: readme.txt Jupyter notebook: ENCODE DNASE I Allelic Imbalance - 2023-01-24.ipynb --------------------------- Metadata file columns: ag_id - a unique identifier of a sample idniv_id - individual genotype id. Samples with the same indiv_id have similar genotype (not necessarily the same cell type) --------------------------- Genotypes: Genotypes are provided in VCF format, each sample id corresponds to ag_id in corresponding metadata file. --------------------------- Variants format: Variants are stored in the bed-like format: #chr, start, end: genomic position of the SNV, hg38 genome assembly; ID: rsSNP ID of the SNV according to the dbSNP build 151; ref: reference allele (A,C,G, or T, according to hg38); alt: alternative allele; AAF, RAF: Topmed allele frequencies of reference and alternative alleles; FMR: Failed mapping rate #WASP filtered reads/#total reads; mean_BAD: Mean background allelic dosage estimation at the variant. Higher BAD values correspond to the higher contribution of aneuploidy and local copy-number variants. BAD scores serve as a baseline when estimating the statistical significance and the effect size of each tested variant. BAD=1 in case of diploid and BAD=2 in case of triploid; footprints_n: number of individual variants in DNase footrpints (called with https://www.vierstra.org/resources/dgf); hotspots_n: number of individual variants in dnase peaks (called with hotspots2); logit_pval_ref, logit_pval_alt: Logit aggregated p-values for reference and alternative alleles; !! Effect sizes are in log2 scale, positive values correspond to preference towards reference allele es_mean: Aggregated effect size; mean of individual effect sizes; (less affected by outliers) es_weighted_mean: Aggregated effect size, averaged with weights -log10[min(pval_ref, pval_alt)] (more sensitive estimate); nSNPs: # of individual variants aggregated at this position; max_cover: Maximum read coverage of aggregated variants; fdrp_bh_ref, fdrp_bh_alt: FDR-corrected reference and alternative p-values; min_fdr: minimum of fdrp_bh_ref and fdrp_bh_alt. AS events have min_fdr <= 0.05; (only in cell-type specific aggregation) taxonomy_name_id ----------------------------