{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "82629a77",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import os"
]
},
{
"cell_type": "markdown",
"id": "63a907ae",
"metadata": {},
"source": [
"# ENCODE DNase I and ChIP-Seq allelic imbalance data\n",
"\n",
"### If you have any questions, please reach us at:\n",
"+ sabramov@altius.org Sergey Abramov \n",
"+ sboytsov@altius.org Alexandr Boytsov\n"
]
},
{
"cell_type": "markdown",
"id": "9942ab9e",
"metadata": {},
"source": [
"# Accessing the data\n",
"\n",
"The data is stored at https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2
\n",
"The directory contains:\n",
"+ DNase I variants (`dnase-seq` folder): \n",
" - `metadata+encode_id.tsv` - metadata file\n",
" - `all.aggregation.dnase.bed.gz` - all tested variants (aggregated accross all the samples)\n",
" - `cell_type.aggregation.bed.gz` - tested variants (aggregated accross cell types). Last column corresponds to `taxonomy_name_id` in the metadata.\n",
" - `genotypes_dnase.vcf.gz` - genotypes file\n",
" - `sample-split-variants.zip` - non-aggregated variants, used for custom aggregations.\n",
"
\n",
"+ ChIP-seq variants (`chip-seq` folder):\n",
" - `metadata.tsv` - metadata file\n",
" - `encode_meta.tsv` - metadata provided by Anshul, Vivek and Idan\n",
" - `all.aggregation.chipseq.bed.gz` - all tested variants (aggregated accross all the samples)\n",
" - `genotypes_chipseq.vcf.gz` - genotypes file\n",
" - `sample-split-variants.zip` - non-aggregated variants, used for custom aggregations\n",
"
\n",
"+ Readme file: `readme.txt`\n",
"+ This notebook"
]
},
{
"cell_type": "markdown",
"id": "153fe2ec",
"metadata": {},
"source": [
"## Metadata file\n",
"+ `ag_id` - a unique identifier of a sample\n",
"+ `indiv_id` - individual genotype id. Samples with the same indiv_id have very similar genotype (not necessarily the same cell type)\n",
"+ `taxonomy_name` - cell type name\n",
"+ `taxonomy_name_id` - cell type ID."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "db332905",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2023-01-25 17:43:23-- https://resources.altius.org/~jvierstra/projects/encode4-allelic-imbalance-v2/dnase/metadata+encode_id.tsv\r\n",
"Resolving resources.altius.org (resources.altius.org)... 10.129.64.28\r\n",
"Connecting to resources.altius.org (resources.altius.org)|10.129.64.28|:443... connected.\r\n",
"HTTP request sent, awaiting response... 200 OK\r\n",
"Length: 388372 (379K) [text/tab-separated-values]\r\n",
"Saving to: ‘metadata+encode_id.tsv’\r\n",
"\r\n",
"\r",
" 0% [ ] 0 --.-K/s \r",
"100%[======================================>] 388,372 --.-K/s in 0.003s \r\n",
"\r\n",
"2023-01-25 17:43:23 (124 MB/s) - ‘metadata+encode_id.tsv’ saved [388372/388372]\r\n",
"\r\n"
]
},
{
"data": {
"text/html": [
"
\n", " | ag_id | \n", "ln_number | \n", "taxonomy_name | \n", "ontology_id | \n", "ontology_term | \n", "system | \n", "subsystem | \n", "organ | \n", "organ_region | \n", "side_position | \n", "... | \n", "treatment_name | \n", "treatment_details | \n", "dose | \n", "dose_units | \n", "hotspot2_spot | \n", "nuclear_percent_duplication | \n", "paired_nuclear_align | \n", "encode_library_id | \n", "taxonomy_name_id | \n", "indiv_id | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "AG7206 | \n", "LN40338H | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.5719 | \n", "27.9486 | \n", "348371100.0 | \n", "ENCLB096YUZ | \n", "NaN | \n", "INDIV_D0001 | \n", "
1 | \n", "AG80615 | \n", "LN3817 | \n", "K562 | \n", "EFO:0002067 | \n", "K562 | \n", "Hematopoietic | \n", "Lymphoid | \n", "Blood | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.5447 | \n", "16.3414 | \n", "237879484.0 | \n", "ENCLB843GMH | \n", "134.0 | \n", "INDIV_D0001 | \n", "
2 | \n", "AG80660 | \n", "LN3327 | \n", "K562 | \n", "EFO:0002067 | \n", "K562 | \n", "Hematopoietic | \n", "Lymphoid | \n", "Blood | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.4751 | \n", "11.7948 | \n", "213181356.0 | \n", "ENCLB253REF | \n", "134.0 | \n", "INDIV_D0001 | \n", "
3 | \n", "AG80850 | \n", "LN1691 | \n", "K562 | \n", "EFO:0002067 | \n", "K562 | \n", "Hematopoietic | \n", "Lymphoid | \n", "Blood | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.4702 | \n", "33.4191 | \n", "NaN | \n", "ENCLB540ZZZ | \n", "134.0 | \n", "INDIV_D0001 | \n", "
4 | \n", "AG80851 | \n", "LN1684 | \n", "K562 | \n", "EFO:0002067 | \n", "K562 | \n", "Hematopoietic | \n", "Lymphoid | \n", "Blood | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.2837 | \n", "14.1151 | \n", "NaN | \n", "ENCLB539ZZZ | \n", "134.0 | \n", "INDIV_D0001 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1406 | \n", "AG79936 | \n", "LN69969A | \n", "h.CACO-2 | \n", "EFO:0001099 | \n", "Caco-2 | \n", "Digestive | \n", "NaN | \n", "Colon | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.4227 | \n", "18.8354 | \n", "249544258.0 | \n", "ENCLB351LZG | \n", "101.0 | \n", "INDIV_D0756 | \n", "
1407 | \n", "AG80025 | \n", "LN54667A | \n", "h.CACO-2 | \n", "EFO:0001099 | \n", "Caco-2 | \n", "Digestive | \n", "NaN | \n", "Colon | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.3935 | \n", "34.8901 | \n", "228776942.0 | \n", "ENCLB564DMT | \n", "101.0 | \n", "INDIV_D0757 | \n", "
1408 | \n", "AG80858 | \n", "LN1269 | \n", "h.CACO-2 | \n", "EFO:0001099 | \n", "Caco-2 | \n", "Digestive | \n", "NaN | \n", "Colon | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.5724 | \n", "41.4670 | \n", "NaN | \n", "ENCLB422ZZZ | \n", "101.0 | \n", "INDIV_D0758 | \n", "
1409 | \n", "AG80857 | \n", "LN1289 | \n", "h.CACO-2 | \n", "EFO:0001099 | \n", "Caco-2 | \n", "Digestive | \n", "NaN | \n", "Colon | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.4214 | \n", "23.4804 | \n", "NaN | \n", "ENCLB423ZZZ | \n", "101.0 | \n", "INDIV_D0759 | \n", "
1410 | \n", "AG80650 | \n", "LN3462 | \n", "fHeart | \n", "UBERON:0000948 | \n", "heart | \n", "Cardiovascular | \n", "Cardiac | \n", "Heart | \n", "NaN | \n", "NaN | \n", "... | \n", "untreated | \n", "untreated | \n", "NaN | \n", "NaN | \n", "0.1077 | \n", "30.6859 | \n", "NaN | \n", "ENCLB740GHK | \n", "33.0 | \n", "INDIV_D0760 | \n", "
1411 rows × 36 columns
\n", "\n", " | #chr | \n", "start | \n", "end | \n", "ID | \n", "ref | \n", "alt | \n", "AAF | \n", "RAF | \n", "FMR | \n", "mean_BAD | \n", "... | \n", "logit_pval_alt | \n", "es_mean | \n", "es_weighted_mean | \n", "nSNPs | \n", "max_cover | \n", "fdrp_bh_ref | \n", "fdrp_bh_alt | \n", "min_fdr | \n", "taxonomy_name_id | \n", "taxonomy_name | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "chr1 | \n", "804903 | \n", "804904 | \n", "rs61770167 | \n", "C | \n", "T | \n", "0.00409339959225280 | \n", "0.9959066004077471 | \n", "0.000000 | \n", "1.0 | \n", "... | \n", "0.221049 | \n", "-0.540568 | \n", "-0.540568 | \n", "1 | \n", "27 | \n", "1.000000 | \n", "1.0 | \n", "1.000000 | \n", "75 | \n", "h.CD19.naive | \n", "
1 | \n", "chr1 | \n", "976214 | \n", "976215 | \n", "rs7417106 | \n", "A | \n", "G | \n", "0.70966329001019367 | \n", "0.2903367099898063 | \n", "0.000000 | \n", "1.0 | \n", "... | \n", "0.997060 | \n", "0.846024 | \n", "0.880585 | \n", "2 | \n", "45 | \n", "0.252424 | \n", "1.0 | \n", "0.252424 | \n", "75 | \n", "h.CD19.naive | \n", "
2 | \n", "chr1 | \n", "999841 | \n", "999842 | \n", "rs2298214 | \n", "C | \n", "A | \n", "0.45247993119266055 | \n", "0.5475200688073394 | \n", "0.066667 | \n", "1.0 | \n", "... | \n", "0.285810 | \n", "-0.415037 | \n", "-0.415037 | \n", "1 | \n", "28 | \n", "1.000000 | \n", "1.0 | \n", "1.000000 | \n", "75 | \n", "h.CD19.naive | \n", "
3 | \n", "chr1 | \n", "1000017 | \n", "1000018 | \n", "rs146254088 | \n", "G | \n", "A | \n", "0.02547623598369011 | \n", "0.9745237640163099 | \n", "0.121212 | \n", "1.0 | \n", "... | \n", "0.574769 | \n", "0.000000 | \n", "0.000000 | \n", "1 | \n", "28 | \n", "1.000000 | \n", "1.0 | \n", "1.000000 | \n", "75 | \n", "h.CD19.naive | \n", "
4 | \n", "chr1 | \n", "1000078 | \n", "1000079 | \n", "rs3128113 | \n", "A | \n", "G | \n", "0.67313328236493374 | \n", "0.3268348623853211 | \n", "0.218750 | \n", "1.0 | \n", "... | \n", "0.500529 | \n", "-0.125531 | \n", "-0.125531 | \n", "1 | \n", "23 | \n", "1.000000 | \n", "1.0 | \n", "1.000000 | \n", "75 | \n", "h.CD19.naive | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
5209518 | \n", "chr15 | \n", "88895246 | \n", "88895247 | \n", "rs12148326 | \n", "C | \n", "T | \n", "0.12808995922528032 | \n", "0.8517297400611621 | \n", "0.000000 | \n", "1.5 | \n", "... | \n", "0.253940 | \n", "-0.512982 | \n", "-0.512982 | \n", "1 | \n", "28 | \n", "0.974588 | \n", "1.0 | \n", "0.974588 | \n", "153 | \n", "h.HAP1 | \n", "
5209519 | \n", "chr15 | \n", "88903785 | \n", "88903786 | \n", "rs35729705 | \n", "G | \n", "A | \n", "0.23169915902140672 | \n", "0.768292877166157 | \n", "0.000000 | \n", "1.5 | \n", "... | \n", "0.560094 | \n", "0.038049 | \n", "0.038049 | \n", "1 | \n", "22 | \n", "0.872462 | \n", "1.0 | \n", "0.872462 | \n", "153 | \n", "h.HAP1 | \n", "
5209520 | \n", "chr15 | \n", "88909531 | \n", "88909532 | \n", "rs2280213 | \n", "T | \n", "C | \n", "0.22235760703363914 | \n", "0.7776423929663608 | \n", "0.000000 | \n", "1.5 | \n", "... | \n", "0.570281 | \n", "-0.008538 | \n", "-0.008538 | \n", "1 | \n", "23 | \n", "0.872462 | \n", "1.0 | \n", "0.872462 | \n", "153 | \n", "h.HAP1 | \n", "
5209521 | \n", "chr15 | \n", "88913387 | \n", "88913388 | \n", "rs139643219 | \n", "C | \n", "G | \n", "0.01145196228338430 | \n", "0.9885480377166157 | \n", "0.048077 | \n", "1.5 | \n", "... | \n", "0.794197 | \n", "0.210090 | \n", "0.262976 | \n", "2 | \n", "149 | \n", "0.872462 | \n", "1.0 | \n", "0.872462 | \n", "153 | \n", "h.HAP1 | \n", "
5209522 | \n", "chr15 | \n", "88992978 | \n", "88992979 | \n", "rs34757735 | \n", "C | \n", "T | \n", "0.31483339704383282 | \n", "0.6851666029561672 | \n", "0.000000 | \n", "1.5 | \n", "... | \n", "0.779762 | \n", "0.194097 | \n", "0.202742 | \n", "2 | \n", "54 | \n", "0.872462 | \n", "1.0 | \n", "0.872462 | \n", "153 | \n", "h.HAP1 | \n", "
5209523 rows × 23 columns
\n", "\n", " | #chr | \n", "start | \n", "end | \n", "ID | \n", "ref | \n", "alt | \n", "AAF | \n", "RAF | \n", "FMR | \n", "mean_BAD | \n", "... | \n", "logit_pval_alt | \n", "es_mean | \n", "es_weighted_mean | \n", "nSNPs | \n", "max_cover | \n", "fdrp_bh_ref | \n", "fdrp_bh_alt | \n", "min_fdr | \n", "taxonomy_name_id | \n", "taxonomy_name | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18 | \n", "chr1 | \n", "1171931 | \n", "1171932 | \n", "rs61768481 | \n", "G | \n", "A | \n", "0.00821865443425076 | \n", "0.991773381753313 | \n", "0.028169 | \n", "1.0 | \n", "... | \n", "1.000000 | \n", "1.965527 | \n", "2.282978 | \n", "2 | \n", "67 | \n", "0.000007 | \n", "1.0 | \n", "0.000007 | \n", "75 | \n", "h.CD19.naive | \n", "
59 | \n", "chr1 | \n", "5992030 | \n", "5992031 | \n", "rs34031616 | \n", "G | \n", "C | \n", "0.18855918705402650 | \n", "0.8109390927624872 | \n", "0.102564 | \n", "1.0 | \n", "... | \n", "0.999999 | \n", "1.713739 | \n", "2.145760 | \n", "2 | \n", "39 | \n", "0.001332 | \n", "1.0 | \n", "0.001332 | \n", "75 | \n", "h.CD19.naive | \n", "
71 | \n", "chr1 | \n", "8526043 | \n", "8526044 | \n", "rs151028640 | \n", "C | \n", "T | \n", "0.00907874617737003 | \n", "0.99092125382263 | \n", "0.000000 | \n", "1.0 | \n", "... | \n", "0.999944 | \n", "1.891949 | \n", "1.902251 | \n", "2 | \n", "41 | \n", "0.002238 | \n", "1.0 | \n", "0.002238 | \n", "75 | \n", "h.CD19.naive | \n", "
111 | \n", "chr1 | \n", "11749344 | \n", "11749345 | \n", "rs11121820 | \n", "T | \n", "G | \n", "0.40270610346585117 | \n", "0.5972938965341488 | \n", "0.030303 | \n", "1.0 | \n", "... | \n", "0.999921 | \n", "2.058894 | \n", "2.058894 | \n", "1 | \n", "31 | \n", "0.034874 | \n", "1.0 | \n", "0.034874 | \n", "75 | \n", "h.CD19.naive | \n", "
141 | \n", "chr1 | \n", "16073670 | \n", "16073671 | \n", "rs578144077 | \n", "G | \n", "A | \n", "0.03300203873598369 | \n", "0.9669979612640163 | \n", "0.000000 | \n", "1.0 | \n", "... | \n", "0.999981 | \n", "1.474769 | \n", "1.528200 | \n", "2 | \n", "46 | \n", "0.013908 | \n", "1.0 | \n", "0.013908 | \n", "75 | \n", "h.CD19.naive | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
10861 | \n", "chr9 | \n", "124800693 | \n", "124800694 | \n", "rs913232 | \n", "G | \n", "A | \n", "0.63069412589194699 | \n", "0.36928198267074414 | \n", "0.058824 | \n", "1.0 | \n", "... | \n", "0.999950 | \n", "1.503529 | \n", "1.627195 | \n", "2 | \n", "30 | \n", "0.032853 | \n", "1.0 | \n", "0.032853 | \n", "75 | \n", "h.CD19.naive | \n", "
10893 | \n", "chr9 | \n", "129803078 | \n", "129803079 | \n", "rs76154399 | \n", "C | \n", "T | \n", "0.01259078746177370 | \n", "0.98740124872579 | \n", "0.012987 | \n", "1.0 | \n", "... | \n", "0.999836 | \n", "0.583522 | \n", "0.761074 | \n", "2 | \n", "149 | \n", "0.030899 | \n", "1.0 | \n", "0.030899 | \n", "75 | \n", "h.CD19.naive | \n", "
10907 | \n", "chr9 | \n", "129888566 | \n", "129888567 | \n", "rs2296791 | \n", "G | \n", "T | \n", "0.01343495158002038 | \n", "0.986549120795107 | \n", "0.044118 | \n", "1.0 | \n", "... | \n", "1.000000 | \n", "0.958625 | \n", "1.627922 | \n", "2 | \n", "138 | \n", "0.000351 | \n", "1.0 | \n", "0.000351 | \n", "75 | \n", "h.CD19.naive | \n", "
10931 | \n", "chr9 | \n", "131669097 | \n", "131669098 | \n", "rs117156575 | \n", "G | \n", "C | \n", "0.02819189602446483 | \n", "0.9718081039755352 | \n", "0.029412 | \n", "1.0 | \n", "... | \n", "0.999951 | \n", "1.355409 | \n", "1.758573 | \n", "2 | \n", "31 | \n", "0.030349 | \n", "1.0 | \n", "0.030349 | \n", "75 | \n", "h.CD19.naive | \n", "
11021 | \n", "chr9 | \n", "137761126 | \n", "137761127 | \n", "rs11137204 | \n", "C | \n", "T | \n", "0.27375605249745158 | \n", "0.7260368883792049 | \n", "0.062500 | \n", "1.0 | \n", "... | \n", "0.999868 | \n", "0.977140 | \n", "1.341225 | \n", "2 | \n", "69 | \n", "0.045060 | \n", "1.0 | \n", "0.045060 | \n", "75 | \n", "h.CD19.naive | \n", "
236 rows × 23 columns
\n", "