WP9 Data warehousing

Description

There is small amount of “complete” toxicology data available in the public domain, the most advanced and comprehensive is that in the Japanese Private-Public partnership project TG-GATES, where the Open version - Open TG-GATEs - contains histology, clinical chemistry and array expression data for three dose levels in rat and human systems. Beyond this there are smaller datasets resulting from smaller scale projects - for example the PredToxproject, which is restricted to only around 15 compounds, but does include metabolomics data and proteomics data alongside the gene expression data. The diXa infrastructure addresses the storage and curation of ‘omics data in the toxicology space, and also links to some 25 currently available, web-accessible chemo-informatics databases operated via OECD eChemPortal. With several of the HeCaToS partners involved in diXa, data accessand processing will be straight forward. Under the recently started FP7 iCORDI project, the diXa data infrastructure will be further internationalized, in addition to TG Gates, by developing collaborations with US data bases CEBS, Comparative Toxicogenomics dBand Connectivity Map. Molecular interaction and pharmacology data is typically more disparate in publication and disclosure, with datasets typically limited to around 20-50 variants of a chemical structure in a small number of bioassays. The ChEMBL database contains a large set ofsuch data, and the published data can be normalized (compound structures converted tocomparable canonical structural representations, salts and mixtures handled; bioassaystagged to targets, units normalized and so forth). This normalized and curated data formsa great platform to data-mine, but requires further processing for effective use.These frameworks also need aligning with other initiatives in other parts of the globe -for example the data produced by the US EPA Tox 21 project is readily mappable to theChEMBL infrastructure, but adds valuable diversity, ‘inactive’, and contextual assay datato our integration plans.The basic problem of data integration needed is one that allows indexing across theseexisting table and sustainable resources, but unlocks the hidden data by combining this ina framework for multi-scale toxicology prediction, and in particular reflects the knownknowledge and conventions of specialist research fields (e.g. cardio- andhepatotoxicology). Automated approaches to filter the large quantity of data forconsistency are required to build reliable data for modelling, for example, samplingstatistics from ChEMBL indicate that up to 1.5% of bioassay data is reported in thepublished literature is reported 103 fold wrong as researcher convert from nM to uMduring the publication process. Predictive models, and outlier analysis are powerfulapproaches to ensure comparability of data. Genedata Analyst™ is a leadingcomputational system for integrated analysis of data coming from diverse technologicalplatforms and has the ability to process billions of data points with unmatched highperformance. Analyst is the only computational system on the market today that supports such an integrated approach. Partner Genedata has successfully performed such tasks inthe EU FP6 project Innomed PredTox, where transcriptomics, proteomics, and metabolomics data has been acquired from in vivo samples, and analyzed together to identify biomarkers for liver and kidney toxicity. Secondly, it is important to bear in mind the experimental error in a reported measurement, and again comparison of same compound-same assay-different lab can lead to generally applicable a priori error estimates, that can be effectively incorporated into machine learning and QSAR approaches as a way of estimating likely errors in predictive models.

 
Ugis Sarkans
EMBL
More information