WP11 Integrated statistics

Description

There is small amount of “complete” toxicology data available in the public domain, the most advanced and comprehensive is that in the Japanese Private-Public partnership project TG-GATES, where the Open version - Open TG-GATEs - contains histology, clinical chemistry and array expression data for three dose levels in rat and human systems. Beyond this there are smaller datasets resulting from smaller scale projects - for example the PredToxproject, which is restricted to only around 15 compounds, but does include metabolomics data and proteomics data alongside the gene expression data. The diXa infrastructure addresses the storage and curation of ‘omics data in the toxicology space, and also links to some 25 currently available, web-accessible chemo-informatics databases operated via OECD eChemPortal. With several of the HeCaToS partners involved in diXa, data accessand processing will be straight forward. Under the recently started FP7 iCORDI project, the diXa data infrastructure will be further internationalized, in addition to TG Gates, by developing collaborations with US data bases CEBS, Comparative Toxicogenomics dBand Connectivity Map. Molecular interaction and pharmacology data is typically more disparate in publication and disclosure, with datasets typically limited to around 20-50 variants of a chemical structure in a small number of bioassays. The ChEMBL database contains a large set of such data, and the published data can be normalized (compound structures converted to comparable canonical structural representations, salts and mixtures handled; bioassays tagged to targets, units normalized and so forth). This normalized and curated data forms a great platform to data-mine, but requires further processing for effective use.These frameworks also need aligning with other initiatives in other parts of the globe - for example the data produced by the US EPA Tox 21 project is readily mappable to theChEMBL infrastructure, but adds valuable diversity, ‘inactive’, and contextual assay data to our integration plans. The basic problem of data integration needed is one that allows indexing across these existing table and sustainable resources, but unlocks the hidden data by combining this in a framework for multi-scale toxicology prediction, and in particular reflects the known knowledge and conventions of specialist research fields (e.g. cardio- and hepatotoxicology). Automated approaches to filter the large quantity of data for consistency are required to build reliable data for modelling, for example, sampling statistics from ChEMBL indicate that up to 1.5% of bioassay data is reported in the published literature is reported 103 fold wrong as researcher convert from nM to uMduring the publication process. Predictive models, and outlier analysis are powerful approaches to ensure comparability of data. Genedata Analyst™ is a leading computational system for integrated analysis of data coming from diverse technological platforms and has the ability to process billions of data points with unmatched high performance. Analyst is the only computational system on the market today that supports such an integrated approach. Partner Genedata has successfully performed such tasks inthe EU FP6 project Innomed PredTox, where transcriptomics, proteomics, and metabolomics data has been acquired from in vivo samples, and analyzed together to identify biomarkers for liver and kidney toxicity. Secondly, it is important to bear in mind the experimental error in a reported measurement, and again comparison of same compound-same assay-different lab can lead to generally applicable a priori error estimates, that can be effectively incorporated into machine learning and QSAR approaches as a way of estimating likely errors in predictive models.

 
Ralf Herwig
MPIMG
More information