Zengyou | Data Mining for Bioinformatics Applications | E-Book | sack.de
E-Book

E-Book, Englisch, 100 Seiten

Zengyou Data Mining for Bioinformatics Applications


1. Auflage 2015
ISBN: 978-0-08-100107-3
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark

E-Book, Englisch, 100 Seiten

ISBN: 978-0-08-100107-3
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark



Data Mining for Bioinformatics Applications provides valuable information on the data mining methods have been widely used for solving real bioinformatics problems, including problem definition, data collection, data preprocessing, modeling, and validation. The text uses an example-based method to illustrate how to apply data mining techniques to solve real bioinformatics problems, containing 45 bioinformatics problems that have been investigated in recent research. For each example, the entire data mining process is described, ranging from data preprocessing to modeling and result validation. - Provides valuable information on the data mining methods have been widely used for solving real bioinformatics problems - Uses an example-based method to illustrate how to apply data mining techniques to solve real bioinformatics problems - Contains 45 bioinformatics problems that have been investigated in recent research

Zengyou He is an Associate Professor at the School of Software, Dalian University of Technology, P.R China. He received his BS, MS, and PhD in Computer science from Harbin Institute of Technology, P.R China and was a Research associate in the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology from 2007 to 2010. His research interests include Computational proteomics and Biological data mining. He has published more than 20 papers on leading journals in the field of bioinformatics, including Bioinformatics, BMC Bioinformatics, Briefings in Bioinformatics, IEEE/ACM Transactions on Computational Biology and Bioinformatics and Journal of Computational Biology.

Zengyou Data Mining for Bioinformatics Applications jetzt bestellen!

Autoren/Hrsg.


Weitere Infos & Material


List of figures
Figure 1.1 Typical phases involved in a data mining process model. 2 Figure 2.1 An example of the alignment of five biological sequences. Here “–” denotes the gap inserted between different residues. 13 Figure 3.1 Overview of the Motif-All algorithm. In the first phase, it finds frequent motifs from P to reduce the number of candidate motifs. In the second phase, it performs the significance testing procedure to report all statistically significant motifs to the user. 22 Figure 3.2 Overview of the C-Motif algorithm. The algorithm generates and tests candidate phosphorylation motifs in a breath-first manner, where the support and the statistical significance values are evaluated simultaneously. 23 Figure 3.3 The calculation of conditional significance in C-Motif. In the figure, Sig(m, P(mi), N(mi)) denotes the new significance value of m on its ith submotif induced data sets. 23 Figure 4.1 An illustration on the training data construction methods for non-kinase-specific phosphorylation site prediction. Here the shadowed part denotes the set of phosphorylated proteins and the unshadowed area represents the set of unphosphorylated proteins. 30 Figure 4.2 An illustration on the training data construction methods for kinase-specific phosphorylation site prediction. The proteins are divided into three parts: (I) the set of proteins that are phosphorylated by the target kinase, (II) the set of proteins that are phosphorylated by the other kinases, and (III) the set of unphosphorylated proteins. 31 Figure 4.3 An illustration on the basic idea of the active learning procedure for phosphorylation site prediction. (a) The SVM classifier (solid line) generated from the original training data. (b) The new SVM classifier (dashed line) built from the enlarged training data. The enlarged training data are composed of the initial training data and a new labeled sample. 33 Figure 4.4 An overview of the PHOSFER method. The training data are constructed with peptides from both soybean and other organisms, in which different training peptides have different weights. The classifier (e.g., random forest) is built on the training data set to predict the phosphorylation status of remaining S/T/Y residues in the soybean organism. 34 Figure 5.1 The protein identification process. In shotgun proteomics, the protein identification procedure has two main steps: peptide identification and protein inference. 40 Figure 5.2 An overview of the BagReg method. It is composed of three major steps: feature extraction, prediction model construction, and prediction result combination. In feature extraction, the BagReg method generates five features that are highly correlated with the presence probabilities of proteins. In prediction model construction, five classification models are built and applied to predict the presence probability of proteins, respectively. In prediction result combination, the presence probabilities from different classification models are combined to obtain a consensus probability. 41 Figure 5.3 The feature extraction process. Five features are extracted from the original input data for each protein: the number of matched peptides (MP), the number of unique peptides (UP), the number of matched spectra (MS), the maximal score of matched peptides (MSP), and the average score of matched peptides (AMP). 42 Figure 5.4 A single learning process. Each separate learning process accomplishes a typical supervised learning procedure. The model construction phase involves constructing the training set and learning the classification model. And the prediction phase is to predict the presence probabilities of all candidate proteins with the classifier obtained in the previous phase. 43 Figure 5.5 The basic idea of ProteinLasso. ProteinLasso formulates the protein inference problem as a minimization problem, where yi is the peptide probability, Di represents the vector of peptide detectabilities for the ith peptide, xj denotes the unknown protein probability of the jth protein, and ? is a user-specified parameter. This optimization problem is the well-known Lasso regression problem in statistics and data mining. 44 Figure 5.6 The target-decoy strategy for evaluating protein inference results. The MS/MS spectra are searched against the target-decoy database, and the identified proteins are sorted according to their scores or probabilities. The false discovery rate at a threshold can be estimated as the ratio of the number of decoy matches to that of target matches. 45 Figure 5.7 An overview of the decoy-free FDR estimation algorithm. 46 Figure 5.8 The correct and incorrect procedure for assessing the performance of protein inference algorithms. In model selection, we cannot use any ground truth information that should only be visible in the model assessment stage. Otherwise, we may overestimate the actual performance of inference algorithms. 47 Figure 6.1 A typical AP-MS workflow for constructing PPI network. A typical AP-MS study performs a set of experiments on bait proteins of interest, with the goal of identifying their interaction partners. In each experiment, a bait protein is first tagged and expressed in the cell. Then, the bait protein and their potential interaction partners (prey proteins) are affinity purified using AP. The resulting proteins (both bait and prey proteins) are digested into peptides and passed to tandem mass spectrometer for analysis. Peptides are identified from the MS/MS spectra with peptide identification algorithms and proteins are inferred from identified peptides with protein inference algorithms. In addition, the label-free quantification method such as spectral counting is typically used to estimate the protein abundance in each experiment. Such pull-down bait–prey data from all AP-MS runs are used to filter contaminants and construct the PPI network. 52 Figure 6.2 A sample AP-MS data set with six purifications. 54 Figure 6.3 The PPI network constructed from the sample data. Here DC is used as the correlation measure and the score threshold is 0.5, that is, a protein pair is considered to be a true interaction if the DC score is above 0.5. In the figure, the width of the edge that connects two proteins is proportional to the corresponding DC score. 55 Figure 6.4 An illustration of database-free method for validating the interaction prediction results. Under the null hypothesis that each bait protein captures a prey protein is a random event, some simulated data sets are generated such that they are comparable to the original one. Then, an empirical p-value representing the probability that an original interaction score for a protein pair would occur in the random data sets by chance can be calculated. Finally, the false discovery rate is calculated according to these p-values. 58 Figure 7.1 An example bait–prey graph. In this figure, each Bi (i = 1, 2, 3, 4) denotes a bait protein and each Pi (i = 1, 2, 3, 4, 5, 6) represents a prey protein. The score that measures interaction strength between a bait–prey pair is provided as well. 63 Figure 7.2 Three maximal bicliques are identified. Among these three bicliques, C1 and C2 are reliable and only C1 is finally reported as a protein-complex core. 63 Figure 7.3 The final protein complex by including both the protein complex core C1 and an attachment B3. 64 Figure 8.1 A typical data analysis pipeline for biomarker discovery from mass spectrometry data. In this workflow, there are three preprocessing steps: feature extraction, feature alignment, and feature transformation. After preprocessing the raw data, feature selection techniques are employed to identify a subset of features as the biomarker. 70 Figure 8.2 An illustration of feature transformation based on protein–protein interaction (PPI) information. The PPI information is used to find groups of correlated features in terms of proteins. These identified feature groups are transformed into a set of new features for biomarker...



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.