E-Book, Englisch, 348 Seiten
Dua / Chowriappa Data Mining for Bioinformatics
1. Auflage 2013
ISBN: 978-1-4665-8866-0
Verlag: Taylor & Francis
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
E-Book, Englisch, 348 Seiten
ISBN: 978-1-4665-8866-0
Verlag: Taylor & Francis
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
Covering theory, algorithms, and methodologies, as well as data mining technologies, Data Mining for Bioinformatics provides a comprehensive discussion of data-intensive computations used in data mining with applications in bioinformatics. It supplies a broad, yet in-depth, overview of the application domains of data mining for bioinformatics to help readers from both biology and computer science backgrounds gain an enhanced understanding of this cross-disciplinary field.
The book offers authoritative coverage of data mining techniques, technologies, and frameworks used for storing, analyzing, and extracting knowledge from large databases in the bioinformatics domains, including genomics and proteomics. It begins by describing the evolution of bioinformatics and highlighting the challenges that can be addressed using data mining techniques. Introducing the various data mining techniques that can be employed in biological databases, the text is organized into four sections:
- Supplies a complete overview of the evolution of the field and its intersection with computational learning
- Describes the role of data mining in analyzing large biological databases—explaining the breath of the various feature selection and feature extraction techniques that data mining has to offer
- Focuses on concepts of unsupervised learning using clustering techniques and its application to large biological data
- Covers supervised learning using classification techniques most commonly used in bioinformatics—addressing the need for validation and benchmarking of inferences derived using either clustering or classification
The book describes the various biological databases prominently referred to in bioinformatics and includes a detailed list of the applications of advanced clustering algorithms used in bioinformatics. Highlighting the challenges encountered during the application of classification on biological databases, it considers systems of both single and ensemble classifiers and shares effort-saving tips for model selection and performance estimation strategies.
Zielgruppe
bioinformatics software engineers/developers; bioinformatician; bioinformatics scientists and support specialists; data mining analysts/architects; data mining engineers and support specialists; faculty; post doctoral researchers
Autoren/Hrsg.
Fachgebiete
Weitere Infos & Material
Introduction to Bioinformatics
Introduction
Transcription and Translation The Central Dogma of Molecular Biology
The Human Genome Project
Beyond the Human Genome Project Sequencing Technology Dideoxy Sequencing Cyclic Array Sequencing Sequencing by Hybridization Microelectrophoresis Mass Spectrometry Nanopore Sequencing Next-Generation Sequencing Challenges of Handling NGS Data Sequence Variation Studies Kinds of Genomic Variations SNP Characterization Functional Genomics Splicing and Alternative Splicing Microarray-Based Functional Genomics Comparative Genomics Functional Annotation Function Prediction Aspects
Conclusion
References
Biological Databases and Integration
Introduction: Scientific Work Flows and Knowledge Discovery
Biological Data Storage and Analysis Challenges of Biological Data Classification of Bioscience Databases Primary versus Secondary Databases Deep versus Broad Databases Point Solution versus General Solution Databases Gene Expression Omnibus (GEO) Database The Protein Data Bank (PDB)
The Curse of Dimensionality
Data Cleaning Problems of Data Cleaning Challenges of Handling Evolving Databases Problems Associated with Single-Source Techniques Problems Associated with Multisource Integration Data Argumentation: Cleaning at the Schema Level Knowledge-Based Framework: Cleaning at the Instance Level Data Integration Ensembl Sequence Retrieval System (SRS) IBM’s DiscoveryLink Wrappers: Customizable Database Software Data Warehousing: Data Management with Query Optimization Data Integration in the PDB
Conclusion
References
Knowledge Discovery in Databases
Introduction
Analysis of Data Using Large Databases Distance Metrics Data Cleaning and Data Preprocessing
Challenges in Data Cleaning Models of Data Cleaning Proximity-Based Techniques Parametric Methods Nonparametric Methods Semiparametric Methods Neural Networks Machine Learning Hybrid Systems
Data Integration Data Integration and Data Linkage Schema Integration Issues Field Matching Techniques Character-Based Similarity Metrics Token-Based Similarity Metrics Data Linkage/Matching Techniques
Data Warehousing Online Analytical Processing Differences between OLAP and OLTP OLAP Tasks Life Cycle of a Data Warehouse
Conclusion
References
Section II
Feature Selection and Extraction Strategies in Data Mining
Introduction
Overfitting
Data Transformation Data Smoothing by Discretization Discretization of Continuous Attributes Normalization and Standardization Min-Max Normalization z-Score Standardization Normalization by Decimal Scaling
Features and Relevance Strongly Relevant Features Weakly Relevant to the Dataset/Distribution Pearson Correlation Coefficient Information Theoretic Ranking Criteria
Overview of Feature Selection Filter Approaches Wrapper Approaches
Filter Approaches for Feature Selection FOCUS Algorithm Relief Method—Weight-Based Approach.
Feature Subset Selection Using Forward Selection Gram-Schmidt Forward Feature Selection
Other Nested Subset Selection Methods
Feature Construction and Extraction Matrix Factorization LU Decomposition QR Factorization to Extract Orthogonal Features Eigenvalues and Eigenvectors of a Matrix Other Properties of a Matrix A Square Matrix and Matrix Diagonalization Symmetric Real Matrix: Spectral Theorem Singular Vector Decomposition (SVD) Principal Component Analysis (PCA) Jordan Decomposition of a Matrix Principal Components Partial Least-Squares-Based Dimension Reduction (PLS) Factor Analysis (FA) Independent Component Analysis (ICA) Multidimensional Scaling (MDS)
Conclusion
References
Feature Interpretation for Biological Learning
Introduction
Normalization Techniques for Gene Expression Analysis Normalization and Standardization Techniques Expression Ratios Intensity-Based Normalization Total Intensity Normalization Intensity-Based Filtering of Array Elements Identification of Differentially Expressed Genes Selection Bias of Gene Expression Data
Data Preprocessing of Mass Spectrometry Data Data Transformation Techniques Baseline Subtraction (Smoothing) Normalization Binning Peak Detection Peak Alignment Application of Dimensionality Reduction
Techniques for MS Data Analysis Feature Selection Techniques Univariate Methods Multivariate Methods
Data Preprocessing for Genomic Sequence Data Feature Selection for Sequence Analysis
Ontologies in Bioinformatics The Role of Ontologies in Bioinformatics Description Logics Gene Ontology (GO) Open Biomedical Ontologies (OBO)
Conclusion
References
Section III
Clustering Techniques in Bioinformatics
Introduction
Clustering in Bioinformatics
Clustering Techniques Distance-Based Clustering and Measures Mahalanobis Distance Minkowiski Distance Pearson Correlation Binary Features Nominal Features Mixed Variables Distance Measure Properties k-Means Algorithm k-Modes Algorithm Genetic Distance Measure (GDM)
Applications of Distance-Based Clustering in Bioinformatics New Distance Metric in Gene Expressions for Coexpressed Genes Gene Expression Clustering Using Mutual Information Distance Measure Gene Expression Data Clustering Using a Local Shape-Based Clustering Exact Similarity Computation Approximate Similarity Computation
Implementation of k-Means in WEKA
Hierarchical Clustering Agglomerative Hierarchical Clustering Cluster Splitting and Merging Calculate Distance between Clusters Applications of Hierarchical Clustering Techniques in Bioinformatics Hierarchical Clustering Based on Partially Overlapping and Irregular Data Cluster Stability Estimation for Microarray Data Comparing Gene Expression Sequences Using Pairwise Average Linking
Implementation of Hierarchical Clustering
Self-Organizing Maps Clustering SOM Algorithm Application of SOM in Bioinformatics Identifying Distinct Gene Expression Patterns Using SOM SOTA: Combining SOM and Hierarchical Clustering for Representation of Genes
Fuzzy Clustering Fuzzy c-Means (FCM) Application of Fuzzy Clustering in Bioinformatics Clustering Genes Using Fuzzy J-Means and VNS Methods Fuzzy k-Means Clustering on Gene Expression Comparison of Fuzzy Clustering Algorithms
Implementation of Expectation Maximization Algorithm
Conclusion
References
Advanced Clustering Techniques
Graph-Based Clustering Graph-Based Cluster Properties Cut in a Graph Intracluster and Intercluster Density
Measures for Identifying Clusters Identifying Clusters by Computing Values for the Vertices or Vertex Similarity Distance and Similarity Measure Adjacency-Based Measures Connectivity Measures Computing the Fitness Measure Density Measure Cut-Based Measures
Determining a Split in the Graph Cuts Spectral Methods Edge-Betweenness
Graph-Based Algorithms Chameleon Algorithm CLICK Algorithm
Application of Graph-Based Clustering in Bioinformatics Analysis of Gene Expression Data Using Shortest Path (SP) Construction of Genetic Linkage Maps Using Minimum Spanning Tree of a Graph Finding Isolated Groups in a Random Graph Process Implementation in Cytoscape Seeding Method
Kernel-Based Clustering Kernel Functions Gaussian Function
Application of Kernel Clustering in Bioinformatics Kernel Clustering Kernel-Based Support Vector Clustering Analyzing Gene Expression Data Using SOM and Kernel-Based Clustering
Model-Based Clustering for Gene Expression Data Gaussian Mixtures Diagonal Model Model Selection
Relevant Number of Genes A Resampling-Based Approach for Identifying Stable and Tight Patterns Overcoming the Local Minimum Problem in k-Means Clustering Tight Clustering Tight Clustering of Gene Expression Time Courses
Higher-Order Mining Clustering for Association Rule Discovery Clustering of Association Rules Clustering Clusters
Conclusion
References
Section IV
Classification Techniques in Bioinformatics
Introduction Bias-Variance Trade-Off in Supervised Learning Linear and Nonlinear Classifiers Model Complexity and Size of Training Data Dimensionality of Input Space
Supervised Learning in Bioinformatics
Support Vector Machines (SVMs) Hyperplanes Large Margin of Separation Soft Margin of Separation Kernel Functions Applications of SVM in Bioinformatics Gene Expression Analysis Remote Protein Homology Detection
Bayesian Approaches Bayes’ Theorem Naive Bayes Classification Handling of Prior Probabilities Handling of Posterior Probability Bayesian Networks Methodology Capturing Data Distributions Using Bayesian Networks Equivalence Classes of Bayesian Networks Learning Bayesian Networks Bayesian Scoring Metric Application of Bayesian Classifiers in Bioinformatics Binary Classification Multiclass Classification Computational Challenges for Gene Expression Analysis
Decision Trees Tree Pruning
Ensemble Approaches Bagging Unweighed Voting Methods Confidence Voting Methods Ranked Voting Methods Boosting Seeking Prospective Classifiers to Be Part of the Ensemble Choosing an Optimal Set of Classifiers Assigning Weight to the Chosen Classifier Random Forest Application of Ensemble Approaches in Bioinformatics
Computational Challenges of Supervised Learning
Conclusion
References
Validation and Benchmarking
Introduction: Performance Evaluation Techniques
Classifier Validation Model Selection Challenges Model Selection Performance Estimation Strategies Holdout Three-Way Split k-Fold Cross-Validation Random Subsampling
Performance Measures Sensitivity and Specificity Precision, Recall, and f-Measure ROC Curve
Cluster Validation Techniques The Need for Cluster Validation External Measures Internal Measures Performance Evaluation Using Validity Indices Silhouette Index (SI) Davies-Bouldin and Dunn’s Index Calinski Harabasz (CH) Index Rand Index
Conclusion
References




