Sarkar | Text Analytics with Python | E-Book | www.sack.de
E-Book

E-Book, Englisch, 688 Seiten

Sarkar Text Analytics with Python

A Practitioner's Guide to Natural Language Processing
2. ed
ISBN: 978-1-4842-4354-1
Verlag: Apress
Format: PDF
Kopierschutz: 1 - PDF Watermark

A Practitioner's Guide to Natural Language Processing

E-Book, Englisch, 688 Seiten

ISBN: 978-1-4842-4354-1
Verlag: Apress
Format: PDF
Kopierschutz: 1 - PDF Watermark



Leverage Natural Language Processing (NLP) in Python and learn how to set up your own robust environment for performing text analytics. This second edition has gone through a major revamp and introduces several significant changes and new topics based on the recent trends in NLP. You'll see how to use the latest state-of-the-art frameworks in NLP, coupled with machine learning and deep learning models for supervised sentiment analysis powered by Python to solve actual case studies. Start by reviewing Python for NLP fundamentals on strings and text data and move on to engineering representation methods for text data, including both traditional statistical models and newer deep learning-based embedding models. Improved techniques and new methods around parsing and processing text are discussed as well.   Text summarization and topic models have been overhauled so the book showcases how to build, tune, and interpret topic models in the context of an interest dataset on NIPS conference papers. Additionally, the book covers text similarity techniques with a real-world example of movie recommenders, along with sentiment analysis using supervised and unsupervised techniques.There is also a chapter dedicated to semantic analysis where you'll see how to build your own named entity recognition (NER) system from scratch. While the overall structure of the book remains the same, the entire code base, modules, and chapters has been updated to the latest Python 3.x release.
What You'll Learn

•Understand NLP and text syntax, semantics and structure•Discover text cleaning and feature engineering•Review text classification and text clustering • Assess text summarization and topic models• Study deep learning for NLP
Who This Book Is For
IT professionals, data analysts, developers, linguistic experts, data scientists and engineers and basically anyone with a keen interest in linguistics, analytics and generating insights from textual data.


Dipanjan Sarkar is a Data Scientist at Intel, the world's largest silicon company which is on a mission to make the world more connected and productive. He primarily works on Analytics, Business Intelligence, Application Development and building large scale Intelligent Systems. He received his master's degree in Information Technology from the International Institute of Information Technology, Bangalore with a focus on Data Science and Software Engineering. He is also an avid supporter of self-learning, especially Massive Open Online Courses and holds a Data Science Specialisation from Johns Hopkins University on Coursera. He has been an analytics practitioner for over six years, specializing in statistical, predictive and text analytics. He has also authored a books on R and Machine Learning and occasionally reviews technical books and acts as a course beta tester for Coursera. Dipanjan's interests include learning about new technology, financial markets, disruptive start-ups, data science and more recently, artificial intelligence and deep learning. In his spare time he loves reading, gaming and watching popular sitcoms and football.

Sarkar Text Analytics with Python jetzt bestellen!

Autoren/Hrsg.


Weitere Infos & Material


1;Table of Contents;5
2;About the Author;14
3;About the Technical Reviewer;15
4;Foreword;16
5;Acknowledgments;17
6;Introduction;19
7;Chapter 1: Natural Language Processing Basics;21
7.1;Natural Language;23
7.1.1;What Is Natural Language?;23
7.1.2;The Philosophy of Language;23
7.1.3;Language Acquisition and Usage;26
7.1.3.1;Language Acquisition and Cognitive Learning;27
7.1.3.2;Language Usage;28
7.2;Linguistics;30
7.3;Language Syntax and Structure;33
7.3.1;Words;35
7.3.2;Phrases;37
7.3.3;Clauses;40
7.3.4;Grammar;41
7.3.4.1;Dependency Grammar;42
7.3.4.2;Constituency Grammar;46
7.3.5;Word-Order Typology;53
7.4;Language Semantics;55
7.4.1;Lexical Semantic Relations;55
7.4.1.1;Lemmas and Wordforms;56
7.4.1.2;Homonyms, Homographs, and Homophones;56
7.4.1.3;Heteronyms and Heterographs;57
7.4.1.4;Polysemes;57
7.4.1.5;Capitonyms;57
7.4.1.6;Synonyms and Antonyms;57
7.4.1.7;Hyponyms and Hypernyms;58
7.4.2;Semantic Networks and Models;59
7.4.3;Representation of Semantics;61
7.4.3.1;Propositional Logic;61
7.4.3.2;First Order Logic;66
7.5;Text Corpora;71
7.5.1;Corpora Annotation and Utilities;72
7.5.2;Popular Corpora;73
7.5.3;Accessing Text Corpora;75
7.5.3.1;Accessing the Brown Corpus;76
7.5.3.2;Accessing the Reuters Corpus;79
7.5.3.3;Accessing the WordNet Corpus;80
7.6;Natural Language Processing;82
7.6.1;Machine Translation;82
7.6.2;Speech Recognition Systems;83
7.6.3;Question Answering Systems;84
7.6.4;Contextual Recognition and Resolution;84
7.6.5;Text Summarization;85
7.6.6;Text Categorization;85
7.7;Text Analytics;86
7.8;Machine Learning;87
7.9;Deep Learning;88
7.10;Summary;88
8;Chapter 2: Python for Natural Language Processing;89
8.1;Getting to Know Python;90
8.2;The Zen of Python;91
8.3;Applications: When Should You Use Python?;93
8.4;Drawbacks: When Should You Not Use Python?;95
8.5;Python Implementations and Versions;96
8.6;Setting Up a Robust Python Environment;98
8.6.1;Which Python Version?;98
8.6.2;Which Operating System?;99
8.6.3;Integrated Development Environments;99
8.6.4;Environment Setup;100
8.6.5;Package Management;104
8.6.6;Virtual Environments;105
8.7;Python Syntax and Structure;108
8.8;Working with Text Data;109
8.8.1;String Literals;109
8.8.2;Representing Strings;111
8.8.3;String Operations and Methods;113
8.8.3.1;Basic Operations;114
8.8.3.2;Indexing and Slicing;115
8.8.3.3;Methods;118
8.8.3.4;Formatting;120
8.8.3.5;Regular Expressions;122
8.9;Basic Text Processing and Analysis: Putting It All Together;126
8.10;Natural Language Processing Frameworks;131
8.11;Summary;133
9;Chapter 3: Processing and  Understanding Text;135
9.1;Text Preprocessing and Wrangling;137
9.1.1;Removing HTML Tags;137
9.1.2;Text Tokenization;139
9.1.2.1;Sentence Tokenization;140
9.1.2.1.1;Default Sentence Tokenizer;141
9.1.2.1.2;Pretrained Sentence Tokenizer Models;143
9.1.2.1.3;PunktSentenceTokenizer;145
9.1.2.1.4;RegexpTokenizer;145
9.1.2.2;Word Tokenization;146
9.1.2.2.1;Default Word Tokenizer;147
9.1.2.2.2;TreebankWordTokenizer;147
9.1.2.2.3;TokTokTokenizer;148
9.1.2.2.4;RegexpTokenizer;149
9.1.2.2.5;Inherited Tokenizers from RegexpTokenizer;151
9.1.2.3;Building Robust Tokenizers with NLTK and spaCy;152
9.1.3;Removing Accented Characters;155
9.1.4;Expanding Contractions;156
9.1.5;Removing Special Characters;158
9.1.6;Case Conversions;158
9.1.7;Text Correction;159
9.1.7.1;Correcting Repeating Characters;159
9.1.7.2;Correcting Spellings;162
9.1.8;Stemming;168
9.1.9;Lemmatization;172
9.1.10;Removing Stopwords;174
9.1.11;Bringing It All Together — Building a Text Normalizer;175
9.2;Understanding Text Syntax and Structure;177
9.2.1;Installing Necessary Dependencies;179
9.2.2;Important Machine Learning Concepts;182
9.2.3;Parts of Speech Tagging;183
9.2.3.1;Building POS Taggers;186
9.2.4;Shallow Parsing or Chunking;192
9.2.4.1;Building Shallow Parsers;193
9.2.5;Dependency Parsing;203
9.2.5.1;Building Dependency Parsers;205
9.2.6;Constituency Parsing;210
9.2.6.1;Building Constituency Parsers;212
9.3;Summary;219
10;Chapter 4: Feature Engineering for Text Representation;220
10.1;Understanding Text Data;221
10.2;Building a Text Corpus;222
10.3;Preprocessing Our Text Corpus;224
10.4;Traditional Feature Engineering Models;227
10.4.1;Bag of Words Model;227
10.4.2;Bag of N-Grams Model;229
10.4.3;TF-IDF Model;230
10.4.3.1;Using TfidfTransformer;232
10.4.3.2;Using TfidfVectorizer;233
10.4.3.3;Understanding the TF-IDF Model;234
10.4.4;Extracting Features for New Documents;239
10.4.5;Document Similarity;239
10.4.5.1;Document Clustering with Similarity Features;241
10.4.6;Topic Models;245
10.5;Advanced Feature Engineering Models;250
10.5.1;Loading the Bible Corpus;252
10.5.2;Word2Vec Model;253
10.5.2.1;The Continuous Bag of Words (CBOW) Model;253
10.5.2.2;Implementing the Continuous Bag of Words (CBOW) Model;255
10.5.2.2.1;Build the Corpus Vocabulary;255
10.5.2.2.2;Build a CBOW (Context, Target) Generator;256
10.5.2.2.3;Build the CBOW Model Architecture;257
10.5.2.2.4;Train the Model;260
10.5.2.2.5;Get Word Embeddings;261
10.5.2.3;The Skip-Gram Model;263
10.5.2.4;Implementing the Skip-Gram Model;265
10.5.2.4.1;Build the Corpus Vocabulary;265
10.5.2.4.2;Build a Skip-Gram [(target, context), relevancy] Generator;266
10.5.2.4.3;Build the Skip-Gram Model Architecture;267
10.5.2.4.4;Train the Model;270
10.5.2.4.5;Get Word Embeddings;271
10.5.3;Robust Word2Vec Models with Gensim;274
10.5.4;Applying Word2Vec Features for Machine Learning Tasks;277
10.5.4.1;Strategy for Getting Document Embeddings;279
10.5.5;The GloVe Model;282
10.5.6;Applying GloVe Features for Machine Learning Tasks;284
10.5.7;The FastText Model;288
10.5.8;Applying FastText Features to Machine Learning Tasks;289
10.6;Summary;292
11;Chapter 5: Text Classification;293
11.1;What Is Text Classification?;295
11.1.1;Formal Definition;295
11.1.2;Major Text Classification Variants;296
11.2;Automated Text Classification;297
11.2.1;Formal Definition;299
11.2.2;Text Classification Task Variants;300
11.3;Text Classification Blueprint;300
11.4;Data Retrieval;303
11.5;Data Preprocessing and Normalization;305
11.6;Building Train and Test Datasets;310
11.7;Feature Engineering Techniques;311
11.7.1;Traditional Feature Engineering Models;312
11.7.2;Advanced Feature Engineering Models;313
11.8;Classification Models;314
11.8.1;Multinomial Naïve Bayes;316
11.8.2;Logistic Regression;319
11.8.3;Support Vector Machines;321
11.8.4;Ensemble Models;324
11.8.5;Random Forest;325
11.8.6;Gradient Boosting Machines;326
11.9;Evaluating Classification Models;327
11.9.1;Confusion Matrix;328
11.9.1.1;Understanding the Confusion Matrix;329
11.9.1.1.1;Performance Metrics;330
11.10;Building and Evaluating Our Text Classifier;333
11.10.1;Bag of Words Features with Classification Models;333
11.10.2;TF-IDF Features with Classification Models;337
11.10.3;Comparative Model Performance Evaluation;340
11.10.4;Word2Vec Embeddings with Classification Models;341
11.10.5;GloVe Embeddings with Classification Models;344
11.10.6;FastText Embeddings with Classification Models;345
11.10.7;Model Tuning;346
11.10.8;Model Performance Evaluation;352
11.11;Applications;359
11.12;Summary;359
12;Chapter 6: Text Summarization and Topic Models;361
12.1;Text Summarization and Information Extraction;362
12.1.1;Keyphrase Extraction;364
12.1.2;Topic Modeling;364
12.1.3;Automated Document Summarization;364
12.2;Important Concepts;365
12.3;Keyphrase Extraction;368
12.3.1;Collocations;369
12.3.2;Weighted Tag-Based Phrase Extraction;375
12.4;Topic Modeling;380
12.5;Topic Modeling on Research Papers;382
12.5.1;The Main Objective;382
12.5.2;Data Retrieval;383
12.5.3;Load and View Dataset;384
12.5.4;Basic Text Wrangling;385
12.6;Topic Models with Gensim;386
12.6.1;Text Representation with Feature Engineering;387
12.6.2;Latent Semantic Indexing;390
12.6.3;Implementing LSI Topic Models from Scratch;400
12.6.4;Latent Dirichlet Allocation;407
12.6.5;LDA Models with MALLET;417
12.6.6;LDA Tuning: Finding the Optimal Number of Topics;420
12.6.7;Interpreting Topic Model Results;427
12.6.7.1;Dominant Topics Distribution Across Corpus;428
12.6.7.2;Dominant Topics in Specific Research Papers;430
12.6.7.3;Relevant Research Papers per Topic Based on Dominance;431
12.6.8;Predicting Topics for New Research Papers;433
12.7;Topic Models with Scikit-Learn;436
12.7.1;Text Representation with Feature Engineering;437
12.7.2;Latent Semantic Indexing;437
12.7.3;Latent Dirichlet Allocation;443
12.7.4;Non-Negative Matrix Factorization;446
12.7.5;Predicting Topics for New Research Papers;450
12.7.6;Visualizing Topic Models;452
12.8;Automated Document Summarization;453
12.8.1;Text Wrangling;457
12.8.2;Text Representation with Feature Engineering;458
12.8.3;Latent Semantic Analysis;459
12.8.4;TextRank;463
12.9;Summary;468
13;Chapter 7: Text Similarity and Clustering;470
13.1;Essential Concepts;472
13.1.1;Information Retrieval (IR);472
13.1.2;Feature Engineering;472
13.1.3;Similarity Measures;473
13.1.4;Unsupervised Machine Learning Algorithms;474
13.2;Text Similarity;474
13.3;Analyzing Term Similarity;475
13.3.1;Hamming Distance;478
13.3.2;Manhattan Distance;479
13.3.3;Euclidean Distance;481
13.3.4;Levenshtein Edit Distance;482
13.3.5;Cosine Distance and Similarity;488
13.4;Analyzing Document Similarity;492
13.5;Building a Movie Recommender;493
13.5.1;Load and View Dataset;494
13.5.2;Text Preprocessing;497
13.5.3;Extract TF-IDF Features;498
13.5.4;Cosine Similarity for Pairwise Document Similarity;499
13.5.5;Find Top Similar Movies for a Sample Movie;500
13.5.5.1;Find Movie ID;500
13.5.5.2;Get Movie Similarities;500
13.5.5.3;Get Top Five Similar Movie IDs;500
13.5.5.4;Get Top Five Similar Movies;501
13.5.6;Build a Movie Recommender;501
13.5.7;Get a List of Popular Movies;502
13.5.8;Okapi BM25 Ranking for Pairwise Document Similarity;505
13.6;Document Clustering;514
13.7;Clustering Movies;517
13.7.1;Feature Engineering;517
13.7.2;K-Means Clustering;518
13.7.3;Affinity Propagation;525
13.7.4;Ward's Agglomerative Hierarchical Clustering;529
13.8;Summary;534
14;Chapter 8: Semantic Analysis;535
14.1;Semantic Analysis;536
14.2;Exploring WordNet;537
14.2.1;Understanding Synsets;538
14.2.2;Analyzing Lexical Semantic Relationships;539
14.2.2.1;Entailments;540
14.2.2.2;Homonyms and Homographs;540
14.2.2.3;Synonyms and Antonyms;541
14.2.2.4;Hyponyms and Hypernyms;542
14.2.2.5;Holonyms and Meronyms;545
14.2.2.6;Semantic Relationships and Similarity;546
14.3;Word Sense Disambiguation;549
14.4;Named Entity Recognition;552
14.5;Building an NER Tagger from Scratch;560
14.6;Building an End-to-End NER Tagger with Our Trained NER Model;570
14.7;Analyzing Semantic Representations;574
14.7.1;Propositional Logic;574
14.7.2;First Order Logic;576
14.8;Summary;582
15;Chapter 9: Sentiment Analysis;583
15.1;Problem Statement;584
15.2;Setting Up Dependencies;585
15.3;Getting the Data;585
15.4;Text Preprocessing and Normalization;586
15.5;Unsupervised Lexicon-Based Models;588
15.5.1;Bing Liu's Lexicon;590
15.5.2;MPQA Subjectivity Lexicon;590
15.5.3;Pattern Lexicon;591
15.5.4;TextBlob Lexicon;591
15.5.5;AFINN Lexicon;594
15.5.6;SentiWordNet Lexicon;596
15.5.7;VADER Lexicon;600
15.6;Classifying Sentiment with Supervised Learning;603
15.7;Traditional Supervised Machine Learning Models;606
15.8;Newer Supervised Deep Learning Models;609
15.9;Advanced Supervised Deep Learning Models;618
15.10;Analyzing Sentiment Causation;630
15.10.1;Interpreting Predictive Models;630
15.10.2;Analyzing Topic Models;638
15.11;Summary;645
16;Chapter 10: The Promise of Deep Learning;646
16.1;Why Are We Crazy for Embeddings?;648
16.2;Trends in Word-Embedding Models;650
16.3;Trends in Universal Sentence-Embedding Models;651
16.4;Understanding Our Text Classification Problem;657
16.5;Universal Sentence Embeddings in Action;658
16.5.1;Load Up Dependencies;658
16.5.2;Load and View the Dataset;659
16.5.3;Building Train, Validation, and Test Datasets;660
16.5.4;Basic Text Wrangling;660
16.5.5;Build Data Ingestion Functions;662
16.5.6;Build Deep Learning Model with Universal Sentence Encoder;663
16.5.7;Model Training;664
16.5.8;Model Evaluation;666
16.6;Bonus: Transfer Learning with Different Universal Sentence Embeddings;667
16.7;Summary and Future Scope;674
17;Index;675



Ihre Fragen, Wünsche oder Anmerkungen
Vorname*
Nachname*
Ihre E-Mail-Adresse*
Kundennr.
Ihre Nachricht*
Lediglich mit * gekennzeichnete Felder sind Pflichtfelder.
Wenn Sie die im Kontaktformular eingegebenen Daten durch Klick auf den nachfolgenden Button übersenden, erklären Sie sich damit einverstanden, dass wir Ihr Angaben für die Beantwortung Ihrer Anfrage verwenden. Selbstverständlich werden Ihre Daten vertraulich behandelt und nicht an Dritte weitergegeben. Sie können der Verwendung Ihrer Daten jederzeit widersprechen. Das Datenhandling bei Sack Fachmedien erklären wir Ihnen in unserer Datenschutzerklärung.