Taming Text: How to Find, Organize, and Manipulate It

Preț: 200,25 lei
Disponibilitate: la comandă
ISBN: 9781933988382
Anul publicării: 2013
Pagini: 320

DESCRIERE

Taming Text is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built.

About this Book

There is so much text in our lives, we are practically drowning in it. Fortunately, there are innovative tools and techniques for managing unstructured information that can throw the smart developer a much-needed lifeline. You'll find them in this book.

Taming Text is a practical, example-driven guide to working with text in real applications. This book introduces you to useful techniques like full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. You'll explore real use cases as you systematically absorb the foundations upon which they are built.

Written in a clear and concise style, this book avoids jargon, explaining the subject in terms you can understand without a background in statistics or natural language processing. Examples are in Java, but the concepts can be applied in any language.

What's Inside
•When to use text-taming techniques
•Important open-source libraries like Solr and Mahout
•How to build text-processing applications

About the Authors

Grant Ingersoll is an engineer, speaker, and trainer, a Lucene committer, and a cofounder of the Mahout machine-learning project. Thomas Morton is the primary developer of OpenNLP and Maximum Entropy. Drew Farris is a technology consultant, soft ware developer, and contributor to Mahout, Lucene, and Solr.

Contents
--------------------------------------------------------------------------------

foreword
preface
acknowledgments
about this book
about the cover illustration

Chapter 1 Getting started taming text
Why taming text is important
Preview: A fact-based question answering system
Understanding text is hard
Text, tamed
Text and the intelligent app: search and beyond
Summary
Resources
Chapter 2 Foundations of taming text
Foundations of language
Common tools for text processing
Preprocessing and extracting content from common file formats
Summary
Resources
Chapter 3 Searching
Search and faceting example: Amazon.com
Introduction to search concepts
Introducing the Apache Solr search server
Indexing content with Apache Solr
Searching content with Apache Solr
Understanding search performance factors
Improving search performance
Search alternatives
Summary
Resources
Chapter 4 Fuzzy string matching
Approaches to fuzzy string matching
Finding fuzzy string matches
Building fuzzy string matching applications
Summary
Resources
Chapter 5 Identifying people, places, and things
Approaches to named-entity recognition
Basic entity identification with OpenNLP
In-depth entity identification with OpenNLP
Performance of OpenNLP
Customizing OpenNLP entity identification for a new domain
Summary
Further reading
Chapter 6 Clustering text
Google News document clustering
Clustering foundations
Setting up a simple clustering application
Clustering search results using Carrot 2
Clustering document collections with Apache Mahout
Topic modeling using Apache Mahout
Examining clustering performance
Acknowledgments
Summary
References
Chapter 7 Classification, categorization, and tagging
Introduction to classification and categorization
The classification process
Building document categorizers using Apache Lucene
Training a naive Bayes classifier using Apache Mahout
Categorizing documents with OpenNLP
Building a tag recommender using Apache Solr
Summary
References
Chapter 8 Building an example question answering system
Basics of a question answering system
Installing and running the QA code
A sample question answering architecture
Understanding questions and producing answers
Steps to improve the system
Summary
Resources
Chapter 9 Untamed text: exploring the next frontier
Semantics, discourse, and pragmatics: exploring higher levels of NLP
Document and collection summarization
Relationship extraction
Identifying important content and people
Detecting emotions via sentiment analysis
Cross-language information retrieval
Summary
References
index

RECENZII

Spune-ne opinia ta despre acest produs! scrie o recenzie
Created in 0.0454 sec