Text Data Management and Analysis

A Practical Introduction to Information Retrieval and Text Mining

Author: ChengXiang Zhai,Sean Massung

Publisher: Morgan & Claypool

ISBN: 1970001178

Category: Computers

Page: 530

View: 9359


Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media such as blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand for powerful software tools to help people analyze and manage vast amounts of text data effectively and efficiently. Unlike data generated by a computer system or sensors, text data are usually generated directly by humans, and are accompanied by semantically rich content. As such, text data are especially valuable for discovering knowledge about human opinions and preferences, in addition to many other kinds of knowledge that we encode in text. In contrast to structured data, which conform to well-defined schemas (thus are relatively easy for computers to handle), text has less explicit structure, requiring computer processing toward understanding of the content encoded in text. The current technology of natural language processing has not yet reached a point to enable a computer to precisely understand natural language text, but a wide range of statistical and heuristic approaches to analysis and management of text data have been developed over the past few decades. They are usually very robust and can be applied to analyze and manage text data in any natural language, and about any topic. This book provides a systematic introduction to all these approaches, with an emphasis on covering the most useful knowledge and skills required to build a variety of practically useful text information systems. The focus is on text mining applications that can help users analyze patterns in text data to extract and reveal useful knowledge. Information retrieval systems, including search engines and recommender systems, are also covered as supporting technology for text mining applications. The book covers the major concepts, techniques, and ideas in text data mining and information retrieval from a practical viewpoint, and includes many hands-on exercises designed with a companion software toolkit (i.e., MeTA) to help readers learn how to apply techniques of text mining and information retrieval to real-world text data and how to experiment with and improve some of the algorithms for interesting application tasks. The book can be used as a textbook for a computer science undergraduate course or a reference book for practitioners working on relevant problems in analyzing and managing text data.

An Architecture for Fast and General Data Processing on Large Clusters

Author: Matei Zaharia

Publisher: Morgan & Claypool

ISBN: 1970001577

Category: Computers

Page: 141

View: 7114


The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to clusters. Today, a myriad data sources, from the Internet to business operations to scientific instruments, produce large and valuable data streams. However, the processing capabilities of single machines have not kept up with the size of data. As a result, organizations increasingly need to scale out their computations over clusters. At the same time, the speed and sophistication required of data processing have grown. In addition to simple queries, complex algorithms like machine learning and graph analysis are becoming common. And in addition to batch processing, streaming analysis of real-time data is required to let organizations take timely action. Future computing platforms will need to not only scale out traditional workloads, but support these new applications too. This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping MapReduce's scalability and fault tolerance. And whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing. We achieve these results through a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine the generality of RDDs from both a theoretical modeling perspective and a systems perspective. This version of the dissertation makes corrections throughout the text and adds a new section on the evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the references have been added.

Shared-Memory Parallelism Can be Simple, Fast, and Scalable

Author: Julian Shun

Publisher: Morgan & Claypool

ISBN: 1970001895

Category: Computers

Page: 426

View: 2629


Parallelism is the key to achieving high performance in computing. However, writing efficient and scalable parallel programs is notoriously difficult, and often requires significant expertise. To address this challenge, it is crucial to provide programmers with high-level tools to enable them to develop solutions easily, and at the same time emphasize the theoretical and practical aspects of algorithm design to allow the solutions developed to run efficiently under many different settings. This thesis addresses this challenge using a three-pronged approach consisting of the design of shared-memory programming techniques, frameworks, and algorithms for important problems in computing. The thesis provides evidence that with appropriate programming techniques, frameworks, and algorithms, shared-memory programs can be simple, fast, and scalable, both in theory and in practice. The results developed in this thesis serve to ease the transition into the multicore era. The first part of this thesis introduces tools and techniques for deterministic parallel programming, including means for encapsulating nondeterminism via powerful commutative building blocks, as well as a novel framework for executing sequential iterative loops in parallel, which lead to deterministic parallel algorithms that are efficient both in theory and in practice. The second part of this thesis introduces Ligra, the first high-level shared memory framework for parallel graph traversal algorithms. The framework allows programmers to express graph traversal algorithms using very short and concise code, delivers performance competitive with that of highly-optimized code, and is up to orders of magnitude faster than existing systems designed for distributed memory. This part of the thesis also introduces Ligra+, which extends Ligra with graph compression techniques to reduce space usage and improve parallel performance at the same time, and is also the first graph processing system to support in-memory graph compression. The third and fourth parts of this thesis bridge the gap between theory and practice in parallel algorithm design by introducing the first algorithms for a variety of important problems on graphs and strings that are efficient both in theory and in practice. For example, the thesis develops the first linear-work and polylogarithmic-depth algorithms for suffix tree construction and graph connectivity that are also practical, as well as a work-efficient, polylogarithmic-depth, and cache-efficient shared-memory algorithm for triangle computations that achieves a 2–5x speedup over the best existing algorithms on 40 cores. This is a revised version of the thesis that won the 2015 ACM Doctoral Dissertation Award.

Communities of Computing

Computer Science and Society in the ACM

Author: Thomas J. Misa

Publisher: Morgan & Claypool

ISBN: 1970001852

Category: Computers

Page: 422

View: 853


Communities of Computing is the first book-length history of the Association for Computing Machinery (ACM), founded in 1947 and with a membership today of 100,000 worldwide. It profiles ACM's notable SIGs, active chapters, and individual members, setting ACM's history into a rich social and political context. The book's 12 core chapters are organized into three thematic sections. "Defining the Discipline" examines the 1960s and 1970s when the field of computer science was taking form at the National Science Foundation, Stanford University, and through ACM's notable efforts in education and curriculum standards. "Broadening the Profession" looks outward into the wider society as ACM engaged with social and political issues - and as members struggled with balancing a focus on scientific issues and awareness of the wider world. Chapters examine the social turbulence surrounding the Vietnam War, debates about the women's movement, efforts for computing and community education, and international issues including professionalization and the Cold War. "Expanding Research Frontiers" profiles three areas of research activity where ACM members and ACM itself shaped notable advances in computing, including computer graphics, computer security, and hypertext. Featuring insightful profiles of notable ACM leaders, such as Edmund Berkeley, George Forsythe, Jean Sammet, Peter Denning, and Kelly Gotlieb, and honest assessments of controversial episodes, the volume deals with compelling and complex issues involving ACM and computing. It is not a narrow organizational history of ACM committees and SIGS, although much information about them is given. All chapters are original works of research. Many chapters draw on archival records of ACM's headquarters, ACM SIGs, and ACM leaders. This volume makes a permanent contribution to documenting the history of ACM and understanding its central role in the history of computing.

Data mining

praktische Werkzeuge und Techniken für das maschinelle Lernen

Author: Ian H. Witten,Eibe Frank

Publisher: N.A

ISBN: 9783446215337


Page: 386

View: 4466


Managing Gigabytes

Compressing and Indexing Documents and Images

Author: Ian H. Witten,Alistair Moffat,Timothy C. Bell

Publisher: Morgan Kaufmann

ISBN: 9781558605701

Category: Business & Economics

Page: 519

View: 2651


In this fully updated second edition of the highly acclaimed Managing Gigabytes, authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. It also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the Web. * Up-to-date coverage of new text compression algorithms such as block sorting, approximate arithmetic coding, and fat Huffman coding * New sections on content-based index compression and distributed querying, with 2 new data structures for fast indexing * New coverage of image coding, including descriptions of de facto standards in use on the Web (GIF and PNG), information on CALIC, the new proposed JPEG Lossless standard, and JBIG2 * New information on the Internet and WWW, digital libraries, web search engines, and agent-based retrieval * Accompanied by a public domain system called MG which is a fully worked-out operational example of the advanced techniques developed and explained in the book * New appendix on an existing digital library system that uses the MG software

Data Mining, Southeast Asia Edition

Author: Jiawei Han,Jian Pei,Micheline Kamber

Publisher: Elsevier

ISBN: 9780080475585

Category: Computers

Page: 800

View: 3504


Our ability to generate and collect data has been increasing rapidly. Not only are all of our business, scientific, and government transactions now computerized, but the widespread use of digital cameras, publication tools, and bar codes also generate data. On the collection side, scanned text and image platforms, satellite remote sensing systems, and the World Wide Web have flooded us with a tremendous amount of data. This explosive growth has generated an even more urgent need for new techniques and automated tools that can help us transform this data into useful information and knowledge. Like the first edition, voted the most popular data mining book by KD Nuggets readers, this book explores concepts and techniques for the discovery of patterns hidden in large data sets, focusing on issues relating to their feasibility, usefulness, effectiveness, and scalability. However, since the publication of the first edition, great progress has been made in the development of new data mining methods, systems, and applications. This new edition substantially enhances the first edition, and new chapters have been added to address recent developments on mining complex types of data— including stream data, sequence data, graph structured data, social network data, and multi-relational data. A comprehensive, practical look at the concepts and techniques you need to know to get the most out of real business data Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning Dozens of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects Complete classroom support for instructors at www.mkp.com/datamining2e companion site

Data Mining

Practical Machine Learning Tools and Techniques, Second Edition

Author: Ian H. Witten,Eibe Frank

Publisher: Elsevier

ISBN: 9780080477022

Category: Computers

Page: 560

View: 2820


Data Mining, Second Edition, describes data mining techniques and shows how they work. The book is a major revision of the first edition that appeared in 1999. While the basic core remains the same, it has been updated to reflect the changes that have taken place over five years, and now has nearly double the references. The highlights of this new edition include thirty new technique sections; an enhanced Weka machine learning workbench, which now features an interactive interface; comprehensive information on neural networks; a new section on Bayesian networks; and much more. This text is designed for information systems practitioners, programmers, consultants, developers, information technology managers, specification writers as well as professors and students of graduate-level data mining and machine learning courses. Algorithmic methods at the heart of successful data mining—including tried and true techniques as well as leading edge methods Performance improvement techniques that work by transforming the input or output

Algorithmen - Eine Einführung

Author: Thomas H. Cormen,Charles E. Leiserson,Ronald Rivest,Clifford Stein

Publisher: Walter de Gruyter GmbH & Co KG

ISBN: 3110522012

Category: Computers

Page: 1339

View: 9255


Der "Cormen" bietet eine umfassende und vielseitige Einführung in das moderne Studium von Algorithmen. Es stellt viele Algorithmen Schritt für Schritt vor, behandelt sie detailliert und macht deren Entwurf und deren Analyse allen Leserschichten zugänglich. Sorgfältige Erklärungen zur notwendigen Mathematik helfen, die Analyse der Algorithmen zu verstehen. Den Autoren ist es dabei geglückt, Erklärungen elementar zu halten, ohne auf Tiefe oder mathematische Exaktheit zu verzichten. Jedes der weitgehend eigenständig gestalteten Kapitel stellt einen Algorithmus, eine Entwurfstechnik, ein Anwendungsgebiet oder ein verwandtes Thema vor. Algorithmen werden beschrieben und in Pseudocode entworfen, der für jeden lesbar sein sollte, der schon selbst ein wenig programmiert hat. Zahlreiche Abbildungen verdeutlichen, wie die Algorithmen arbeiten. Ebenfalls angesprochen werden Belange der Implementierung und andere technische Fragen, wobei, da Effizienz als Entwurfskriterium betont wird, die Ausführungen eine sorgfältige Analyse der Laufzeiten der Programme mit ein schließen. Über 1000 Übungen und Problemstellungen und ein umfangreiches Quellen- und Literaturverzeichnis komplettieren das Lehrbuch, dass durch das ganze Studium, aber auch noch danach als mathematisches Nachschlagewerk oder als technisches Handbuch nützlich ist. Für die dritte Auflage wurde das gesamte Buch aktualisiert. Die Änderungen sind vielfältig und umfassen insbesondere neue Kapitel, überarbeiteten Pseudocode, didaktische Verbesserungen und einen lebhafteren Schreibstil. So wurden etwa - neue Kapitel zu van-Emde-Boas-Bäume und mehrfädigen (engl.: multithreaded) Algorithmen aufgenommen, - das Kapitel zu Rekursionsgleichungen überarbeitet, sodass es nunmehr die Teile-und-Beherrsche-Methode besser abdeckt, - die Betrachtungen zu dynamischer Programmierung und Greedy-Algorithmen überarbeitet; Memoisation und der Begriff des Teilproblem-Graphen als eine Möglichkeit, die Laufzeit eines auf dynamischer Programmierung beruhender Algorithmus zu verstehen, werden eingeführt. - 100 neue Übungsaufgaben und 28 neue Problemstellungen ergänzt. Umfangreiches Dozentenmaterial (auf englisch) ist über die Website des US-Verlags verfügbar.

Handbook of Neural Computation

Author: Pijush Samui,Sanjiban Sekhar Roy,Valentina E. Balas

Publisher: Academic Press

ISBN: 0128113197

Category: Technology & Engineering

Page: 658

View: 426


Handbook of Neural Computation explores neural computation applications, ranging from conventional fields of mechanical and civil engineering, to electronics, electrical engineering and computer science. This book covers the numerous applications of artificial and deep neural networks and their uses in learning machines, including image and speech recognition, natural language processing and risk analysis. Edited by renowned authorities in this field, this work is comprised of articles from reputable industry and academic scholars and experts from around the world. Each contributor presents a specific research issue with its recent and future trends. As the demand rises in the engineering and medical industries for neural networks and other machine learning methods to solve different types of operations, such as data prediction, classification of images, analysis of big data, and intelligent decision-making, this book provides readers with the latest, cutting-edge research in one comprehensive text. Features high-quality research articles on multivariate adaptive regression splines, the minimax probability machine, and more Discusses machine learning techniques, including classification, clustering, regression, web mining, information retrieval and natural language processing Covers supervised, unsupervised, reinforced, ensemble, and nature-inspired learning methods

Moving Objects Databases

Author: Ralf Hartmut Güting,Markus Schneider

Publisher: Academic Press

ISBN: 0120887991

Category: Computers

Page: 389

View: 4536


First uniform treatment of moving objects databases, the technology that supports GPS and RFID data analysis.

Learning from Multiple Social Networks

Author: Liqiang Nie,Xuemeng Song,Tat-Seng Chua

Publisher: Morgan & Claypool Publishers

ISBN: 1627059865

Category: Computers

Page: 118

View: 1313


With the proliferation of social network services, more and more social users, such as individuals and organizations, are simultaneously involved in multiple social networks for various purposes. In fact, multiple social networks characterize the same social users from different perspectives, and their contexts are usually consistent or complementary rather than independent. Hence, as compared to using information from a single social network, appropriate aggregation of multiple social networks offers us a better way to comprehensively understand the given social users. Learning across multiple social networks brings opportunities to new services and applications as well as new insights on user online behaviors, yet it raises tough challenges: (1) How can we map different social network accounts to the same social users? (2) How can we complete the item-wise and block-wise missing data? (3) How can we leverage the relatedness among sources to strengthen the learning performance? And (4) How can we jointly model the dual-heterogeneities: multiple tasks exist for the given application and each task has various features from multiple sources? These questions have been largely unexplored to date. We noticed this timely opportunity, and in this book we present some state-of-the-art theories and novel practical applications on aggregation of multiple social networks. In particular, we first introduce multi-source dataset construction. We then introduce how to effectively and efficiently complete the item-wise and block-wise missing data, which are caused by the inactive social users in some social networks. We next detail the proposed multi-source mono-task learning model and its application in volunteerism tendency prediction. As a counterpart, we also present a mono-source multi-task learning model and apply it to user interest inference. We seamlessly unify these models with the so-called multi-source multi-task learning, and demonstrate several application scenarios, such as occupation prediction. Finally, we conclude the book and figure out the future research directions in multiple social network learning, including the privacy issues and source complementarity modeling. This is preliminary research on learning from multiple social networks, and we hope it can inspire more active researchers to work on this exciting area. If we have seen further it is by standing on the shoulders of giants.

Ergonomie und Mensch-Maschine-Systeme

Author: Ludger Schmidt,Christopher M. Schlick,Jürgen Grosche

Publisher: Springer Science & Business Media

ISBN: 354078330X

Category: Business & Economics

Page: 462

View: 7220


Festschrift der Abteilung Ergonomie und Führungssysteme des Forschungsinstituts für Kommunikation, Informationsverarbeitung und Ergonomie anlässlich 40 Jahre Ergonomie in der Forschungsgesellschaft für Angewandte Naturwissenschaften. 1967 wurde die Forschungsgruppe Anthropotechnik und Flugmesstechnik, die zuvor an der TU Berlin tätig war, in die Gesellschaft zur Förderung der astrophysikalischen Forschung e. V. eingegliedert. Zwei Jahre später erfolgte mit einer Erweiterung des Aufgabenspektrums die Gründung des Forschungsinstituts für Anthropotechnik (FAT). Aus dem FAT und zwei weiteren Instituten ging schließlich 1999 das FGAN Forschungsinstitut für Kommunikation, Informationsverarbeitung und Ergonomie (FKIE) hervor, in dem das Arbeitsspektrum des bisherigen FAT nun von der Abteilung Ergonomie und Führungssysteme abgedeckt wurde. Heute arbeiten in der Abteilung über 60 Mitarbeiter/-innen aus verschiedenen Ingenieurwissenschaften, der Informatik, Psychologie, Biologie, Mathematik, Physik u. a. interdisziplinär zusammen. In der Abteilung Ergonomie und Führungssysteme werden Konzepte, Methoden und Werkzeuge zur benutzerzentrierten Gestaltung von Führungs- und Führungsinformationssysteme erforscht, entwickelt und angewandt. Aufbauend auf ergonomischen Anforderungsanalysen werden innovative Mensch-Maschine-Schnittstellen konzipiert, in Form von Prototypen realisiert und hinsichtlich ihrer nutzergerechten Gestaltung in Feld- und Laborstudien evaluiert. Grobgliederung anhand der Schwerpunkte von Forschung und Entwicklung: - Gestaltung und Bewertung von Mensch-Maschine-Systemen, - 3D-Visualisierung und Interaktion, - Führung unbemannter Robotersysteme, - Methoden zur ergonomischen Bewertung. In ca. 25 wissenschaftlichen Beiträgen wird in diesem Herausgeberwerk das Spektrum der Arbeiten im Bereich Ergonomie und Mensch-Maschine-Systeme dargestellt. Autoren sind aktive und ehemalige wissenschaftliche Mitarbeiter und Professoren des Institutes.

CIKM 2003

proceedings of the Twelfth ACM International Conference on Information & Knowledge Management : November 3-8, 2003, New Orleans, Louisiana, USA

Author: Association for Computing Machinery

Publisher: N.A

ISBN: 9781581137231

Category: Database management

Page: 578

View: 2699