Computer ScienceScience & MathematicsEconomics & FinanceBusiness & ManagementPolitics & GovernmentHistoryPhilosophy

Data Science at the Command Line

Obtain, Scrub, Explore, and Model Data with Unix Power Tools

by Jeroen Janssens

Data Science at the Command Line

Subscribe to new books via dBooks.org telegram channel

Join
DescriptionTable of ContentsDetailsReport an issue

Book Description

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools - useful whether you work with Windows, macOS, or Linux.You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.

- Obtain data from websites, APIs, databases, and spreadsheets;
- Perform scrub operations on text, CSV, HTML, XML, and JSON files;
- Explore data, compute descriptive statistics, and create visualizations;
- Manage your data science workflow;
- Create your own tools from one-liners and existing Python or R code;
- Parallelize and distribute data-intensive pipelines;
- Model data with dimensionality reduction, regression, and classification algorithms;
- Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark.

This open book is licensed under a Creative Commons License (CC BY-NC-ND). Free download in PDF format is not available. You can read Data Science at the Command Line book online for free.

Table of Contents

Chapter 1
Introduction
Chapter 2
Getting Started
Chapter 3
Obtaining Data
Chapter 4
Creating Command-line Tools
Chapter 5
Scrubbing Data
Chapter 6
Project Management with Make
Chapter 7
Exploring Data
Chapter 8
Parallel Pipelines
Chapter 9
Modeling Data
Chapter 10
Polyglot Data Science
Chapter 11
Conclusion

Book Details

Title
Data Science at the Command Line
Subject
Computer Science
Publisher
O'Reilly Media
Published
2021
Pages
282
Edition
2
Language
English
ISBN13 Digital
9781492087915
ISBN10 Digital
1492087912
License
CC BY-NC-ND

Related Books

The Data Journalism Handbook
When you combine the sheer scale and range of digital information now available with a journalist's "nose for news" and her ability to tell a compelling story, a new world of possibility opens up. With The Data Journalism Handbook, you'll explore the potential, limits, and applied uses of this new and fascinating field. This ...
Data Science with Microsoft SQL Server 2016
R is one of the most popular, powerful data analytics languages and environments in use by data scientists. Actionable business data is often stored in Relational Database Management Systems (RDBMS), and one of the most widely used RDBMS is Microsoft SQL Server. Much more than a database server, it's a rich ecostructure with advanced analytic ...
What Is Data Science?
We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data? This report examines the many sides of data science - the technologi...
The Data Science Design Manual
This engaging and clearly written textbook/reference provides a must-have introduction to the rapidly emerging interdisciplinary field of data science. It focuses on the principles fundamental to becoming a good data scientist and the key skills needed to build systems for collecting, analyzing, and interpreting data. The Data Science Design Manual...
The Linux Command Line
The Linux Command Line takes you from your very first terminal keystrokes to writing full programs in Bash, the most popular Linux shell (or command line). Along the way you'll learn the timeless skills handed down by generations of experienced, mouse-shunning gurus: file navigation, environment configuration, command chaining, pattern matchin...
Data Journeys in the Sciences
This groundbreaking, open volume analyses and compares data practices across several fields through the analysis of specific cases of data journeys. It brings together leading scholars in the philosophy, history and social studies of science to achieve two goals: tracking the travel of data across different spaces, times and domains of research pra...