# What is data science? Data science is a branch of computer science dealing with capturing, processing, and analyzing data in order to gain new insights about the systems being studied. Data scientists deal with vast amounts of information from different sources and in different contexts, so the processing they must do is usually unique to each study, utilizing custom algorithms, Artificial Intelligence (AI), machine learning, as well as human interpretation. It's a broad field that's expanding rapidly across many industries, including medicine, astronomy, meteorology, marketing, sociology, visual effects, and much more. ## Why is data science important? Science is based on gathering evidence and interpreting the evidence in order to draw logical conclusions. This principle has served civilization well enough to enable trans-Atlantic flights, telephony, treatments for diseases, landing rovers on the surface of Mars, and much more. In the modern world, there's a proliferation of data being gathered. There's data about lifestyle habits, dietary preferences, music choices, purchasing habits, energy consumption, weather systems, migratory patterns, seismic activity, flight times, and so much more. Computers are everywhere, so there's almost constant input into a pool of big data. That's more information about the world around us than we've ever had access to before, and it's spread across a wider sample set than ever. Analyzing large data sets can lead to surprising revelations. Sometimes patterns and correlations are found in places not previously expected, or in places that had only been theorized. Observing and analyzing the environment is important for humans to learn and grow as a better-informed species. A lot of data science is applied to frivolous pursuits, and sometimes to even ethically questionable ones, but there's just as much analysis happening around worthwhile, healthy, and helpful causes that open source should be proud to support it. And as it turns out, open source software is vital to the growth and development of data science. ## Infrastructure Because of the vast amount of data that data science analyzes, it's a field that requires a solid computing infrastructure. The data sets involved in serious data science are often too large to process on a single machine or even a small cluster, so hybrid clouds are used to store and process the information, and to make correlations among what's been parsed. This means that a data scientist's toolbox includes a platform like [OpenShift](http://openshift.io) for running processing services, distributed computing software like Apache [Hadoop](https://hadoop.apache.org/) or Apache [Spark](https://spark.apache.org/), a distributed file system like [Ceph](http://ceph.io) or [Gluster](https://www.gluster.org/) for scalable and highly-available storage, and so on. A data scientist's job is equal parts statistics and math as programming and computer engineering. ## What does a data scientist do? A data scientist gathers data, parses and normalizes it, and then creates routines for a computer to run on the data in search for a pattern, trend, or just a helpful visualization. For instance, if you have ever created a pie chart or bar graph from the fields of a spreadsheet, then you've acted as a low level data scientist by interpreting a data set and visualizing to help others to understand it. When data is being analyzed for patterns, there's no way to tell a computer what to look for (because it hasn't been found yet). While AI and machine learning can scrub vast data sets to find arbitrary patterns, it takes human ingenuity to look for the irrational and to interpret what's found. That means a data scientist must be able to design custom routines with programming languages like [Python](https://opensource.com/tags/python), [R](https://opensource.com/health/12/7/join-m-revolution-m-and-r-programming-languages), and Scala. They must be familiar with important libraries, like Beautiful Soup, NumPy, and Pandas, so that they may scrape data, sanitize it, and organize it. They need to be able to version control and iterate upon their code so that they way they look at data is maturing and developing as they continue to understand the relationships they discover. ## How to start learning data science Data science is a career, so you don't learn all of data science in a year or two of study and then call yourself a data scientist. Instead, you start studying now, maybe on your own or maybe through formalized training, and then you apply what you've learned in a real-world situation. You repeat that process until you have either solved all of the world's problems or retire. Fortunately, data science is largely driven by open source software that is freely available to everyone. A good first step is to [try a Linux distribution](https://opensource.com/article/19/7/ways-get-started-linux), which can serve as a good platform for your work. Linux is an open source operating system, so it's not only free to use but it's uncommonly flexible, making it ideal for a field known for its constant need to adapt. Linux also ships with Python, which is a leading language in data science today. The [NumPy](https://numpy.org/) and [Pandas](https://pandas.pydata.org/pandas-docs/stable/) libraries are specifically designed for number crunching and data analytics, and their documentation is very thorough. As is often the case, though, one of the greatest struggles when learning a new language or library is finding a way to apply the tools to something in your life. Unlike many other disciplines, there are no wrong answers in data science. You can apply the principles of data science to *any* set of data. At worst, you'll discover that there's no correlation between two sets of data, or that there's no pattern in a seemingly random event. But that's valid research, so not only will you have learnt about data science, you'll also have proven or disproven a hypothesis. Thanks to the influence of open source itself, open data sets are easy to find. There are data sets available on [data.gov](https://www.data.gov/), [Worldbank.org](https://data.worldbank.org/), [Google](https://cloud.google.com/public-datasets/) (including data from NASA, Github, the US Census, and others), and many more. These are excellent resources for you to learn how to scrape the web for data, how to parse it into a format you can easily process, and how to analyze it with specialized libraries. ## Why use Python for data science? You can use several different languages for data science, but Python is one of the most popular. Nearly any language is capable of analyzing data, but some languages and libraries are designed with certain expectations; for instance, the NumPy library provides tools for processing matrices so that you don't have to write a matrix library on your own. Python, as a language, has a few advantages over many others. First, Python is famous for being relatively easy to read. While Python code may not make sense to someone completely unfamiliar with computer programming, it tends to be easier to parse than, say C or C++. That means Python is easier for other people to re-use, because they can read your code to become confident in what it claims to do, and they may even be able to add to it. Furthermore, Python has several strong purpose-built libraries geared specifically toward data science. Things data scientists find themselves needing to do often are already provided by Python data science libraries, which has earned it a rightful place as a leading language in the field. All other benefits of Python apply, such as the convenience of the ``pip`` package manager, the robust ``venv`` virtual environment interface, an interactive shell, and so on. ## Data science and the future As computers continue to proliferate, available data grows. If you're the sort who wants to understand how the world works, there's no better way to start than data science. And whatever you do in data science, remember to keep it open so that everyone benefits.