# How to become a data scientist in 3 easy steps by Seth Kenlon Data science is an exciting new field in computing, built around analyzing, visualizing, correlating, and interpreting the boundless amounts of information we have our computers collecting about the world. Of course, calling it a "new" field is a little disingenuous, because the discipline is a derivative of statistics, data analysis, and plain old obsessive scientific observation. But data science is a formalized branch of these disciplines, with processes and tools all its own, and it can be broadly applied across disciplines (such as visual effects) that had never produced big dumps of unmanageable data before. Data science is a new opportunity to take a fresh look at data from oceanography, meteorology, geography, cartography, biology, medicine and health, and entertainment industries, and gain a better understanding of patterns, influences, and causality. Like other big and seemingly all-inclusive fields, it can be intimidating to know where to start exploring data science. There are a lot of resources out there to help data scientists use their favourite programming languages to accomplish their goals, and that includes one of the most popular languages out there: Python. Using the [Pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), and [Seaborn](https://seaborn.pydata.org/index.html) libraries, you can learn the basic toolset of data science. If you're not familiar with the basics of Python yet, read my [introduction to Python](https://opensource.com/article/17/10/python-101) before continuing. ## How to create a Python virtual environment Programmers sometimes forget which libraries they have installed on their own development machine, which can lead to shipping code that worked on their computer but fails on all others for lack of a library. Python has a system designed to avoid this manner of unpleasant surprise: the virtual environment. A virtual environment intentionally ignores all the Python libraries you have installed, effectively forcing you to begin development with nothing more than a stock install of Python itself. To activate a virtual environment with ``venv``, invent a name for your environment (this article uses ``example``) and then create it: ``` $ python3 -m venv example ``` Source the ``activate`` file in the environment's ``bin`` directory to activate it: ``` $ source ./example/bin/activate (example) $ ``` You are now "in" your virtual environment, a clean slate upon which you can build custom solutions to problems, with the added burden of consciously installing required libraries. ## Installing Pandas and NumPy The first libraries you must install are Pandas and NumPy. These libraries are common in data science, so this won't be the last time you'll install them. These, of course, aren't the only libraries you'll ever need in data science, but they're a good start. Pandas is an open source, BSD-licensed library that makes it easy to process data structures for analysis. It depends upon NumPy, a scientific library providing multi-dimensional arrays, linear algebra, Fourier transforms, and much more. Install both using ``pip3``: ``` (example) $ pip3 install pandas ``` Installing Pandas also installs NumPy, so there's no need to specify both. Once you have installed these to your virtual environment once, the install packages are cached so that when you install again, they don't actually have to be downloaded from the Internet. Those are the only libraries necessary for now. Next, you need some sample data. ## Generating a sample data set Data science is all about data, and luckily there are lots of free and open data sets available from scientific, computing, and government organizations. While these data sets are a great resource for education, they've got a lot more data than necessary for this simple example. You can create a sample and manageable data set quickly with Python: ``` #!/usr/bin/env python3 import random def rgb(): NUMBER=random.randint(0,255)/255 return NUMBER FILE = open('sample.csv','w') FILE.write('"red","green","blue"') for COUNT in range(10): FILE.write('\n{:0.2f},{:0.2f},{:0.2f}'.format(rgb(),rgb(),rgb())) ``` This produces a file called ``sample.csv`` consisting of randomly generated floats representing, in this example, RGB values (a commonly tracked value, among hundreds, in visual effects). You can use a ``.csv`` file as a data source for Pandas. ## Ingesting data with Pandas One of the basic features of Pandas is its ability to ingest data and process it without you, the programmer, writing new functions just to parse input. If you're used to applications that do that automatically, this might not seem like it's very special, but imagine opening a CSV in [LibreOffice](http://libreoffice.org) and having to write formulas to split the values at each comma. Pandas shields you from low level operations like that. Here's some simple code to ingest and then print out a file of comma-separated values: ``` #!/usr/bin/env python3 from pandas import read_csv, DataFrame import pandas as pd FILE = open('sample.csv','r') DATAFRAME = pd.read_csv(FILE) print(DATAFRAME) ``` The first few lines import components of the Pandas library. The Pandas library is extensive, so you'll refer to its documentation frequently when looking for functions beyond the basic ones in this article. Next, a variable ``f`` is created by opening the ``sample.csv`` file you created. That variable is used by the Pandas module ``read_csv`` (imported in the second line) to create a *dataframe*. In Pandas, a dataframe is a two-dimensional array, commonly thought of as a table. Once your data is in a dataframe, you can manipulate it by column and row, you can query it for ranges, and much more. The sample code, for now, just prints the dataframe to the terminal. Run the code. Your output differs slightlf from this sample output, because the numbers are randomly generated, but the format is the same: ``` (example) $ python3 ./parse.py red green blue 0 0.31 0.96 0.47 1 0.95 0.17 0.64 2 0.00 0.23 0.59 3 0.22 0.16 0.42 4 0.53 0.52 0.18 5 0.76 0.80 0.28 6 0.68 0.69 0.46 7 0.75 0.52 0.27 8 0.53 0.76 0.96 9 0.01 0.81 0.79 ``` Now assume you need only the red values of your data set. You can do this by declaring the column names of your dataframe, and then selectively printing only the column you're interested in: ``` from pandas import read_csv, DataFrame import pandas as pd FILE = open('sample.csv','r') DATAFRAME = pd.read_csv(FILE) # define columns DATAFRAME.columns = [ 'red','green','blue' ] print(DATAFRAME['red']) ``` Run the code now, and you get just the red column: ``` (example) $ python3 ./parse.py 0 0.31 1 0.95 2 0.00 3 0.22 4 0.53 5 0.76 6 0.68 7 0.75 8 0.53 9 0.01 Name: red, dtype: float64 ``` Manipulating tables of data is a great way to get used to how data can be parsed with Pandas. There are many more ways to select data from a dataframe, and the more you experiment, the more natural it becomes. ## Visualization It's no secret that many humans prefer to visualize information. It's the reason charts and graphs are staples of meetings with upper management, and why "info graphics" are popular in mainstream news. Part of a data scientist's job is to help others understand large samples of data, and so there are libraries to help with that task. Combining Pandas with a visualization library can produce visual interpretations of your data. One popular open source library for visualization is [Seaborn](https://seaborn.pydata.org/), which is in turn based on the open source [matplotlib](https://matplotlib.org/). ### Installing Seaborn and Matplotlib Your Python virtual environment doesn't yet have Seaborn and Matplotlib installed, so first you must install them with ``pip3``. Seaborn causes Matplotlib (along with many others) to be installed ``` (example) $ pip3 install seaborn ``` For Matplotlib to display graphics, you must also have PyGObject and PyCairo installed. This involves compiling code, which ``pip3`` can do for you as long as you have the necessary header files and libraries installed. Your Python virtual environment has no awareness of these support libraries, so you can execute the install command inside or outside of the environment. On Fedora and CentOS: ``` (example) $ sudo dnf install -y gcc zlib-devel bzip2 bzip2-devel readline-devel \ sqlite sqlite-devel openssl-devel tk-devel git python3-cairo-devel \ cairo-gobject-devel gobject-introspection-devel ``` On Ubuntu and Debian: ``` (example) $ sudo apt install -y libgirepository1.0-dev build-essential \ libbz2-dev libreadline-dev libssl-dev zlib1g-dev libsqlite3-dev wget \ curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libcairo2-dev ``` Once those are installed, you can install the GUI components needed by Matplotlib: ``` (example) $ pip3 install PyGObject pycairo ``` ## Displaying a graph with Seaborn and Matplotlib Open a file called ``vizualize.py`` in your favourite text editor. To create a line graph visualization of your data, you must first import the necessary Python modules. The Pandas modules you've already used in the previous code examples: ``` #!/usr/bin/env python3 from pandas import read_csv, DataFrame import pandas as pd ``` Next, import Seaborn, Matplotlib, and several components of Matplotlib so you can configure the graphics you produce: ``` import seaborn as sns import matplotlib import matplotlib.pyplot as plt from matplotlib import rcParams ``` Matplotlib can export its output to many formats, including PDF, SVG, or just a GUI window on your desktop. For this example, it makes most sense to output to the desktop, so you must set the Matplotlib backend to GTK3Agg. If you're not using Linux, then you may need to use the TkAgg backend instead. After setting the backend for the GUI window, you set the size of the window, and the Seaborn preset style. ``` matplotlib.use('GTK3Agg') rcParams['figure.figsize'] = 11,8 sns.set_style('darkgrid') ``` Now that your display is configured, the code is initially familiar. Ingest your ``sample.csv`` file with Pandas, and define the columns of your dataframe: ``` FILE = open('sample.csv','r') DATAFRAME = pd.read_csv(FILE) DATAFRAME.columns = [ 'red','green','blue' ] ``` With the data in a useful format, you can plot it out in a graph. Use each column as input for a plot, and then use ``plt.show()`` to draw the graph in a GUI window. The ``plt.legend()`` parameter associates the column header with each line on your graph (the ``loc`` parameter places the legend outside the chart rather than over it). ``` for i in DATAFRAME.columns: DATAFRAME[i].plot() plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=1) plt.show() ``` Run the code to display the results. ![Your first graph](seaborn-matplotlib-graph_0.png) Your graph accurately displays all the information contained in your CSV file: values are on the Y-axis, index numbers are on the X-axis, and the lines of the graph are identified so you know what they represent. However, since in this case the code is tracking colour values (or at least, it's pretending to), the colour of the lines are not just non-intuitive, but counter-intuitive. If you're never tasked with analyzing colour data, you may never run into this problem, but you're sure to run into something analogous. When visualizing data, you must consider the best way to present it so that the viewer doesn't incorrectly extrapolate false information from what you're presenting. To fix this particular problem, and to demonstrate some of the customization available, this final code sample assigns each plotted line a specific color: ``` import matplotlib from pandas import read_csv, DataFrame import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from matplotlib import rcParams matplotlib.use('GTK3Agg') rcParams['figure.figsize'] = 11,8 sns.set_style('whitegrid') FILE = open('sample.csv','r') DATAFRAME = pd.read_csv(FILE) DATAFRAME.columns = [ 'red','green','blue' ] plt.plot(DATAFRAME['red'],'r-') plt.plot(DATAFRAME['green'],'g-') plt.plot(DATAFRAME['blue'],'b-') plt.plot(DATAFRAME['red'],'ro') plt.plot(DATAFRAME['green'],'go') plt.plot(DATAFRAME['blue'],'bo') plt.show() ``` This uses special Matplotlib notation to create two plots per column. The initial plot of each column is assigned a colour (``r`` for red, ``g`` for green, and ``b`` for blue). These are built-in Matplotlib settings. The ``-`` notation indicates a solid line (double dashes, such as ``r--``, creates a dashed line). A second plot is created for each column, with the same colours but using ``o`` to denote dots or nodes. To demonstrate built-in Seaborn themes, the value of ``sns.set_style`` has changed to ``whitegrid``. ![An improved graph](seaborn-matplotlib-graph_1.png) ## Deactivating your virtual environment When you're finished exploring Pandas and plotting, you can deactivate your Python virtual environment with the ``deactivate`` command: ``` (example) $ deactivate $ ``` When you want to get back to it, just reactivate it as shown at the start of the article. You have to reinstall your modules when you reactivate your virtual environment, but they'll be installed from cache rather than the Internet, so you don't have to be online. ## Endless possibilities The true power of Pandas, Matplotlib, Seaborn, and indeed data science itself, is the limitless potential for you to parse, interpret, and structure data in a meaningful and enlightening way. Your next step is to explore simple data sets with the new tools you've experienced in this article. There's a lot more to Matplotlib and Seaborn than just line graphs, so try creating a bar graph or a pie chart something else entirely. The possibilities are limitless once you understand your tool set, and have some idea of how to correlate the data you're faced with. Data science is a new way to find stories hidden within data; let open source be your medium.