Data Science

Data Science Tutorial Today, Data rules the world. This has resulted in a huge demand for Data Scientists. A Data Scientist helps companies with data-driven decisions, to make their business better. Learning by Examples With our "Try it Yourself" editor, you can edit Python code and view the result. Example import pandas as pd import matplotlib.pyplot as plt from scipy import stats full_health_data = pd.read_csv("data.csv", header=0, sep=",") x = full_health_data["Average_Pulse"] y = full_health_data["Calorie_Burnage"] slope, intercept, r, p, std_err = stats.linregress(x, y) def myfunc(x): return slope * x + intercept mymodel = list(map(myfunc, x)) print(mymodel) plt.scatter(x, y) plt.plot(x, mymodel) plt.ylim(ymin=0, ymax=2000) plt.xlim(xmin=0, xmax=200) plt.xlabel("Average_Pulse") plt.ylabel ("Calorie_Burnage") plt.show() Data Science Introduction Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine learning to analyze data and to extract knowledge and insights from it. What is Data Science? Data Science is about data gathering, analysis and decision-making. Data Science is about finding patterns in data, through analysis, and make future predictions. By using Data Science, companies are able to make: Better decisions (should we choose A or B) Predictive analysis (what will happen next?) Pattern discoveries (find pattern, or maybe hidden information in the data) Where is Data Science Needed? Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing. Examples of where Data Science is needed: For route planning: To discover the best routes to ship To foresee delays for flight/ship/train etc. (through predictive analysis) To create promotional offers To find the best suited time to deliver goods To forecast the next years revenue for a company To analyze health benefit of training To predict who will win elections Data Science can be applied in nearly every part of a business where data is available. Examples are: Consumer goods Stock markets Industry Politics Logistic companies E-commerce How Does a Data Scientist Work? A Data Scientist requires expertise in several backgrounds: Machine Learning Statistics Programming (Python or R) Mathematics Databases A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must organize the data in a standard format. Here is how a Data Scientist works: Ask the right questions - To understand the business problem. Explore and collect data - From database, web logs, customer feedback, etc. Extract the data - Transform the data to a standardized format. Clean the data - Remove erroneous values from the data. Find and replace missing values - Check for missing values and replace them with a suitable value (e.g. an average value). Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling is important). Analyze data, find patterns and make future predictions. Represent the result - Present the result with useful insights in a way the "company" can understand. Where to Start? In this tutorial, we will start by presenting what data is and how data can be analyzed. You will learn how to use statistics and mathematical functions to make predictions. Data Science - What is Data? What is Data? Data is a collection of information. One purpose of Data Science is to structure data, making it interpretable and easy to work with. Data can be categorized into two groups: Structured data Unstructured data Unstructured Data Unstructured data is not organized. We must organize the data for analysis purposes. Unstructured Data Unstructured Data Structured Data Structured data is organized and easier to work with. Structured Data How to Structure Data? We can use an array or a database table to structure or present data. Example of an array: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125] The following example shows how to create an array in Python: Example Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125] print(Array) It is common to work with very large data sets in Data Science. In this tutorial we will try to make it as easy as possible to understand the concepts of Data Science. We will therefore work with a small data set that is easy to interpret. Data Science - Database Table Database Table A database table is a table with structured data. The following table shows a database table with health data extracted from a sports watch: Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep 30 80 120 240 10 7 30 85 120 250 10 7 45 90 130 260 8 7 45 95 130 270 8 7 45 100 140 280 0 7 60 105 140 290 7 8 60 110 145 300 7 8 60 115 145 310 8 8 75 120 150 320 0 8 75 125 150 330 8 8 This dataset contains information of a typical training session such as duration, average pulse, calorie burnage etc. Database Table Structure A database table consists of column(s) and row(s): Column 1 Column 2 Column 3 Column 4 ...... ...... Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep Row 1 30 80 120 240 10 7 Row 2 30 85 120 250 10 7 Row 3 45 90 130 260 8 7 Row 4 45 95 130 270 8 7 45 100 140 280 0 7 60 105 140 290 7 8 60 110 145 300 7 8 60 115 145 310 8 8 75 120 150 320 0 8 75 125 150 330 8 8 A row is a horizontal representation of data. A column is a vertical representation of data. Variables A variable is defined as something that can be measured or counted. Examples can be characters, numbers or time. In the example under, we can observe that each column represents a variable. Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep 30 80 120 240 10 7 30 85 120 250 10 7 45 90 130 260 8 7 45 95 130 270 8 7 45 100 140 280 0 7 60 105 140 290 7 8 60 110 145 300 7 8 60 115 145 310 8 8 75 120 150 320 0 8 75 125 150 330 8 8 There are 9 columns, meaning that there are 9 variables. There are 11 rows, meaning that each variable has 10 observations. But if there are 11 rows, how come there are only 10 observations? It is because the first row is the label, meaning that it is the name of the variable. Data Science & Python Python Python is a programming language widely used by Data Scientists. Python has in-built mathematical libraries and functions, making it easier to calculate mathematical problems and to perform data analysis. We will provide practical examples using Python. To learn more about Python, please visit our Python Tutorial. Python Libraries Python has libraries with large collections of mathematical functions and analytical tools. In this course, we will use the following libraries: Pandas Numpy Matplotlib SciPy We will use these libraries throughout the course to create examples. Data Science - Python DataFrame Create a DataFrame with Rows and Columns A data frame is a structured representation of data. Let's define a data frame with 3 columns and 5 rows with fictional numbers: Example import pandas as pd d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]} df = pd.DataFrame(data=d) print(df) Example Explained Import the Pandas library as pd Define data with column and rows in a variable named d Create a data frame using the function pd.DataFrame() The data frame contains 3 columns and 5 rows Print the data frame output with the print() function We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame() function from the Pandas library. Be aware of the capital D and F in DataFrame! Interpreting the Output This is the output: Dataframe Output We see that "col1", "col2" and "col3" are the names of the columns. Do not be confused about the vertical numbers ranging from 0-4. They tell us the information about the position of the rows. In Python, the numbering of rows starts with zero. Now, we can use Python to count the columns and rows. We can use df.shape[1] to find the number of columns: Example Count the number of columns: count_column = df.shape[1] print(count_column) We can use df.shape[0] to find the number of rows: Example Count the number of rows: count_row = df.shape[0] print(count_row) Why Can We Not Just Count the Rows and Columns Ourselves? If we work with larger data sets with many columns and rows, it will be confusing to count it by yourself. You risk to count it wrongly. If we use the in-built functions in Python correctly, we assure that the count is correct. Data Science Functions This chapter shows three commonly used functions when working with Data Science: max(), min(), and mean(). The Sports Watch Data Set Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep 30 80 120 240 10 7 30 85 120 250 10 7 45 90 130 260 8 7 45 95 130 270 8 7 45 100 140 280 0 7 60 105 140 290 7 8 60 110 145 300 7 8 60 115 145 310 8 8 75 120 150 320 0 8 75 125 150 330 8 8 The data set above consists of 6 variables, each with 10 observations: Duration - How long lasted the training session in minutes? Average_Pulse - What was the average pulse of the training session? This is measured by beats per minute Max_Pulse - What was the max pulse of the training session? Calorie_Burnage - How much calories were burnt on the training session? Hours_Work - How many hours did we work at our job before the training session? Hours_Sleep - How much did we sleep the night before the training session? We use underscore (_) to separate strings because Python cannot read space as separator. The max() function The Python max() function is used to find the highest value in an array. Example Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125) print (Average_pulse_max) The min() function The Python min() function is used to find the lowest value in an array. Example Average_pulse_min = min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125) print (Average_pulse_min) The mean() function The NumPy mean() function is used to find the average value of an array. Example import numpy as np Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330] Average_calorie_burnage = np.mean(Calorie_burnage) print(Average_calorie_burnage) We write np. in front of mean to let Python know that we want to activate the mean function from the Numpy library. Data Science - Data Preparation Before analyzing data, a Data Scientist must extract the data, and make it clean and valuable. Extract Data Before data can be analyzed, it must be imported/extracted. Pandas is a library in Python used for data analysis and data manipulation. In the example below, we show you how to import data using Pandas in Python. We use the read_csv() function to import a CSV file with the health data: Example import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") print(health_data) Example Explained Import the Pandas library Name the data frame as health_data. header=0 means that the headers for the variable names are to be found in the first row (note that 0 means the first row in Python) sep="," means that "," is used as the separator between the values. This is because we are using the file type .csv (comma separated values) Data Cleaning Look at the imported data. As you can see, the data are "dirty" with wrongly or unregistered values: Dirty data There are some blank fields Average pulse of 9 000 is not possible 9 000 will be treated as non-numeric, because of the space separator One observation of max pulse is denoted as "AF", which does not make sense So, we must clean the data in order to perform the analysis. Remove Blank Rows We see that the non-numeric values (9 000 and AF) are in the same rows with missing values. Solution: We can remove the rows with missing observations to fix this problem. When we load a data set using Pandas, all blank cells are automatically converted into "NaN" values. So, removing the NaN cells gives us a clean data set that can be analyzed. We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove all rows that have a NaN value: Example health_data.dropna(axis=0,inplace=True) print(health_data) The result is a data set without NaN rows: Cleaned data Data Categories To analyze data, we also need to know the types of data we are dealing with. Data can be split into three main categories: Numerical - Contains numerical values. Can be divided into two categories: Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either 2 or 3 Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes and 20 seconds, or 7.533 hours Categorical - Contains values that cannot be measured up against each other. Example: A color or a type of training Ordinal - Contains categorical data that can be measured up against each other. Example: School grades where A is better than B and so on By knowing the type of your data, you will be able to know what technique to use when analyzing them. Data Types We can use the info() function to list the data types within our data set: Example print(health_data.info()) Result: Datatype float and object We see that this data set has two different types of data: Float64 Object We cannot use objects to calculate and perform analysis here. We must convert the type object to float64 (float64 is a number with a decimal in Python). We can use the astype() function to convert the data into float64. The following example converts Average_Pulse and Calorie_Burnage into data type float64 (the other variables are already of data type float64): Example health_data["Average_Pulse"] = health_data['Average_Pulse'].astype(float) health_data["Max_Pulse"] = health_data["Max_Pulse"].astype(float) print (health_data.info()) Result: Datatype float Now, the data set has only float64 data types. Analyze the Data When we have cleaned the data set, we can start analyzing the data. We can use the describe() function in Python to summarize data: Example print(health_data.describe()) Result: Duration Average_Pulse Max_Pulse Calorie_Burnage Hours_Work Hours_Sleep Count 10.0 10.0 10.0 10.0 10.0 10.0 Mean 51.0 102.5 137.0 285.0 6.6 7.5 Std 10.49 15.4 11.35 30.28 3.63 0.53 Min 30.0 80.0 120.0 240.0 0.0 7.0 25% 45.0 91.25 130.0 262.5 7.0 7.0 50% 52.5 102.5 140.0 285.0 8.0 7.5 75% 60.0 113.75 145.0 307.5 8.0 8.0 Max 60.0 125.0 150.0 330.0 10.0 8.0 Count - Counts the number of observations Mean - The average value Std - Standard deviation (explained in the statistics chapter) Min - The lowest value 25%, 50% and 75% are percentiles (explained in the statistics chapter) Max - The highest value

myteknohood

Search This Blog

Data Science

Comments

Popular posts from this blog

PHP - AJAX

PHP XML

MySQL Databases PDO