root

public

mirror of https://github.com/01-edu/public.git

nprimo 37585c152c feat(pandas): add link to dataset used in exercise 3		12 months ago
..
audit	docs(pandas): fix audits format	2 years ago
data	docs(ai): add ai branch subjects to public	2 years ago
README.md	feat(pandas): add link to dataset used in exercise 3	12 months ago

README.md

Pandas

The goal of this day is to understand practical usage of Pandas. As Pandas in intensively used in Data Science, other days of the piscine will be dedicated to it.

Not only is the Pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.

Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in Pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.

Most of the topics we will cover today are explained and describes with examples in the first resource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the resource, even if there are 40 pages.

Exercises of the day

Exercice 0: Environment and libraries
Exercise 1: Your first DataFrame
Exercise 2: Electric power consumption
Exercise 3: E-commerce purchases
Exercise 4: Handling missing values

Virtual Environment

Python 3.x
NumPy
Pandas
Jupyter or JupyterLab

Version of Pandas I used to do the exercises: 1.0.1. I suggest to use the most recent one.

Resources

If I had to give you one resource it would be this one:

https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf

It contains ALL you need to know about Pandas.

Pandas documentation:

Exercise 0: Environment and libraries

The goal of this exercise is to set up the Python work environment with the required libraries.

Note: For each quest, your first exercise will be to set up the virtual environment with the required libraries.

I recommend to use:

the last stable versions of Python.
the virtual environment you're the most comfortable with. virtualenv and conda are the most used in Data Science.
one of the most recent versions of the libraries required

Create a virtual environment named ex00, with a version of Python >= 3.8, with the following libraries: pandas, numpy and jupyter.

Exercise 1: Your first DataFrame

The goal of this exercise is to learn to create basic Pandas objects.

Create a DataFrame as below this using two ways:
- From a NumPy array
- From a Pandas Series
color list number

1 Blue [1, 2] 1.1

3 Red [3, 4] 2.2

5 Pink [5, 6] 3.3

7 Grey [7, 8] 4.4

9 Black [9, 10] 5.5
Print the types for every column and the types of the first value of every column

	color	list	number
1	Blue	[1, 2]	1.1
3	Red	[3, 4]	2.2
5	Pink	[5, 6]	3.3
7	Grey	[7, 8]	4.4
9	Black	[9, 10]	5.5

Exercise 2: Electric power consumption

The goal of this exercise is to learn to manipulate real data with Pandas.

The data set used is Individual household electric power consumption

Delete the columns Time, Sub_metering_2 and Sub_metering_3
Set Date as index
Create a function that takes as input the DataFrame with the data set and returns a DataFrame with updated types:
```
    def update_types(df):
        #TODO
        return df
```
Use describe to have an overview on the data set
Delete the rows with missing values
Modify Sub_metering_1 by adding 1 to it and multiplying the total by 0.06. If x is a row the output is: (x+1)*0.06
Select all the rows for which the Date is greater or equal than 2008-12-27 and Voltage is greater or equal than 242
Print the 88888th row.
What is the date for which the Global_active_power is maximal ?
Sort the first three columns by descending order of Global_active_power and ascending order of Voltage.
Compute the daily average of Global_active_power.

Exercise 3: E-commerce purchases

The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction.

The data set used is E-commerce purchases.

Questions:

How many rows and columns are there?
What is the average Purchase Price?
What were the highest and lowest purchase prices?
How many people have English 'en' as their Language of choice on the website?
How many people have the job title of "Lawyer" ?
How many people made the purchase during the AM and how many people made the purchase during PM ?
What are the 5 most common Job Titles?
Someone made a purchase that came from Lot: "90 WT" , what was the Purchase Price for this transaction?
What is the email of the person with the following Credit Card Number: 4926535242672853
How many people have American Express as their Credit Card Provider and made a purchase above $95 ?
How many people have a credit card that expires in 2025?
What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)

Exercise 4: Handling missing values

The goal of this exercise is to learn to handle missing values. In the previous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.

This article explains the different types of missing data and how they should be handled.

https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b

"It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values."

Preliminary: Drop the flower column

Fill the missing values with a different "strategy" for each column:

sepal_length -> mean

sepal_width -> median

petal_length, petal_width -> 0
Fill the missing values using the median of the associated column using fillna.

Bonus questions:
- Filling the missing values by 0 or the mean of the associated column is common in Data Science. In that case, explain why filling the missing values with 0 or the mean is a bad idea.
- Find a special row ;-).