public

root

public

mirror of https://github.com/01-edu/public.git

11 KiB

Raw Blame History

Exercise 0: Environment and libraries

The exercise is validated is all questions of the exercise are validated.

Activate the virtual environment. If you used `conda` run `conda activate your_env`

Run `python --version`.

Does it print `Python 3.x`? x >= 8

Does `import jupyter`, `import numpy` and `import pandas` run without any error?

Exercise 1: Your first DataFrame

The exercise is validated is all questions of the exercise are validated.

The solution of question 1 is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.

The solution of question 2 is accepted if the columns' types are as below and if the types of the first value of the columns are as below:

    <class 'pandas.core.series.Series'>
    <class 'pandas.core.series.Series'>
    <class 'pandas.core.series.Series'>

        <class 'str'>
        <class 'list'>
        <class 'float'>

Exercise 2: Electric power consumption

The exercise is validated is all questions of the exercise are validated

The solution of question 1 is accepted if `drop` is used with `axis=1`.`inplace=True` may be useful to avoid to affect the result to a variable. A solution that could be accepted too (even if it's not a solution I recommend) is `del`.

The solution of question 2 is accepted if the DataFrame returns the output below. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted. I recommend to use `set_index` with `inplace=True` to do so.

        Input: df.head().index

        Output:

        DatetimeIndex(['2006-12-16', '2006-12-16','2006-12-16', '2006-12-16','2006-12-16'],
        dtype='datetime64[ns]', name='Date', freq=None)

The solution of question 3 is accepted if all the types are `float64` as below. The preferred solution is `pd.to_numeric` with `coerce=True`.

        Input: df.dtypes

        Output:

            Global_active_power      float64
            Global_reactive_power    float64
            Voltage                  float64
            Global_intensity         float64
            Sub_metering_1           float64
            dtype: object

The solution of question 4 is accepted if you use `df.describe()`.

The solution of question 5 is accepted if `dropna` is used and if the number of missing values is equal to 0. It is important to notice that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True` allows to remove the rows with missing values.

The solution of question 6 is accepted if one of the two approaches below were used:

        #solution 1
        df.loc[:,'A'] = (df['A'] + 1) * 0.06

        #solution 2
        df.loc[:,'A'] = df.loc[:,'A'].apply(lambda x: (x+1)*0.06)

You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. **The answer is no**. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas

The solution of question 7 is accepted as long as the output of `print(filtered_df.head().to_markdown())` is as below and if the number of rows is equal to 449667.

| Date                |   Global_active_power |   Global_reactive_power |
|:--------------------|----------------------:|------------------------:|
| 2008-12-27 00:00:00 |                 0.996 |                   0.066 |
| 2008-12-27 00:00:00 |                 1.076 |                   0.162 |
| 2008-12-27 00:00:00 |                 1.064 |                   0.172 |
| 2008-12-27 00:00:00 |                 1.07  |                   0.174 |
| 2008-12-27 00:00:00 |                 0.804 |                   0.184 |

The solution of question 8 is accepted if the output is:

        Global_active_power        0.254
        Global_reactive_power      0.000
        Voltage                  238.350
        Global_intensity           1.200
        Sub_metering_1             0.000
        Name: 2007-02-16 00:00:00, dtype: float64

The solution of question 9 if the output is `Timestamp('2009-02-22 00:00:00')`.

The solution of question 10 if the output of `print(sorted_df.tail().to_markdown())` is:

| Date                |   Global_active_power |   Global_reactive_power |   Voltage |
|:--------------------|----------------------:|------------------------:|----------:|
| 2008-08-28 00:00:00 |                 0.076 |                       0 |    234.88 |
| 2008-08-28 00:00:00 |                 0.076 |                       0 |    235.18 |
| 2008-08-28 00:00:00 |                 0.076 |                       0 |    235.4  |
| 2008-08-28 00:00:00 |                 0.076 |                       0 |    235.64 |
| 2008-12-08 00:00:00 |                 0.076 |                       0 |    236.5  |

The solution of question 11 is accepted if the output is as below. The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`.

    Date
    2006-12-16    3.053475
    2006-12-17    2.354486
    2006-12-18    1.530435
    2006-12-19    1.157079
    2006-12-20    1.545658
                    ...
    2010-12-07    0.770538
    2010-12-08    0.367846
    2010-12-09    1.119508
    2010-12-10    1.097008
    2010-12-11    1.275571
    Name: Global_active_power, Length: 1433, dtype: float64

Exercise 3: E-commerce purchases

The exercise is validated is all questions of the exercise are validated.

To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.

The solution of question 1 is accepted if it contains 10000 entries and 14 columns. There many solutions based on: shape, info, describe.

The solution of question 2 is accepted if the answer is 50.34730200000025.

Even if `np.mean` gives the solution, `df['Purchase Price'].mean()` is preferred

The solution of question 3 is accepted if the min is `0`and the max is `99.989999999999995`

The solution of question 4 is accepted if the answer is 1098

The solution of question 5 is accepted if the answer is 30

The solution of question 6 is accepted if the are `4932` people that made the purchase during the `AM` and `5068` people that made the purchase during `PM`. There many ways to the solution but the goal of this question was to make you use `value_counts`

The solution of question 7 is accepted if the answer is as below. There many ways to the solution but the goal of this question was to use `value_counts`

Interior and spatial designer    31

Lawyer                           30

Social researcher                28

Purchasing manager               27

Designer, jewellery              27

The solution of question 8 is accepted if the purchase price is 75.1

The solution of question 9 is accepted if the email address is bondellen@williams-garza.com

The solution of question 10 is accepted if the answer is 39. The preferred solution is based on this: `df[(df['A'] == X) & (df['B'] > Y)]`

The solution of question 11 is accepted if the answer is 1033. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.

The solution of question 12 is accepted if the answer is as below. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurrences.

- hotmail.com     1638
- yahoo.com       1616
- gmail.com       1605
- smith.com         42
- williams.com      37

Exercise 4: Handling missing values

The exercise is validated is all questions of the exercise are validated (except the bonus question)

The solution of question 1 is accepted if the two steps are implemented in that order. First, convert the numerical columns to `float` and then fill the missing values. The first step may involve `pd.to_numeric(df.loc[:,col], errors='coerce')`. The second step is validated if you eliminated all missing values. However there are many possibilities to fill the missing values. Here is one of them:

example:

    df.fillna({0:df.sepal_length.mean(),
    2:df.sepal_width.median(),
    3:0,
    4:0})

The solution of question 2 is accepted if the solution is `df.loc[:,col].fillna(df[col].median())`.

The solution of bonus question is accepted if you find out this answer: Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers. Always think about the meaning of the data transformation ! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense.

	sepal_length	sepal_width	petal_length	petal_width
count	146	141	120	147
mean	56.9075	52.6255	15.5292	12.0265
std	572.222	417.127	127.46	131.873
min	-4.4	-3.6	-4.8	-2.5
25%	5.1	2.8	2.725	0.3
50%	5.75	3	4.5	1.3
75%	6.4	3.3	5.1	1.8
max	6900	3809	1400	1600

The solution of bonus question is accepted if the presence of negative values and huge values have been detected. A good data scientist always check abnormal values in the dataset. YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA. Print the row with index 122 ;-) This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers can be handled.

11 KiB Raw Blame History

Exercise 0: Environment and libraries

The exercise is validated is all questions of the exercise are validated.

Activate the virtual environment. If you used conda run conda activate your_env

Run python --version.

Does it print Python 3.x? x >= 8

Does import jupyter, import numpy and import pandas run without any error?

Exercise 1: Your first DataFrame

The exercise is validated is all questions of the exercise are validated.

The solution of question 1 is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.

The solution of question 2 is accepted if the columns' types are as below and if the types of the first value of the columns are as below:

Exercise 2: Electric power consumption

The exercise is validated is all questions of the exercise are validated

The solution of question 1 is accepted if drop is used with axis=1.inplace=True may be useful to avoid to affect the result to a variable. A solution that could be accepted too (even if it's not a solution I recommend) is del.

The solution of question 2 is accepted if the DataFrame returns the output below. If the type of the index is not dtype='datetime64[ns]' the solution is not accepted. I recommend to use set_index with inplace=True to do so.

The solution of question 3 is accepted if all the types are float64 as below. The preferred solution is pd.to_numeric with coerce=True.

The solution of question 4 is accepted if you use df.describe().

The solution of question 6 is accepted if one of the two approaches below were used:

The solution of question 7 is accepted as long as the output of print(filtered_df.head().to_markdown()) is as below and if the number of rows is equal to 449667.

The solution of question 8 is accepted if the output is:

The solution of question 9 if the output is Timestamp('2009-02-22 00:00:00').

The solution of question 10 if the output of print(sorted_df.tail().to_markdown()) is:

The solution of question 11 is accepted if the output is as below. The solution is based on groupby which creates groups based on the index Date and aggregates the groups using the mean.

Exercise 3: E-commerce purchases

The exercise is validated is all questions of the exercise are validated.

To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.

The solution of question 1 is accepted if it contains 10000 entries and 14 columns. There many solutions based on: shape, info, describe.

The solution of question 2 is accepted if the answer is 50.34730200000025.

The solution of question 3 is accepted if the min is 0and the max is 99.989999999999995

The solution of question 4 is accepted if the answer is 1098

The solution of question 5 is accepted if the answer is 30

The solution of question 6 is accepted if the are 4932 people that made the purchase during the AM and 5068 people that made the purchase during PM. There many ways to the solution but the goal of this question was to make you use value_counts

The solution of question 7 is accepted if the answer is as below. There many ways to the solution but the goal of this question was to use value_counts

The solution of question 8 is accepted if the purchase price is 75.1

The solution of question 9 is accepted if the email address is bondellen@williams-garza.com

The solution of question 10 is accepted if the answer is 39. The preferred solution is based on this: df[(df['A'] == X) & (df['B'] > Y)]

The solution of question 11 is accepted if the answer is 1033. The preferred solution is based on the usage of apply on a lambda function that slices the string that contains the expiration date.

Exercise 4: Handling missing values

The exercise is validated is all questions of the exercise are validated (except the bonus question)

The solution of question 2 is accepted if the solution is df.loc[:,col].fillna(df[col].median()).

11 KiB

Raw Blame History

Activate the virtual environment. If you used `conda` run `conda activate your_env`

Run `python --version`.

Does it print `Python 3.x`? x >= 8

Does `import jupyter`, `import numpy` and `import pandas` run without any error?

The solution of question 1 is accepted if `drop` is used with `axis=1`.`inplace=True` may be useful to avoid to affect the result to a variable. A solution that could be accepted too (even if it's not a solution I recommend) is `del`.

The solution of question 2 is accepted if the DataFrame returns the output below. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted. I recommend to use `set_index` with `inplace=True` to do so.

The solution of question 3 is accepted if all the types are `float64` as below. The preferred solution is `pd.to_numeric` with `coerce=True`.

The solution of question 4 is accepted if you use `df.describe()`.

The solution of question 7 is accepted as long as the output of `print(filtered_df.head().to_markdown())` is as below and if the number of rows is equal to 449667.

The solution of question 9 if the output is `Timestamp('2009-02-22 00:00:00')`.

The solution of question 10 if the output of `print(sorted_df.tail().to_markdown())` is:

The solution of question 11 is accepted if the output is as below. The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`.

The solution of question 3 is accepted if the min is `0`and the max is `99.989999999999995`

The solution of question 6 is accepted if the are `4932` people that made the purchase during the `AM` and `5068` people that made the purchase during `PM`. There many ways to the solution but the goal of this question was to make you use `value_counts`

The solution of question 7 is accepted if the answer is as below. There many ways to the solution but the goal of this question was to use `value_counts`

The solution of question 10 is accepted if the answer is 39. The preferred solution is based on this: `df[(df['A'] == X) & (df['B'] > Y)]`

The solution of question 11 is accepted if the answer is 1033. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.

The solution of question 2 is accepted if the solution is `df.loc[:,col].fillna(df[col].median())`.