###### For the solution of question 1 is `drop` used with `axis=1`.`inplace=True`? It may be useful to avoid to affect the result to a variable. A solution that could be accepted too (even if it's not a solution I recommend) is `del`.
###### For question 2 does the DataFrame return the output below? If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted. I recommend to use `set_index` with `inplace=True` to do so.
###### For question 5 is `dropna` being used and is the number of missing values equal to 0? It is important to notice that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True` allows to remove the rows with missing values.
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. **The answer is no**. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
###### For question 11, is the output the following? The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`.
##### To validate this exercise all answers should return the expected numerical value given in the correction AND use Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
###### For question 6, are there `4932` people that made the purchase during the `AM` and `5068` people that made the purchase during `PM`? There are many ways to get the solution but the goal of this question was to make you use `value_counts`.
###### For question 11, is the answer **1033**? The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.
###### For question 12, is the answer as below? The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurrences.
###### For question 1, are the two steps implemented in that order? First, convert the numerical columns to `float` and then fill the missing values. The first step may involve `pd.to_numeric(df.loc[:,col], errors='coerce')`. The second step is validated if you eliminated all missing values. However there are many possibilities to fill the missing values. Here is one of them:
###### +The solution of bonus question is accepted if you find out this answer: Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers. Always think about the meaning of the data transformation! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense.
###### +Has the presence of negative values and huge values been detected? A good data scientist always check abnormal values in the dataset. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-) This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers can be handled.