#### Exercise 0: Environment and libraries ##### The exercise is validated if all questions of the exercise are validated. ##### Activate the virtual environment. If you used `conda` run `conda activate your_env`. ##### Run `python --version`. ###### Does it print `Python 3.x`? x >= 8 ###### Does `import jupyter`, `import numpy` and `import pandas` run without any error? --- --- #### Exercise 1: Concatenate ###### Is the outputted DataFrame as below for question 1? | | letter | number | |---:|:---------|---------:| | 0 | a | 1 | | 1 | b | 2 | | 2 | c | 1 | | 3 | d | 2 | --- --- #### Exercise 2: Merge ##### The exercise is validated if all questions of the exercise are validated. ###### Does the output for question 1 look as below? | | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |---:|-----:|:-------------|:-------------|:-------------|:-------------| | 0 | 1 | A | B | K | L | | 1 | 2 | C | D | M | N | ###### Does the output for question 2 look as below? | | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 | |---:|-----:|:---------------|:---------------|:---------------|:---------------| | 0 | 1 | A | B | K | L | | 1 | 2 | C | D | M | N | | 2 | 3 | E | F | nan | nan | | 3 | 4 | G | H | nan | nan | | 4 | 5 | I | J | nan | nan | | 5 | 6 | nan | nan | O | P | | 6 | 7 | nan | nan | Q | R | | 7 | 8 | nan | nan | S | T | Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name. --- --- #### Exercise 3: Merge MultiIndex ##### The exercise is validated if all questions of the exercise are validated. ###### Is the outputted DataFrame's shape `(1305, 5)` and `merged.head()` returns a table as below for question 1? One of the answers that returns the correct DataFrame is `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)` | | Open | Close | Close_Adjusted | Twitter | Reddit | | :--------------------------------------------------- | --------: | -------: | -------------: | ----------: | --------: | | (Timestamp('2021-01-01 00:00:00', freq='B'), 'AAPL') | 0.0991792 | -0.31603 | 0.634787 | -0.00159041 | 1.06053 | | (Timestamp('2021-01-01 00:00:00', freq='B'), 'FB') | -0.123753 | 1.00269 | 0.713264 | 0.0142127 | -0.487028 | | (Timestamp('2021-01-01 00:00:00', freq='B'), 'GE') | -1.37775 | -1.01504 | 1.2858 | 0.109835 | 0.04273 | | (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 | | (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 | ###### For question 2, are the numbers that are missing in the DataFrame equal to 0 and `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`? --- --- #### Exercise 4: Groupby Apply ##### The exercise is validated is all questions of the exercise are validated and if the for loop hasn't been used. The goal is to use `groupby` and `apply`. ###### Is the output for question 1 the following? ```python df = pd.DataFrame(range(1,11), columns=['sequence']) print(winsorize(df, [0.20, 0.80]).to_markdown()) ``` | | sequence | |---:|-----------:| | 0 | 2.8 | | 1 | 2.8 | | 2 | 3 | | 3 | 4 | | 4 | 5 | | 5 | 6 | | 6 | 7 | | 7 | 8 | | 8 | 8.2 | | 9 | 8.2 | ###### Is the output for question 2 a Pandas Series or DataFrame with the first 11 rows equal to the output below? The code below gives a solution. | | sequence | |---:|-----------:| | 0 | 1.45 | | 1 | 2 | | 2 | 3 | | 3 | 4 | | 4 | 5 | | 5 | 6 | | 6 | 7 | | 7 | 8 | | 8 | 9 | | 9 | 9.55 | | 10 | 11.45 | ```python def winsorize(df_series, quantiles): """ df: pd.DataFrame or pd.Series quantiles: list [0.05, 0.95] """ min_value = np.quantile(df_series, quantiles[0]) max_value = np.quantile(df_series, quantiles[1]) return df_series.clip(lower = min_value, upper = max_value) df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95]) ``` - https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e --- --- #### Exercise 5: Groupby Agg ###### Is the output for question 1 as below? The columns don't have to be MultiIndex. A solution could be `df.groupby('product').agg({'value':['min','max','mean']})` | product | ('value', 'min') | ('value', 'max') | ('value', 'mean') | | :----------- | ---------------: | ---------------: | ----------------: | | chair | 22.89 | 32.12 | 27.505 | | mobile phone | 100 | 111.22 | 105.61 | | table | 20.45 | 99.99 | 51.22 | --- --- #### Exercise 6: Unstack ###### Is the output similar (as the values are generated randomly, it's obvious the audit doesn't require to match the values below) to what `unstacked_df.head()`returns for question 1? | Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') | |:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:| | 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 | | 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 | | 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 | ###### Is the answer for question 2: `unstacked.plot(title = 'Stocks 2021')`? The title can be anything else.