mirror of https://github.com/01-edu/public.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
6.8 KiB
6.8 KiB
Exercise 0: Environment and libraries
The exercise is validated if all questions of the exercise are validated.
Activate the virtual environment. If you used conda
run conda activate your_env
.
Run python --version
.
Does it print Python 3.x
? x >= 8
Does import jupyter
, import numpy
and import pandas
run without any error?
Exercise 1: Concatenate
Is the outputted DataFrame as below for question 1?
| | letter | number |
|---:|:---------|---------:|
| 0 | a | 1 |
| 1 | b | 2 |
| 2 | c | 1 |
| 3 | d | 2 |
Exercise 2: Merge
The exercise is validated if all questions of the exercise are validated.
Does the output for question 1 look as below?
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y |
|---:|-----:|:-------------|:-------------|:-------------|:-------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
Does the output for question 2 look as below?
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 |
|---:|-----:|:---------------|:---------------|:---------------|:---------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
| 2 | 3 | E | F | nan | nan |
| 3 | 4 | G | H | nan | nan |
| 4 | 5 | I | J | nan | nan |
| 5 | 6 | nan | nan | O | P |
| 6 | 7 | nan | nan | Q | R |
| 7 | 8 | nan | nan | S | T |
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
Exercise 3: Merge MultiIndex
The exercise is validated if all questions of the exercise are validated.
Is the outputted DataFrame's shape (1305, 5)
and merged.head()
returns a table as below for question 1? One of the answers that returns the correct DataFrame is market_data.merge(alternative_data, how='left', left_index=True, right_index=True)
Open | Close | Close_Adjusted | |||
---|---|---|---|---|---|
(Timestamp('2021-01-01 00:00:00', freq='B'), 'AAPL') | 0.0991792 | -0.31603 | 0.634787 | -0.00159041 | 1.06053 |
(Timestamp('2021-01-01 00:00:00', freq='B'), 'FB') | -0.123753 | 1.00269 | 0.713264 | 0.0142127 | -0.487028 |
(Timestamp('2021-01-01 00:00:00', freq='B'), 'GE') | -1.37775 | -1.01504 | 1.2858 | 0.109835 | 0.04273 |
(Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 |
(Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 |
For question 2, are the numbers that are missing in the DataFrame equal to 0 and filled_df.sum().sum() == merged_df.sum().sum()
gives: True
?
Exercise 4: Groupby Apply
The exercise is validated if all questions of the exercise are validated and if the for loop hasn't been used. The goal is to use groupby
and apply
.
Is the output for question 1 the following?
df = pd.DataFrame(range(1,11), columns=['sequence'])
print(winsorize(df, [0.20, 0.80]).to_markdown())
| | sequence |
|---:|-----------:|
| 0 | 2.8 |
| 1 | 2.8 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 8.2 |
| 9 | 8.2 |
Is the output for question 2 a Pandas Series or DataFrame with the first 11 rows equal to the output below? The code below gives a solution.
| | sequence |
|---:|-----------:|
| 0 | 1.45 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 9.55 |
| 10 | 11.45 |
def winsorize(df_series, quantiles):
"""
df: pd.DataFrame or pd.Series
quantiles: list [0.05, 0.95]
"""
min_value = np.quantile(df_series, quantiles[0])
max_value = np.quantile(df_series, quantiles[1])
return df_series.clip(lower = min_value, upper = max_value)
df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95])
Exercise 5: Groupby Agg
Is the output for question 1 as below? The columns don't have to be MultiIndex. A solution could be df.groupby('product').agg({'value':['min','max','mean']})
product | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
---|---|---|---|
chair | 22.89 | 32.12 | 27.505 |
mobile phone | 100 | 111.22 | 105.61 |
table | 20.45 | 99.99 | 51.22 |
Exercise 6: Unstack
Is the output similar (as the values are generated randomly, it's obvious the audit doesn't require to match the values below) to what unstacked_df.head()
returns for question 1?
| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') |
|:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:|
| 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 |
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 |
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 |