Introduction to Data Transformation with Python and Pandas.

Importing Modules

import numpy as np

import pandas as pd

Loading Data

For this tutorial you need to download the

The avocados, homelessness.csv, sales_subset.csv and temperatures.csv dataset from Kaggle’s website.
The COVID19 Daily Updatesdata from Kaggle’s website.

Creating Dataframes

A Dataframe from a list

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df11 = pd.DataFrame(a, columns=['one', 'two', 'three'])
print(df11)

  one  two three
 a  1.2   4.2
 b   70  0.03
 x    5     0

A Dataframe from Array

Example 1

dates = pd.date_range("20130101", periods=6) # pandas indexes
r_num = np.random.randn(6, 4) # array

df = pd.DataFrame(r_num, index=dates, columns=list("ABCD"))
print(df)

                   A         B         C         D
2013-01-01  0.817994 -0.924007 -1.515711 -0.198598
2013-01-02  0.673364 -1.914110 -0.126208 -0.282033
2013-01-03  1.312579  0.340656 -0.300397 -0.838614
2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784
2013-01-05 -2.045106 -0.929131 -0.029660  0.529883
2013-01-06 -1.343257 -0.250821 -0.046303  0.944569

Example 2

r_num = np.array(range(1,26,1))
# Convert 1D array to a 2D numpy array of 5 rows and 5 columns
arr_2d = np.reshape(r_num, (5, 5))
df4 = pd.DataFrame(arr_2d)

print(df4)
print(type(df4))

 1   2   3   4
 1   2   3   4   5
 6   7   8   9  10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25
<class 'pandas.core.frame.DataFrame'>

A Dataframe from a dictionary

dict = {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }

df2 = pd.DataFrame(dict)

print(df2)

     A          B    C  D      E    F
1.0 2013-01-02  1.0  3   test  foo
1.0 2013-01-02  1.0  3  train  foo
1.0 2013-01-02  1.0  3   test  foo
1.0 2013-01-02  1.0  3  train  foo

A Dataframe froma list of dictionaries

# Create a list of dictionaries with new data
df9 = [
    {"date": "2019-11-03", "small_sold": 10376832, "large_sold": 7835071},
    {"date": "2019-11-10", "small_sold": 10717154, "large_sold": 8561348},
]

# Convert list into DataFrame
df9 = pd.DataFrame(df9)

# Print the new DataFrame
print(df9)

         date  small_sold  large_sold
0  2019-11-03    10376832     7835071
1  2019-11-10    10717154     8561348

A Dataframe from CSV

To print df3 I use the head() method to print only the top rows.

# Read the csv from working directory
df3 = pd.read_csv("homelessness.csv")

print(df3.head())

   Unnamed: 0              region  ... family_members  state_pop
         0  East South Central  ...          864.0    4887681
         1             Pacific  ...          582.0     735139
         2            Mountain  ...         2606.0    7158024
         3  West South Central  ...          432.0    3009733
         4             Pacific  ...        20964.0   39461588

[5 rows x 6 columns]

Another example of importing from csv.

df6 = pd.read_csv("sales_subset.csv")

print(df6.head())

   Unnamed: 0  store type  ...  temperature_c fuel_price_usd_per_l  unemployment
         0      1    A  ...       5.727778             0.679451         8.106
         1      1    A  ...       8.055556             0.693452         8.106
         2      1    A  ...      16.816667             0.718284         7.808
         3      1    A  ...      22.527778             0.748928         7.808
         4      1    A  ...      27.050000             0.714586         7.808

[5 rows x 10 columns]

df7 = pd.read_csv("temperatures.csv")

print(df7.head())

   Unnamed: 0        date     city        country  avg_temp_c
         0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
         1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
         2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
         3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
         4  2000-05-01  Abidjan  Côte D'Ivoire      27.547

df8 = pd.read_csv("covid-19-all.csv")
print(df8.head())

<string>:1: DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False.
  Country/Region Province/State  Latitude  ...  Recovered  Deaths        Date
0            NaN            NaN       NaN  ...    41727.0  2191.0  2021-01-01
1            NaN            NaN       NaN  ...    33634.0  1181.0  2021-01-01
2            NaN            NaN       NaN  ...    67395.0  2762.0  2021-01-01
3            NaN            NaN       NaN  ...     7463.0    84.0  2021-01-01
4            NaN            NaN       NaN  ...    11146.0   405.0  2021-01-01

[5 rows x 8 columns]

Methods and Attributes of a DataFrame

In Python, there are specific methods, or operations that can be performed for each data structure. Similarly, there are specific characteristics of the data structures called attributes. We can assess all the methods and attributes associated with a data structure using the following lines:

att_meth = dir(df)
print(att_meth)

['A', 'B', 'C', 'D', 'T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__array_wrap__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_accum_func', '_add_numeric_operations', '_agg_by_level', '_agg_examples_doc', '_agg_summary_and_see_also_doc', '_align_frame', '_align_series', '_append', '_arith_method', '_as_manager', '_attrs', '_box_col_values', '_can_fast_transpose', '_check_inplace_and_allows_duplicate_labels', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_cmp_method', '_combine_frame', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_from_arguments', '_construct_result', '_constructor', '_constructor_sliced', '_convert', '_count_level', '_data', '_dir_additions', '_dir_deletions', '_dispatch_frame_op', '_drop_axis', '_drop_labels_or_levels', '_ensure_valid_index', '_find_valid_index', '_flags', '_from_arrays', '_from_mgr', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cleaned_column_resolvers', '_get_column_array', '_get_index_resolvers', '_get_item_cache', '_get_label_or_level_values', '_get_numeric_data', '_get_value', '_getitem_bool_array', '_getitem_multilevel', '_gotitem', '_hidden_attrs', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_mgr', '_inplace_method', '_internal_names', '_internal_names_set', '_is_copy', '_is_homogeneous_type', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_view', '_iset_item', '_iset_item_mgr', '_iset_not_inplace', '_item_cache', '_iter_column_arrays', '_ixs', '_join_compat', '_logical_func', '_logical_method', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_mgr', '_min_count_stat_function', '_needs_reindex_multi', '_protect_consolidate', '_reduce', '_reduce_axis1', '_reindex_axes', '_reindex_columns', '_reindex_index', '_reindex_multi', '_reindex_with_indexers', '_rename', '_replace_columnwise', '_repr_data_resource_', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_series', '_set_axis', '_set_axis_name', '_set_axis_nocheck', '_set_is_copy', '_set_item', '_set_item_frame_value', '_set_item_mgr', '_set_value', '_setitem_array', '_setitem_frame', '_setitem_slice', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_stat_function', '_stat_function_ddof', '_take_with_is_copy', '_to_dict_of_blocks', '_typ', '_update_inplace', '_validate_dtype', '_values', '_where', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'applymap', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'attrs', 'axes', 'backfill', 'between_time', 'bfill', 'bool', 'boxplot', 'clip', 'columns', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_dict', 'from_records', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'insert', 'interpolate', 'isin', 'isna', 'isnull', 'items', 'iteritems', 'iterrows', 'itertuples', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lookup', 'lt', 'mad', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_flags', 'set_index', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_feather', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_markdown', 'to_numpy', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'to_xml', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unstack', 'update', 'value_counts', 'values', 'var', 'where', 'xs']

All methods have parenthesis and typically they take arguments. However attributes do no have parenthesis. Attributes and methods can be called using the . notation.

DataFrame Attributes

Dimension and data types

# Print the dimension of the dataframe
print(df3.shape)

# Print the dataframe column types
print(df3.dtypes)

(51, 6)
Unnamed: 0          int64
region             object
state              object
individuals       float64
family_members    float64
state_pop           int64
dtype: object

Columns and Rows

# Print the column index of dataframe
print(df3.columns)

# Print the row index of dataframe
print(df3.index)

Index(['Unnamed: 0', 'region', 'state', 'individuals', 'family_members',
       'state_pop'],
      dtype='object')
RangeIndex(start=0, stop=51, step=1)

DataFrame Methods

Describe the Data Frame

# First Rows
print(df.head())

# Last Rows
print(df.tail())

                   A         B         C         D
2013-01-01  0.817994 -0.924007 -1.515711 -0.198598
2013-01-02  0.673364 -1.914110 -0.126208 -0.282033
2013-01-03  1.312579  0.340656 -0.300397 -0.838614
2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784
2013-01-05 -2.045106 -0.929131 -0.029660  0.529883
                   A         B         C         D
2013-01-02  0.673364 -1.914110 -0.126208 -0.282033
2013-01-03  1.312579  0.340656 -0.300397 -0.838614
2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784
2013-01-05 -2.045106 -0.929131 -0.029660  0.529883
2013-01-06 -1.343257 -0.250821 -0.046303  0.944569

Information

print(df.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2013-01-01 to 2013-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes
None

Sort a DataFrame

# Sortdf3by individuals (ascending)
df3_ind = df3.sort_values("individuals") 

# Print the top few rows
print(df3_ind.head())

# Sortdf3by individuals (descending)
df3_ind = df3.sort_values("individuals", ascending=False) 

# Print the top few rows
print(df3_ind.head())


# Sortdf3by region ascending, then descending family members
df3_reg_fam = df3.sort_values(['region', 'family_members'], ascending=[True, False])

# Print the top few rows
print(df3_reg_fam.head())

    Unnamed: 0              region  ... family_members  state_pop
        50            Mountain  ...          205.0     577601
        34  West North Central  ...           75.0     758080
          7      South Atlantic  ...          374.0     965479
        39         New England  ...          354.0    1058287
        45         New England  ...          511.0     624358

[5 rows x 6 columns]
    Unnamed: 0              region  ... family_members  state_pop
          4             Pacific  ...        20964.0   39461588
        32        Mid-Atlantic  ...        52070.0   19530351
          9      South Atlantic  ...         9587.0   21244317
        43  West South Central  ...         6111.0   28628666
        47             Pacific  ...         5880.0    7523869

[5 rows x 6 columns]
    Unnamed: 0              region  ... family_members  state_pop
        13  East North Central  ...         3891.0   12723071
        35  East North Central  ...         3320.0   11676341
        22  East North Central  ...         3142.0    9984072
        49  East North Central  ...         2167.0    5807406
        14  East North Central  ...         1482.0    6695497

[5 rows x 6 columns]

Indexing

By default when we create a dataframe from scratch, Pandas assigns two indexes for rows and columns using integers ranging from zero until the last value, for instance, the df4 that was created from a 2D array.

Column index

df4.columns

RangeIndex(start=0, stop=5, step=1)

Row index

df4.index

RangeIndex(start=0, stop=5, step=1)

However, in dataframes created from a csv we have column names by default and whole numbers as indexes of the rows. For instance, the dataframes df6 and df7.

Column index

# Examples of column indexes
df6.columns

df7.columns

Index(['Unnamed: 0', 'store', 'type', 'department', 'date', 'weekly_sales',
       'is_holiday', 'temperature_c', 'fuel_price_usd_per_l', 'unemployment'],
      dtype='object')
Index(['Unnamed: 0', 'date', 'city', 'country', 'avg_temp_c'], dtype='object')

Row index

# Examples of row indexes
df6.index

df6.index

RangeIndex(start=0, stop=10774, step=1)
RangeIndex(start=0, stop=10774, step=1)

Changing column names is straightforward:

# Old
print(df9)


df9.columns = ['A', 'B', 'C']
# New
print(df9)

         date  small_sold  large_sold
0  2019-11-03    10376832     7835071
1  2019-11-10    10717154     8561348
            A         B        C
0  2019-11-03  10376832  7835071
1  2019-11-10  10717154  8561348

Is possible to use one column as a row index, as follows:

df7_1 = df7.set_index("city")
df7_1.head()

         Unnamed: 0        date        country  avg_temp_c
city                                                      
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidjan           1  2000-02-01  Côte D'Ivoire      27.685
Abidjan           2  2000-03-01  Côte D'Ivoire      29.061
Abidjan           3  2000-04-01  Côte D'Ivoire      28.162
Abidjan           4  2000-05-01  Côte D'Ivoire      27.547

Dropping the index works as follows:

df7_1.reset_index().head()

      city  Unnamed: 0        date        country  avg_temp_c
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidjan           1  2000-02-01  Côte D'Ivoire      27.685
Abidjan           2  2000-03-01  Côte D'Ivoire      29.061
Abidjan           3  2000-04-01  Côte D'Ivoire      28.162
Abidjan           4  2000-05-01  Côte D'Ivoire      27.547

Once row-indexed, ideally, you would like to sort the dataframe according to this index.

print(df7_1.sort_index())

         Unnamed: 0        date        country  avg_temp_c
city                                                      
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidjan         106  2008-11-01  Côte D'Ivoire      27.302
Abidjan         107  2008-12-01  Côte D'Ivoire      27.472
Abidjan         108  2009-01-01  Côte D'Ivoire      26.912
Abidjan         109  2009-02-01  Côte D'Ivoire      28.224
...             ...         ...            ...         ...
Xian          16391  2004-09-01          China      17.889
Xian          16392  2004-10-01          China      11.229
Xian          16393  2004-11-01          China       5.720
Xian          16395  2005-01-01          China      -2.209
Xian          16499  2013-09-01          China         NaN

[16500 rows x 4 columns]

Subsetting rows

To subset specific values of the dataframe we can filter columns using relational operators (Boolean conditions) to return True of False subsets of the dataframe. There are at least two common ways to achieve this same objective. Imagine, you want to subset the rows from df3 where the column individuals is greater than 10000.

For example, passing the column as an attribute

df3[df3.individuals>10000]

    Unnamed: 0              region  ... family_members  state_pop
          4             Pacific  ...        20964.0   39461588
          9      South Atlantic  ...         9587.0   21244317
        32        Mid-Atlantic  ...        52070.0   19530351
        37             Pacific  ...         3337.0    4181886
        43  West South Central  ...         6111.0   28628666
        47             Pacific  ...         5880.0    7523869

[6 rows x 6 columns]

Alternatively, we can directly subset passing df3["individuals"]>10000, which is transformed into an object of <class 'pandas.core.series.Series'>

print(type(df3["individuals"]>10000))

df3[df3["individuals"]>10000]

<class 'pandas.core.series.Series'>
    Unnamed: 0              region  ... family_members  state_pop
4            4             Pacific  ...        20964.0   39461588
9            9      South Atlantic  ...         9587.0   21244317
32          32        Mid-Atlantic  ...        52070.0   19530351
37          37             Pacific  ...         3337.0    4181886
43          43  West South Central  ...         6111.0   28628666
47          47             Pacific  ...         5880.0    7523869

[6 rows x 6 columns]

Another way to subset different rows is by using the method loc, which takes advantage of the rows and columns indexes.

Subsetting rows using the `loc`

We use the [rows, columns] brackets after the dataframe to distinguish between rows and columns.
Indexes of rows and columns are typically strings nested in list objects.
To subset range of rows or columns is easy with the : slicing operator

Let’s look a the df7, first we set the column city as a row index, and then we pass the list cities to map all the rows of the cities "Moscow", "Saint Petersburg".

cities = ["Abidjan", "Xian"]
df7_1.loc[cities].head()

         Unnamed: 0        date        country  avg_temp_c
city                                                      
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidjan           1  2000-02-01  Côte D'Ivoire      27.685
Abidjan           2  2000-03-01  Côte D'Ivoire      29.061
Abidjan           3  2000-04-01  Côte D'Ivoire      28.162
Abidjan           4  2000-05-01  Côte D'Ivoire      27.547

Going further we can specify multilevel indexes, that is, indexes nested inside other indexes. Lets give an example were we nest the column country inside the index of city from the dataset df7.

# Index df7 by country & city
df7_1 = df7.set_index(["country", "city"])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")] # this is a list
print(df7_1.loc[rows_to_keep].head())

                        Unnamed: 0        date  avg_temp_c
country city                                              
Brazil  Rio De Janeiro       12540  2000-01-01      25.974
        Rio De Janeiro       12541  2000-02-01      26.699
        Rio De Janeiro       12542  2000-03-01      26.270
        Rio De Janeiro       12543  2000-04-01      25.750
        Rio De Janeiro       12544  2000-05-01      24.356

Now that we have two indexes is possible to sort the data accordingly

df7_1.sort_index(level=["country", "city"], ascending=[True, False]).head()

                   Unnamed: 0        date  avg_temp_c
country     city                                     
Afghanistan Kabul        7260  2000-01-01       3.326
            Kabul        7261  2000-02-01       3.454
            Kabul        7262  2000-03-01       9.612
            Kabul        7263  2000-04-01      17.925
            Kabul        7264  2000-05-01      24.658

Using the loc method is easy to slice blocks of values using the indexes. For instance, in the df7_1, that was indexed and sorted, we can pass the range "Pakistan":"Russia", which will subset this set of values from the dataframe. Is important to sort the data.frame before passing the indexes of rows as follows:

df7_1 = df7_1.sort_index()
df7_1.loc["Pakistan":"Russia"]

                           Unnamed: 0        date  avg_temp_c
country  city                                                
Pakistan Faisalabad              4785  2000-01-01      12.792
         Faisalabad              4786  2000-02-01      14.339
         Faisalabad              4787  2000-03-01      20.309
         Faisalabad              4788  2000-04-01      29.072
         Faisalabad              4789  2000-05-01      34.845
...                               ...         ...         ...
Russia   Saint Petersburg       13360  2013-05-01      12.355
         Saint Petersburg       13361  2013-06-01      17.185
         Saint Petersburg       13362  2013-07-01      17.234
         Saint Petersburg       13363  2013-08-01      17.153
         Saint Petersburg       13364  2013-09-01         NaN

[1155 rows x 3 columns]

Finally, given that we have indexed the df7_1 according to two columns, we can subset pairs of index values and return a given set of rows.

df7_1.loc[("Pakistan","Lahore"):("Russia", "Moscow")].head()

                 Unnamed: 0        date  avg_temp_c
country  city                                      
Pakistan Lahore        8415  2000-01-01      12.792
         Lahore        8416  2000-02-01      14.339
         Lahore        8417  2000-03-01      20.309
         Lahore        8418  2000-04-01      29.072
         Lahore        8419  2000-05-01      34.845

Subsetting rows using the `iloc`

Using iloc with a Dataframe is similar to the loc:

We use the [rows, columns] brackets after the dataframe to distinguish between rows and columns.
Indexes of rows and columns are typically integers starting from zero.
To subset range of rows or columns is easy with the : slicing operator

Subset the first five rows

df7.iloc[:5, :]

   Unnamed: 0        date     city        country  avg_temp_c
         0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
         1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
         2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
         3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
         4  2000-05-01  Abidjan  Côte D'Ivoire      27.547

Subsetting rows in two columns or more

A more complex example involves a subset involving filtering of two columns. We are going to subset in two levels grouping each condition inside () parenthesis as follows:

print(df3[(df3["state_pop"] > 5000000) & (df3["individuals"]> 5000)])

    Unnamed: 0              region  ... family_members  state_pop
          2            Mountain  ...         2606.0    7158024
          4             Pacific  ...        20964.0   39461588
          5            Mountain  ...         3250.0    5691287
          9      South Atlantic  ...         9587.0   21244317
        10      South Atlantic  ...         2556.0   10511131
        13  East North Central  ...         3891.0   12723071
        21         New England  ...        13257.0    6882635
        22  East North Central  ...         3142.0    9984072
        30        Mid-Atlantic  ...         3350.0    8886025
        32        Mid-Atlantic  ...        52070.0   19530351
        33      South Atlantic  ...         2817.0   10381615
        35  East North Central  ...         3320.0   11676341
        38        Mid-Atlantic  ...         5349.0   12800922
        42  East South Central  ...         1744.0    6771631
        43  West South Central  ...         6111.0   28628666
        47             Pacific  ...         5880.0    7523869

[16 rows x 6 columns]

Another example, in df3 filter for rows where family_members is less than 1000 and region is Pacific.

df3[(df3.family_members < 1000) & (df3.region == "Pacific")]

   Unnamed: 0   region   state  individuals  family_members  state_pop
1           1  Pacific  Alaska       1434.0           582.0     735139

Finally, we can subset from a list of options using the method isin().

# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
df3[df3.state.isin(canu)].head()

    Unnamed: 0    region       state  individuals  family_members  state_pop
          2  Mountain     Arizona       7259.0          2606.0    7158024
          4   Pacific  California     109008.0         20964.0   39461588
        28  Mountain      Nevada       7058.0           486.0    3027341
        44  Mountain        Utah       1904.0           972.0    3153550

Subsetting columns

To subset, it is necessary to know the column names of the dataframe. For instance, the column names of the dataframe df4 can be retrieved using the attribute df4.columns.values, as follows:

print(df4.columns.values)

[0 1 2 3 4]

Knowing the column index it is easy then to subset the dataframe:

# First column
df4[[0]]

# Last column
df4[[4]]

Another example using the dataframe df3 whose column indexes are

print(df3.columns.values)

['Unnamed: 0' 'region' 'state' 'individuals' 'family_members' 'state_pop']

We can pass a list of column names to subset the state and family_members columns as follows:

# select two columns
df3[["state", "family_members"]].head()

        state  family_members
   Alabama           864.0
    Alaska           582.0
   Arizona          2606.0
  Arkansas           432.0
California         20964.0

To subset columns using the loc method we have to select the rows and columns that we are selecting. If we intend to select all the rows and we only are subsetting columns, we have to pass the : slice operator.

df3.loc[:, ["state", "family_members"]].head()

        state  family_members
   Alabama           864.0
    Alaska           582.0
   Arizona          2606.0
  Arkansas           432.0
California         20964.0

Another example using the df7

df7.loc[:, ["city", "country"]].head()

      city        country
Abidjan  Côte D'Ivoire
Abidjan  Côte D'Ivoire
Abidjan  Côte D'Ivoire
Abidjan  Côte D'Ivoire
Abidjan  Côte D'Ivoire

Similarly, to the loc method, subsetting columns with the iloc method uses integers to map columns. For instance subsetting the columns ["city", "country"] from df7. The slice operators : using the iloc method states that we are calling all the rows or columns.

df7.columns
df7.iloc[:, [1,3]].head()

Index(['Unnamed: 0', 'date', 'city', 'country', 'avg_temp_c'], dtype='object')
         date        country
2000-01-01  Côte D'Ivoire
2000-02-01  Côte D'Ivoire
2000-03-01  Côte D'Ivoire
2000-04-01  Côte D'Ivoire
2000-05-01  Côte D'Ivoire

Subsetting rows and columns

First, locate the rows of the second column that comply with the rule. In this example, we locate elements that are divisible by 2.

# Subset elements of the second column that are divisible by two
rows = df4[1] % 2 == 0
print(type(rows))

<class 'pandas.core.series.Series'>

Notice that Python creates an object of type pandas.core.series.Series. Next, we use this object to subset the elements of the second column in the following way:

df5 = df4[[1]][rows]

print(df5.shape)
print(type(df5))
print(df5)

(3, 1)
<class 'pandas.core.frame.DataFrame'>
    1
0   2
2  12
4  22

Another method of subsetting rows and columns is by using indexes and the loc method. Recall that this method works only if the dataset contains indexes and is sorted accordingly. To refresh these steps, let’s perform the following operations:

df7_1 = df7.set_index(["country", "city"]) # set the indexes
df7_1 = df7_1.sort_index() # sort the dataframe descending

Now we can subset all rows that contain the country Zimbabwe on the column date, as follows:

df7_1.loc["Zimbabwe", "date"].head()

type(df7_1.index.values)

city
Harare    2000-01-01
Harare    2000-02-01
Harare    2000-03-01
Harare    2000-04-01
Harare    2000-05-01
Name: date, dtype: object
<class 'numpy.ndarray'>

Subsetting rows and columns using the `iloc`

This method is similar to the loc method, but instead of using strings as indexes, we use integers to call rows and columns. We follow these two rules:

We use the [rows, columns] brackets to distinguish between rows and columns.
Indexes of rows and columns are numbers that typically start in 0.

Recall df4

print(df4)

 1   2   3   4
 1   2   3   4   5
 6   7   8   9  10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25

Subset the first and last element of df4

# Subset the first element of df4
print(df4.iloc[0,0])

# Subset the last element of df4
print(df4.iloc[4,4])

1
25

Extract the first column

df4.iloc[:,[0]]

Extract the first row

df4.iloc[[0],:]

   0  1  2  3  4
0  1  2  3  4  5

Extract the last 10 elements of the second column

df7.iloc[(df7.shape[0]-11):(df7.shape[0]-1), [1]]

Get the first 5 rows of columns 3 and 4.

df7.iloc[:4, 2:4]

      city        country
Abidjan  Côte D'Ivoire
Abidjan  Côte D'Ivoire
Abidjan  Côte D'Ivoire
Abidjan  Côte D'Ivoire

Extract all elements in the first row that comply with a condition. For this example we are going to nest the condition of all the values in the first row greater than two, df4.iloc[0,:]>2. Then we are using the np.where method to locate the indexes of elements that contain a True boolean.

df4.iloc[[0], np.where(df4.iloc[0,:]>2)[0]]

   2  3  4
0  3  4  5

Transforming

Describe (Summary Statistics)

# mean
print(df3.state_pop.mean())

# median
print(df3.state_pop.median())

# variance
print(df3.state_pop.var())

# standard deviation
print(df3.state_pop.std())

# min value
# standard deviation
print(df3.state_pop.min())

# max value
print(df3.state_pop.min())

# all together
print(df3.describe())

6405637.274509804
4461153.0
53688706994844.23
7327257.808678784
577601
577601
       Unnamed: 0    individuals  family_members     state_pop
count   51.000000      51.000000       51.000000  5.100000e+01
mean    25.000000    7225.784314     3504.882353  6.405637e+06
std     14.866069   15991.025083     7805.411811  7.327258e+06
min      0.000000     434.000000       75.000000  5.776010e+05
25%     12.500000    1446.500000      592.000000  1.777414e+06
50%     25.000000    3082.000000     1482.000000  4.461153e+06
75%     37.500000    6781.500000     3196.000000  7.340946e+06
max     50.000000  109008.000000    52070.000000  3.946159e+07

If the dataframe contains integers or real numbers is easy to perform the operations across the columns or rows.

# Mean of rows across columns
df4.mean(axis=1)

# Mean of columns across columns
df4.mean(axis=0)

   3.0
   8.0
  13.0
  18.0
  23.0
dtype: float64
  11.0
  12.0
  13.0
  14.0
  15.0
dtype: float64

Column operations

The main column operations are:

print(df3.state_pop.sum())

326687501

Cumulative sum

print(df3.state_pop.cumsum())

     4887681
     5622820
    12780844
    15790577
    55252165
    60943452
    64514972
    65480451
    66181998
    87426315
   97937446
   99358039
  101108575
  113831646
  120527143
  123675761
  126587120
  131048273
  135707963
  137047020
  143082822
  149965457
  159949529
  165555778
  168536798
  174658421
  175719086
  177644700
  180672041
  182025506
  190911531
  193004272
  212534623
  222916238
  223674318
  235350659
  239290894
  243472780
  256273702
  257331989
  262416145
  263294843
  270066474
  298695140
  301848690
  302473048
  310974334
  318498203
  320302494
  326109900
  326687501
Name: state_pop, dtype: int64

Cumulative product

print(df3.state_pop.cumprod())

               4887681
         3593124922659
   7272930357681714200
   7903566051232904824
  -6717292260295177376
   8873061169631614368
   8148294484792166400
   5819023669836478464
  -1205241372944291840
  -6292177132517226496
  6622373097463437312
  4962005595153618944
  3470577629245915136
 -8423188867997155328
 -2225371326546493440
  3946997816422858752
  5096266778097451008
  1136688303390556160
  1494541717077688320
  7730232222543446016
  8113663438137196544
 -1654939752842264576
  6166687450565443584
  1500558237656678400
  6253697334853500928
 -8790508571491041280
 -7109132568401805312
 -5744006173422518272
 -2284299832833081344
 -8673003558171312128
 -3504656614122586112
 -5377308965807849472
 -4878392185126387712
  1709610911523667968
  8941421250245296128
  9215048314536329216
 -8334418405679431680
  4198662944055099392
 -8723413785341067264
   581225513859678208
  1910243011917250560
  2130586611002376192
 -2684937009104945152
 -2708850407756529664
  7509713552135946240
 -6781182851288137728
 -6382118816838582272
 -4484145143506534400
  2332118313460563968
 -4719128645426216960
 -8733419210157326336
Name: state_pop, dtype: int64

Adding new columns

To add new columns we can take a dataframe, for instance df4, write the new index (or column name) and then pass the values on the left of = as follows

# add columns 3 and 4 and save them on a new column
df4[5] = df4[2] + df4[3]

# estimate column 3 as a proportion of column 5 and add the result in a new column
df4[6] = df4[2] / df4[5]

print(df4)

 1   2   3   4   5         6
 1   2   3   4   5   7  0.428571
 6   7   8   9  10  17  0.470588
11  12  13  14  15  27  0.481481
16  17  18  19  20  37  0.486486
21  22  23  24  25  47  0.489362

If the columns do not have a numeric index, but a string name, the procedure to add a new column is by passing a string with the name of the new column, as follows:

# Create indiv_per_10k col as homeless individuals per 10k state pop
df3["indiv_per_10k"] = 10000 * df3.individuals / df3.state_pop   

Converting column types

Transform the first column into a datetime64[ns]

# Print class of object
print(df7["date"].dtypes) # column date is an object

# transform 
df7["date"] = pd.to_datetime(df7["date"])

# Print the new class of object
print(df7["date"].dtypes)  # column date is an object

object
datetime64[ns]

Now that date is of class datetime64[ns], we can create new columns of year, month or day

# Add a year column to temperatures
df7["year"] = df7["date"].dt.year
df7["month"] = df7["date"].dt.month
df7["day"] = df7["date"].dt.day

df7.head()
print(df7.dtypes)

   Unnamed: 0       date     city        country  avg_temp_c  year  month  day
0           0 2000-01-01  Abidjan  Côte D'Ivoire      27.293  2000      1    1
1           1 2000-02-01  Abidjan  Côte D'Ivoire      27.685  2000      2    1
2           2 2000-03-01  Abidjan  Côte D'Ivoire      29.061  2000      3    1
3           3 2000-04-01  Abidjan  Côte D'Ivoire      28.162  2000      4    1
4           4 2000-05-01  Abidjan  Côte D'Ivoire      27.547  2000      5    1
Unnamed: 0             int64
date          datetime64[ns]
city                  object
country               object
avg_temp_c           float64
year                   int64
month                  int64
day                    int64
dtype: object

Here is another example using df11, columns two and three are strings that can be transformed into numeric elements.

print(df11.dtypes)

df11[['two', 'three']] = df11[['two', 'three']].astype(float)

one      object
two      object
three    object
dtype: object

Aggregating

The aggregate method works by defining a function that latter is going to be called on each column of a dataframe using the method agg()

For instance, consider the following function that computes the 75% and the 25% quantiles

def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

We can use this function on columns one and three of the df4 dataframe as follows:

print(df4[[0, 2]].agg(iqr))

0    10.0
2    10.0
dtype: float64

agg() can take more than one function, for instance

print(df4[[0, 2]].agg([iqr, np.median, np.mean]))

           0     2
iqr     10.0  10.0
median  11.0  13.0
mean    11.0  13.0

Grouping

For grouping, we use the method .groupby() from the pandas library that splits a dataframe according to a certain categorical variable. The following snipped splits the dataframe using the variable type in two categories, and then computes the sum() of the column weekly_sales.

df6.groupby("type")["weekly_sales"].sum()

type
A    2.337163e+08
B    2.317840e+07
Name: weekly_sales, dtype: float64

We can even take more than one category

df6.groupby(["type", "is_holiday"])["weekly_sales"].sum()

type  is_holiday
A     False         2.336927e+08
      True          2.360181e+04
B     False         2.317678e+07
      True          1.621410e+03
Name: weekly_sales, dtype: float64

Now we can use the .groupby together with the agg functions to calculate the min, max, mean and median of a variable.

df6.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.mean, np.median, np.max, np.min])

     unemployment                ... fuel_price_usd_per_l                    
             mean median   amax  ...               median      amax      amin
type                             ...                                         
A        7.972611  8.067  8.992  ...             0.735455  1.107410  0.664129
B        9.279323  9.199  9.765  ...             0.803348  1.107674  0.760023

[2 rows x 8 columns]

Dealing with NaN values

# Initial number of rows
na_rows = df8.shape[0]

# Report the sum of NaN values in each column
df8.isna().sum()

# Report the sum of NaN values in each column as a proportion 
df8.isna().sum()/df8.shape[0]

Country/Region    171061
Province/State    223208
Latitude          171062
Longitude         171062
Confirmed             19
Recovered            386
Deaths               432
Date                   0
dtype: int64
Country/Region    0.137736
Province/State    0.179724
Latitude          0.137736
Longitude         0.137736
Confirmed         0.000015
Recovered         0.000311
Deaths            0.000348
Date              0.000000
dtype: float64

Imputing NaN values with zero

#Impute NaN with Zero
df8_1 = df8.fillna(0)

#Display no NaNs
df8_1.isna().any()

Country/Region    False
Province/State    False
Latitude          False
Longitude         False
Confirmed         False
Recovered         False
Deaths            False
Date              False
dtype: bool

Removing rows with missing values

# Drop Na values
df8 = df8.dropna()

# How many rows were dropped?
na_rows - df8.shape[0]

Pivot Tables

The pivot table is another method to transform data using categorical variables to split the dataframe. By default, the .pivot_table method computes the mean of a variable.

# Pivot for mean weekly_sales for each store type
df6.pivot_table(values="weekly_sales", index="type")

      weekly_sales
type              
A     23674.667242
B     25696.678370

An example selecting two variables

# Pivot for mean weekly_sales for each store type
df6.pivot_table(values=["weekly_sales", "unemployment"], index="type")

      unemployment  weekly_sales
type                            
A         7.972611  23674.667242
B         9.279323  25696.678370

We can extend the pivot table capabilities by passing two functions instead of one, as follows

# Pivot for mean weekly_sales for each store type
df6.pivot_table(values="weekly_sales", index="type", aggfunc=[np.mean, np.median])

              mean       median
      weekly_sales weekly_sales
type                           
A     23674.667242     11943.92
B     25696.678370     13336.08

We can split the dataframe passing one variable as index and another as columns

df6.pivot_table(values="weekly_sales", index="type", columns="is_holiday")

is_holiday         False       True
type                               
A           23768.583523  590.04525
B           25751.980533  810.70500

Create a pivot table adding the avg_temp_c column from df7, with country and city as rows, and year as columns.

df7["year"] = df7["date"].dt.year

Compute over two variables and replaces the NaN values, adding a sum of the columns and rows with the argument margins=True.

df6.pivot_table(values="weekly_sales", index="department", columns=["store", "type"], fill_value=0, margins=True)

store                  1             2  ...            39           All
type                   A             A  ...             A              
department                              ...                            
1           23491.755000  32392.588333  ...  21423.068333  32052.467153
2           47421.124167  68156.664167  ...  60768.638333  71380.022778
3           12872.590000  17012.000000  ...  16847.852500  18278.390625
4           38382.255833  47650.447500  ...  41670.077500  44863.253681
5           23761.120000  30331.717500  ...  23466.439167  37189.000000
...                  ...           ...  ...           ...           ...
96          27897.153333  33841.960833  ...  24947.875833  20337.607681
97          33771.761667  40757.997500  ...  23002.670000  26584.400833
98          10853.782500  14009.203333  ...   9089.097500  11820.590278
99            466.364545    455.516364  ...    317.189091    379.123659
All         20896.941787  26517.435162  ...  18414.938423  23843.950149

[81 rows x 13 columns]

Exporting

DataFrame to CSV

The most common format to export data is by saving the dataframe on a csv file

df9.to_csv("df9.csv")

DataFrame to Latex

print(df.style.to_latex())

\begin{tabular}{lrrrr}
 & A & B & C & D \\
2013-01-01 00:00:00 & 0.817994 & -0.924007 & -1.515711 & -0.198598 \\
2013-01-02 00:00:00 & 0.673364 & -1.914110 & -0.126208 & -0.282033 \\
2013-01-03 00:00:00 & 1.312579 & 0.340656 & -0.300397 & -0.838614 \\
2013-01-04 00:00:00 & -0.732977 & -0.560867 & -0.515910 & -0.768784 \\
2013-01-05 00:00:00 & -2.045106 & -0.929131 & -0.029660 & 0.529883 \\
2013-01-06 00:00:00 & -1.343257 & -0.250821 & -0.046303 & 0.944569 \\
\end{tabular}

References

Stack Overflow (2022)
Data Camp (2022)

Data Camp. 2022. “Data Manipulation with pandas - DataCamp Learn.” <https://app.datacamp.com/learn/courses/data-manipulation-with-pandas>.

Stack Overflow. 2022. “Change column type in pandas.” <https://stackoverflow.com/questions/15891038/change-column-type-in-pandas>.

Table of Content: