Water Harvesting Project¶

Data Analysis and Visualization using Python (Pandas, Matplotlib, and Seaborn Libraries)¶


This is a sample visualization for an imaginary water harvesting project. All the information provided here are imaginary, created solely for the purpose of simulating a case to develop innovative intervention that will strengthen resilience of concerned communities (example country taken here is Somalia). The case scenario is based on plausible assumptions and projections but does not reflect the actual situation or needs of any specific location or group. The aim of this exercise is to develop a sample portfolio showcasing the skills and knowledge related to data analysis and visualization.


1) Importing required libraries¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio
pio.renderers.default ='notebook'

2) Importing relevant excel file¶

In [2]:
harvest=pd.read_excel("harvest.xlsx", sheet_name=2)
In [3]:
harvest.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 17 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   SN                               50 non-null     int64         
 1   Name                             50 non-null     object        
 2   Gender                           50 non-null     object        
 3   Age                              50 non-null     int64         
 4   District                         50 non-null     object        
 5   Latitude                         50 non-null     float64       
 6   Longitude                        50 non-null     float64       
 7   Date                             50 non-null     datetime64[ns]
 8   Land Size                        50 non-null     int64         
 9   Land ID                          50 non-null     object        
 10  Crops                            50 non-null     object        
 11  Updated Yield                    50 non-null     int64         
 12  Updated Income from Agriculture  50 non-null     int64         
 13  Updated Total income             50 non-null     float64       
 14  # Trained                        50 non-null     int64         
 15  Satisfaction                     50 non-null     int64         
 16  Perception                       50 non-null     int64         
dtypes: datetime64[ns](1), float64(3), int64(8), object(5)
memory usage: 6.8+ KB

3) To show gender distribution of respondents¶

In [4]:
gender_count_respondents=harvest.Gender.value_counts()
plt.style.use("ggplot")
ax=harvest.Gender.value_counts().plot(kind="bar",
                                      figsize = [4,3],
                                      width = .4,
                                      color = ["cornflowerblue", "hotpink"],
                                      ylim = [0,40])
ax.set_xticklabels(["Male","Female"], rotation="horizontal")
ax.bar_label(ax.containers[0], padding=5)
plt.title("# Respondents' Gender")
plt.xlabel("Gender")
plt.show()

4) To show age distribution of respondents¶

In [5]:
def age_group(Age):
    if Age > 50:
        return "Over 50"
    elif Age >= 41:
        return "41 to 50"
    elif Age >=31:
        return "31 to 40"
    elif Age >=21:
        return "21 to 30"
    else:
        return "20 or Under"
respondent_ageGroup=harvest.Age.apply(age_group).value_counts(normalize=True).mul(100).round(1).sort_index()
In [6]:
respondent_ageGroup
Out[6]:
20 or Under     6.0
21 to 30       24.0
31 to 40       32.0
41 to 50       24.0
Over 50        14.0
Name: Age, dtype: float64
In [7]:
plt.pie(respondent_ageGroup, 
        labels=respondent_ageGroup.index, 
        autopct = "%1.0f%%", 
        wedgeprops={'linewidth':1, 'edgecolor': 'dimgrey'})
plt.title("Age Group of Respondents")
plt.show()
In [8]:
ax=sns.boxplot(data=harvest, x="Gender", y="Age")
ax.set_xticklabels(["Male", "Female"])
plt.title("Age Distribution by Gender")
plt.show()

5) Yield comparision (Baseline vs Current)¶

Importing baseline data

In [9]:
baseline=pd.read_excel("harvest.xlsx", sheet_name=0)
baseline.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 14 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   SN                                50 non-null     int64         
 1   Name                              50 non-null     object        
 2   Gender                            50 non-null     object        
 3   Age                               50 non-null     int64         
 4   District                          50 non-null     object        
 5   Latitude                          50 non-null     float64       
 6   Longitude                         50 non-null     float64       
 7   Date                              50 non-null     datetime64[ns]
 8   Land Size                         50 non-null     int64         
 9   Land ID                           50 non-null     object        
 10  Crops                             50 non-null     object        
 11  Baseline Yield                    50 non-null     int64         
 12  Baseline Income from Agriculture  50 non-null     int64         
 13  Baseline Total income             50 non-null     float64       
dtypes: datetime64[ns](1), float64(3), int64(5), object(5)
memory usage: 5.6+ KB

Joining two database tables

In [10]:
combined_baseline=harvest.merge(baseline, on="Land ID", suffixes=("_harvest", "_baseline"))
In [11]:
combined_baseline.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns (total 30 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   SN_harvest                        50 non-null     int64         
 1   Name_harvest                      50 non-null     object        
 2   Gender_harvest                    50 non-null     object        
 3   Age_harvest                       50 non-null     int64         
 4   District_harvest                  50 non-null     object        
 5   Latitude_harvest                  50 non-null     float64       
 6   Longitude_harvest                 50 non-null     float64       
 7   Date_harvest                      50 non-null     datetime64[ns]
 8   Land Size_harvest                 50 non-null     int64         
 9   Land ID                           50 non-null     object        
 10  Crops_harvest                     50 non-null     object        
 11  Updated Yield                     50 non-null     int64         
 12  Updated Income from Agriculture   50 non-null     int64         
 13  Updated Total income              50 non-null     float64       
 14  # Trained                         50 non-null     int64         
 15  Satisfaction                      50 non-null     int64         
 16  Perception                        50 non-null     int64         
 17  SN_baseline                       50 non-null     int64         
 18  Name_baseline                     50 non-null     object        
 19  Gender_baseline                   50 non-null     object        
 20  Age_baseline                      50 non-null     int64         
 21  District_baseline                 50 non-null     object        
 22  Latitude_baseline                 50 non-null     float64       
 23  Longitude_baseline                50 non-null     float64       
 24  Date_baseline                     50 non-null     datetime64[ns]
 25  Land Size_baseline                50 non-null     int64         
 26  Crops_baseline                    50 non-null     object        
 27  Baseline Yield                    50 non-null     int64         
 28  Baseline Income from Agriculture  50 non-null     int64         
 29  Baseline Total income             50 non-null     float64       
dtypes: datetime64[ns](2), float64(6), int64(13), object(9)
memory usage: 12.1+ KB
In [12]:
combined_baseline[["District_harvest", "Updated Yield", "Baseline Yield"]].head()
Out[12]:
District_harvest Updated Yield Baseline Yield
0 Mogadishu 1600 800
1 Mogadishu 875 500
2 Mogadishu 840 800
3 Mogadishu 1500 1000
4 Mogadishu 630 600

Grouping datafraame based on district

In [13]:
df=combined_baseline.groupby("District_harvest").agg({"Updated Yield": "sum", "Baseline Yield": "sum"})
In [14]:
df
Out[14]:
Updated Yield Baseline Yield
District_harvest
Baidowa 11745 7700
Burco 11725 7400
Garowe 12385 8700
Hargeisa 13595 8600
Mogadishu 10680 7300

Creating varialbes for plot

In [15]:
district=df.index
updated=df["Updated Yield"]
baseline=df["Baseline Yield"]

Barplot using matplotlib.pyplot method

In [16]:
plt.figure(figsize=[8,6])
plt.bar(district, updated, 
        width = 0.4, 
        align="edge", 
        color = "cornflowerblue", 
        label = "Current Yield")
plt.bar(district, baseline, 
        align="edge", 
        width = -0.4, 
        color = "orange", 
        label = "Baseline Yield")
plt.legend(shadow=True)
plt.title("Baseline vs Current Yield")
plt.xlabel("District")
plt.ylabel("Total yield in KG")
plt.tight_layout()
plt.show()
In [17]:
df4=combined_baseline[["District_harvest", "Baseline Yield", "Updated Yield"]]

Baseline and Current yield comparision based on type of crops¶

In [18]:
df5=combined_baseline.rename(columns=
                             {"District_harvest": "District", "Crops_harvest": "Crops", "Updated Yield": "Current Yield"}
                            )
In [19]:
plt.figure(figsize=[10,8])
ax1=sns.catplot(data=df5, 
                x = "District", 
                y="Baseline Yield", 
                kind="bar", 
                col="Crops", 
                color="orange", 
                errorbar=("ci",0), 
                estimator=sum)

plt.ylim(0,11000)
ax1.set_axis_labels("", "Baseline Yield")

plt.subplots_adjust(top=0.85)
plt.suptitle("Baseline and Current Yield per crops", va="bottom", size=20)

ax2=sns.catplot(data=df5, 
                x = "District", 
                y="Current Yield", 
                kind="bar", 
                col="Crops", 
                color="cornflowerblue", 
                errorbar=("ci",0), 
                estimator=sum)

plt.tight_layout()
plt.show()
<Figure size 1000x800 with 0 Axes>

6) Agricultural income comparision (Baseline vs Current)¶

Grouping dataframe

In [20]:
df1=combined_baseline.groupby("District_harvest").agg(
    {"Updated Income from Agriculture": "sum",
    "Baseline Income from Agriculture": "sum"}
)
df1
Out[20]:
Updated Income from Agriculture Baseline Income from Agriculture
District_harvest
Baidowa 35713 22970
Burco 34790 21790
Garowe 52126 34825
Hargeisa 43036 26035
Mogadishu 42978 29525

Creating variables

In [21]:
district1=df1.index
updatedIncome=df1["Updated Income from Agriculture"]
baselineIncome=df1["Baseline Income from Agriculture"]

Barplot using matplotlib.pyplot method

In [22]:
plt.figure(figsize=[8,6])
plt.bar(district1, updatedIncome, 
        width = 0.4, 
        align="edge", 
        color = "cornflowerblue", 
        label = "Current Agricultural Income")

plt.bar(district1, baselineIncome, 
        align="edge", 
        width = -0.4, 
        color = "orange", 
        label = "Baseline Agricultural Income")

plt.legend(shadow=True)
plt.title("Baseline vs Current Agricultural Income")
plt.xlabel("District")
plt.ylabel("Total income in USD")
plt.tight_layout()
plt.show()

7) Total income comparision (Baseline vs Current)¶

Grouping dataframe

In [23]:
df2=combined_baseline.groupby("District_harvest").agg(
    {"Updated Total income": "sum", 
     "Baseline Total income": "sum"}
)
df2
Out[23]:
Updated Total income Baseline Total income
District_harvest
Baidowa 55645.6350 35152.70
Burco 55131.9000 34786.50
Garowe 74401.8375 51706.65
Hargeisa 64396.0250 38878.60
Mogadishu 68049.2200 46688.40

Creating variables

In [24]:
district2=df2.index
upTotalIncome=df2["Updated Total income"]
baseTotalIncome=df2["Baseline Total income"]

Barplot using matplotlib.pyplot method

In [25]:
plt.figure(figsize=[8,6])
plt.bar(district2, upTotalIncome, 
        width = 0.4, 
        align="edge", 
        color = "cornflowerblue", 
        label = "Current Total")

plt.bar(district1, baseTotalIncome, 
        align="edge", 
        width = -0.4, 
        color = "orange", 
        label = "Baseline Total")

plt.legend(shadow=True)
plt.title("Baseline vs Current Total Income")
plt.xlabel("District")
plt.ylabel("Total income in USD")
plt.tight_layout()
plt.show()

8) Income relation (Agricultural and Total)¶

Scatterplot using Seaborn Libraries¶

In [26]:
sns.scatterplot(data=harvest, 
                x="Updated Income from Agriculture", 
                y = "Updated Total income")

plt.ylabel("Total Income")
plt.xlabel("Agricultural Income")
plt.title("Relation Between Current Agricultural and Total Income")
plt.show()
In [27]:
ax=sns.relplot(data=harvest, 
               x="Updated Income from Agriculture", 
               y = "Updated Total income", 
               kind="scatter", 
               col = "Crops")

ax.set_axis_labels("Agricultural Income", "Total Income", size = 15)

plt.subplots_adjust(top=0.85)
plt.suptitle("Current Total and Agricultural Income by Crops Type", size = 20)
plt.show()

9) Gender distribution of training participants¶

Computing series

In [28]:
gender_count_trainees=harvest.groupby("Gender")["# Trained"].sum().sort_values(ascending=False)
gender_count_trainees
Out[28]:
Gender
M    54
F    22
Name: # Trained, dtype: int64

Barplot using pandas .plot method

In [29]:
ax=gender_count_trainees.plot(kind="bar",
                            figsize = [4,3],
                            width = .4,
                            color = ["cornflowerblue", "hotpink"],
                            ylim = [0,60])

ax.set_xticklabels(["Male","Female"], rotation="horizontal")
ax.bar_label(ax.containers[0], padding=3)
plt.title("# Trainees' Gender")
plt.xlabel("Gender")
plt.show()

10) Community Satisfaction¶

Overall average

In [30]:
harvest.Satisfaction.mean()
Out[30]:
3.44
In [31]:
df3=harvest[["District", "Satisfaction", "Gender", "Perception"]]

Barplot using seaborn libraries¶

Average satisfaction by gender

In [32]:
plt.figure(figsize=[4,3])
ax=sns.barplot(df3, 
               x="Gender", 
               y="Satisfaction", 
               errorbar=("ci",0), 
               width = 0.4, 
               palette = ["cornflowerblue", "hotpink"])

ax.set_xticklabels(["Male","Female"], rotation="horizontal")
ax.bar_label(ax.containers[0], fmt="%.2f", padding=5)
plt.title("Community Satisfaction by Gender")
plt.ylim(0,5)
plt.show()

Average satisfaction by location and gender

In [33]:
ax=sns.barplot(df3, 
               x="District", 
               y="Satisfaction", 
               errorbar=("ci",0), 
               hue="Gender", 
               palette = ["cornflowerblue", "hotpink"])

new_labels = ["Male", "Female"]
handles, _ = ax.get_legend_handles_labels()
ax.legend(handles, new_labels, facecolor="white")
plt.ylim(0,5)
for container in ax.containers:
    ax.bar_label(container, fmt="%.2f")
plt.title("Community Satisfaction by Location and Gender")
plt.tight_layout()
plt.show()

Community Satisfaction Distribution based on gender by histogram

In [34]:
ax=sns.displot(data=harvest, 
               kind="hist", 
               x="Satisfaction", 
               bins=5, 
               col="Gender", 
               hue="Gender", 
               palette = ["cornflowerblue", "hotpink"], 
               alpha=1, 
               facet_kws=dict(margin_titles=True), 
               legend=False)

new_titles = ["Male", "Female"]
for i, ax in enumerate(ax.axes.flat):
    ax.set_title(new_titles[i], fontsize=12, weight="light")

plt.subplots_adjust(top=0.85)
plt.suptitle("Community Satisfaction by Gender", size=15)
    
plt.show()

11) Impact perception - average total, as per location, as per gender¶

Overall average impact perception

In [35]:
df3.Perception.mean()
Out[35]:
3.46

Average impact perception by gender

In [36]:
plt.figure(figsize=[4,3])
ax=sns.barplot(df3, 
               x = "Gender", 
               y = "Perception", 
               width=0.4, 
               errorbar=("ci",0), 
               palette=["cornflowerblue", "hotpink"])

plt.ylim(0,5)
ax.bar_label(ax.containers[0], fmt="%.2f", padding=5)
ax.set_xticklabels(["Male", "Female"], rotation="horizontal")
plt.title("Impact Perception by Gender")
plt.show()

Average impact perception by gender and location

In [37]:
ax=sns.barplot(df3, 
               x="District", 
               y="Perception", 
               hue="Gender", 
               errorbar=("ci",0), 
               palette=["cornflowerblue", "hotpink"])

new_labels = ["Male", "Female"]
handles, _ = ax.get_legend_handles_labels()
ax.legend(handles, new_labels, facecolor="white")
for container in ax.containers:
    ax.bar_label(container, fmt="%.2f")
plt.ylim(0,5)
plt.title("Impact Perception by Location")
plt.tight_layout()
plt.show()

Impact Perception Distribution based on gender by histogram

In [38]:
ax=sns.displot(data=harvest, 
               kind="hist", 
               x="Perception", 
               bins=5, col="Gender", 
               hue="Gender", 
               palette = ["cornflowerblue", "hotpink"], 
               alpha=1, facet_kws=dict(margin_titles=True), 
               legend=False)

new_titles = ["Male", "Female"]
for i, ax in enumerate(ax.axes.flat):
    ax.set_title(new_titles[i], fontsize=12, weight="light")

plt.subplots_adjust(top=0.85)
plt.suptitle("Impact Perception by Gender", size=15)
    
plt.show()

12) Reservoir capacity and harvest per week¶

Importing relevant excel file and creating a dataframe

In [39]:
infrastructure=pd.read_excel("harvest.xlsx", sheet_name=4, index_col="District")
In [40]:
infrastructure.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Mogadishu to Burco
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Location           5 non-null      object        
 1   Structure ID       5 non-null      int64         
 2   Type of Structure  5 non-null      object        
 3   Size (cubic m)     5 non-null      int64         
 4   Cost (USD)         5 non-null      int64         
 5   Capacity (litres)  5 non-null      int64         
 6   Volume Harvested   5 non-null      int64         
 7   Harvested Date     5 non-null      datetime64[ns]
 8   Frequency          5 non-null      object        
dtypes: datetime64[ns](1), int64(5), object(3)
memory usage: 400.0+ bytes

Table showing reservoir information

In [41]:
infrastructure[["Type of Structure", "Capacity (litres)", "Volume Harvested", "Frequency"]]
Out[41]:
Type of Structure Capacity (litres) Volume Harvested Frequency
District
Mogadishu Reservoir 150000 90000 weekly
Baidowa Reservoir 120000 72000 weekly
Garowe Reservoir 90000 54000 weekly
Hargeisa Reservoir 90000 54000 weekly
Burco Reservoir 75000 45000 weekly