Clustering the Familiar by Charles Thickstun (MSc Cand)

July 16, 2018

It’s striking how quickly the different becomes familiar. Only four months ago the narrow strip of lights flanking the road from Kilimanjaro International Airport seemed an impenetrable boundary; the dark shapes of hills and trees looming beyond, an inextricable mass cast in shadow. Now the faint crackling of Swahili from the radio straining to be heard over the rushing wind and the roar of the engine felt a comforting presence as I tracked the turns to the project house in Moshi.

I’ve returned to Tanzania to sit with project leaders from the Pan African Malaria Vector Research Consortium (PAMVERC) and discuss the optimal means of aggregating more than 40,000 households in the Misungwi region into clusters for analysis in a large-scale malaria bednet trial. Investigators here have been supervising an initial mapping census of the region over the last two months. This massive undertaking involves geolocating every household in the region and conducting an interview with the head of each to gather information on the number of people living there.

After an initial meeting on the cluster mapping process Dr. Protopopoff, the Principal Investigator on the PAMVERC Misungwi trial, encouraged me to spend my first day in the field accompanying a woman named Tatu as she surveyed a local hamlet.

I am quickly thrust back into the unfamiliar as our van full of field workers makes its way down a dirt road outside Misungwi town, my lack of Swahili becoming readily apparent as Tatu begins giving her team orders for the day before each is dropped off at a hamlet. My only solace is that the team seems to be similarly uncertain of the terrain as we backtrack along the road and through a set of fields before coming across the hamlet leader walking down a path.

The hamlet leader accompanies us, as Tatu later explains to me, because he knows each household in the hamlet and where the boundaries of nearby hamlets are, but also because he speaks the local vernacular, a variant of Swahili that she has been attempting to learn so that she will not be required to take all her answers through his interpretation.

As soon as this census is finalized the trial will enter an operational limbo; distribution of bednets cannot commence until a decision has been made on which intervention group each household belongs to. While theoretically the virtue of a Randomized-Controlled Trial (RCT) is that those decisions are left to a computer, the reality is much more complicated.

One of the first assumptions we must contend with in statistics is that all variables gathered in a sample are independent: that is to say that no sampled variable is related to another. This of course is impossible. Variables are highly interlinked: malaria is closely tied to a person’s socioeconomic status, which is closely tied to the type of house they live in, to what work they do, to where they live, which is again tied to malaria, and the people in one area are more likely to have similar variables than those in another. One thing we can attempt to do in a trial of this scope is to control for these dependencies.

The first step we take in this endeavor is to aggregate households into groups or “clusters.” In an ideal scenario we would hope that households within these clusters were different in exactly the same way from others in the study, but in practice we are aiming for them to be more similar to the others in their cluster than those without. By doing so we may weight these households by an inter-cluster coefficient (ICC) which accounts for some of the disparate effects that an intervention may have on a community due to their similarity within the cluster, and difference from those in another. This process makes a Cluster-Randomized-Controlled Trial (CRTC) in which we can allocate interventions based on the cluster, rather than the individual.

CRTCs are the standard in malaria trials for a number of reasons. The first being the difficulty of randomizing on the household and the logistics of delivering a blinded intervention to each of 40,000 locations, ensuring that they do not swap nets with their neighbors, and linking effects back from individual households to the intervention they have been given. The second is that there is a community effect to bednet trials: a grouping of households sharing the same intervention produces a greater antimalarial effect than a single household with a bednet. In order to accurately capture this effect, as would be achieved in a large-scale bednet rollout, the same intervention must be given to the entire community.

In creating a cluster map of the Misungwi region we first begin with the minimum unit of aggregation. This study considers people (particularly children), who live in households, which are organized into hamlets, which are part of villages, which are placed in wards. Each of these constitutes a unit of aggregation. In some cases we can dismiss levels without much though: all the people within a household often sleep in the same place, so having two or more bednets, and ensuring the right people slept under the right net would be unmanageable. Similarly, there are only a handful of wards in Misungwi, so dividing based on wards would make it impossible to have balanced intervention groups. Ideally, we use the smallest unit of aggregation possible, as this allows for the most nuance in the creation of clusters. In this case the decision was made for us, as the trial requires the assistance of hamlet leaders to distribute bednets, so no cluster may cross a hamlet boundary.

Once this has been determined we are given a number of requirements to take into account: 1) each cluster must have a minimum of 150 households in the “core” sampling area; 2) there must be a minimum of 84 clusters in the region; and 3) each household in a “core” sampling area must be no closer than 300 meters to a household in a different intervention group. The first two of these considerations are for sample size reasons: in order to detect a significant difference between intervention groups there must be a certain number of clusters of a certain size. The last is due to the community effect I spoke of earlier. If sampled households are too close to households in a different intervention group they may be contaminated by the effects of the other intervention group. This would bias the results of both interventions towards one another, minimizing any difference in their effects.

ArcGIS can provide a solid start in building groups of hamlets based on optimal spatial constraints; unfortunately this approach takes none of the above requirements into consideration. From here, the clustering process becomes a game of trial and error, as each hamlet is assessed individually and split or regrouped with its geographic neighbors to build clusters of adequate size and distance from others.

As my time in Tanzania ends and I begin the long journey back to Ottawa I am left pondering my brief stay here: the many peoples and communities I have met and experienced, eroding my unfamiliar; the maps of clusters and buffers that I have built, and what that means for the people here; and the hope that when our work here is done each community in Misgunwi will have one more variable in common: that fewer of their children will die of malaria.

Please reload

© 2018 by Manisha Kulkarni. Proudly created with

INSIGHT Lab, University of Ottawa | Ottawa, Ontario, Canada