Multi-armed bandits for targeted advertising in social networks
Downloads: paper (draft)
This was a group project completed for the class Applied Mechanism Design for Social Good at UMD, and which served as my Master’s scholarly paper. We posed the problem of maximizing user engagement with advertisements in a social network, where highly influential users are prioritized. This framework was inspired by social networks in which a few users, often called influencers, are followed by a large number of users (for instance, in Twitter and Instagram). In our proposed setting, we must use observations from any related followers to better select ads for an influencer.
The motivation behind this problem setting is that choosing an advertisement to show to an influencer is a high risk/high reward choice. If the influencer engages with an ad and purchases a product, they may recommend the product to their followers, resulting in many purchases. However, bad ad choices could risk losing the engagement of the influencer, which is more costly than losing the engagement of a follower, since influencers are much rarer. On the other hand, the collective interests of the set of users that follow an influencer might reflect the interests of the influencer. Therefore, we can learn about the interests of an influencer by exploring those of their followers.
We modelled this problem as a multi-armed bandit (MAB) problem in a graph structure. We chose an advertisement to show for each user in a social network, and these choices correspond to the actions/arms in the MAB. We focused on the setting of a single influencer with several followers and imposed a reward structure between the ads so that the click-through rates (CTRs) of the followers reflected those of the influencer on average.
We designed an algorithm for this setting that reduced the need for exploration at the cost of using a biased estimator for the rewards of the influencer. We compared the regret achieved against those of two unbiased algorithms from the literature, and we found that our algorithm achieved much lower regret even after a high number of time steps. This might be because the unbiased algorithms did not make an efficient use of the reward structure of the problem, but in an infinite horizon, the biased algorithm might not converge towards the optimal arm. Future work could try to combine the best of both solutions by using the biased estimator as a warm start and subsequently relying on unbiased estimators.