Much of human social interaction is mediated through conversation. These are speech exchanges between two individuals where smooth turn-taking occurs with no formal or explicit rules. Given its central importance in social interactions, it is natural to ask how turn-taking evolved and what might be its neural basis. To investigate these questions, we are using marmoset monkeys as a model system. Marmosets are a highly vocal primate species that often exchange vocalizations with conspecifics to maintain social contact. We show that marmosets, like humans, take turns during natural dyadic vocal exchanges and that the timing of exchanges is periodically coupled. This suggests that an oscillatory mechanism is responsible for the dynamics of turn-taking. Consistent with this idea, we show that marmosets entrain the timing of their vocal output during vocal exchanges, whereby faster (or slower) response intervals from one marmoset lead to faster (or slower) response intervals from the other marmoset. To explain these results, we built a stochastic dynamic systems model of two interacting oscillators. The model is based on the interactions among four neural structures (‘drive’, ‘motor’ and two ‘auditory’ nodes) with connectivity inspired by published physiological and anatomical data. We validate our model showing that it generates turn-taking dynamics nearly identical to that seen in natural marmoset vocal exchanges. We then use our model to predict that a self-monitoring mechanism is crucial for the correct timing of the vocal turn-taking.