Model selection methods based on stochastic regularization suchas Dropout have been widely used in deep learning due to theirsimplicity and effectiveness. The standard Dropout method treatsall units, visible or hidden, in the same way, thus ignoring any a prioriinformation related to grouping or structure. Such structure ispresent in multi-modal learning applications, where subsets of unitsmay correspond to individual modalities. In this abstract we describeModout, a model selection method based on stochastic regularization,which is particularly useful in the multi-modal setting.Different from previous methods, it is capable of learning whetheror when to fuse two modalities in a layer. Evaluation of Modouton the Montalbano gesture recognition dataset demonstrates improvedperformance compared to other stochastic regularizationmethods, and is on par with a state-of-the-art carefully designedfusion architecture.