Understanding people’s sentiments from data published on the web presents a significant research problem and has a variety of applications, such as learning the context, prediction of election results and opinion about an incident. So far, sentiment analysis from web data has focused primarily on a single modality, such as text or image. However, the readily available multiple modal information, such as image and different forms of texts, as a combination can help to estimate the sentiments more accurately. Further, blindly combining the visual and textual features increases the complexity of the model, which ultimately reduces the sentiment analysis performance as it often fails to capture the correct interrelationships between different modalities. Hence, in this study, a sentiment analysis framework that carefully fuses the salient visual cues and high attention textual cues is proposed, exploiting the interrelationships between multimodal web data. A multimodal deep association learner is stacked to learn the relationships between learned salient visual features and textual features. Further, to automatically learn the discriminative features from the image and text, two streams of unimodal deep feature extractors are proposed to extract the visual and textual features that are most relevant to the sentiments. Finally, the sentiment is estimated using the features that are combined using a late fusion mechanism. The extensive evaluations show that our proposed framework achieved promising results for sentiment analysis using web data, in comparison to existing unimodal approaches and multimodal approaches that blindly combine the visual and textual features.