We consider multipath TCP (MPTCP) flows over the data networking dynamics of IEEE 802.11ay for drone surveillance of areas using high-definition video streaming. Mobility-induced handoffs are critical in IEEE 802.11ay (because of the smaller coverage of mmWaves), which adversely affects the performance of such data streaming flows. As a result of the enhanced 802.11ay network events and features (triggered by beamforming, channel bonding, MIMO, mobility-induced handoffs, channel sharing, retransmissions, etc.), the time taken for packets to travel end-to-end in 802.11ay are inherently time-varying. Several fundamental assumptions inherent in stochastic TCP models, including Poisson arrivals of packets, Gaussian process, and parameter certainty, are challenged by the improved data traffic dynamics over IEEE 802.11ay networks. The MPTCP model’s state estimation differs largely from the actual network values. We develop a new data-driven stochastic framework to address current deficiencies of MPTCP models and design a foundational architecture for intelligent multipath scheduling (at the transport layer) considering lower layer (hybrid) beamforming. At the heart of our cross-layer architecture is an intelligent learning agent for actuating and interfacing, which learns from experience optimal packet cloning, scheduling, aggregation, and beamforming using successful features of multi-armed bandits and federated learning. We demonstrate that the proposed framework can estimate and optimize jointly (explore–exploit) and is more practicable for designing the next generation of low-delay and robust MPTCP models.