Adjoint tomography is one of state-of-the-art imaging methods with high resolution. It can get better-resolved models by solving full wave equations to accurately simulate the propagation of seismic waves, and by considering full waveform information in inversion. However, the computational cost and storage requirement of 3-D adjoint tomography are high. Relatively, 2-D adjoint tomography is much more computationally efficient. Surface waves and teleseismic body waves provide essential data for studying crustal and uppermost mantle structures. Due to different sensitivities to shear wave velocity at depths and Moho discontinuity, joint inversion of both datasets can resolve the Vs model and Moho interface better. To take advantages of the two different types of data, we propose a strategy of joint inversion for ambient noise surface waves and teleseismic body waves recorded by linear arrays based on the adjoint-state method, which can be used to yield a fine Vs model and Moho topography. We perform various synthetic imaging experiments, in which the model has typical features of crustal structures in North China Craton (NCC). Compared to the surface wave inversion only, joint inversion improves the resolution of images as well as constraining discontinuity undulations better. Compared to the body wave inversion only, joint inversion can suppress high frequency artifacts and reduce the nonlinearity during inversion. This study could provide an efficient alternative to image fine velocity structures beneath linear arrays and also build a framework for joint inversion. It could improve the resolution of lithospheric imaging and also provide strategies of incorporating other waveforms in the future.