Nvidia's CUDA 4.0 toolkit is knocking around for Nvidia GPUs.
According to Nvidia, it has been designed with parallel programming in mind, so developers can get more of their stuff ported to GPUs. There's GPUDirect 2.0 in the new release, which supports p2p communication between GPUs in a single workstation, which'll give speed a kick up the jacksie.
Then there's Unified Virtual Addressing which Nvidia tells us provides a single merged-memory space, that's for the main system memory and GPU memories. UVA, Nvidia hopes, will make parallel programming easier for developers. Cuda 4.0 sports Thrust C++ template performance primitives libraries. In non Nvidia terms, this is a collection of open source C++ parallel algorithms, making it a tad easier for C++ developers. Thrust, says Nvidia, makes parallel sorting up to 100 times faster than with Standard Template Library and Threading Building Blocks.
C++ support will also include new/delete and virtual stuff.
Aside from the Thrusting in a parallel manner, Nvidia is excited about MPI integration with CUDA - which will automatically bung data across from the GPU's memory over Infiniband if an application needs to send or receive MPI. Then there's multi thread sharing of GPUs, so multiple CPUs can share contexts on one GPU.
Cuda 4.0 will also pack a new GPU binary disassembler and improve on its support for MacOS, we're promised.
Nvidia thinks the release is the most important thing in the world, so be sure to enroll as a CUDA registered developer. The toolkit is free to pilfer on the 4th of March if you do.
.
:D
does this mean that there is not more cudamalloc and cuda memcopy?
Nvidia has staked a large part of its future on the idea that GPUs and their massively parallel architectures can replace CPUs for a big chunk of computational jobs. But parallel programming on one device is tough, across two incompatible devices is very difficult, and across clusters of hybrid machines can be very tricky indeed. That's why Nvidia's CUDA parallel programming environment is probably as important as any chip or Tesla GPU co-processor that Nvidia will ever ship.