Ubuntu 18, CUDA, TF, Caffe, DIGITS

I recently built a home computer for training ML models. I’m using one of NVIDIA’s Titan cards, so I’m going to be using CUDA, CUBLAS, cuDNN, and TensorRT. I opted to install the most recent version of Ubuntu, which is 18.04. After installing everything, I wanted to get DIGITS up and running to get some benchmarks for the GPU. I installed the nvidia-docker runtime and fired up the DIGITS container. I realized that the Caffe framework was available but tensorflow was not. In an attempt to install tensorflow I quickly realized that the packaged versions of CUDA would not work with TensorFlow, and the versions of CUDA that would work with TensorFlow would not work with DIGITS. To get all these to work together (Ubuntu 18, Cuda 10, TensorFlow 1.12, and DIGITS 6.1.1, Caffe 0.17.2), it’s going to be a bit of work, but ultimately worth it…here’s what’s needed.

To start with, I needed to ditch all the packaged versions of the NVIDIA software. We’re going to need to install the drivers directly from NVIDIA. I had installed the packaged version of caffe as well so it’s going too. After removing these packages, I needed to boot into a non-graphical run-level and remove the loaded nvidia driver.

root@titan:/home/lane# apt remove nvidia-*
root@titan:/home/lane# apt remove caffe-cuda

root@titan:/home/lane# init 3
root@titan:/home/lane# rmmod nvidia_modeset
root@titan:/home/lane# rmmod nvidia-smi
root@titan:/home/lane# rmmod nvidia_uvm
root@titan:/home/lane# rmmod nvidia

The next step is to download and install CUDA. It can be downloaded from https://docs.nvidia.com/cuda/. Before this can be completed I need to make sure that I have the kernel headers and gcc installed, which I did, so all is good. Download the runfile version of CUDA, this will build the driver on the fly and does not have any external dependencies.

root@titan:/home/lane/Downloads# sh cuda_10.0.130_410.48_linux

Build and install Caffe

lane@titan:~$ git clone https://github.com/NVIDIA/caffe.git
Cloning into 'caffe'...
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (2/2), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 34594 (delta 0), reused 0 (delta 0), pack-reused 34592
Receiving objects: 100% (34594/34594), 111.68 MiB | 5.18 MiB/s, done.
Resolving deltas: 100% (22974/22974), done.

lane@titan:~$ cd caffe/
lane@titan:~/caffe [caffe-0.17|✔] $ git checkout -b v0.17.2
Switched to a new branch 'v0.17.2'

lane@titan:~/caffe [v0.17.2 L|✔] $ mkdir build && cd build/
lane@titan:~/caffe/build [v0.17.2 L|✔] $ cmake ..
-- The C compiler identification is GNU 7.3.0
-- The CXX compiler identification is GNU 7.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Boost version: 1.65.1
-- Found the following Boost libraries:
--   system
--   thread
--   filesystem
--   regex
--   chrono
--   date_time
--   atomic
-- Found GFlags: /usr/include  
-- Found gflags  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libgflags.so)
-- Found Glog: /usr/include  
-- Found glog    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libglog.so)
-- Found Protobuf: /usr/lib/x86_64-linux-gnu/libprotobuf.so;-lpthread (found version "3.0.0") 
-- Found PROTOBUF Compiler: /usr/bin/protoc
-- HDF5: Using hdf5 compiler wrapper to determine C configuration
-- HDF5: Using hdf5 compiler wrapper to determine CXX configuration
-- Found HDF5: /usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5_cpp.so;/usr/lib/x86_64-linux-gnu/hdf5/serial/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.10.0.1") found components:  HL 
-- Found LMDB: /usr/include  
-- Found lmdb    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/liblmdb.so)
-- Found LevelDB: /usr/include  
-- Found LevelDB (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libleveldb.so)
-- Found Snappy: /usr/include  
-- Found Snappy  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libsnappy.so)
-- Found JPEGTurbo: /usr/include  
-- Found JPEGTurbo  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libturbojpeg.so.0)
-- CUDA detected: 10.0
-- Found CUDNN: /usr/local/cuda-10.0/lib64/libcudnn.so (found version "7.4") 
-- Added CUDA NVCC flags for: sm_61
-- Found OpenCV: /usr (found version "3.2.0") found components:  core imgcodecs highgui imgproc videoio 
-- Found OpenCV 3.x: /usr/share/OpenCV
-- Found OpenBLAS libraries: /usr/lib/x86_64-linux-gnu/libopenblas.so
-- Found OpenBLAS include: /usr/include/x86_64-linux-gnu
-- Found PythonInterp: /usr/bin/python2 (found suitable version "2.7.15", minimum required is "2") 
-- Found Boost Python Library /usr/lib/x86_64-linux-gnu/libboost_python-py27.so
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable version "2.7.15rc1", minimum required is "2") 
-- Found NumPy: /home/lane/.local/lib/python2.7/site-packages/numpy/core/include (found suitable version "1.15.4", minimum required is "1.7.1") 
-- NumPy ver. 1.15.4 found (include: /home/lane/.local/lib/python2.7/site-packages/numpy/core/include)
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) 
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIR NCCL_LIBRARY) 
-- Found NVML: /usr/local/cuda-10.0/include  
-- Found NVML (include: /usr/local/cuda-10.0/include, library: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so)
-- Found Git: /usr/bin/git (found version "2.17.1") 
-- 
-- ******************* Caffe Configuration Summary *******************
-- General:
--   Version           :   0.17.2
--   Git               :   v0.17.2-4-g3abc8f53
--   System            :   Linux
--   C++ compiler      :   /usr/bin/c++
--   Release CXX flags :   -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Debug CXX flags   :   -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Build type        :   Release
-- 
--   BUILD_SHARED_LIBS :   ON
--   BUILD_python      :   ON
--   BUILD_matlab      :   OFF
--   BUILD_docs        :   ON
--   USE_LEVELDB       :   ON
--   USE_LMDB          :   ON
--   TEST_FP16         :   OFF
-- 
-- Dependencies:
--   BLAS              :   Yes (Open)
--   Boost             :   Yes (ver. 1.65)
--   glog              :   Yes
--   gflags            :   Yes
--   protobuf          :   Yes (ver. 3.0.0)
--   lmdb              :   Yes (ver. 0.9.21)
--   LevelDB           :   Yes (ver. 1.20)
--   Snappy            :   Yes (ver. ..)
--   OpenCV            :   Yes (ver. 3.2.0)
--   JPEGTurbo         :   Yes
--   CUDA              :   Yes (ver. 10.0)
-- 
-- NVIDIA CUDA:
--   Target GPU(s)     :   Auto
--   GPU arch(s)       :   sm_61
--   cuDNN             :   Yes (ver. 7.4)
--   NCCL              :   Not found (not requested)
--   USE_MPI           :   OFF
--   NVML              :   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so 
-- 
-- Python:
--   Interpreter       :   /usr/bin/python2 (ver. 2.7.15)
--   Libraries         :   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.15rc1)
--   NumPy             :   /home/lane/.local/lib/python2.7/site-packages/numpy/core/include (ver 1.15.4)
-- 
-- Documentaion:
--   Doxygen           :   No
--   config_file       :   
-- 
-- Install:
--   Install path      :   /home/lane/caffe/build/install
-- 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/lane/caffe/build
lane@titan:~/caffe/build [v0.17.2 L|✔] $ make
root@titan:/home/lane/caffe/build# make install

Test that Caffee is installed and can properly use the GPU

lane@titan:~/caffe/build [v0.17.2 L|✔] $ python
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import caffe
>>> print (caffe.__version__);
0.17.2
>>> caffe.set_mode_gpu()
I1207 11:43:19.508618  6204 gpu_memory.cpp:105] GPUMemory::Manager initialized
I1207 11:43:19.508757  6204 gpu_memory.cpp:107] Total memory: 12785221632, Free: 12466323456, dev_info[0]: total=12785221632 free=12466323456

Clone and configure TensorFlow https://www.tensorflow.org/install/source

lane@titan:~$ git clone https://github.com/tensorflow/tensorflow.git
lane@titan:~$ cd tensorflow
lane@titan:~/tensorflow [master|✔︎] $ git checkout -b r1.12
lane@titan:~/tensorflow [r1.12||✔︎] $ ./configure 
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.15.0 installed.
Please specify the location of python. [Default is /usr/bin/python]: 

Found possible Python library paths:
  /usr/local/python/
  /usr/local/lib/python2.7/dist-packages
  /usr/lib/python2.7/dist-packages
Please input the desired Python library path to use.  Default is [/usr/local/python/]

Do you wish to build TensorFlow with Apache Ignite support? [Y/n]: 
Apache Ignite support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: 
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: 
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: 
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 10.0

Please specify the location where CUDA 10.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 

Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Do you wish to build TensorFlow with TensorRT support? [y/N]: y
TensorRT support will be enabled for TensorFlow.

Please specify the location where TensorRT is installed. [Default is /usr/lib/x86_64-linux-gnu]:/usr/local/TensorRT-5.0.2.6/

Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 2.2]: 2.3.7

Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/nccl_2.3.7/

Assuming NCCL header path is /usr/local/nccl_2.3.7/lib/../include/nccl.h
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]: 

Do you want to use clang as CUDA compiler? [y/N]: 
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 

Do you wish to build TensorFlow with MPI support? [y/N]: 
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
	--config=gdr         	# Build with GDR support.
	--config=verbs       	# Build with libverbs support.
	--config=ngraph      	# Build with Intel nGraph support.
Configuration finished

Build TensorFlow with GPU support

lane@titan:~/tensorflow [r1.12|✔︎] bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

Build the TensorFlow python package

lane@titan:~/tensorflow [r1.12|✔︎] ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

This will place the wheel in /tmp/tensorflow_pkg, you can then install the newly built python module with

lane@titan:~/tensorflow [r1.12|✔︎] sudo pip install /tmp/tensorflow_pkg/tensorflow-1.12.0-cp27-cp27mu-linux_x86_64.whl

Now for a quick test to determine if all is well

>>> with tf.Session() as sess:
...   devices = sess.list_devices()
...
2018-12-07 10:58:16.626765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-07 10:58:16.626825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-07 10:58:16.626839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2018-12-07 10:58:16.626851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2018-12-07 10:58:16.627080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11275 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
>>>

That sums up the install. I’ve cloned DIGITS from https://github.com/NVIDIA/DIGITS and started an instance using the script digits-devserver. DIGITS indicates its using the newly installed version of Caffe and that the TensorFlow framework is available for training models.

DIGITS, TensorFlow, Caffe, CUDA, and Ubuntu 18

Leave a Reply Cancel reply