Cloud Optimized Geotiff Part 2


Welcome to part 2!

To test and benchmark I use Oracle Cloud Infrastructure. I hope, using a state-of-the-art infrastructure delivers good out-of-the-box performance without need to tweak too much. OCI claims to have top-notch network performance. I use an instance with actual Oracle Linux:

[opc@difusinstance1 ~]$ cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="7.9"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="Oracle Linux Server 7.9"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:7:9:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://bugzilla.oracle.com/"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 7"
ORACLE_BUGZILLA_PRODUCT_VERSION=7.9
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=7.9

The machine is a VM.Standard2.1 (1 core with HT enabled). See shapes.

[opc@difusinstance1 ~]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
stepping        : 4
microcode       : 0x1
cpu MHz         : 1995.311
cache size      : 16384 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear arch_capabilities
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
bogomips        : 3990.62
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
stepping        : 4
microcode       : 0x1
cpu MHz         : 1995.311
cache size      : 16384 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear arch_capabilities
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
bogomips        : 3990.62
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

To build gdal install development tools and proj6. NOTE: I installed proj6 as newer versions needed a higher sqlite database. Be warned.

sudo yum install libcurl-devel
sudo yum install python3-devel
sudo yum groupinstall "Development Tools"
sudo yum install sqlite-devel
wget https://download.osgeo.org/proj/proj-6.0.0.tar.gz
tar -xf proj-6.0.0.tar.gz
cd proj-6.0.0/
./configure
make
sudo make install

The gdal build and installation itself is the same as in part 1, BUT:

./configure --with-python --with-curl

That’s important, because we will use gdal virtual filesystems, here /vsicurl/

For some reason this installation cannot find the gdal lib if you use the python bindings. Quickhackfix:

export LD_LIBRARY_PATH=/usr/local/lib

I created an object store bucket cogTest that should support all needed features to efficiently access cogfiles, especially http-range queries. To access OCI resources without the need to specify credentials or passwords, I use instance principals. Maybe I’ll explain that in another blog post 😄

export OCI_CLI_AUTH=instance_principal

NOTE: Adapt –compartment-id

[opc@difusinstance1 ~]$ oci os bucket list --compartment-id=ocid1.compartment.oc1..aaaaaaaaxxxx
{
  "data": [
    {
      "compartment-id": "ocid1.compartment.oc1..aaaaaaaaxxxx",
      "created-by": "xxx",
      "defined-tags": null,
      "etag": "fe454127-e0c3-4fb0-884f-e3d3c2ade3cd",
      "freeform-tags": null,
      "name": "cogTest",
      "namespace": "xxx",
      "time-created": "2021-12-08T19:26:59.967000+00:00"
    }
  ]
}

Upload a grib file and see how long it takes.

[opc@difusinstance1 ~]$ time oci os object put -bn cogTest --file T_2M.2D.199501.grb --name T_2M.2D.199501.grb
Upload ID: xxx
Split file into 8 parts for upload.
Uploading object  [####################################]  100%
{
  "etag": "xxx",
  "last-modified": "xxx",
  "opc-multipart-md5": "LNk7hI+Bag9kJXVFr2KmWw==-8"
}

real    0m13.506s
user    0m7.187s
sys     0m1.889s
[opc@difusinstance1 ~]$

In VM.Standard2.1 configuration it took 14secs to upload a 1GB file.

This is “amazing”:

[opc@difusinstance1 ~]$ time gdal_translate T_2M.2D.199501.grb -co COMPRESS=LZW -of Gtiff T_2M.2D.199501.tiff
Input file size is 848, 824
Warning 1: T_2M.2D.199501.tiff: Metadata exceeding 32000 bytes cannot be written into GeoTIFF. Transferred to PAM instead.
0...10...20...30...40...50...60...70...80...90...100 - done.

real    844m22.804s
user    835m14.293s
sys     3m21.397s

Ehm… Whooot? 844 minutes? There is something rotten.

Luckily I found out, that since gdal 3.4 you can directly create a cog, including overviews etc.:

[opc@difusinstance1 ~]$ time gdal_translate T_2M.2D.199501.grb T_2M.2D.199501_cog.tiff -of COG -co COMPRESS=LZW
Input file size is 848, 824
0...10...20Warning 1: T_2M.2D.199501_cog.tiff: Metadata exceeding 32000 bytes cannot be written into GeoTIFF. Transferred to PAM instead.
...30...40...50...60...70...80...90...100 - done.

real    19m40.172s
user    19m20.766s
sys     0m11.804s
[opc@difusinstance1 ~]$ time oci os object put -bn cogTest --file T_2M.2D.199501_cog.tiff --name T_2M.2D.199501_cog.tiff
Upload ID: 64ac2e1b-5ed2-79b3-90cc-ebd6b36eeb31
Split file into 22 parts for upload.
Uploading object  [####################################]  100%
{
  "etag": "c8f89623-13fb-4570-bc38-52f551c97053",
  "last-modified": "Sat, 11 Dec 2021 09:52:53 GMT",
  "opc-multipart-md5": "AYK515hswfqvRb5rWDnZYg==-22"
}

real    0m28.890s
user    0m14.107s
sys     0m5.336s

Geek-Note: As you can see, the time spent on the CPU is only half it took to upload the file to the storage. This indicates I/O-bound workload.

From now on I have the opportunity to use a larger instance! 4 OCPUs and 64GB RAM. Harrr, harrr! More Power!

Let’s see, if this speeds up upload, as per documentation the network throughput is significantly higher:

(gdal) [opc@difusinstance1 ~]$ time oci os object put -bn cogTest --file T_2M.2D.199501_cog.tiff --name T_2M.2D.199501_cog.tiff
Upload ID: 55d9411d-1772-7749-f8b9-779962eb23bf
Split file into 22 parts for upload.
Uploading object  [####################################]  100%
{
  "etag": "bc0182a5-2426-4872-bcb3-ac35afeafda0",
  "last-modified": "Sun, 12 Dec 2021 13:30:30 GMT",
  "opc-multipart-md5": "AYK515hswfqvRb5rWDnZYg==-22"
}

real    0m17.305s
user    0m19.291s
sys     0m10.395s

Ooops! 17secs. vs. 29secs.! Bravo! Remember, the file size is about 3 GB!

Let’s get the values of the first pixel of all bands:

(gdal) [opc@difusinstance1 ~]$ time gdallocationinfo /vsicurl/https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff 1 1
Report:
  Location: (1P,1L)
  Band 1:
    Value: -15.705435180664
  Band 2:
    Value: -14.7335723876953
...

...
  Band 744:
    Value: -9.49413146972654

real    0m26.303s
user    0m13.483s
sys     0m2.190s

Now get the first pixel from the first band:

(gdal) [opc@difusinstance1 ~]$ time gdallocationinfo -b 1 /vsicurl/https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff 1 1
Report:
  Location: (1P,1L)
  Band 1:
    Value: -15.705435180664

real    0m22.940s
user    0m11.845s
sys     0m1.103s

Ok, suxxx. Much slower than expected. Check some debug output:

(gdal) [opc@difusinstance1 ~]$ time gdallocationinfo --debug on -b 1 /vsicurl/https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff 1 1
HTTP: libcurl/7.29.0 NSS/3.53.1 zlib/1.2.3 libidn/1.28 libssh2/1.8.0
VSICURL: GetFileSize(https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff)=2940113117  response_code=200
VSICURL: Downloading 0-16383 (https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff)...
VSICURL: Got response_code=206
GDAL: GDALOpen(/vsicurl/https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff, this=0x936d00) succeeds as GTiff.
Report:
  Location: (1P,1L)
  Band 1:
VSICURL: GetFileList(/vsicurl/https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o)
VSICURL: GetFileSize(https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff.aux.xml)=0  response_code=404
VSICURL: GetFileSize(https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.aux)=0  response_code=404
VSICURL: GetFileSize(https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.AUX)=0  response_code=404
VSICURL: GetFileSize(https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff.aux)=0  response_code=404
VSICURL: GetFileSize(https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/xxxxx/b/cogTest/o/T_2M.2D.199501_cog.tiff.AUX)=0  response_code=404

...

Lots of 404 errors. gdal searches for the .aux file generated in Part 1. I forgot to upload this.

oci os object put -bn cogTest --file T_2M.2D.199501_cog.tiff.aux.xml --name T_2M.2D.199501_cog.tiff.aux.xml

But unfortunately this did not help. It seems it still scans too much of the file.

I think we have to do a part 3 to further investigate this…