[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[LDM #IRS-806047]: pqact keeping files open after scour deletes them
- Subject: [LDM #IRS-806047]: pqact keeping files open after scour deletes them
- Date: Wed, 06 Jul 2016 14:50:18 -0600
Hi Pete,
You've encountered a feature on Unix systems: a process can have a file open
and write to it even if another process has deleted the file. It's not until
the first process closes the file that the space it occupied is released by to
the system.
By default, a pqact(1) FILE action will keep the file open for appending until
and unless the pqact(1) process runs out of file descriptors -- in which case
the least recently used file descriptor will be closed.
You could add the option "-overwrite" to the FILE actions. This will cause each
file to only contain the most recent matching data-product.
Alternatively, you could add the option "-close" to the FILE actions. Yes this
will increase the overhead, but in your case that appears justified. Be
advised, however, that matching products will, by default, be appended to the
same file unless the "-overwrite" option is also used.
> Sean and ldm support (Steve, Tom?)
>
> Sorry this email is so long, but I guess I need to be verbose to explain
> the issue and what I know about it so far..
>
> I've come across a problem/curiosity since Sean gave me two pqact files
> to recreate the thredds.ucar.edu tds structure for highest-resolution
> model data on my weather.aos.wisc.edu data server.
>
> I added them in by including them in my ldmd.conf, as:
>
> EXEC "pqact"
> exec "pqact -q /dev/shm/ldm.pq /usr/local/ldm/etc/pqact.forecastModels"
> exec "pqact -q /dev/shm/ldm.pq
> /usr/local/ldm/etc/pqact.forecastProdsAndAna"
>
> I've attached all three pqact files for reference.
>
> After having done that, I expected my disk usage to grow, so I cleaned
> off what I thought would be plenty - 500Gb or so. But after Sean and
> Ryan left, over the next weekend, I noticed my disk kept filling up. I
> kept deleting more historical data and yet it kept growing.
>
> At one point, I noticed that if I took the disk usage with the 'du'
> command of all of the files/directories on my /data directory and added
> them up, it did not come close to the full /data space available (was
> something like 2 Tb in use out of 2.5 Tb available.)
>
> For the heck of it I stopped and restarted the ldm. Imagine my surprise
> when 500 Gb suddenly were freed up!
>
> I have attached the output from lsof (showing open files and the process
> that owns them) - I deleted all but files that are open by one of the
> pqact processes - if you think the rest are necessary to figure this out,
> I can resend with the whole thing.
>
> There are 3 pqact processes running - 28574 (using my original
> pqact.conf), 28575 (running on Sean's pqact.forecastModels) and 28576
> (running on Sean's pqact.forecastProdsAndAna)
>
> I have my scour.conf set to delete files greater than one day old
> under the /data/tds directory (where all of the tds high res model data
> from Sean's pqact files gets put. I ran this lsof at 14:12 local time,
> and the scour last happened at 11:30 AM CDT (16:30 UTC) so in theory,
> any files older than 11:30 AM CDT on July 5 should be deleted.
>
> In fact, a ls -l of the HRRR_CONUS_2p5km files shows just that (I took
> out a bunch in the middle to shorten the list)
>
> [ldm@weather ~/etc]$ ls -l
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km*
>
> -rw-r--r--. 1 ldm ldm 1920199805 Jul 5 12:24
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1600.grib2
> -rw-r--r--. 1 ldm ldm 200752 Jul 5 12:30
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1600.grib2.gbx9
> -rw-r--r--. 1 ldm ldm 23230 Jul 5 12:30
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1600.grib2.ncx3
> -rw-r--r--. 1 ldm ldm 1912314659 Jul 5 13:25
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1700.grib2
> -rw-r--r--. 1 ldm ldm 200757 Jul 5 13:30
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1700.grib2.gbx9
> -rw-r--r--. 1 ldm ldm 23232 Jul 5 13:30
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1700.grib2.ncx3
>
> ....
>
> -rw-r--r--. 1 ldm ldm 1934749135 Jul 6 13:25
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1700.grib2
> -rw-r--r--. 1 ldm ldm 200759 Jul 6 13:30
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1700.grib2.gbx9
> -rw-r--r--. 1 ldm ldm 23231 Jul 6 13:30
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1700.grib2.ncx3
> -rw-r--r--. 1 ldm ldm 1936641707 Jul 6 14:25
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1800.grib2
> -rw-r--r--. 1 ldm ldm 200781 Jul 6 14:30
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1800.grib2.gbx9
> -rw-r--r--. 1 ldm ldm 23236 Jul 6 14:30
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1800.grib2.ncx3
>
>
> If you search for /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_ in
> the lsof listing, you'll see all of these plus bunches that are marked
> '(deleted)' - so scour nuked them, but pqact for some reason hasn't let them
> go. I last restarted the ldm around 05 UTC July 4.
>
> grep '\/data\/tds\/native\/grid\/NCEP\/HRRR\/CONUS_2p5km\/HRRR_'
> weather_aos_wisc_edu_lsof.txt
>
> pqact 28575 ldm 5w REG 8,4 3180287226
> 12996221
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_0400.grib2
> (deleted)
> pqact 28575 ldm 6w REG 8,4 3561617420
> 12996234
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_0300.grib2
> (deleted)
> pqact 28575 ldm 11w REG 8,4 3572003347
> 12996286
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160703_2100.grib2
> (deleted)
> pqact 28575 ldm 12w REG 8,4 1874048146
> 12996235
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_0200.grib2
> (deleted)
> pqact 28575 ldm 15w REG 8,4 3399192600
> 12996173
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_0500.grib2
> (deleted)
> pqact 28575 ldm 21w REG 8,4 2600954387
> 12996195
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_0600.grib2
> (deleted)
> pqact 28575 ldm 30w REG 8,4 2029102991
> 12996202
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_0700.grib2
> (deleted)
> pqact 28575 ldm 33w REG 8,4 2516119390
> 12996211
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_0800.grib2
> (deleted)
> pqact 28575 ldm 40w REG 8,4 2443414784
> 12996222
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_0900.grib2
> (deleted)
> pqact 28575 ldm 46w REG 8,4 2926153042
> 12996214
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1000.grib2
> (deleted)
> pqact 28575 ldm 49w REG 8,4 2403645194
> 12996218
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1100.grib2
> (deleted)
> pqact 28575 ldm 51w REG 8,4 2280661458
> 12996165
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1200.grib2
> (deleted)
> pqact 28575 ldm 62w REG 8,4 2069064423
> 12996228
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1300.grib2
> (deleted)
> pqact 28575 ldm 64w REG 8,4 1941733544
> 12996187
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1400.grib2
> (deleted)
> pqact 28575 ldm 71w REG 8,4 2041348300
> 12996207
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1500.grib2
> (deleted)
> pqact 28575 ldm 77w REG 8,4 1934161187
> 12996179
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1600.grib2
> (deleted)
> pqact 28575 ldm 80w REG 8,4 1935748571
> 12996186
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1700.grib2
> (deleted)
> pqact 28575 ldm 83w REG 8,4 1939644108
> 12996205
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1800.grib2
> (deleted)
> pqact 28575 ldm 92w REG 8,4 1942403055
> 12996225
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_1900.grib2
> (deleted)
> pqact 28575 ldm 95w REG 8,4 1940111965
> 12996246
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_2000.grib2
> (deleted)
> pqact 28575 ldm 102w REG 8,4 1942075959
> 12996250
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_2100.grib2
> (deleted)
> pqact 28575 ldm 107w REG 8,4 1917970061
> 12996254
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_2200.grib2
> (deleted)
> pqact 28575 ldm 111w REG 8,4 1915809771
> 12996258
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160704_2300.grib2
> (deleted)
> pqact 28575 ldm 113w REG 8,4 1904154807
> 12996190
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0000.grib2
> (deleted)
> pqact 28575 ldm 123w REG 8,4 1889709131
> 12996169
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0100.grib2
> (deleted)
> pqact 28575 ldm 126w REG 8,4 1885391770
> 12996198
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0200.grib2
> (deleted)
> pqact 28575 ldm 133w REG 8,4 1891804447
> 12996206
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0300.grib2
> (deleted)
> pqact 28575 ldm 139w REG 8,4 1893351073
> 12996199
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0400.grib2
> (deleted)
> pqact 28575 ldm 142w REG 8,4 1907215010
> 12996192
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0500.grib2
> (deleted)
> pqact 28575 ldm 145w REG 8,4 1911105692
> 12996210
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0600.grib2
> (deleted)
> pqact 28575 ldm 154w REG 8,4 1919158117
> 12996226
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0700.grib2
> (deleted)
> pqact 28575 ldm 158w REG 8,4 1921279097
> 12996171
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0800.grib2
> (deleted)
> pqact 28575 ldm 164w REG 8,4 1923891313
> 12996168
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_0900.grib2
> (deleted)
> pqact 28575 ldm 170w REG 8,4 1914028225
> 12996215
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1000.grib2
> (deleted)
> pqact 28575 ldm 173w REG 8,4 1912047735
> 12996224
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1100.grib2
> (deleted)
> pqact 28575 ldm 175w REG 8,4 1912127181
> 12996182
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1200.grib2
> (deleted)
> pqact 28575 ldm 185w REG 8,4 1917546312
> 12996163
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1300.grib2
> (deleted)
> pqact 28575 ldm 188w REG 8,4 1921040920
> 12996274
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1400.grib2
> (deleted)
> pqact 28575 ldm 195w REG 8,4 1916252842
> 12996279
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1500.grib2
> (deleted)
> pqact 28575 ldm 201w REG 8,4 1920199805
> 12996161
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1600.grib2
> pqact 28575 ldm 204w REG 8,4 1912314659
> 12996160
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1700.grib2
> pqact 28575 ldm 207w REG 8,4 1926374767
> 12996170
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1800.grib2
> pqact 28575 ldm 216w REG 8,4 1918308546
> 12996178
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_1900.grib2
> pqact 28575 ldm 220w REG 8,4 1943169824
> 12996175
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_2000.grib2
> pqact 28575 ldm 226w REG 8,4 1912657143
> 12996162
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_2100.grib2
> pqact 28575 ldm 232w REG 8,4 1880507077
> 12996181
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_2200.grib2
> pqact 28575 ldm 235w REG 8,4 1876356353
> 12996231
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160705_2300.grib2
> pqact 28575 ldm 237w REG 8,4 1858990491
> 12996196
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0000.grib2
> pqact 28575 ldm 247w REG 8,4 1848657586
> 12996242
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0100.grib2
> pqact 28575 ldm 250w REG 8,4 1856959455
> 12996244
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0200.grib2
> pqact 28575 ldm 257w REG 8,4 1859339751
> 12996248
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0300.grib2
> pqact 28575 ldm 263w REG 8,4 1873889950
> 12996197
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0400.grib2
> pqact 28575 ldm 266w REG 8,4 1865073308
> 12996194
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0500.grib2
> pqact 28575 ldm 269w REG 8,4 1875881860
> 12996208
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0600.grib2
> pqact 28575 ldm 278w REG 8,4 1895724505
> 12996260
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0700.grib2
> pqact 28575 ldm 281w REG 8,4 1904900838
> 12996201
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0800.grib2
> pqact 28575 ldm 288w REG 8,4 1910833578
> 12996189
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_0900.grib2
> pqact 28575 ldm 294w REG 8,4 1896415552
> 12996212
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1000.grib2
> pqact 28575 ldm 297w REG 8,4 1898703686
> 12996267
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1100.grib2
> pqact 28575 ldm 299w REG 8,4 1908547243
> 12996185
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1200.grib2
> pqact 28575 ldm 309w REG 8,4 1909311831
> 12996177
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1300.grib2
> pqact 28575 ldm 312w REG 8,4 1919560310
> 12996236
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1400.grib2
> pqact 28575 ldm 319w REG 8,4 1921990669
> 12996275
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1500.grib2
> pqact 28575 ldm 325w REG 8,4 1929673467
> 12996227
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1600.grib2
> pqact 28575 ldm 328w REG 8,4 1934749135
> 12996223
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1700.grib2
> pqact 28575 ldm 331w REG 8,4 1229212489
> 12996280
> /data/tds/native/grid/NCEP/HRRR/CONUS_2p5km/HRRR_CONUS_2p5km_20160706_1800.grib2
>
> I used HRRR here as an example, but similar things happen for pretty
> much all of the model data being saved by the two pqacts running on
> Sean's pqact.conf files.
>
> About the only difference I can see between Seans and my original are
> that some of the entries in Seans have the keywoard -metadata on the
> FILE line. But I don't think it's that, because the GFS files do not
> have the -metadata keyword, and they also are being kept open by pqact.
>
> For example, here's the GFS FILE save line from Sean's
> pqact.forecastModels line:
>
> CONDUIT ^data/nccf/com/gfs/prod/gfs\.(........)(..).*pgrb2\.0p25\.a
> FILE
> /data/tds/native/grid/NCEP/GFS/Global_0p25deg_ana/GFS_Global_0p25deg_ana_\1_\200.grib2
>
> And here's my GFS FILE save line from pqact.conf for just the analysis:
>
> # 0.25 deg GFS analysis [huge]
> CONDUIT ^data/nccf/com/gfs/prod/gfs.20(..)(..)(..)(..)/.*pgrb2.0p25.(anl)
> FILE /data/grib2/gblav0p25.\1\2\3\4_F\5
>
> Any idea what might be going on? Does pqact by default keep a certain
> number of files open, and since I'm processing so many radar/etc files
> with my other pqact file that it just keeps flushing, whereas there are
> many fewer files with Sean's pqact's and they never get closed?
>
> I do notice that the file descriptors shown in the lsof for my pqact
> run from 0 to 1022, maybe roll over ~1024? Seans go only to 336 and
> 647 respectively.
>
> Perhaps from FL_FLUSH 1024 in ldm-6.13.1/src/pqact/filel.h?
>
> /*
> * fl_entry.flags, args to close_lru()
> */
> #define FL_NEEDS_SYNC 1
> #define FL_OVERWRITE 2
> #define FL_NOTRANSIENT 16
> #define FL_STRIP 32
> #define FL_LOG 64
> #define FL_METADATA 128 /* write data-product metadata */
> #define FL_NODATA 256 /* don't write data */
> #define FL_EDEX 512 /* send message to memory segment */
> #define FL_FLUSH 1024
> #define FL_CLOSE 2048
>
> Do I need to use a -close option to the FILE line for these? Seems like
> it would be a lot of overhead to
>
> I did test with both ldm-6.12.4 and ldm-6.13.1 and the problem happens
> with both. I have not tried merging these pqact entries with my own and
> running a single pqact process, but maybe that would resolve the issue?
>
> I did check and I have selinux set to permissive (I know that sometimes
> bites people in the butt)..
>
> The directories and files in them are all owned by user/group ldm/ldm,
> which is the owner of ldm and pqact and other ldm procs.
>
> I can get around this by ldmadmin stop ; ldmadmin start every few days,
> but that's not ideal.
>
> Thanks for any ideas/help.
Regards,
Steve Emmerson
Ticket Details
===================
Ticket ID: IRS-806047
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata
inquiry tracking system and then made publicly available through the web. If
you do not want to have your interactions made available in this way, you must
let us know in each email you send to us.