[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDFJava #MQO-415619]: Efficiently serializing NetCDF in memory objects



The rule I am currently proposing is that
if the memory file is opened read-only, then no copy
is made and ownership remains with the caller.
If it is writable then a copy is made internally and owned by the netcdf 
library.
The reason for doing this is to reduce memory. It is also consistent
with the way HDF5 works.
This is admittedly a somewhat complicated rule, so the alternative
is for the netcdf-c library to always make an internal copy of the incoming
data. In any case, the memory returned by nc_close_mem() would need to be
free'd by the caller.

> My understanding is that 1) would incur no copy. For 2), it seems at least
> one mandatory copy would me made. Am I correct ?
Any time you write to an in-memory file, it will incur at least one copy
internally.

>I t also seems nc_create is missing an argument to provide a size hint for
> the memory buffer. This would help avoiding unnecessary reallocations.
Good idea.

> Unnecessary copies will kill any potential improvements
> from the new API.
Possibly true, but I do not think I have enough
control over the HDF5 implementation to do this when
the in-memory file is modified. No-copy is only possible
if the file is read-only.



> 
> Does nc_open_mem takes ownership of the memory ? In other words, who is
> responsible for releasing the provided memory ? (It seems it is the caller
> since you mention a copy is made internally)
> 
> 2. The memory returned by nc_close_mem would be a COPY of the current
> >    contents if the original was writeable (again HDF5 forced).
> 
> 
> Is it a copy of initial data (provided through nc_open_mem for example) or
> is it a copy of the internal copy ? Also, can you clairify what the
> original is in that context ?
> 
> I have two main use cases for this feature
> 
> 1. Deserialization: open a read-only NetCDF from memory and provide it
> as read-only input for algorithms
> 2. Serialization: NetCDF is created and populated in-memory by
> algorithms. Once data is submitted from algorithms to the framework, it is
> treated as read-only. Framework would access memory to efficiently
> serialize it for transport
> 
> My understanding is that 1) would incur no copy. For 2), it seems at least
> one mandatory copy would me made. Am I correct ?
> 
> It also seems nc_create is missing an argument to provide a size hint for
> the memory buffer. This would help avoiding unnecessary reallocations.
> 
> Regarding your thoughts, my use-case is about maximizing performance for
> large datasets. Unnecessary copies will kill any potential improvements
> from the new API.
> So I would request a "no-copy" API, even if it comes with its set of
> constraints. The nc_close_mem API seems a good way to enforce some of them.
> 
> No/zero copy API are usually for advanced users and should be opt-in. One
> advice I can give is to provide dedicated API for zero copy
> 
> 1. Default API could incur copy, which is safe for normal users
> 2. Dedicated API will give you more freedom for the design of the
> zero-copy function signatures
> 
> So to answer your question, I would provide two sets of API if you are
> concerned about complexity of zero copy optimizations for your users.
> 
> Regards,
> 
> Michaël
> 
> address@hidden>:
> 
> > Proposal:
> > I am in the process of adding support for retrieving
> > the final file image for an in-memory file. This would
> > extend the existing nc_open_mem functionality.
> >
> > This involves several changes.
> > 1. nc_open_mem() would now allow the NC_WRITE mode flag
> >    so a chunk of memory can be passed in and be modified
> > 2. nc_create() would now allow the NC_INMEMORY flag to be set
> >    to cause the created file to be kept in memory.
> > 3. the ability would be provided to obtain the in-memory contents
> >    of a file at the point it is closed using a new function
> >         nc_close_mem(int ncid, size_t* sizep, void** memory)
> > Note the following.
> > 1. If nc_open_mem is called with NC_WRITE, then a copy of the incoming
> >    memory would be made internally (dictated by HDF5 functionality).
> >    If instead, the file is read-only, no copy would be made.
> > 2. The memory returned by nc_close_mem would be a COPY of the current
> >    contents if the original was writeable (again HDF5 forced).
> >
> > I also have a couple of thoughts.
> > 1. Instead of nc_close_mem, I could separate out the memory extraction
> >    from the close operation. Hence one would call something like
> >    nc_get_mem() and then nc_close() instead of nc_close only.
> >    I prefer the nc_close_mem solution, but can anyone provide a
> >    use case where the separate functions is better?
> > 2. I could always make it so that the incoming and extracted memory
> >    chunks were copies. This would mean that read-only opening
> >    a memory chunk would pay the copy cost when it did not have to.
> >    But the always copy soln might be simpler for the user to understand
> >    and use.
> > Note that this is all preliminary and is subject to change if,
> > for example, I cannot get some piece of functionality to work.
> >
> >
> >
> >
> > > Sorry, I posted the proposal to the netcdf news group and forgot to copy
> > > to the gihub site. I will do that ASAP.
> > >
> > > >
> > > > This is good news. I am eager to provide feedback on your proposal.
> > > > I did not find any proposal in the linked issue. Did you forget to
> > push :) ?
> > > >
> > > > Also, it has not been updated with our latest discussions. Do you want
> > me
> > > > to fix that ?
> > > >
> > > > Regards,
> > > >
> > > > Michaël
> > > >
> > > > address@hidden>:
> > > >
> > > > > It appears I can do this. You might examine my proposed
> > > > > API and tell me if it will serve your purposes.
> > > > > See https://github.com/Unidata/netcdf-c/issues/708
> > > > >
> > > > > > I will investigate and see if I can use this function
> > > > > > to get the desired effect.
> > > > > > Thanks for bringing it to my attention.
> > > > > >
> > > > > > >
> > > > > > > The HDFgroup forum has pointed me to the H5Pset_file_image
> > > > > > > <https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#
> > > > > Property-SetFileImage>
> > > > > > > function. This is documented further here
> > > > > > > <https://support.hdfgroup.org/HDF5/doc/Advanced/
> > FileImageOperations/
> > > > > HDF5FileImageOperations.pdf>
> > > > > > > and seems to be available from HDF5 1.8.9 upwards.
> > > > > > > So, I should rejoice as the limitations no longer comes from the
> > HDF5
> > > > > > > software. Yet, I am a bit suspicious because this feature has
> > been
> > > > > > > available from mid-2012 and I fear you are not using it for a
> > good
> > > > > reason.
> > > > > > >
> > > > > > > What do you think ? I am willing to try implementing it if
> > someone can
> > > > > > > mentor me.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Michaël
> > > > > > >
> > > > > > > address@hidden>:
> > > > > > >
> > > > > > > > I think that as long as the mmap is set up with non-persist,
> > then
> > > > > > > > the only writes to disk will occur with paging.
> > > > > > > >
> > > > > > > > > Actually, my solution is less than optimal since, using
> > mmap, data
> > > > > is
> > > > > > > > still
> > > > > > > > > written to disk eventually.
> > > > > > > > >
> > > > > > > > > The goal is to avoid both writes to disk and data copies.
> > This is
> > > > > > > > sometimes
> > > > > > > > > called "zero copy" optimization and would work because in my
> > > > > application,
> > > > > > > > > we enforce that input data is read only. Output data is
> > modified
> > > > > > > > obviously
> > > > > > > > > but can be seen as read only just before transmission.
> > > > > > > > >
> > > > > > > > > I have started a thread on the HDF forum on the topic. I am
> > also
> > > > > looking
> > > > > > > > > into Apache Common VFS Ram filesystem as a fallback
> > workaround.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > address@hidden>:
> > > > > > > > >
> > > > > > > > > > Actually, we do use mmap already. The problem is that the
> > mapped
> > > > > memory
> > > > > > > > > > is hidden deep in, for example, the hdf5 memory driver. So
> > there
> > > > > is no
> > > > > > > > > > way I can see to get access to it. If we did have access
> > to it,
> > > > > then
> > > > > > > > > > of course, we could copy it out and give you the
> > functionality
> > > > > you
> > > > > > > > desire .
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > My ideal use case would allow users to create/modify
> > NetCDF4
> > > > > datasets
> > > > > > > > > > > in-memory. Once done, my software would read the binary
> > data
> > > > > and
> > > > > > > > send it
> > > > > > > > > > > over the network.
> > > > > > > > > > > One workaround I see is to use POSIX memory mapped file
> > > > > > > > > > > <https://en.wikipedia.org/wiki/Memory-mapped_file> to
> > trick
> > > > > the
> > > > > > > > library
> > > > > > > > > > > into working in-memory.
> > > > > > > > > > >
> > > > > > > > > > > I now understand this issue originates from a limitation
> > of
> > > > > the HFD5
> > > > > > > > > > > library. I can raise this issue to the HDF group.
> > Publications
> > > > > such
> > > > > > > > as
> > > > > > > > > > this
> > > > > > > > > > > one <https://support.hdfgroup.org/
> > pubs/papers/Big_HDF_FAQs.pdf>
> > > > > (see
> > > > > > > > > > also
> > > > > > > > > > > this
> > > > > > > > > > > <https://www.hdfgroup.org/2015/03/from-hdf5-datasets-to-
> > > > > > > > > > apache-spark-rdds/>)
> > > > > > > > > > > seems to indicate they would be interested bythe feature.
> > > > > Indeed,
> > > > > > > > > > avoiding
> > > > > > > > > > > unecessary transfers to disk is key to achieving good
> > > > > performance in
> > > > > > > > Big
> > > > > > > > > > > Data systems (this is the whole point of Apache Spark
> > BTW).
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > address@hidden>:
> > > > > > > > > > >
> > > > > > > > > > > > This question has come up before.
> > > > > > > > > > > > This is currently not possible for netcdf-4 files.
> > > > > > > > > > > > Using the NC_DISKLESS mode flag allows for keeping
> > > > > > > > > > > > the file in memory. The nc_open_mem function allows
> > for the
> > > > > > > > > > > > read only access to treat a chunk of memory as if it
> > was
> > > > > > > > > > > > a netcdf file. Unfortunately, for netcdf-4, we
> > ultimately
> > > > > > > > > > > > depend on the HDF5 operation H5P_set_core
> > > > > > > > > > > > (https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#
> > > > > > > > > > Property-SetFaplCore
> > > > > > > > > > > > ).
> > > > > > > > > > > > Upon close(), this can optionally store the memory
> > buffer in
> > > > > a
> > > > > > > > file.
> > > > > > > > > > > > Unfortunately, it does not (currently) provide an
> > option to
> > > > > copy
> > > > > > > > out
> > > > > > > > > > the
> > > > > > > > > > > > memory.
> > > > > > > > > > > > The only solution I can see for now is to build an
> > > > > alternative to
> > > > > > > > the
> > > > > > > > > > core
> > > > > > > > > > > > driver
> > > > > > > > > > > > that provides access (somehow) to the memory.
> > > > > > > > > > > > BTW, this is all going on at the netcdf-c library
> > level. Our
> > > > > pure
> > > > > > > > jave
> > > > > > > > > > HDF5
> > > > > > > > > > > > reader is read-only, hence cannot create or modify
> > files.
> > > > > > > > > > > >
> > > > > > > > > > > > I have created an issue for this
> > > > > > > > > > > > (https://github.com/Unidata/netcdf-c/issues/708)
> > > > > > > > > > > > but it is not likely to get implemented anytime soon.
> > > > > > > > > > > >
> > > > > > > > > > > > You will have to be content with writing the contents
> > to a
> > > > > file.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am using NetCDF Java library 4.6.10.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My goal is to efficiently send NetcdfFile
> > > > > > > > > > > > > <https://www.unidata.ucar.edu/
> > software/thredds/current/
> > > > > > > > > > > > netcdf-java/javadoc/ucar/nc2/NetcdfFile.html>
> > > > > > > > > > > > > objects over the network using a communication
> > library
> > > > > (such as
> > > > > > > > > > ZeroMQ) .
> > > > > > > > > > > > > Because the NetcdfFile
> > > > > > > > > > > > > <https://www.unidata.ucar.edu/
> > software/thredds/current/
> > > > > > > > > > > > netcdf-java/javadoc/ucar/nc2/NetcdfFile.html>
> > > > > > > > > > > > > class
> > > > > > > > > > > > > does not implement Serializable, I need to implement
> > that
> > > > > step
> > > > > > > > > > myself.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Since NetCDF is a machine independent data format, I
> > would
> > > > > like
> > > > > > > > to
> > > > > > > > > > access
> > > > > > > > > > > > > the raw binary data.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is trivial if the data kas been written to a
> > file on
> > > > > the
> > > > > > > > disk.
> > > > > > > > > > But,
> > > > > > > > > > > > > what about in memory datasets ? If possible, I would
> > like
> > > > > to
> > > > > > > > access
> > > > > > > > > > to
> > > > > > > > > > > > > binary data without writing it to disk...
> > > > > > > > > > > > >
> > > > > > > > > > > > > Can I access the buffer of an in-memory NetcdfFile
> > > > > > > > > > > > > <https://www.unidata.ucar.edu/
> > software/thredds/current/
> > > > > > > > > > > > netcdf-java/javadoc/ucar/nc2/NetcdfFile.html>
> > > > > > > > > > > > > object
> > > > > > > > > > > > > from the Java API ? Any pointers will be appreciated.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Michaël
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > =Dennis Heimbigner
> > > > > > > > > > > >   Unidata
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Ticket Details
> > > > > > > > > > > > ===================
> > > > > > > > > > > > Ticket ID: MQO-415619
> > > > > > > > > > > > Department: Support netCDF Java
> > > > > > > > > > > > Priority: Normal
> > > > > > > > > > > > Status: Open
> > > > > > > > > > > > ===================
> > > > > > > > > > > > NOTE: All email exchanges with Unidata User Support are
> > > > > recorded
> > > > > > > > in the
> > > > > > > > > > > > Unidata inquiry tracking system and then made publicly
> > > > > available
> > > > > > > > > > through
> > > > > > > > > > > > the web.  If you do not want to have your interactions
> > made
> > > > > > > > available
> > > > > > > > > > in
> > > > > > > > > > > > this way, you must let us know in each email you send
> > to us .
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > =Dennis Heimbigner
> > > > > > > > > >   Unidata
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Ticket Details
> > > > > > > > > > ===================
> > > > > > > > > > Ticket ID: MQO-415619
> > > > > > > > > > Department: Support netCDF Java
> > > > > > > > > > Priority: Normal
> > > > > > > > > > Status: Open
> > > > > > > > > > ===================
> > > > > > > > > > NOTE: All email exchanges with Unidata User Support are
> > recorded
> > > > > in the
> > > > > > > > > > Unidata inquiry tracking system and then made publicly
> > available
> > > > > > > > through
> > > > > > > > > > the web.  If you do not want to have your interactions made
> > > > > available
> > > > > > > > in
> > > > > > > > > > this way, you must let us know in each email you send to
> > us.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > =Dennis Heimbigner
> > > > > > > >   Unidata
> > > > > > > >
> > > > > > > >
> > > > > > > > Ticket Details
> > > > > > > > ===================
> > > > > > > > Ticket ID: MQO-415619
> > > > > > > > Department: Support netCDF Java
> > > > > > > > Priority: Normal
> > > > > > > > Status: Open
> > > > > > > > ===================
> > > > > > > > NOTE: All email exchanges with Unidata User Support are
> > recorded in
> > > > > the
> > > > > > > > Unidata inquiry tracking system and then made publicly
> > available
> > > > > through
> > > > > > > > the web.  If you do not want to have your interactions made
> > > > > available in
> > > > > > > > this way, you must let us know in each email you send to us.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > =Dennis Heimbigner
> > > > > > Unidata
> > > > > >
> > > > >
> > > > > =Dennis Heimbigner
> > > > >   Unidata
> > > > >
> > > > >
> > > > > Ticket Details
> > > > > ===================
> > > > > Ticket ID: MQO-415619
> > > > > Department: Support netCDF Java
> > > > > Priority: Normal
> > > > > Status: Open
> > > > > ===================
> > > > > NOTE: All email exchanges with Unidata User Support are recorded in
> > the
> > > > > Unidata inquiry tracking system and then made publicly available
> > through
> > > > > the web.  If you do not want to have your interactions made
> > available in
> > > > > this way, you must let us know in each email you send to us.
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > > =Dennis Heimbigner
> > > Unidata
> > >
> >
> > =Dennis Heimbigner
> >   Unidata
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: MQO-415619
> > Department: Support netCDF Java
> > Priority: Normal
> > Status: Open
> > ===================
> > NOTE: All email exchanges with Unidata User Support are recorded in the
> > Unidata inquiry tracking system and then made publicly available through
> > the web.  If you do not want to have your interactions made available in
> > this way, you must let us know in each email you send to us.
> >
> >
> >
> 
> 

=Dennis Heimbigner
  Unidata


Ticket Details
===================
Ticket ID: MQO-415619
Department: Support netCDF Java
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.