Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coffea interface with pyhf - statistical inference #104

Open
kratsg opened this issue Jun 9, 2019 · 5 comments
Open

coffea interface with pyhf - statistical inference #104

kratsg opened this issue Jun 9, 2019 · 5 comments
Labels
enhancement New feature or request

Comments

@kratsg
Copy link
Contributor

kratsg commented Jun 9, 2019

/cc @matthewfeickert @lukasheinrich

Is your feature request related to a problem? Please describe.

Feature request.

Describe the solution you'd like

An interface/way to export to the pyhf JSONs to perform binned statistical fits (should work for anything that allows you to make histograms basically). The pyhf workspaces follow the schema (v1.0.0) defined here: https://diana-hep.org/pyhf/schemas/1.0.0/workspace.json. If you see this issue in the future, we might be at a later version of the schema.

Describe alternatives you've considered

None.

Additional context

A similar issue is on-going to get pyhf and zfit working together as pyhf is binned, but zfit is unbinned: zfit/zfit#120.

@kratsg
Copy link
Contributor Author

kratsg commented Jun 9, 2019

There's a couple of ways one could approach this:

  • depend on pyhf for the exporting, and provide a coffea.stats.export('pyhf') which returns a pyhf.Model or pyhf.Workspace instance. This is more what numpy+scipy tends to do.
  • rely on the existence of the pyhf JSON schema specifications and dump the JSON (or python dictionary) that validates against the schema specification and leave it up to the user to run pyhf themselves.

@lgray lgray changed the title [enhancement] coffea interface with pyhf - statistical inference [ENHANCEMENT] coffea interface with pyhf - statistical inference Jun 9, 2019
@lgray lgray changed the title [ENHANCEMENT] coffea interface with pyhf - statistical inference [enhancement] coffea interface with pyhf - statistical inference Jun 9, 2019
@lgray lgray added the enhancement New feature or request label Jun 10, 2019
@lgray lgray changed the title [enhancement] coffea interface with pyhf - statistical inference coffea interface with pyhf - statistical inference Jun 10, 2019
@lgray
Copy link
Collaborator

lgray commented Jul 9, 2019

@guitargeek So the discussion should sorta start here.
@kratsg - after chewing on this for a bit I am thinking on developing something that mimics the llvm compiler infrastructure model.

I am thinking along this direction because we are not going to convince anyone to use a specific stats tool but we can convince people to write things in a way that lets them use any stats tool. Especially if we make that way of describing a model such that you can turn it into any stats tool easy and/or expressive

  1. Frontends based on some given flavor of histograms and parametric functions where you can write the model down in a clean way
  2. The frontend description is processed into an intermediate representation that allows us to describe the the model to be fit in a way that's agnostic of the input histogram types, probably encoding in a fairly declarative way. I would not expect that we optimize anything here, so it's not a intermediate representation that needs to be as flexible as what you find in gcc or llvm, of course.
  3. Then various backends can process the intermediate representation into a target statistical tool's expected inputs, producing all necessary files and descriptors.

I'm sort of assuming this is all in python so we have easy access to how functions are composed, so that we can straightforwardly, for instance, write something in PyROOT RooFit, take it apart and reassemble it into PyHF, zfitter, combine, sklearn or whatever we decide to make backends for.

While this may seem a little silly at first (why not just write your fit in a given stats tool after all), I think we can arrive at something with this where we have a highly portable description of a fit and when something new and cool gets made to fit things with we/someone just supply a new backend and people can happily fit away.

What do you think?

@lgray
Copy link
Collaborator

lgray commented Jul 9, 2019

@jpivarski if you have anything to add here it'd be super useful for discussion as well!

@jpivarski
Copy link
Member

We talked about this yesterday and I thought the "uniform interface to fitters" idea was a good one, particularly if you're targeting large projects like TensorFlow. If you're thinking specifically of HistFitter-style fits, then I'm beginning to think it would be better to defer to pyhf JSON, because that JSON format was intended to be implementation-independent, after all. Is the scope of what you're considering broader than the scope of what pyhf is already standardizing?

This could morph into a project of linking histogram-booking with pyhf models...

@alexander-held
Copy link
Member

I came across this old issue and wanted to add a bit of information based on progress in the last two years.

project of linking histogram-booking with pyhf models

The cabinetry library does something like this. Users specify the relevant information needed to build a HistFactory model (the type that pyhf supports). cabinetry turns that information into instructions to create all the required template histograms. A prototype interface which allows carrying out those instructions with coffea exists. After all histograms are produced,cabinetry assembles them into a workspace following the pyhf JSON format.

provide a coffea.stats.export('pyhf') which returns a pyhf.Model

The challenge with that approach is that the information about how HistFactory channels-samples-systematics interact with each other is not known to coffea, and not necessarily required for standard usage. It brings up an interesting point, which is closely related to discussions in #469. When processing systematics, a coffea processor needs to know some related information: which types of detector systematics should be applied (typically affecting most/all samples in the same way), and which modeling systematics (often sample-dependent) need to be evaluated? To build the statistical model, this information needs to be provided. It can be hardcoded in the processor, or provided externally (that is the route cabinetry is taking).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants