Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-level helpers for working with unix domain sockets #279

Open
njsmith opened this issue Aug 11, 2017 · 20 comments · May be fixed by #3187
Open

High-level helpers for working with unix domain sockets #279

njsmith opened this issue Aug 11, 2017 · 20 comments · May be fixed by #3187

Comments

@njsmith
Copy link
Member

njsmith commented Aug 11, 2017

Now that the high level networking API is coming together, we should also add unix domain helpers. Should be pretty straightforward.

Some notes here: #73 (comment)

I guess something like:

async def open_unix_stream(path):
    ...

async def open_unix_listeners(path, *, mode=0o666, backlog=None):
    ...

mode=0o666 matches what twisted does; tornado does 0o600. Should research which is better as a default.

The biggest issue is to figure out what to do about unlink-on-close. It's nice to keep thing tidy, but it introduces a race condition if you're trying to do no-downtime upgrades...

@njsmith
Copy link
Member Author

njsmith commented Aug 14, 2017

If we do do unlink on close, then it's important to note that getsockname is not reliable on AF_UNIX sockets. In particular, if you bind a socket and then rename the inode, getsockname will return the name you originally bound to, not the name the socket has now.

So one option would be SocketListener(..., unlink_on_close=None) / SocketListener(..., unlink_on_close="/path/to/unlink"). Another option would be to add a sockaddr=... option, which I guess we'll need anyway for #280, and then unlink_on_close=True would trust this instead of calling getsockname.

@njsmith
Copy link
Member Author

njsmith commented Sep 17, 2017

I think the locking algorithm in #73 (comment) will work, but it's super annoying. So I'm leaning towards: start by implementing a version where we never clean up stale socket files, and then if/when someone complains, implement the full locking version, so we can keep the same API. [Edit: or maybe not... asyncio doesn't clean up stale socket files, and uvloop originally did. From this we learned that starting to clean up socket files when you didn't before is a backcompat-breaking change. So we should probably pick one option and stick to it.]

Other libraries, in their "try to blow away an existing socket file first" code, check to see if it's a socket and only unlink it if so. We aren't going to unlink the file, but we might still want to do a just-in-case check where if the target file exists and is not a socket, then error out instead of renaming over it.

@Fuyukai
Copy link
Member

Fuyukai commented Dec 16, 2017

Hi, has this had any work at all? If not, I'd like to take an attempt at making it. I'm working on converting one of my libraries to support curio and trio, and this is a missing part for a bit of it for the trio side.

@njsmith
Copy link
Member Author

njsmith commented Dec 17, 2017 via email

@Fuyukai
Copy link
Member

Fuyukai commented Dec 17, 2017

I was looking at implementing both, but if you're waiting on design decisions I only need the client part so I can just restrict it to that for now.

njsmith added a commit that referenced this issue Jan 19, 2018
Unix client socket support. See: #279
@njsmith
Copy link
Member Author

njsmith commented Mar 18, 2018

[Just discovered this half-written response that's apparently been lurking in a tab for like 3 months... whoops. I guess I'll post anyway, since it does have some useful summary of the issues, even if I didn't get around to making any conclusions...]

@SunDwarf Well, let me brain dump a bit about the issues, and maybe it'll lead to some conclusion, and maybe you'll find it interesting in any case :-)

The tricky parts are in starting and stopping a unix domain listening socket, and there are two issues that end up interacting in a complicated way.

First issue: the ideal way to set up a unix domain listening socket is to (a) bind to some temporary name, (b) call listen, and then (c) rename onto the correct name. Doing things this way has a really neat property: it lets us atomically replace an old listening socket with a new one, so that every connection attempt is guaranteed to go to either the old socket or the new socket. This can be used for zero-downtime server upgrades. In fact, this is the only way to do zero-downtime server upgrades without complicated extra machinery for coordinating the hand-off between servers – e.g., there is no way for one TCP listening socket to atomically replace another – so this is one of those things where if you need it, then you really need it. And there really aren't any downsides to doing things this way, so we should always do the bind/rename dance when setting up a listening socket.

Second issue: tearing down a socket. Unix domain sockets have really annoying semantics here: when you close a listening socket, it leaves behind a useless stray file in the filesystem. You can't re-open it or anything; it's just a random piece of litter. One consequence is that when creating a listening socket, you might need to clean up the old socket; the consensus seems to be that you should just go ahead and silently do this. The rename trick above automatically takes care of this. So leaving these behind doesn't really cause any problems; it's just ugly. To avoid that ugliness, many libraries try to delete (unlink) the socket file when shutting down.

Now here's the problem: if we want to support zero-downtime upgrades, we might be replacing the old server's socket with the new server's socket. If this happens just as the old server is shutting down, and the old server has code to clean up the socket file when it shuts down, then if the timing works out just right it might accidentally end up deleting the new socket file. This puts us in an difficult position: zero-downtime upgrades are a feature not many people need, but that's irreplaceable when you need it; OTOH not cleaning up old sockets is a low-grade annoyance for everyone, including lots of people who don't care about zero-downtime upgrades. And for bonus fun, we don't still don't have a full solution for zero-downtime upgrades currently, because we don't have a good way to do graceful shutdown of a server (meaning: drain the listening socket and then close it). See #14, #147.

We have a few options:

  • Don't try to support zero-downtime upgrades. This is what every other library I know of does, but it seems like the wrong approach to me; zero-downtime upgrades are a really powerful feature, and not that hard to support?

  • Don't clean up old sockets, and do document that we support zero-downtime upgrades: this is the next easiest option. It's what HAproxy does, and they're pretty smart.

  • Clean up old sockets, and document that we support zero-downtime upgrades, EXCEPT that if you kill your old server just before starting the new server then there's this race condition, so you shouldn't do that. Maybe make a note that if this causes a problem for you, then we'd love some patches to fix it (see below).

  • Add some flag to control whether we clean up old sockets, and document that we support zero-downtime upgrades IF you toggle this flag to False. I'm not a fan of this kind of flag – most people don't really know (and shouldn't have to know) which way to set it, and it's impossible to get rid of this kind of feature once you add it.

  • Clean up old sockets, and document that we support zero-downtime upgrades, while using some Extremely Clever Locking to make it all work out. I believe this is possible (details in the middle of this comment), and so I guess this is the ideal solution if we have infinite resources. Implementing and testing this is significantly more work than any of the other options here, so maybe it would be better to start with something else and plan to move to this eventually after Trio hits it big. Or maybe we could never convince ourselves that this is really 100% reliable, and it'd be better to KISS.

  • Did I miss any?

Bonus: it's not trivial to switch between these options later. For example, here's a bug filed on uvloop where code was breaking because asyncio doesn't clean up unix domain listening sockets, so someone wrote code that assumes this and tries to clean up manually, and uvloop did clean them up, so the manual cleanup code crashed. There are some things we could do to mitigate this, like put a giant warning in the docs saying that you shouldn't make assumptions about whether the stale socket gets left behind. But who knows if people would listen.

Quite a mess, huh?

@njsmith
Copy link
Member Author

njsmith commented Dec 18, 2018

Now that the latest versions of Windows support AF_UNIX, I guess we'll have to take that into account as well.

According to some random SO answer, there is a way to get atomic rename on recent Windows (AFAICT "Windows 10 1607" is well before the AF_UNIX work): https://stackoverflow.com/a/51737582

[Edit: @pquentin found the docs! https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/ntifs/ns-ntifs-_file_rename_information ...From this, it's actually not clear whether it is atomic?]

Unfortunately I won't believe we got the Windows details right without testing them (e.g. how does Windows handle a renamed AF_UNIX socket?), and I don't know Appveyor has a new enough version of Windows to let us test them yet.

@Tronic
Copy link
Contributor

Tronic commented Aug 27, 2019

Is there any way to find out if the socket being closed is the final copy of that socket, or alternatively to keep track of copies created by fork etc? The socket should not be unlinked as long as there are other copies still listening.

I'm using socket mtime to track if it has been replaced by another instance. There are a few race conditions (identical timestamps for sockets created at the same time, or someone getting in between checking of mtime and unlinking because I don't bother to use locks).

@smurfix
Copy link
Contributor

smurfix commented Aug 27, 2019

That's easy. Try to connect to it after closing your listener.

@Tronic
Copy link
Contributor

Tronic commented Aug 27, 2019

Closing still needs work and I have to work on something else. Here's the work so far:

class UnixSocketListener(trio.SocketListener):
    def __init__(self, sock, path, inode):
        self.path, self.inode = path, inode
        super().__init__(trio.socket.from_stdlib_socket(sock))

    @staticmethod
    def _create(path, mode, backlog):
        if os.path.exists(path) and not stat.S_ISSOCK(os.stat(path).st_mode):
            raise FileExistsError(f"Existing file is not a socket: {path}")
        sock = socket.socket(socket.AF_UNIX)
        try:
            # Using umask prevents others tampering with the socket during creation.
            # Unfortunately it also might affect other threads and signal handlers.
            tmp_path = f"{path}.{uuid4().hex[:8]}"
            old_mask = os.umask(0o777)
            try:
                sock.bind(tmp_path)
            finally:
                os.umask(old_mask)
            try:
                inode = os.stat(tmp_path).st_ino
                os.chmod(tmp_path, mode)  # os.fchmod doesn't work on sockets on MacOS
                sock.listen(backlog)
                os.rename(tmp_path, path)
            except:
                os.unlink(tmp_path)
                raise
        except:
            sock.close()
            raise
        return UnixSocketListener(sock, path, inode)

    @staticmethod
    async def create(path, *, mode=0o666, backlog=None):
        return await trio.to_thread.run_sync(
            UnixSocketListener._create, path, mode, backlog or 0xFFFF
        )

    def _close(self):
        try:
            # Test connection
            s = socket.socket(socket.AF_UNIX)
            try:
                s.connect(self.path)
            except ConnectionRefusedError:
                if self.inode == os.stat(self.path).st_ino:
                    os.unlink(self.path)
            finally:
                s.close()
        except Exception:
            pass

    async def aclose(self):
        with trio.fail_after(1) as cleanup:
            cleanup.shield = True
            await super().aclose()
            self._close()

async def open_unix_listeners(path, *, mode=0o666, backlog=None):
    return [await UnixSocketListener.create(path, mode=mode, backlog=backlog)]

The test connection trick doesn't work perfectly (closing races, I think), and in this version the cleanup is blocking. For some reason, trio.to_thread.run_sync fails with RuntimeError('must be called from async context')

@Tronic
Copy link
Contributor

Tronic commented Aug 28, 2019

Minor irritation with atomic rename approach: the original random socket name also appears in peername seen by clients.

@njsmith
Copy link
Member Author

njsmith commented Aug 31, 2019

Is there any way to find out if the socket being closed is the final copy of that socket, or alternatively to keep track of copies created by fork etc? The socket should not be unlinked as long as there are other copies still listening.

Maybe it's simpler to just offer users a way to disable auto-deletion? (Assuming we support auto-deletion at all...)

Minor irritation with atomic rename approach: the original random socket name also appears in peername seen by clients.

Oh, huh, good point. And we'll also see it if we call getsockname on our own socket.

We should add some API to Stream for accessing the local/remote addresses (#280). And on macOS getpeername is unreliable in general – so SocketListener needs to take the peername returned by accept and stash that in the SocketStream, instead of relying on getpeername like we do now. We can use a similar trick to make this API return the right addresses for Unix-domain sockets (i.e., the path that we know we connected or bound, not the path that the kernel returns from getsockname/getpeername).

Of course, that doesn't help for other non-Trio programs connecting to Trio-created Unix-domain sockets, but maybe that's not a big deal.

       # Using umask prevents others tampering with the socket during creation.
       # Unfortunately it also might affect other threads and signal handlers.

Yeah, I don't think we can afford to use umask... too disruptive. However, as long as we call chmod before we call listen, I think we're fine, because the only thing chmod is controlling is which processes are allowed to connect, and until we call listen, no-one is allowed to connect.

We do need to think carefully about race conditions here though... imagine this code gets run by root, to bind a socket in a globally writable directory like /tmp. Can someone mess us up in a dangerous way?

  • First we call bind. A classic attack is for an attacker to create a symlink at /tmp/blah pointing somewhere else, so that when later on someone with more privileges tries to bind a socket to /tmp/blah, they end up creating the socket at an attacker controlled location. This was the source of a sshd security bug back in 1999. Apparently Linux fixed this ages ago (by making bind refuse to follow symlinks – search this thread for "Solar Designer" and "Alan Cox", and I also verified on my laptop that trying to bind to a dangling symlink gives EADDRINUSE). I also tried on macOS... and it totally worked and created the file at the target of the symlink, so I guess macOS is still vulnerable to this issue. That's a bit disturbing, and I guess means you shouldn't create Unix sockets in /tmp on macOS, but there isn't really any way we can work around it (we can't avoid calling bind!), so I guess there's nothing for us to do.

  • Then we call chmod. fchmod would be easier to analyze because it's inherently immune to race conditions, but you say that fchmod doesn't work on macOS so... I guess we're using chmod. (Does fchmod work on Linux?) chmod looks up the file again using its path, so in between bind and chmod another process could in principle move our socket and put something else there instead (e.g. a symlink to an arbitrary file whose permissions we'll end up changing). However, /tmp at least has a protection against this: the sticky bit makes it so that other users can't rename our files. So for /tmp, only another process running as the same user could perform this switcheroo. In classic Unix, that means it's totally not something we need to worry about, because all processes with the same user-id have the same permissions, so any process that's able to mess up our chmod could just easily do their own chmod. I'm not quite 100% sure that's still true today, with our sandboxes and seccomp and namespaces and selinux and so on, but I doubt there are any sandboxes that try to block chmod while allowing files/symlinks to be created and renamed.

    Oh, and we can also use os.chmod(..., follow_symlinks=False) to be extra-sure.

  • And finally there's the rename. Again in principle someone could replace the socket before we call rename, but a similar argument applies as with chmod: if they have the ability to replace our temporary socket, then they also have the ability to replace our final socket directly, so there's no need to jump through hoops by replacing the temporary socket and then let us do the rename for them.

@Tronic
Copy link
Contributor

Tronic commented Sep 2, 2019

Maybe it's simpler to just offer users a way to disable auto-deletion? (Assuming we support auto-deletion at all...)

It is certainly difficult to do right. The test connection might also have side effects on the accepting end (e.g. application logging messages about failed connection handshakes). And if enabled by default without sufficient checks, multi-processing applications would get very nasty bugs (socket unlinked whenever the first worker terminates / is restarted).

Fortunately there is a simple and fairly safe solution. Never automatically unlink, but provide a utility function for unlinking. A multi-processing application could choose when to call this function (e.g. when it initiates shutdown, even before the workers actually terminate), and the socket would still internally check that the inode number matches.

This leaves a race condition between stat and unlink, between which another instance might replace the socket, which I can imagine being an issue with restarts like server-app& kill $oldpidand kill $oldpid; server-app&, which are broken in any case. One should wait for the new instance to signal that it has started (usually by pidfile being created) before killing the old one:

  1. new instance replaces socket and then creates pidfile
  2. system daemon sees the new pidfile and kills the old instance
  3. old instance graceful shutdown: socket inode mismatch, no unlinking

Minor irritation with atomic rename approach: the original random socket name also appears in peername seen by clients.

Oh, huh, good point. And we'll also see it if we call getsockname on our own socket.

Considering unlink + bind without rename, isn't the downtime there extremely minimal? Any connections in old server's backlog stay there and a new socket is bound and listened within microseconds. Hardly any connections should appear within such window, and network clients have to have retrying logic in any case.

In particular, unlink/bind could work as a fallback on Windows if true atomic replacement is not possible.

@njsmith
Copy link
Member Author

njsmith commented Sep 2, 2019

And if enabled by default without sufficient checks, multi-processing applications would get very nasty bugs (socket unlinked whenever the first worker terminates / is restarted).

I don't think I'm as worried about this as you are. Sharing a listening socket between processes isn't something you do by accident – it requires some careful work to get everything shared and set up properly. Even after we add an open_unix_listener helper, only the parent process will be able to use it – all the workers will need to convert a raw fd into a trio socket, manually wrap it in a SocketListener, etc. So we can assume everyone doing this is pretty familiar with these APIs, has read the docs, and is prepared to write a bit of code to say explicitly what they want. So if they also have to say autodelete=False, eh, it's not a big burden.

When choosing the defaults, we want to put more weight on doing something sensible that works for less sophisticated users, who don't know enough to pick their own settings.

Considering unlink + bind without rename, isn't the downtime there extremely minimal? Any connections in old server's backlog stay there and a new socket is bound and listened within microseconds. Hardly any connections should appear within such window, and network clients have to have retrying logic in any case.

Sure, but "definitely no connections are lost" is still better than "probably not many connections are lost" :-). The rename part is cheap and easy, and the only downside is the weird getsockname/getpeername issue. But that's extremely minor. I can think of any situation where you would use these to check the identity of a server socket – they're generally only used to identify clients. And even that doesn't make sense for Unix sockets, since clients are usually anonymous. So I'm not worried about using rename to create the socket. All the questions are about cleanup.

@Tronic
Copy link
Contributor

Tronic commented Sep 2, 2019

I don't think I'm as worried about this as you are. Sharing a listening socket between processes isn't something you do by accident

Actually it is quite simple to do with the high level API (probably not using Trio as planned, again):

import os, posix, trio

listeners = trio.run(trio.open_tcp_listeners, 5000)
for l in listeners: l.socket.set_inheritable(True)

posix.fork()
posix.fork()

trio.run(
    trio.serve_listeners,
    lambda s: s.send_all(b"Hello, I'm %d\n" % os.getpid()),
    listeners
)

Also, I suppose that sharing a listener object actually is something quite easily done by accident, even if the socket is not inheritable, by some completely unrelated fork() -- but that would have bigger issues: trying to close a fd that doesn't belong to the process.

@njsmith
Copy link
Member Author

njsmith commented Sep 2, 2019

(os and posix are the same module, no need to import both. Also your set_inheritable calls there aren't doing anything – "inheritable" controls what happens across exec. File descriptors are always inherited across fork.)

Sharing trio objects across multiple calls to trio.run isn't formally supported... I guess what you wrote will work right now in practice, but no guarantees. It doesn't look very accidental, either :-)

Also, I suppose that sharing a listener object actually is something quite easily done by accident, even if the socket is not inheritable, by some completely unrelated fork() -- but that would have bigger issues: trying to close a fd that doesn't belong to the process.

Much bigger issues. If you try to fork inside trio.run you'll end up with two processes trying to share the same wakeup fd, epoll fd, etc. It doesn't work at all. (Of course fork+exec is fine.)

@njsmith njsmith closed this as completed Sep 2, 2019
@njsmith njsmith reopened this Sep 2, 2019
@njsmith
Copy link
Member Author

njsmith commented Sep 2, 2019

[sorry, I fat-fingered that and posted a partial comment. I've reopened and edited my comment above to fill in the missing bits.]

@Tronic
Copy link
Contributor

Tronic commented Sep 5, 2019

(os and posix are the same module, no need to import both. Also your set_inheritable calls there aren't doing anything – "inheritable" controls what happens across exec. File descriptors are always inherited across fork.)

Dang! Looks like I accidentally mixed things from Nim language (where it is posix.fork()) and from Sanic's multiprocessing based approach, having recently used both to implement pre-forking servers.

Sharing trio objects across multiple calls to trio.run isn't formally supported... I guess what you wrote will work right now in practice, but no guarantees. It doesn't look very accidental, either :-)

I have to admit that I tried forking in async context first. As you can imagine, it didn't end well. Maybe should have used exec or trio processes, but I haven't found a convenient way to call Python code of the currently running program (determining how the program was started and trying to run a new interpreter etc. gets complicated).

Sanic is using a similar approach with asyncio -- calling loop.run_until_complete in small pieces, before and after spawning worker processes. I hope that this remains supported in Trio.

Btw, I have to admire the eye you have for detail. I've read your blogs about KeyboardInterrupt/signals and Cancellation, and I see the same attitude here with getting the UNIX sockets precisely right. Actually I originally found Trio when looking for information about cancellation in asyncio and was astonished by how the entire topic was apparently being ignored by everyone else. Needless to say, I was instantly convinced to switch and seeing bits of that kind of quality everywhere I look in Trio is certainly satisfying (the "little" details like making sure that SSL server_hostname actually gets verified, or making sockets NODELAY by default -- details unnoticed by most but they make a big difference as a whole). Keep up the good work! :)

@njsmith
Copy link
Member Author

njsmith commented Sep 12, 2019

I have to admit that I tried forking in async context first. As you can imagine, it didn't end well. Maybe should have used exec or trio processes, but I haven't found a convenient way to call Python code of the currently running program (determining how the program was started and trying to run a new interpreter etc. gets complicated).

Unfortunately I don't think there's another reliable way option... you can fork before starting any async code (and possibly pass the sockets over to the child processes later), but even then you'll probably find yourself having to support spawning new processes to support Windows.

I'm pretty sure that if you don't change directories, don't change the environment, and run sys.executable, then that will reliably spawn a new python interpreter in the same python environment, for all cases except for when Python is embedded as a library inside some larger program (e.g. blender or libreoffice). Of course in those cases you can't really expect fork to work either... multiprocessing has the same problem, so it provides multiprocessing.set_executable, and folks who embed python but also want multiprocessing to work are supposed to call that during startup to give the location of a working python interpreter. You can fetch the value it will use by doing multiprocessing.spawn.get_executable(). (By default it's just the same as sys.executable.)

Sanic is using a similar approach with asyncio -- calling loop.run_until_complete in small pieces, before and after spawning worker processes. I hope that this remains supported in Trio.

Unfortunately, it's not supported in Trio. It creates huge complications in asyncio because when run_until_complete returns, it effectively freezes all the other running tasks in-place, so deadlines don't work, you have to figure out what you want to do with those tasks if the user never re-starts the loop, etc. And in Trio it's not even very useful, since the nursery system means you can't have tasks that outlive a call to trio.run anyway.

Btw, I have to admire the eye you have for detail.

Ah, thank you! ☺️

@Tronic
Copy link
Contributor

Tronic commented Sep 14, 2019

(a quicker response that I'd like, and quite off topic too)

  • I've abandoned fork and using the spawn mode of multiprocessing (because it is anyway used on Windows and starting at Python 3.8 on MacOS; supporting different semantics for slightly improved resource use on Linux is not worth the trouble).
  • The way that multiprocessing runs the new process seems quite broken (long story short: running the main module cannot really be avoided, in particular with frozen executables).

This makes me think that the sane solution is to accept that main is getting run (including the if __name__ == "__main__": block).

Fortunately I believe I can restructure the problem so that the first process starts normally but ends up opening the listening sockets, and then re-starting the whole program as child processes that also end up in the same place -- but instead of opening sockets, receive them pickled (and on Windows duplicated as is required) from the original one. This has the benefit that all state other than the socket is initialized in normal way in each child process, and won't need to be pickled at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants