-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing algorithm #161
Comments
Quoting jbd (2016-12-27 22:38:12)
The duc indexing is already quite fast but you might be interested in the
filesystem crawling algorithm of robinhood
([1]https://github.com/cea-hpc/robinhood) also written in C, if you don't
already know the project:
Thanks for that. I didn't know about Robnhood, but it seems that they
have taken the idea a few steps further already, it's like Duc on mass
steroids with a hundred additional features; these people are way ahead
of me.
Robinhood seems to have a different audience then Duc though; it is not
a simple tool you can setup in a few minutes, but it requires some
experience and knowledge to configure and run. I guess they pick up
where Duc hits its limits :)
To go beyond the performance of classical scanning tools, robinhood
implements a multi-threaded version of depth-first traversal[4]. To
parallelize the scan, the namespace traversal is split into individual
tasks that consist in reading single directories. A pool of worker
threads performs these tasks following a depth-first strategy (as
illustrated on figure 3).
Parallelization is on my wish list for a long time (second to
incremental indexing), I'm sure a lot of users will benefit greatly from
this. On a single spindle and a modern Linux kernel multiple threads
will not add too much performance, but as soon as multiple physical
disks are involved doing things in parallel will speed up things a lot,
especially on Lustre-scale file systems.
I have made a proof of concept of a parallel indexing algorithm for Duc
a few years ago. The implementation was close to your description above,
but I was not too happy with the increased complexity of the resulting
code, so it never made it into Duc in the end. I will definitely take a
look at the internals of robinhood to see if I can steal any good ideas!
If the storage could handle parallel requests nicely, that's a huge
win.
I guess that even with no concurrency on the database the speedup will
still be significant, since the overhead of reading metadata from a file
system is so much higher then writing out the results to the database.
Right now I don't have any metrics to compare the crawling performance
between robinhood and duc, but I'm using robinhood on a petabyte filer
through nfs3 with nearly 1 billion files on it, and the crawling
algorithm performance scales quite linearly up to 8 threads. After
that, I can't tell if the filer or the mysql database that holds the
results is the bottleneck at this moment.
I would be very interested in some numbers that compare Duc to Robinhood
though on a file system this size; if you ever get some numbers on this,
please let me know!
…--
:wq
^X^Cy^K^X^C^C^C^C
|
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Ico> Quoting jbd (2016-12-27 22:38:12)
> The duc indexing is already quite fast but you might be interested in the
> filesystem crawling algorithm of robinhood
> ([1]https://github.com/cea-hpc/robinhood) also written in C, if you don't
> already know the project:
Ico> Thanks for that. I didn't know about Robnhood, but it seems that they
Ico> have taken the idea a few steps further already, it's like Duc on mass
Ico> steroids with a hundred additional featuresr; these people are way ahead
Ico> of me.
I've looked at this too, and I think they have a crucial flaw, which
is using Mysql as the backend, instead of some dedicated and optimized
DB like Duc is using. Or even what philesight used, and duc was a big
performance and size improvement there.
As for the parallel code, it will help up until the backend (NFS on
Netapp in my case mostly) hits it's limits I fear.
Ico> Robinhood seems to have a different audience then Duc though; it is not
Ico> a simple tool you can setup in a few minutes, but it requires some
Ico> experience and knowledge to configure and run. I guess they pick up
Ico> where Duc hits its limits :)
And they gloss over the performance issues of mysql in a big way!
> To go beyond the performance of classical scanning tools, robinhood
> implements a multi-threaded version of depth-first traversal[4]. To
> parallelize the scan, the namespace traversal is split into individual
> tasks that consist in reading single directories. A pool of worker
> threads performs these tasks following a depth-first strategy (as
> illustrated on figure 3).
Ico> Parallelization is on my wish list for a long time (second to
Ico> incremental indexing), I'm sure a lot of users will benefit
Ico> greatly from this. On a single spindle and a modern Linux kernel
Ico> multiple threads will not add too much performance, but as soon
Ico> as multiple physical disks are involved doing things in parallel
Ico> will speed up things a lot, especially on Lustre-scale file
Ico> systems.
Ico> I have made a proof of concept of a parallel indexing algorithm
Ico> for Duc a few years ago. The implementation was close to your
Ico> description above, but I was not too happy with the increased
Ico> complexity of the resulting code, so it never made it into Duc in
Ico> the end. I will definitely take a look at the internals of
Ico> robinhood to see if I can steal any good ideas!
I did the same with Philesight in ruby, but ran into the big kernel
lock issue of Ruby threading. I don't know if they've fixed that in
newer versions or not.
> If the storage could handle parallel requests nicely, that's a huge
> win.
Ico> I guess that even with no concurrency on the database the speedup
Ico> will still be significant, since the overhead of reading metadata
Ico> from a file system is so much higher then writing out the results
Ico> to the database.
Execpt if you do a crappy mysql setup... In my view, the use case is
so specialized, that doing a dedicated DB tool makes the most sense.
Using a general SQL DB for storage just doesn't scale. Look at the
crap that Bacula goes through.
> Right now I don't have any metrics to compare the crawling performance
> between robinhood and duc, but I'm using robinhood on a petabyte filer
> through nfs3 with nearly 1 billion files on it, and the crawling
> algorithm performance scales quite linearly up to 8 threads. After
> that, I can't tell if the filer or the mysql database that holds the
> results is the bottleneck at this moment.
Ico> I would be very interested in some numbers that compare Duc to
Ico> Robinhood though on a file system this size; if you ever get some
Ico> numbers on this, please let me know!
I've just compiled robinhood 3.0, but I need to find and setup a mysql
host first. Hopefully I'll get you some numbers in the next day or
so.
John
|
I tend to think like you, but I'm not a database expert. And if you want to run stuff like that on huge file system, you'll have to build a solid database backend. The people behind robinhood seems to be quite involved and active on this matter (you can have a look at their mailing list) and the program is used in production on multi-petabytes storage system. But again, I think that duc hits a sweet spot by using a KV backend. And my suggestion was only concerning the crawling algorithm =)
Yes, no magic here. But the speedup optained while hammering/breaking the storage system could be noticeable. |
Quoting John (2016-12-28 17:35:04)
I've just compiled robinhood 3.0, but I need to find and setup a mysql
host first. Hopefully I'll get you some numbers in the next day or
so.
Great, thanks for that!
…--
:wq
^X^Cy^K^X^C^C^C^C
|
>>>> "jbd" == jbd ***@***.***> writes:
jbd> I've looked at this too, and I think they have a crucial flaw, which is
jbd> using Mysql as the backend
jbd> I tend to think like you, but I'm not a database expert. And if
jbd> you want to run stuff like that on huge file system, you'll have
jbd> to build a solid database backend. The people behind robinhood
jbd> seems to be quite involved and active on this matter (you can
jbd> have a look at their mailing list) and the program is used in
jbd> production on multi-petabytes storage system. But again, I think
jbd> that duc hits a sweet spot by using a KV backend. And my
jbd> suggestion was only concerning the crawling algorithm =)
That's true, the crawling aglo was the major thrust, but in alot of
ways, it's not even that hard to do really. The hard part is the
inserts into the DB in a consistent manner. And if you want the DB to
be consistent at all times, that makes the problem even harder!
Though it might explain why RobinHood uses mysql, because it punts
that issue to the DB developers.
jbd> As for the parallel code, it will help up until the backend
jbd> (NFS on Netapp in my case mostly) hits it's limits I fear.
jbd> Yes, no magic here. But the speedup optained while
jbd> hammering/breaking the storage system could be noticeable.
Yes, it is noticeable to spread the load. I've got multiple
multi-terabyte filesystems with upto 60 millions files in some of
them, so we just index them all seperately on a weekly schedule.
Takes approx six to eight hours on some of them to complete.
Looking at Robinhood, they take advantage of Lustre's metadata
tracking, which I think is *really* cool. And I'd love to see if ext4
could be hacked to include that into the filesystem itself as well.
Instead of having outside tools do the work, put it into the
filesystem. Which is a much much much harder problem in alot of ways,
but as filesystems expand... will only help.
But back to the core issue... even getting just two crawlers in
parallel would probably be a big win, except on single spindle
drives. And of course coordinating the inserts into the DB.
John
|
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Ico> Quoting John (2016-12-28 17:35:04)
> I've just compiled robinhood 3.0, but I need to find and setup a mysql
> host first. Hopefully I'll get you some numbers in the next day or
> so.
Ico> Great, thanks for that!
Some preliminary numbers are that duc wins in terms of speed and DB
size compared to Robinhood. It's not even finished indexing and the
DB is upto 7.1Gb for a 6.1Tb filesystem with 18 million files. I did
*zero* tuning on the mysql side of things, which is obviously
impacting the performance of the index. I think I need to just move
it all to /tmp (swap back ramfs) and up the numbers and see how it
goes then.
Duc DB size for the same filesystem (though probably more heavily used
at the time) is 240mb. This is using the tokyocabinet backend, not
the lmdb stuff. Which I need to look at.
So my initial opinion is that duc and a dedicated DB design is a real
win for this situation. And I say this because I've also used
different backup tools which use custom DBs (networker) and SQL based
(commvault) and the custom DB wins hands down there in terms of size,
speed and capacity.
Quick note, v1.4.2 needs to get updated on the DB options and where to
find them.
John
|
Hi John,
Thanks for the testing this!
Quoting John (2016-12-29 18:36:46)
>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Ico> Quoting John (2016-12-28 17:35:04)
> I've just compiled robinhood 3.0, but I need to find and setup a mysql
> host first. Hopefully I'll get you some numbers in the next day or
> so.
Ico> Great, thanks for that!
Some preliminary numbers are that duc wins in terms of speed and DB
size compared to Robinhood.
I'm pretty sure Duc wins on the DB size, since thats one of the things
I've optimized on using varints and tokyocabinet with compression. I
*am* surprised about the indexing speed, though. Duc does the simplest
possible thing here with no smart optimization at all; this makes me
wonder if there is not much to gain in using threading (in your
particular setup), or if there is much more overhead in Robinhood which
might cause it to slow down.
(I haven't looked in all the details about Robinhood, but maybe they
store a lot more data then Duc?)
Quick note, v1.4.2 needs to get updated on the DB options and where to
find them.
Yeah, I'm afraid the docs are always a bit behind. I'll add lmdb to the
INSTALL file, but I guess it would pay to describe this all a bit more.
On the other hand - maybe we should choose a single backend and drop
support for the pleora of databases. I'll create a separate ticket to
discuss this.
…--
:wq
^X^Cy^K^X^C^C^C^C
|
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Ico> Hi John,
Ico> Thanks for the testing this!
It's going slowly for sure with robinhood. But from the DB size, I
suspect it won't work in my envrionment.
John> Some preliminary numbers are that duc wins in terms of speed and DB
John> size compared to Robinhood.
Ico> I'm pretty sure Duc wins on the DB size, since thats one of the
Ico> things I've optimized on using varints and tokyocabinet with
Ico> compression. I *am* surprised about the indexing speed,
Ico> though. Duc does the simplest possible thing here with no smart
Ico> optimization at all; this makes me wonder if there is not much to
Ico> gain in using threading (in your particular setup), or if there
Ico> is much more overhead in Robinhood which might cause it to slow
Ico> down.
I don't know yet honestly. I'm running the DB on a local disk, so it
might just be that I'm hitting the disk IOPs limits. Which again
tells me to move to to swap, and to increase the memory buffers and
such. Should have done it last night. Will probably kill it off and
do it now anyway.
Ico> (I haven't looked in all the details about Robinhood, but maybe
Ico> they store a lot more data then Duc?)
Dunno... but it's certainly not optimal in terms of size at all. The
DB looks like this:
mysql> show tables;
+------------------------------+
| Tables_in_robinhood_sjscratc |
+------------------------------+
| ACCT_STAT |
| ANNEX_INFO |
| ENTRIES |
| NAMES |
| SOFT_RM |
| VARS |
+------------------------------+
6 rows in set (0.00 sec)
And just doing a "SELECT COUNT(*) FROM NAMES;" is taking forever... so
it's almost certainly my DB tuning lack which is hurting robinhood.
> Quick note, v1.4.2 needs to get updated on the DB options and where to
> find them.
Ico> Yeah, I'm afraid the docs are always a bit behind. I'll add lmdb
Ico> to the INSTALL file, but I guess it would pay to describe this
Ico> all a bit more.
Ico> On the other hand - maybe we should choose a single backend and drop
Ico> support for the pleora of databases. I'll create a separate ticket to
Ico> discuss this.
Absolutely! We should move from tokyocabinet to lmdb, and I'd also
suggest that we pull lmdb into the duc source tree as well, to make it
simpler to deploy. It's only 32k in size, so it's trivial. Just make
it part of the process to see about pulling upstream changes from time
to time if they don't cause problems.
John
|
You can have a rough idea of what is stored in robinhood database by quickly looking at the rbh-find command options.
FWIW, I fully agree =) |
I lost trace of an interesting paper, but I found it again: On Distributed File Tree Walk of Parallel File |
Out of curiosity, I've tested what looks like an implementation of the previous paper: dwalk.c from the mpifileutils project which is using libdftw underneath. It uses mpi to handle parallelism, but I'm only using the local machine. The goal is only to see the real benefits of a distributed file tree walk. See also http://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial5.pdf To test on a CentOS 7 system:
The yum openmpi library seems to have some problems, but it is working:
Let's test (as root to access my no-root squash nfs share on 8 cores/16G virtual machine). I've ran each test 5 times and took the best score because the filer could be in heavy use at some point in time. dwalk performs a stat for each entry. The folder I'm indexing:
One mpi process:
Two:
Four:
Eight:
And for fun, 16 =)
I don't know where the bottleneck is (nfs, local machine or remote storage) but the gain is substantial. It would be interesting to have a dry-run option for the index duc command that will prevent the db_put call in index.c to compare the results. |
Quoting jbd (2016-12-30 11:45:22)
I lost trace of an interesting paper, but I found it again: [On
Distributed File Tree Walk of Parallel File
Systems](http://www.cs.nmsu.edu/~misra/papers/sc12paper.pdf ).
Thanks for the link; I think I've seen this when looking for an
appropriate algorithm, but if I recall correctly I did not figure out
how to make this fit Duc's purpose. The problem solved in this paper is
to visit each directory exactly once and do some work there, but for Duc
the file system needs to be walked recursively and depth first because
the size of each directory is the sum of the size of all its files and
subdirectories. Order matters here.
While typing the above, it occured to me that it might be possible to
change the process though: Duc could first do a complete walk of the
filesystem and only count the sum of files in each directory - this
could be an complete parallel process using one of the algorithms from
the paper. After the indexing run, the directory sizes could be
recursively added from the database without having to touch the file
system.
This sounds like a decent solution. I'll ponder this for some time and
do some prototyping to see if I can grok and implement the
parallelization and rerecursivization.
…--
:wq
^X^Cy^K^X^C^C^C^C
|
Indeed, that was something I didn't have in mind. |
I don't know if you're interested, but I've tried to implement a --dry-run option to the index (#166). Given your last remark, maybe it is not as useful as I thought... =) |
Quoting jbd (2016-12-30 16:37:05)
Out of curiosity, I've tested what looks like an implementation of the
previous paper: [dwalk.c from the mpifileutils
project](https://github.com/hpc/mpifileutils/blob/master/src/dwalk/dwalk.c)
which is using [libdftw](https://github.com/hpc/libdftw) underneath.
It uses mpi to handle parallelism, but I'm only using the local
machine. The goal is only to see the real benefits of a distributed
file tree walk.
See also http://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial5.pdf
Thanks for testing, nice to know what kind of speedup can be expected
using parallelism.
Your test results:
1: 71.579805 seconds (10677.927944 files/sec)
2: 50.769305 seconds (15054.844655 files/sec)
4: 32.848336 seconds (23268.271489 files/sec)
8: 21.685690 seconds (35245.546718 files/sec)
16: 21.121180 seconds (36187.561490 files/sec)
That makes sense: http://zevv.nl/div/graph.png
I don't know where the bottleneck is (nfs, local machine or remote
storage) but the gain is substantial. It would be interesting to have
a dry-run option for the index duc command that will prevent the
db_put call in index.c to compare the results.
Test branch for you:
https://github.com/zevv/duc/tree/index-noop
Use the --no-operation flag to the 'duc index' command. Duc now
traverses and does fstat() but does not serialize data or write to the
database.
(Do you know if the MPI test app actually calls fstat() for each file?)
…--
:wq
^X^Cy^K^X^C^C^C^C
|
Grrr, I've lost my answer.
Thank you for merging my pull request. Let's hope I didn't introduce a subtle bug in the process =) Here is the result (best of five runs) on the same dataset:
A bit faster ! That gives us a rough idea.
Since the dwalk has a "--lite - walk file system without stat" option, I hope that the default behaviour is to perform a stat for each entry. An strace -f -estat,lstat confirms that. By quickly looking at the code, the bayer_flist_walk_paths function enqueues walk_stat_process callbacks that issue lstat call in the end. bayer_flist_walk_paths is also doing other stuff regarding user and group (but I don't know if my previous runs felt inside this code path). |
Quoting jbd (2016-12-30 18:29:02)
Grrr, I've lost my answer.
> Test branch for you
Thank you for merging my pull request. Let's hope I didn't introduce a subtle bug in the process =)
I usually do, it would be nice to be able to blame someone every now and
then :)
Here is the result (best of five runs) on the same dataset:
```
# echo 3 | tee /proc/sys/vm/drop_caches;
# rm -f /root/.duc.db*
# time ./duc-lmdb index --dry-run -x -p /data
real 0m53.694s
user 0m1.114s
sys 0m11.823s
```
A bit faster ! That gives us a rough idea.
> (Do you know if the MPI test app actually calls fstat() for each file?)
Since the dwalk has a "--lite - walk file system without stat" option,
I hope that the default behaviour is to perform a stat for each entry.
That makes sense.
So, using this algorithm for parallel walking should bring us a 2 or 3
times speedup, but there still is the problem to map/reduce the total
directory sizes in recursive order.
My implementation (which I seem to have thrown away, a pity) did
something like this:
- there's a global work pool with jobs, which hold directories to be
scanned. one job = one directory. Each job has a counter which holds
the number of child jobs, starting at 0.
- a number of threads pick jobs from the pool and scan directories.
Finding subdirs means adding new work to the work pool and increasing
the child counter of the parent.
- each job points to its parent directory. When completing a job the
totals are added to the job parent's totals and the child counter
decreases.
- when the child counter of a job reaches 0 again this means the subtree
is complete and its directory entries are written to the database.
I was pretty happy with the performance (2.5x speedup, iirc), but the
implementation was just too complicated. I'm probably just not smart
enough to write and maintain multithreaded applications.
…--
:wq
^X^Cy^K^X^C^C^C^C
|
(Some ramblings in case I leave this for a year and forget about it - this happens to me far too often.) Ok, I've spent some time thinking and prototyping, but there's one nasty problem on the road: At this time Duc recursively chdir()s itself through the file system, thus only needing relative file names for opendir(). This breaks when doing multiple threads since there is only one CWD (current working directory) for all threads. The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows. I've stopped caring for windows a long time ago, but I really would like to keep MacOS in the list of supported operating systems. Two solutions come to mind:
|
On Jan 1, 2017, at 11:25 AM, Ico Doornekamp ***@***.***> wrote:
(Some ramblings in case I leave this for a year and forget about it - this happens to me far too often.)
Ok, I've spent some time thinking and prototyping, but there's one nasty problem on the road:
At this time Duc recursively chdir()s itself through the file system, thus only needing relative file names for opendir(). This breaks when doing multiple threads since there is only one CWD (current working directory) for all threads. The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows.
Another option for MacOS X support would be to check if the nextgen Apple filesystem supports relative paths, and potentially wait for that,
https://developer.apple.com/library/content/documentation/FileManagement/Conceptual/APFS_Guide/Introduction/Introduction.html
…--
Stuart Anderson [email protected]
http://www.ligo.caltech.edu/~anderson
|
Ico> (Some ramblings in case I leave this for a year and forget about
Ico> it - this happens to me far too often.)
Same here.... :-(
Ico> Ok, I've spent some time thinking and prototyping, but there's
Ico> one nasty problem on the road:
Ico> At this time Duc recursively chdir()s itself through the file
Ico> system, thus only needing relative file names for opendir(). This
Ico> breaks when doing multiple threads since there is only one CWD
Ico> (current working directory) for all threads. The proper solution
Ico> for this is to use openat() and fdopendir() which allow
Ico> referencing of file names relative to a parent file
Ico> descriptor. The only problem is that for some reason this is not
Ico> supported on MacOS X and Windows. I've stopped caring for
Ico> windows a long time ago, but I really would like to keep MacOS in
Ico> the list of supported operating systems.
What about fchdir() instead? You opendir() a directory, and then do a
'fchdir(dir)' into that directory. Keep it per-thread.
Ico> Two solutions come to mind:
Ico> • Use processes instead of threads. Actually not that bad, although this will
Ico> complicate the communication in the worker pool with some overhead. (pipe
Ico> write/read = system calls)
I'm not as familiar with MacOS/Windows IPC models, so I can't really
comment here. For the Unix side, it shouldn't be too hard, it's
mostly the waiting for completion that is a little tricky, and keeping
count of course.
Ico> • Have duc always keep track of absolute path names and pass those to opendir
Ico> (). This causes a bit of overhead with some string handling, but is not
Ico> that bad.
This is probably the simplest, esp since it's trivial to just keep
track of the directory name in each thread. The memory overhead
shouldn't be too pad, PATH_MAX bytes + some misc per-thread.
John
|
Quoting John (2017-01-01 22:06:17)
Ico> • Have duc always keep track of absolute path names and pass those to opendir
Ico> (). This causes a bit of overhead with some string handling, but is not
Ico> that bad.
This is probably the simplest, esp since it's trivial to just keep
track of the directory name in each thread. The memory overhead
shouldn't be too pad, PATH_MAX bytes + some misc per-thread.
It's not really about the string overhead, more about the extra work for
the file system layer to resolve the same path over and over again. For
example, when indexing something like:
/usr/src/linux-headers-4.6.0-1-amd64/include/config/bridge/igmp
/usr/src/linux-headers-4.6.0-1-amd64/include/config/bridge/nf
/usr/src/linux-headers-4.6.0-1-amd64/include/config/bridge/ebt
now, duc simple chdirs to the 'bridge' directory and does a stat() for
each entry in the dir. Without chdir, the stat() will be done on the
full path, so the vfs layer will have too vind 'usr', then find 'src',
then 'linux-headers-4.6.0-1-amd64', and so on. Of course a lot of this
will live in cache but still there's more work to be done in the kernel.
This is why openat() exists (albeit not on MacOS X).
I'm working to do some tests on a patched duc that does just that (in a
single thread, still) to see what the impact would be.
…--
:wq
^X^Cy^K^X^C^C^C^C
|
As of MacOS 10.10 (Darwin 14.0), which was released in the end of 2014, |
Quoting Alex (2019-01-19 18:48:47)
> The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows. I've stopped caring for windows a long time ago, but I really would like to keep MacOS in the list of supported operating systems.
As of MacOS 10.10 (Darwin 14.0), which was released in the end of 2014, `openat()` is supported. You can see the full list of supported syscalls for this version here: https://opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/kern/syscalls.master.auto.html
Thanks, that is a good reason to rethink those requirement and possible
refactor the indexes. I have never heard of a single user running on
windows, so I would gladly drop support for that.
side note: Duc also runs on android (via termux), which also has
openat() and friends.
…--
:wq
^X^Cy^K^X^C^C^C^C
|
I'm currently working on an worker-pool based implementation based on the DFS algorithm described in this paper: https://www.lrde.epita.fr/~bleton/doc/parallel-depth-first-search.pdf If anyone is interested in a proof-of-concept, I can provide a Python script that minimally illustrates the filesystem crawling algorithm. If the maintainers have suggestions/opinions on this crawling algorithm, please let me know -- I'm open to suggestions. |
>>>> "Muhammad" == Muhammad Haris ***@***.***> writes:
Muhammad> I'm currently working on an worker-pool based implementation
Muhammad> based on the DFS algorithm described in this paper:
Muhammad> https://www.lrde.epita.fr/~bleton/doc/parallel-depth-first-search.pdf
Muhammad> If anyone is interested in a proof-of-concept, I can provide
Muhammad> a Python script that minimally illustrates the filesystem
Muhammad> crawling algorithm. If the maintainers have
Muhammad> suggestions/opinions on this crawling algorithm, please let
Muhammad> me know -- I'm open to suggestions.
Sure, but I think you might want to implement this in C instead,
because (as I dimly recall) Python isn't quite threadsafe and doesn't
scale well because of locking overhead. But I could be wrong.
Please feel free to post any results you get along with sample code
for us to look over.
John
|
Hello,
this is more a suggestion than an issue.
The duc indexing is already quite fast but you might be interested in the filesystem crawling algorithm of robinhood also written in C, if you don't already know the project:
To go beyond the performance of classical scanning tools, robinhood implements a multi-threaded version of depth-first traversal[4]. To parallelize the scan, the namespace traversal is split into individual tasks that consist in reading single directories. A pool of worker threads performs these tasks following a depth-first strategy (as illustrated on figure 3).
from Taking back control of HPC file systems with Robinhood Policy Engine paper. See here for more.
If the storage could handle parallel requests nicely, that's a huge win. Right now I don't have any metrics to compare the crawling performance between robinhood and duc, but I'm using robinhood on a petabyte filer through nfs3 with nearly 1 billion files on it, and the crawling algorithm performance scales quite linearly up to 8 threads. After that, I can't tell if the filer or the mysql database that holds the results is the bottleneck at this moment.
The text was updated successfully, but these errors were encountered: