Commpath#182
Conversation
Previously, chosen_realized_cachepath was copied into set_intercept_readlink_cachepath() chosen_realized_cachepath and chosen_parsed_cachepath were copied into set_should_intercept_cachepath() This PR removes both setter functions and makes the original pointers global.
Removes chosen_cachepath and cachepath_bitindex from spindle_launch.h Updates initialization of matching variables in ldcs_process_data. determineValidCachePaths() moved from spindle_be.cc to ldcs_audit_server_process.c to get ldcs_process_data visibility. Added #include "parseloc.h" to ldcs_audit_server_process.c to get declaration of determineValidCachePaths(). Relocated "parseloc.h" to src/util so ldcs_audit_server_process.c could find it. Trued up signedness of types caused my making "parseloc.h" more visible, e.g., cachepath_bitidx is now uint64_t everywhere.
The three-message-reply response is now a single message with two strings. The symbolic version of the cachepath is no longer communicated as it was not being used.
New name is ldcs_audit_server_md_allreduce_AND(). If we get to the point where we're using other allreduce operations we can solve the problem of duplicating the op list in md-land and cobo-land. For now, we're only using one op in md-land, so the op can go into the function name.
Unlikely it would ever make a difference, but this is much more correct.
src/client/beboot/spindle_bootstrap.c
Moved orig_location from static global to local
Renamed symbolic_location to symbolic_commpath
Renamed orig_location to orig_commpath
Renamed location to commpath
Renamed LDCS_LOCATION to LDCS_COMMPATH
src/client/client/client.c
Renamed LDCS_LOCATION to LDCS_COMMPATH
Renamed location to commpath
src/client/client_comlib/client_api.c
Added setenv("LDCS_CHOSEN_PARSED_CACHEPATH", local_cpc);
testsuite/test_driver.c
Replaced LDCS_LOCATION and LDCS_ORIG_LOCATION checks for cachepath with
LDCS_CHOSEN_PARSED_CACHEPATH
Replaced spindle_loc with cachepath
All tests pass with both distinct and identical commpaths/cachepaths.
Additionally populates /etc/environment just in case ssh is used to set up the servers.
The option is marked as obsolete in configure --help and will cause an error in configure if it is specified. As updates the CI configure scripts to use --with-cachepaths and --with-commpath instead of --with-localstorage.
Replaces args->location with args->commpath.
Additional integration for commpath + spank-plugin.
TMPDIR left out of a docker script, and --cachepath instead of --cachepaths.
This commit fixes a race condition where eager client processes can
submit a LDCS_MSG_CHOSEN_CACHEPATH_REQUEST before the servers have
come to a consensus. When that is the case, the server now responds
with LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET and the client sleeps for
one second (max 10 retries) before sending the message again.
1) include/ldcs_api.h
Added LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET
2) server/comlib/ldcs_api_util.c
Added STR_CASE entry for that message.
3) server/auditserver/ldcs_audit_server_handlers.c
Added global "static bool cachepath_consensus_reached" above handle_cachepath_consensus()
Set that variable to true inside handle_cachepath_consensus()
In handle_chosen_cachepath_request(), made msg.header.type conditional on cachepath_consensus_reached
4) client/client_comlib/client_api.c
This is the only place LDCS_MSG_CHOSEN_CACHEPATH_REQUEST is used.
Will sleep for 1 second after each LDCS_MSG_NO_CACHEPATH_CONSENSUS_YET message.
The theory being that eager clients are using an uninitialized cachepath variable. By delaying the consensus, the failure should happen more often.
"sending message of type: request_location_path" is now "sending message of type: CHOSEN_CACHEPATH_REQUEST"
Known to affect the symbolic form of candidate cachepaths. Not sure that's ever being used, but it's fixed now.
_message_type_to_str() can now be used in cobo_fe_comm.c. ldcs_audit_server_fe_broadcast() now reports message type. Only two messages are expected to be routed through there, but it's the correct way to report it.
Cleanup now takes both commpath and cachepath and prefixes for removing files created by Spindle.
The original LDCS_LOCATION_MOD checked to see if there were multiple servers running on a node and, if so, modified the location string so that each server had its own location. The code did not handle the case where the directory above the requested directory was not writeable, e.g., if the user passed in --location=/tmp, the code would try to create a directory /tmp-00 for the first server. That fails. With commpath and cachepath replacing location, and with new initialization paths, the existing code would modify only commpath after the commpath directory had been created. If the multiple-server case needs to be supported, commpath- and cachepath-specific code needs to be added back in.
That configure parameter is no longer supported.
Replaced with
--with-cachepaths=/tmp/commpath/cachepath
--with-commpath=/tmp/commpath
|
|
||
| char *chosen_parsed_cachepath = NULL; | ||
| send_cachepath_query( ldcsid , NULL, &chosen_parsed_cachepath); | ||
| assert( chosen_parsed_cachepath ); |
There was a problem hiding this comment.
Don't assert. Instead do an err_printf and return -1.
| } | ||
|
|
||
| send_cachepath_query( ldcsid, &chosen_realized_cachepath, &chosen_parsed_cachepath ); | ||
| assert( chosen_realized_cachepath ); |
There was a problem hiding this comment.
Again, don't assert in the client. err_printf and return -1.
| char *orig_file_name = (char *) name; | ||
| if (is_in_spindle_cache(name)) { | ||
| debug_printf2("Library %s is in spindle cache (%s). Translating request\n", name, location); | ||
| assert( chosen_realized_cachepath ); |
| extern char *chosen_realized_cachepath; | ||
|
|
||
| location_len = strlen(location); | ||
| assert( chosen_realized_cachepath ); |
| static int cachepath_size = 0; | ||
| static int orig_cachepath_size = 0; | ||
| extern char *chosen_realized_cachepath, *chosen_parsed_cachepath; | ||
| assert( chosen_realized_cachepath ); |
| return -1; | ||
| } | ||
| char *local_crc = strdup( buffer ); | ||
| char *local_cpc = strdup( &buffer[ strlen(local_crc) + 1 ] ); |
There was a problem hiding this comment.
spindle_strdup instead of strdup on the above two lines
| case HSHAKE_INTERNAL_ERROR: | ||
| err_printf("Internal error doing handshake: %s", spindle_handshake_last_error_str()); | ||
| exit(-1); | ||
| return -1; |
There was a problem hiding this comment.
Revert this back to exit(-1).
The handshake is security checks. If things go wrong here, we just want to exit out and stop. We don't know if an attacker might be able to trigger the HSHAKE_INTERNAL_ERROR, and if they do we're better off just stopping the process.
| case HSHAKE_INTERNAL_ERROR: | ||
| err_printf("Internal error doing handshake: %s", spindle_handshake_last_error_str()); | ||
| exit(-1); | ||
| return -1; |
|
|
||
| int num_children = ldcs_audit_server_md_get_num_children(procdata); | ||
|
|
||
| debug_printf( "Processing REQUEST_CACHEPATH_CONSENSUS.\n" ); |
There was a problem hiding this comment.
These debug_printfs are all at level1, but should be level 2-3.
level 1 should just be a single statement for the high-level operations. One or two "running consensus on ..." or "consensus reached on ..." would be appropriate.
level 2 should be the major operations within the high-level opertions. Many of these printfs should be level 2.
level 3 should be for the gross details.
| errno=0; | ||
| while( stat( origpath, buf ) == -1 ){ | ||
| local_errno = errno; | ||
| debug_printf("Failed to stat '%s' (%s).\n", origpath, strerror(local_errno)); |
There was a problem hiding this comment.
Should probably be a debug level 3 printf
Passed all tests on my clone. Let's see what it does over here.