System Administrator Interview 3/16/06 - 4:30PM in San Francisco
This is an effort to understand how a target user: "System Administrators" would use Cosmo. This was not a structured interview and so there are no specific questions during this interview. The information collected will help prioritize target users features for future releases of Cosmo.
Information about the user
- Senior level System Administrator
- Uses Cosmo mostly from the terminal window, rarely touches the web console
- Works closely with developers ie. When the errors occur, ther is a strong relationship to hand off issues from admin to developers
Highlights
Items of importance:
- Server status, really useful page.
- Having the server be able to repeat things by itself. (instead of writing seperate scripts)
- Running roll over counter, log in the last minute.
Top priorities:
- Back ups "can't back up w/ out stopping and restart" (about 20mins to back up, big binary files, not incremental back up
- Understanding who is using the disk space, w/ out clicking though each user, sorted name of the top users, then use the http based browser, (web console) go to home directory
- Need a better post production validation
- Repeat installation. Not having to write scripts, such as installing Cosmo on each server would improve productivity for senior admins
- "I can do a lot w/ web access logs"
- Don't need/want a long html list of all the users--> really long wait time.
- Questions Sys Admin's always need to check: disk space, memory, network and power.
Nice to have:
- Greater sophistication to view memory use in a graph, ie. "50MB of number is tight, I'd like to know if it's dropped before 50MB, number of free space". Perhaps a script to fetch this page and put it in a database
- On going monitoring
- fetch url-->to get a list of the 6000 users. xml extract their user name.
Interview
Installation of Cosmo
- Look at the read me, after untar try to follow instructions--> wiki further (admins like what's in the bundle)went through it step by step
- Repeating the installation is "where I had difficulties" –don't want to install in all the machines
- Would it be valuable to have a public communication-->to talk about how one person would install it on a machine
Writing Scripts
- Wrote a script to install cosmo on many servers-->senior admins would know about scripting, but not all admins. Especially if they have less experience.
- If I want to run a mail a server, I ask my OS to get me a known stable copy.
- Run many instances, support dev.
- Production every time-->not want to install in all the machines
- Multiple project instances: old ones still available, here's the url to get a new data--> data wipe. 0.2 regularly, won't release wipe w/ out migration-->script to go to the old intance. 0.2-->0.3
- Send upgrade to the Cosmo instance from the old one to 0.3.1, use scripts to create a new instance.
- Move/copy files, find me the 10 largest files. How do I find the bad seed if someone is abusing the server. Can't now browser simulator.
Log files
- Log files very important-->production instance the most is log files. When the site isn't working, admins look into log files--> not best way to check errors
- Production instance the most is log files.-->start a new instance, tar ball unzip, 30 sec, sometimes take first instance 200 sec.
- Better visual cue of what is the real vs. what is in production. PriscillaChung?: For example everything in the left pane is production vs. right is real.
- Hard to get a unfiled view of consistent, right order, multiple event one event correlate between 2 log files.
Errors
- Very common look for error such as, 'file not found', 'i have a crash'
- A lot of visual analysis, looking for patterns.
- This this different, something is different, something rpt a problem,
- Send to the dev. --> when the errors strong hand off from admin to dev.
Building Cosmo
- Provide switches, change password from the built in one, copy data from an old one, where to download one, pass all the right switches
- Put description, basic switch--> to rebuild all the instances. To rebuild everyone every time would be a special case
- The idea of revert, testing version, ie. people are still working on the existing wiki, and at the same time people are coming up and recreating a new one
What Sys Admins check on and ask themselves:
- Check to see if it's not running out of disk space, serving a lot of bandwidth, that it's not running slow.
- Are thing are going fine? Is the application is healthy?
- Not crashing, people are hosting, besides are Chandler is okay, but though may not be best practice.
- Server status, really useful page. Having the server be able to repeat things by itself.
- Running roll over counter, log in the last minute.
Web Console: Server Status
- Uses garbage collection for performance issues
- Java system app pause garbage collection, then pause and wait (not sun)
- Production instance to work right is to lower behavior of the application
Other notes:
- Use Appache for the front end-->General web server to management control
- Directory index--> Can't look through files, like grip (unix, find files w/ a string in it)
- Lock on the calendar, and there is a lock, no tool to break the lock tickets and fix
- Can I create an tool, read, update, and I can delete things --> collections, files, etc.
- Cosmo is layered-->Derby, cosmo http, log, standard out log...random what goes to where...