Recover Linux DataLinux Data Recovery

Determining the Problem and recovering from it

Now that we’ve decided what to recover, in this case booting into mutli-user mode, we need to determine what the problem is. Our initial troubleshooting practices should ascertain where the problem is: hardware or software, configuration or libraries, etc.

Boot Stages

Narrowing down the source of the problem is done in stages, as each stage of booting Linux is completed. Actual fixes depend entirely on your specific environment distribution and problem. Please read the noted documentation and understand the techniques before applying them blindly.

Stage 1: LILO (Linux Loader)
Stage 2: Loading the kernel
Stage 3: Mounting the Disks
Stage 4: Startup Scripts
Stage 5: Runlevel Scripts
Stage 6: Providing a Login Prompt

LILO

After the BIOS screen, LILO should run. If LILO runs, we can safely assume that the most of the hardware is functioning, and that the Master Boot Record or MBR, is still loading. Each of the letters in LILO signifies a different part of LILO has loaded. The most common problem is seeing “LI” and then nothing else. A common reason is another operating system like NT loaded on top of Linux. To resolve this problem you need to wipe out the broken MBR (using Windows “fdisk /mbr” or “dd if=/dev/zero of=/dev/hda bs=512 count=1” and reinstall LILO (/sbin/lilo) back into the MBR. A current boot disk is needed to run /sbin/liloor dd.

Other common problems for LILO include: errors in /etc/lilo.conf, limitations in the BIOS pointing to IDE drives, installing Linux to a FAT partition and running defrag within DOS, etc. Most problems can be fixed by verifying the options and mappings inside/etc/lilo.conf, running /sbin/lilo and rebooting. I always create a /boot directory with the kernel, bootstrap (boot.b) andsystem.map near the beginning of my disk to avoid cylinder (1024cyl) and size (>2GB) issues with the BIOS. When the kernel is located on a SCSI drive, put the key word linear in the lilo.conf. Basic documentation for LILO is provided in man:lilo(8), man:liloconfig(8) and man:lilo.conf(5) and extensive documentation for LILO is in /usr/share/doc/lilo/manual.gz.

Loading the Kernel

After LILO has run, it hands off control to the kernel image listed in the lilo.conf. If the kernel image is corrupted, messages vary from silence to core dumps. If someone recently upgraded the kernel, they should have made a boot disk with functional kernel, and kept the old kernel bootable under a different alias. Use the boot disk or boot to the older kernel and work on creating a new functioning kernel. If the original kernel is corrupted, you may find a copy of your kernel on the distribution cd or inside“/usr/src/linux”

 

Problems loading hardware might also be listed during this time; the log is located in /var/log/dmesg. Additional or alternate kernel log locations are boot.log or kern.log depending on distribution. HOWTO documents and distribution specific kernel guides are your best references for creating kernels. Problems with a particular section of the kernel (advanced power management for example) are most directly addressed by the kernel mailing list. www.kernel.org is the Linux kernel homepage, with additional references and documentation available through web site links.

Mounting the Disks

The kernel loads, then mounts the partitions listed under /etc/fstab. man:fstab(5) provides file format and directive information. All of the mount points must be accessible during boot time. If a mount point fails, the system will prompt for root password (configurable and varies across distributions) and then boot to single user mode (Run Level 1). If a disk volume is not accessible during boot, it’s time to use single user mode and comment out the mount line in the /etc/fstab for a temporary fix.

If a disk was not cleanly unmounted, fsck will check for errors with automatic settings. These settings will resolve simple problems but actual errors need to be fixed by running fsck manually in Single User mode. Fsck should only be run on unmounted filesystems. Read man:fsck(8) and man:e2fsck(8) for more information.

If the superblock has been corrupted, fsck can restore one of the backup superblocks. The location of the backup superblock is dependent on the filesystem’s blocksize. For ext2 filesystems with 1k blocksizes, a backup superblock can be found at block 8193; for filesystems with 2k blocksizes, at block 16384; and for 4k blocksizes, at block 32768. To use one of these superblocks, run fsck -b [blocknumber] /dev/[harddrive]. For example, “fsck -b 8193 /dev/hda” for the first IDE drive with 1k blocksizes. ,b>fsck -B [blocksize] doesn’t require any math, only knowing the block size used.

When using partitioning utilities from several different operating systems, an inconsistency may develop regarding the boundaries of the partitions. The starting and ending blocks will be listed twice when fdisk displays the partition table. This is rare, but does happen on multiboot systems. My recommendation is to always use a 3rd party utility that consistently understands all the filesystems used, specific to all the versions of all the filesystems (NT with SP3, SP4 and SP5 and Windows 2000 all have different versions of NTFS for example) or use utilities like fdisk, cfdisk, sfdisk, disk druid, etc under one operating system. Spoon feeding an operating system a blank partition, of a certain type or a formatted filesystem generally works, some OSes require blank space. Last time I had this issue, I used a popular third party partitioning software and it resolved the issue.

Startup Scripts

Startup scripts are very distribution specific. Most distributions place the scripts in /etc/ or /etc/rc.d/ with the startup scripts in either rc.boot or rcS.d. In addition to the actual mounting, hardware detection occurs, networking is configured, hostname specified, clocks started, portmaps rendered and console settings declared. This is happy stuff that we need for the next section. If one of these scripts fails, find out what hardware it depends on (isapnp for sound card failed? then maybe you have a sound card hardware issue or a kernel module issue). If any complaints or error messages are displayed, few problems will stop the boot sequence. Research the script in use and look for man pages for these scripts. Try man -k “scriptname” if you can’t find documentation.

 

Runlevel Scripts

After executing the initial startup scripts, either runlevel 3 (multiuser shell) or 5 (Xwindow) will be invoked depending on the configuration in inittab. Read man:inittab(5) for differences between run levels and initial configurations. PCMCIA is started along with the user, system and network daemons. Any number of these can fail, depending on configuration. Both Apache and Sendmail will hang without a proper hostname, etc. The daemon failing should be reported on the screen and in the normal messages log. Check the software documentation or turn off the daemon in question. chkconfig –list will display run levels and daemons under Redhat and mandrake. Linuxconf has a module for controlling the behavior of different services. Debian start up scripts should be managed using update-rc.d in accordance with the debian policy manual.

 

Login Prompt

The login prompt relies on a subsystem and a set of programs, in addition to the username and password validation. If there is a problem with logging in, common problems include caps lock, forgotten passwords (recover root password from single user mode), trojan login scripts and rootkits. Most trojans and rootkits try to convince users that the system was not comprimised by providing a moderately consistent user experience. Validate packages using rpm -V or files using a file verification program liketripwire if “strange” behavior is noticed. www.securityfocus.com lists extensive security related resources, and provides information about rootkits, trojans and utilities like tripwire