Author: ephuizi

  • bloomfilter 库比较

    当前问题

    • 存储服务部署时会规划的内存,但是运行一段时候后,会发生oom
    • 很奇怪,存储oom时,dump的堆大小比规划的内存常常小几个GB
    • 存储服务运行一段时候后,容易发生full gc

    背景

    存储是全局重删的. 每一个节点负责自身数据的重删索引管理. 目前直接把重删索引存储在ssd的文件上,并且使用开放寻址法处理hash冲突问题.

    文件是否写满通过hash冲突次数和在这个文件已经保存的hash值总数. 一个索引文件保存的最大数量hash限制为:

    ((Integer.MAX_VALUE / ENTRY_SIZE) – 100 )的下个质数

    ((2**31-1)/37.0-100)的下个质数 为 58039999

    算法如何来的,我只能说,第一次见它就是这样算的.可能是考虑到java 对文件内存映射不能 Integer.MAX_VALUE

    因此一个keys文件长度最大为 hash_count_limit * ENTRY_SIZE 约等于 2GB左右.

    (more…)

  • vmware vsan 磁盘故障更换

    现象

    vsan 存储容量变小. 主机出现告警: vsan 数据出现错误

    (more…)

  • linux下获取进程线程情况

    linux 没有真正的线程, linux 上的线程是轻量级进程LWP.

    The main difference between a light weight process (LWP) and a normal process is that LWPs share the same address space and other resources like open files etc. As some resources are shared so these processes are considered to be light weight as compared to other normal processes and hence the name light weight processes.

    获取进程的线程情况

    ➜  ~ ps -Lf 1210
    UID        PID  PPID   LWP  C NLWP STIME TTY      STAT   TIME CMD
    mysql     1210     1  1210  0   10 4月17 ?       Ssl    0:02 /usr/sbin/mariadbd
    mysql     1210     1  1478  0   10 4月17 ?       Ssl    3:24 /usr/sbin/mariadbd
    mysql     1210     1  1479  0   10 4月17 ?       Ssl    0:01 /usr/sbin/mariadbd
    mysql     1210     1  1480  0   10 4月17 ?       Ssl    0:00 /usr/sbin/mariadbd
    mysql     1210     1  1481  0   10 4月17 ?       Ssl    0:00 /usr/sbin/mariadbd
    mysql     1210     1  1492  0   10 4月17 ?       Ssl    0:00 /usr/sbin/mariadbd
    mysql     1210     1  1566  0   10 4月17 ?       Ssl    0:00 /usr/sbin/mariadbd
    mysql     1210     1  9933  0   10 5月16 ?       Ssl    0:00 /usr/sbin/mariadbd
    mysql     1210     1  9937  0   10 5月16 ?       Ssl    0:00 /usr/sbin/mariadbd
    mysql     1210     1 13703  0   10 00:11 ?        Ssl    0:00 /usr/sbin/mariadbd
    
    ➜  ~ ls -l /proc/1210/task/
    总用量 0
    dr-xr-xr-x 7 mysql mysql 0 5月  16 13:44 1210
    dr-xr-xr-x 7 mysql mysql 0 5月  17 00:12 13703
    dr-xr-xr-x 7 mysql mysql 0 5月  16 13:44 1478
    dr-xr-xr-x 7 mysql mysql 0 5月  16 13:44 1479
    dr-xr-xr-x 7 mysql mysql 0 5月  16 13:44 1480
    dr-xr-xr-x 7 mysql mysql 0 5月  16 13:44 1481
    dr-xr-xr-x 7 mysql mysql 0 5月  16 13:44 1492
    dr-xr-xr-x 7 mysql mysql 0 5月  16 13:44 1566
    dr-xr-xr-x 7 mysql mysql 0 5月  17 00:12 9933
    dr-xr-xr-x 7 mysql mysql 0 5月  17 00:12 9937
    
    ➜  ~ top -Hp 1210
    top - 00:18:19 up 29 days,  7:20,  1 user,  load average: 0.21, 0.15, 0.13
    Threads:  10 total,   0 running,  10 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem :  1881996 total,    85684 free,   900112 used,   896200 buff/cache
    KiB Swap:        0 total,        0 free,        0 used.   791844 avail Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
     1210 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:02.17 mariadbd
     1478 mysql     20   0 1158504 130800   7732 S  0.0  7.0   3:24.05 mariadbd
     1479 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:01.59 mariadbd
     1480 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:00.00 mariadbd
     1481 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:00.05 mariadbd
     1492 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:00.00 mariadbd
     1566 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:00.00 mariadbd
     9933 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:00.22 mariadbd
     9937 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:00.24 mariadbd
    13703 mysql     20   0 1158504 130800   7732 S  0.0  7.0   0:00.00 mariadbd
    
    ➜  ~ pstree -p 1210
    mariadbd(1210)─┬─{mariadbd}(1478)
                   ├─{mariadbd}(1479)
                   ├─{mariadbd}(1480)
                   ├─{mariadbd}(1481)
                   ├─{mariadbd}(1492)
                   ├─{mariadbd}(1566)
                   ├─{mariadbd}(13703)
                   ├─{mariadbd}(15228)
                   ├─{mariadbd}(15229)
                   ├─{mariadbd}(15230)
                   └─{mariadbd}(15231)
    

    ref

    what-is-the-difference-between-lightweight-process-and-thread

  • 重新认识CAP 定理.

    1. 特指 linearizability Consistency
    2. CAP根本没有提到延迟(latency),满足CAP可用性的系统可以花任意长的时间来回复一个请求.
    3. CAP系统的模型是一个只能读写单个数据的寄存器,事务(transaction)不在这个定理的范围之内
    4. 在设计分布式系统的时候,你需要考虑到更多得多的问题。如果太关注CAP就容易导致忽略了其他重要的问题

    (more…)

  • 通过linux top查看jvm的内存

    通过linux 的top命令查看进程的内存

    • top 中那些指标是关于内存的
    • 为什么 VIRT 有时候会比系统内存还大
    • RES 比 配置jvm 的Xmx还大
    • top中DATA代表什么

    top 命令内存相关参数

    CODEandDATA需要按F,然后使用空格键选中,才会显示出来

    top -p 1210
    
    PID USER      PR  NI    VIRT    RES    SHR   CODE    DATA   SWAP S %CPU %MEM     TIME+ COMMAND
    1210 mysql     20   0 1158504 130804   7736  22472 1052816      0 S  0.0  7.0   8:53.50 mariadbd
    

    VIRT

    36. VIRT  --  Virtual Memory Size (KiB)
        The total amount of virtual memory used by the task.  It includes all code, data and shared libraries plus pages that have been swapped out and pages that have  been mapped but not used.
    

    VIRT=CODE+DATA+shared libraries +pages that have been swapped out+pages that have been mapped but not used

    SWAP

    27. SWAP  --  Swapped Size (KiB)
        The non-resident portion of a task's address space.
    

    被 swap-out 的内存页大小

    RES

    17. RES  --  Resident Memory Size (KiB)
        The non-swapped physical memory a task is using.
    

    一个任务正在使用的,没有被swap-out 的物理内存

    例如下面的例子展示了RES 会包含SHR的匿名mmap

    #include <sys/mman.h>
    #include <unistd.h>
    #include <stdint.h>
    
    int main()
    {
        /* mmap 50MiB of shared anonymous memory */
        char *p = mmap(NULL, 50 << 20, PROT_READ | PROT_WRITE,
                       MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    
        /* Touch every single page to make them resident */
        for (int i = 0; i < (50 << 20) / 4096; i++) {
            p[i * 4096] = 1;
        }
    
        /* Let us see the process in top */
        sleep(1000000);
    
        return 0;
    }
    
    gcc  -std=gnu99    main.c
    ./a
    ps -ef|pgrep a.out
    339065
    top -p 339065
    
    
        PID USER        VIRT    RES    SHR S  %CPU %MEM   CODE    DATA     TIME+ COMMAND
     338377 root       55412  51564  51476 S   0.0  0.1      4     180   0:00.01 a.out
    

    CODE

     4. CODE  --  Code Size (KiB)
        The amount of physical memory devoted to executable code, also known as the Text Resident Set size or TRS.
    
        可执行代码驻留的物理内存总量,驻存代码集合(Text Resident Set, TRS)
    

    DATA

    6. DATA  --  Data + Stack Size (KiB)
       The amount of physical memory devoted to other than executable code, also known as the Data Resident Set size or DRS.
    

    man手册里是不对的,可以看这篇文章里面的例子

    The DATA column contains the amount of reserved private anonymous memory. By definition, the private anonymous memory is the memory that is specific to the program and that holds its data. It can only be shared by forking in a copy-on-write fashion. It includes (but is not limited to) the stacks and the heap ((But we will see later that it only partially contains the data segment of the loaded executables)). This column does not contain any piece of information about how much memory is actually used by the program, it just tells us that the program reserved some amount of memory, however that memory may be left untouched for a long time.

    1. DATA 包括 the stacks and the heap,并且不止包括它们.
    2. DATA 不能告诉我们程序实际使用多少内存,它只是告诉我们该程序“保留”了一定数量的内存,但是该内存可能会长时间保持不变。

    $$ANON = RES – SHR$$ ( ANON 表示在堆上分配的内存.)

    $$ANON <= DATA$$ (vm_physic)

    SHR

    21. SHR  --  Shared Memory Size (KiB)
            The amount of shared memory available to a task, not all of which is typically resident.  It simply reflects memory that could be potentially shared with other  processes.
    
            任务可用的共享内存量,但并非所有的共享内存都是常驻(resident)的。它(SHR)只是反映了可能与其他进程共享的内存
    

    SHR contains all virtual memory that could be shared with other processes, and RSS contains all memory physically in RAM that is used by the process.

    Thus all shared memory currently in RAM is counted both in SHR and in RSS, so SHR + RSS has no meaning since it can contain duplicates counts.(SHR + RSS没有意义,因为他们可能包含重复的项)

    1. 除了自身进程的共享内存,也包括其他进程的共享内存
    2. 虽然进程只使用了几个共享库的函数,但它包含了整个共享库的大小
    3. 计算某个进程所占的物理内存大小公式:RES – SHR
    4. swap out后,它将会降下来

    通过 proc filesystem

    cat /proc/1210/statm
    289626 32701 1934 5618 0 263204 0

    //os 内存页大小

    getconf PAGESIZE
    4096

    Table 1-3: Contents of the statm files (as of 2.6.8-rc3)

    Field Content 与 top 相关字段
    size total program size (pages) (same as VmSize in status) $$VIRT=289626*4096/1024=1158504$$
    resident size of memory portions (pages) (same as VmRSS in status) $$RES=32701*4096/1024 = 130804$$
    shared number of pages that are shared (i.e. backed by a file, same as RssFile+RssShmem in status) $$SHR=1934*4096/1024=7736$$
    trs number of pages that are ‘code’ (not including libs; broken, includes data segment) $$CODE=5618*4096/1024=22472$$
    lrs number of pages of library (always 0 on 2.6)
    drs number of pages of data/stack (including libs; broken, includes library text) $$DATA=263204*4096/1024=1052816$$
    dt number of dirty pages (always 0 on 2.6)

    通过 pmap

    ➜  ~ pmap -X 1210|head -n 5
    1210:   /usr/sbin/mariadbd
             Address Perm   Offset Device Inode    Size    Rss    Pss Referenced Anonymous Swap Locked Mapping
        556335aee000 r-xp 00000000  fd:01 22182   22472   5228   5228       5172         0    0      0 mariadbd
        5563372df000 r--p 015f1000  fd:01 22182    1392   1392   1392       1392      1392    0      0 mariadbd
        55633743b000 rw-p 0174d000  fd:01 22182     720    416    416        416       384    0      0 mariadbd
    ➜  ~ pmap -X 1210|tail -n 5
        7ffc7142e000 rw-p 00000000  00:00     0     132     76     76         76        76    0      0 [stack]
        7ffc714af000 r-xp 00000000  00:00     0       8      4      0          4         0    0      0 [vdso]
    ffffffffff600000 r-xp 00000000  00:00     0       4      0      0          0         0    0      0 [vsyscall]
                                                ======= ====== ====== ========== ========= ==== ======
                                                1158508 131308 128832     131136    123272    0      0 KB
    

    In computing, proportional set size (PSS) is the portion of main memory (RAM) occupied by a process and is composed by the private memory of that process plus the proportion of shared memory with one or more other processes(由该进程的私有内存加上与一个或多个其他进程的共享内存的比例组成). Unshared memory including the proportion of shared memory is reported as the PSS.

    jvm 与 linux

    jvm 设置xmsxmx之后,jvm进程占的实际内存,为什么还会变动

    G1 will try expand the heap if the amount of time you spend doing GC work versus application work is greater than a specific threshold. Note: If your min/max heap are the same, expansion cannot occur.

    其实堆的大小已经是固定了, jvm dump不会再扩展.

    Linux给各个进程提供相同的虚拟内存空间;这使得进程之间相互独立,互不干扰。实现的方法是采用虚拟内存技术:给每一个进程一定虚拟内存空间,而只有当虚拟内存实 际被使用时,才分配物理内存。

    -Xms10g -Xmx10g, when jvm start, it will ask op-system allocation 10g memory which will be used for heap.

    And op-system will try to allocate the memory for the JVM (show as VIRT), but system did not promise u it will allocate physical memory, it maybe swap 😉

    But u will find the VIRT is still not 10g, that reason is 10g is for heap size, a JVM include much more the heap, for example, stack, permgen(hotspot JDK8, openJDK seems has no permgen, fix me if i am wrong), native stack, code, files etc.

    jvm heap usage used 的大小比top中的RES还大

    [root@node2 octopus]#  /usr/lib/jvm/java-11/bin/jhsdb jmap --heap --pid 31821
    Attaching to process ID 31821, please wait...
    Debugger attached successfully.
    Server compiler detected.
    JVM version is 11.0.11+9-LTS
    
    using thread-local object allocation.
    Garbage-First (G1) GC with 13 thread(s)
    
    Heap Configuration:
       MinHeapFreeRatio         = 40
       MaxHeapFreeRatio         = 70
       MaxHeapSize              = 19327352832 (18432.0MB)
       NewSize                  = 1363144 (1.2999954223632812MB)
       MaxNewSize               = 11593056256 (11056.0MB)
       OldSize                  = 5452592 (5.1999969482421875MB)
       NewRatio                 = 2
       SurvivorRatio            = 8
       MetaspaceSize            = 21807104 (20.796875MB)
       CompressedClassSpaceSize = 1073741824 (1024.0MB)
       MaxMetaspaceSize         = 17592186044415 MB
       G1HeapRegionSize         = 16777216 (16.0MB)
    
    Heap Usage:
    G1 Heap:
       regions  = 1152
       capacity = 19327352832 (18432.0MB)
       used     = 17792765976 (16968.503929138184MB)  #这里
       free     = 1534586856 (1463.4960708618164MB)
       92.06002565721671% used
    G1 Young Generation:
    Eden Space:
       regions  = 11
       capacity = 872415232 (832.0MB)
       used     = 184549376 (176.0MB)
       free     = 687865856 (656.0MB)
       21.153846153846153% used
    Survivor Space:
       regions  = 8
       capacity = 134217728 (128.0MB)
       used     = 134217728 (128.0MB)
       free     = 0 (0.0MB)
       100.0% used
    G1 Old Generation:
       regions  = 1064
       capacity = 18320719872 (17472.0MB)
       used     = 17490776088 (16680.503929138184MB)
       free     = 829943784 (791.4960708618164MB)
       95.46991717684399% used
    
    [root@node2 octopus]#
    [root@node2 octopus]#
    [root@node2 octopus]# top -p 31821
    top - 16:54:33 up 21:52,  2 users,  load average: 0.00, 0.01, 0.05
    Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem : 49177608 total,   391092 free, 22373724 used, 26412792 buff/cache
    KiB Swap: 33554428 total, 33554428 free,        0 used. 26303500 avail Mem
    
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
      31821 root      20   0  159.4g  15.0g  25284 S   0.0 32.0   5:40.87 jsvc
    
    

    jvm 堆使用了used = 17792765976 (16968.503929138184MB) 进程top中RES确只还是15.0g (RES - SHR). 乍一看有点奇怪,但是统计一下jvm 堆里的对象,实际只占了11GB左右.

    [root@node2 octopus]#  /usr/lib/jvm/java-11/bin/jmap -histo 31821 |head -n 5
     num     #instances         #bytes  class name (module)
    -------------------------------------------------------
       1:          3094    10086075648  [J (java.base@11.0.11)
       2:         90396     1081725104  [B (java.base@11.0.11)
       3:         11750      203173760  [I (java.base@11.0.11)
    [root@node2 octopus]#  /usr/lib/jvm/java-11/bin/jmap -histo 31821 |tail -n 5
    1661:             1             16  sun.util.locale.provider.TimeZoneNameUtility$TimeZoneNameGetter (java.base@11.0.11)
    1662:             1             16  sun.util.logging.internal.LoggingProviderImpl (java.logging@11.0.11)
    1663:             1             16  sun.util.resources.LocaleData$LocaleDataStrategy (java.base@11.0.11)
    1664:             1             16  sun.util.resources.cldr.provider.CLDRLocaleDataMetaInfo (jdk.localedata@11.0.11)
    Total        789125    11397387608 #这里
    

    这说明了jvm dump真实占用除了存活对象之后,还有其他部分. 是不是存储对象所使用的所有region 数量的总和呢?

    top VIRT and RSS

    When is Virtual Memory Size Important?

    The virtual memory map contains a lot of stuff.

    • Some of it is read-only,

    • some of it is shared,

    • and some of it is allocated but never touched (eg, almost all of the 4Gb of heap in this example).

    But the operating system is smart enough to only load what it needs, so the virtual memory size is largely irrelevant.(操作系统只给进程分配它们真实需要使用的内存,因此虚拟内存基本不需要注意)

    Where virtual memory size is important is if you’re running on a 32-bit operating system, where you can only allocate 2Gb (or, in some cases, 3Gb) of process address space. In that case you’re dealing with a scarce resource, and might have to make tradeoffs, such as reducing your heap size in order to memory-map a large file or create lots of threads.(以前的机器都是32位的 逻辑寻址最多访问 4GB 内存, 去掉系统保留的,大部分机器上进程只能访问3GB. )

    But, given that 64-bit machines are ubiquitous, I don’t think it will be long before Virtual Memory Size is a completely irrelevant statistic.

    When is Resident Set Size Important?

    Resident Set size is that portion of the virtual memory space that is actually in RAM. If your RSS grows to be a significant portion of your total physical memory, it might be time to start worrying. If your RSS grows to take up all your physical memory, and your system starts swapping, it’s well past time to start worrying.

    But RSS is also misleading, especially on a lightly loaded machine. The operating system doesn’t expend a lot of effort to reclaiming the pages used by a process. There’s little benefit to be gained by doing so, and the potential for an expensive page fault if the process touches the page in the future. As a result, the RSS statistic may include lots of pages that aren’t in active use. (在轻负载的机器上,操作系统可能不会很及时的回收失效页.因此RSS可能包含很多失效的page)

    Memory – Part 2: Understanding Process memory

    The /proc Filesystem

    Proportional set size

    Linux和JVM内存